Hello everyone, in the previous article Streaming Analytics in Google Cloud Platform - Introduction, we have covered what is streaming analytics, what services we are going to use and a quick introduction to each service. In this part of the series, we will begin the installation of SDKs, and libraries and set up our environment.
Having a well-configured development environment is super crucial down the road. I have seen many questions in the stack overflow and other online forums asking questions related to “module not found”, “unable to import packages”, “access denied”. “resource not found”, etc. The problem is not, getting errors and fixing them by asking online, in fact, it is a good practice to ask someone who knows and do research on our own to find solutions to our technical problems, if you look back, usually, the problems which you spend many numbers of hours researching are the ones still in your mind and maybe forever. In my first job, I spent a huge amount in asking questions and researching related to setting up MySQL replication, and the knowledge gained from those times is still fresh in my mind, even after many years. So asking questions is good and doing your own research is very important. The problem is the time! You will be spending a lot of time on these kinds of questions and waiting for answers, which could be avoided if you simply spend a few minutes setting up your development correctly. I assume you get a clear understanding of why this part is very important, let's begin with installations. Google Cloud Command-line SDK (gcloud)gcloud is a command line interface (CLI) tool for the Google Cloud Platform (GCP). It is powerful and flexible, allows you to use the command line to perform various tasks on GCP, such as creating and managing services and interacting with services, and deploying applications. It is a very essential tool for developers working with GCP. I am going to cover the installation of gcloud CLI for MacOS, official documentation contains the detailed instruction for other operating systems, please refer here if you are using Windows. Create a folder in a preferred location
Download the gcloud SDK
Extract the gcloud archive in the current directory
Run the installer script to start the gcloud SDK installation
Once the installation completed, you can start a new terminal window, so that the changes take effect. To verify the installation, run the following command, it will show the similar output as shown below for successful installation
To initialise the gcloud, run the following command
Init command will launches as interactive getting started workflow for the gcloud command line. This command will authorise the gcloud and other SDKs to access to Google Cloud resources and set the current project. This step is must to complete authentication and google cloud resources. Few useful commands to try:
Example output: Apache Beam Python SDKApache Beam Python SDK support Python 3.6, 3.7 and 3.8. I will be using Python 3.9 - at the time of writing this article, Apache Beam Python SDK does not support Python 3.10. So I am using 3.9 but please note you might not see this 3.9 support information in official documentation and as it is a continuously evolving project, document may be updated any time. Safer options are Python 3.8, for our implementation, Python 3.9 works fine. To check your current Python version:
And you need PIP installer to install Apache Beam package, if you do not have one, please install it, steps available here: https://pip.pypa.io/en/stable/installation/ If you are using anaconda, there is a high chance that, PIP is already installed, you can check it by running below command:
Create a virtual environment for stream analytics project by running below command:
And activate your environment by running
If you are using anaconda, you can create virtual environment by running below command:
To activate conda environment
Quick note on virtual environments in Python:
Download and install the Apache Beam packages:
You may have seen only pip install apache-beam instead of apache-beam[gcp] in many documentation. The reason we need apache-beam[gcp] and the difference is Apache Beam[GCP] is a specific implementation of Apache Beam that allows you to run data pipelines on Google Cloud Platform. It provides in-built integration with Google Cloud service such as Cloud Data Runner, BigQuery, Cloud Storage, etc. There is one more thing, to define our pipeline in programming, we need a text editor, I am using VS code, you can feel free to use any editor you are comfortable with. We are done with package installations and setting up the isolated virtual environment to run our pipeline, let us now create Project in Google Cloud Platform, create Cloud Pub/Sub Topic and Subscription using gcloud command-line tool and enable required access next. Google Cloud ProjectCreating a new project for our streaming analytics project: gcloud projects create streaming-analytics List project - this command will list Project ID, Project Name and Project Number, note down the Project ID of streaming-analytics project. gcloud projects list Set streaming-analytics project as our current project: gcloud config set project streaming-analytics To verify the above steps and view the current project and user, run the below command: gcloud config list You must have below permission or Project Creator role to create project on Google Cloud: resourcemanager.projects.create Cloud Pub/Sub - Topic & Subscriptiongcloud pubsub topics create analytics-topic Create a Pub/Sub subscription and assign it to above created topic: gcloud pubsub subscriptions create analytics-subscription --topic analytics-topic Cloud Pub/Sub - Publishing & Pulling MessagesTo publish a test message run the following command: gcloud pubsub topics publish analytics-topic --message '{"name":"streamanalytics", "age":2}'
This message will be sent to Pub/Sub subscription queue, you can pull the message by running following command: gcloud pubsub subscriptions pull analytics-subscription --format="json(ackId, message.attributes, message.data.decode(\"base64\").decode(\"utf-8\"), message.messageId, message.publishTime)" Sample output: You must have below permissions/roles assigned to create Cloud Pub/Sub topics, subscription and publish and consume messages: roles/pubsub.publisher
roles/pubsub.subscriber
roles/pubsub.viewer BigQuery - Creating DatasetCreate dataset in BigQuery: BigQuery dataset can be created by using many client tools, we will be using the Google Cloud Console. For other options, refer the official documentation. Go to BigQuery Console -> Click on View Action (next to project name) -> Create Dataset Enter Dataset ID, Data Location, Other options and click on Create Dataset. BigQuery - CREATE & SELECT TableIn BigQuery, table can be created in many ways, let us create with standard SQL as below and run the BigQuery Editor: CREATE TABLE streamanalytics.user(
name STRING,
age INTEGER
); In streaming analytics pipeline, we can create tables and populate data from the data pipeline, to setup the environment correctly, we are creating in console and verifying the access. Populate table with sample data: INSERT INTO `streamanalytics.user` VALUES ("stream",1); To query user table: SELECT * FROM `streamanalytics.user`; You must have below roles to create dataset & tables, insert records and select records. BigQuery Admin OR BigQuery Editor OR BigQuery Data Owner Cloud Dataflow Jobs:To create and deploy Cloud Dataflow jobs, you need the below permission/roles assigned:
|