The objectives of this section are to
- carry out data science tasks in notebooks,
- rehost notebooks on the cloud,
- execute ad-hoc queries at scale, and
- invoke pre-trained ML models from Datalab.
The name of the integrated development environment that will be used is Cloud Datalab. Cloud Datalab notebooks run on virtual machines, so the Compute Engine and Cloud Storage will also be discussed. The notebooks themselves are stored in a cloud repository, so they are under version control.
Because Cloud Datalab runs on a VM: 1. You can control and change what kind of machine is running your notebook. As examples, you can give the notebook more memory and add a GPU. Changing the machine is very easy. 2. VMs are ephemeral. This means that anything you want to persist needs to be saved outside the VM, and the best place is in cloud storage.
BigQuery will also be demonstrated. It is a managed data nalysis service on the cloud that allows the user to execute ad-hoc queries at scale and speed that is not possible using traditional database systems.
Increasingly, data scientists are using self-descriptive, shareable, executable notebooks like Jupyter or IPython notebooks for data analysis and ML. Datalab is Google’s open-source notebook tool, and it is based on Jupyter. As with Jupyter, it has code sections interleaved with markup and output, and it is possible to export the files as a standalone file.
Cloud Datalab also enables Google Docs-style collaboration. One difference between Datalab and traditional notebooks (like Jupyter) is that they are hosted in the cloud, so there isn’t any concern about ensuring the server is “up” so colleagues can work on it. The notebooks themselves can be persistent in Git, so you can delete the VM when they’re no longer needed.
If running a Datalab notebook and you discover you need a machine with more memory, it is possible to “rehost” the notebook. To do this, you “stop” the virtual machine, reprovision, and start the machine back up again. Most of the time, projects in this specialization will be run on plain vanilla virtual machines.
Working with Managed Services
You can develop locally with Datalab and then scale out data processing to the cloud. As an example, you can read CSV files with Apache Beam, process with Pandas Dataframes, and then use the data in Machine Learning models within TensorFlow. When it is time to scale, you can use Google Cloud Storage to hold the data, process it with Cloud Dataflow, and run distributed training and hyper parameter optimization in Cloud Datamanager. This is possible with Datalab because it integrates so well with other GCP products.
|Exploring & Analyzing||:||BigQuery, Google Cloud Storage|
|Machine Learning & Modeling||:||TensorFlow, GCML|
|Visualizing||:||Google Charts, Plotly, Matplotlib|
|Seamless Product Combination||:||CMLE, Dataflow, CloudStorage|
|Integration||:||Authentication and code source control|
Computation and Storage
Compute Engine is managed infrastructure. When done with an analysis, you can save your notebook to Git and stop the machine, so you only pay for what you use. Compute Engine is essentially a globally distributed CPU, and Cloud Storage is a globally distributed disk.
Datalab is a single node program, so it runs on a single Compute Engine instance. But, when we kick off Dataflow jobs or ML Engine jobs, we send the processing to many Compute Engine instances. Compute Engine allows you to rent a VM on the cloud. There are numerous parameters that can be customized, including the number of cores, the amount of memory, the disk size, and operating system.
When we run TensorFlow programs, they read directly off Cloud Storage. The purpose of Cloud Storage is to create a global filesystem. A typical cloud storage url might look like
gs://acme-sales/ part of the url is called a bucket. The bucket is globally unique, and commonly a reverse domain name. The remainder of the url is a conventional folder structure.
You can interact with Cloud Storage using a command line utility called
gsutil, a tool that comes with the Google Cloud SDK. The syntax used with
gsutil is similar to Unix command line syntax.
RB are make bucket and remove bucket,
CP is copy, as examples. The following line would copy a bunch of local files to a location on Cloud Storage.
gsutil cp sales*.csv gs://acme-sales/data/
Latency is still a concern, so it is necessary to choose a zones and regions close to compute clusters. An example of a zone is
us-central1-a. Another consideration is service disruptions, so it may be necessary to distribute your apps and data across multiple zones to protect yourself in case a single zone goes down due to a power outage. A zone is an isolated location within a region. A final reason to distribute apps and data across regions is to ensure they are available to customers worldwide.
In this initial lab, I create a virtual machine, configure the security, and access it remotely. Then, I carry out an ingest-transform-and-publish data pipeline.
- Create a Compute Engine instance
Browse to cloud.google.com, click the three horizontal lines in top left, select Compute Engine, then VM Instances. For this lab I leave all options as default.
- SSH into the instance
SSHing into the instance allows you to remotely access it. This is done by clicking on
SSH in the list of VM instances. This brings up a console-like interface. I run this command to print some information about the compute engine.
[email protected]:~$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU @ 2.60GHz stepping : 7 microcode : 0x1 cpu MHz : 2600.000 cache size : 20480 KB [...]
- Next I update apt-get and use it to install git. It is necessary to use sudo because running as root is required to install software on the VM.
sudo apt-get update sudo apt-get -y -qq install git
I verify it is now installed.
git --version git version 2.11.0
- Using Git I ingest some data from the United States Geological Survey. To do this I download a bash script that is hosted on a publicly available Google Cloud github repo.
git clone https://github.com/GoogleCloudPlatform/training-data-analyst cd training-data-analyst/courses/machine_learning/deepdive/01_googleml/earthquakes
The script is ingest.sh.
[email protected]:~/training-data-analyst/courses/machine_learning/deepdive/01_googleml/earthquakes$ ls -la total 32 drwxr-xr-x 3 google751886_student google751886_student 4096 Jul 29 15:50 . drwxr-xr-x 8 google751886_student google751886_student 4096 Jul 29 15:50 .. -rw-r--r-- 1 google751886_student google751886_student 637 Jul 29 15:50 commands.sh -rw-r--r-- 1 google751886_student google751886_student 751 Jul 29 15:50 earthquakes.htm -rwxr-xr-x 1 google751886_student google751886_student 759 Jul 29 15:50 ingest.sh
I run the script to download the data.
And verify it was downloaded.
head earthquakes.csv time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,net,id,updated,place,type,horizontalError,depthError,magError,ma gNst,status,locationSource,magSource 2018-07-29T15:52:16.650Z,19.4193325,-155.2653351,0.84,2.24,ml,13,65,0.005549,0.22,hv,hv70493867,2018-07-29T15:57:57.590Z,"3 km WSW of Volcano, Hawaii",earthquake,0.34,0.26,0.27,11,automatic,hv,hv
The file is stored on the Compute Engine instance’s disk.
- Now I run a Google-provided Python script to transform the data. To do this I first install a few missing Python packages on the Compute Engine using another Google-provided Bash script. In particular, the Python libraries the following bash script installs include Basemap, which allows users to draw geographic maps, Numpy, a library for numeric manipulation, and Matplotlib, the basic matlab protting library.
Then I run the actual script, which creates a .png image file.
- The next step involves the creation of a "bucket" on Google Cloud Storage. From cloud.google.com, I click the menu (three bars), and then "Storage," from there I "Create Bucket." For Cloud ML Engine, only certain regions can be used. For this lab I use a "Regional" storage class with region set to "us-east1." For the name of the bucket, I use the Google Cloud project ID which is a unique value. If the bucket will be used to store website data, using the website url for the bucket name would be appropriate. The important thing is that it be a unique value. Unless there is a good reason not to, best practice is to locate the Compute Engine and Cloud Storage in the same region. This reduces the latency involved with getting the data.
- To place the data into the newly-created bucket, I execute the following command, where I have removed the actual bucket name.
gsutil cp earthquakes.* gs://<BUCKET-NAME>/earthquakes/
The first time I ran this command, I received a “403 Oauth error.” In order to remedy this, I had to go through all the preceding steps again, this time setting the Compute Engine to have its “Cloud API Access Scope” to “Allow full access to all Cloud APIs.” This completed, I was able to copy from the Compute Engine to Cloud Storage with no issues.
- Next, within the storage bucket, I share the three files I just placed there publicly. To do this, on the GCP console, under the "bucket details" view, I click the "Share Publicly" checkboxes, which reveals the publicly available link. Generally, the link is given as 'storage.googleapis.com/' + project + '/' + the appropriate folder structure.
- Finally, I delete the Compute Engine instance by clicking the ... next to it the VM entry under the "Compute Engine" section of the GCP console, and then selecting Delete. Deleting the Compute Engine instance does not have any effect on the Cloud Storage bucket.
In summary, this introductory lab involved using GCP as rented infrastructure. It entailed spinning up a Compute Engine instance, installing custom software onto it, running a processing job, and storing the results onto Cloud Storage. At the close of the lab, Google points out that it is possible to abstract steps like those above away entirely, and to work with computation problems simply as software that needs to run.
The approach that was used here was very traditional. These are the fundamentals that managed services are built on, but future labs in this specialization will not require manually provisioning VMs and installing software on them. Rather, we will give Google code that we need to have run, and the result will be impact of having run the code. In the future, we will interact with GCP in a higher level way.
In general, we will not create compute engine in VM just for the purpose of running a few scripts. This is very wasteful. Instead, it is preferable to use a system called the Google Cloud Shell to execute simple developer tasks. It can be thought of as a “micro VM.” The advantage of using Google Cloud Shell is that many of the utilities we would otherwise need to install are already present; “git,” for example. The Google Cloud Shell is a very ephemeral VM. If it is not being used, in under an hour, it will be recycled. In the future, we will use it do things like start Google Datalab.
To access it, console.cloud.google.com, and click activate google cloud shell in the upper-right.
The specialization is sponsored by Google Cloud and this particular course is presented by Valliappa Lakshmanan, or "Lak," a Technical Lead for Google Cloud's Big Data and Machine Learning professional services.
Some of this content is my personally-owned "Creator Content" as set forth in the Quiklabs Terms of Service, Section 7.