3rd Wave of Cloud

Cloud Shell

In general, we will not create compute engine in VM just for the purpose of running a few scripts. This is very wasteful. Instead, it is preferable to use a system called the Google Cloud Shell to execute simple developer tasks. It can be thought of as a “micro VM.” The advantage of using Google Cloud Shell is that many of the utilities we would otherwise need to install are already present; “git,” for example. The Google Cloud Shell is a very ephemeral VM. If it is not being used, in under an hour, it will be recycled. In the future, we will use it do things like start Google Datalab.

To access it, console.cloud.google.com, and click activate google cloud shell in the upper-right.

Third Wave of Cloud

The third wave of cloud is “fully-managed services” that scale for you.

There have been two previous “waves,” the second wave included the “Google Big Data Stack 1.0.” It is characterized by “virtualized data centers” and rented infrastructure, and included such products as Google GFS, Google Papers, MapReduce and Cloud Dataproc. These gave rise to the open source implementations, hadoop and HTFS. This wave lasted 2002 until 2005, at which point the third wave, and the “Google Big Data Stack 2.0,” began.

Third wave Google products included BigQuery, Pub-Sub, Could Dataflow, and Cloud Spanner. The technologies included Dremel, PubSub, Flume-Java, Megastore, Millwheel, Dataflow, TensorFlow, and Spanner. This wave is described as fully-elastic, scale-independent serverless architecture that lets you forget about infrastructure entirely.

As more specific examples, Dremel is essentially SQL queries, on GCP as BigQuery. Collosus is a file system that allows extremely high throughput reads, on GCP as Google Cloud Storage.

BigQuery is a data warehouse. Lak demonstrates an example where he queries 276 MB, 24 million rows, of publicly available data in a few seconds. He does this without provisioning any virtual machines or installing any software. He just ran the query. The specific benefits of BigQuery include:

  • interactive, ad-hoc analysis of petabyte scale databases,
  • 2011 query language and functions,
  • many ways to ingest, transform, load, and export data to BigQuery, including the web GUI, staging them in cloud storage, and streaming data into BigQuery, and
  • export data from BigQuery via a variety of APIs or by saving in any format that you want.

Additionally, BigQuery offers

  • inexpensive data storage,
  • queries that charge based on the amount of data processed, and
  • integration with Datalab for data analysis needs, so it easy to export BigQuery results into a Pandas Dataframe and plot it using Python.

Lab: Datalab and BigQuery

In this lab, I use BigQuery to calculate aggregates, percentile values, and other information over 70 million rows. These results are imported into a Pandas Dataframe with a dozen rows. This could take hours using other systems, but with this system it takes seconds. The temptation with less efficient systems is to sample the data, but this is bad practice with machine learning.

Lak likes to say that one of the differences between machine learning and statistics is the manner in which outliers are handled. In statistics, outliers are frequently removed, but this is bad practice in machine learning, where outliers should be learned. In order to learn outliers, you need to work with all your data. In this and other labs, we will use BigQuery to process data at scale, and then bring it back onto more familiar in-memory structures like Pandas Dataframes, and then plot them using other Python libraries.

To initiate Datalab, the command to enter into Cloud Shell is as follows. <ZONE> should be replaced with a zone from a geographically nearby region from the following: us-east1, us-central1, asia-east1, or europe-west1. Those are the regions that, at time of my completing the Google cert, were the regions that supported Cloud ML Engine jobs.

datalab create mydatalabvm --zone <ZONE>

To print the list of available zones, run the following.

gcloud compute zones list

The datalab create command can take up to 5 minutes to execute.

The BigQuery console is available from the GCP console. Into that field, I enter the following.

#standardSQL
SELECT
  depature_delay,
  COUNT(1) AS numflights,
  APPROX_QUANTILES(arrival_delay, 5) AS arrival_delay_quantiles
FROM
  'bigquery-samples.airline_ontime_data.flights'
GROUP BY
  departure_delay
HAVING
  num_flights > 100
ORDER BY
  deparuture_delay ASC

The results that are returned include several rows of data, including the following.

Row departure_delay num_flights arrival_delay_quantiles
1 -37.0 107 -66.0
-41.0
-30.0
-17.0
33.0

The table is 7.95 GB, with 70,588,485 rows, and the query above executes in a couple seconds. This is because the query would have been executed on a very large number of distributed computers.

The interpretation of these values is that, in the dataset, there were 107 flights that departed 37 minutes early. Of those 107 flights, the earliest 20% of those flights arrived 66 minutes early, the next 20% arrived between 41 and 66 minutes early, and so on.

A different query from the same table follows. This query finds the airports with the greatest number of lights between them.

#standardSQL
SELECT
  departure_airport,
  arrival_airport,
  COUNT(1) AS num_flights
FROM
  `bigquery-samples.airline_ontime_data.flights`
GROUP BY
  departure_airport,
  arrival_airport
ORDER BY
  num_flights DESC
LIMIT
  10
Row departure_airport arrival_airport num_flights
1 LAX SAN 133394
2 SAN LAX 133068
3 LAS LAS 130381
4 LAS LAX 128546
5 SFO LAX 123488

Again, processing 70 million records took around 2.3 seconds. This is because the query was not run on a single laptop, but instead, its being run on thousands of machines.

Datalab can also run BigQuery queries, as shown below. The deciles below return as a Python list.

query="""
SELECT
  departure_delay,
  COUNT(1) AS num_flights,
  APPROX_QUANTILES(arrival_delay, 10) AS arrival_delay_deciles
FROM
  `bigquery-samples.airline_ontime_data.flights`
GROUP BY
  departure_delay
HAVING
  num_flights > 100
ORDER BY
  departure_delay ASC
"""

import google.datalab.bigquery as bq
df = bq.Query(query).execute().result().to_dataframe()
df.head()
departure_delay num_flights arrival_delay_deciles
0 -37.0 107 [-66.0, -44.0, -41.0, -35.0, -30.0, -23.0, -17…
1 -36.0 139 [-74.0, -43.0, -39.0, -37.0, -32.0, -25.0, -18…
2 -35.0 191 [-68.0, -45.0, -40.0, -36.0, -28.0, -19.0, -14…
3 -34.0 195 [-58.0, -44.0, -40.0, -35.0, -30.0, -25.0, -19…
4 -33.0 227 [-59.0, -43.0, -39.0, -36.0, -32.0, -28.0, -20…

Convert the list of deciles to individual columns.

import pandas as pd
percentiles = df['arrival_delay_deciles'].apply(pd.Series)
percentiles = percentiles.rename(columns = lambda x : str(x*10) + "%")
df = pd.concat([df['departure_delay'], percentiles], axis = 1)
df.head()
departure_delay 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0 -37.0 -66.0 -44.0 -41.0 -35.0 -30.0 -23.0 -17.0 -12.0 -3.0 6.0 33.0
1 -36.0 -74.0 -43.0 -39.0 -37.0 -32.0 -25.0 -18.0 -14.0 -7.0 2.0 49.0
2 -35.0 -68.0 -45.0 -40.0 -36.0 -28.0 -19.0 -14.0 -8.0 -4.0 3.0 85.0
3 -34.0 -58.0 -44.0 -40.0 -35.0 -30.0 -25.0 -19.0 -14.0 -8.0 2.0 39.0
4 -33.0 -59.0 -43.0 -39.0 -36.0 -32.0 -28.0 -20.0 -14.0 -7.0 5.0 25.0

Datalab also has access to all the plotting capabilities of Python, generating, for example, the following plot of the data.

without_extremes = df.drop(['0%', '100%'], 1)
without_extremes.plot(x='departure_delay', xlim=(-30,50), ylim=(-50,50));