Data and Visual Analytics Lecture Notes - Week 2

These are my lecture notes for week 2 of Data and Visual Analytics, CSE 6242, which I am taking as part of GA Tech’s OMSCS program.

The course material is presented by Professor Polo Chau, an Associate Director of Georgia Tech’s MS in Analytics program and an Assistant Professor in the College of Computing, specifically in the School of Computational Science and Engineering.

Data Collection

Three Primary Methods

There are three primary ways of collecting data. The simplest way of collecting data is to simply download the data as a CSV or other delimited format. Downloading via an API requires a bit more effort, possibly requiring writing a script or short program. Scraping/crawling involves examining the actual HTML of a website and identifying the actual elements that need to be captured. This parsing will probably also require the usage of regular expressions, and may need to be redone as it is a brittle solutions; webpages can change at any time.

Method Download API Scrape/Crawl
Effort Low Medium High

A few specific examples of downloadable data:

  • NYC Taxi data (11 GB), Fare (7.7 GB)
  • StackOverflow (xml)
  • Wikipedia (data dump)
  • Atlanta crime data (csv)

Collect Data via APIs:

  • Google Data API (Google Maps directions API)
  • Twitter (subset of all of Twitter)
  • Last.fm
  • Pandora (unofficial API)
  • Flickr
  • data.nasa.gov
  • data.gov

Data that needs to be scraped:

  • Amazon (reviews, product info)
  • ESPN
  • eBay
  • Google Play
  • Google Scholar

Scraping/Crawling

For each app on the Google Play store, there is a list of similar apps. It is possible to crawl Google Play, creating a network where each node is an app, and an edge connects two similar apps.

Scraping a webpage requires becoming familiar with the Developer Tools included with most modern browsers. The task is to uniquely identify elements of interest on the webpage. As an example, the app Shazam is located at https://play.google.com/store/apps/details?id=com.shazam.android. The string that uniquely identifies Shazam is com.shazam.android. Similarly for Spotify the unique identifier is com.spotify.music.

On Google Chrome, open the developer tools by clicking the i in the top right > More Tools > Developer Tools. Select Elements, then mouse over any element in the web page to select it and inspect the HTML. Specifically, in this case, mousing over the similar apps will show how the links to similar apps are identified within the HTML.

There are several popular web scraping libraries, including the following:

  • Selenium (multiple languages)
  • Beautiful Soup (Python)
  • Scrapy (Python)
  • JSoup (Java)

Two important considerations:

  • Depending on the web browser used, different web content may show up. The scraper may need a different “web driver” or “user agent.”
  • Sometimes data may only show up after the user takes a certain action, such as clicking a button. Selenium supports many actions, and Beautiful Soup supports some.

Data Storage

Dr. Chau discusses two primary means of storing data.

CSV

The easiest way to store data is using a .csv, or “comma-separated values” file. An example line in a CSV file might be 1997,Ford,E350. This method is popular because anyone with a text editor (so, anyone) can do it.

There are two primary downsides to CSV files.

  1. The data is all stored in a single table, so data cannot be flexibly combined/joined/aggregated using SQL methods.
  2. The data itself may contain commas. Schemas exist to work around this issue, such as wrapping those data items in quotes, but there is always the possibility of a data item containing quotes and commas, etc.

SQLite

SQLite is the most popular embedded database in the world. It is used in everything from iTunes, iOS, and Android to Adobe Reader, Dropbox, and even the flight software for the Airbus A350 XWB. It has also been included in Python since distribution 2.5.

There are several advantages to using SQLite.

  • Self-contained system: one file contains the data as well as the schema.
  • Serverless: the database is served right from your computer.
  • Zero-configuration: no setup is required.

For these and other reasons, Dr. Chau recommends using SQLite for data storage.

Practical SQLite

Starting up SQLite on a Mac is as simple as executing the following command on the command prompt, where database.db can be replaced by the name of the database.

sqlite3 database.db

That command will create a database file in the current working directory. This will also start the SQLite command prompt, which can be exited with the command .exit.

Ryans-MacBook-Pro:Desktop ryanwingate$ sqlite3 database.db
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
sqlite>

The commands in the following sections are entered in the SQLite command prompt.

The statement below creates a table called student with an integer field called id and a text field called name.

CREATE TABLE student(id integer, name text);

The command .schema is specific to SQLite for printing out the schema of the database table. In this case, it returns the previous command issued.

CREATE TABLE student(id integer, name text);

Once the table schema is set up, data can be entered into it.

INSERT INTO student VALUES(111, "Smith");
INSERT INTO student VALUES(222, "Johnson");
INSERT INTO student VALUES(333, "Lee");
SELECT * FROM student;

The select statement, above returns the following.

111|Smith
222|Johnson
333|Lee

Following creation of a second table, the .schema command returns multiple lines.

CREATE TABLE takes(id integer, course_id integer, grade integer);
.schema
CREATE TABLE student(id integer, name text);
CREATE TABLE takes(id integer, course_id integer, grade integer);

Insert a few entries into the takes table.

INSERT INTO takes VALUES(111, 6242, 100);
INSERT INTO takes VALUES(222, 6242, 90);
INSERT INTO takes VALUES(222, 6242, 80);

With two tables, it is possible to perform more elaborate select statements that include joins.

SELECT name FROM student, takes
 WHERE student.id = takes.id
   AND takes.course_id = 6242;
Smith
Johnson
Johnson

Likewise, aggregate functions are possible.

SELECT id, avg(grade)
  FROM takes
 GROUP BY id;
111|100.0
222|85.0

Typical SQL filtering is possible in SQLite as well.

SELECT id, avg(grade)
  FROM takes
 GROUP BY id
HAVING AVG(grade) > 90;
111|100.0

Professor Chau highly recommends taking a database course to obtain more in-depth knowledge of these topics, because SQL and database knowledge are crucial parts of data science.

As is true for almost all flavors of SQL, the following query ordering is required.

 SELECT a1, a2, ..., an
   FROM t1, t2, ..., tm
  WHERE predicate
 [GROUP BY ...]
[HAVING ...]
 [ORDER BY ...]

Missing Indexes

Indexes are a means of speeding up queries of a SQLite database. Internally, SQLite uses the index approach called B-tree, which enables O(log n) speed for adding/finding/deleting items. The index is necessary to obtain this speed, however. Consider a table of 1,000,000 datapoints. Without an index, a sequential scan would be required, which would require searching through an average of 500,000 records. If there were an index, on the other hand, the average number of searches that would be required is reduced to six.

The statement that is required to create an index for the student id, as an example, is the following.

CREATE INDEX student_id_index
    ON student(id);

This index is reflected in the schema.

These are my lecture notes for week 2 of Data and Visual Analytics, CSE 6242, which I am taking as part of GA Tech’s OMSCS program.

The course material is presented by Professor Polo Chau, an Associate Director of Georgia Tech’s MS in Analytics program and an Assistant Professor in the College of Computing, specifically in the School of Computational Science and Engineering.