Data and Visual Analytics Lecture Notes - Week 1

These are my lecture notes for week 1 of Data and Visual Analytics, CSE 6242, which I am taking as part of GA Tech’s OMSCS program.

The course material is presented by Professor Polo Chau, an Associate Director of Georgia Tech’s MS in Analytics program and an Assistant Professor in the College of Computing, specifically in the School of Computational Science and Engineering.

Course Introduction

Dr. Chau’s Research

Professor Polo Chau’s research interests lie in very large datasets (truly “big data”). His research group is called the Polo Club of Data Science. The datasets that interest him include the following.

Label Size Nodes Edges
The Internet 50 Billion Pages Pages Links
Facebook 1.2 Billion Users Users Friendships
YahooWeb - 1.4 Billion 6 Billion
Symantec Machine-File Graph - 1 Billion 37 Billion
Twitter - 104 Million 3.7 Billion
Phone Call Network - 30 Million 260 Million
  • Twitter, who follows whom, 500 million users
  • Amazon, who buys what, 120 million users
  • Cellphone network, who calls whom, 100 million users
  • Protein-protein interactions, 200 million possible interactions in human genome

Dr. Chau states when designing a system to work with large datasets, it is very important to first test it on small datasets.

Data of these magnitudes can be exciting, but the limits of human cognition pose a serious problem. According to George P. Miller in 1956, the average human can only hold $7 \pm 2$ items in working memory. Dr. Chau says this has likely decreased with today’s reliance on computers. So, the challenge is to distill truly huge data to a very small number of relevant details, so the human mind can cope with it. Data science turns data into insights.

Specifically, his research consists of an innovative approach to analytics that combines data mining (which is automatic, involves sumamrization, clustering, and classification, and copes with millions of items) with HCI, or “human-computer interaction” (which is user-driven and interactive, involves interaction and visualization, and can only cope with thousands of items). This course, likewise, involves combining computation and human intuition. The course answers the question “how do we leverage human perception to analyze data?” The mission and vision is “scalable, interactive, usable tools for big data analytics.”

Computation Interactive Visualization
Automatic User-driven, interactive
Summarization, classification, clustering Interaction, visualization
Millions of nodes (or more) Thousands of nodes

Both approaches develop approaches to making sense of network data.

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. Einstein?

Specific Research Projects

Apolo: Combines machine learning and visualization to explore million-node graphs in real-time.
Carina: A follow-up to Apolo, which enables million-node graph exploration in a browser, not on a desktop.
Visage: Interactive visual graph querying. The code resembles SQL.
ActiVis: Visualization and interpretation of deep learning models, which is often considered a black box. Currently deployed on Facebook’s machine learning platform.

Dr. Chau’s primary research area is Cybersecurity. Projects in that area are the following:

Polonium & AESOP: Patented system with Symantec, finds malware in 37 billion file relationships.
NetProbe: Auction fraud detection on eBay.
MARCO: Detects fake Yelp reviews.

Why Do Data and Visual Analytics?

Professor Chau defines Data and Visual Analytics as the interdisciplinary science of combining computation techniques and interactive visualization to transform data to make discoveries or help with important decisions. Note that the end goal, “discovery” and “decisions,” are a key part of this definition.

In recent years, the size of today’s data has grown enormous…

image-center obtained from good.is

The size of today’s data has made data visualization its own domain of study, called “data science.” There are many facets, or “ingredients,” to data science, including:

  • Storage,
  • Complex System Design,
  • Scalability of Algorithms,
  • Visualization Techniques,
  • Interaction Techniques,
  • Statistical Tests, and more.

More specifically, if the data are large, then the data may not be able to be stored on a single machine. This means that a cluster will need to be used, which then means that the design of the system needs to be considered. Because the data are on multiple machines, algorithm scalability must be considered, as well as the visualization techniques that can be employed. This is obviously a very complex and evolving skillset that is needed by both traditional businesses that have growing amounts of data, as well as by entirely new types of businesses that rely on data as a core part of their business model.

The data scientist role is critical for organizations looking to extract insight from information assets for ‘big data’ initiatives and requires a broad combination of skills that may be better fulfilled as a team. Gartner

Data Scientists do not currently have a well-defined job description, but there is consensus that it requires a very broad set of skills, and requires extensive collaboration within a large team. Professor Chau thinks this course will help develop a number of these important skills.

Course Goals

  • Learn visual and computational techniques, and use them in a complimentary manner.
  • Gain a breadth of knowledge.
  • Learn practical know-how by working on real data and problems.

The course schedule roughly follows the following analytics building blocks, which will be explored in much more depth during subsequent parts of the course. These are the following, where the colors indicate larger groups. The building blocks are conceived of as parts of a larger, fluid process, but not as a series of rigid steps.

  • Collection,
  • Cleaning,
  • Integration,
  • Analysis,
  • Visualization,
  • Presentation, and
  • Dissemination.

In particular, the data types can inform algorithm selection and visualization type, and visualization type can inform data cleaning and algorithm selection. The progression through the stages is not linear, but iterative.

Analytics Building Blocks

Example Project: Apolo

Visualizing very large datasets often produces a type of image Professor Polo refers to as a “beautiful hairball.” These images can be beautiful (examples here), but they are not very useful for actually deriving insights about the dataset. The goal of Apolo is to help determine what features of these large datasets are worth exploring. It accomplishes this by finding the most relevant nodes for users to hone-in on.

An example application Polo introduces is a citation network. Academic papers generally contain large numbers of references, or citations, of other works. Each paper can be visualized as a node, and each citation as an edge. So, researching the literature on a given topic amounts to traversing the citation graph.

Apolo is a tool for exploring large graph data. It uses machine learning in combination with visualization techniques. In a demo that Professor Polo presents, he uses Apolo to analyze the citation network for the “sense making” literature in the human computer interaction field. The nodes for his demo were 80K papers crawled from Google Scholar, and 150K citations from those papers.

To begin the demo, he starts with a single node that represents a seminal paper. From a list of papers that cite that seminal paper, he drags a few additional nodes into the graph structure, labeling these with appropriate terms, “search” and “info visualization.” These form the starting point for groups of additional papers. The color for the groups are automatically assigned. Those two newly added papers then become the starting point for discovery of additional papers, some of which are also added to the groups already in the graph. Size of the node scales with the number of citations, and the color saturation scales with the paper’s relevance.

Summary: First, the user specifies exemplars. Then, the user finds other relevant nodes using belief propagation.

Analytics Building Block Specific Steps
Collection Scrape Google Scholar. No API available.
Cleaning -
Integration -
Analysis Design inference algorithm: which nodes are next?
Visualization Interactive visualizations shown during demo
Presentation Paper, talks, lectures
Dissemination In-progress, currently working to recreate Apolo as a web app

Example Project: NetProbe

The problem NetProbe was developed to solve is bad sellers (Professor Polo calls them “fraudsters”) on eBay who routinely commit non-delivery fraud. Non-delivery fraud entails selling an item to a customer, but never delivering it. NetProbe works by examining the network of buyers and sellers on eBay.

eBay fraudsters routinely interact with a large number of accomplices who do business with them to boost the fraudsters’ reputation on eBay. Those accomplices are often second accounts the fraudster, himself, created. The accomplices are usually apparently legitimate individuals, who may also use eBay to interact with honest customers, but who also make “heavyweight transactions” with fraudster accounts in order to fabricate the fraudster’s reputation. Importantly, accomplice accounts often interact with a large set of fraudsters.

The fraudsters are discovered by using belief propagation in combination with actual fraud reported by honest customers. Once an account has been flagged as a fraudster, it is possible to use their trading history to identify possible accomplices. Other likely fraudsters can then be identified on the basis of having interacted with those accomplices.

Analytics Building Block Specific Steps
Collection Scraping
Cleaning -
Integration -
Analysis Design detection algorithm
Visualization -
Presentation Paper, talks, lectures
Dissemination Not released

Data Science Buzzwords

The data science field, being young and full of promise, is currently rife with “buzzwords.” Professor Polo defines buzzwords as words or phrases that become fashionable within a certain period of time.

The Buzzword Hype Cycle

Gartner defines the buzzword hype cycle using a graph of expectations as a function of various periods of time. Specifically, Gartner identifies several phases that a given technology passes through, including the following.

Time Period Expectations
Innovation Trigger Rapidly ramping expectations as knowledge of the innovation grows
Peak of Inflated Expectations Expectations grow to a peak, then begin to fall
Trough of Disillusionment Steep plummet as new difficulties and realities associated with the tech emerge
Slope of Enlightenment Slow increase in expectations as many of the difficulties are overcome, and collective, realistic knowledge disseminates
Plateau of Productivity Expectations level off as the technology becomes commonplace and widely available

Obviously, the specific trajectory a given technology follows on the hype cycle graph will vary due to the specifics of how the technology develops and what headwinds it encounters. Professor Polo also disagrees with some of the specific placements of how Gartner has placed some technologies on the hype cycle. Some technologies may appear very promising, but then fail to deliver entirely. But, he says this is an appropriate rough average trajectory that many emerging technologies follow.

Professor Polo’s advice: given the constantly evolving nature of technology, the right approach is to constantly learn new things, and be cautiously optimistic about emerging technologies. Try to understand the deeper reason behind the hype and popularity. What is the promise of the technology?

Anyone pursuing a data science career should be aware of new, emerging technologies, and to be constantly learning about their promise.

Buzzphrase: Artificial Intelligence

Artificial Intelligence is one of the biggest buzzwords of recent years. Two recent positive examples of AI in the news are self-driving taxis in Singapore and Google’s AlphaGo. Two recent negative examples of AI in the news are a Microsoft system called Tay that learned to make racist comments, and a Tesla vehicle that did not notice another vehicle, resulting in a collision that killed the driver.

We are currently in what Professor Polo calls the “Third Wave of AI.” There have been two previous “AI Winters” where the interest in AI decreases and funding for the AI research gets cut. The idea of artificial intelligence was born in the 1950s, and due to limits of computation at the time, had its first winter in the 1970s. The second winter was in the early 1980s.

In the early 1990s, “Machine Learning” began growing in popularity. Machine learning was a means of scoping the promise of AI down, and to focus on problems that could be solved more easily. From the start, machine learning was a means of containing hype. The idea was to focus on small, narrowly defined, specific tasks, and to deliver on promises. Examples of machine learning that have already been implemented are the USPS’s utilization of OCR to read addresses.

Artificial Intelligence, unfortunately, has become a very broad term in popular usage. Many systems are described as using artificial intelligence when really they use more narrowly-prescribed machine learning techniques. Professor Polo states that a good way of distinguishing between AI and ML is to read a White House report on the subject.

Three important terms are defined in on pages 7 and 8 of the report.

Narrow AI

Remarkable progress has been made on what is known as Narrow AI, which addresses specific application areas such as playing strategic games, language translation, self-driving vehicles, and image recognition. Narrow AI underpins many commercial services such as trip planning, shopper recommendation systems, and ad targeting, and is finding important applications in medical diagnosis, education, and scientific research.

There seems to be some overlap between the White House’s definitions of narrow AI and ML. Image recognition, as an example, is called a narrow AI system in the quote above, but current image recognition systems rely heavily on machine learning techniques.

General AI

General AI (sometimes called Artificial General Intelligence, or AGI) refers to a notional future AI system that exhibits apparently intelligent behavior at least as advanced as a person across the full range of cognitive tasks. A broad chasm seems to separate today’s Narrow AI from the much more difficult challenge of General AI. Attempts to reach General AI by expanding Narrow AI solutions have made little headway over many decades of research. The current consensus of the private-sector expert community, with which the NSTC Committee on Technology concurs, is that General AI will not be achieved for at least decades.

General AI are Agent Smith in The Matrix and Terminators in the Terminator.

Machine Learning

Machine learning is one of the most important technical approaches to AI and the basis of many recent advances and commercial applications of AI. Modern machine learning is a statistical process that starts with a body of data and tries to derive a rule or procedure that explains the data or can predict future data.

Machine learning is very closely related to the more general field of analytics because most machine learning techniques require large amounts of input data for “training.” Arguably, the advent of “big data” was a necessary precondition for the rapid growth in demand for machine learning engineers and data scientists in recent years.

These are my lecture notes for week 1 of Data and Visual Analytics, CSE 6242, which I am taking as part of GA Tech’s OMSCS program.

The course material is presented by Professor Polo Chau, an Associate Director of Georgia Tech’s MS in Analytics program and an Assistant Professor in the College of Computing, specifically in the School of Computational Science and Engineering.