“Data Science from Scratch” by Joel Grus

Summary & Discussion

Joel Grus’s Data Science from Scratch is an excellent and unique guide to understanding this new field. It distinguishes itself from other similar Data Science books in that it builds up the reader’s understanding from first principles, without relying on the many libraries available to Data Scientists such as Numpy and Pandas.

Regarding such libraries, Joel Grus says on page xii: “They are great for doing Data Science. But they are also a good way to start doing Data Science without actually understanding Data Science.”

Recall the “Danger Zone” of Drew Conway’s famous Data Science Venn Diagram. Drew Conway comments on this area: “It is from this part of the diagram that the phrase ‘lies, damned lies, and statistics’ emanates, because either through ignorance or malice this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there or what they have created” (source).

This book’s purpose is to directly address this potential problem. Assuming little to no prior knowledge, it builds a complete understanding of Data Science from the ground up. Diligent readers therefore develop a strong grasp of the basic principles upon which Data Science is built. This understanding is crucial in fields like Data Science, where practitioners make judgments based on information that can be multiple abstraction layers away from the actual, raw data. For better or for worse, today’s computers afford very easy and fast computation. This cheap computational horsepower can be misused by people who don’t know (or forget) the mathematical or statistical basis for their reasoning, or who are simply looking for quick results.

With Joel Grus’s help, hopefully this can be avoided.


I will be heavily relying on this book in generating my own library of Data Science notes. Joel Grus has graciously made all the data and source code for the book freely available online, and I expect I will be reproducing and commenting on large parts of it in my notes as well.

As part of my ongoing attempt to understand the knowledge base a Data Scientist requires, I’ve distilled the book’s contents, below, in the order in which they appear.

Major Contents

Basics

  • Python
  • Visualizing Data
    • Matplotlib
    • Bar Charts, Line Charts, Scatterplots

Math & Stats

  • Linear Algebra
  • Statistics
    • Central Tendencies, Dispersion
    • Correlation & Causation
    • Simpson’s Paradox
  • Probability
    • Dependence / Independence
    • Bayes’s Theorem
    • Continuous Distributions, Normal Distribution, Central Limit Theorem
  • Hypothesis Testing & Inference
    • p-Values
    • Confidence Intervals
  • Gradient Descent

General Approach to Data Analysis

  • Getting Data
    • Reading Files
    • Scraping the web
    • JSON & XML
  • Working with Data
    • Exploring
    • Cleaning & Munging
    • Manipulating

Machine Learning

  • Machine Learning (…)
  • k-Nearest Neighbors (…)
  • Naive Bayes (…)
  • Simple Linear Regression (…)
  • Multiple Regression (…)
  • Logistic Regression (…)
  • Decision trees (…)
  • Neural Networks (…)
  • Clustering (…)
  • Natural Language Processing (…)
  • Network Analysis (…)
  • Recommender Systems (…)

Other Tools

  • Databases and SQL (…)
  • MapReduce (…)

Items marked with (…) may be further subdivided in the future


Content for this article is taken from:

Data Science from Scratch by Joe Grus (O’Reilly). Copyright 2015 Joel Grus, 978-1-4919-0142-7. Get it here.