“Data Science for Business” Ch 1 Key Concepts

These are my notes on chapter 1 of Data Science for Business by Foster Provost and Tom Fawcett (O’Reilly). Copyright 2013 Foster Provost and Tom Fawcett, 978-1-449-36132-7.

It is recommended reading for my Data and Visual Analytics course, CSE 6242, which I am taking as part of GA Tech’s OMSCS program. The book has received high praise and deserves it. Highly recommended.

Over the course of the past 20 years, companies have invested significantly in IT infrastructure. This infrastructure is the enabling technology behind today’s massive datasets. The book lists several sources of data in the modern corporation, including customer behavior, marketing campaign performance, operations, supply chain, etc. Companies the world over have realized the value of this data, and this has resulted in the emergence of a new field: data science.

The book has two primary goals: (1) help the reader view business problems with data in mind, and (2) understand the process by which useful knowledge can be obtained from data. The book refers to this as “data-analytic thinking.”

Data mining and data science are terms that are often used interchangeably. But, technically, the terms are related as follows: data mining is the practical application of data science principles. I think the analogy is something like:
Data Mining : Data Science :: Engineering : Science.

Data Science, fundamentally, consists of the set of techniques, processes, and principles required to gain deeper understanding of complex phenomena based upon the automated analysis of relevant data. The goal of data science is to improve decision making beyond what is possible with simple intuition. The jargon-y term for this is “Data-Driven Decision making.”

Big Data Processing

Data engineering and data processing are critical functions that support data science, but they are not data science. Data engineering and processing might involve modern web system processing, online advertising campaign management, or efficient transaction processing. Other specific “Big Data” technologies include MongoDB, Hadoop, and HBase. The term “Big Data” essentially means datasets that are too large for traditional processing systems, and therefore require new technology in order to cope with the scale.

The authors of the book draw an analogy between “Web 1.0” and the current state of analytics adoption. During Web 1.0, companies were primarily concerned with basic web infrastructure, such as establishing a site, building e-commerce capability, and improving operational efficiency. A subsequent mindset shift led companies to actually begin developing business models based on the new capabilities of the web, and “Web 2.0” began. Similarly, as companies become able to process massive data flexibly, the authors say we should expect a similar mindset shift where the companies begin to consider analytics capability as an important strategic asset.

Data and Data Science Capabilities: Strategic Assets

Analytics capabilities require both substantial data, and the talent required to derive value from it. The best data science team could not generate value without data, and the best dataset in the world is useless without the appropriate data. Focusing on investing in the former and excluding the latter will not create value for the business.

One of the aims of this book is to encourage a perspective of approaching business problems “data-analytically,” which the authors define as approaching problems with the aim of determining whether and how data could be used to solve it. Digital companies the world over already excel at this (think Amazon, Facebook, Twitter, Google), but companies that work in more traditional industries will, increasingly, need to invest in analytics capabilities to keep up with their competition.

The authors are at pains to point out that developing analytics capabilities also requires close collaboration between data science teams and business units. The business decision makers and the data science teams each need to understand what the other is doing. A lack of communication or, worse, a serious disconnect, could be disastrous.

The Aim of the Book

This book is focusing on data science and data mining fundamentals, which allows a fairly in-depth understanding of data science processes and methods deeply, without getting to the level of mathematical discussion of actual algorithms.

A few of the specific concepts that will be investigated are the following. The book will explore 12 similar concepts.

  • Extracting knowledge from data to solve business problems can be handled systematically by following step-by-step processes.
  • Information technology can be used to discover descriptive attributes of important entities, based on large masses of data.
  • Look to hard at a dataset, you will find something, but it may nnot generalize beyond the dataset you have (“overfitting”).
  • The context within which a particular data analytic solution is applied matters greatly.

Data Science versus Engineering, Again

Data Science is a younger field, even, than computer science. The authors state that the current state of data science can be compared to chemistry in the mid 1800s, when the field was still very experimental. At that time, every chemist was also a lab tech. This is reflected in job postings for data scientists, where there is little distinction between the analytical techniques employed and the popular tools used. Currently, the tools are often synonymous with the technique.

This book focuses on the science and not the technology. The general principles this book explores will be more enduring than the specific software currently used to apply the principles.

These are my notes on chapter 1 of Data Science for Business by Foster Provost and Tom Fawcett (O’Reilly). Copyright 2013 Foster Provost and Tom Fawcett, 978-1-449-36132-7.

It is recommended reading for my Data and Visual Analytics course, CSE 6242, which I am taking as part of GA Tech’s OMSCS program. The book has received high praise and deserves it. Highly recommended.