Learning about data science

Learning about data science

Here’s what I understand from data science after taking Coursera’s crash course on data science from Johns Hopkins University.

What data science is

Data science is the science of answering questions with data.

Use of statistics in data science

It usually involves the use of statistics, which are comprised of 4 things.

  1. Descriptive analysis, which is the exploration of the data to get an idea of what it tells, this can include creating graphs or visualizations of raw data;
  2. Inference, which is drawing of conclusions from samples, this might be an estimation of the number of voters for the past polls, inference can be seen as the formalization of the descriptive analysis;
  3. Prediction, which is an attempt to say what the future data will be based on current or past data, this takes the form of linear regression or other machine learning algorithms, as well as deep learning with constructs such as neural networks;
  4. Experimental design, which is the definition of methods needed to conduct meaningful experiments, randomization of the population group is a key aspect of experimental design as it helps to get better generalizability.

Machine learning

There are 2 main types of machine learning:

  1. Unsupervised learning: trying to see things previously unseen - for instance, groups of people within a population, such as “voters of the Republicans” or “voters of the Democrates”;
  2. Supervised learning: trying to predict something based on previously known data - for instance, trying to predict the height of a son based on the height of his father.

Even though they are related, there are some differences between traditional statistics and machine learning. Traditional statistics cares primarily about model simplicity and interpretability. Machine learning cares primarily about model performance.

Structure of a data science project

The structure of a data science is as follows:

  1. Asking of a question of which the answer will determine what to do next;
  2. Data is gathered that will help to answer the question;
  3. Exploratory data analysis helps get a vague idea of what an answer might be;
  4. Formal data analysis helps get a deeper understanding of the data;
  5. Interpretation infer conclusions from the data samples;
  6. Communication of the experiment gives the answer to the question;
  7. Data science project is over, now is decision time.

Sometimes, exploratory data analysis can come first, leading to questions which you can then try to answer using the steps above.

Illusion of progress

The course mentions the paper Classifier Technology and the Illusion of Progress, by David John Hand. The paper explains how a lot of current machine learning algorithms may well be too complicated to yield value for real life projects. Some technologies have a theoretical advantage that is very hard to transpose in the real world due to unexpected edge case or lack of domain knowledge on the part of the executor.

Output of a data science project

A data science project should have some kind of output. Ideally the output should be reproducible, meaning it should contain all the data and code needed to run the experiment from scratch. This is useful for auditing the experiment or for running it with another data set.

The output of the data science could be:

Good outcomes

The best outcome of a data science project is to have an answer to the question the project is meant to answer. That and the previously discussed output. Some other good outcomes may be increased knowledge or the conclusion that the data set is not suited to answer the question, in which case you can decide to look for more data or to drop the question. Either way, it’s useful to know if the data set is insufficient.

Bad outcomes

Sometimes, even in the hands of smart people, machine learning can make bad predictions, as was the case with some predictions by the Google Flu Trends:

In the 2012-2013 season, it predicted twice as many doctors’ visits as the US Centers for Disease Control and Prevention (CDC) eventually recorded. — New Scientist

And sometimes, machine learning can yield very good performance but be too expensive to run at scale in production, as was the case with an algorithm meant to improve Netflix recommendations:

(…) the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. — Netflix blog

Some other papers are mentioned in the course: