This week, I took Coursera's crash course on data science from Johns Hopkins University. From the use of statistics to the structure of a data science project, here's what I learned.
Data science is the science of answering questions with data.
It usually involves the use of statistics, which are comprised of 4 things.
Descriptive analysis, which is the exploration of the data to get an idea of what it tells us. This can include creating graphs or visualizations of raw data.
Inference, which is drawing of conclusions from samples. This might be an estimation of the number of voters for the past polls. Inference can be seen as the formalization of the descriptive analysis.
Prediction, which is an attempt to say what the future data will be based on current or past data. This takes the form of linear regression or other machine learning algorithms, as well as deep learning with constructs such as neural networks.
Experimental design, which is the definition of methods needed to conduct meaningful experiments. Randomization of the population group is a key aspect of experimental design as it helps to get better generalizability.
There are 2 main types of machine learning:
- unsupervised learning: trying to see things previously unseen - for instance, groups of people within a population, such as "voters of the Republicans" or "voters of the Democrates",
- supervised learning: trying to predict something based on previously known data - for instance, trying to predict the height of a son based on the height of his father.
Even though they are related, there are some differences between traditional statistics and machine learning. Traditional statistics cares primarily about model simplicity and interpretability. Machine learning cares primarily about model performance.
The structure of a data science is as follows:
- asking of a question of which the answer will determine what to do next
- data is gathered that will help to answer the question
- exploratory data analysis helps get a vague idea of what an answer might be
- formal data analysis helps get a deeper understanding of the data
- interpretation infer conclusions from the data samples
- communication of the experiment gives the answer to the question
- data science project is over, now is decision time
Sometimes, exploratory data analysis can come first, lead to questions which you can then try to answer using the steps above.
The course mentions the paper Classifier Technology and the Illusion of Progress, by David John Hand.
The paper by D. J. Hand explains how a lot of current machine learning algorithms may well be too complicated to yield value for real life projects.
Some technologies have a theoretical advantage that is very hard to transpose in the real world due to unexpected edge case or lack of domain knowledge on the part of the executor.
A data science project should have some kind of output.
Ideally the output should be reproducible, meaning it should contain all the data and code needed to run the experiment from scratch. This is useful for auditing the experiment or for running it with another data set.
The output of the data science could be:
- a report (see Jupyter Notebooks and Knitr),
- an interactive presentation (see Slidify),
- an application or an interactive web page (see Shiny).
The best outcome of a data science project is to have an answer to the question the project is meant to answer. That and the previously discussed output.
Some other good outcomes may be increased knowledge or the conclusion that the data set is not suited to answer the question, in which case you can decide to look for more data or to drop the question. Either way, it's useful to know if the data set is insufficient.
Sometimes, even in the hands of smart people, machine learning can make bad predictions, as was the case with some predictions by the Google Flu Trends:
In the 2012-2013 season, it predicted twice as many doctors’ visits as the US Centers for Disease Control and Prevention (CDC) eventually recorded. — New Scientist
And sometimes, machine learning can yield very good performance but be too expensive to run at scale in production, as was the case with an algorithm meant to improve Netflix recommendations:
(...) the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. — Netflix blog
Some other papers are mentioned in the course:
I'd like to read those.