Evolution of Data Science. Part 1

The “Evolution of Data Science” blog series will compose of a number of posts that will take you through key historical events that have shaped Data Science as we know it today.

Historical Development

Analyzing data, in order to get meaningful insights from hidden patterns or discover universal laws that are not obvious from simply “eyeballing” the raw data, has been around for hundreds of years albeit on a smaller scale. A few centuries ago, Galileo, the Italian mathematician and physicist had collected observational astronomy data and analyzed it to produce findings such as establishing the periodicity of sunspots to discovering new planets, and developing the new field of kinematics, which is widely known today as classical mechanics. The work of data analysts, scientists and theoreticians in quantitative fields such as Galileo, Newton, Laplace, Faraday, Maxwell, Poincare, Planck, Einstein, to name a few, in the last 500 years were all done by hand until World War II began. After that point, calculating machines or primitive computers were built to help phycists analyze their data. This was the dawn of machine based-data analysis.

Dawn of Machine-based Data Analysis

In the late 1930s, physicists in the US were concerned that German physicists would be first to develop the nuclear bomb because they had successfully split the uranium atom in 1938 before the war broke out in 1939. They persuaded Einstein to write a letter to then President Roosevelt urging him to approve funding for research on nuclear fission. This led to the creation of the super-secret Manhattan Project in 1940 to develop the nuclear bomb. It was based at Los Alamos National Laboratory in New Mexico, although some other researchers were scattered at various Universities in the US.

The development of early computing benefited enormously from Manhattan Project’s innovation. The data generated at the Manhattan Project was massive at the time. Physicists and mathematicians who were involved used analog computers to analyze it. Those analog computers were integral to the Manhattan Project, and were so heavily used, that they frequently broke down. Researchers at Los Alamos also used old punch-card style computers produced by IBM.

Principal members of the project such as John von Neumann, Enrico Fermi, Richard Feynman, and Nicolas Metropolis developed various computational methods at the time for the analysis of nuclear fission data that were run on analog computers. One of those was the Monte-Carlo and its variant, the Metropolis algorithm that are pervasive today in modern data analysis sampling, from text-mining application and natural language processing name-entity recognition using CRF (conditional random field) which is sampled via Metropolis-Masting, speech recognition system using HMM (hidden Markov model) via Monte-Carlo to sequential training of ANN (artificial neural network) for image recognition tasks also via Monte-Carlo.

An interesting side fact is that the term Monte-Carlo was a code-name used by John von Neuman’s team for their algorithm due to the super-secret nature of the Manhattan Project. A member in John von Neumann’s team had coined the term Monte-Carlo as the code-name as he had visited Monte-Carlo in Europe prior to World War II and he fell in love with the place.

Some of the scientists who were involved in the Manhattan Project, mainly Jon Von Neumann and Richard Feynman, made important contributions to the advancement of the field of modern computing after the war. John Von Neumann’s proposed Von Neumann’s Computer Architecture and Richard Feynman’s proposed the Quantum Computer.

These forefathers of computing and data analysis formed the foundations on which the field of Data Science is built on.

In Part II of this blog series we will cover “The rise of modern computing in massive memory intensive data analysis”.

– Sione K. Palu, Data Scientist