What should a data scientist know? What are the core skills of a data scientist? I have not seen another job title so vague and ambiguous that arouses so many debates and discussions. BD2K (Big Data to Knowledge) Centers of NIH (National Institutes of Health) [Ohno-Machado 2014] have issued funding to a few tertiary colleges in the United States to develop data science curricula, which carries on such discussions.
This is an interdisciplinary field. Around 15 years ago, I was still a matriculation student in Hong Kong. The University of Hong Kong (HKU) started a major called bioinformatics. People were puzzled about what it was indeed, because it looked like a melting pot of several unrelated disciplines (which actually a lot of freshmen complained as they did not understand the purpose of the undergraduate program). But we now understand how it is important.
So what should the students learn? It was suggested in the following figure:
You can see that the core competencies include statistics, machine learning, software engineering, reproducible research, and data visualization. Some of them are math and computer, some sciences, and some arts. And of course, individual data scientist jobs require the corresponding business knowledge.
Honestly, I do not excel in all of them. I have a physics background, which makes it easy for me to learn machine learning and research. Software engineering is not hard to pick up. But statistics is an alien theory to me, and visualization requires the artistic sense that I don’t possess.
Anyway, a lot to learn. Stay humble.
- K. Sainani, “The Landscape of Bioinformatics Education: Ever-Expanding and Heterogeneous“, Biomedical Computational Review (Fall 2015).
- L. Welch, F. Lewitter, R. Schwartz, C. Brooksbank, P. Radivojac, B. Gaeta, M. V. Schneider, “Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies”, PLoS Comput Biol 10(3): e1003496. (2014).
- L. Ohno-Machado, “NIH’s Big Data to Knowledge initiative and the advancement of biomedical informatics”, J. Am. Med. Inform. Assoc. 21, 193 (2014).
- BD2K: Data Science in NIH.