The Sexiest Job: About What?

ed3b560c0fbb45624553f1c621fae5e4_r
(taken from Analyzing and Analyzers)

D. J. Patil, the Chief Data Scientist of the United States at the moment, coined the term “data scientist,” and called it “the sexiest job in the 21st century.” Therefore, we now have a job title called “data scientist,” which I have difficulties to categorize it into the Standard Occupational Classification (SOC) codes. While I respect D. J. Patil a lot (I love his speech in my commencement ceremony in University of Maryland), this is the job title that is the least defined job title ever seen in my life.


DJ Patil, the U. S. Chief Data Scientist (from his LinkedIn)

So what does a data scientist do? I have seen many articles about it. And various employers have different expectations about the data scientists they hired. Sometimes their expectation is so unreasonable in a way that they want a god. And a lot of people call themselves a data scientist in LinkedIn, despite the fact that their official titles are software engineers, software developers, data analysts, quantitative analysts, research scientists, researchers,… With a Ph.D. in theoretical physics, I want to call myself a data scientist too because of the word “scientist.” I found it cool and sexy. But I realize the risk of calling myself one: people expect something different from what I really am. I rather call myself an “applied quantitative researcher,” as shown in my LinkedIn.

Of course, it provides room for opportunists to make money by distorting their image and branding themselves in various ways from time to time.

Regarding the skills we need, I love the chart above. (Read that book, which is a good description.) Despite my complicated feelings toward the term “data scientist,” I believe as the R & D people in the big data era, we should know:

  1. Statistics, Machine Learning, Natural Language Processing (NLP) and Information Retrieval (IR): the mathematical modeling part.
  2. Domain Knowledge, or Business Knowledge: the knowledge about the industry, the world, the people, the company, …
  3. Software Development: the skills of development cycle, such as object-oriented (OO) programming, functional programming, unit tests, …, and some recent technologies about distributed computing such as Hadoop and Spark.

Employers hired data scientists from diverse backgrounds. Statisticians, research scientists in machine learning, physicists, chemists, or mathematicians might know the mathematics and research methodologies very well, but they do not know how to write maintainable codes. This article described it well. On the other hand, some people are trained as a software developer. However, they do not have enough mathematical background to handle the analytics well.

The word “data” attracts the eyeballs, but we really need to define what these terms like “big data,” “data scientists,” or “data products” are. Yes, by the way, despite the vaguely-defined term “data products”, this article does describe the trend very well. But no matter what, there can only be more accessible data in this age of information explosion, any skills that tackle with data keep on being in high demand.

Continue reading “The Sexiest Job: About What?”

Scala as the Next Influential Programming Language

I have been learning Scala. Some time ago, I doubted if it’s worth it as the learning curve is quite steep. But today I read the first chapter of my newly ordered book, titled Advanced Analytics Using Spark, a tool written in Scala for handling big data analytics, I reassured that I bet on the right thing.

I believe it will be the most common programming language the coming generation in this big data era because:

  1. It runs on JVM: a lot of libraries have been maintained as Java packages. Why do we discard Java if everything is getting more perfect from time to time? It is the same reason why we do not discard our old Fortran codes in scientific computing, but to wrap them in MATLAB or Python.
  2. It is an object-oriented: we learned about modularization and design patterns all the time. It keeps the strength of Java.
  3. It is functional: analytics involve functions. We want to handle functions flexibly. It shortens our codes, and makes our codes more readable (provided that we write appropriately). Mathematical manipulation is easier when we can handle operations with fewer codes. Lambda expressions are available.
  4. Interactive programming is available: what makes R and Python great is its availability to program interactively, especially handling data and mathematical models. And yes, this is also available in Scala.
  5. Parallel computing comes naturally: with actors or additional packages like Spark, Scala is well suited for scalable huge data computing. This is something that R and Python lack.

scalacodes

Continue reading “Scala as the Next Influential Programming Language”

Blog at WordPress.com.

Up ↑