The MOOC Trap

Type in “learn to become a data scientist” into Google and you will get the following results: DataCamp, Udacity, Udemy, Coursera, DataQuest, etc. These Massive Open Online Courses, MOOCs for short, are invaluable when learning a new skillset. They allow students to enter a once guarded elite within the academic walls. They sharpen skill sets and add value to any resume.



However, I have noticed that they have become a crutch that is traded for comfort over the rigors of self-improvement past a certain point. This realization is not placed on others but is reflected in my own advancement in advanced analytics. We strive to learn the most advanced methods, the coolest visualizations, the highest performing algorithms. Yet, we seldom strive for originality in our work because we are afraid it might be too mundane.


To add to one’s proverbial toolkit in data science, MOOCs and educational resources that guide the user through a well-formatted analysis is crucial. There are little other ways to learn the fundamentals other than structured educational materials. But this can lead to the “MOOC trap.” A typical case study of the MOOC trap is the burgeoning data scientist who has completed an intensive 6-month online program. She has dedicated 300 hours of intense study, through both sandbox exercises and a few semi-structured projects. She feels like she has a basic understanding of a host of skills, but is timid to try her analytical toolset on a problem of her choosing. Instead, she signs up for another 6-month MOOC, a mere regurgitation of the material that she just covered. Enamored with the ads and displays of a polished portfolio on GitHub that the MOOC promises, she forks over another $200 a month.


This individual felt the excitement of looking for a question, venturing the internets for a dataset, and the feeling of struggle as she looked at the mess that real-world data provides. But she regressed back to the comfort of the MOOC. I feel the same in my own work. There are so many datasets that we as a community have access to, structured and unstructured, clean and well, terribly scattered and messy. We are trained through our educational systems in college/grad school, online courses and structured tutorials, to create something advanced and analytically perfect. We are pressured to post this to GitHub, to display our certification of accomplishment with a stamp from an official organization.


The problem with the MOOC trap is that it no longer trains us for the real world; it trains us to become great followers of directions. We fear that our analysis on an original piece of work will not be cool enough, it will not be advanced enough, and well, we might have grind just to produce an exploratory analysis of things that we might have already assumed. But this is the challenge, to create something original because it gives us ownership. Completing basic analytics with an original dataset that we went out and found adds to the data science community. This builds the foundations of what science is and hones our fundamental skills so sorely needed in the workforce.


While MOOCs offer a structured and nicely formatted addition to our repositories/portfolios of glistening analytical work, it has the potential to leave us in a comfortable position where growth decays. There is a certain point to where educational training and online courses can take us, but beyond that, it is a series of diminishing returns. Each nascent data scientist will have a different inflection point, but the feeling is the same; you have a burning question, but feel your skillset is unpracticed. In this instance, forgo the MOOC and find the data in the world. Produce the basic analysis, ask your peers to review, and struggle a little more. Only then will we grow as data scientists.





Defining your “T”

I once had an interview with a data science consulting firm that I completely bombed. It wasn’t the SQL test and it wasn’t the questions on machine learning algorithms. The question that stumped me was when the interviewer asked, “What is your T?” Dumbstruck, I answered that I didn’t know this terminology. Politely, the interviewer explained that she wanted to know what my core competencies were, what area was I known as the “go-to” guy and where was I competent in performing on my own. This brings me to what I believe is the most important story in data science; the story of you.

The T-model was defined by Harlan Harris in “Analyzing the Analyzers,” published by O’Reilly Press in 2013. Heralded as “an introspective survey of data scientists and their work,” Harris and his colleagues argue that there are many categories of individuals calling themselves data scientists. Defining which category one fits in depends on the contents of their T. I created the following T as a visual example.

Screen Shot 2017-12-16 at 2.49.07 PM.png

The vertical of the T is one’s core competency, their comparative advantage in the field of advanced analytics. This is the area where researchers distinguish themselves and are frequently called upon to perform as the leader of a team. You might remember these people from graduate school that seemed to effortlessly explain the underpinnings of a complex derivation on the board, pulling proofs out of thin air as if a celestial power were whispering answers in their ear. Now, you don’t have to have analytic superpowers to complete your vertical of your T, but you should be quite competent in that area.

The horizontal bar are the areas that you feel comfortable operating in, but are not solely known for. These can be areas where you utilize for a project every now and then or are part of your analytical process. You might need a little extra time with your buddy Google or a sojourn on StackOverflow, but these are areas where you know if you are given a little extra time, you will have no trouble performing.

The programs to the right of the T are the programming languages, software, and technologies that go hand-in-hand with your competencies. For example, if someone lists machine learning as their core competency, they will most likely have R or Python listed in parallel to perform those analyses. With Python would be the libraries or modules that one would depend on to perform these duties. These could include pandas for data manipulation, numpy for quick calculations, matplotlib and seaborn for visualization, beautifulsoup for scraping data from the web, nltk for natural language processing and scikitLearn for machine learning algorithms.

The purpose of defining your T is not to include every buzzword technology and programming language that is hot, but to include those resources that get you 90% to your goal. That final 10% could be something that you use seldom and need to call on additional expertise to complete. Creating a firm narrative of your core strengths helps the nascent and advanced data scientist alike explain what they have to bring to the team. It creates a mutual understanding between those hiring you to what you bring to the table. But more importantly, it provides a visual elevator pitch when communicating what data science is to you.


An interview with Booz Allen Hamilton Data Scientist, Nicholas Kallfa

Last weekend I had a sit-down conversation with Nicholas Kallfa. Nicholas is a data scientist at Booz Allen Hamilton where he works on creating tailored products for government clients using R and mathematical computing techniques to visualize information in R’s open-source platform Shiny. Not only is Nicholas a great communicator of advanced mathematics, but he shared with us the various methods he uses to guide clients through difficult concepts to reach a mutual understanding. This interview was quite enjoyable since there were an abundance of cats knocking things over in the background. If there is residual noise, I blame it on them.

Toolkit: R (R Studio), Python (Spyder), Git/GitHub, Postgres, Microsoft SQL Server, QGIS, Latex, Tableau, Excel




Nick’s favorite online resources:

  • Udemy (mostly to learn something completely new for the first time)
  • Stack Overflow
  • Relevant subreddits
  • Mostly just google and see where he ends up

Influential books about data:

Bio: Nicholas Kallfa is a Data Scientist at Booz Allen Hamilton where he specializes in providing research and data science support to clients within the federal government. Some of his previous clients include U.S. Customs and Border Protection, Immigration and Customs Enforcement as well as multiple clients in the Department of Defense. He has a background in mathematics holding a MS from the University of Iowa and currently lives in Alexandria, VA with his fiancee and 4 mischievous cats.


What is a Data-Driven Organization?

Recently, I had a former colleague reach out to me to inquire about how his organization could be data-driven. Specifically, he was interested in what the various terms and buzzwords he had been hearing actually meant. The following is a brief-response to summarize at a high level the various underpinnings of what it means to be “data-driven.” I have included the e-mail response in its entirety.

“Ok, so what is this topic of a datadriven business? Well, for starters, most of what we hear about informing decisions through advanced analytics and big data is nothing new. What is new is the sheer quantity and quality of the data that we are being presented. In theory, these buzzwords and wing-dings we see in business news have been around in operations research for decades. But we have never had the amount of data being produced by what we call the “Internet of Things,” or IoT for short. Think about your cell phone. There are about 12-15 sensors in a typical iPhone that are continuously recording data and feeding it to various services. Your location, speed, altitude, when your phone activity is low and when it is high. All of these things can be used to inform decisions and create new products. But it is the same process that you learned in the Navy: optimize a problem given the data and based on the constraints. So here is a little introduction to what I view a datadriven business to maintain.
1. Theory: Big data is broken up into three Vs (variety, velocity, volume).
A) Variety deals with data coming from different sources, be it IoT from your phone or online purchases, weather data, word documents, web page text, etc. They are from different sources because of the format that they are in, be it CSV, JSON, XML, HTTP, JavaScript, etc. All these forms of coding mechanisms have pros and cons for the task at hand. Some are “lighter” and take up less space (JSON and CSV), while others have built in hierarchies that help maintain a webpage’s layout (HTTP). Big data and data-related decisions for a business will revolve around how to collect and make these sources play nice with each other. This is called the “data wrangling” phase. We wrangle data from different sources. It is then the data scientist’s job to take the collected data and then clean it, or change it to a format that is readable by other programs.
B) Velocity is the speed that data comes in. It is separated by two types: batch and stream processing. Batch processing happens in batches, or all at once. Stream processing is when data drips into a system. This can be thought of as clicks on a webpage. A company like Amazon would be interested in seeing where a customer clicked, how long they waited on a webpage before buying something and the order that they shop in. These are rather simple procedures to follow since all web-based applications can time stamp and track cursor selection. When we get into batch processing, most of the time we think of big data platforms like Hadoop. Hadoop is the industry standard for setting up a system that can divide tasks and send them to different processors in a server (think a big server tower). Each job is mapped and reduced to a specific core and then once those individual tasks are done, it sends it back to the Hadoop server and combines the parts. This process is called MapReduce and is a major keystone to Hadoop. Think of this as giving the crew of a ship multiple tasks. You cannot do this yourself, but you can command junior officers to delegate tasks. Essentially, you are the Hadoop server and your junior officers are the elements of MapReduce. You map to your junior officers and they reduce your workload. The junior officers then give out commands to the enlisted folks (this could be a machine learning algorithm that runs the math equations and operations research models on the individual cores in your Hadoop server and then sends them back to the junior officers who then report to you). You, or the Hadoop server, then takes all those individual orders at the end of the day to come to a conclusion and summarize results for the Admiral.
C) Volume. Simply put, this deals with the shear size of big data. It hurts my head to think about how big data is in a military or business world, but I will try. Think of Trump’s proposed wall. Well, DHS is going to have quite the mission in dealing with all the data that the sensors on the walls will be sending back. UAVs patrolling need someplace to store data. This is where we get the term “data management” and “data warehousing.” We need someplace to store data and Amazon (AWS) and Google are leading the way to put these “server farms” in the desert or places that they can use solar power to cool these servers off (they get really hot, which is why IT always puts huge fans in server rooms in your office).
2. Analysis: There is a process in data science that can usually be broken up into 3 parts…
A. Descriptive: Basic statistics, basic graphs, basic outlier analysis. You are using these fundamentals of data analysis to get an overall picture of the data. Essentially, you have a specific question in mind.
B. Exploratory: You are looking at trends. You don’t have a specific question, but you are hoping to find something interesting, then you can think of a question and go back to descriptive statistics.
C. Predictive: This is usually called “machine learning.” These range from the simple (linear regression, time series forecasting, cluster analysis, decision trees) to the more complex (naive bales, random forests, neural networks, deep learning). You are forecasting here.
3. Tools: This is the data analysis process broken into steps and the respective tools you will need
A. Wrangling: Find the data out in the world (JSON, HTTP, CSS, JavaScript, JAVA, etc)
B. Collection and Cleaning: Formatting data into a structure that you can process.
C. Storage: In-house or cloud-based
D. Query: How do we get data from the warehouse. Think of a library. You want to get some books about France but you don’t want to bring the entire library home with you. On the computer at the library you select a few filters to get a cook book, French for Dummies and a novel about France. Under the hood, the library compute is running a SQL  query. This stands for Structured Query Language and is the backbone of every organization on the planet to query data. This can be SQL or NoSQL (Not only SQL). SQL is for structured data and NoSQL is for unstructured data. SQL is vertical integration and NoSQL is horizontal. Think of vertical as you can stack pages of a spreadsheet on  top of each other and drill a hole through the name column for personell on a ship. All these spreadsheets have personal information on them, but you want to query their rank. Well, you would drill a hole through the “name” column and then through the “rank” column. Then you would have a new spreadsheet called “Name and Rank” that  would list every person on your ship. Horizontal would be your shelf at home that has a cd next to a DVD and next to a book in Japanese. Drilling a hole through these will yield nothing. So, NoSQL is used to have these talk to each other and connect. It is more cumbersome, but it is now the industry standard for all the IoT that our  unstructured data sources provide. For SQL you will see the open-source MySQL and the paid version of Microsoft SQL. Access databases also run on SQL. For NoSQL, MongoDB is the major player.
E. Analysis: Now that you have your data clean and ready, you can perform your analysis. The open-source packages are R and Python. Similar packages and both are known as object oriented programming. R is favored by stats people and Python is favored by computer scientists and engineers. R is a simpler coding mechanism and language  but Python can be used for virtually any task. For stats junkies there is STATA, SPSS, Excel and SAS.
D. Sharing Conclusions: This is where the reporting and telling a story of your data comes into play. This is the most crucial step that most data wizards fail. If you cannot explain your analysis in terms that people understand, then all is lost. These will usually be business intelligence (BI) programs that visualize data. Tableau is the heavyweight in  the business/government area. You can create an interactive dashboard that has all of your data and analysis and link it into your own website for users to utilize. Other players are Microsoft PowerBI and Qlik.”