An interview with International Development Data Scientist, Anton Prokopyev

image002

Last weekend I had the opportunity to sit down with a fascinating data scientist who works in a large non-governmental organization that specializes in international development. Anton views data scientists as “forensic investigators” who have a responsibility to the community at large to develop repeatable and reputable analytical stories. 

Bio: Anton is a data scientist working in an international development organization in Washington, D.C., where he applies his programming and research skills to contribute to data-for-good goals. While Anton is a full-stack data scientist, meaning he works through a project from prototyping to production, he has a keen affection for natural language processing and text analytics. Prior to starting his data science journey, Anton worked in tech companies and Silicon Valley startups.

LinkedIn: linkedin.com/in/prokopyev

GitHub: github.com/prokopyev

Twitter: twitter.com/prokopsky

 

Anton’s favorite online resources:

Siraj Raval

3Blue1Brown

Talk Python

Kaggle Kernels

ML Trainings

Standard

Book Summary: Weapons of Mathematical Destruction

Screen Shot 2018-01-07 at 10.30.47 AM

Cathy O’Neil’s Weapons of Math Destruction, or WMD for short, offers a moralistic argument that with big data, so too comes big responsibility. Uncovering what she refers to an inherent prejudice, racism and economic unfairness in predictive algorithms and machine learning models, no analytical model is safe from her fierce criticism. With an opening salvo that our very democracy hangs on the moralism of our mathematical models, O’Neill unveils the destructive power of the models that impact our daily lives.

O’Neill defines a WMD with the following characteristics:

  1. It is opaque, as if a hidden veil sits between the subject and the controller
  2. It causes a negative feedback loop where the output of the model adversely affects the future decisions for the subject, only to impute further damaging data
  3. It disadvantages a specific group, most notably minorities
  4. It presumes guilt, treating innocence as a mere constraint in the model
  5. It favors efficiency resulting in further model variance
  6. It doesn’t calculate fairness by relying solely on numbers and data
  7. It lacks a feedback structure, resulting in creating noise without signals to readjust

A prime example of a WMD is as follows,

“Do you see the paradox? An algorithm processes a slew of statistics and comes up with a probability that a certain person might be a bad hire, a risky borrower, a terrorist, or a miserable teacher. That probability is distilled into a score, which can turn someone’s life upside down. And yet when the person fights back, ‘suggestive’ countervailing evidence simply won’t cut it. The case must be ironclad. The human victims of WMD’s, we’ll see time and again, are held to a far higher standard of evidence than the algorithms themselves.”

Some intriguing examples O’Neill sites are common in the analytic community. These include the star high school student who applies to a state-level university, only to be turned away because the state school’s predictive model for acceptance states that all-star students have a high probability of turning down so-called “safety schools” for more prestigious institutions. Knowing that the models for what makes a good university ranking relies on having a low acceptance rate, the state school rejects the all-star student. This cycle is perpetuated in a negative feedback loop.

Another example is the predictive models that city police departments rely on. Using various inputs to machine learning algorithms such as the number of ATMs, geographical inputs, classifications of high-risk zones, the models attempt to optimize police resources by placing cop cars in areas with a higher probability of a crime occurring. Unfortunately, these areas tend to be lower class and with a large minority population. The sheer act of having an increased police presence results in minor crimes such as drug use, being seen, acted upon and enforced. More police reports and arrests feedback into the machine learning model resulting in a further bias towards placing more police units in the area. This is another prime example of a negative feedback loop presuming guilt without model readjustment.

As seen from these examples, O’Neill argues that more data is not the panacea that will fix our predictive problems. If the algorithm favors a certain piece of the population or refuses to readjust based on algorithm tuning (accuracy, precision, recall, a combination of the two in the F1 or F-beta scores), then no matter the n-size of the data, the model will still fail. However, there is hope! O’Neill offers the following suggestions to improve our models and restore dependency on a more fair and robust set of algorithms.

  1. Ensure there is an opportunity for outside introspection. Let the general public understand the models that judge them, be it credit e-scores or teacher evaluation criteria
  2. Allow user sign-off to use their online actions as data that can be resold. Europe follows a similar model that could be copied
  3. Make public the assumptions that the model is built upon
  4. Audit algorithms as if they were financial institutions

While O’Neil’s work strongly falls into the category of personal rights and privacy in a data-filled world, much of her anecdotal evidence is convincing. The moralism that surrounds what we inexplicitly provide the data world in our online actions, voting behavior, consumer choices, and personal health well-being should not be the product for others, but of ourselves. This is a question that will plague future technologists as we come to terms with what is and isn’t the public domain.

Standard

The MOOC Trap

Type in “learn to become a data scientist” into Google and you will get the following results: DataCamp, Udacity, Udemy, Coursera, DataQuest, etc. These Massive Open Online Courses, MOOCs for short, are invaluable when learning a new skillset. They allow students to enter a once guarded elite within the academic walls. They sharpen skill sets and add value to any resume.

 

 

However, I have noticed that they have become a crutch that is traded for comfort over the rigors of self-improvement past a certain point. This realization is not placed on others but is reflected in my own advancement in advanced analytics. We strive to learn the most advanced methods, the coolest visualizations, the highest performing algorithms. Yet, we seldom strive for originality in our work because we are afraid it might be too mundane.

 

To add to one’s proverbial toolkit in data science, MOOCs and educational resources that guide the user through a well-formatted analysis is crucial. There are little other ways to learn the fundamentals other than structured educational materials. But this can lead to the “MOOC trap.” A typical case study of the MOOC trap is the burgeoning data scientist who has completed an intensive 6-month online program. She has dedicated 300 hours of intense study, through both sandbox exercises and a few semi-structured projects. She feels like she has a basic understanding of a host of skills, but is timid to try her analytical toolset on a problem of her choosing. Instead, she signs up for another 6-month MOOC, a mere regurgitation of the material that she just covered. Enamored with the ads and displays of a polished portfolio on GitHub that the MOOC promises, she forks over another $200 a month.

 

This individual felt the excitement of looking for a question, venturing the internets for a dataset, and the feeling of struggle as she looked at the mess that real-world data provides. But she regressed back to the comfort of the MOOC. I feel the same in my own work. There are so many datasets that we as a community have access to, structured and unstructured, clean and well, terribly scattered and messy. We are trained through our educational systems in college/grad school, online courses and structured tutorials, to create something advanced and analytically perfect. We are pressured to post this to GitHub, to display our certification of accomplishment with a stamp from an official organization.

 

The problem with the MOOC trap is that it no longer trains us for the real world; it trains us to become great followers of directions. We fear that our analysis on an original piece of work will not be cool enough, it will not be advanced enough, and well, we might have grind just to produce an exploratory analysis of things that we might have already assumed. But this is the challenge, to create something original because it gives us ownership. Completing basic analytics with an original dataset that we went out and found adds to the data science community. This builds the foundations of what science is and hones our fundamental skills so sorely needed in the workforce.

 

While MOOCs offer a structured and nicely formatted addition to our repositories/portfolios of glistening analytical work, it has the potential to leave us in a comfortable position where growth decays. There is a certain point to where educational training and online courses can take us, but beyond that, it is a series of diminishing returns. Each nascent data scientist will have a different inflection point, but the feeling is the same; you have a burning question, but feel your skillset is unpracticed. In this instance, forgo the MOOC and find the data in the world. Produce the basic analysis, ask your peers to review, and struggle a little more. Only then will we grow as data scientists.

 

 

 

Standard

Defining your “T”

I once had an interview with a data science consulting firm that I completely bombed. It wasn’t the SQL test and it wasn’t the questions on machine learning algorithms. The question that stumped me was when the interviewer asked, “What is your T?” Dumbstruck, I answered that I didn’t know this terminology. Politely, the interviewer explained that she wanted to know what my core competencies were, what area was I known as the “go-to” guy and where was I competent in performing on my own. This brings me to what I believe is the most important story in data science; the story of you.

The T-model was defined by Harlan Harris in “Analyzing the Analyzers,” published by O’Reilly Press in 2013. Heralded as “an introspective survey of data scientists and their work,” Harris and his colleagues argue that there are many categories of individuals calling themselves data scientists. Defining which category one fits in depends on the contents of their T. I created the following T as a visual example.

Screen Shot 2017-12-16 at 2.49.07 PM.png

The vertical of the T is one’s core competency, their comparative advantage in the field of advanced analytics. This is the area where researchers distinguish themselves and are frequently called upon to perform as the leader of a team. You might remember these people from graduate school that seemed to effortlessly explain the underpinnings of a complex derivation on the board, pulling proofs out of thin air as if a celestial power were whispering answers in their ear. Now, you don’t have to have analytic superpowers to complete your vertical of your T, but you should be quite competent in that area.

The horizontal bar are the areas that you feel comfortable operating in, but are not solely known for. These can be areas where you utilize for a project every now and then or are part of your analytical process. You might need a little extra time with your buddy Google or a sojourn on StackOverflow, but these are areas where you know if you are given a little extra time, you will have no trouble performing.

The programs to the right of the T are the programming languages, software, and technologies that go hand-in-hand with your competencies. For example, if someone lists machine learning as their core competency, they will most likely have R or Python listed in parallel to perform those analyses. With Python would be the libraries or modules that one would depend on to perform these duties. These could include pandas for data manipulation, numpy for quick calculations, matplotlib and seaborn for visualization, beautifulsoup for scraping data from the web, nltk for natural language processing and scikitLearn for machine learning algorithms.

The purpose of defining your T is not to include every buzzword technology and programming language that is hot, but to include those resources that get you 90% to your goal. That final 10% could be something that you use seldom and need to call on additional expertise to complete. Creating a firm narrative of your core strengths helps the nascent and advanced data scientist alike explain what they have to bring to the team. It creates a mutual understanding between those hiring you to what you bring to the table. But more importantly, it provides a visual elevator pitch when communicating what data science is to you.

Standard

An interview with Booz Allen Hamilton Data Scientist, Nicholas Kallfa

Last weekend I had a sit-down conversation with Nicholas Kallfa. Nicholas is a data scientist at Booz Allen Hamilton where he works on creating tailored products for government clients using R and mathematical computing techniques to visualize information in R’s open-source platform Shiny. Not only is Nicholas a great communicator of advanced mathematics, but he shared with us the various methods he uses to guide clients through difficult concepts to reach a mutual understanding. This interview was quite enjoyable since there were an abundance of cats knocking things over in the background. If there is residual noise, I blame it on them.

Toolkit: R (R Studio), Python (Spyder), Git/GitHub, Postgres, Microsoft SQL Server, QGIS, Latex, Tableau, Excel

 

LinkedInPPic.jpg

 

Nick’s favorite online resources:

  • Udemy (mostly to learn something completely new for the first time)
  • Stack Overflow
  • Relevant subreddits
  • Mostly just google and see where he ends up

Influential books about data:

Bio: Nicholas Kallfa is a Data Scientist at Booz Allen Hamilton where he specializes in providing research and data science support to clients within the federal government. Some of his previous clients include U.S. Customs and Border Protection, Immigration and Customs Enforcement as well as multiple clients in the Department of Defense. He has a background in mathematics holding a MS from the University of Iowa and currently lives in Alexandria, VA with his fiancee and 4 mischievous cats.

Standard

What is a Data-Driven Organization?

Recently, I had a former colleague reach out to me to inquire about how his organization could be data-driven. Specifically, he was interested in what the various terms and buzzwords he had been hearing actually meant. The following is a brief-response to summarize at a high level the various underpinnings of what it means to be “data-driven.” I have included the e-mail response in its entirety.

“Ok, so what is this topic of a datadriven business? Well, for starters, most of what we hear about informing decisions through advanced analytics and big data is nothing new. What is new is the sheer quantity and quality of the data that we are being presented. In theory, these buzzwords and wing-dings we see in business news have been around in operations research for decades. But we have never had the amount of data being produced by what we call the “Internet of Things,” or IoT for short. Think about your cell phone. There are about 12-15 sensors in a typical iPhone that are continuously recording data and feeding it to various services. Your location, speed, altitude, when your phone activity is low and when it is high. All of these things can be used to inform decisions and create new products. But it is the same process that you learned in the Navy: optimize a problem given the data and based on the constraints. So here is a little introduction to what I view a datadriven business to maintain.
1. Theory: Big data is broken up into three Vs (variety, velocity, volume).
A) Variety deals with data coming from different sources, be it IoT from your phone or online purchases, weather data, word documents, web page text, etc. They are from different sources because of the format that they are in, be it CSV, JSON, XML, HTTP, JavaScript, etc. All these forms of coding mechanisms have pros and cons for the task at hand. Some are “lighter” and take up less space (JSON and CSV), while others have built in hierarchies that help maintain a webpage’s layout (HTTP). Big data and data-related decisions for a business will revolve around how to collect and make these sources play nice with each other. This is called the “data wrangling” phase. We wrangle data from different sources. It is then the data scientist’s job to take the collected data and then clean it, or change it to a format that is readable by other programs.
B) Velocity is the speed that data comes in. It is separated by two types: batch and stream processing. Batch processing happens in batches, or all at once. Stream processing is when data drips into a system. This can be thought of as clicks on a webpage. A company like Amazon would be interested in seeing where a customer clicked, how long they waited on a webpage before buying something and the order that they shop in. These are rather simple procedures to follow since all web-based applications can time stamp and track cursor selection. When we get into batch processing, most of the time we think of big data platforms like Hadoop. Hadoop is the industry standard for setting up a system that can divide tasks and send them to different processors in a server (think a big server tower). Each job is mapped and reduced to a specific core and then once those individual tasks are done, it sends it back to the Hadoop server and combines the parts. This process is called MapReduce and is a major keystone to Hadoop. Think of this as giving the crew of a ship multiple tasks. You cannot do this yourself, but you can command junior officers to delegate tasks. Essentially, you are the Hadoop server and your junior officers are the elements of MapReduce. You map to your junior officers and they reduce your workload. The junior officers then give out commands to the enlisted folks (this could be a machine learning algorithm that runs the math equations and operations research models on the individual cores in your Hadoop server and then sends them back to the junior officers who then report to you). You, or the Hadoop server, then takes all those individual orders at the end of the day to come to a conclusion and summarize results for the Admiral.
C) Volume. Simply put, this deals with the shear size of big data. It hurts my head to think about how big data is in a military or business world, but I will try. Think of Trump’s proposed wall. Well, DHS is going to have quite the mission in dealing with all the data that the sensors on the walls will be sending back. UAVs patrolling need someplace to store data. This is where we get the term “data management” and “data warehousing.” We need someplace to store data and Amazon (AWS) and Google are leading the way to put these “server farms” in the desert or places that they can use solar power to cool these servers off (they get really hot, which is why IT always puts huge fans in server rooms in your office).
2. Analysis: There is a process in data science that can usually be broken up into 3 parts…
A. Descriptive: Basic statistics, basic graphs, basic outlier analysis. You are using these fundamentals of data analysis to get an overall picture of the data. Essentially, you have a specific question in mind.
B. Exploratory: You are looking at trends. You don’t have a specific question, but you are hoping to find something interesting, then you can think of a question and go back to descriptive statistics.
C. Predictive: This is usually called “machine learning.” These range from the simple (linear regression, time series forecasting, cluster analysis, decision trees) to the more complex (naive bales, random forests, neural networks, deep learning). You are forecasting here.
3. Tools: This is the data analysis process broken into steps and the respective tools you will need
A. Wrangling: Find the data out in the world (JSON, HTTP, CSS, JavaScript, JAVA, etc)
B. Collection and Cleaning: Formatting data into a structure that you can process.
C. Storage: In-house or cloud-based
D. Query: How do we get data from the warehouse. Think of a library. You want to get some books about France but you don’t want to bring the entire library home with you. On the computer at the library you select a few filters to get a cook book, French for Dummies and a novel about France. Under the hood, the library compute is running a SQL  query. This stands for Structured Query Language and is the backbone of every organization on the planet to query data. This can be SQL or NoSQL (Not only SQL). SQL is for structured data and NoSQL is for unstructured data. SQL is vertical integration and NoSQL is horizontal. Think of vertical as you can stack pages of a spreadsheet on  top of each other and drill a hole through the name column for personell on a ship. All these spreadsheets have personal information on them, but you want to query their rank. Well, you would drill a hole through the “name” column and then through the “rank” column. Then you would have a new spreadsheet called “Name and Rank” that  would list every person on your ship. Horizontal would be your shelf at home that has a cd next to a DVD and next to a book in Japanese. Drilling a hole through these will yield nothing. So, NoSQL is used to have these talk to each other and connect. It is more cumbersome, but it is now the industry standard for all the IoT that our  unstructured data sources provide. For SQL you will see the open-source MySQL and the paid version of Microsoft SQL. Access databases also run on SQL. For NoSQL, MongoDB is the major player.
E. Analysis: Now that you have your data clean and ready, you can perform your analysis. The open-source packages are R and Python. Similar packages and both are known as object oriented programming. R is favored by stats people and Python is favored by computer scientists and engineers. R is a simpler coding mechanism and language  but Python can be used for virtually any task. For stats junkies there is STATA, SPSS, Excel and SAS.
D. Sharing Conclusions: This is where the reporting and telling a story of your data comes into play. This is the most crucial step that most data wizards fail. If you cannot explain your analysis in terms that people understand, then all is lost. These will usually be business intelligence (BI) programs that visualize data. Tableau is the heavyweight in  the business/government area. You can create an interactive dashboard that has all of your data and analysis and link it into your own website for users to utilize. Other players are Microsoft PowerBI and Qlik.”
Standard

Stuck in the Middle

My name is Nick and I am stuck in the middle. In fact, most of the analytics community is stuck in the middle somewhere.  Search LinkedIn and you will see thousands of professionals calling themselves “data scientist.” However, there is no unified definition of what a data scientist is. What do they study? What tools do they use? How many years do they have to be labeled a data analyst and climb up the totem pole of professional networking to be recognized as a data scientist? How many posts to GitHub or Kaggle competitions must one enter? Does one use R or Python? Does simply being an Excel Jedi make someone a data scientist? Do you have to know advanced calculus and be an expert in linear algebra to earn the title of data scientist. These questions plague the analytics community and there are no simple answers.

Through my own work in the advanced analytics community I have witnessed a wide range of data skillsets in many accomplished data miracle workers. These skillsets have all stemmed from different backgrounds: economics, mathematics, engineering, computer science, policy, international relations. Not one of these accomplished analysts followed the same path, yet all have risen to the top in their respective analytics communities. This is not the traditional route to become a scientist. One does not become a data scientist by taking a test, performing hard labor to earn a PhD, or taking a 1,000 hour online course. This reputation is earned. But how?

Through the following blog posts, podcast interviews, and a persistent asking of the question, “What is a data scientist?”, I hope to unstuck both myself and other confused data nerds from the middle ground between data analyst and data scientist.  More importantly, I want this to be an opportunity to drive a discussion that is lacking in the analytics community. For if we are to be translators of nerd, how do we define our profession?

 

 

Standard