Future proofing your data science toolkit

We are constantly inundated with the latest and greatest tools. We see over and over again friends, LinkedIn acquaintances, and former classmates posting about new certifications and technologies that are being used on the data science front. We feel that we are in a never-ending GIF that never quite allows us to reach completion in our data science career.

That feeling of almost learning all data science

So how do you cut through this massive tidal wave of tech stacks to focus on what is important? How does one seek clarity through the haze of skills that are important and those tools that are simply a worthless certification that will never be used on the job?

Over the past few years, there has been a handful of tools that have transcended industries and shown a clear signal of being key to a high performing data science team in the near future. Mainly, these technologies revolve around the concept of big data. Now, you may be asking yourself, what does “big data” really mean? Is it gigabytes, terabytes, petabytes of data? Is it streaming, large batch jobs, or a hybrid? Or is it anything that doesn’t fit on an Excel spreadsheet?

What is big data?

We are used to the hearing the three Vs of big data: velocity, volume, veracity. But these are opaque and hard to grasp. They don’t meet the qualifications from a mom-and-pop analytics start-up to a fully-fledged Fortone 500 firm leveraging analytics solutions. A better example of big data is from Jesse Anderson’s recent book Data Teams. In it, he states that big data is simply when your analytics team says they cannot do something due to data processing constraints. I like to think of this analogy.

My own take: Imagine that you grew up in Enterprise, Oregon, a Northwest Oregon farming town of approximately 2,000 people. Now, if you were to venture across the border over to Washington state, you would think that Seattle is the biggest city on earth. Rightly so, if you grew up in New York City, you would think that Seattle is small in comparison to your home city.

Small town, small data

This analogy applies to the size of an organization’s data; when you work with small data, everything seems big, and when you work with large data, it takes a larger magnitude of data ingestion to become overwhelmed. But each organization has a breaking point. That point where the data team throws up their hands and says, “we can’t go any further.” This is where the future of data science lies. To leverage the massive amounts of data (or taken as the three Vs of big data in velocity, volume and veracity), we need tools that allow us to compute at scale.

Big city, big data

Soaring through the clouds

There is no wonder that scaling up data science workloads involves using someone else’s machine. Your (and my) dinky MacBook Air or HP glorified netbook can only store up to 8 GB of information in location memory (RAM), 16 if you are lucky enough to stuff another memory card in the thing. This creates myriad problems when trying to conduct anything beyond basic querying with larger datasets (over 1 GB). Once we enter this space, a simple Pandas GroupBy command or R data.frame equivalent will render your machine worthless until a reboot.

To use compute capacity, you need to rent another machine. When renting a machine, this is referred to as spinning up a compute instance. These compute instances go by different names and there are wide varieties of hardware and software that are leveraged when increasing our data science workload capacity. However, when it comes to who to rent from, there are really only 3 major players on the market: Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP). Of course, Oracle and IBM have their own dedicated cloud, but they don’t crack the surface when competing with the big girls and boys.

Data Lakes

When venturing into the cloud, it is best to begin with the two primary means a data scientist will interact: storage and compute. When developing data science products on the cloud, you will need to store your data somewhere. In fact, what you really want, a data lake somewhere you can store your data with as few rules as possible. This is where the concept of the data lake comes into play. Data lakes give you the opportunity to dump data, regardless of file type, into a central location where you can they grab and place that data into an experiment. The respective players here are the following:

  • AWS Simple Storage Service (S3)
  • Microsoft Azure Data Blob
  • GCP Cloud Storage

Now, don’t get me wrong, being able to store data in its proper home, be it a relational database based on SQL or an unstructured storage area using NoSQL, will always be preferential to dumping your data in a data lake. Moreover, for those data sources that you want to have on-hand at a moment’s notice, creating a data warehouse will be preferable. But to get up-and-started with data science experimentation and skill building on the cloud, data lakes are fantastic.

Compute Power

Once you have your data in a safe location, you can start looking at renting compute power. Again, our three main players will look like this:

  • AWS Elastic Compute (EC2)
  • Microsoft Azure Compute
  • GCP Virtual Machines (VM)

When getting started, you want to learn a few things:

  1. How do I make sure that I don’t get charged too much money if I accidently (you will) leave this thing on? There are free tier compute instances that you can select to avoid unnecessary shocks when looking at your cloud bill.
  2. How do I ensure the correct ports are turned on? You will need to ensure that you have the correct web portals open when creating a compute instance. Failure to do so will not allow data science tools like Jupyter Notebook to run successfully. Leave ports open that you aren’t supposed to, then the world can see your instance.
  3. How do I make things talk to each other? Ensure that you have the correct permissions set up. AWS, Azure and GCP all have access management portals that need to have the minimum permissions and rules set for your storage and compute instance.
  4. If I had 2 more hours to spend on this, what would be the next steps? Knowing what the next level of complexity is often times as important as knowing how to do it. Being able to see how your application naturally builds upon itself and ties into other services will allow you to see the world as a true solutions architect.

Abstracting the noise

Often times, you will find yourself on a team that does not want to worry about how to manage a cluster or HDFS (Hadoop Distributed File System). Naturally, when jumping into the world of big data and scaling out data science workflows, the question of how involved you want to be in the day-to-day management of the big data ecosystem needs to be answered. This is why there has been a recent rise in products that allow the user to abstract the more nuanced data engineering and system administrator roles of the cloud to a platform that is more user friendly.

Databricks is one mover and shaker on the market that has been receiving a significant fanfare and accolades for its ease of use. But don’t let its new-kid-on-the-block persona dissuade you from using it. Started by the original researchers from UC Berkeley who created Apache Spark, the Databricks team has created a product that can leverage cloud computing and the data science Jupyter Notebook environment with ease. For a nominal fee, depending on how much power you seek, you can set the cluster size with the click of a button, create instant autoscaling groups and serve as your own cluster admin team with little up-keep. To sweeten the pot, there is a Community Edition that can be used and a legit Coursera course that dropped two weeks ago by Databricks staff.

Zeppelin notebooks is another data science tool that operates on top of distributed systems. Normally a little more difficult to implement, due to the manual creation of HDFS clusters and permissions granting between cloud applications, Zeppelin notebooks allow you to leverage a Spark environment through PySpark and SparkR. Due to constraints around requirements a company may have, such as a contract with a cloud support service, Zeppelin notebooks can allow you to operate within your already existing cluster setup.

While there are more tools that allow your company to leverage big data compute, distributed file systems, and all the buzzwords that lie between, the prime movers are going to be build on the big three cloud computing resources (AWS, Azure, GCP). Learning one easily translates to learning the other. Often times, within the data science space, you will find yourself moving from one service to the other. Simply knowing the mapping of related terms is often enough to ensure consistency in thought and action when creating new application across cloud platforms within big data. Personally, I have found having a chart of the naming conventions used for AWS and Azure to be quite helpful and ensure continued success in both learning and developing new data science products.

Examples:

  • AWS S3 : Azure Blob Storage
  • GCP VM : AWS EC2
  • Azure ML Studio : AWS SageMaker

Pic sources:

Standard

Walking the Talk: Implementing an end-to-end machine learning project

3-1-2, Lift Off!

In the previous post, “How to create your first end-to-end machine learning project”, a four-stage process was offered to get you out of the endless MOOC Trap and jump that fence to greener pastures. These included:

  1. Find a tutorial
  2. Follow said tutorial
  3. Re-follow tutorial with your own data
  4. Customize your pipeline with your own bling

To illustrate these concepts, I am going to walk you through how to do this on a newer tech stack. Let’s say, for example, that the Udemy commercials on your YouTube feed have been blaring at you that “Python is where it is at.” So naturally, you want to be like the rest of the cool kids and learn yourself some new tech. You also have a vague idea that cloud-based environments are terrific for Python-based deep learning. Not having a lot of experience in deep learning, you Google, “Deep Learning Projects” and come across some really sweet algorithms that can help you generate text. You now have two things in front of you that you have little experience with: deep learning and renting GPUs.

“You know, deep learning is where it is at”

Last year, I found myself in a similar situation. I had a solid understanding and had completed a 300-hour MOOC on deep learning through Udacity, but I had yet to complete a project outside of MOOC land.  I had learned the math, the tech stack when following along a tutorial, and how to leverage GPUs form a sandbox environment. But what I truly needed in my development was the chance to take this new proverbial toolbelt and test it out. Enter the four-step process of moving out of the MOOC Trap!

1. Find a Tutorial

Tutorial searching.

The first thing that I needed to do was a solid review of how to use deep learning for text generation. Using recurrent Neural Networks (RNN) and Long-Sort Term Memory (LSTM) neural nets is not at the same level of explaining linear regression. This is where reading blogs and watching YouTube videos of folks talking about their use-cases and deep dives into the algorithms is important. Naturally, Medium and TowardsDataScience are going to be starting resources.

After many failed attempts to find a tutorial online that fit my particular use-case, I ran across one of my favorite online contributors, Jason Brownlee PhD, and attempted to follow his tutorial “Text Generation with LSTM Recurrnet Neural Networks in Python with Keras”. There were many concepts that needed refreshing, so in order to begin Dr. Brownlee’s tutorial, I needed to see some other examples by review. In fact, many of these links were first attempted and a portion of the ideas set forth here were used over the course of my pipeline. But as any student of data science knows, the first GitHub repo or tutorial you look at rarely has the gold you are looking to uncover:

2. Follow tutorial

The tutorial leverages LSTMs by taking Lewis Carroll’s Adventures in Wonderland by Lewis Carroll and trying to generate unseen text. Text generation is powerful because it can be used for bots that support and reduce human power to automate tasks such as customer support. This was certainly an area that I wanted to experiment in and Dr. Brownlee’s posts and one of the most well-articulated and presented set of tutorials on the open market.

Once I read through the tutorial, I opened up my local Jupyter Notebook and got cracking! Leveraging resources outside the technical tutorial were important to understand the math and reasons why LSTMs were a superior form of RNNs (see above links). Namely, which activation functions, drop-out rates and learning rates needed to be applied out of the gate (no pun for LSTM fans intended).

3. Refollow tutorial with you own data

After following Dr. Brownlee’s tutorial, I found that I my computer’s compute was severly limited. Wanting to make things easy on myself, I transferred my Jupyter Notebook to Google Colab where I could leverage a GPU to speed up the process. While the end goal was to run this on AWS, Google Colab was a safer bet so I wouldn’t accidently be charged a month’s rent to run a fun side project on EC2.

Those EC2 charges.

I have always had this idea of creating a project that could actually solve something in my life. Since I was a little kid, I remember my mother waking up in the dark each Sunday to finish her sermon. Since she was a minister for 30+ years, this practice would occur weekly. The idea came to me that if I could create a bot for her to complete her work, then much time could be freed up. The main issue with training LSTMs is the absence of large amounts of data. Lucky for me, my mother saved every sermon she ever gave on Microsoft Word. Over the past year, she had been sending me emails of 5 sermons that I stored in a folder, thinking that I would get around to this project, but never actually pulling the trigger. After a year of emails, I had 30 years worth of documents! (20 years x 52 weeks = 1,042 documents). To play it safe, I chose her most recent 300 documents.

Soooo, do you have a document I can store my documents in ?

Immediately, it became obvious that this would not be like changing the batteries out of a remote. There was much work to customize the various string editors, make sure that certain words were treated differently, ensure that capitalization mattered where needed. NLP and text analytics is not all about algorithms and knowing what hyperparameters to set, it is more and more about domain knowledge. So like any experienced data science consultant, I called up mom to find out more about her sermons. Having had a front-row seat to most of her career, either in diapers waddling under pews, or to sitting in my 20s/30s listening to her ideas flow of social justice and how the historical prophets would preach about helping those disenfranchised, I felt I had a good sense to her writings. But just as gathering requirements for clients on a work site, I felt that a phone call to mom to gather her biblical requirements to her interpretations was key.

Ramping up that GPU!

Once all my customizations were in place, I was able to return basic text generations. Since my LSTM was learning on one document, it didn’t have a lot to go off of. I needed more power. I needed a lot more power to run over 300 documents. I went to Google Colab, selected the $10 a month GPU subscription so this bad boy could run uninterrupted overnight, and transferred Dr. Brownlee’s followed tutorial from Jupyter Notebook on my weak little MacBook Air to Colab. This is where things got interesting!

4. Customize your pipeline with your own bling

First off, Dr. Brownlee offers his neural wisdom to recommend algorithm improvements. I quote:

1) Predict fewer than 1,000 characters as output for a given seed.

2) Remove all punctuation from the source text, and therefore from the models’ vocabulary.

3) Try a one hot encoded for the input sequences.

4) Train the model on padded sentences rather than random sequences of characters.

5) Increase the number of training epochs to 100 or many hundreds.

6) Add dropout to the visible input layer and consider tuning the dropout percentage.

7) Tune the batch size, try a batch size of 1 as a (very slow) baseline and larger sizes from there.

8) Add more memory units to the layers and/or more layers.

9) Experiment with scale factors (temperature) when interpreting the prediction probabilities.

10) Change the LSTM layers to be “stateful” to maintain state across batches.

And he is also nice enough to offer LSTM recurrent neural net “office hour” materials in the following:

The main issue I had was creating a mechanism that could take each document and extract only the body text, no header, no footers, no fluff, and place it neatly in a .txt document. It took a few iterations, but the following script was able to extract all text and append neatly into one file. As you can see from hte text below, we have dates, locations, names and biblical passages that all need cleaning before being converted into vector space for the LSTM model.

The below sample sermon shows these areas:

As you can see, there are many areas that need to be cleaned in this document. There are various titles, headers and specialized characters that need to be taken into account. The following script allows me to read a Microsoft word document or .txt file and append it to a master .txt file that can be fed into the LSTM model clean job. Sources for the initial scripting that this is based on are linked in the beginning of the article. But overall, you can see that the documents are being aggregated from the folder I have on my desktop, the same folder that I dropped my mom’s emails in each week, and loaded to a master document.

Once I had 300 sermons read into my pipeline on Google Colab, I was able to begin fiddling with the hyperparameters of the LSTM model. Setting multiple layers, optimization methods, activation functions, and back-propagation learning rates, I felt that I could let the neural net train overnight. As you can see, there are 100,352 unique words in my network. These can be seen as nodes that are going to abstract them selves with magic (linear algebra) to create a layer of 8,643 words to predict what word will come next based on teh previous words.

Simple LSTM net, but effective.

Waking up the next morning was like being a kid on Christmas morning. The excitement I felt opening up Google Colab to see the results pulsated thorugh my veins. I opened up the first paragraph to see the following:

A puppy, representative of Christmas morning, because, well… puppies.
 

“I found myself wondering about the wire this church, the people and the post of the seen of the complical seeming the church.  The and in the and the lives that the see the worship in the pertans the life the hell the story that the people and the people that the people the work.  The work that the light and viel the final see the make of us the for the world the healing of the conter and the people and this people at the story and the say.”





Ok, not exactly the Sermon on the Mount, but hot dang, I got something runing! Of course, to get closer to actual human speak would take more effort, and the idea behind this end-to-end machine learning project was get something on the board and move from there. Currently, the next step is to place this onto AWS, store the documents on S3 inside a bucket, and pull them into SageMaker where I can control the level of compute power needed for a more in-depth tuning of the product.

As you can tell, there are now many avenues to take this project. We can look at creating a front-end user interface with Flask, deployed on EC2 with a Docker container and customized Route53 Elastic IP address. We can deploy on Azure’s machine learning studio if we have a specific bend to Microsoft.

Getting started and out of the MOOC Trap is the main battle for nascent data scientist and experts alike. We pride ourselves with the knowledge that we have accumulated but may feel shy to go out into the world and look at a blank page. I hope that explaining this process will be helpful to those data scientists out there just getting their feet wet.

The following are links to my code, the resources I used and the proper accredations for the code that was borrowed and influential. 

AWS integration, here we come!

Link to GitHub with code

Props and credits to following images:

Standard

How to create your first end-to-end machine learning project

You want to impress future employers by creating a dope end-to-end machine learning project, fully kitted out with a web scraping data collection strategy, deep dive exploratory phase, followed by a sick feature engineering strategy, coupled with a stacked-ensemble method as the engine, polished off with a sleek front-end built microservice fully deployed on the cloud. You have a plan, you have the bootcamp/degree program under your belt, and you have $100 of AWS credits saved from that last Udacity course. You fire up your laptop, spin up Jupyter Notebook on local mode, and log into AWS. You then draw a blank. A blinking cursor in your notebook next to “[Ln] 1:”. The coding log-jam akin to writer’s block, the proverbial trap from following too many MOOCs, a deep hole of despair. What do you do?

Entering the wilderness

Entering the wilderness of a real project (credit: Hendrik Cornelissen)

When venturing out into the wild beyond official coursework, bootcamp code-alongs, and the tutorials of Massive Online Open Courses (MOOCs), taking that first step into uncharted territory when creating your first end-to-end project can be scary. Much of the time, it is difficult to know where to start. When re-skilling, up-tooling, or revamping our way into data science, we tend to get distracted by the latest and greatest in algorithmic development. As written about in the previous post, end-to-end machine learning projects rarely leverage the most complicated algorithm in academia. In fact, many of the machine learning ecosystems in development these days in major companies around the world are slight deviations of the tried-and-true approach of what we see as “standard data science” pipelines. So why should you put pressure on yourself to apply the latest cutting edge ML algorithm when you are just starting out?

Knowing how to leverage data science tutorials is your first step
(credit)

All about tutorials

The approach that I like to take when learning something new and wanting to try it on my own use case normally follows a three-step pattern:

  1. Find a tutorial
  2. Follow said tutorial
  3. Re-follow tutorial with your own data
  4. Customize your pipeline with your own bling
A tutorial in a tutorial found in the wild. Double meta!

Excellent places to start for tutorials include:

Complete the tutorial

Follow along with the tutorial. Many times YouTube is a great place to really get to know the flow and how to interact with the tech stack and dataset that you are using. I always like to find video tutorials that have an accompanying blog as well. Much of the time, the author will guide you through the data science pipeline, while referring to the documentation that they created in the blog. AWS does a fantastic job of this, especially in regards to their curated videos that follow SageMaker examples.

Change the data

Me after changing the data (credit)

Once you feel comfortable with the data science approach that has been taught, and are able to understand all the code and dataset particulars, it is time to bring in your own data. Your data should mirror the data that the tutorial is using. For example, if you are bringing in data about churn prediction when the tutorial is a regression-based approach, then you should rethink your target variable strategy. Try to find data that fits the algorithm family that you are working with. Classification should go with classification outcomes, regression with regression, and the same for unsupervised learning problems.

Plus it up and go into the wild

Unleash the beast! (Credit: Prince David)

You are now at the point where you can begin adding a custom flavor to the pipeline. You have already succeeded in bringing your own data, now it is time to put the pieces together into a true end-to-end project. If the tutorial that you are following only moves from EDA (exploratory data analysis) to evaluation criteria for the machine learning algorithm predictions, or maybe you are learning the front end component of deploying flask or Django on EC2, then this is the perfect opportunity to spice things up!

Did someone say "plus it up"? (credit)

Try to think about what end-to-end really means. Where does the data come from? How is it collected? Can you track down an API to bring in the data on a schedule? If no API exists, can you scrape it? Can you automate that scrape with a chron job? Once the data is in, can you write functions that perform the EDA sections for you to automate the output? Can you do a deeper dive and create a story around the EDA that you are digging into? Once you have created your EDA, what feature creation can you do? Can you bring in another dataset and stich those sources together? In other words, can you add a data engineering component?

As you can see, no matter the tutorial that you are following, there are always areas for improvement. There are always ways that you can get your feet wet then really take off with your own touch. Once you have created this pipeline, think of the story that you want to tell about it. Is this something that I can talk about in a future interview? Is this something that I need to communicate to non-technical members of my team back at work to show that I am ready for sponsorship to the next level? How can I tell the story of what I have done? Can I write a blog piece about this and share with the world?

One final thought

Stick around for more! (credit)

Throughout the data science process, we learn from others. As always, properly accredit those that came before you and were influential in your work. If you followed a tutorial to really help a challenging section, include that tutorial link in your notebook. Just as we try to move our careers and curiosities forward, it is paramount that we give a bow to those that conducted the science before us.

Coming next week: An example of a step-by-step approach to using a Recurrent Neural Network to do my mom’s job for her (with limited success of course because she is a Rockstar)!

Standard

The Data Science Tool that Never Goes Out of Style

Admit, we have all been there. You are on your 5th Udemy course on what you think is the next must-have algorithm/tech skill, from Generative Adversarial Networks, to YOLO  and ANNOY algorithms, when you ask yourself, why am I doing this? Yes, it looks super cool to post about how only real data scientists do back-propagation by hand or grow their own neural nets with organic soil under their desk, but does this actually translate into on-the-job success? Does this further drive the passion for data science and the creative spirit to constantly follow the most recent trend. Or as previously stated a few years ago on this blog, to be lost in the utter darkness of the “MOOC Trap?” So hit that CAPS LOCK key and let’s get down to business!

I am not saying that there is not a place for advanced algorithmic learnings and the implementation of cutting-edge machine learning techniques on the job market. But in my experience working for various Fortune 500 companies in the consulting world and government, it is rarely the case that these are being implemented on a day-to-day basis. What truly matters are the core skills that we tend to forget. This is why I am hitting that #ThrowbackThursday button on your social media feed and recommending to poor a glass of wine, light a candle, and take your relationship with SQL to the next level.

No one loves SQL. Seriously, it is not what we think about when you hear “Sexiest job of the 21st century.” Running queries does not win Kaggle competitions nor does it get on the front page of Analytics Vidhya, make you want to hit that subscribe button and pay for a monthly membership to TowardsDataScience.com or make you yearn for some sweet, sweet credits to run queries on AWS RDS. But it is important, oh so very important. Most of the jobs on the market in both data analytics and data science not only require SQL, but act as the first cone you must pass to actually speak with a human being beyond a phone screen for most jobs. So where to start?

“SQL Rocks!” or something like that. You get the idea.

Write one query a day

Habits grow out of consistency but our better natures tend to pull us away, telling us to us that one day won’t make a difference and that we can wait to start tomorrow. But the simple act of one query a day can be quite transformative. This can be as simple as SELECT * FROM table. That basic act will create a routine within your subconscious. That one query will then become two the next, to five the following day. Next thing you know, you will most from basic query to more complex WHERE clauses, conditional statements and playing interior designer with those WINDOW functions that you will be dropping in no time.

But how do I start?

It is no secret that data science is more and more influenced by game theory and behavioral economics and one thing we learn is that incentives matter. In fact, research has shown that losing something hurts more than gaining something. (Kahneman & Tversky, 1979) This is why starting creating a penalty rather than reward can get you to SQL habit forming faster.

Action Item

Create a daily penalty, which has to be outside your own benefit. Believe me, I have tried creating penalties that involve paying off $10 of student loans for each day I miss. That just ended up with me patting myself on the back saying, “I can skip, because it is going to something I need to accomplish regardless.” The penalty should be to a charity, 50 burpees you have to do that day, or loss of a TV privilege for your favorite show. This may seem tough, but to create habit, sometimes we need a fire under our butt.

I don’t like penalties, just give me something fun!

Ok, ok, I may have come off a little harsh with the whole penalty thing. If you are someone who is big on motivational phrases, like to watch motivational videos before going on a jog, or big on the social media scene, create a challenge. We see a lot of the #100DaysofCoding challenges out there. This is where you can jump right in. Challenge a friend to a bet. Whoever can accumulate the most days of writing a SQL query in a row will receive something (cash, 6-pack of beer, or back rub, you figure it out). Create a leaderboard, invite colleagues, those friends that are on the job hunt and get after it!

What are some good resources?

Some fine resources out there are the following:

  1. SQL by Jose Portillo. The guy is a legit phenom of an instructor.
  2. DataCamp
  3. LeetCode
  4. w3schools SQL course
  5. learnsql

Advanced: Make friends with the ETL team

If you are already on the job market and on a data science team, slide one seat over on the proverbial team chairs and make friends with a member of the ETL team. These are the folks that create the pipelines that move data in and out of the applications and machine learning algorithms we design. ETL team members are expert level SQL junkies. Almost every member of an ETL team or solutions architecture team have SQL skills that warm up to my most advanced SQL queries. By talking about the data challenges they face, seeing how they approach complex table joins and create data warehouses for your analytic dashboarding pleasure, or simply unclog your data faucets, you can learn a lot. In my experience, these individuals are also excited to learn about data science and friendly lunch-and-shares can not only create internal team synergies, but lead to you acquiring new SQL skills in the process.

Images:

https://9gag.com/gag/aqnW0QY

https://www.pinterest.com/pin/499055202427808374/

https://cheezburger.com/8323023360

https://medium.com/datadriveninvestor/

Standard

The Effective Data Science Resume

green-chameleon-s9CC2SKySJM-unsplash (1).jpg

 

Writing an effective data science resume can be challenging. There are so many divergent signals from the industry about what good content looks like. Do I include the latest tech stack from AWS Re:Invent Conference webinar this year? Do I put down all the Kaggle competitions that I have completed? Do I include the Ames Housing dataset regression model that I finished at a boot camp but didn’t really have any passion? How do I show a future employer that I communicate well within and between teams?

The field of data science has many roles and tailoring your resume to the specific function you want to play on a data science or analytics team is crucial. This post is not to give step-by-step instructions on how to create a unicorn-inspired resume, but rather to help lay the groundwork for a resume that you can then customize to the position that you are applying.

 

Step 1: The “About Me” Intro

Start with a small blurb about yourself. This can be pithy and short. Including a strong message that highlights your passion for analytics, but doesn’t simply list your accomplishments. This helps set the theme for the resume. This is not the place to say that you are a “good communicator” or that you are “an inspiring individual.” If you think that what you are writing sounds generic, then don’t include it. Putting that you excel in “interpersonal skills” is equivalent to saying that you know how to open Microsoft Word. This should be seen as your mission statement, your raison d’être of why you are in analytics. Put some heart into this little opener!

 

Step 2: Education

roman-mager-5mZ_M06Fc9g-unsplash.jpg

Don’t worry if you got an undergraduate degree unrelated to computer science, programming or engineering. If your previous degree is not related to your current situation put it anyways. It shows that you are well-rounded and have the ability to think outside the norms of statistical jargon. Some of these “unrelated” fields are even better for emerging areas in data science. We are seeing that majors such as English Literature have a benefit when fitting models on natural language processing (NLP) routines. Parsing language and NER (named Entity Relationships) are the heartbeat of many unstructured machine learning productions. If you studied political science, it shows that you are able to understand the societal boundaries that we face when looking at the morality and bias within machine learning. If you studied gender studies, a new era of image detection and how we understand gender within computer vision will prove that major to be useful. This is all to make the point that you didn’t need to study computer science to make an effective data scientist. Now, if you want to work at Facebook and help develop out PyTorch’s new neural network architecture, then this post isn’t for you. But if you want to help bring data science practices into organizations that are inherently human-centered such as optimizing product placement, predicting voter turnout via political polling, or engineering features for how to get people into the latest Mission Impossible thriller, then these liberal arts skills will come in handy.

 

Step 3:Technical Skills

shahadat-rahman-BfrQnKBulYQ-unsplash.jpgWithin the data science consulting world, especially amongst the Big Four, having technical skills laid out nice and neat in front of the recruiter is crucial. Having sections for programming languages, statistical knowledge, algorithms, cloud computing resources, and visualization tools, is highly critical. This helps recruiters center in on what they are looking for, that elusive purple squirrel of job applicants.

 

Step 4: Job Experience

tim-van-der-kuip-CPs2X8JYmS8-unsplash.jpgNow is the time to shine! If you have years of data or analytics related job experience, be sure to bullet your accomplishments. The STAR (Situation, Task, Action, Results) method is most important rather than listing the tools and techniques that you used to complete a task. This shows initiative and a “go get ‘em” attitude. An example of a data science consultant leveraging the STAR method could be the following: “Helped major airline client automate 10,000 email requests per day by using NLP LDA techniques (Latent Dirichlet Allocation) to automate the response process resulting in a reduction of in-person hours by 30 percent.” This STAR approach shows that you had a situational grasp of the task at hand, the methods to complete it, and the business justification for your work.

Step 5: Projects

jeff-sheldon-JWiMShWiF14-unsplash.jpgPersonally, this is an area that I see too much of on resumes. Much of the time there will be a candidate who has about one year of experience in an analytics related field but half the resume will be school or side projects. The school/side project is important, but it is not the reason for creating the resume. You want to show potential employers that you have been battle tested and already have the experience that they desire. Now, you may be asking yourself, “But I am just finishing (fill in the blank) education program. How can I put experience?” If you were in an internship, see Step 4. If you were in a Co-Op, see Step 4. If you contributed to a volunteer organization, see Step 4. Remember, the resume is to give employers a signal that you have been there before.

On to projects. Include projects that are unique, those topics that you don’t see too often. A few projects that should NEVER be on a resume: Titanic dataset, Iris dataset, Ames or Boston Housing dataset. If you have done projects on those datasets, no matter how cool they were or how dope the visualization and dashboard you created, please don’t put them on there. The goal is to separate yourself from the pack, not to mirror the herd.

ian-schneider-TamMbr4okv4-unsplash.jpg

The following are projects that are effective. If you had to go out and collect the data from an outside source or had to do an immense amount of cleaning, include that. The project can have an end result that you did summary statistics and produced some bar charts. But if that dataset has not been analyzed before, maybe you are looking at weather data from Timbuktu or baseball games in only one stadium you went to as a child, include those! These are what we refer to as “passion projects.” Projects that when you discuss them to others, they can see your excitement. I once knew a data analytics professional who recorded how he got to work each day. He did this for one year, jotting down the means of transportation (subway, bus, bike, walk), the weather (cloudy, sunny, snowy, rainy) and how much time it took him to get to the office. He made a basic Excel dashboard form this information. It was one of the coolest passion projects I have ever seen. Don’t get me wrong, I am not saying to include Excel dashboards on your resume, but don’t think that the passion project needs to be something that wins Kaggle competitions or does the newest machine learning algorithm out there.

One final note on projects. The best projects are the ones that follow the data science lifecycle. You identified a question, formed a hypothesis, collected the data, cleaned the data, created features, created a model, analyzed the results, visualized those results, and explained those results. That is what future employers are looking for. If you happened to push that model to production via a cloud service, then hats off to you my friend, because you just told an analytics story that most organizations yearn to have.

Step 6: Certifications

If you have any certifications, put them. I am talking about ones that you had to take a test for. Most of the time, these are through the actual platform distributor themselves. Tableau Desktop Associate, AWS Cloud Developer, Azure, GCP, IBM, etc. A trend that I am seeing is people listing every Udemy or Coursera course that they have taken. I am all for online learning platforms and frequently use Udemy when I need to learn something quickly. (“Hey, Python is where it’s at. You know, you should take this course on Udemy, it’s taught by a real data scientist”). Heck, I used Udemy last week to learn how to enhance security on my AWS S3 bucket. But what can show a potential employer a signal that you may not be experienced, in my humble opinion, is to include that Udemy Intro to NumPy course that you took which lasted 2.5 hours. My personal rule of thumb is that if you think it separates you from others applying to the position, then put it. But if it is something that doesn’t, then leave it off. The resume has limited space and including what brings you to the top of the pile, rather than what helps you blend in, is paramount.

Step 7: Interpersonal Skills

Yes, I said it in the beginning, don’t include this as a skillset. Rather than explicitly stating that you are an effective communicator or leader, prove it! If you have attended Toastmasters and given speeches through their program, that is one of the best ways to both gain and signal you continuously improve your public speaking.

Being an effective data scientist is about persuading others your techniques and process are solid. Working on oration skills will help you get there. On a similar note, if you have presented at conferences, have been a community leader of some sort, volunteer at an organization, include that in a personal section at the end. This all shows the inherent humanity in your candidacy for a job and your ability to be both a leader and team player

Last remarks

There is no right way to write a resume, but there are many ways to make yourself seem like every other data science candidate out there. The goal is to show your uniqueness to an organization, to elude confidence and proven performance, even if you don’t have a lot of experience. Companies want team contributing individuals, so your resume should be the embodiment of that.

 

Photo credits:

Photo by Green Chameleon on Unsplash
Photo by Shahadat Rahman on Unsplash
Photo by Ian Schneider on Unsplash
Photo by William Moreland on Unsplash
Photo by Roman Mager on Unsplash
Standard

Extracting Stays from GPS Points

Contributing Author/Data Scientist: Brandon Segal

 

As many news outlets have been reporting the past few weeks there is a new push to begin leveraging the wealth of location data from citizens that are captured from cellphones every day. Not discussing the ethics of whether that is what should be done, it does bring up an interesting question. What is the best way to find out where devices are dwelling?

GPS Data Overview

Screen Shot 2020-04-06 at 6.46.55 PM.png

Knowing where devices are dwelling could help inform public services if social distancing efforts are working by showing locations with high densities of devices or by showing whether certain businesses are not observing given guidelines. To answer these kinds of questions, local governments could turn to data brokers who provide high precision GPS data from millions of citizens who have apps on their phones with location permissions. This data is what is commonly referred to as spatiotemporal data. Spatiotemporal data is, as the name suggests, data that has both a time and location associated with it and is used to describe moving objects or geometries that change over time.

Let’s look at an example of the spatiotemporal data your phones produce and send to these different data brokers.

{
"latitude":39.9526,
"longitude":75.1652,
"horizontalAccuracy":30,
"speed":10,
"timestamp":"2020-03-26T23:11:26.787Z"
}
view raw spatialData.json hosted with ❤ by GitHub

Spatiotemporal Data Sample

On a map, this data would be a single point and would be pretty uninteresting by itself. However, when records are taken over a longer period we can begin seeing patterns emerge. Below is an example of a GPS trace from a University student in Seoul over two months. Notice how from these traces we can likely determine their common travel patterns, common dwell points, and outliers on the map caused by infrequent trips or device inaccuracies.

Screen Shot 2020-04-06 at 7.09.57 PM.pngStudent at Yonsei University GPS traces over 2 months taken from an Android Device (link)

 

Extracting Stay Points

Plotting these traces on a map makes drawing general qualitative trends easy but it is still difficult to measure where a device spends time whether it is in a specific coffee shop, mall, or even a census block. Grouping these unlabeled data points into logical groups calls for a clustering algorithm that can group these based on the density of points. These kinds of clustering algorithms are called Density-Based Clustering Algorithms with some examples being:

  • DBSCAN: A density-based algorithm based on the distance between points and a minimum amount of points necessary to define a cluster
  • OPTICS: A variant of DBSCAN but with a hierarchical component to find clusters of various densities
  • DJ-Cluster: Standing for Density-Join Clustering Algorithm, it is a memory sensitive alternative to DBSCAN

Each of the algorithms above has its own benefits but they all depend on the following:

 
  • Having a defined distance metric to measure how far the points are from each other
  • A method for defining whether a group of points can be considered a cluster.

Each of the algorithms can be used with spatial data to find groups of points but the main challenge is when the temporal component is introduced. This is because the distance components can no longer be defined just by the euclidian distance between the x and y components but now you have to take into account the distance points are in time. While working with spatiotemporal data I have used several clustering methods but the one that worked best with GPS Data derived from phones was one called D-StaR[1].

D-StaR stands for the Duration based Stay Region extraction algorithm which was developed by Kyosuke Nishida, Hiroyuki Toda, and Yoshimasa Koike for extracting arbitrarily shaped stay regions while ignoring outliers. The benefits this algorithm has over others include:

  • It allows for the points to be sampled more irregularly than other algorithms that depend on a minimum amount of points to define a cluster
  • It has computational and memory benefits over algorithms that require computing the distance one point is from every other point
  • It has the potential to handle streaming data and emit stay regions in a more event-driven application

D-StaR expects the input data set to be sorted by the time from beginning to end. After sorting the data the algorithm can be has a few input variables including the following:

  • ϵ: A spatial distance that a point has to be within to be considered a spatial neighbor to a core point (measured in meters)
  • q: A window size that determined the number of points before and after a core-point to find spatial neighbors (unitless)
  • t: A temporal threshold to consider a merged cluster a stay region (measured in seconds)
Screen Shot 2020-04-06 at 7.13.01 PM.png

Example of cluster determined by D-Star for a distance threshold ϵ, sliding window size q = 3, and duration threshold for core point t= 15. Point pi is described as (i, ti) in this figure

 

Using this algorithm I found improved F1 scores of finding actual stay regions by 20% from data derived from phone data. I invite you to read this paper linked above that goes over D-StaR as that helps support this kind of academic research. In the next post, I will go over how to implement this algorithm using Python and how we can implement it in a data processing pipeline.

About the author:

Brandon Segal is a Software Engineer currently working on building enterprise machine learning infrastructure with a strong focus on data engineering and real time inference. Previously Brandon has worked in data visualization and as a geospatial data scientist where he learned the interesting challenges with building data intensive applications. Located in the DMV area and originally from the Philly area he loves spending time with his wife, volunteering in Arlington, and continued learning.

Github: https://github.com/brandon-segal

Email: brandonsegal.k@gmail.com

 

Photo credit: Paul Hanaoka on Unsplash

Standard

PGA Tour Analytics: Accuracy Off the Tee

Screen Shot 2020-02-26 at 7.41.25 PM.png

 

We Looked at Distance

In the previous analysis of distance, we found that there was a severe uptick in distance gained off the tee after the introduction of the Pro V1 golf ball. As you can see from the plot below, the red line indicates when the Pro V1 was introduced in October 2000.

Screen Shot 2020-02-26 at 7.20.46 PM.png

Now Enter Accuracy

Screen Shot 2020-02-26 at 7.43.25 PM.png

But what has happened to accuracy off the tee during this same time period? Again, taking data derived from Shotlink and scraped from the PGA Tour’s public-facing website using Python, we have managed to collect information on the tournament week level for every player to have made the cut since 1980. Using this dataset, we have added another piece to the puzzle of an eventual model that can help us determine the most important features of the modern tour player’s success. (Note: the year 2005 is missing from this dataset due to issues scraping that particular year. All calculations made impute missing values based on the years 2004 and 2006 for this year).

Leading up to 2000, technology helped Tour players find the fairway off the tee. In fact, the trend from 1980 to 1999 is a story of increased accuracy off the tee. The sharp decrease occurred immediately when golf balls started flying further.

Screen Shot 2020-02-26 at 7.38.59 PM.png

The narrative of distance over accuracy becomes apparent when we view distance and accuracy off the tee together. On average, Tour players got longer at the cost of accuracy. The relationship between driving distance and accuracy still holds for those players that won during the week, if not more. For Tour winners, there is an even more exaggerated drop in the percentage of fairways hit off the tee, while distance is more than the average Tour player.

Screen Shot 2020-02-26 at 7.22.12 PM.png

Forming Relationships

Let’s now take a look at the correlation between driving distance and accuracy. Taking each player that has made the cut in a tour sanctioned event since 1980, approximately 112,619 observations, we can plot the distance and accuracy. Each blue dot represents a player, which allows us to view the distribution of accuracy and the distribution of distance on the far right and top axis as well. More importantly, this combined scatterplot lets us see the relationship between distance and accuracy. Known as a Pearson coefficient, we can calculate the linear co-movement of these two variables. Simply put, in relation to each other, how well do they move? For an additional yard of distance, what decrease in accuracy can we infer?

And for the stats nerds out there, the equation for your enjoyment.

Screen Shot 2020-02-26 at 7.26.06 PM.png

The following scatterplot highlights the average correlation coefficient between 1980 and 2020 of -0.27, meaning that for every additional yard, a player can expect to lose 0.27 percent in accuracy. For better interpretation, an increase in 10 yards would yield a decrease in accuracy of 2.7 percent.

Screen Shot 2020-02-26 at 7.29.48 PM.png

Now, this is all on average and it is very difficult to infer that 1980 looks like 2020. When running the numbers for 1980, the Pearson coefficient was -0.24, while 2020 was -3.3. What would be interesting to see is these coefficients over time.

As you can see from the scatterplot below, each Pearson coefficient was calculated for each year. These coefficients were then plotted over time. A linear trend line was placed to demonstrate that while there were fluctuations between years, the overall story is that players have been giving up accuracy as they get longer.

Screen Shot 2020-02-26 at 7.32.37 PM.png

For example, in 1980, a player gave up approximately 2.5% fairway accuracy for each additional 10 yards they gained. But in 2020, a player will need to give up almost an additional 1% decrease in accuracy off the tee to gain 10 yards. This makes sense when we think about it for a moment. Players hit it longer and a 5 degree miss with the driver will be further offline at 300 yards out than it is at 250 yards out. This is simple geometry; the further one travels from a line at an angle, the further that person will be from the other line.

While none of these findings are earth-shattering, my hope is that through iterations of exploring these PGA statistics, a meager contribution to the golf analytics community can be made.

As always, the code used in this analysis is available at the author’s GitHub repository: https://github.com/nbeaudoin/PGA-Tour-Analytics and can be found on LinkedIn at https://www.linkedin.com/in/nicholas-beaudoin-805ba738/

 

Image sources:

https://www.wallstreetmojo.com/pearson-correlation-coefficient/

https://www.golfdiscount.com/blog/fun-facts/2018-pga-tour-driving-statistics/#prettyPhoto/0/

https://www.pgatour.com/news/2019/05/11/nine-things-to-know-pga-championship-bethpage-black.html

 

 

Standard

PGA Tour Driving Distance Analytics

Distance Debate

The USGA’s February 2020 distance report has riveted the professional golf community. Calls by top players to have a bifurcation of the current golf ball at the tour level and an amateur golf ball, have created a rift that will most likely be played out in courts as ball manufacturers go to battle over maintaining their patented technology. The debate about distance centers around courses being unplayable to the modern PGA Tour professional due to increasing distance off the tee. Holes such as the 13th at Augusta National becoming an iconoclastic Par 5 to one that is merely a drive, pitch and putt have necessitated the lengthening of courses and purchase of additional properties to facilitate the added length. Both environmental concerns and the degradation of strategy on courses such as Riviera, host of the 2028 Olympics, are painstakingly debated. The following analysis helps paint the narrative for how we have gotten here.

Screen Shot 2020-02-19 at 6.43.35 PM.png

 

Enter Pro V1: King of the Golf Balls

October 11, 2000, was a day that rocked the golf world. On this day, the Pro V1 entered its first round of tournament play in Las Vegas at the Invensys Classic. The “Professional Veneer One,” a.k.a. the Pro V1, was a solid core ball that displaced wound technology in place of the traditional golf ball. Immediately, tour players began to turn to the Pro V1 as their prime gamers with its distance benefits. Not wanting to be left behind the curve, Nike developed its own version of the urethane golf ball in its Nike Tour Accuracy, which Tiger Woods immediately put into rotation. Throughout the early 2000s, it is said that distance gains were dramatic. But just how much did distance increase during this time period?

Screen Shot 2020-02-19 at 6.46.03 PM.png

The following data comes from my analysis of data derived from the PGA Tour’s public-facing website. Using Python to scrape the data through the HTML code, the following analysis provides a unique glimpse into 2,022 PGA Tour players and how the PGA Tour derives its data from Shotlink (see myformer article posted on Shotlink analytics). Since Shotlink no longer authorizes or allows academic usage of their data, it has become more challenging for researchers of the game to inquire to pressing questions at the top of the game. My hope is that the following analysis helps contribute to a more analytical narrative of golf’s top performers.

 

The Numbers

When looking at the average driving distance from 1980 through February 2020, it is obvious that the mean distance has increased. In fact, since the introduction of the Pro V1, the average distance on tour has increased by 18.17 yards.

Screen Shot 2020-02-19 at 6.49.09 PM.png

When looking at the difference of year-over-year change after the introduction of the Pro v1, we see a phenomenal spike in distance. 2000 and 2002 saw the biggest jump in distance off the tee. With each company that followed suit to Titleist’s Pro V1, it can be hypothesized that the game-changing performance of PGA Tour players off the tee rose dramatically due to this technology change.

Screen Shot 2020-02-19 at 6.49.39 PM.png

To gain a deeper look at the spike in driving distance, we took a peek at the distribution of distance off the tee between 1999 and 2003. The assumption is that since the Pro V1 was introduced in late 2000, competing ball manufactures adopted the urethane cover and proceeded to have their Tour players game the same technology. This four-year time span gives the data robustness to cover any technology adoption between Tour members and should show a statistically significant difference due to its large sample size.

Screen Shot 2020-02-19 at 6.53.32 PM.png

 

What’s Next?

While this analysis merely provides a summary glimpse into how driving distance has changed over the past 20 years, it is important to keep in mind that before the ProV1 was introduced, the driving distance was already increasing at an unprecedented rate. The debate rages on whether to scale back the current ball played on tour. Further analysis is warranted to determine how much of an impact the modern solid core playing ball had on player performance. For this analysis, an impact evaluation leveraging multivariate regression, coupled with checks for statistical robustness via parallel trends of pre and post ProV1 introduction, including difference-in-difference estimation strategies, is needed.

 

Methodology and Code

Data analytics code and notebooks created are available at the author’s GitHub:

https://github.com/nbeaudoin/PGA-Tour-Analytics

 

Data sources: 

http://www.pgatour.com

Images sources:

2015 Titleist Pro V1 ball review

https://www.brandsoftheworld.com/logo/pga-tour-4

 

 

 

Standard

The Match: Tiger vs. Phil on the Green

*** Legal Note: Written permission has been granted by CDW to use the Shotlink database for academic means. Therefore, no profits are gained from the posting of this analysis. ***

The Match

Tiger Woods and Phil Mickelson have always been on the fan wishlist for a Sunday showdown through the final stretch of a major championship. Surprisingly, each player has captured the drama and nail-biting drama of major championship wins without the other nipping at their heels. Craving more Tiger-Phil action, the golf gods (aka PPV TV) have presented an opportunity to watch these golf superstars go head-to-head in a made-for-tv affair. “The Match” goes down on Thanksgiving Day in Las Vegas at Shadow Creek and is an 18-hole match play (winner take all on each hole) for $9 million. While the Vegas odds have Tiger favored –220 to Phil’s +180, these numbers can change as golf fans dive deeper into their recent Ryder Cup failures.

Screen Shot 2018-10-14 at 1.11.00 PM

The purpose of the below analysis is to take a peek at one component of Tiger and Phil’s game. Each week leading up to “The Match” I hope to present a piece of the puzzle that will help golf fans and analysts alike understand the driving factors for each golfer’s success. This week we focus on where most of the drama will occur: the putting green.

Data Collection

The data that is used for this analysis come from CDW’s PGA Tour Shotlink database. (You can read about Shotlink here). This database is a proprietary collection of every golf shot hit from 2003 to 2018 on the PGA Tour regular season.  Each shot records 46 different variables, ranging from tournament location, golfer identification, the distance of shots, starting coordinates in relation to the hole, strokes gained, Boolean (0s and 1s) for whether or not a player is located on the green if it is the first putt, etc… Needless to say, there is a lot of data. In fact, 2018 alone recorded 1.17 million shots.

Screen Shot 2018-10-14 at 1.28.25 PM

Methodology

The analysis utilized a SQLite3 local server that stored the more than 18 million rows of data. Feature engineering (creating new variables) was used to determine how long courses played during the tournaments and to create better features to identify players. The object-oriented programming language Python was used for all cleaning, wrangling, feature engineering, and analytics, while SQL queries were run from within Python to query the database.

2003 – 2017 PGA Tour Regular Season

Tiger and Phil are wizards around the green. In fact, there are many times where each of them have made incredible saves that seemed to defy the odds of gravity. (Tiger fans will recall a late Sunday charge at The Valspar this year on the 17th hole, when he drained a 40+ foot putt to get within one of the lead). But how do these two magicians of the green compare?

First, let’s take a look at their putting before this 2018 PGA Tour regular season. using data on every putt they hit during these 14 years, we find that Phil had many more opportunities on the gree, possibly from a busier playing schedule, being on the green in regulation more often, missing more putts, or whatever it is that makes someone hit the ball more often (room for analysis in the future). Regardless, Tiger hit 13,647 putts to Phil’s 21,641.

Screen Shot 2018-10-14 at 1.46.46 PM.png

When we look at how well they did side-by-side, we can see little deviation from one another outside of 20 feet. The graph below plots every putt from the 2003 to 2017 season, by taking the putts made and dividing it by the putts attempted from each foot interval. This gives us the probability that a putt will be made at the distance on the horizontal axis distance. (Note, there is a large sample size, so each player has at least one putt recorded from every foot inside 60 feet. Statistically, the line within 20 feet should be given more importance since a majority of the putts were attempted and made in this vicinity.)

Screen Shot 2018-10-14 at 1.47.46 PM.png

So if we can place more importance on putts within 20 feet, wouldn’t it be interesting to look at those putt probabilities within that distance?

Screen Shot 2018-10-14 at 1.51.13 PM.png

Once we zoom in closer, we notice that both players make the same proportion of putts from 3.5 feet and in. There is only one recorded miss from Tiger woods at 1 foot in this 14-year window, while Phil missed a few one footers during this time as well. Tiger appears to gain the upper hand once we move outside of 7 feet with a 10 percent higher probability of making for putts between 7 to 10 feet. Overall, I give the advantage to Tiger.

Compared to the rest of the field, the indicator that will help us differentiate putting ability is called “strokes gained.” The strokes gained formula compares putting to the rest of the field. For example, if the field did poorly in a tournament and Tiger putted average, then he would have gained strokes on the field through putting. Let’s look at strokes gained-putting through the same process of isolating putts at each foot. This time, we will stick within 20 feet for sample size reasons explained earlier.

Screen Shot 2018-10-14 at 1.57.18 PM.png

It appears that both Tiger and Phil outpace the field on strokes gained within 7 feet. Keep in mind that many tournaments are won and lost within this distance and mental fortitude must be outstanding to lead in this category. Each player’s aggressive putting preference may yield the lower than expected results from outside 10 feet.

2018 PGA Tour Regular Season

Since the sample size is an issue for one season, we will be looking at 2018 putts within 20 feet. When comparing Tiger’s performance to Phil’s, we see that Phil has the advantage. This advantage widens after 7 feet and is neutralized thereafter. Since Tiger recently began to play competitively again, it could be that Phil was more comfortable on the greens during the 2018 season.

Screen Shot 2018-10-14 at 1.59.35 PM.png

Let’s take a look at some other statistics. If we compare putts at exactly 15 feet, it looks like Tiger is slightly better at 0.5 percent higher probability of making the putt. Although, this is really decimal dust at this point and carries little weight.

Screen Shot 2018-10-14 at 2.00.13 PM.png

From 6 to 10 feet, Phil had a much better 2018 regular season with a 10 percent higher probability of making a putt than Tiger.

Screen Shot 2018-10-14 at 2.01.28 PM.png

So what about strokes gained for putting in 2018. Well, throwing statistics out the door, let’s forget about small sample sizes and see where they stack up against the field.

Screen Shot 2018-10-14 at 2.03.10 PM.png

While Phil has a consistent 0.03 to 0.04 strokes gained on the field, Tiger is almost at the tour average. Barring some exceptional putting at 7 feet and in, Tiger fails to deliver in 2018 on the green. Remember, these are very small numbers and should be seen as indicative of average putting on tour.

Screen Shot 2018-10-14 at 2.05.05 PM.png

Again, if we zoom in and look at 2018 regular season putting from within 20 feet for strokes gained, we see good putting from within 7 feet for both players, but almost average putting compared to the PGA Tour at large when putting from distances outside 10 feet, especially Tiger.

So who is the better putter? Well, the analysis shows that Tiger was a better putter from 2003 to 2017, but Phil has gained traction in 2018. Both have not had astonishing putting careers, but it is the clutch putting in the heat of battle when everything is on the line, that makes the difference and has truly separated Tiger and Phil from the rest of the Tour.

Standard

Global Lead Developer for Data Science at General Assembly

Matt sat down with Translating Nerd in a conference room at the Washington, DC data science and programming school General Assembly. Matt teaches a re-occurring 12-week, full-time, data science program that takes data novices and transforms them into employment-ready data scientists. He discusses the data science pipeline, machine learning procedures and sticking points that students need to overcome.

Screen Shot 2018-09-21 at 11.22.53 AM.png

 

Matt currently is a global lead instructor for General Assembly’s Data Science Immersive program in ten cities across the U.S and most enjoys bridging the gap between theoretical statistics and real-world insights. Matt is a recovering politico, having worked as a data scientist for a political consulting firm through the 2016 election. Prior to his work in politics, he earned his Master’s degree in statistics from The Ohio State University. Matt is passionate about putting the revolutionary power of machine learning into the hands of as many people as possible. When he isn’t teaching, he’s thinking about how to be a better teacher, falling asleep to Netflix, and/or cuddling with his pug.

How to contact Matt?

Twitter: @matthewbrems
Standard