The Effective Data Science Resume

green-chameleon-s9CC2SKySJM-unsplash (1).jpg

 

Writing an effective data science resume can be challenging. There are so many divergent signals from the industry about what good content looks like. Do I include the latest tech stack from AWS Re:Invent Conference webinar this year? Do I put down all the Kaggle competitions that I have completed? Do I include the Ames Housing dataset regression model that I finished at a boot camp but didn’t really have any passion? How do I show a future employer that I communicate well within and between teams?

The field of data science has many roles and tailoring your resume to the specific function you want to play on a data science or analytics team is crucial. This post is not to give step-by-step instructions on how to create a unicorn-inspired resume, but rather to help lay the groundwork for a resume that you can then customize to the position that you are applying.

 

Step 1: The “About Me” Intro

Start with a small blurb about yourself. This can be pithy and short. Including a strong message that highlights your passion for analytics, but doesn’t simply list your accomplishments. This helps set the theme for the resume. This is not the place to say that you are a “good communicator” or that you are “an inspiring individual.” If you think that what you are writing sounds generic, then don’t include it. Putting that you excel in “interpersonal skills” is equivalent to saying that you know how to open Microsoft Word. This should be seen as your mission statement, your raison d’être of why you are in analytics. Put some heart into this little opener!

 

Step 2: Education

roman-mager-5mZ_M06Fc9g-unsplash.jpg

Don’t worry if you got an undergraduate degree unrelated to computer science, programming or engineering. If your previous degree is not related to your current situation put it anyways. It shows that you are well-rounded and have the ability to think outside the norms of statistical jargon. Some of these “unrelated” fields are even better for emerging areas in data science. We are seeing that majors such as English Literature have a benefit when fitting models on natural language processing (NLP) routines. Parsing language and NER (named Entity Relationships) are the heartbeat of many unstructured machine learning productions. If you studied political science, it shows that you are able to understand the societal boundaries that we face when looking at the morality and bias within machine learning. If you studied gender studies, a new era of image detection and how we understand gender within computer vision will prove that major to be useful. This is all to make the point that you didn’t need to study computer science to make an effective data scientist. Now, if you want to work at Facebook and help develop out PyTorch’s new neural network architecture, then this post isn’t for you. But if you want to help bring data science practices into organizations that are inherently human-centered such as optimizing product placement, predicting voter turnout via political polling, or engineering features for how to get people into the latest Mission Impossible thriller, then these liberal arts skills will come in handy.

 

Step 3:Technical Skills

shahadat-rahman-BfrQnKBulYQ-unsplash.jpgWithin the data science consulting world, especially amongst the Big Four, having technical skills laid out nice and neat in front of the recruiter is crucial. Having sections for programming languages, statistical knowledge, algorithms, cloud computing resources, and visualization tools, is highly critical. This helps recruiters center in on what they are looking for, that elusive purple squirrel of job applicants.

 

Step 4: Job Experience

tim-van-der-kuip-CPs2X8JYmS8-unsplash.jpgNow is the time to shine! If you have years of data or analytics related job experience, be sure to bullet your accomplishments. The STAR (Situation, Task, Action, Results) method is most important rather than listing the tools and techniques that you used to complete a task. This shows initiative and a “go get ‘em” attitude. An example of a data science consultant leveraging the STAR method could be the following: “Helped major airline client automate 10,000 email requests per day by using NLP LDA techniques (Latent Dirichlet Allocation) to automate the response process resulting in a reduction of in-person hours by 30 percent.” This STAR approach shows that you had a situational grasp of the task at hand, the methods to complete it, and the business justification for your work.

Step 5: Projects

jeff-sheldon-JWiMShWiF14-unsplash.jpgPersonally, this is an area that I see too much of on resumes. Much of the time there will be a candidate who has about one year of experience in an analytics related field but half the resume will be school or side projects. The school/side project is important, but it is not the reason for creating the resume. You want to show potential employers that you have been battle tested and already have the experience that they desire. Now, you may be asking yourself, “But I am just finishing (fill in the blank) education program. How can I put experience?” If you were in an internship, see Step 4. If you were in a Co-Op, see Step 4. If you contributed to a volunteer organization, see Step 4. Remember, the resume is to give employers a signal that you have been there before.

On to projects. Include projects that are unique, those topics that you don’t see too often. A few projects that should NEVER be on a resume: Titanic dataset, Iris dataset, Ames or Boston Housing dataset. If you have done projects on those datasets, no matter how cool they were or how dope the visualization and dashboard you created, please don’t put them on there. The goal is to separate yourself from the pack, not to mirror the herd.

ian-schneider-TamMbr4okv4-unsplash.jpg

The following are projects that are effective. If you had to go out and collect the data from an outside source or had to do an immense amount of cleaning, include that. The project can have an end result that you did summary statistics and produced some bar charts. But if that dataset has not been analyzed before, maybe you are looking at weather data from Timbuktu or baseball games in only one stadium you went to as a child, include those! These are what we refer to as “passion projects.” Projects that when you discuss them to others, they can see your excitement. I once knew a data analytics professional who recorded how he got to work each day. He did this for one year, jotting down the means of transportation (subway, bus, bike, walk), the weather (cloudy, sunny, snowy, rainy) and how much time it took him to get to the office. He made a basic Excel dashboard form this information. It was one of the coolest passion projects I have ever seen. Don’t get me wrong, I am not saying to include Excel dashboards on your resume, but don’t think that the passion project needs to be something that wins Kaggle competitions or does the newest machine learning algorithm out there.

One final note on projects. The best projects are the ones that follow the data science lifecycle. You identified a question, formed a hypothesis, collected the data, cleaned the data, created features, created a model, analyzed the results, visualized those results, and explained those results. That is what future employers are looking for. If you happened to push that model to production via a cloud service, then hats off to you my friend, because you just told an analytics story that most organizations yearn to have.

Step 6: Certifications

If you have any certifications, put them. I am talking about ones that you had to take a test for. Most of the time, these are through the actual platform distributor themselves. Tableau Desktop Associate, AWS Cloud Developer, Azure, GCP, IBM, etc. A trend that I am seeing is people listing every Udemy or Coursera course that they have taken. I am all for online learning platforms and frequently use Udemy when I need to learn something quickly. (“Hey, Python is where it’s at. You know, you should take this course on Udemy, it’s taught by a real data scientist”). Heck, I used Udemy last week to learn how to enhance security on my AWS S3 bucket. But what can show a potential employer a signal that you may not be experienced, in my humble opinion, is to include that Udemy Intro to NumPy course that you took which lasted 2.5 hours. My personal rule of thumb is that if you think it separates you from others applying to the position, then put it. But if it is something that doesn’t, then leave it off. The resume has limited space and including what brings you to the top of the pile, rather than what helps you blend in, is paramount.

Step 7: Interpersonal Skills

Yes, I said it in the beginning, don’t include this as a skillset. Rather than explicitly stating that you are an effective communicator or leader, prove it! If you have attended Toastmasters and given speeches through their program, that is one of the best ways to both gain and signal you continuously improve your public speaking.

Being an effective data scientist is about persuading others your techniques and process are solid. Working on oration skills will help you get there. On a similar note, if you have presented at conferences, have been a community leader of some sort, volunteer at an organization, include that in a personal section at the end. This all shows the inherent humanity in your candidacy for a job and your ability to be both a leader and team player

Last remarks

There is no right way to write a resume, but there are many ways to make yourself seem like every other data science candidate out there. The goal is to show your uniqueness to an organization, to elude confidence and proven performance, even if you don’t have a lot of experience. Companies want team contributing individuals, so your resume should be the embodiment of that.

 

Photo credits:

Photo by Green Chameleon on Unsplash
Photo by Shahadat Rahman on Unsplash
Photo by Ian Schneider on Unsplash
Photo by William Moreland on Unsplash
Photo by Roman Mager on Unsplash
Standard

Extracting Stays from GPS Points

Contributing Author/Data Scientist: Brandon Segal

 

As many news outlets have been reporting the past few weeks there is a new push to begin leveraging the wealth of location data from citizens that are captured from cellphones every day. Not discussing the ethics of whether that is what should be done, it does bring up an interesting question. What is the best way to find out where devices are dwelling?

GPS Data Overview

Screen Shot 2020-04-06 at 6.46.55 PM.png

Knowing where devices are dwelling could help inform public services if social distancing efforts are working by showing locations with high densities of devices or by showing whether certain businesses are not observing given guidelines. To answer these kinds of questions, local governments could turn to data brokers who provide high precision GPS data from millions of citizens who have apps on their phones with location permissions. This data is what is commonly referred to as spatiotemporal data. Spatiotemporal data is, as the name suggests, data that has both a time and location associated with it and is used to describe moving objects or geometries that change over time.

Let’s look at an example of the spatiotemporal data your phones produce and send to these different data brokers.

Spatiotemporal Data Sample

On a map, this data would be a single point and would be pretty uninteresting by itself. However, when records are taken over a longer period we can begin seeing patterns emerge. Below is an example of a GPS trace from a University student in Seoul over two months. Notice how from these traces we can likely determine their common travel patterns, common dwell points, and outliers on the map caused by infrequent trips or device inaccuracies.

Screen Shot 2020-04-06 at 7.09.57 PM.pngStudent at Yonsei University GPS traces over 2 months taken from an Android Device (link)

 

Extracting Stay Points

Plotting these traces on a map makes drawing general qualitative trends easy but it is still difficult to measure where a device spends time whether it is in a specific coffee shop, mall, or even a census block. Grouping these unlabeled data points into logical groups calls for a clustering algorithm that can group these based on the density of points. These kinds of clustering algorithms are called Density-Based Clustering Algorithms with some examples being:

  • DBSCAN: A density-based algorithm based on the distance between points and a minimum amount of points necessary to define a cluster
  • OPTICS: A variant of DBSCAN but with a hierarchical component to find clusters of various densities
  • DJ-Cluster: Standing for Density-Join Clustering Algorithm, it is a memory sensitive alternative to DBSCAN

Each of the algorithms above has its own benefits but they all depend on the following:

 
  • Having a defined distance metric to measure how far the points are from each other
  • A method for defining whether a group of points can be considered a cluster.

Each of the algorithms can be used with spatial data to find groups of points but the main challenge is when the temporal component is introduced. This is because the distance components can no longer be defined just by the euclidian distance between the x and y components but now you have to take into account the distance points are in time. While working with spatiotemporal data I have used several clustering methods but the one that worked best with GPS Data derived from phones was one called D-StaR[1].

D-StaR stands for the Duration based Stay Region extraction algorithm which was developed by Kyosuke Nishida, Hiroyuki Toda, and Yoshimasa Koike for extracting arbitrarily shaped stay regions while ignoring outliers. The benefits this algorithm has over others include:

  • It allows for the points to be sampled more irregularly than other algorithms that depend on a minimum amount of points to define a cluster
  • It has computational and memory benefits over algorithms that require computing the distance one point is from every other point
  • It has the potential to handle streaming data and emit stay regions in a more event-driven application

D-StaR expects the input data set to be sorted by the time from beginning to end. After sorting the data the algorithm can be has a few input variables including the following:

  • ϵ: A spatial distance that a point has to be within to be considered a spatial neighbor to a core point (measured in meters)
  • q: A window size that determined the number of points before and after a core-point to find spatial neighbors (unitless)
  • t: A temporal threshold to consider a merged cluster a stay region (measured in seconds)
Screen Shot 2020-04-06 at 7.13.01 PM.png

Example of cluster determined by D-Star for a distance threshold ϵ, sliding window size q = 3, and duration threshold for core point t= 15. Point pi is described as (i, ti) in this figure

 

Using this algorithm I found improved F1 scores of finding actual stay regions by 20% from data derived from phone data. I invite you to read this paper linked above that goes over D-StaR as that helps support this kind of academic research. In the next post, I will go over how to implement this algorithm using Python and how we can implement it in a data processing pipeline.

About the author:

Brandon Segal is a Software Engineer currently working on building enterprise machine learning infrastructure with a strong focus on data engineering and real time inference. Previously Brandon has worked in data visualization and as a geospatial data scientist where he learned the interesting challenges with building data intensive applications. Located in the DMV area and originally from the Philly area he loves spending time with his wife, volunteering in Arlington, and continued learning.

Github: https://github.com/brandon-segal

Email: brandonsegal.k@gmail.com

 

Photo credit: Paul Hanaoka on Unsplash

Standard

PGA Tour Analytics: Accuracy Off the Tee

Screen Shot 2020-02-26 at 7.41.25 PM.png

 

We Looked at Distance

In the previous analysis of distance, we found that there was a severe uptick in distance gained off the tee after the introduction of the Pro V1 golf ball. As you can see from the plot below, the red line indicates when the Pro V1 was introduced in October 2000.

Screen Shot 2020-02-26 at 7.20.46 PM.png

Now Enter Accuracy

Screen Shot 2020-02-26 at 7.43.25 PM.png

But what has happened to accuracy off the tee during this same time period? Again, taking data derived from Shotlink and scraped from the PGA Tour’s public-facing website using Python, we have managed to collect information on the tournament week level for every player to have made the cut since 1980. Using this dataset, we have added another piece to the puzzle of an eventual model that can help us determine the most important features of the modern tour player’s success. (Note: the year 2005 is missing from this dataset due to issues scraping that particular year. All calculations made impute missing values based on the years 2004 and 2006 for this year).

Leading up to 2000, technology helped Tour players find the fairway off the tee. In fact, the trend from 1980 to 1999 is a story of increased accuracy off the tee. The sharp decrease occurred immediately when golf balls started flying further.

Screen Shot 2020-02-26 at 7.38.59 PM.png

The narrative of distance over accuracy becomes apparent when we view distance and accuracy off the tee together. On average, Tour players got longer at the cost of accuracy. The relationship between driving distance and accuracy still holds for those players that won during the week, if not more. For Tour winners, there is an even more exaggerated drop in the percentage of fairways hit off the tee, while distance is more than the average Tour player.

Screen Shot 2020-02-26 at 7.22.12 PM.png

Forming Relationships

Let’s now take a look at the correlation between driving distance and accuracy. Taking each player that has made the cut in a tour sanctioned event since 1980, approximately 112,619 observations, we can plot the distance and accuracy. Each blue dot represents a player, which allows us to view the distribution of accuracy and the distribution of distance on the far right and top axis as well. More importantly, this combined scatterplot lets us see the relationship between distance and accuracy. Known as a Pearson coefficient, we can calculate the linear co-movement of these two variables. Simply put, in relation to each other, how well do they move? For an additional yard of distance, what decrease in accuracy can we infer?

And for the stats nerds out there, the equation for your enjoyment.

Screen Shot 2020-02-26 at 7.26.06 PM.png

The following scatterplot highlights the average correlation coefficient between 1980 and 2020 of -0.27, meaning that for every additional yard, a player can expect to lose 0.27 percent in accuracy. For better interpretation, an increase in 10 yards would yield a decrease in accuracy of 2.7 percent.

Screen Shot 2020-02-26 at 7.29.48 PM.png

Now, this is all on average and it is very difficult to infer that 1980 looks like 2020. When running the numbers for 1980, the Pearson coefficient was -0.24, while 2020 was -3.3. What would be interesting to see is these coefficients over time.

As you can see from the scatterplot below, each Pearson coefficient was calculated for each year. These coefficients were then plotted over time. A linear trend line was placed to demonstrate that while there were fluctuations between years, the overall story is that players have been giving up accuracy as they get longer.

Screen Shot 2020-02-26 at 7.32.37 PM.png

For example, in 1980, a player gave up approximately 2.5% fairway accuracy for each additional 10 yards they gained. But in 2020, a player will need to give up almost an additional 1% decrease in accuracy off the tee to gain 10 yards. This makes sense when we think about it for a moment. Players hit it longer and a 5 degree miss with the driver will be further offline at 300 yards out than it is at 250 yards out. This is simple geometry; the further one travels from a line at an angle, the further that person will be from the other line.

While none of these findings are earth-shattering, my hope is that through iterations of exploring these PGA statistics, a meager contribution to the golf analytics community can be made.

As always, the code used in this analysis is available at the author’s GitHub repository: https://github.com/nbeaudoin/PGA-Tour-Analytics and can be found on LinkedIn at https://www.linkedin.com/in/nicholas-beaudoin-805ba738/

 

Image sources:

https://www.wallstreetmojo.com/pearson-correlation-coefficient/

https://www.golfdiscount.com/blog/fun-facts/2018-pga-tour-driving-statistics/#prettyPhoto/0/

https://www.pgatour.com/news/2019/05/11/nine-things-to-know-pga-championship-bethpage-black.html

 

 

Standard

PGA Tour Driving Distance Analytics

Distance Debate

The USGA’s February 2020 distance report has riveted the professional golf community. Calls by top players to have a bifurcation of the current golf ball at the tour level and an amateur golf ball, have created a rift that will most likely be played out in courts as ball manufacturers go to battle over maintaining their patented technology. The debate about distance centers around courses being unplayable to the modern PGA Tour professional due to increasing distance off the tee. Holes such as the 13th at Augusta National becoming an iconoclastic Par 5 to one that is merely a drive, pitch and putt have necessitated the lengthening of courses and purchase of additional properties to facilitate the added length. Both environmental concerns and the degradation of strategy on courses such as Riviera, host of the 2028 Olympics, are painstakingly debated. The following analysis helps paint the narrative for how we have gotten here.

Screen Shot 2020-02-19 at 6.43.35 PM.png

 

Enter Pro V1: King of the Golf Balls

October 11, 2000, was a day that rocked the golf world. On this day, the Pro V1 entered its first round of tournament play in Las Vegas at the Invensys Classic. The “Professional Veneer One,” a.k.a. the Pro V1, was a solid core ball that displaced wound technology in place of the traditional golf ball. Immediately, tour players began to turn to the Pro V1 as their prime gamers with its distance benefits. Not wanting to be left behind the curve, Nike developed its own version of the urethane golf ball in its Nike Tour Accuracy, which Tiger Woods immediately put into rotation. Throughout the early 2000s, it is said that distance gains were dramatic. But just how much did distance increase during this time period?

Screen Shot 2020-02-19 at 6.46.03 PM.png

The following data comes from my analysis of data derived from the PGA Tour’s public-facing website. Using Python to scrape the data through the HTML code, the following analysis provides a unique glimpse into 2,022 PGA Tour players and how the PGA Tour derives its data from Shotlink (see myformer article posted on Shotlink analytics). Since Shotlink no longer authorizes or allows academic usage of their data, it has become more challenging for researchers of the game to inquire to pressing questions at the top of the game. My hope is that the following analysis helps contribute to a more analytical narrative of golf’s top performers.

 

The Numbers

When looking at the average driving distance from 1980 through February 2020, it is obvious that the mean distance has increased. In fact, since the introduction of the Pro V1, the average distance on tour has increased by 18.17 yards.

Screen Shot 2020-02-19 at 6.49.09 PM.png

When looking at the difference of year-over-year change after the introduction of the Pro v1, we see a phenomenal spike in distance. 2000 and 2002 saw the biggest jump in distance off the tee. With each company that followed suit to Titleist’s Pro V1, it can be hypothesized that the game-changing performance of PGA Tour players off the tee rose dramatically due to this technology change.

Screen Shot 2020-02-19 at 6.49.39 PM.png

To gain a deeper look at the spike in driving distance, we took a peek at the distribution of distance off the tee between 1999 and 2003. The assumption is that since the Pro V1 was introduced in late 2000, competing ball manufactures adopted the urethane cover and proceeded to have their Tour players game the same technology. This four-year time span gives the data robustness to cover any technology adoption between Tour members and should show a statistically significant difference due to its large sample size.

Screen Shot 2020-02-19 at 6.53.32 PM.png

 

What’s Next?

While this analysis merely provides a summary glimpse into how driving distance has changed over the past 20 years, it is important to keep in mind that before the ProV1 was introduced, the driving distance was already increasing at an unprecedented rate. The debate rages on whether to scale back the current ball played on tour. Further analysis is warranted to determine how much of an impact the modern solid core playing ball had on player performance. For this analysis, an impact evaluation leveraging multivariate regression, coupled with checks for statistical robustness via parallel trends of pre and post ProV1 introduction, including difference-in-difference estimation strategies, is needed.

 

Methodology and Code

Data analytics code and notebooks created are available at the author’s GitHub:

https://github.com/nbeaudoin/PGA-Tour-Analytics

 

Data sources: 

http://www.pgatour.com

Images sources:

2015 Titleist Pro V1 ball review

https://www.brandsoftheworld.com/logo/pga-tour-4

 

 

 

Standard

The Match: Tiger vs. Phil on the Green

*** Legal Note: Written permission has been granted by CDW to use the Shotlink database for academic means. Therefore, no profits are gained from the posting of this analysis. ***

The Match

Tiger Woods and Phil Mickelson have always been on the fan wishlist for a Sunday showdown through the final stretch of a major championship. Surprisingly, each player has captured the drama and nail-biting drama of major championship wins without the other nipping at their heels. Craving more Tiger-Phil action, the golf gods (aka PPV TV) have presented an opportunity to watch these golf superstars go head-to-head in a made-for-tv affair. “The Match” goes down on Thanksgiving Day in Las Vegas at Shadow Creek and is an 18-hole match play (winner take all on each hole) for $9 million. While the Vegas odds have Tiger favored –220 to Phil’s +180, these numbers can change as golf fans dive deeper into their recent Ryder Cup failures.

Screen Shot 2018-10-14 at 1.11.00 PM

The purpose of the below analysis is to take a peek at one component of Tiger and Phil’s game. Each week leading up to “The Match” I hope to present a piece of the puzzle that will help golf fans and analysts alike understand the driving factors for each golfer’s success. This week we focus on where most of the drama will occur: the putting green.

Data Collection

The data that is used for this analysis come from CDW’s PGA Tour Shotlink database. (You can read about Shotlink here). This database is a proprietary collection of every golf shot hit from 2003 to 2018 on the PGA Tour regular season.  Each shot records 46 different variables, ranging from tournament location, golfer identification, the distance of shots, starting coordinates in relation to the hole, strokes gained, Boolean (0s and 1s) for whether or not a player is located on the green if it is the first putt, etc… Needless to say, there is a lot of data. In fact, 2018 alone recorded 1.17 million shots.

Screen Shot 2018-10-14 at 1.28.25 PM

Methodology

The analysis utilized a SQLite3 local server that stored the more than 18 million rows of data. Feature engineering (creating new variables) was used to determine how long courses played during the tournaments and to create better features to identify players. The object-oriented programming language Python was used for all cleaning, wrangling, feature engineering, and analytics, while SQL queries were run from within Python to query the database.

2003 – 2017 PGA Tour Regular Season

Tiger and Phil are wizards around the green. In fact, there are many times where each of them have made incredible saves that seemed to defy the odds of gravity. (Tiger fans will recall a late Sunday charge at The Valspar this year on the 17th hole, when he drained a 40+ foot putt to get within one of the lead). But how do these two magicians of the green compare?

First, let’s take a look at their putting before this 2018 PGA Tour regular season. using data on every putt they hit during these 14 years, we find that Phil had many more opportunities on the gree, possibly from a busier playing schedule, being on the green in regulation more often, missing more putts, or whatever it is that makes someone hit the ball more often (room for analysis in the future). Regardless, Tiger hit 13,647 putts to Phil’s 21,641.

Screen Shot 2018-10-14 at 1.46.46 PM.png

When we look at how well they did side-by-side, we can see little deviation from one another outside of 20 feet. The graph below plots every putt from the 2003 to 2017 season, by taking the putts made and dividing it by the putts attempted from each foot interval. This gives us the probability that a putt will be made at the distance on the horizontal axis distance. (Note, there is a large sample size, so each player has at least one putt recorded from every foot inside 60 feet. Statistically, the line within 20 feet should be given more importance since a majority of the putts were attempted and made in this vicinity.)

Screen Shot 2018-10-14 at 1.47.46 PM.png

So if we can place more importance on putts within 20 feet, wouldn’t it be interesting to look at those putt probabilities within that distance?

Screen Shot 2018-10-14 at 1.51.13 PM.png

Once we zoom in closer, we notice that both players make the same proportion of putts from 3.5 feet and in. There is only one recorded miss from Tiger woods at 1 foot in this 14-year window, while Phil missed a few one footers during this time as well. Tiger appears to gain the upper hand once we move outside of 7 feet with a 10 percent higher probability of making for putts between 7 to 10 feet. Overall, I give the advantage to Tiger.

Compared to the rest of the field, the indicator that will help us differentiate putting ability is called “strokes gained.” The strokes gained formula compares putting to the rest of the field. For example, if the field did poorly in a tournament and Tiger putted average, then he would have gained strokes on the field through putting. Let’s look at strokes gained-putting through the same process of isolating putts at each foot. This time, we will stick within 20 feet for sample size reasons explained earlier.

Screen Shot 2018-10-14 at 1.57.18 PM.png

It appears that both Tiger and Phil outpace the field on strokes gained within 7 feet. Keep in mind that many tournaments are won and lost within this distance and mental fortitude must be outstanding to lead in this category. Each player’s aggressive putting preference may yield the lower than expected results from outside 10 feet.

2018 PGA Tour Regular Season

Since the sample size is an issue for one season, we will be looking at 2018 putts within 20 feet. When comparing Tiger’s performance to Phil’s, we see that Phil has the advantage. This advantage widens after 7 feet and is neutralized thereafter. Since Tiger recently began to play competitively again, it could be that Phil was more comfortable on the greens during the 2018 season.

Screen Shot 2018-10-14 at 1.59.35 PM.png

Let’s take a look at some other statistics. If we compare putts at exactly 15 feet, it looks like Tiger is slightly better at 0.5 percent higher probability of making the putt. Although, this is really decimal dust at this point and carries little weight.

Screen Shot 2018-10-14 at 2.00.13 PM.png

From 6 to 10 feet, Phil had a much better 2018 regular season with a 10 percent higher probability of making a putt than Tiger.

Screen Shot 2018-10-14 at 2.01.28 PM.png

So what about strokes gained for putting in 2018. Well, throwing statistics out the door, let’s forget about small sample sizes and see where they stack up against the field.

Screen Shot 2018-10-14 at 2.03.10 PM.png

While Phil has a consistent 0.03 to 0.04 strokes gained on the field, Tiger is almost at the tour average. Barring some exceptional putting at 7 feet and in, Tiger fails to deliver in 2018 on the green. Remember, these are very small numbers and should be seen as indicative of average putting on tour.

Screen Shot 2018-10-14 at 2.05.05 PM.png

Again, if we zoom in and look at 2018 regular season putting from within 20 feet for strokes gained, we see good putting from within 7 feet for both players, but almost average putting compared to the PGA Tour at large when putting from distances outside 10 feet, especially Tiger.

So who is the better putter? Well, the analysis shows that Tiger was a better putter from 2003 to 2017, but Phil has gained traction in 2018. Both have not had astonishing putting careers, but it is the clutch putting in the heat of battle when everything is on the line, that makes the difference and has truly separated Tiger and Phil from the rest of the Tour.

Standard

Global Lead Developer for Data Science at General Assembly

Matt sat down with Translating Nerd in a conference room at the Washington, DC data science and programming school General Assembly. Matt teaches a re-occurring 12-week, full-time, data science program that takes data novices and transforms them into employment-ready data scientists. He discusses the data science pipeline, machine learning procedures and sticking points that students need to overcome.

Screen Shot 2018-09-21 at 11.22.53 AM.png

 

Matt currently is a global lead instructor for General Assembly’s Data Science Immersive program in ten cities across the U.S and most enjoys bridging the gap between theoretical statistics and real-world insights. Matt is a recovering politico, having worked as a data scientist for a political consulting firm through the 2016 election. Prior to his work in politics, he earned his Master’s degree in statistics from The Ohio State University. Matt is passionate about putting the revolutionary power of machine learning into the hands of as many people as possible. When he isn’t teaching, he’s thinking about how to be a better teacher, falling asleep to Netflix, and/or cuddling with his pug.

How to contact Matt?

Twitter: @matthewbrems
Standard

Should I call my mom?: Machine Learning using NLP for non-nerds

My mom really likes to write. She writes emails that would make folks in college English departments give valedictorian status to. Now I know this makes me seem like a bad son, but there are times when I would like an indicator on the emails that tell me the urgency of responding. When there is great news, then I can respond over the weekend by giving her a phone call, but if there is something that is upsetting, I need a red flag to pop up and let me know that I need to respond via phone call or else my father is going to start texting me saying, “Dude, call your mom.” (Preface to say, see image below, I think everyone should be calling their mom when asking this question. But this does not make a convincing data science blog).

Screen Shot 2018-09-12 at 8.10.38 PM

Well, natural language processing (NLP) offers a solution. In fact, NLP is at the forefront of data science industry specialists for its use on unstructured (text) data. Once arduous tasks of combing through piles of PDFs and reading long emails have been replaced by techniques that allow the automation of these time-consuming tasks. We have spam detectors that filter out suspicious emails based on text cues predicting with astonishing accuracy which emails are spam and which are notes from grandma. But what lacks in the data science community is the “Mom Alert.”

Screen Shot 2018-09-12 at 8.15.00 PM

To understand the complexities that NLP has to offer, let’s break down the “Mom Alert.” First, we need a corpus (collection) of past emails that my mom has sent me. These need to be labeled by hand as “upset” mom and “happy” mom. Once I have created labels for my mom’s historical emails, I can take those emails and break them down into a format that the computer algorithm can understand.

But first, I need to separate these emails into two sets of data, one with known labels of upset and happy and the other with no labels that I want to predict. The set with labels will be called the training emails and the one without labels will be called the testing emails. This is important in the NLP process because it is going to help me build a model that can help generalize, or better predict the future.

Screen Shot 2018-09-12 at 8.09.39 PM

Training and Testing Data

It is important to note that I will want to pre-process my data. That is, I want to make sure that all the words are lowercase because I don’t want my model to treat words that are capitalized and not as different words. I will then remove all words that offer little value and occur frequently in the English language. These are called “stop words” and usually offer little semantic value. “And”, “but”, “the”, “a”, “an” give little meaning to the purpose of a sentence and can all go. I will also remove punctuation because that is not going to give me information for my model. Finally, I will take any word that is plural and bring it down to its singular form, thus shoes to shoe and cars to car. The point of all this pre-processing is to reduce the number of words that I need to have in my model.

Now that I have separated emails into training and testing sets, pre-processed the words, I need to put those emails into a format that my computer algorithm can understand: numbers. This is called the “vectorization” process where we create a document-term matrix. The matrix is simply a count of the words in a document (mom’s email) and the times that a word occurred. This matrix is then used to compare documents across themselves. The reason we pre-processed these words was that the vectorization process would result in a massive, extremely clumsy matrix.

Screen Shot 2018-09-12 at 8.03.16 PM

        Document-Term Matrix (aka, Vectorization)

As you can see by the image above, each email is displayed as their own row in the matrix (document-term matrix). This document-term matrix then takes each unique word in ALL the emails from the training emails and places them in the columns. These are called features. Features of a document-term matrix can make this matrix incredibly long. Think of every unique word in thousands of emails. That is one long matrix! Which is the reason we did pre-processing in the first place! We want to reduce the number of columns, or features, which is a process known as dimensionality reduction. In other words, we are reducing the numbers that our algorithm needs to digest.

Screen Shot 2018-09-12 at 8.40.13 PM

Wrong Matrix

Now that I have my emails represented as a matrix, I can create an algorithm to take those numerical representations and convert them to a prediction. Recall from a previous post that we introduced Bayes Theorem. (see Translating Nerd’s post on Bayes). Well, we could use Bayes Theorem to create a predicted probability that my mom is upset. We will call upset = 1 and happy = 0.

Side Note: I know this seems pessimistic, but my outcome variable is going to be the probability that she is upset, and that is why we need to constrain our algorithm between zero and one. Full disclosure, my mom is wonderful and I prefer her happy. Again, see images under the text.

Screen Shot 2018-09-12 at 8.34.55 PM

     

Now, there are other algorithms that can be used, such as logistic regression, support vector machines and even neural networks, but let’s keep this simple. Actually, the first email spam detectors used Naïve Bayes because it works so darn well with large numbers of features (words in our case).  But “naïve”, what does that mean you may ask? The model makes the naïve assumption that these words are not related to each other. We know this cannot be true because that is how we create sentences, that is, words have meaning when coupled with each other. Of course, each algorithm has drawbacks but Naïve Bayes proves to be quite accurate with large amounts of features (ie, words).

Once we have implemented our Naïve Bayes model on the document-term matrix, we can make a prediction on each email in the test set. This test set acts as a validation on the training set which will allow us to make changes to our model and get as close as possible to a generalized model for determining an upset email from mom. A keynote of machine learning is to create a model that doesn’t just fit our training data, because we need it to be vague enough to generalize to new data. This is called overfitting a model and should be avoided at all costs. Of course, there is a trade-off between underfitting a model that needs to be balanced but again, I digress (see image below).

Screen Shot 2018-09-12 at 8.04.25 PM

Machine learning basics for future posts

Once we have tuned our Naïve Bayes algorithm to both fit the training emails and generalize well enough to future emails in the test set, we are ready to test it out on a new email from mom. When mom sends us a new email, our algorithm will output a predicted probability. Let’s say that any email that has a predicted probability of 50 percent or more (0.5) will be called upset (1) and any predicted probability that is under 50 percent will be called happy (0).

Screen Shot 2018-09-12 at 8.31.33 PM

If we wanted a simplistic model we could look at the above new emails that have been run (without labels) through our Naive Bayes algorithm. It looks like the predicted probabilities have cleared our 0.5 threshold and made a classification of happy, upset, upset, and happy. It looks like I will have some calls to make this evening!

 

 

Images sources:

https://datascienceomar.wordpress.com/2016/07/04/classification-with-scikit-learn/
http://www.cs.nthu.edu.tw/~shwu/courses/ml/labs/08_CV_Ensembling/08_CV_Ensembling.html
https://www.includehelp.com/ml-ai/data-splitting.aspx
https://lespetitesgourmettes.com/recipes/friday-funnies-8/attachment/call-your-mom/
https://logobaker.ru/logo/3407-reazon.html
https://kultura.zpravy.idnes.cz/filmove-novinky-zari-2017-0sa-/filmvideo.aspx?c=A170912_120514_filmvideo_ts

 

 

 

 

Standard

World Bank to Front End Developer: Coffee with Andres Meneses

A couple months ago Andres Meneses and I sat down at a busy Adams Morgan cafe in the heart of Washington, DC to discuss his success as a frontend developer. Incredible to his story is how he navigated from a successful long-term position at The World Bank to a coding boot camp to gain a foothold in the web development world. His background is one of the many success stories popping up from mid-career individuals making a jump into the world of data science/technology development.

AndresPhoto

 

Bio: Andrés Meneses is the proud owner of the happiest dog in Washington, DC and a passionate pro-butter advocate. He is also a web developer, committed first and foremost to optimizing user experience. He involves users from the outset of all projects because, as he likes to put it, “There is nothing worse than working hard on a digital product that no one ever uses!” By leveraging his combined expertise in product and project management and digital communications, Andrés approaches his work by thinking broadly. Why and what does this organization have to share, and how will that information engage, inform, surprise, and help the intended audience? And how will the information best deliver tangible outcomes? Throughout his time working in all types of organizations, most notably his more than 10 years at the World Bank as a technical project manager, he never stopped learning and made the leap a few years ago into full-time web developer. Just like his dog, he has never been happier.

 

Three websites Dr. Nicholas uses to keep current:

Contact information:

 

Standard

Interview with General Assembly Data Scientist: Dr.Farshad Nasiri

Dr.Farshad Nasiri is the local instructor lead for the Data Science Immersive (DSI) program at General Assembly in Washington, DC. He received his B.S. from Sharif University of Technology in Tehran and his Ph.D from George Washington University in mechanical engineering where he applied machine learning tools to predict air bubble generation on ship hulls. Prior to joining General Assembly, he worked as a computational fluid dynamics engineer and a graduate research assistant. As the DSI instructor, he delivers lectures on the full spectrum of data science-related subjects.

Farshad

Farshad is interested in high-performance computing and implementation of machine learning algorithms in low level, highly scalable programming languages such as Fortran. He is also interested in data science in medicine specifically preventive care through data collected by wearable devices.

Favorite Data Books:

Favorite websites:

Contact info:

https://www.linkedin.com/in/farshad-nasiri/

Standard

Machine Learning Post-GDPR: My conversation with a Spotify chatbot

I have a dilemma. I love Spotify and find their recommendation engine and machine learning algorithms spot on (no pun intended). I think their suggestions are fairly accurate to my listening preferences. As well as it should be. I let them take my data.

Screen Shot 2018-05-25 at 10.43.19 PM.png

Spotify has been harvesting my personal likes and dislikes for the past 6 years. They have been collecting metadata on what music I listen to, how long I listen to it, what songs I skip and what songs I add to my playlists. I feel like I have a good relationship with Spotify. Then came the emails. Service Term Agreements, Privacy Changes, Consent Forms. Something about GDPR.

Which brings me to the topic of discussion, what is GDPR and how will services that rely on our metadata to power algorithms and machine learning predictions function in a post-GDPR world? First, let’s back-track a bit for a second. Over the past week, most people’s inboxes have looked like this…

Screen Shot 2018-05-25 at 10.06.06 PM.png

Or you feel you have won an all expense paid “phishing” trip for your gmail account.

Screen Shot 2018-05-25 at 10.02.50 PM

Nope, your email hasn’t been hacked, but something monumental has occurred in the data world that has been years in the making. Two years ago, the European Union Parliament passed a law to have strict user-friendly protections added to an already stringent set of EU data regulations. The GDPR stands for the General Data Protection Regulation and went into force today in the member countries of the European Union. There are a number of things that GDPR seeks to accomplish, but the overarching mood of the regulation seeks to streamline how a user of a good or service gives consent to an entity that uses their private data.

Now I hear you saying, “I thought that GDPR was just an EU regulation, how does it affect me as an American? I don’t like this European way of things and this personal rights stuff.”

Well, you are correct, it goes into effect May 25, 2018 (today or in the past, if you are reading this now), but any organization that has European customers will need to abide by it. Since most organizations have European customers, everyone is jumping in and updating their privacy policies, begging for your consent in your inbox.

“Who cares? Aren’t companies going to use my data anyways without me knowing?” Well, not anymore. Not only does GDPR allow individuals and groups to bring suit against companies that are in breach of GDPR, but these companies can be fined 4 percent of their annual global revenue. Yes revenue, not profit, and global revenue at that. Already today Facebook has been hit with 3.9 billion Euros and Google with 3.7 billion Euros in fines. Source

In a book published by Bruce Schneider, Data and Goliath: The Hidden Battles to Collect and Control Your World speaks to the deluge of data that has congested society’s machines. This data exhaust, as Schneider refers to it, has a plethora of extractable data, known as metadata. Metadata, while not linked to specific conversations you have on your phone or the text contents of your emails, is collected by organizations we use in our daily life. Facebook, Amazon, Google, Yahoo (yes, some of us still use Yahoo email), Apple all rely on gigantic server farms to maintain petabytes of saved metadata.

Screen Shot 2018-05-25 at 9.52.37 PM

Schneider refers to the information age being built on data exhaust and he is correct. Modern algorithms that power recommendation engines for Amazon, the phone towers that collect cell phone signals to show your location, the GPS inside your iPhone, your Netflix account, all rely on metadata and the effective storage of, well, a boat load of data for lack of a better term.

Over the course of the past couple decades, we have been consenting to privacy agreements that allow these companies, and companies like them, to use our data for purposes that we are not aware of. That is the reason for GDPR. This is where consent comes to play.

According to the New York Times, GDPR rests on two main principles. “The first is that companies need your consent to collect data. The second is that you should be required to share only data that is necessary to make their services work.” Source

Broken down, here is Translating Nerd’s list of important highlights of GDPR:

1. You must give your consent for a company to use your data
2. Companies must explain how your data will be used
3. You can request your data be deleted
4. If there is a data breach at a company, they have 72 hours to notify a public data regulatory agency
5. You can request access to the personal data that is kept
6. You can object to any processing of your personal data in any situation that arises
7. Simply put, consent must be easy to give and easy to take away

So after receiving Spotify’s updated privacy policy terms in my inbox, I decided to contact a representative of my favorite music streaming company (her name is Joyce). Namely, I was curious if I failed to give my consent for Spotify to use my data, would their recommendation engine not have data from my personal account to provide those terrific suggestions that all seem to begin with Nine Inch Nails and regress slowly to somewhere near Adele. Regression to the mean perhaps? A conversation for another post perhaps or a comment on society’s love of Adele.

The conversation I had went like this:

 

 

As for most chatbots, sorry, Joyce, I was referred to a standard account term source. So I needed to do some more digging. I found that as stated in GDPR terms, I could deactivate certain features such as what data from Facebook is kept on me or using third party sites to help target ads to me.

Screen Shot 2018-05-25 at 10.53.09 PMScreen Shot 2018-05-25 at 10.53.13 PM

But this isn’t enough, what I am curious about is can a machine learning algorithm like Spotify’s recommendation engine work without access to my data? And then I saw it, all the way at the bottom of their user terms in this link. A clear and definitive answer that no, to delete my personal data I would need to close my account, which means no Spotify or recommendations for Adele.

Screen Shot 2018-05-25 at 10.54.14 PM

Simply put, Algorithms run on data. No data, no prediction. No prediction, no recommendation engine to suggest Adele. My intuitions proved correct, that data powers the machine learning algorithms we build. This should be obvious. Just as you need fuel to power a car, you need data from a user to run an algorithm. What GDPR ensures is that the data you are giving is used SOLELY for the service you requested. But it remains unclear how this is differentiated in the user agreement.

And this is exactly where GDPR places us as a society. How much data are we willing to give to receive a service? What if the service is enhanced with more personal data? We rely on the algorithms from recommendation engines to tell us what to buy from Amazon, to inform us what Netflix shows we would like, to help match us on social dating websites like OKCupid and Match.com, but that will require the full consent of the user to receive the full power of the predictive algorithm that drives these products and our cravings for smarter services.

Standard