The Effective Data Science Resume

green-chameleon-s9CC2SKySJM-unsplash (1).jpg

 

Writing an effective data science resume can be challenging. There are so many divergent signals from the industry about what good content looks like. Do I include the latest tech stack from AWS Re:Invent Conference webinar this year? Do I put down all the Kaggle competitions that I have completed? Do I include the Ames Housing dataset regression model that I finished at a boot camp but didn’t really have any passion? How do I show a future employer that I communicate well within and between teams?

The field of data science has many roles and tailoring your resume to the specific function you want to play on a data science or analytics team is crucial. This post is not to give step-by-step instructions on how to create a unicorn-inspired resume, but rather to help lay the groundwork for a resume that you can then customize to the position that you are applying.

 

Step 1: The “About Me” Intro

Start with a small blurb about yourself. This can be pithy and short. Including a strong message that highlights your passion for analytics, but doesn’t simply list your accomplishments. This helps set the theme for the resume. This is not the place to say that you are a “good communicator” or that you are “an inspiring individual.” If you think that what you are writing sounds generic, then don’t include it. Putting that you excel in “interpersonal skills” is equivalent to saying that you know how to open Microsoft Word. This should be seen as your mission statement, your raison d’être of why you are in analytics. Put some heart into this little opener!

 

Step 2: Education

roman-mager-5mZ_M06Fc9g-unsplash.jpg

Don’t worry if you got an undergraduate degree unrelated to computer science, programming or engineering. If your previous degree is not related to your current situation put it anyways. It shows that you are well-rounded and have the ability to think outside the norms of statistical jargon. Some of these “unrelated” fields are even better for emerging areas in data science. We are seeing that majors such as English Literature have a benefit when fitting models on natural language processing (NLP) routines. Parsing language and NER (named Entity Relationships) are the heartbeat of many unstructured machine learning productions. If you studied political science, it shows that you are able to understand the societal boundaries that we face when looking at the morality and bias within machine learning. If you studied gender studies, a new era of image detection and how we understand gender within computer vision will prove that major to be useful. This is all to make the point that you didn’t need to study computer science to make an effective data scientist. Now, if you want to work at Facebook and help develop out PyTorch’s new neural network architecture, then this post isn’t for you. But if you want to help bring data science practices into organizations that are inherently human-centered such as optimizing product placement, predicting voter turnout via political polling, or engineering features for how to get people into the latest Mission Impossible thriller, then these liberal arts skills will come in handy.

 

Step 3:Technical Skills

shahadat-rahman-BfrQnKBulYQ-unsplash.jpgWithin the data science consulting world, especially amongst the Big Four, having technical skills laid out nice and neat in front of the recruiter is crucial. Having sections for programming languages, statistical knowledge, algorithms, cloud computing resources, and visualization tools, is highly critical. This helps recruiters center in on what they are looking for, that elusive purple squirrel of job applicants.

 

Step 4: Job Experience

tim-van-der-kuip-CPs2X8JYmS8-unsplash.jpgNow is the time to shine! If you have years of data or analytics related job experience, be sure to bullet your accomplishments. The STAR (Situation, Task, Action, Results) method is most important rather than listing the tools and techniques that you used to complete a task. This shows initiative and a “go get ‘em” attitude. An example of a data science consultant leveraging the STAR method could be the following: “Helped major airline client automate 10,000 email requests per day by using NLP LDA techniques (Latent Dirichlet Allocation) to automate the response process resulting in a reduction of in-person hours by 30 percent.” This STAR approach shows that you had a situational grasp of the task at hand, the methods to complete it, and the business justification for your work.

Step 5: Projects

jeff-sheldon-JWiMShWiF14-unsplash.jpgPersonally, this is an area that I see too much of on resumes. Much of the time there will be a candidate who has about one year of experience in an analytics related field but half the resume will be school or side projects. The school/side project is important, but it is not the reason for creating the resume. You want to show potential employers that you have been battle tested and already have the experience that they desire. Now, you may be asking yourself, “But I am just finishing (fill in the blank) education program. How can I put experience?” If you were in an internship, see Step 4. If you were in a Co-Op, see Step 4. If you contributed to a volunteer organization, see Step 4. Remember, the resume is to give employers a signal that you have been there before.

On to projects. Include projects that are unique, those topics that you don’t see too often. A few projects that should NEVER be on a resume: Titanic dataset, Iris dataset, Ames or Boston Housing dataset. If you have done projects on those datasets, no matter how cool they were or how dope the visualization and dashboard you created, please don’t put them on there. The goal is to separate yourself from the pack, not to mirror the herd.

ian-schneider-TamMbr4okv4-unsplash.jpg

The following are projects that are effective. If you had to go out and collect the data from an outside source or had to do an immense amount of cleaning, include that. The project can have an end result that you did summary statistics and produced some bar charts. But if that dataset has not been analyzed before, maybe you are looking at weather data from Timbuktu or baseball games in only one stadium you went to as a child, include those! These are what we refer to as “passion projects.” Projects that when you discuss them to others, they can see your excitement. I once knew a data analytics professional who recorded how he got to work each day. He did this for one year, jotting down the means of transportation (subway, bus, bike, walk), the weather (cloudy, sunny, snowy, rainy) and how much time it took him to get to the office. He made a basic Excel dashboard form this information. It was one of the coolest passion projects I have ever seen. Don’t get me wrong, I am not saying to include Excel dashboards on your resume, but don’t think that the passion project needs to be something that wins Kaggle competitions or does the newest machine learning algorithm out there.

One final note on projects. The best projects are the ones that follow the data science lifecycle. You identified a question, formed a hypothesis, collected the data, cleaned the data, created features, created a model, analyzed the results, visualized those results, and explained those results. That is what future employers are looking for. If you happened to push that model to production via a cloud service, then hats off to you my friend, because you just told an analytics story that most organizations yearn to have.

Step 6: Certifications

If you have any certifications, put them. I am talking about ones that you had to take a test for. Most of the time, these are through the actual platform distributor themselves. Tableau Desktop Associate, AWS Cloud Developer, Azure, GCP, IBM, etc. A trend that I am seeing is people listing every Udemy or Coursera course that they have taken. I am all for online learning platforms and frequently use Udemy when I need to learn something quickly. (“Hey, Python is where it’s at. You know, you should take this course on Udemy, it’s taught by a real data scientist”). Heck, I used Udemy last week to learn how to enhance security on my AWS S3 bucket. But what can show a potential employer a signal that you may not be experienced, in my humble opinion, is to include that Udemy Intro to NumPy course that you took which lasted 2.5 hours. My personal rule of thumb is that if you think it separates you from others applying to the position, then put it. But if it is something that doesn’t, then leave it off. The resume has limited space and including what brings you to the top of the pile, rather than what helps you blend in, is paramount.

Step 7: Interpersonal Skills

Yes, I said it in the beginning, don’t include this as a skillset. Rather than explicitly stating that you are an effective communicator or leader, prove it! If you have attended Toastmasters and given speeches through their program, that is one of the best ways to both gain and signal you continuously improve your public speaking.

Being an effective data scientist is about persuading others your techniques and process are solid. Working on oration skills will help you get there. On a similar note, if you have presented at conferences, have been a community leader of some sort, volunteer at an organization, include that in a personal section at the end. This all shows the inherent humanity in your candidacy for a job and your ability to be both a leader and team player

Last remarks

There is no right way to write a resume, but there are many ways to make yourself seem like every other data science candidate out there. The goal is to show your uniqueness to an organization, to elude confidence and proven performance, even if you don’t have a lot of experience. Companies want team contributing individuals, so your resume should be the embodiment of that.

 

Photo credits:

Photo by Green Chameleon on Unsplash
Photo by Shahadat Rahman on Unsplash
Photo by Ian Schneider on Unsplash
Photo by William Moreland on Unsplash
Photo by Roman Mager on Unsplash
Standard

Extracting Stays from GPS Points

Contributing Author/Data Scientist: Brandon Segal

 

As many news outlets have been reporting the past few weeks there is a new push to begin leveraging the wealth of location data from citizens that are captured from cellphones every day. Not discussing the ethics of whether that is what should be done, it does bring up an interesting question. What is the best way to find out where devices are dwelling?

GPS Data Overview

Screen Shot 2020-04-06 at 6.46.55 PM.png

Knowing where devices are dwelling could help inform public services if social distancing efforts are working by showing locations with high densities of devices or by showing whether certain businesses are not observing given guidelines. To answer these kinds of questions, local governments could turn to data brokers who provide high precision GPS data from millions of citizens who have apps on their phones with location permissions. This data is what is commonly referred to as spatiotemporal data. Spatiotemporal data is, as the name suggests, data that has both a time and location associated with it and is used to describe moving objects or geometries that change over time.

Let’s look at an example of the spatiotemporal data your phones produce and send to these different data brokers.

{
"latitude":39.9526,
"longitude":75.1652,
"horizontalAccuracy":30,
"speed":10,
"timestamp":"2020-03-26T23:11:26.787Z"
}
view raw spatialData.json hosted with ❤ by GitHub

Spatiotemporal Data Sample

On a map, this data would be a single point and would be pretty uninteresting by itself. However, when records are taken over a longer period we can begin seeing patterns emerge. Below is an example of a GPS trace from a University student in Seoul over two months. Notice how from these traces we can likely determine their common travel patterns, common dwell points, and outliers on the map caused by infrequent trips or device inaccuracies.

Screen Shot 2020-04-06 at 7.09.57 PM.pngStudent at Yonsei University GPS traces over 2 months taken from an Android Device (link)

 

Extracting Stay Points

Plotting these traces on a map makes drawing general qualitative trends easy but it is still difficult to measure where a device spends time whether it is in a specific coffee shop, mall, or even a census block. Grouping these unlabeled data points into logical groups calls for a clustering algorithm that can group these based on the density of points. These kinds of clustering algorithms are called Density-Based Clustering Algorithms with some examples being:

  • DBSCAN: A density-based algorithm based on the distance between points and a minimum amount of points necessary to define a cluster
  • OPTICS: A variant of DBSCAN but with a hierarchical component to find clusters of various densities
  • DJ-Cluster: Standing for Density-Join Clustering Algorithm, it is a memory sensitive alternative to DBSCAN

Each of the algorithms above has its own benefits but they all depend on the following:

 
  • Having a defined distance metric to measure how far the points are from each other
  • A method for defining whether a group of points can be considered a cluster.

Each of the algorithms can be used with spatial data to find groups of points but the main challenge is when the temporal component is introduced. This is because the distance components can no longer be defined just by the euclidian distance between the x and y components but now you have to take into account the distance points are in time. While working with spatiotemporal data I have used several clustering methods but the one that worked best with GPS Data derived from phones was one called D-StaR[1].

D-StaR stands for the Duration based Stay Region extraction algorithm which was developed by Kyosuke Nishida, Hiroyuki Toda, and Yoshimasa Koike for extracting arbitrarily shaped stay regions while ignoring outliers. The benefits this algorithm has over others include:

  • It allows for the points to be sampled more irregularly than other algorithms that depend on a minimum amount of points to define a cluster
  • It has computational and memory benefits over algorithms that require computing the distance one point is from every other point
  • It has the potential to handle streaming data and emit stay regions in a more event-driven application

D-StaR expects the input data set to be sorted by the time from beginning to end. After sorting the data the algorithm can be has a few input variables including the following:

  • ϵ: A spatial distance that a point has to be within to be considered a spatial neighbor to a core point (measured in meters)
  • q: A window size that determined the number of points before and after a core-point to find spatial neighbors (unitless)
  • t: A temporal threshold to consider a merged cluster a stay region (measured in seconds)
Screen Shot 2020-04-06 at 7.13.01 PM.png

Example of cluster determined by D-Star for a distance threshold ϵ, sliding window size q = 3, and duration threshold for core point t= 15. Point pi is described as (i, ti) in this figure

 

Using this algorithm I found improved F1 scores of finding actual stay regions by 20% from data derived from phone data. I invite you to read this paper linked above that goes over D-StaR as that helps support this kind of academic research. In the next post, I will go over how to implement this algorithm using Python and how we can implement it in a data processing pipeline.

About the author:

Brandon Segal is a Software Engineer currently working on building enterprise machine learning infrastructure with a strong focus on data engineering and real time inference. Previously Brandon has worked in data visualization and as a geospatial data scientist where he learned the interesting challenges with building data intensive applications. Located in the DMV area and originally from the Philly area he loves spending time with his wife, volunteering in Arlington, and continued learning.

Github: https://github.com/brandon-segal

Email: brandonsegal.k@gmail.com

 

Photo credit: Paul Hanaoka on Unsplash

Standard