You want to impress future employers by creating a dope end-to-end machine learning project, fully kitted out with a web scraping data collection strategy, deep dive exploratory phase, followed by a sick feature engineering strategy, coupled with a stacked-ensemble method as the engine, polished off with a sleek front-end built microservice fully deployed on the cloud. You have a plan, you have the bootcamp/degree program under your belt, and you have $100 of AWS credits saved from that last Udacity course. You fire up your laptop, spin up Jupyter Notebook on local mode, and log into AWS. You then draw a blank. A blinking cursor in your notebook next to “[Ln] 1:”. The coding log-jam akin to writer’s block, the proverbial trap from following too many MOOCs, a deep hole of despair. What do you do?
Entering the wilderness
When venturing out into the wild beyond official coursework, bootcamp code-alongs, and the tutorials of Massive Online Open Courses (MOOCs), taking that first step into uncharted territory when creating your first end-to-end project can be scary. Much of the time, it is difficult to know where to start. When re-skilling, up-tooling, or revamping our way into data science, we tend to get distracted by the latest and greatest in algorithmic development. As written about in the previous post, end-to-end machine learning projects rarely leverage the most complicated algorithm in academia. In fact, many of the machine learning ecosystems in development these days in major companies around the world are slight deviations of the tried-and-true approach of what we see as “standard data science” pipelines. So why should you put pressure on yourself to apply the latest cutting edge ML algorithm when you are just starting out?
All about tutorials
The approach that I like to take when learning something new and wanting to try it on my own use case normally follows a three-step pattern:
- Find a tutorial
- Follow said tutorial
- Re-follow tutorial with your own data
- Customize your pipeline with your own bling
Excellent places to start for tutorials include:
- Towards Data Science
- Susan Li tutorials on medium.com
- Machine Learning Mastery by Jason Brownlee, PhD
- Hands-on tutorials from AWS
- A Cloud Guru for Azure, AWS and GCP tutorials
- Sentdex Python tutorials for everything under the sun in 1,000 videos
Complete the tutorial
Follow along with the tutorial. Many times YouTube is a great place to really get to know the flow and how to interact with the tech stack and dataset that you are using. I always like to find video tutorials that have an accompanying blog as well. Much of the time, the author will guide you through the data science pipeline, while referring to the documentation that they created in the blog. AWS does a fantastic job of this, especially in regards to their curated videos that follow SageMaker examples.
Change the data
Once you feel comfortable with the data science approach that has been taught, and are able to understand all the code and dataset particulars, it is time to bring in your own data. Your data should mirror the data that the tutorial is using. For example, if you are bringing in data about churn prediction when the tutorial is a regression-based approach, then you should rethink your target variable strategy. Try to find data that fits the algorithm family that you are working with. Classification should go with classification outcomes, regression with regression, and the same for unsupervised learning problems.
Plus it up and go into the wild
You are now at the point where you can begin adding a custom flavor to the pipeline. You have already succeeded in bringing your own data, now it is time to put the pieces together into a true end-to-end project. If the tutorial that you are following only moves from EDA (exploratory data analysis) to evaluation criteria for the machine learning algorithm predictions, or maybe you are learning the front end component of deploying flask or Django on EC2, then this is the perfect opportunity to spice things up!
Try to think about what end-to-end really means. Where does the data come from? How is it collected? Can you track down an API to bring in the data on a schedule? If no API exists, can you scrape it? Can you automate that scrape with a chron job? Once the data is in, can you write functions that perform the EDA sections for you to automate the output? Can you do a deeper dive and create a story around the EDA that you are digging into? Once you have created your EDA, what feature creation can you do? Can you bring in another dataset and stich those sources together? In other words, can you add a data engineering component?
As you can see, no matter the tutorial that you are following, there are always areas for improvement. There are always ways that you can get your feet wet then really take off with your own touch. Once you have created this pipeline, think of the story that you want to tell about it. Is this something that I can talk about in a future interview? Is this something that I need to communicate to non-technical members of my team back at work to show that I am ready for sponsorship to the next level? How can I tell the story of what I have done? Can I write a blog piece about this and share with the world?
One final thought
Throughout the data science process, we learn from others. As always, properly accredit those that came before you and were influential in your work. If you followed a tutorial to really help a challenging section, include that tutorial link in your notebook. Just as we try to move our careers and curiosities forward, it is paramount that we give a bow to those that conducted the science before us.
Coming next week: An example of a step-by-step approach to using a Recurrent Neural Network to do my mom’s job for her (with limited success of course because she is a Rockstar)!