Future proofing your data science toolkit

We are constantly inundated with the latest and greatest tools. We see over and over again friends, LinkedIn acquaintances, and former classmates posting about new certifications and technologies that are being used on the data science front. We feel that we are in a never-ending GIF that never quite allows us to reach completion in our data science career.

That feeling of almost learning all data science

So how do you cut through this massive tidal wave of tech stacks to focus on what is important? How does one seek clarity through the haze of skills that are important and those tools that are simply a worthless certification that will never be used on the job?

Over the past few years, there has been a handful of tools that have transcended industries and shown a clear signal of being key to a high performing data science team in the near future. Mainly, these technologies revolve around the concept of big data. Now, you may be asking yourself, what does “big data” really mean? Is it gigabytes, terabytes, petabytes of data? Is it streaming, large batch jobs, or a hybrid? Or is it anything that doesn’t fit on an Excel spreadsheet?

What is big data?

We are used to the hearing the three Vs of big data: velocity, volume, veracity. But these are opaque and hard to grasp. They don’t meet the qualifications from a mom-and-pop analytics start-up to a fully-fledged Fortone 500 firm leveraging analytics solutions. A better example of big data is from Jesse Anderson’s recent book Data Teams. In it, he states that big data is simply when your analytics team says they cannot do something due to data processing constraints. I like to think of this analogy.

My own take: Imagine that you grew up in Enterprise, Oregon, a Northwest Oregon farming town of approximately 2,000 people. Now, if you were to venture across the border over to Washington state, you would think that Seattle is the biggest city on earth. Rightly so, if you grew up in New York City, you would think that Seattle is small in comparison to your home city.

Small town, small data

This analogy applies to the size of an organization’s data; when you work with small data, everything seems big, and when you work with large data, it takes a larger magnitude of data ingestion to become overwhelmed. But each organization has a breaking point. That point where the data team throws up their hands and says, “we can’t go any further.” This is where the future of data science lies. To leverage the massive amounts of data (or taken as the three Vs of big data in velocity, volume and veracity), we need tools that allow us to compute at scale.

Big city, big data

Soaring through the clouds

There is no wonder that scaling up data science workloads involves using someone else’s machine. Your (and my) dinky MacBook Air or HP glorified netbook can only store up to 8 GB of information in location memory (RAM), 16 if you are lucky enough to stuff another memory card in the thing. This creates myriad problems when trying to conduct anything beyond basic querying with larger datasets (over 1 GB). Once we enter this space, a simple Pandas GroupBy command or R data.frame equivalent will render your machine worthless until a reboot.

To use compute capacity, you need to rent another machine. When renting a machine, this is referred to as spinning up a compute instance. These compute instances go by different names and there are wide varieties of hardware and software that are leveraged when increasing our data science workload capacity. However, when it comes to who to rent from, there are really only 3 major players on the market: Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP). Of course, Oracle and IBM have their own dedicated cloud, but they don’t crack the surface when competing with the big girls and boys.

Data Lakes

When venturing into the cloud, it is best to begin with the two primary means a data scientist will interact: storage and compute. When developing data science products on the cloud, you will need to store your data somewhere. In fact, what you really want, a data lake somewhere you can store your data with as few rules as possible. This is where the concept of the data lake comes into play. Data lakes give you the opportunity to dump data, regardless of file type, into a central location where you can they grab and place that data into an experiment. The respective players here are the following:

  • AWS Simple Storage Service (S3)
  • Microsoft Azure Data Blob
  • GCP Cloud Storage

Now, don’t get me wrong, being able to store data in its proper home, be it a relational database based on SQL or an unstructured storage area using NoSQL, will always be preferential to dumping your data in a data lake. Moreover, for those data sources that you want to have on-hand at a moment’s notice, creating a data warehouse will be preferable. But to get up-and-started with data science experimentation and skill building on the cloud, data lakes are fantastic.

Compute Power

Once you have your data in a safe location, you can start looking at renting compute power. Again, our three main players will look like this:

  • AWS Elastic Compute (EC2)
  • Microsoft Azure Compute
  • GCP Virtual Machines (VM)

When getting started, you want to learn a few things:

  1. How do I make sure that I don’t get charged too much money if I accidently (you will) leave this thing on? There are free tier compute instances that you can select to avoid unnecessary shocks when looking at your cloud bill.
  2. How do I ensure the correct ports are turned on? You will need to ensure that you have the correct web portals open when creating a compute instance. Failure to do so will not allow data science tools like Jupyter Notebook to run successfully. Leave ports open that you aren’t supposed to, then the world can see your instance.
  3. How do I make things talk to each other? Ensure that you have the correct permissions set up. AWS, Azure and GCP all have access management portals that need to have the minimum permissions and rules set for your storage and compute instance.
  4. If I had 2 more hours to spend on this, what would be the next steps? Knowing what the next level of complexity is often times as important as knowing how to do it. Being able to see how your application naturally builds upon itself and ties into other services will allow you to see the world as a true solutions architect.

Abstracting the noise

Often times, you will find yourself on a team that does not want to worry about how to manage a cluster or HDFS (Hadoop Distributed File System). Naturally, when jumping into the world of big data and scaling out data science workflows, the question of how involved you want to be in the day-to-day management of the big data ecosystem needs to be answered. This is why there has been a recent rise in products that allow the user to abstract the more nuanced data engineering and system administrator roles of the cloud to a platform that is more user friendly.

Databricks is one mover and shaker on the market that has been receiving a significant fanfare and accolades for its ease of use. But don’t let its new-kid-on-the-block persona dissuade you from using it. Started by the original researchers from UC Berkeley who created Apache Spark, the Databricks team has created a product that can leverage cloud computing and the data science Jupyter Notebook environment with ease. For a nominal fee, depending on how much power you seek, you can set the cluster size with the click of a button, create instant autoscaling groups and serve as your own cluster admin team with little up-keep. To sweeten the pot, there is a Community Edition that can be used and a legit Coursera course that dropped two weeks ago by Databricks staff.

Zeppelin notebooks is another data science tool that operates on top of distributed systems. Normally a little more difficult to implement, due to the manual creation of HDFS clusters and permissions granting between cloud applications, Zeppelin notebooks allow you to leverage a Spark environment through PySpark and SparkR. Due to constraints around requirements a company may have, such as a contract with a cloud support service, Zeppelin notebooks can allow you to operate within your already existing cluster setup.

While there are more tools that allow your company to leverage big data compute, distributed file systems, and all the buzzwords that lie between, the prime movers are going to be build on the big three cloud computing resources (AWS, Azure, GCP). Learning one easily translates to learning the other. Often times, within the data science space, you will find yourself moving from one service to the other. Simply knowing the mapping of related terms is often enough to ensure consistency in thought and action when creating new application across cloud platforms within big data. Personally, I have found having a chart of the naming conventions used for AWS and Azure to be quite helpful and ensure continued success in both learning and developing new data science products.

Examples:

  • AWS S3 : Azure Blob Storage
  • GCP VM : AWS EC2
  • Azure ML Studio : AWS SageMaker

Pic sources:

Standard

Walking the Talk: Implementing an end-to-end machine learning project

3-1-2, Lift Off!

In the previous post, “How to create your first end-to-end machine learning project”, a four-stage process was offered to get you out of the endless MOOC Trap and jump that fence to greener pastures. These included:

  1. Find a tutorial
  2. Follow said tutorial
  3. Re-follow tutorial with your own data
  4. Customize your pipeline with your own bling

To illustrate these concepts, I am going to walk you through how to do this on a newer tech stack. Let’s say, for example, that the Udemy commercials on your YouTube feed have been blaring at you that “Python is where it is at.” So naturally, you want to be like the rest of the cool kids and learn yourself some new tech. You also have a vague idea that cloud-based environments are terrific for Python-based deep learning. Not having a lot of experience in deep learning, you Google, “Deep Learning Projects” and come across some really sweet algorithms that can help you generate text. You now have two things in front of you that you have little experience with: deep learning and renting GPUs.

“You know, deep learning is where it is at”

Last year, I found myself in a similar situation. I had a solid understanding and had completed a 300-hour MOOC on deep learning through Udacity, but I had yet to complete a project outside of MOOC land.  I had learned the math, the tech stack when following along a tutorial, and how to leverage GPUs form a sandbox environment. But what I truly needed in my development was the chance to take this new proverbial toolbelt and test it out. Enter the four-step process of moving out of the MOOC Trap!

1. Find a Tutorial

Tutorial searching.

The first thing that I needed to do was a solid review of how to use deep learning for text generation. Using recurrent Neural Networks (RNN) and Long-Sort Term Memory (LSTM) neural nets is not at the same level of explaining linear regression. This is where reading blogs and watching YouTube videos of folks talking about their use-cases and deep dives into the algorithms is important. Naturally, Medium and TowardsDataScience are going to be starting resources.

After many failed attempts to find a tutorial online that fit my particular use-case, I ran across one of my favorite online contributors, Jason Brownlee PhD, and attempted to follow his tutorial “Text Generation with LSTM Recurrnet Neural Networks in Python with Keras”. There were many concepts that needed refreshing, so in order to begin Dr. Brownlee’s tutorial, I needed to see some other examples by review. In fact, many of these links were first attempted and a portion of the ideas set forth here were used over the course of my pipeline. But as any student of data science knows, the first GitHub repo or tutorial you look at rarely has the gold you are looking to uncover:

2. Follow tutorial

The tutorial leverages LSTMs by taking Lewis Carroll’s Adventures in Wonderland by Lewis Carroll and trying to generate unseen text. Text generation is powerful because it can be used for bots that support and reduce human power to automate tasks such as customer support. This was certainly an area that I wanted to experiment in and Dr. Brownlee’s posts and one of the most well-articulated and presented set of tutorials on the open market.

Once I read through the tutorial, I opened up my local Jupyter Notebook and got cracking! Leveraging resources outside the technical tutorial were important to understand the math and reasons why LSTMs were a superior form of RNNs (see above links). Namely, which activation functions, drop-out rates and learning rates needed to be applied out of the gate (no pun for LSTM fans intended).

3. Refollow tutorial with you own data

After following Dr. Brownlee’s tutorial, I found that I my computer’s compute was severly limited. Wanting to make things easy on myself, I transferred my Jupyter Notebook to Google Colab where I could leverage a GPU to speed up the process. While the end goal was to run this on AWS, Google Colab was a safer bet so I wouldn’t accidently be charged a month’s rent to run a fun side project on EC2.

Those EC2 charges.

I have always had this idea of creating a project that could actually solve something in my life. Since I was a little kid, I remember my mother waking up in the dark each Sunday to finish her sermon. Since she was a minister for 30+ years, this practice would occur weekly. The idea came to me that if I could create a bot for her to complete her work, then much time could be freed up. The main issue with training LSTMs is the absence of large amounts of data. Lucky for me, my mother saved every sermon she ever gave on Microsoft Word. Over the past year, she had been sending me emails of 5 sermons that I stored in a folder, thinking that I would get around to this project, but never actually pulling the trigger. After a year of emails, I had 30 years worth of documents! (20 years x 52 weeks = 1,042 documents). To play it safe, I chose her most recent 300 documents.

Soooo, do you have a document I can store my documents in ?

Immediately, it became obvious that this would not be like changing the batteries out of a remote. There was much work to customize the various string editors, make sure that certain words were treated differently, ensure that capitalization mattered where needed. NLP and text analytics is not all about algorithms and knowing what hyperparameters to set, it is more and more about domain knowledge. So like any experienced data science consultant, I called up mom to find out more about her sermons. Having had a front-row seat to most of her career, either in diapers waddling under pews, or to sitting in my 20s/30s listening to her ideas flow of social justice and how the historical prophets would preach about helping those disenfranchised, I felt I had a good sense to her writings. But just as gathering requirements for clients on a work site, I felt that a phone call to mom to gather her biblical requirements to her interpretations was key.

Ramping up that GPU!

Once all my customizations were in place, I was able to return basic text generations. Since my LSTM was learning on one document, it didn’t have a lot to go off of. I needed more power. I needed a lot more power to run over 300 documents. I went to Google Colab, selected the $10 a month GPU subscription so this bad boy could run uninterrupted overnight, and transferred Dr. Brownlee’s followed tutorial from Jupyter Notebook on my weak little MacBook Air to Colab. This is where things got interesting!

4. Customize your pipeline with your own bling

First off, Dr. Brownlee offers his neural wisdom to recommend algorithm improvements. I quote:

1) Predict fewer than 1,000 characters as output for a given seed.

2) Remove all punctuation from the source text, and therefore from the models’ vocabulary.

3) Try a one hot encoded for the input sequences.

4) Train the model on padded sentences rather than random sequences of characters.

5) Increase the number of training epochs to 100 or many hundreds.

6) Add dropout to the visible input layer and consider tuning the dropout percentage.

7) Tune the batch size, try a batch size of 1 as a (very slow) baseline and larger sizes from there.

8) Add more memory units to the layers and/or more layers.

9) Experiment with scale factors (temperature) when interpreting the prediction probabilities.

10) Change the LSTM layers to be “stateful” to maintain state across batches.

And he is also nice enough to offer LSTM recurrent neural net “office hour” materials in the following:

The main issue I had was creating a mechanism that could take each document and extract only the body text, no header, no footers, no fluff, and place it neatly in a .txt document. It took a few iterations, but the following script was able to extract all text and append neatly into one file. As you can see from hte text below, we have dates, locations, names and biblical passages that all need cleaning before being converted into vector space for the LSTM model.

The below sample sermon shows these areas:

As you can see, there are many areas that need to be cleaned in this document. There are various titles, headers and specialized characters that need to be taken into account. The following script allows me to read a Microsoft word document or .txt file and append it to a master .txt file that can be fed into the LSTM model clean job. Sources for the initial scripting that this is based on are linked in the beginning of the article. But overall, you can see that the documents are being aggregated from the folder I have on my desktop, the same folder that I dropped my mom’s emails in each week, and loaded to a master document.

Once I had 300 sermons read into my pipeline on Google Colab, I was able to begin fiddling with the hyperparameters of the LSTM model. Setting multiple layers, optimization methods, activation functions, and back-propagation learning rates, I felt that I could let the neural net train overnight. As you can see, there are 100,352 unique words in my network. These can be seen as nodes that are going to abstract them selves with magic (linear algebra) to create a layer of 8,643 words to predict what word will come next based on teh previous words.

Simple LSTM net, but effective.

Waking up the next morning was like being a kid on Christmas morning. The excitement I felt opening up Google Colab to see the results pulsated thorugh my veins. I opened up the first paragraph to see the following:

A puppy, representative of Christmas morning, because, well… puppies.
 

“I found myself wondering about the wire this church, the people and the post of the seen of the complical seeming the church.  The and in the and the lives that the see the worship in the pertans the life the hell the story that the people and the people that the people the work.  The work that the light and viel the final see the make of us the for the world the healing of the conter and the people and this people at the story and the say.”





Ok, not exactly the Sermon on the Mount, but hot dang, I got something runing! Of course, to get closer to actual human speak would take more effort, and the idea behind this end-to-end machine learning project was get something on the board and move from there. Currently, the next step is to place this onto AWS, store the documents on S3 inside a bucket, and pull them into SageMaker where I can control the level of compute power needed for a more in-depth tuning of the product.

As you can tell, there are now many avenues to take this project. We can look at creating a front-end user interface with Flask, deployed on EC2 with a Docker container and customized Route53 Elastic IP address. We can deploy on Azure’s machine learning studio if we have a specific bend to Microsoft.

Getting started and out of the MOOC Trap is the main battle for nascent data scientist and experts alike. We pride ourselves with the knowledge that we have accumulated but may feel shy to go out into the world and look at a blank page. I hope that explaining this process will be helpful to those data scientists out there just getting their feet wet.

The following are links to my code, the resources I used and the proper accredations for the code that was borrowed and influential. 

AWS integration, here we come!

Link to GitHub with code

Props and credits to following images:

Standard

How to create your first end-to-end machine learning project

You want to impress future employers by creating a dope end-to-end machine learning project, fully kitted out with a web scraping data collection strategy, deep dive exploratory phase, followed by a sick feature engineering strategy, coupled with a stacked-ensemble method as the engine, polished off with a sleek front-end built microservice fully deployed on the cloud. You have a plan, you have the bootcamp/degree program under your belt, and you have $100 of AWS credits saved from that last Udacity course. You fire up your laptop, spin up Jupyter Notebook on local mode, and log into AWS. You then draw a blank. A blinking cursor in your notebook next to “[Ln] 1:”. The coding log-jam akin to writer’s block, the proverbial trap from following too many MOOCs, a deep hole of despair. What do you do?

Entering the wilderness

Entering the wilderness of a real project (credit: Hendrik Cornelissen)

When venturing out into the wild beyond official coursework, bootcamp code-alongs, and the tutorials of Massive Online Open Courses (MOOCs), taking that first step into uncharted territory when creating your first end-to-end project can be scary. Much of the time, it is difficult to know where to start. When re-skilling, up-tooling, or revamping our way into data science, we tend to get distracted by the latest and greatest in algorithmic development. As written about in the previous post, end-to-end machine learning projects rarely leverage the most complicated algorithm in academia. In fact, many of the machine learning ecosystems in development these days in major companies around the world are slight deviations of the tried-and-true approach of what we see as “standard data science” pipelines. So why should you put pressure on yourself to apply the latest cutting edge ML algorithm when you are just starting out?

Knowing how to leverage data science tutorials is your first step
(credit)

All about tutorials

The approach that I like to take when learning something new and wanting to try it on my own use case normally follows a three-step pattern:

  1. Find a tutorial
  2. Follow said tutorial
  3. Re-follow tutorial with your own data
  4. Customize your pipeline with your own bling
A tutorial in a tutorial found in the wild. Double meta!

Excellent places to start for tutorials include:

Complete the tutorial

Follow along with the tutorial. Many times YouTube is a great place to really get to know the flow and how to interact with the tech stack and dataset that you are using. I always like to find video tutorials that have an accompanying blog as well. Much of the time, the author will guide you through the data science pipeline, while referring to the documentation that they created in the blog. AWS does a fantastic job of this, especially in regards to their curated videos that follow SageMaker examples.

Change the data

Me after changing the data (credit)

Once you feel comfortable with the data science approach that has been taught, and are able to understand all the code and dataset particulars, it is time to bring in your own data. Your data should mirror the data that the tutorial is using. For example, if you are bringing in data about churn prediction when the tutorial is a regression-based approach, then you should rethink your target variable strategy. Try to find data that fits the algorithm family that you are working with. Classification should go with classification outcomes, regression with regression, and the same for unsupervised learning problems.

Plus it up and go into the wild

Unleash the beast! (Credit: Prince David)

You are now at the point where you can begin adding a custom flavor to the pipeline. You have already succeeded in bringing your own data, now it is time to put the pieces together into a true end-to-end project. If the tutorial that you are following only moves from EDA (exploratory data analysis) to evaluation criteria for the machine learning algorithm predictions, or maybe you are learning the front end component of deploying flask or Django on EC2, then this is the perfect opportunity to spice things up!

Did someone say "plus it up"? (credit)

Try to think about what end-to-end really means. Where does the data come from? How is it collected? Can you track down an API to bring in the data on a schedule? If no API exists, can you scrape it? Can you automate that scrape with a chron job? Once the data is in, can you write functions that perform the EDA sections for you to automate the output? Can you do a deeper dive and create a story around the EDA that you are digging into? Once you have created your EDA, what feature creation can you do? Can you bring in another dataset and stich those sources together? In other words, can you add a data engineering component?

As you can see, no matter the tutorial that you are following, there are always areas for improvement. There are always ways that you can get your feet wet then really take off with your own touch. Once you have created this pipeline, think of the story that you want to tell about it. Is this something that I can talk about in a future interview? Is this something that I need to communicate to non-technical members of my team back at work to show that I am ready for sponsorship to the next level? How can I tell the story of what I have done? Can I write a blog piece about this and share with the world?

One final thought

Stick around for more! (credit)

Throughout the data science process, we learn from others. As always, properly accredit those that came before you and were influential in your work. If you followed a tutorial to really help a challenging section, include that tutorial link in your notebook. Just as we try to move our careers and curiosities forward, it is paramount that we give a bow to those that conducted the science before us.

Coming next week: An example of a step-by-step approach to using a Recurrent Neural Network to do my mom’s job for her (with limited success of course because she is a Rockstar)!

Standard

The Data Science Tool that Never Goes Out of Style

Admit, we have all been there. You are on your 5th Udemy course on what you think is the next must-have algorithm/tech skill, from Generative Adversarial Networks, to YOLO  and ANNOY algorithms, when you ask yourself, why am I doing this? Yes, it looks super cool to post about how only real data scientists do back-propagation by hand or grow their own neural nets with organic soil under their desk, but does this actually translate into on-the-job success? Does this further drive the passion for data science and the creative spirit to constantly follow the most recent trend. Or as previously stated a few years ago on this blog, to be lost in the utter darkness of the “MOOC Trap?” So hit that CAPS LOCK key and let’s get down to business!

I am not saying that there is not a place for advanced algorithmic learnings and the implementation of cutting-edge machine learning techniques on the job market. But in my experience working for various Fortune 500 companies in the consulting world and government, it is rarely the case that these are being implemented on a day-to-day basis. What truly matters are the core skills that we tend to forget. This is why I am hitting that #ThrowbackThursday button on your social media feed and recommending to poor a glass of wine, light a candle, and take your relationship with SQL to the next level.

No one loves SQL. Seriously, it is not what we think about when you hear “Sexiest job of the 21st century.” Running queries does not win Kaggle competitions nor does it get on the front page of Analytics Vidhya, make you want to hit that subscribe button and pay for a monthly membership to TowardsDataScience.com or make you yearn for some sweet, sweet credits to run queries on AWS RDS. But it is important, oh so very important. Most of the jobs on the market in both data analytics and data science not only require SQL, but act as the first cone you must pass to actually speak with a human being beyond a phone screen for most jobs. So where to start?

“SQL Rocks!” or something like that. You get the idea.

Write one query a day

Habits grow out of consistency but our better natures tend to pull us away, telling us to us that one day won’t make a difference and that we can wait to start tomorrow. But the simple act of one query a day can be quite transformative. This can be as simple as SELECT * FROM table. That basic act will create a routine within your subconscious. That one query will then become two the next, to five the following day. Next thing you know, you will most from basic query to more complex WHERE clauses, conditional statements and playing interior designer with those WINDOW functions that you will be dropping in no time.

But how do I start?

It is no secret that data science is more and more influenced by game theory and behavioral economics and one thing we learn is that incentives matter. In fact, research has shown that losing something hurts more than gaining something. (Kahneman & Tversky, 1979) This is why starting creating a penalty rather than reward can get you to SQL habit forming faster.

Action Item

Create a daily penalty, which has to be outside your own benefit. Believe me, I have tried creating penalties that involve paying off $10 of student loans for each day I miss. That just ended up with me patting myself on the back saying, “I can skip, because it is going to something I need to accomplish regardless.” The penalty should be to a charity, 50 burpees you have to do that day, or loss of a TV privilege for your favorite show. This may seem tough, but to create habit, sometimes we need a fire under our butt.

I don’t like penalties, just give me something fun!

Ok, ok, I may have come off a little harsh with the whole penalty thing. If you are someone who is big on motivational phrases, like to watch motivational videos before going on a jog, or big on the social media scene, create a challenge. We see a lot of the #100DaysofCoding challenges out there. This is where you can jump right in. Challenge a friend to a bet. Whoever can accumulate the most days of writing a SQL query in a row will receive something (cash, 6-pack of beer, or back rub, you figure it out). Create a leaderboard, invite colleagues, those friends that are on the job hunt and get after it!

What are some good resources?

Some fine resources out there are the following:

  1. SQL by Jose Portillo. The guy is a legit phenom of an instructor.
  2. DataCamp
  3. LeetCode
  4. w3schools SQL course
  5. learnsql

Advanced: Make friends with the ETL team

If you are already on the job market and on a data science team, slide one seat over on the proverbial team chairs and make friends with a member of the ETL team. These are the folks that create the pipelines that move data in and out of the applications and machine learning algorithms we design. ETL team members are expert level SQL junkies. Almost every member of an ETL team or solutions architecture team have SQL skills that warm up to my most advanced SQL queries. By talking about the data challenges they face, seeing how they approach complex table joins and create data warehouses for your analytic dashboarding pleasure, or simply unclog your data faucets, you can learn a lot. In my experience, these individuals are also excited to learn about data science and friendly lunch-and-shares can not only create internal team synergies, but lead to you acquiring new SQL skills in the process.

Images:

https://9gag.com/gag/aqnW0QY

https://www.pinterest.com/pin/499055202427808374/

https://cheezburger.com/8323023360

https://medium.com/datadriveninvestor/

Standard