In the previous post, “How to create your first end-to-end machine learning project”, a four-stage process was offered to get you out of the endless MOOC Trap and jump that fence to greener pastures. These included:
- Find a tutorial
- Follow said tutorial
- Re-follow tutorial with your own data
- Customize your pipeline with your own bling
To illustrate these concepts, I am going to walk you through how to do this on a newer tech stack. Let’s say, for example, that the Udemy commercials on your YouTube feed have been blaring at you that “Python is where it is at.” So naturally, you want to be like the rest of the cool kids and learn yourself some new tech. You also have a vague idea that cloud-based environments are terrific for Python-based deep learning. Not having a lot of experience in deep learning, you Google, “Deep Learning Projects” and come across some really sweet algorithms that can help you generate text. You now have two things in front of you that you have little experience with: deep learning and renting GPUs.
Last year, I found myself in a similar situation. I had a solid understanding and had completed a 300-hour MOOC on deep learning through Udacity, but I had yet to complete a project outside of MOOC land. I had learned the math, the tech stack when following along a tutorial, and how to leverage GPUs form a sandbox environment. But what I truly needed in my development was the chance to take this new proverbial toolbelt and test it out. Enter the four-step process of moving out of the MOOC Trap!
1. Find a Tutorial
The first thing that I needed to do was a solid review of how to use deep learning for text generation. Using recurrent Neural Networks (RNN) and Long-Sort Term Memory (LSTM) neural nets is not at the same level of explaining linear regression. This is where reading blogs and watching YouTube videos of folks talking about their use-cases and deep dives into the algorithms is important. Naturally, Medium and TowardsDataScience are going to be starting resources.
After many failed attempts to find a tutorial online that fit my particular use-case, I ran across one of my favorite online contributors, Jason Brownlee PhD, and attempted to follow his tutorial “Text Generation with LSTM Recurrnet Neural Networks in Python with Keras”. There were many concepts that needed refreshing, so in order to begin Dr. Brownlee’s tutorial, I needed to see some other examples by review. In fact, many of these links were first attempted and a portion of the ideas set forth here were used over the course of my pipeline. But as any student of data science knows, the first GitHub repo or tutorial you look at rarely has the gold you are looking to uncover:
- Keras tutorial on LSTMs and supporting notebook
- Jeff Heaton’s deep learning notebook tutorial
- Jeff Heaton’s video series
- Dive into how LSTMs work
2. Follow tutorial
The tutorial leverages LSTMs by taking Lewis Carroll’s Adventures in Wonderland by Lewis Carroll and trying to generate unseen text. Text generation is powerful because it can be used for bots that support and reduce human power to automate tasks such as customer support. This was certainly an area that I wanted to experiment in and Dr. Brownlee’s posts and one of the most well-articulated and presented set of tutorials on the open market.
Once I read through the tutorial, I opened up my local Jupyter Notebook and got cracking! Leveraging resources outside the technical tutorial were important to understand the math and reasons why LSTMs were a superior form of RNNs (see above links). Namely, which activation functions, drop-out rates and learning rates needed to be applied out of the gate (no pun for LSTM fans intended).
3. Refollow tutorial with you own data
After following Dr. Brownlee’s tutorial, I found that I my computer’s compute was severly limited. Wanting to make things easy on myself, I transferred my Jupyter Notebook to Google Colab where I could leverage a GPU to speed up the process. While the end goal was to run this on AWS, Google Colab was a safer bet so I wouldn’t accidently be charged a month’s rent to run a fun side project on EC2.
I have always had this idea of creating a project that could actually solve something in my life. Since I was a little kid, I remember my mother waking up in the dark each Sunday to finish her sermon. Since she was a minister for 30+ years, this practice would occur weekly. The idea came to me that if I could create a bot for her to complete her work, then much time could be freed up. The main issue with training LSTMs is the absence of large amounts of data. Lucky for me, my mother saved every sermon she ever gave on Microsoft Word. Over the past year, she had been sending me emails of 5 sermons that I stored in a folder, thinking that I would get around to this project, but never actually pulling the trigger. After a year of emails, I had 30 years worth of documents! (20 years x 52 weeks = 1,042 documents). To play it safe, I chose her most recent 300 documents.
Immediately, it became obvious that this would not be like changing the batteries out of a remote. There was much work to customize the various string editors, make sure that certain words were treated differently, ensure that capitalization mattered where needed. NLP and text analytics is not all about algorithms and knowing what hyperparameters to set, it is more and more about domain knowledge. So like any experienced data science consultant, I called up mom to find out more about her sermons. Having had a front-row seat to most of her career, either in diapers waddling under pews, or to sitting in my 20s/30s listening to her ideas flow of social justice and how the historical prophets would preach about helping those disenfranchised, I felt I had a good sense to her writings. But just as gathering requirements for clients on a work site, I felt that a phone call to mom to gather her biblical requirements to her interpretations was key.
Once all my customizations were in place, I was able to return basic text generations. Since my LSTM was learning on one document, it didn’t have a lot to go off of. I needed more power. I needed a lot more power to run over 300 documents. I went to Google Colab, selected the $10 a month GPU subscription so this bad boy could run uninterrupted overnight, and transferred Dr. Brownlee’s followed tutorial from Jupyter Notebook on my weak little MacBook Air to Colab. This is where things got interesting!
4. Customize your pipeline with your own bling
First off, Dr. Brownlee offers his neural wisdom to recommend algorithm improvements. I quote:
1) Predict fewer than 1,000 characters as output for a given seed.
2) Remove all punctuation from the source text, and therefore from the models’ vocabulary.
3) Try a one hot encoded for the input sequences.
4) Train the model on padded sentences rather than random sequences of characters.
5) Increase the number of training epochs to 100 or many hundreds.
6) Add dropout to the visible input layer and consider tuning the dropout percentage.
7) Tune the batch size, try a batch size of 1 as a (very slow) baseline and larger sizes from there.
8) Add more memory units to the layers and/or more layers.
9) Experiment with scale factors (temperature) when interpreting the prediction probabilities.
10) Change the LSTM layers to be “stateful” to maintain state across batches.
And he is also nice enough to offer LSTM recurrent neural net “office hour” materials in the following:
- Generating Text with Recurrent Neural Networks [pdf], 2011
- Keras code example of LSTM for text generation.
- Lasagne code example of LSTM for text generation.
- MXNet tutorial for using an LSTM for text generation.
- Auto-Generating Clickbait With Recurrent Neural Networks.
The main issue I had was creating a mechanism that could take each document and extract only the body text, no header, no footers, no fluff, and place it neatly in a .txt document. It took a few iterations, but the following script was able to extract all text and append neatly into one file. As you can see from hte text below, we have dates, locations, names and biblical passages that all need cleaning before being converted into vector space for the LSTM model.
The below sample sermon shows these areas:
As you can see, there are many areas that need to be cleaned in this document. There are various titles, headers and specialized characters that need to be taken into account. The following script allows me to read a Microsoft word document or .txt file and append it to a master .txt file that can be fed into the LSTM model clean job. Sources for the initial scripting that this is based on are linked in the beginning of the article. But overall, you can see that the documents are being aggregated from the folder I have on my desktop, the same folder that I dropped my mom’s emails in each week, and loaded to a master document.
Once I had 300 sermons read into my pipeline on Google Colab, I was able to begin fiddling with the hyperparameters of the LSTM model. Setting multiple layers, optimization methods, activation functions, and back-propagation learning rates, I felt that I could let the neural net train overnight. As you can see, there are 100,352 unique words in my network. These can be seen as nodes that are going to abstract them selves with magic (linear algebra) to create a layer of 8,643 words to predict what word will come next based on teh previous words.
Waking up the next morning was like being a kid on Christmas morning. The excitement I felt opening up Google Colab to see the results pulsated thorugh my veins. I opened up the first paragraph to see the following:
“I found myself wondering about the wire this church, the people and the post of the seen of the complical seeming the church. The and in the and the lives that the see the worship in the pertans the life the hell the story that the people and the people that the people the work. The work that the light and viel the final see the make of us the for the world the healing of the conter and the people and this people at the story and the say.”
Ok, not exactly the Sermon on the Mount, but hot dang, I got something runing! Of course, to get closer to actual human speak would take more effort, and the idea behind this end-to-end machine learning project was get something on the board and move from there. Currently, the next step is to place this onto AWS, store the documents on S3 inside a bucket, and pull them into SageMaker where I can control the level of compute power needed for a more in-depth tuning of the product.
As you can tell, there are now many avenues to take this project. We can look at creating a front-end user interface with Flask, deployed on EC2 with a Docker container and customized Route53 Elastic IP address. We can deploy on Azure’s machine learning studio if we have a specific bend to Microsoft.
Getting started and out of the MOOC Trap is the main battle for nascent data scientist and experts alike. We pride ourselves with the knowledge that we have accumulated but may feel shy to go out into the world and look at a blank page. I hope that explaining this process will be helpful to those data scientists out there just getting their feet wet.
The following are links to my code, the resources I used and the proper accredations for the code that was borrowed and influential.