My mom really likes to write. She writes emails that would make folks in college English departments give valedictorian status to. Now I know this makes me seem like a bad son, but there are times when I would like an indicator on the emails that tell me the urgency of responding. When there is great news, then I can respond over the weekend by giving her a phone call, but if there is something that is upsetting, I need a red flag to pop up and let me know that I need to respond via phone call or else my father is going to start texting me saying, “Dude, call your mom.” (Preface to say, see image below, I think everyone should be calling their mom when asking this question. But this does not make a convincing data science blog).
Well, natural language processing (NLP) offers a solution. In fact, NLP is at the forefront of data science industry specialists for its use on unstructured (text) data. Once arduous tasks of combing through piles of PDFs and reading long emails have been replaced by techniques that allow the automation of these time-consuming tasks. We have spam detectors that filter out suspicious emails based on text cues predicting with astonishing accuracy which emails are spam and which are notes from grandma. But what lacks in the data science community is the “Mom Alert.”
To understand the complexities that NLP has to offer, let’s break down the “Mom Alert.” First, we need a corpus (collection) of past emails that my mom has sent me. These need to be labeled by hand as “upset” mom and “happy” mom. Once I have created labels for my mom’s historical emails, I can take those emails and break them down into a format that the computer algorithm can understand.
But first, I need to separate these emails into two sets of data, one with known labels of upset and happy and the other with no labels that I want to predict. The set with labels will be called the training emails and the one without labels will be called the testing emails. This is important in the NLP process because it is going to help me build a model that can help generalize, or better predict the future.
Training and Testing Data
It is important to note that I will want to pre-process my data. That is, I want to make sure that all the words are lowercase because I don’t want my model to treat words that are capitalized and not as different words. I will then remove all words that offer little value and occur frequently in the English language. These are called “stop words” and usually offer little semantic value. “And”, “but”, “the”, “a”, “an” give little meaning to the purpose of a sentence and can all go. I will also remove punctuation because that is not going to give me information for my model. Finally, I will take any word that is plural and bring it down to its singular form, thus shoes to shoe and cars to car. The point of all this pre-processing is to reduce the number of words that I need to have in my model.
Now that I have separated emails into training and testing sets, pre-processed the words, I need to put those emails into a format that my computer algorithm can understand: numbers. This is called the “vectorization” process where we create a document-term matrix. The matrix is simply a count of the words in a document (mom’s email) and the times that a word occurred. This matrix is then used to compare documents across themselves. The reason we pre-processed these words was that the vectorization process would result in a massive, extremely clumsy matrix.
Document-Term Matrix (aka, Vectorization)
As you can see by the image above, each email is displayed as their own row in the matrix (document-term matrix). This document-term matrix then takes each unique word in ALL the emails from the training emails and places them in the columns. These are called features. Features of a document-term matrix can make this matrix incredibly long. Think of every unique word in thousands of emails. That is one long matrix! Which is the reason we did pre-processing in the first place! We want to reduce the number of columns, or features, which is a process known as dimensionality reduction. In other words, we are reducing the numbers that our algorithm needs to digest.
Now that I have my emails represented as a matrix, I can create an algorithm to take those numerical representations and convert them to a prediction. Recall from a previous post that we introduced Bayes Theorem. (see Translating Nerd’s post on Bayes). Well, we could use Bayes Theorem to create a predicted probability that my mom is upset. We will call upset = 1 and happy = 0.
Side Note: I know this seems pessimistic, but my outcome variable is going to be the probability that she is upset, and that is why we need to constrain our algorithm between zero and one. Full disclosure, my mom is wonderful and I prefer her happy. Again, see images under the text.
Now, there are other algorithms that can be used, such as logistic regression, support vector machines and even neural networks, but let’s keep this simple. Actually, the first email spam detectors used Naïve Bayes because it works so darn well with large numbers of features (words in our case). But “naïve”, what does that mean you may ask? The model makes the naïve assumption that these words are not related to each other. We know this cannot be true because that is how we create sentences, that is, words have meaning when coupled with each other. Of course, each algorithm has drawbacks but Naïve Bayes proves to be quite accurate with large amounts of features (ie, words).
Once we have implemented our Naïve Bayes model on the document-term matrix, we can make a prediction on each email in the test set. This test set acts as a validation on the training set which will allow us to make changes to our model and get as close as possible to a generalized model for determining an upset email from mom. A keynote of machine learning is to create a model that doesn’t just fit our training data, because we need it to be vague enough to generalize to new data. This is called overfitting a model and should be avoided at all costs. Of course, there is a trade-off between underfitting a model that needs to be balanced but again, I digress (see image below).
Machine learning basics for future posts
Once we have tuned our Naïve Bayes algorithm to both fit the training emails and generalize well enough to future emails in the test set, we are ready to test it out on a new email from mom. When mom sends us a new email, our algorithm will output a predicted probability. Let’s say that any email that has a predicted probability of 50 percent or more (0.5) will be called upset (1) and any predicted probability that is under 50 percent will be called happy (0).
If we wanted a simplistic model we could look at the above new emails that have been run (without labels) through our Naive Bayes algorithm. It looks like the predicted probabilities have cleared our 0.5 threshold and made a classification of happy, upset, upset, and happy. It looks like I will have some calls to make this evening!