Should I call my mom?: Machine Learning using NLP for non-nerds

My mom really likes to write. She writes emails that would make folks in college English departments give valedictorian status to. Now I know this makes me seem like a bad son, but there are times when I would like an indicator on the emails that tell me the urgency of responding. When there is great news, then I can respond over the weekend by giving her a phone call, but if there is something that is upsetting, I need a red flag to pop up and let me know that I need to respond via phone call or else my father is going to start texting me saying, “Dude, call your mom.” (Preface to say, see image below, I think everyone should be calling their mom when asking this question. But this does not make a convincing data science blog).

Screen Shot 2018-09-12 at 8.10.38 PM

Well, natural language processing (NLP) offers a solution. In fact, NLP is at the forefront of data science industry specialists for its use on unstructured (text) data. Once arduous tasks of combing through piles of PDFs and reading long emails have been replaced by techniques that allow the automation of these time-consuming tasks. We have spam detectors that filter out suspicious emails based on text cues predicting with astonishing accuracy which emails are spam and which are notes from grandma. But what lacks in the data science community is the “Mom Alert.”

Screen Shot 2018-09-12 at 8.15.00 PM

To understand the complexities that NLP has to offer, let’s break down the “Mom Alert.” First, we need a corpus (collection) of past emails that my mom has sent me. These need to be labeled by hand as “upset” mom and “happy” mom. Once I have created labels for my mom’s historical emails, I can take those emails and break them down into a format that the computer algorithm can understand.

But first, I need to separate these emails into two sets of data, one with known labels of upset and happy and the other with no labels that I want to predict. The set with labels will be called the training emails and the one without labels will be called the testing emails. This is important in the NLP process because it is going to help me build a model that can help generalize, or better predict the future.

Screen Shot 2018-09-12 at 8.09.39 PM

Training and Testing Data

It is important to note that I will want to pre-process my data. That is, I want to make sure that all the words are lowercase because I don’t want my model to treat words that are capitalized and not as different words. I will then remove all words that offer little value and occur frequently in the English language. These are called “stop words” and usually offer little semantic value. “And”, “but”, “the”, “a”, “an” give little meaning to the purpose of a sentence and can all go. I will also remove punctuation because that is not going to give me information for my model. Finally, I will take any word that is plural and bring it down to its singular form, thus shoes to shoe and cars to car. The point of all this pre-processing is to reduce the number of words that I need to have in my model.

Now that I have separated emails into training and testing sets, pre-processed the words, I need to put those emails into a format that my computer algorithm can understand: numbers. This is called the “vectorization” process where we create a document-term matrix. The matrix is simply a count of the words in a document (mom’s email) and the times that a word occurred. This matrix is then used to compare documents across themselves. The reason we pre-processed these words was that the vectorization process would result in a massive, extremely clumsy matrix.

Screen Shot 2018-09-12 at 8.03.16 PM

        Document-Term Matrix (aka, Vectorization)

As you can see by the image above, each email is displayed as their own row in the matrix (document-term matrix). This document-term matrix then takes each unique word in ALL the emails from the training emails and places them in the columns. These are called features. Features of a document-term matrix can make this matrix incredibly long. Think of every unique word in thousands of emails. That is one long matrix! Which is the reason we did pre-processing in the first place! We want to reduce the number of columns, or features, which is a process known as dimensionality reduction. In other words, we are reducing the numbers that our algorithm needs to digest.

Screen Shot 2018-09-12 at 8.40.13 PM

Wrong Matrix

Now that I have my emails represented as a matrix, I can create an algorithm to take those numerical representations and convert them to a prediction. Recall from a previous post that we introduced Bayes Theorem. (see Translating Nerd’s post on Bayes). Well, we could use Bayes Theorem to create a predicted probability that my mom is upset. We will call upset = 1 and happy = 0.

Side Note: I know this seems pessimistic, but my outcome variable is going to be the probability that she is upset, and that is why we need to constrain our algorithm between zero and one. Full disclosure, my mom is wonderful and I prefer her happy. Again, see images under the text.

Screen Shot 2018-09-12 at 8.34.55 PM

     

Now, there are other algorithms that can be used, such as logistic regression, support vector machines and even neural networks, but let’s keep this simple. Actually, the first email spam detectors used Naïve Bayes because it works so darn well with large numbers of features (words in our case).  But “naïve”, what does that mean you may ask? The model makes the naïve assumption that these words are not related to each other. We know this cannot be true because that is how we create sentences, that is, words have meaning when coupled with each other. Of course, each algorithm has drawbacks but Naïve Bayes proves to be quite accurate with large amounts of features (ie, words).

Once we have implemented our Naïve Bayes model on the document-term matrix, we can make a prediction on each email in the test set. This test set acts as a validation on the training set which will allow us to make changes to our model and get as close as possible to a generalized model for determining an upset email from mom. A keynote of machine learning is to create a model that doesn’t just fit our training data, because we need it to be vague enough to generalize to new data. This is called overfitting a model and should be avoided at all costs. Of course, there is a trade-off between underfitting a model that needs to be balanced but again, I digress (see image below).

Screen Shot 2018-09-12 at 8.04.25 PM

Machine learning basics for future posts

Once we have tuned our Naïve Bayes algorithm to both fit the training emails and generalize well enough to future emails in the test set, we are ready to test it out on a new email from mom. When mom sends us a new email, our algorithm will output a predicted probability. Let’s say that any email that has a predicted probability of 50 percent or more (0.5) will be called upset (1) and any predicted probability that is under 50 percent will be called happy (0).

Screen Shot 2018-09-12 at 8.31.33 PM

If we wanted a simplistic model we could look at the above new emails that have been run (without labels) through our Naive Bayes algorithm. It looks like the predicted probabilities have cleared our 0.5 threshold and made a classification of happy, upset, upset, and happy. It looks like I will have some calls to make this evening!

 

 

Images sources:

https://datascienceomar.wordpress.com/2016/07/04/classification-with-scikit-learn/
http://www.cs.nthu.edu.tw/~shwu/courses/ml/labs/08_CV_Ensembling/08_CV_Ensembling.html
https://www.includehelp.com/ml-ai/data-splitting.aspx
https://lespetitesgourmettes.com/recipes/friday-funnies-8/attachment/call-your-mom/
https://logobaker.ru/logo/3407-reazon.html
https://kultura.zpravy.idnes.cz/filmove-novinky-zari-2017-0sa-/filmvideo.aspx?c=A170912_120514_filmvideo_ts

 

 

 

 

Standard

World Bank to Front End Developer: Coffee with Andres Meneses

A couple months ago Andres Meneses and I sat down at a busy Adams Morgan cafe in the heart of Washington, DC to discuss his success as a frontend developer. Incredible to his story is how he navigated from a successful long-term position at The World Bank to a coding boot camp to gain a foothold in the web development world. His background is one of the many success stories popping up from mid-career individuals making a jump into the world of data science/technology development.

AndresPhoto

 

Bio: Andrés Meneses is the proud owner of the happiest dog in Washington, DC and a passionate pro-butter advocate. He is also a web developer, committed first and foremost to optimizing user experience. He involves users from the outset of all projects because, as he likes to put it, “There is nothing worse than working hard on a digital product that no one ever uses!” By leveraging his combined expertise in product and project management and digital communications, Andrés approaches his work by thinking broadly. Why and what does this organization have to share, and how will that information engage, inform, surprise, and help the intended audience? And how will the information best deliver tangible outcomes? Throughout his time working in all types of organizations, most notably his more than 10 years at the World Bank as a technical project manager, he never stopped learning and made the leap a few years ago into full-time web developer. Just like his dog, he has never been happier.

 

Three websites Dr. Nicholas uses to keep current:

Contact information:

 

Standard

Interview with General Assembly Data Scientist: Dr.Farshad Nasiri

Dr.Farshad Nasiri is the local instructor lead for the Data Science Immersive (DSI) program at General Assembly in Washington, DC. He received his B.S. from Sharif University of Technology in Tehran and his Ph.D from George Washington University in mechanical engineering where he applied machine learning tools to predict air bubble generation on ship hulls. Prior to joining General Assembly, he worked as a computational fluid dynamics engineer and a graduate research assistant. As the DSI instructor, he delivers lectures on the full spectrum of data science-related subjects.

Farshad

Farshad is interested in high-performance computing and implementation of machine learning algorithms in low level, highly scalable programming languages such as Fortran. He is also interested in data science in medicine specifically preventive care through data collected by wearable devices.

Favorite Data Books:

Favorite websites:

Contact info:

https://www.linkedin.com/in/farshad-nasiri/

Standard

Machine Learning Post-GDPR: My conversation with a Spotify chatbot

I have a dilemma. I love Spotify and find their recommendation engine and machine learning algorithms spot on (no pun intended). I think their suggestions are fairly accurate to my listening preferences. As well as it should be. I let them take my data.

Screen Shot 2018-05-25 at 10.43.19 PM.png

Spotify has been harvesting my personal likes and dislikes for the past 6 years. They have been collecting metadata on what music I listen to, how long I listen to it, what songs I skip and what songs I add to my playlists. I feel like I have a good relationship with Spotify. Then came the emails. Service Term Agreements, Privacy Changes, Consent Forms. Something about GDPR.

Which brings me to the topic of discussion, what is GDPR and how will services that rely on our metadata to power algorithms and machine learning predictions function in a post-GDPR world? First, let’s back-track a bit for a second. Over the past week, most people’s inboxes have looked like this…

Screen Shot 2018-05-25 at 10.06.06 PM.png

Or you feel you have won an all expense paid “phishing” trip for your gmail account.

Screen Shot 2018-05-25 at 10.02.50 PM

Nope, your email hasn’t been hacked, but something monumental has occurred in the data world that has been years in the making. Two years ago, the European Union Parliament passed a law to have strict user-friendly protections added to an already stringent set of EU data regulations. The GDPR stands for the General Data Protection Regulation and went into force today in the member countries of the European Union. There are a number of things that GDPR seeks to accomplish, but the overarching mood of the regulation seeks to streamline how a user of a good or service gives consent to an entity that uses their private data.

Now I hear you saying, “I thought that GDPR was just an EU regulation, how does it affect me as an American? I don’t like this European way of things and this personal rights stuff.”

Well, you are correct, it goes into effect May 25, 2018 (today or in the past, if you are reading this now), but any organization that has European customers will need to abide by it. Since most organizations have European customers, everyone is jumping in and updating their privacy policies, begging for your consent in your inbox.

“Who cares? Aren’t companies going to use my data anyways without me knowing?” Well, not anymore. Not only does GDPR allow individuals and groups to bring suit against companies that are in breach of GDPR, but these companies can be fined 4 percent of their annual global revenue. Yes revenue, not profit, and global revenue at that. Already today Facebook has been hit with 3.9 billion Euros and Google with 3.7 billion Euros in fines. Source

In a book published by Bruce Schneider, Data and Goliath: The Hidden Battles to Collect and Control Your World speaks to the deluge of data that has congested society’s machines. This data exhaust, as Schneider refers to it, has a plethora of extractable data, known as metadata. Metadata, while not linked to specific conversations you have on your phone or the text contents of your emails, is collected by organizations we use in our daily life. Facebook, Amazon, Google, Yahoo (yes, some of us still use Yahoo email), Apple all rely on gigantic server farms to maintain petabytes of saved metadata.

Screen Shot 2018-05-25 at 9.52.37 PM

Schneider refers to the information age being built on data exhaust and he is correct. Modern algorithms that power recommendation engines for Amazon, the phone towers that collect cell phone signals to show your location, the GPS inside your iPhone, your Netflix account, all rely on metadata and the effective storage of, well, a boat load of data for lack of a better term.

Over the course of the past couple decades, we have been consenting to privacy agreements that allow these companies, and companies like them, to use our data for purposes that we are not aware of. That is the reason for GDPR. This is where consent comes to play.

According to the New York Times, GDPR rests on two main principles. “The first is that companies need your consent to collect data. The second is that you should be required to share only data that is necessary to make their services work.” Source

Broken down, here is Translating Nerd’s list of important highlights of GDPR:

1. You must give your consent for a company to use your data
2. Companies must explain how your data will be used
3. You can request your data be deleted
4. If there is a data breach at a company, they have 72 hours to notify a public data regulatory agency
5. You can request access to the personal data that is kept
6. You can object to any processing of your personal data in any situation that arises
7. Simply put, consent must be easy to give and easy to take away

So after receiving Spotify’s updated privacy policy terms in my inbox, I decided to contact a representative of my favorite music streaming company (her name is Joyce). Namely, I was curious if I failed to give my consent for Spotify to use my data, would their recommendation engine not have data from my personal account to provide those terrific suggestions that all seem to begin with Nine Inch Nails and regress slowly to somewhere near Adele. Regression to the mean perhaps? A conversation for another post perhaps or a comment on society’s love of Adele.

The conversation I had went like this:

 

 

As for most chatbots, sorry, Joyce, I was referred to a standard account term source. So I needed to do some more digging. I found that as stated in GDPR terms, I could deactivate certain features such as what data from Facebook is kept on me or using third party sites to help target ads to me.

Screen Shot 2018-05-25 at 10.53.09 PMScreen Shot 2018-05-25 at 10.53.13 PM

But this isn’t enough, what I am curious about is can a machine learning algorithm like Spotify’s recommendation engine work without access to my data? And then I saw it, all the way at the bottom of their user terms in this link. A clear and definitive answer that no, to delete my personal data I would need to close my account, which means no Spotify or recommendations for Adele.

Screen Shot 2018-05-25 at 10.54.14 PM

Simply put, Algorithms run on data. No data, no prediction. No prediction, no recommendation engine to suggest Adele. My intuitions proved correct, that data powers the machine learning algorithms we build. This should be obvious. Just as you need fuel to power a car, you need data from a user to run an algorithm. What GDPR ensures is that the data you are giving is used SOLELY for the service you requested. But it remains unclear how this is differentiated in the user agreement.

And this is exactly where GDPR places us as a society. How much data are we willing to give to receive a service? What if the service is enhanced with more personal data? We rely on the algorithms from recommendation engines to tell us what to buy from Amazon, to inform us what Netflix shows we would like, to help match us on social dating websites like OKCupid and Match.com, but that will require the full consent of the user to receive the full power of the predictive algorithm that drives these products and our cravings for smarter services.

Standard

Interview with Johns Hopkins University Data Scientist: Dr. Paul Nicholas

A couple weeks ago I sat down with Dr. Paul Nicholas, data scientist at John’s Hopkins University to discuss his unique path in operations research. Dr. Nicholas brings a wealth of knowledge, from analyzing survey data in Afghanistan of local populations while deployed by the US military to working as a fellow at MIT. Our conversation moved from essential tools in a data scientist’s toolkit to briefing strategies when helping clients understand complex solutions. Dr. Nicholas tells interesting stories of what it is like to be on the front lines of operations research and data science.

 
Screen Shot 2018-02-24 at 6.19.53 PM.png

 

Bio: Dr. Paul J. Nicholas is an operations research scientist at the Johns Hopkins University Applied Physics Laboratory (JHU APL). He also teaches a graduate course on analytics and decision analysis as part of GMU’s Data Analytics Engineering (DAEN) program, and serves in the U.S. Marine Corps Reserve. Paul’s research focuses on the use of advanced analytic techniques to examine problems in national security, including large-scale combinatorial optimization, simulation, and data analytics. Paul received his B.S. from the U.S. Naval Academy, his M.S. from the Naval Postgraduate School, and his Ph.D. from George Mason University. On active duty, he deployed several times to Iraq and Afghanistan in support of combat operations as a communications officer and operations research analyst. He was a visiting fellow to the Johns Hopkins University Applied Physics Lab, and an MIT Seminar XXI Fellow. He is the author of numerous peer-reviewed publications and holds two U.S. patents.

Books that Dr. Nicholas recommends:

Three websites Dr. Nicholas uses to keep current:

Contact information:

http://mason.gmu.edu/~pnichol6/

SEOR Department
Engineering Building, Rm 2100, MS 4A6
George Mason University
Fairfax VA 22030
Email: pnichol6 [at] gmu [dot] edu

Standard

Interview with Excella Consulting Data Scientist and ML Expert: Patrick Smith

A couple weeks ago, I had the opportunity to sit down in the studio with Patrick Smith. Patrick brings Natural Language Processing (NLP) expertise as it relates to deep learning and neural networks. His work in the financial quant. area before joining Excella Consulting offers an incredible insight into the mathematical rigors that are crucial for the data scientist to master. Listen as Patrick offers his take on what is current and on the horizon in data science consulting!

Screen Shot 2018-02-22 at 7.35.23 AM

Bio: Patrick Smith is the data science lead at Excella in Arlington, Virginia, where he developed the data science practice. Previously, he both led and helped create the data science program at General Assembly in Washington, DC, as was a data scientist with Booz Allen Hamilton’s Strategic Innovations Group. Prior to data science, Patrick worked in risk and quantitative securities analysis, and institutional portfolio management. He has his B.A. in Economics from The George Washington University and has done masters work at Harvard and Stanford Universities in AI and computer science.

Patrick is passionate about AI and deep learning and has contributed to significant research and development projects as part of Excella’s artificial intelligence research effort. He architected Excella’s DALE intelligent assistant solution which provides state of the art results in question answering tasks and is the author of an upcoming book on Artificial Intelligence.

 

Favorite Data Books:
Sites that Keep Patrick Current:
How to Contact Patrick:
Standard

Searching for Lost Nuclear Bombs: Bayes’ Theorem in Action

 

Screen Shot 2018-02-07 at 8.35.32 PM.png

What does searching for lost nuclear weapons, hunting German U-boats submarines, breaking NAZI codes, savaging the high seas for Soviet submarines, and searching for 747s that go missing in the Pacific Ocean have in common? No, not a diabolical death wish, but a 200-year-old algorithm, deceivingly simple, yet computationally robust. This algorithm and its myriad offshoots is Bayes Theorem. The book The Theory That Would Not Die, by Sharon Bertsch McGrayne, opens the reader to the 200-year controversial history of the theorem, beginning in the late 1700s with Reverand Thomas Bayes’ discovery, through the application by Pierre Laplace, to its contentious struggle against the statistical frequentist approach in the early 1900s.

Screen Shot 2018-02-07 at 7.03.26 PM

It wasn’t until the dawn of the computer age that Bayes made an impact on predictive analytics. The narrative that McGrayne portrays is one of the misunderstood analyst, so desperately clinging to a belief that will offer clearer predictive power, trying to convey a simple algorithm that offers a more powerful means of testing phenomenon.

What is Bayes?

Screen Shot 2018-02-07 at 6.40.16 PM

A: Hypothesis

B: Data

The idea is simple: We learn new information each day. In essence, we update the knowledge that we already have on a daily basis from our past experiences. Each new day that passes we update our prior beliefs. We assign a probability of events occurring in the future based on these prior beliefs. This prior belief system is at the core of Bayes theorem. Simply put, Bayes is a way of updating our beliefs with new information to arrive at a more exact prediction based on probability. Another way of looking at the rule is below from www.pyschologyinaction.org.

Screen Shot 2018-02-07 at 7.20.12 PM

 

What does Bayes look like in action? 

Caveat: This is an extremely simplified version of the model used and is by no means represents the sheer volume of calculations involved by highly skilled statisticians.

A popular case documented by McGrayne in The Theory that Would Not Die is an incident that happened in the 1960s. The US military loses things, more often than we like to think. Many nuclear bombs have been accidentally dropped or lost in transit. While not activated, these bombs are a national security threat that most presidents want to get a hold of as quickly as possible. One specific example of this is in 1966 when a B-52G bomber crashed mid-air with a refueling KC-135 tanker over the Mediterranean Sea. The plane subsequently jettisoned four nuclear warheads. Three of these were found on land while the third lay in the waters off the Spanish coast.

Screen Shot 2018-02-07 at 6.40.33 PM

The Navy employed probabilistic Bayes experts to find the warheads. Specifically, they used Bayes’ Rule to find the probability of the warhead being in a given area given a positive signal from their sonar devices. Since going 2,550 feet below the ocean in a submersible is expensive and dangerous, the Navy wanted to ensure it was making the trip with a purpose and not to find a cylindrical rock formation.

The Navy searched for 80 days without results. However, Bayes tells us that an inconclusive result is a constructive result nonetheless. As McGrayne states,  “Once a large search area was divided into small cells, Bayes’ rule said that the failure to find something in one cell enhances the probability of finding it in the others. Bayes described in mathematical terms an everyday hunt for a missing sock: an exhaustive but fruitless search of the bedroom and a cursory look in the bath would suggest the sock is more likely to be found in the laundry. Thus, Bayes could provide useful information even if the search was unsuccessful.” (McGrayne, 189)

The simplistic model for this search in a single square would be…

Screen Shot 2018-02-07 at 6.40.16 PM

  • P(A|B): Probability the square contains the warhead given a positive signal from Navy instruments
  • P(A): Probability the square contains the warhead
  • P(B|A): Probability of a positive signal from the instrument given the warhead is present
  • P(B): Probability of a positive signal from instrument

We would then add these cells to derive a more general picture of the search area. The search area would be updated as new data flowed in, creating a new prior and posterior hypothesis that stemmed directly from the new data. Remember, this is one component of a much larger model, but it should help you get the picture. To get a more robust model, we would need to create similar calculations for the last known position of the aircraft when it went down, the currents of the ocean, the climate over the past months, etc. Computationally, this can get very heavy very quickly, but the underlying principles remain the same: base a hypothesis on prior knowledge and update to form a posterior hypothesis.

As you can see, as Navy vessels complete searches in one area, Bayes’ model is updated. While the above pictures are from a study involving the search for Air France Flight 447 in the Pacific Ocean, the story remains the same. We create a hypothesis, we test it, we gain new data, and we use that data to update our hypothesis. If you are curious about how Bayes was used for the successful recovery of France Flight 447, I highly recommend the 2015 METRON slides.

Prior hypothesisScreen Shot 2018-02-07 at 6.40.55 PM

 

Posterior hypothesis after initial search

Screen Shot 2018-02-07 at 6.41.03 PM

Similar methodologies in large-scale search endeavors have been well documented. these include the current search for the missing Malaysia Flight 370. Below are links to interesting use-cases of Bayes in action.

Bayesian Methods in the Search for MH370

How Statisticians Found Air France Flight 447 Two Years After It Crashed Into Atlantic

Missing Flight Found Using Bayes’ Theorem?

Operations Analysis During the Underwater Search for Scorpion

Can A 250-Year-Old Mathematical Theorem Find A Missing Plane?

And yes, they found the bomb!

Screen Shot 2018-02-07 at 6.41.13 PM

 

Background reference:

The Theory That Would Not Die, Sharon Bertsch McGrayne

https://en.wikipedia.org/wiki/1966_Palomares_B-52_crash

 

Image references:

https://sinews.siam.org/Details-Page/Bayesian-Search-for-Missing-Aircraft-Ships-and-People

https://www.motherjones.com/politics/2015/08/nuclear-weapon-obama-most-expensive-ever/

https://www.military.com/daily-news/2017/10/17/surge-wake-ship-collisions-tests-navy-new-deployment-plan.html

https://www.rogerebert.com/reviews/the-hunt-for-red-october-1990

http://thediplomatinspain.com/ecologistas-en-accion-lleva-al-supremo-la-radiactividad-de-palomares/

https://www.psychologyinaction.org/psychology-in-action-1/2012/10/22/bayes-rule-and-bomb-threats

https://navaltoday.com/#newsitem-123777

 

 

 

Standard

Interview with Radiant Solutions Data Scientist: Josh Lipsmeyer

Recently Translating Nerd interviewed Josh Lipsmeyer about the work he does as a Software Engineer/Data Scientist. Josh has an interesting background because he leverages a mathematics and physics background to solve some pretty tough data-related problems. The conversation was lively with this Arkansas native as we dug deep into how he simplifies the data world for us. There are a few shenanigans in this podcast since Josh and I go back.

Screen Shot 2018-02-05 at 7.12.23 PM

 

BioJosh Lipsmeyer is a Software Engineer/Data Scientist at Radiant Solutions where he specializes in algorithm development, data science, and application deployment primarily for customers with the Department of Defense. His primary interest is the study of complex network resilience, information diffusion, and evolution.  He has a background in mathematics and physics, holding an MS in mathematics from the University of Tennessee. He currently lives in Centreville, VA with his wife and 1-year-old son.

Toolkit: Critical thinking, Python, Spark, Java, Microsoft SQL Server, QGIS, STK, Latex, Tableau, Excel

Josh’s favorite online resources:

  • Udemy
  • Stack Overflow
  • ArXiv.org
  • Coursera
  • Open Source University Courses

Influential books about data:

Understanding Machine Learning: From Theory to Algorithms

How to get ahold of Josh:

https://www.linkedin.com/in/josh-lipsmeyer-769884b4/

Standard

Book Summary: Turning Text into Gold

Screen Shot 2018-01-26 at 6.44.33 PM

 

Key Takeaways

Turning Text into Gold: Taxonomies and Textual Analytics, by Bill Inmon, covers a plethora of text analytics and Natural Language Processing foundations. Inmon makes it abundantly clear in the first chapter of his book that organizations are underutilizing their data. He states that 98 percent of corporate decisions are based on only 20 percent of available data. This data, labeled structured data due to its ability to fit in matrices, spreadsheets, relational databases and are easily ingested into machine learning models, are well understood. However, unstructured data, the text, and words that our world generates on a daily basis are seldom used. Similar to the alchemists of the middle ages who searched for a method to turn ordinary metals into gold, Inmon describes a process to turn unstructured data into decisions; turning text into gold.

What the heck is a taxonomy?

Taxonomies are the dictionaries that we use to refer tie the words in a document, book, corpus of materials, into a business-related understanding. For example, if I were a car manufacturer, I would have a taxonomy of various car-related concepts so I could identify those concepts in the text. We then start to see repetition and patterns in the text. We might begin to see new words that relate to car manufacturing in the text. We can then add these terms to our taxonomy. While the original taxonomy might garner 70 percent of car-related words in the document, 90 percent is usually a business appropriate level to move from taxonomy/ontology to database migration.

Now what?

Once we have the necessary inputs from our long list of taxonomies. Through textual disambiguation, the raw text from our document is compared to the taxonomy we have created. If there is a fit, then this text is moved from the document and stored in a processing stage. This stage involves looking for more distinct patterns in the newly moved text. Using regular expressions, or a type of investigative method in coding, we can discern more distinct patterns from the text. We can then move this raw text into a matrix, or what many people are familiar with in a spreadsheet. Transferring the text into a matrix involves the manipulation of text to numbers, which can be rather large when fitting into a matrix. While there are specific steps that can be taken (ie, sparse matrix vs. dense matrix), the process is the same: make text machine-readable. Words become zeros and ones and analytical models can now be applied to the document. Machine learning algorithms, such as offshoots of Bayes Theorem and other classification techniques can be used to categorize and cluster text.

A simple example

Imagine you go to the ER one day and a report is generated when you are out-processed. This record holds many important elements to your medical history. However, having someone extract the name, address, your medications, your condition, your treating doctor’s information, your health vitals, etc would take a lot of time. More time than a swamped hospital staff on a limited budget can handle. Text analytics is used to link the all this information into a spreadsheet that can then be fitted into the hospital’s database. Add up enough of these records and you can start looking for patterns.

Screen Shot 2018-02-03 at 6.27.38 PM

  1. Your visit to the ER is documented as text
  2. The hospital takes a pre-defined “dictionary”, or taxonomy, of medical-related terms
  3. The taxonomy is compared against your medical evaluation and processed into a spreadsheet/matrix.
  4. The spreadsheet is uploaded into a relational database the hospital maintains
  5. An analyst queries the database data to make a machine learning model that can create value-added predictions.
  6. Based on your model, a value is produced that results in a decision being made.

 

Image sources:

www.medicalexpo.com
http://openres.ersjournals.com/content/2/1/00077-2015
https://www.sharesight.com/blog/ode-to-the-spreadsheet/
Standard

A conversation with RAND economist (and data scientist), Dr. Ben Miller

Last week I had the opportunity to drop by RAND to interview Dr. Ben Miller. Ben is an economist who specializes in econometric modeling and is at the leading edge of applied data science within the think tank world. Ben offers many insights into causal inference, discovering patterns and uncovering signals in cloudy data. Ben explains the tools he uses, the methods he calls upon and the manner which he guides RAND’s “sponsors” to make informed decisions as they relate to national priorities.

As Ben says, “My toolkit is ‘applied econometrics,’ aka using data to estimate causal relationships.  So think of techniques like instrumental variables, differences-in-differences, regression discontinuities, etc.  Overall, it’s about putting a quantitative estimate on a relation between two variables in a way that is (hopefully!) unbiased, and at the same time understanding the uncertainty associated with that estimate.  A really approachable and widely respected introduction to that toolkit is Angrist & Pischke’s ‘Mostly Harmless Econometrics.'”

Screen Shot 2018-02-02 at 9.31.52 PM.png

 

Referenced works from the audio:

The Dog that Didn’t Bark: Item Non-response Effects on Earnings and Health

Models for Censored and Truncated Data

 

Bio: Ben Miller is an Associate Economist at the RAND Corporation and a Professor at the Pardee RAND Graduate School. His research spans a wide variety of topics including disaster mitigation, infrastructure finance, energy resilience, geospatial information, insurance, tax policy, regulatory affairs, health care supply, agriculture, econometrics, and beyond.  Recent publications examine the link between flood insurance prices and housing affordability, review federal policies surrounding transportation and water infrastructure finance, estimate the causal impact of weather warning systems on fatalities and injuries from tornadoes, and overview econometric techniques for determining the value of geospatial information. Prior to joining RAND, Miller worked as a statistician supporting the U.S. Census Bureau’s Survey of Income and Program Participation. He holds a Ph.D. in economics from the University of California, San Diego and a B.S. in economics from Purdue University.

LinkedIn: https://www.linkedin.com/in/mille419/

RAND: https://www.rand.org/about/people/m/miller_benjamin_m.html

Standard