Turning Text into Gold: Taxonomies and Textual Analytics, by Bill Inmon, covers a plethora of text analytics and Natural Language Processing foundations. Inmon makes it abundantly clear in the first chapter of his book that organizations are underutilizing their data. He states that 98 percent of corporate decisions are based on only 20 percent of available data. This data, labeled structured data due to its ability to fit in matrices, spreadsheets, relational databases and are easily ingested into machine learning models, are well understood. However, unstructured data, the text, and words that our world generates on a daily basis are seldom used. Similar to the alchemists of the middle ages who searched for a method to turn ordinary metals into gold, Inmon describes a process to turn unstructured data into decisions; turning text into gold.
What the heck is a taxonomy?
Taxonomies are the dictionaries that we use to refer tie the words in a document, book, corpus of materials, into a business-related understanding. For example, if I were a car manufacturer, I would have a taxonomy of various car-related concepts so I could identify those concepts in the text. We then start to see repetition and patterns in the text. We might begin to see new words that relate to car manufacturing in the text. We can then add these terms to our taxonomy. While the original taxonomy might garner 70 percent of car-related words in the document, 90 percent is usually a business appropriate level to move from taxonomy/ontology to database migration.
Once we have the necessary inputs from our long list of taxonomies. Through textual disambiguation, the raw text from our document is compared to the taxonomy we have created. If there is a fit, then this text is moved from the document and stored in a processing stage. This stage involves looking for more distinct patterns in the newly moved text. Using regular expressions, or a type of investigative method in coding, we can discern more distinct patterns from the text. We can then move this raw text into a matrix, or what many people are familiar with in a spreadsheet. Transferring the text into a matrix involves the manipulation of text to numbers, which can be rather large when fitting into a matrix. While there are specific steps that can be taken (ie, sparse matrix vs. dense matrix), the process is the same: make text machine-readable. Words become zeros and ones and analytical models can now be applied to the document. Machine learning algorithms, such as offshoots of Bayes Theorem and other classification techniques can be used to categorize and cluster text.
A simple example
Imagine you go to the ER one day and a report is generated when you are out-processed. This record holds many important elements to your medical history. However, having someone extract the name, address, your medications, your condition, your treating doctor’s information, your health vitals, etc would take a lot of time. More time than a swamped hospital staff on a limited budget can handle. Text analytics is used to link the all this information into a spreadsheet that can then be fitted into the hospital’s database. Add up enough of these records and you can start looking for patterns.
- Your visit to the ER is documented as text
- The hospital takes a pre-defined “dictionary”, or taxonomy, of medical-related terms
- The taxonomy is compared against your medical evaluation and processed into a spreadsheet/matrix.
- The spreadsheet is uploaded into a relational database the hospital maintains
- An analyst queries the database data to make a machine learning model that can create value-added predictions.
- Based on your model, a value is produced that results in a decision being made.