Recently, I had a former colleague reach out to me to inquire about how his organization could be data-driven. Specifically, he was interested in what the various terms and buzzwords he had been hearing actually meant. The following is a brief-response to summarize at a high level the various underpinnings of what it means to be “data-driven.” I have included the e-mail response in its entirety.
“Ok, so what is this topic of a data–driven business? Well, for starters, most of what we hear about informing decisions through advanced analytics and big data is nothing new. What is new is the sheer quantity and quality of the data that we are being presented. In theory, these buzzwords and wing-dings we see in business news have been around in operations research for decades. But we have never had the amount of data being produced by what we call the “Internet of Things,” or IoT for short. Think about your cell phone. There are about 12-15 sensors in a typical iPhone that are continuously recording data and feeding it to various services. Your location, speed, altitude, when your phone activity is low and when it is high. All of these things can be used to inform decisions and create new products. But it is the same process that you learned in the Navy: optimize a problem given the data and based on the constraints. So here is a little introduction to what I view a data–driven business to maintain.
1. Theory: Big data is broken up into three Vs (variety, velocity, volume).
B) Velocity is the speed that data comes in. It is separated by two types: batch and stream processing. Batch processing happens in batches, or all at once. Stream processing is when data drips into a system. This can be thought of as clicks on a webpage. A company like Amazon would be interested in seeing where a customer clicked, how long they waited on a webpage before buying something and the order that they shop in. These are rather simple procedures to follow since all web-based applications can time stamp and track cursor selection. When we get into batch processing, most of the time we think of big data platforms like Hadoop. Hadoop is the industry standard for setting up a system that can divide tasks and send them to different processors in a server (think a big server tower). Each job is mapped and reduced to a specific core and then once those individual tasks are done, it sends it back to the Hadoop server and combines the parts. This process is called MapReduce and is a major keystone to Hadoop. Think of this as giving the crew of a ship multiple tasks. You cannot do this yourself, but you can command junior officers to delegate tasks. Essentially, you are the Hadoop server and your junior officers are the elements of MapReduce. You map to your junior officers and they reduce your workload. The junior officers then give out commands to the enlisted folks (this could be a machine learning algorithm that runs the math equations and operations research models on the individual cores in your Hadoop server and then sends them back to the junior officers who then report to you). You, or the Hadoop server, then takes all those individual orders at the end of the day to come to a conclusion and summarize results for the Admiral.
C) Volume. Simply put, this deals with the shear size of big data. It hurts my head to think about how big data is in a military or business world, but I will try. Think of Trump’s proposed wall. Well, DHS is going to have quite the mission in dealing with all the data that the sensors on the walls will be sending back. UAVs patrolling need someplace to store data. This is where we get the term “data management” and “data warehousing.” We need someplace to store data and Amazon (AWS) and Google are leading the way to put these “server farms” in the desert or places that they can use solar power to cool these servers off (they get really hot, which is why IT always puts huge fans in server rooms in your office).
2. Analysis: There is a process in data science that can usually be broken up into 3 parts…
A. Descriptive: Basic statistics, basic graphs, basic outlier analysis. You are using these fundamentals of data analysis to get an overall picture of the data. Essentially, you have a specific question in mind.
B. Exploratory: You are looking at trends. You don’t have a specific question, but you are hoping to find something interesting, then you can think of a question and go back to descriptive statistics.
C. Predictive: This is usually called “machine learning.” These range from the simple (linear regression, time series forecasting, cluster analysis, decision trees) to the more complex (naive bales, random forests, neural networks, deep learning). You are forecasting here.
3. Tools: This is the data analysis process broken into steps and the respective tools you will need
B. Collection and Cleaning: Formatting data into a structure that you can process.
C. Storage: In-house or cloud-based
D. Query: How do we get data from the warehouse. Think of a library. You want to get some books about France but you don’t want to bring the entire library home with you. On the computer at the library you select a few filters to get a cook book, French for Dummies and a novel about France. Under the hood, the library compute is running a SQL query. This stands for Structured Query Language and is the backbone of every organization on the planet to query data. This can be SQL or NoSQL (Not only SQL). SQL is for structured data and NoSQL is for unstructured data. SQL is vertical integration and NoSQL is horizontal. Think of vertical as you can stack pages of a spreadsheet on top of each other and drill a hole through the name column for personell on a ship. All these spreadsheets have personal information on them, but you want to query their rank. Well, you would drill a hole through the “name” column and then through the “rank” column. Then you would have a new spreadsheet called “Name and Rank” that would list every person on your ship. Horizontal would be your shelf at home that has a cd next to a DVD and next to a book in Japanese. Drilling a hole through these will yield nothing. So, NoSQL is used to have these talk to each other and connect. It is more cumbersome, but it is now the industry standard for all the IoT that our unstructured data sources provide. For SQL you will see the open-source MySQL and the paid version of Microsoft SQL. Access databases also run on SQL. For NoSQL, MongoDB is the major player.
E. Analysis: Now that you have your data clean and ready, you can perform your analysis. The open-source packages are R and Python. Similar packages and both are known as object oriented programming. R is favored by stats people and Python is favored by computer scientists and engineers. R is a simpler coding mechanism and language but Python can be used for virtually any task. For stats junkies there is STATA, SPSS, Excel and SAS.
D. Sharing Conclusions: This is where the reporting and telling a story of your data comes into play. This is the most crucial step that most data wizards fail. If you cannot explain your analysis in terms that people understand, then all is lost. These will usually be business intelligence (BI) programs that visualize data. Tableau is the heavyweight in the business/government area. You can create an interactive dashboard that has all of your data and analysis and link it into your own website for users to utilize. Other players are Microsoft PowerBI and Qlik.”