Interview with Johns Hopkins University Data Scientist: Dr. Paul Nicholas

A couple weeks ago I sat down with Dr. Paul Nicholas, data scientist at John’s Hopkins University to discuss his unique path in operations research. Dr. Nicholas brings a wealth of knowledge, from analyzing survey data in Afghanistan of local populations while deployed by the US military to working as a fellow at MIT. Our conversation moved from essential tools in a data scientist’s toolkit to briefing strategies when helping clients understand complex solutions. Dr. Nicholas tells interesting stories of what it is like to be on the front lines of operations research and data science.

Screen Shot 2018-02-24 at 6.19.53 PM.png


Bio: Dr. Paul J. Nicholas is an operations research scientist at the Johns Hopkins University Applied Physics Laboratory (JHU APL). He also teaches a graduate course on analytics and decision analysis as part of GMU’s Data Analytics Engineering (DAEN) program, and serves in the U.S. Marine Corps Reserve. Paul’s research focuses on the use of advanced analytic techniques to examine problems in national security, including large-scale combinatorial optimization, simulation, and data analytics. Paul received his B.S. from the U.S. Naval Academy, his M.S. from the Naval Postgraduate School, and his Ph.D. from George Mason University. On active duty, he deployed several times to Iraq and Afghanistan in support of combat operations as a communications officer and operations research analyst. He was a visiting fellow to the Johns Hopkins University Applied Physics Lab, and an MIT Seminar XXI Fellow. He is the author of numerous peer-reviewed publications and holds two U.S. patents.

Books that Dr. Nicholas recommends:

Three websites Dr. Nicholas uses to keep current:

Contact information:

SEOR Department
Engineering Building, Rm 2100, MS 4A6
George Mason University
Fairfax VA 22030
Email: pnichol6 [at] gmu [dot] edu


Interview with Excella Consulting Data Scientist and ML Expert: Patrick Smith

A couple weeks ago, I had the opportunity to sit down in the studio with Patrick Smith. Patrick brings Natural Language Processing (NLP) expertise as it relates to deep learning and neural networks. His work in the financial quant. area before joining Excella Consulting offers an incredible insight into the mathematical rigors that are crucial for the data scientist to master. Listen as Patrick offers his take on what is current and on the horizon in data science consulting!

Screen Shot 2018-02-22 at 7.35.23 AM

Bio: Patrick Smith is the data science lead at Excella in Arlington, Virginia, where he developed the data science practice. Previously, he both led and helped create the data science program at General Assembly in Washington, DC, as was a data scientist with Booz Allen Hamilton’s Strategic Innovations Group. Prior to data science, Patrick worked in risk and quantitative securities analysis, and institutional portfolio management. He has his B.A. in Economics from The George Washington University and has done masters work at Harvard and Stanford Universities in AI and computer science.

Patrick is passionate about AI and deep learning and has contributed to significant research and development projects as part of Excella’s artificial intelligence research effort. He architected Excella’s DALE intelligent assistant solution which provides state of the art results in question answering tasks and is the author of an upcoming book on Artificial Intelligence.


Favorite Data Books:
Sites that Keep Patrick Current:
How to Contact Patrick:

Searching for Lost Nuclear Bombs: Bayes’ Theorem in Action


Screen Shot 2018-02-07 at 8.35.32 PM.png

What does searching for lost nuclear weapons, hunting German U-boats submarines, breaking NAZI codes, savaging the high seas for Soviet submarines, and searching for 747s that go missing in the Pacific Ocean have in common? No, not a diabolical death wish, but a 200-year-old algorithm, deceivingly simple, yet computationally robust. This algorithm and its myriad offshoots is Bayes Theorem. The book The Theory That Would Not Die, by Sharon Bertsch McGrayne, opens the reader to the 200-year controversial history of the theorem, beginning in the late 1700s with Reverand Thomas Bayes’ discovery, through the application by Pierre Laplace, to its contentious struggle against the statistical frequentist approach in the early 1900s.

Screen Shot 2018-02-07 at 7.03.26 PM

It wasn’t until the dawn of the computer age that Bayes made an impact on predictive analytics. The narrative that McGrayne portrays is one of the misunderstood analyst, so desperately clinging to a belief that will offer clearer predictive power, trying to convey a simple algorithm that offers a more powerful means of testing phenomenon.

What is Bayes?

Screen Shot 2018-02-07 at 6.40.16 PM

A: Hypothesis

B: Data

The idea is simple: We learn new information each day. In essence, we update the knowledge that we already have on a daily basis from our past experiences. Each new day that passes we update our prior beliefs. We assign a probability of events occurring in the future based on these prior beliefs. This prior belief system is at the core of Bayes theorem. Simply put, Bayes is a way of updating our beliefs with new information to arrive at a more exact prediction based on probability. Another way of looking at the rule is below from

Screen Shot 2018-02-07 at 7.20.12 PM


What does Bayes look like in action? 

Caveat: This is an extremely simplified version of the model used and is by no means represents the sheer volume of calculations involved by highly skilled statisticians.

A popular case documented by McGrayne in The Theory that Would Not Die is an incident that happened in the 1960s. The US military loses things, more often than we like to think. Many nuclear bombs have been accidentally dropped or lost in transit. While not activated, these bombs are a national security threat that most presidents want to get a hold of as quickly as possible. One specific example of this is in 1966 when a B-52G bomber crashed mid-air with a refueling KC-135 tanker over the Mediterranean Sea. The plane subsequently jettisoned four nuclear warheads. Three of these were found on land while the third lay in the waters off the Spanish coast.

Screen Shot 2018-02-07 at 6.40.33 PM

The Navy employed probabilistic Bayes experts to find the warheads. Specifically, they used Bayes’ Rule to find the probability of the warhead being in a given area given a positive signal from their sonar devices. Since going 2,550 feet below the ocean in a submersible is expensive and dangerous, the Navy wanted to ensure it was making the trip with a purpose and not to find a cylindrical rock formation.

The Navy searched for 80 days without results. However, Bayes tells us that an inconclusive result is a constructive result nonetheless. As McGrayne states,  “Once a large search area was divided into small cells, Bayes’ rule said that the failure to find something in one cell enhances the probability of finding it in the others. Bayes described in mathematical terms an everyday hunt for a missing sock: an exhaustive but fruitless search of the bedroom and a cursory look in the bath would suggest the sock is more likely to be found in the laundry. Thus, Bayes could provide useful information even if the search was unsuccessful.” (McGrayne, 189)

The simplistic model for this search in a single square would be…

Screen Shot 2018-02-07 at 6.40.16 PM

  • P(A|B): Probability the square contains the warhead given a positive signal from Navy instruments
  • P(A): Probability the square contains the warhead
  • P(B|A): Probability of a positive signal from the instrument given the warhead is present
  • P(B): Probability of a positive signal from instrument

We would then add these cells to derive a more general picture of the search area. The search area would be updated as new data flowed in, creating a new prior and posterior hypothesis that stemmed directly from the new data. Remember, this is one component of a much larger model, but it should help you get the picture. To get a more robust model, we would need to create similar calculations for the last known position of the aircraft when it went down, the currents of the ocean, the climate over the past months, etc. Computationally, this can get very heavy very quickly, but the underlying principles remain the same: base a hypothesis on prior knowledge and update to form a posterior hypothesis.

As you can see, as Navy vessels complete searches in one area, Bayes’ model is updated. While the above pictures are from a study involving the search for Air France Flight 447 in the Pacific Ocean, the story remains the same. We create a hypothesis, we test it, we gain new data, and we use that data to update our hypothesis. If you are curious about how Bayes was used for the successful recovery of France Flight 447, I highly recommend the 2015 METRON slides.

Prior hypothesisScreen Shot 2018-02-07 at 6.40.55 PM


Posterior hypothesis after initial search

Screen Shot 2018-02-07 at 6.41.03 PM

Similar methodologies in large-scale search endeavors have been well documented. these include the current search for the missing Malaysia Flight 370. Below are links to interesting use-cases of Bayes in action.

Bayesian Methods in the Search for MH370

How Statisticians Found Air France Flight 447 Two Years After It Crashed Into Atlantic

Missing Flight Found Using Bayes’ Theorem?

Operations Analysis During the Underwater Search for Scorpion

Can A 250-Year-Old Mathematical Theorem Find A Missing Plane?

And yes, they found the bomb!

Screen Shot 2018-02-07 at 6.41.13 PM


Background reference:

The Theory That Would Not Die, Sharon Bertsch McGrayne


Image references:





Interview with Radiant Solutions Data Scientist: Josh Lipsmeyer

Recently Translating Nerd interviewed Josh Lipsmeyer about the work he does as a Software Engineer/Data Scientist. Josh has an interesting background because he leverages a mathematics and physics background to solve some pretty tough data-related problems. The conversation was lively with this Arkansas native as we dug deep into how he simplifies the data world for us. There are a few shenanigans in this podcast since Josh and I go back.

Screen Shot 2018-02-05 at 7.12.23 PM


BioJosh Lipsmeyer is a Software Engineer/Data Scientist at Radiant Solutions where he specializes in algorithm development, data science, and application deployment primarily for customers with the Department of Defense. His primary interest is the study of complex network resilience, information diffusion, and evolution.  He has a background in mathematics and physics, holding an MS in mathematics from the University of Tennessee. He currently lives in Centreville, VA with his wife and 1-year-old son.

Toolkit: Critical thinking, Python, Spark, Java, Microsoft SQL Server, QGIS, STK, Latex, Tableau, Excel

Josh’s favorite online resources:

  • Udemy
  • Stack Overflow
  • Coursera
  • Open Source University Courses

Influential books about data:

Understanding Machine Learning: From Theory to Algorithms

How to get ahold of Josh:


Book Summary: Turning Text into Gold

Screen Shot 2018-01-26 at 6.44.33 PM


Key Takeaways

Turning Text into Gold: Taxonomies and Textual Analytics, by Bill Inmon, covers a plethora of text analytics and Natural Language Processing foundations. Inmon makes it abundantly clear in the first chapter of his book that organizations are underutilizing their data. He states that 98 percent of corporate decisions are based on only 20 percent of available data. This data, labeled structured data due to its ability to fit in matrices, spreadsheets, relational databases and are easily ingested into machine learning models, are well understood. However, unstructured data, the text, and words that our world generates on a daily basis are seldom used. Similar to the alchemists of the middle ages who searched for a method to turn ordinary metals into gold, Inmon describes a process to turn unstructured data into decisions; turning text into gold.

What the heck is a taxonomy?

Taxonomies are the dictionaries that we use to refer tie the words in a document, book, corpus of materials, into a business-related understanding. For example, if I were a car manufacturer, I would have a taxonomy of various car-related concepts so I could identify those concepts in the text. We then start to see repetition and patterns in the text. We might begin to see new words that relate to car manufacturing in the text. We can then add these terms to our taxonomy. While the original taxonomy might garner 70 percent of car-related words in the document, 90 percent is usually a business appropriate level to move from taxonomy/ontology to database migration.

Now what?

Once we have the necessary inputs from our long list of taxonomies. Through textual disambiguation, the raw text from our document is compared to the taxonomy we have created. If there is a fit, then this text is moved from the document and stored in a processing stage. This stage involves looking for more distinct patterns in the newly moved text. Using regular expressions, or a type of investigative method in coding, we can discern more distinct patterns from the text. We can then move this raw text into a matrix, or what many people are familiar with in a spreadsheet. Transferring the text into a matrix involves the manipulation of text to numbers, which can be rather large when fitting into a matrix. While there are specific steps that can be taken (ie, sparse matrix vs. dense matrix), the process is the same: make text machine-readable. Words become zeros and ones and analytical models can now be applied to the document. Machine learning algorithms, such as offshoots of Bayes Theorem and other classification techniques can be used to categorize and cluster text.

A simple example

Imagine you go to the ER one day and a report is generated when you are out-processed. This record holds many important elements to your medical history. However, having someone extract the name, address, your medications, your condition, your treating doctor’s information, your health vitals, etc would take a lot of time. More time than a swamped hospital staff on a limited budget can handle. Text analytics is used to link the all this information into a spreadsheet that can then be fitted into the hospital’s database. Add up enough of these records and you can start looking for patterns.

Screen Shot 2018-02-03 at 6.27.38 PM

  1. Your visit to the ER is documented as text
  2. The hospital takes a pre-defined “dictionary”, or taxonomy, of medical-related terms
  3. The taxonomy is compared against your medical evaluation and processed into a spreadsheet/matrix.
  4. The spreadsheet is uploaded into a relational database the hospital maintains
  5. An analyst queries the database data to make a machine learning model that can create value-added predictions.
  6. Based on your model, a value is produced that results in a decision being made.


Image sources:

A conversation with RAND economist (and data scientist), Dr. Ben Miller

Last week I had the opportunity to drop by RAND to interview Dr. Ben Miller. Ben is an economist who specializes in econometric modeling and is at the leading edge of applied data science within the think tank world. Ben offers many insights into causal inference, discovering patterns and uncovering signals in cloudy data. Ben explains the tools he uses, the methods he calls upon and the manner which he guides RAND’s “sponsors” to make informed decisions as they relate to national priorities.

As Ben says, “My toolkit is ‘applied econometrics,’ aka using data to estimate causal relationships.  So think of techniques like instrumental variables, differences-in-differences, regression discontinuities, etc.  Overall, it’s about putting a quantitative estimate on a relation between two variables in a way that is (hopefully!) unbiased, and at the same time understanding the uncertainty associated with that estimate.  A really approachable and widely respected introduction to that toolkit is Angrist & Pischke’s ‘Mostly Harmless Econometrics.'”

Screen Shot 2018-02-02 at 9.31.52 PM.png


Referenced works from the audio:

The Dog that Didn’t Bark: Item Non-response Effects on Earnings and Health

Models for Censored and Truncated Data


Bio: Ben Miller is an Associate Economist at the RAND Corporation and a Professor at the Pardee RAND Graduate School. His research spans a wide variety of topics including disaster mitigation, infrastructure finance, energy resilience, geospatial information, insurance, tax policy, regulatory affairs, health care supply, agriculture, econometrics, and beyond.  Recent publications examine the link between flood insurance prices and housing affordability, review federal policies surrounding transportation and water infrastructure finance, estimate the causal impact of weather warning systems on fatalities and injuries from tornadoes, and overview econometric techniques for determining the value of geospatial information. Prior to joining RAND, Miller worked as a statistician supporting the U.S. Census Bureau’s Survey of Income and Program Participation. He holds a Ph.D. in economics from the University of California, San Diego and a B.S. in economics from Purdue University.




An interview with International Development Data Scientist, Anton Prokopyev


Last weekend I had the opportunity to sit down with a fascinating data scientist who works in a large non-governmental organization that specializes in international development. Anton views data scientists as “forensic investigators” who have a responsibility to the community at large to develop repeatable and reputable analytical stories. 

Bio: Anton is a data scientist working in an international development organization in Washington, D.C., where he applies his programming and research skills to contribute to data-for-good goals. While Anton is a full-stack data scientist, meaning he works through a project from prototyping to production, he has a keen affection for natural language processing and text analytics. Prior to starting his data science journey, Anton worked in tech companies and Silicon Valley startups.





Anton’s favorite online resources:

Siraj Raval


Talk Python

Kaggle Kernels

ML Trainings


Book Summary: Weapons of Mathematical Destruction

Screen Shot 2018-01-07 at 10.30.47 AM

Cathy O’Neil’s Weapons of Math Destruction, or WMD for short, offers a moralistic argument that with big data, so too comes big responsibility. Uncovering what she refers to an inherent prejudice, racism and economic unfairness in predictive algorithms and machine learning models, no analytical model is safe from her fierce criticism. With an opening salvo that our very democracy hangs on the moralism of our mathematical models, O’Neill unveils the destructive power of the models that impact our daily lives.

O’Neill defines a WMD with the following characteristics:

  1. It is opaque, as if a hidden veil sits between the subject and the controller
  2. It causes a negative feedback loop where the output of the model adversely affects the future decisions for the subject, only to impute further damaging data
  3. It disadvantages a specific group, most notably minorities
  4. It presumes guilt, treating innocence as a mere constraint in the model
  5. It favors efficiency resulting in further model variance
  6. It doesn’t calculate fairness by relying solely on numbers and data
  7. It lacks a feedback structure, resulting in creating noise without signals to readjust

A prime example of a WMD is as follows,

“Do you see the paradox? An algorithm processes a slew of statistics and comes up with a probability that a certain person might be a bad hire, a risky borrower, a terrorist, or a miserable teacher. That probability is distilled into a score, which can turn someone’s life upside down. And yet when the person fights back, ‘suggestive’ countervailing evidence simply won’t cut it. The case must be ironclad. The human victims of WMD’s, we’ll see time and again, are held to a far higher standard of evidence than the algorithms themselves.”

Some intriguing examples O’Neill sites are common in the analytic community. These include the star high school student who applies to a state-level university, only to be turned away because the state school’s predictive model for acceptance states that all-star students have a high probability of turning down so-called “safety schools” for more prestigious institutions. Knowing that the models for what makes a good university ranking relies on having a low acceptance rate, the state school rejects the all-star student. This cycle is perpetuated in a negative feedback loop.

Another example is the predictive models that city police departments rely on. Using various inputs to machine learning algorithms such as the number of ATMs, geographical inputs, classifications of high-risk zones, the models attempt to optimize police resources by placing cop cars in areas with a higher probability of a crime occurring. Unfortunately, these areas tend to be lower class and with a large minority population. The sheer act of having an increased police presence results in minor crimes such as drug use, being seen, acted upon and enforced. More police reports and arrests feedback into the machine learning model resulting in a further bias towards placing more police units in the area. This is another prime example of a negative feedback loop presuming guilt without model readjustment.

As seen from these examples, O’Neill argues that more data is not the panacea that will fix our predictive problems. If the algorithm favors a certain piece of the population or refuses to readjust based on algorithm tuning (accuracy, precision, recall, a combination of the two in the F1 or F-beta scores), then no matter the n-size of the data, the model will still fail. However, there is hope! O’Neill offers the following suggestions to improve our models and restore dependency on a more fair and robust set of algorithms.

  1. Ensure there is an opportunity for outside introspection. Let the general public understand the models that judge them, be it credit e-scores or teacher evaluation criteria
  2. Allow user sign-off to use their online actions as data that can be resold. Europe follows a similar model that could be copied
  3. Make public the assumptions that the model is built upon
  4. Audit algorithms as if they were financial institutions

While O’Neil’s work strongly falls into the category of personal rights and privacy in a data-filled world, much of her anecdotal evidence is convincing. The moralism that surrounds what we inexplicitly provide the data world in our online actions, voting behavior, consumer choices, and personal health well-being should not be the product for others, but of ourselves. This is a question that will plague future technologists as we come to terms with what is and isn’t the public domain.


The MOOC Trap

Type in “learn to become a data scientist” into Google and you will get the following results: DataCamp, Udacity, Udemy, Coursera, DataQuest, etc. These Massive Open Online Courses, MOOCs for short, are invaluable when learning a new skillset. They allow students to enter a once guarded elite within the academic walls. They sharpen skill sets and add value to any resume.



However, I have noticed that they have become a crutch that is traded for comfort over the rigors of self-improvement past a certain point. This realization is not placed on others but is reflected in my own advancement in advanced analytics. We strive to learn the most advanced methods, the coolest visualizations, the highest performing algorithms. Yet, we seldom strive for originality in our work because we are afraid it might be too mundane.


To add to one’s proverbial toolkit in data science, MOOCs and educational resources that guide the user through a well-formatted analysis is crucial. There are little other ways to learn the fundamentals other than structured educational materials. But this can lead to the “MOOC trap.” A typical case study of the MOOC trap is the burgeoning data scientist who has completed an intensive 6-month online program. She has dedicated 300 hours of intense study, through both sandbox exercises and a few semi-structured projects. She feels like she has a basic understanding of a host of skills, but is timid to try her analytical toolset on a problem of her choosing. Instead, she signs up for another 6-month MOOC, a mere regurgitation of the material that she just covered. Enamored with the ads and displays of a polished portfolio on GitHub that the MOOC promises, she forks over another $200 a month.


This individual felt the excitement of looking for a question, venturing the internets for a dataset, and the feeling of struggle as she looked at the mess that real-world data provides. But she regressed back to the comfort of the MOOC. I feel the same in my own work. There are so many datasets that we as a community have access to, structured and unstructured, clean and well, terribly scattered and messy. We are trained through our educational systems in college/grad school, online courses and structured tutorials, to create something advanced and analytically perfect. We are pressured to post this to GitHub, to display our certification of accomplishment with a stamp from an official organization.


The problem with the MOOC trap is that it no longer trains us for the real world; it trains us to become great followers of directions. We fear that our analysis on an original piece of work will not be cool enough, it will not be advanced enough, and well, we might have grind just to produce an exploratory analysis of things that we might have already assumed. But this is the challenge, to create something original because it gives us ownership. Completing basic analytics with an original dataset that we went out and found adds to the data science community. This builds the foundations of what science is and hones our fundamental skills so sorely needed in the workforce.


While MOOCs offer a structured and nicely formatted addition to our repositories/portfolios of glistening analytical work, it has the potential to leave us in a comfortable position where growth decays. There is a certain point to where educational training and online courses can take us, but beyond that, it is a series of diminishing returns. Each nascent data scientist will have a different inflection point, but the feeling is the same; you have a burning question, but feel your skillset is unpracticed. In this instance, forgo the MOOC and find the data in the world. Produce the basic analysis, ask your peers to review, and struggle a little more. Only then will we grow as data scientists.





Defining your “T”

I once had an interview with a data science consulting firm that I completely bombed. It wasn’t the SQL test and it wasn’t the questions on machine learning algorithms. The question that stumped me was when the interviewer asked, “What is your T?” Dumbstruck, I answered that I didn’t know this terminology. Politely, the interviewer explained that she wanted to know what my core competencies were, what area was I known as the “go-to” guy and where was I competent in performing on my own. This brings me to what I believe is the most important story in data science; the story of you.

The T-model was defined by Harlan Harris in “Analyzing the Analyzers,” published by O’Reilly Press in 2013. Heralded as “an introspective survey of data scientists and their work,” Harris and his colleagues argue that there are many categories of individuals calling themselves data scientists. Defining which category one fits in depends on the contents of their T. I created the following T as a visual example.

Screen Shot 2017-12-16 at 2.49.07 PM.png

The vertical of the T is one’s core competency, their comparative advantage in the field of advanced analytics. This is the area where researchers distinguish themselves and are frequently called upon to perform as the leader of a team. You might remember these people from graduate school that seemed to effortlessly explain the underpinnings of a complex derivation on the board, pulling proofs out of thin air as if a celestial power were whispering answers in their ear. Now, you don’t have to have analytic superpowers to complete your vertical of your T, but you should be quite competent in that area.

The horizontal bar are the areas that you feel comfortable operating in, but are not solely known for. These can be areas where you utilize for a project every now and then or are part of your analytical process. You might need a little extra time with your buddy Google or a sojourn on StackOverflow, but these are areas where you know if you are given a little extra time, you will have no trouble performing.

The programs to the right of the T are the programming languages, software, and technologies that go hand-in-hand with your competencies. For example, if someone lists machine learning as their core competency, they will most likely have R or Python listed in parallel to perform those analyses. With Python would be the libraries or modules that one would depend on to perform these duties. These could include pandas for data manipulation, numpy for quick calculations, matplotlib and seaborn for visualization, beautifulsoup for scraping data from the web, nltk for natural language processing and scikitLearn for machine learning algorithms.

The purpose of defining your T is not to include every buzzword technology and programming language that is hot, but to include those resources that get you 90% to your goal. That final 10% could be something that you use seldom and need to call on additional expertise to complete. Creating a firm narrative of your core strengths helps the nascent and advanced data scientist alike explain what they have to bring to the team. It creates a mutual understanding between those hiring you to what you bring to the table. But more importantly, it provides a visual elevator pitch when communicating what data science is to you.