Interview with Johns Hopkins University Data Scientist: Dr. Paul Nicholas

A couple weeks ago I sat down with Dr. Paul Nicholas, data scientist at John’s Hopkins University to discuss his unique path in operations research. Dr. Nicholas brings a wealth of knowledge, from analyzing survey data in Afghanistan of local populations while deployed by the US military to working as a fellow at MIT. Our conversation moved from essential tools in a data scientist’s toolkit to briefing strategies when helping clients understand complex solutions. Dr. Nicholas tells interesting stories of what it is like to be on the front lines of operations research and data science.

 
Screen Shot 2018-02-24 at 6.19.53 PM.png

 

Bio: Dr. Paul J. Nicholas is an operations research scientist at the Johns Hopkins University Applied Physics Laboratory (JHU APL). He also teaches a graduate course on analytics and decision analysis as part of GMU’s Data Analytics Engineering (DAEN) program, and serves in the U.S. Marine Corps Reserve. Paul’s research focuses on the use of advanced analytic techniques to examine problems in national security, including large-scale combinatorial optimization, simulation, and data analytics. Paul received his B.S. from the U.S. Naval Academy, his M.S. from the Naval Postgraduate School, and his Ph.D. from George Mason University. On active duty, he deployed several times to Iraq and Afghanistan in support of combat operations as a communications officer and operations research analyst. He was a visiting fellow to the Johns Hopkins University Applied Physics Lab, and an MIT Seminar XXI Fellow. He is the author of numerous peer-reviewed publications and holds two U.S. patents.

Books that Dr. Nicholas recommends:

Three websites Dr. Nicholas uses to keep current:

Contact information:

http://mason.gmu.edu/~pnichol6/

SEOR Department
Engineering Building, Rm 2100, MS 4A6
George Mason University
Fairfax VA 22030
Email: pnichol6 [at] gmu [dot] edu

Standard

Interview with Excella Consulting Data Scientist and ML Expert: Patrick Smith

A couple weeks ago, I had the opportunity to sit down in the studio with Patrick Smith. Patrick brings Natural Language Processing (NLP) expertise as it relates to deep learning and neural networks. His work in the financial quant. area before joining Excella Consulting offers an incredible insight into the mathematical rigors that are crucial for the data scientist to master. Listen as Patrick offers his take on what is current and on the horizon in data science consulting!

Screen Shot 2018-02-22 at 7.35.23 AM

Bio: Patrick Smith is the data science lead at Excella in Arlington, Virginia, where he developed the data science practice. Previously, he both led and helped create the data science program at General Assembly in Washington, DC, as was a data scientist with Booz Allen Hamilton’s Strategic Innovations Group. Prior to data science, Patrick worked in risk and quantitative securities analysis, and institutional portfolio management. He has his B.A. in Economics from The George Washington University and has done masters work at Harvard and Stanford Universities in AI and computer science.

Patrick is passionate about AI and deep learning and has contributed to significant research and development projects as part of Excella’s artificial intelligence research effort. He architected Excella’s DALE intelligent assistant solution which provides state of the art results in question answering tasks and is the author of an upcoming book on Artificial Intelligence.

 

Favorite Data Books:
Sites that Keep Patrick Current:
How to Contact Patrick:
Standard

Searching for Lost Nuclear Bombs: Bayes’ Theorem in Action

 

Screen Shot 2018-02-07 at 8.35.32 PM.png

What does searching for lost nuclear weapons, hunting German U-boats submarines, breaking NAZI codes, savaging the high seas for Soviet submarines, and searching for 747s that go missing in the Pacific Ocean have in common? No, not a diabolical death wish, but a 200-year-old algorithm, deceivingly simple, yet computationally robust. This algorithm and its myriad offshoots is Bayes Theorem. The book The Theory That Would Not Die, by Sharon Bertsch McGrayne, opens the reader to the 200-year controversial history of the theorem, beginning in the late 1700s with Reverand Thomas Bayes’ discovery, through the application by Pierre Laplace, to its contentious struggle against the statistical frequentist approach in the early 1900s.

Screen Shot 2018-02-07 at 7.03.26 PM

It wasn’t until the dawn of the computer age that Bayes made an impact on predictive analytics. The narrative that McGrayne portrays is one of the misunderstood analyst, so desperately clinging to a belief that will offer clearer predictive power, trying to convey a simple algorithm that offers a more powerful means of testing phenomenon.

What is Bayes?

Screen Shot 2018-02-07 at 6.40.16 PM

A: Hypothesis

B: Data

The idea is simple: We learn new information each day. In essence, we update the knowledge that we already have on a daily basis from our past experiences. Each new day that passes we update our prior beliefs. We assign a probability of events occurring in the future based on these prior beliefs. This prior belief system is at the core of Bayes theorem. Simply put, Bayes is a way of updating our beliefs with new information to arrive at a more exact prediction based on probability. Another way of looking at the rule is below from www.pyschologyinaction.org.

Screen Shot 2018-02-07 at 7.20.12 PM

 

What does Bayes look like in action? 

Caveat: This is an extremely simplified version of the model used and is by no means represents the sheer volume of calculations involved by highly skilled statisticians.

A popular case documented by McGrayne in The Theory that Would Not Die is an incident that happened in the 1960s. The US military loses things, more often than we like to think. Many nuclear bombs have been accidentally dropped or lost in transit. While not activated, these bombs are a national security threat that most presidents want to get a hold of as quickly as possible. One specific example of this is in 1966 when a B-52G bomber crashed mid-air with a refueling KC-135 tanker over the Mediterranean Sea. The plane subsequently jettisoned four nuclear warheads. Three of these were found on land while the third lay in the waters off the Spanish coast.

Screen Shot 2018-02-07 at 6.40.33 PM

The Navy employed probabilistic Bayes experts to find the warheads. Specifically, they used Bayes’ Rule to find the probability of the warhead being in a given area given a positive signal from their sonar devices. Since going 2,550 feet below the ocean in a submersible is expensive and dangerous, the Navy wanted to ensure it was making the trip with a purpose and not to find a cylindrical rock formation.

The Navy searched for 80 days without results. However, Bayes tells us that an inconclusive result is a constructive result nonetheless. As McGrayne states,  “Once a large search area was divided into small cells, Bayes’ rule said that the failure to find something in one cell enhances the probability of finding it in the others. Bayes described in mathematical terms an everyday hunt for a missing sock: an exhaustive but fruitless search of the bedroom and a cursory look in the bath would suggest the sock is more likely to be found in the laundry. Thus, Bayes could provide useful information even if the search was unsuccessful.” (McGrayne, 189)

The simplistic model for this search in a single square would be…

Screen Shot 2018-02-07 at 6.40.16 PM

  • P(A|B): Probability the square contains the warhead given a positive signal from Navy instruments
  • P(A): Probability the square contains the warhead
  • P(B|A): Probability of a positive signal from the instrument given the warhead is present
  • P(B): Probability of a positive signal from instrument

We would then add these cells to derive a more general picture of the search area. The search area would be updated as new data flowed in, creating a new prior and posterior hypothesis that stemmed directly from the new data. Remember, this is one component of a much larger model, but it should help you get the picture. To get a more robust model, we would need to create similar calculations for the last known position of the aircraft when it went down, the currents of the ocean, the climate over the past months, etc. Computationally, this can get very heavy very quickly, but the underlying principles remain the same: base a hypothesis on prior knowledge and update to form a posterior hypothesis.

As you can see, as Navy vessels complete searches in one area, Bayes’ model is updated. While the above pictures are from a study involving the search for Air France Flight 447 in the Pacific Ocean, the story remains the same. We create a hypothesis, we test it, we gain new data, and we use that data to update our hypothesis. If you are curious about how Bayes was used for the successful recovery of France Flight 447, I highly recommend the 2015 METRON slides.

Prior hypothesisScreen Shot 2018-02-07 at 6.40.55 PM

 

Posterior hypothesis after initial search

Screen Shot 2018-02-07 at 6.41.03 PM

Similar methodologies in large-scale search endeavors have been well documented. these include the current search for the missing Malaysia Flight 370. Below are links to interesting use-cases of Bayes in action.

Bayesian Methods in the Search for MH370

How Statisticians Found Air France Flight 447 Two Years After It Crashed Into Atlantic

Missing Flight Found Using Bayes’ Theorem?

Operations Analysis During the Underwater Search for Scorpion

Can A 250-Year-Old Mathematical Theorem Find A Missing Plane?

And yes, they found the bomb!

Screen Shot 2018-02-07 at 6.41.13 PM

 

Background reference:

The Theory That Would Not Die, Sharon Bertsch McGrayne

https://en.wikipedia.org/wiki/1966_Palomares_B-52_crash

 

Image references:

https://sinews.siam.org/Details-Page/Bayesian-Search-for-Missing-Aircraft-Ships-and-People

https://www.motherjones.com/politics/2015/08/nuclear-weapon-obama-most-expensive-ever/

https://www.military.com/daily-news/2017/10/17/surge-wake-ship-collisions-tests-navy-new-deployment-plan.html

https://www.rogerebert.com/reviews/the-hunt-for-red-october-1990

http://thediplomatinspain.com/ecologistas-en-accion-lleva-al-supremo-la-radiactividad-de-palomares/

https://www.psychologyinaction.org/psychology-in-action-1/2012/10/22/bayes-rule-and-bomb-threats

https://navaltoday.com/#newsitem-123777

 

 

 

Standard

Interview with Radiant Solutions Data Scientist: Josh Lipsmeyer

Recently Translating Nerd interviewed Josh Lipsmeyer about the work he does as a Software Engineer/Data Scientist. Josh has an interesting background because he leverages a mathematics and physics background to solve some pretty tough data-related problems. The conversation was lively with this Arkansas native as we dug deep into how he simplifies the data world for us. There are a few shenanigans in this podcast since Josh and I go back.

Screen Shot 2018-02-05 at 7.12.23 PM

 

BioJosh Lipsmeyer is a Software Engineer/Data Scientist at Radiant Solutions where he specializes in algorithm development, data science, and application deployment primarily for customers with the Department of Defense. His primary interest is the study of complex network resilience, information diffusion, and evolution.  He has a background in mathematics and physics, holding an MS in mathematics from the University of Tennessee. He currently lives in Centreville, VA with his wife and 1-year-old son.

Toolkit: Critical thinking, Python, Spark, Java, Microsoft SQL Server, QGIS, STK, Latex, Tableau, Excel

Josh’s favorite online resources:

  • Udemy
  • Stack Overflow
  • ArXiv.org
  • Coursera
  • Open Source University Courses

Influential books about data:

Understanding Machine Learning: From Theory to Algorithms

How to get ahold of Josh:

https://www.linkedin.com/in/josh-lipsmeyer-769884b4/

Standard

Book Summary: Turning Text into Gold

Screen Shot 2018-01-26 at 6.44.33 PM

 

Key Takeaways

Turning Text into Gold: Taxonomies and Textual Analytics, by Bill Inmon, covers a plethora of text analytics and Natural Language Processing foundations. Inmon makes it abundantly clear in the first chapter of his book that organizations are underutilizing their data. He states that 98 percent of corporate decisions are based on only 20 percent of available data. This data, labeled structured data due to its ability to fit in matrices, spreadsheets, relational databases and are easily ingested into machine learning models, are well understood. However, unstructured data, the text, and words that our world generates on a daily basis are seldom used. Similar to the alchemists of the middle ages who searched for a method to turn ordinary metals into gold, Inmon describes a process to turn unstructured data into decisions; turning text into gold.

What the heck is a taxonomy?

Taxonomies are the dictionaries that we use to refer tie the words in a document, book, corpus of materials, into a business-related understanding. For example, if I were a car manufacturer, I would have a taxonomy of various car-related concepts so I could identify those concepts in the text. We then start to see repetition and patterns in the text. We might begin to see new words that relate to car manufacturing in the text. We can then add these terms to our taxonomy. While the original taxonomy might garner 70 percent of car-related words in the document, 90 percent is usually a business appropriate level to move from taxonomy/ontology to database migration.

Now what?

Once we have the necessary inputs from our long list of taxonomies. Through textual disambiguation, the raw text from our document is compared to the taxonomy we have created. If there is a fit, then this text is moved from the document and stored in a processing stage. This stage involves looking for more distinct patterns in the newly moved text. Using regular expressions, or a type of investigative method in coding, we can discern more distinct patterns from the text. We can then move this raw text into a matrix, or what many people are familiar with in a spreadsheet. Transferring the text into a matrix involves the manipulation of text to numbers, which can be rather large when fitting into a matrix. While there are specific steps that can be taken (ie, sparse matrix vs. dense matrix), the process is the same: make text machine-readable. Words become zeros and ones and analytical models can now be applied to the document. Machine learning algorithms, such as offshoots of Bayes Theorem and other classification techniques can be used to categorize and cluster text.

A simple example

Imagine you go to the ER one day and a report is generated when you are out-processed. This record holds many important elements to your medical history. However, having someone extract the name, address, your medications, your condition, your treating doctor’s information, your health vitals, etc would take a lot of time. More time than a swamped hospital staff on a limited budget can handle. Text analytics is used to link the all this information into a spreadsheet that can then be fitted into the hospital’s database. Add up enough of these records and you can start looking for patterns.

Screen Shot 2018-02-03 at 6.27.38 PM

  1. Your visit to the ER is documented as text
  2. The hospital takes a pre-defined “dictionary”, or taxonomy, of medical-related terms
  3. The taxonomy is compared against your medical evaluation and processed into a spreadsheet/matrix.
  4. The spreadsheet is uploaded into a relational database the hospital maintains
  5. An analyst queries the database data to make a machine learning model that can create value-added predictions.
  6. Based on your model, a value is produced that results in a decision being made.

 

Image sources:

www.medicalexpo.com
http://openres.ersjournals.com/content/2/1/00077-2015
https://www.sharesight.com/blog/ode-to-the-spreadsheet/
Standard

A conversation with RAND economist (and data scientist), Dr. Ben Miller

Last week I had the opportunity to drop by RAND to interview Dr. Ben Miller. Ben is an economist who specializes in econometric modeling and is at the leading edge of applied data science within the think tank world. Ben offers many insights into causal inference, discovering patterns and uncovering signals in cloudy data. Ben explains the tools he uses, the methods he calls upon and the manner which he guides RAND’s “sponsors” to make informed decisions as they relate to national priorities.

As Ben says, “My toolkit is ‘applied econometrics,’ aka using data to estimate causal relationships.  So think of techniques like instrumental variables, differences-in-differences, regression discontinuities, etc.  Overall, it’s about putting a quantitative estimate on a relation between two variables in a way that is (hopefully!) unbiased, and at the same time understanding the uncertainty associated with that estimate.  A really approachable and widely respected introduction to that toolkit is Angrist & Pischke’s ‘Mostly Harmless Econometrics.'”

Screen Shot 2018-02-02 at 9.31.52 PM.png

 

Referenced works from the audio:

The Dog that Didn’t Bark: Item Non-response Effects on Earnings and Health

Models for Censored and Truncated Data

 

Bio: Ben Miller is an Associate Economist at the RAND Corporation and a Professor at the Pardee RAND Graduate School. His research spans a wide variety of topics including disaster mitigation, infrastructure finance, energy resilience, geospatial information, insurance, tax policy, regulatory affairs, health care supply, agriculture, econometrics, and beyond.  Recent publications examine the link between flood insurance prices and housing affordability, review federal policies surrounding transportation and water infrastructure finance, estimate the causal impact of weather warning systems on fatalities and injuries from tornadoes, and overview econometric techniques for determining the value of geospatial information. Prior to joining RAND, Miller worked as a statistician supporting the U.S. Census Bureau’s Survey of Income and Program Participation. He holds a Ph.D. in economics from the University of California, San Diego and a B.S. in economics from Purdue University.

LinkedIn: https://www.linkedin.com/in/mille419/

RAND: https://www.rand.org/about/people/m/miller_benjamin_m.html

Standard