My goal in this blog is to bring together on web/paper an outline of ways that I see that decision and data sciences are better together than apart.  And by better, I mean more relevant to top level decision makers, executives and the C-suite, at their respective companies or non-profits.

While in graduate school my research was in the area of decision making under uncertainty.  The types of decisions concerned were really big decisions, decisions that were so expensive to undo that it was often worth the cost to build in flexibility in order to handle uncertain futures.  My applications were in the area of capital intensive product development; the decision theory used in these scenarios is the same decision theory that can be applied to any important decision.  The goal of these decisions was to maximize discounted utility, often in the form of discount cash flows, under exogenous and endogenous uncertainties.

In the time before ‘big data’ these decisions were often made with little data, and the cost of making these decisions was expensive. Therefor the techniques were utilized only for big decisions.  In addition, that little data was sometimes to questionable to use for decision making, owing to uncertain futures and that the past does not necessarily predict the future. There were many techniques developed and implemented from decision analysis: decision trees, influence diagrams, utility functions, subjective probability (aka, eliciting probabilities from experts), multi-criteria decision analysis, etc.  The heroes in this journey were Ronald Howard, Howard Raiffa, Ralph Keeney, and others.

[Aside: One of my favorite cautionary tales of using predictions is the history of the estimate of the speed of light (a believed to be universal constant).  Here one sees that confidence interval estimates as late as 1940 for the speed of light did not overlap with what is now believed to be the universal value.]

lightspeedfigure6

Today, for some decisions, we have copious amounts of data, enough data even to make ‘accurate’ predictions and probability estimates of the outcome.  This is the realm of predictive analytics and its tools are statistical and machine learning algorithms.  Common industrial and non-profit questions to ask are ‘To what extent is a customer/donor at risk of leaving an offering?’ (customer churn) and ‘If I take an action, what is the probability that the target person will purchase/donate?’.    Because we have data on lots of previous examples of interventions and their results, we build models and use those models on unseen cases to make probability estimates.

These predictions still need to be used for decision making and this is where decision analysis comes to play.  Decision analysis allows us to answer ‘given the probability of this customer leaving, the probability of them responding to this intervention, the cost of the intervention, and the expected benefit if the intervention is successful’ is it profitable to take this intervention for this customer?  Now that the cost of making that decision is low due to the data at hand and the strength of machine learning models, that sort of decision can be made at a customer by customer level.   At a higher level and combining predictive analytics and decision analysis techniques, we can answer the more strategic questions such as ‘should we invest in a customer retention program? and if yes, at what level.’  Provost and Fawcett do a very good job of explaining how one goes about using decision analysis and data science to answer these sorts of questions.  At a more technical level Slater Stitch provides a good example of making these types of decisions, combining machine learning and decision analysis.

zcfeatv

In the research arena, MIT’s Prediction Analysis Lab, Cynthia Rudin and other researchers are working on combining machine learning and decision making.  In my prior experience I worked in combining methods from operations research (constrained optimization problems) with machine learning by using the predictions from machine learning as forecast inputs into the decision (aka optimization) problem.  My particular application for optimizing asset utilization (trucks, people, other equipment), used forecasts of demand (which could often change within a day, sometimes leading to the optimal decision not to fully allocate all resources and let the day unfold).  Rubin has taken this a step further than the two step linear process:  their research used the outcome of the optimization problem as feedback to the predictive modeling step, machine learning for robust optimization, and have established a (mathematical) connection between machine learning and classic operation research problems (papers are reverenced below).

Data science and decision science / operations research are linked.  INFORMS, The professional society for Operations Research and Management Science, sensed this when they added analytics to their charter.  The notion that it is one vs the other, or that one begins where the other ends, is narrow and ignores the shared history in common scientific and mathematical domains.  Rudin and other researchers are demonstrating methods for systems optimization combining data and decision sciences.   Data scientists and decision analysts who embrace this merger will strategically align their work to the goals of their businesses/non-profits.  Maybe we need a new title to describe the people that do these super-human activities: rational decision scientist (no one has ever invited me to work in branding) or cyborg decision analyst.  In any case, it’s my experience and opinion that the combined forces of data science and decision analysis are greater than the sum of their parts.


Speed of light measurements graph courtesy of http://micro.magnet.fsu.edu/primer/lightandcolor/speedoflight.html

T. Tulabandhula and C. Rudin. Machine learning with operational costs. Journal of Machine Learning Research, 14:1989–2028, 2013.

 

T. Tulabandhula and C. Rudin. Generalization bounds for learning with linear, polygonal, quadratic and conic side knowledge. Machine Learning, pages 1–34, 2014.

 

T. Tulabandhula and C. Rudin. On combining machine learning with decision making. Machine Learning, 93:33– 64, 2014.

 

T. Tulabandhula and C. Rudin. Robust optimization using machine learning for uncertainty sets. In Proceedings of the International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2014.

 

Last post in this series handled the steps of loading, creating the target prediction, and doing some visual exploratory analysis on the problem of predicting remaining useful life for (simulated) turbofans.  In this post I will do some one initial steps predictive modeling, how to perform cross validation.

Cross Validation

I choose to do 10-fold cross validation for this model.  k-fold validation was chosen because the data is relatively small in the training data set, with only 100 turbofan units and 20631 observations across all units.  The folds are selected to be across units.  A simple choice was to use modulo of the unit number to make the folds:

Fold Training Units Validation Units
1 2,3,4,5,6,7,8,9,10,12,13,14,… 1,11,21,31,…
2 1,3,4,5,6,7,8,9,10,11,13,14,… 2,12,22,32,…

Creating the fold column is easily done and used by specifying the fold column by name during the model building phase:

create_fold

use_fold

This setup for cross validation meshes well with the target of the exercise, which is to predict the remaining useful life of unseen turbofan units.  I find when working with IoT / time series / spatial data that setting up the cross validation strategy is key.  There are many considerations that go into this decision and is problem dependent.  Sometimes one wants to memorize the existing data so that nearby data in space-time can be predicted; in this case leveraging autocorrelation can be useful for the prediction step.  In other cases one wants to make predictions on unseen scenarios and needs to be careful to avoid issues of autocorrelation.

Next post in this series will cover model valuation and customized cross validation scoring this use case.


All of the code for this work is available my GitHub  repository for this project.  I have previously presented this material and MLConf Atlanta and Unstructured Data Science Pop-Up Seattle, with the support of H2O.ai.

 

In the last post I setup the problem of predicting remaining useful life of turbofans based on information about the number of operational cycles and snapshot sensor measures.  In this post I will do some exploratory analysis for the simplest dataset: one operating mode and one failure mode.

First up was some preprocessing.  The data presents as the cycle number, incrementing by one for each cycle a turbofan is operated.  For example, turbofan 1 will have operating setting and sensor measurements for cycle 1, 2, 3, … until the last cycle number in the remaining useful life.

HeadTail

I wanted (needed) all turbofans in the training data set to be counting down the number of cycles they had in their remaining useful life.  I wanted it because it makes some the exploratory data analysis easier to interpret, and I needed it because it was the target variable.  A little function does the trick.  As a side benefit, this function demonstrates a common pattern of doing feature engineering:  do some grouping on the data (for this, by unit number), calculate some metric on each group (for this, the maximum number of cycles completed), and then joining (merging) the metric back to the original frame as a new feature.  This new feature took a couple extra steps, but the overall pattern is common.

CreateRUL

Finally, for this post, we will look at how a few of the sensor measures for a few of the machines proceed over time.  Hear I took a sample of 10 turbofan units from H2O to a Pandas data frame and used Seaborn for the plotting.

seaborn

Some sensors show trending over time (e.g., SensorMeasure4), some are constant (e.g., SensorMeasure1), some trend over operating cycles but in different trajectories (e.g., SensorMeasure9), and others seem to change but in no discernible trend (e.g., SensorMeasure6). Other measures take on discrete values, like SensorMeasure17, and do demonstrate a trend over cycles. (See below).

The good news is that there are trends over the target variable, RemainingUsefulLife.  This gives us some promise that we can build a model to predict how many cycles are left in each turbofan’s RUL.  This will be in the next post, along with tips needed to cross validate the models.

SensorMeasures


All of the code for this work is available my GitHub  repository for this project.  I have previously presented this material and MLConf Atlanta and Unstructured Data Science Pop-Up Seattle, with the support of H2O.ai.

One of the recent projects I have finished was predictive modeling of turbofan engine degradation.  The goal is to predict the remaining useful life, or the remaining number of cycles, until a turbofan engine can no longer perform up to requirements from information from sensors on the turbofan and the number of cycles completed.  The original dataset was published by NASA’s Prognostics Center of Excellence [1].

640px-Turbofan_operation.svg[2]

I particularly liked this data set as it gave me a chance to combine some of my favorite things: data mining, machine learning, data from sensors, and physical things.  In general the problem falls under the engineering domain of prognostics.  The data simulates different degrees of initial wear and manufacturing variability, for a series of engines.  Each engine develops a fault at some point, speeding up the degradation.  From the data one knows the operational settings of the turbofan for that cycle and a snapshot measurement of each of 21 sensors during that cycle. Sensor noise and bias is also simulated.

There are four datasets available:  FD001 with 1 operating conditions and 1 failure mode, FD002 with 6 operating conditions and 1 failure mode, FD003 with 1 operating condition and 2 failure modes, and FD004 with 6 operating conditions and 2 failure modes.

Now that the ground work has been set, the next part will focus on exploratory data analysis for FD001.  Over the course of this series I will investigate and model the first (simplest) and last (most complex) data sets, including the use of Kalman filters for ensembling multiple machine learning models together and tips and tricks for modeling and cross validation of time series data.

To be continued…


 

All of the code for this work is available my GitHub  repository for this project.  I have previously presented this material and MLConf Atlanta and Unstructured Data Science Pop-Up Seattle, with the support of H2O.ai.

[1] A. Saxena and K. Goebel (2008). “Turbofan Engine Degradation Simulation Data Set”, NASA Ames Prognostics Data Repository (http://ti.arc.nasa.gov/project/prognostic-data-repository), NASA Ames Research Center, Moffett Field, CA

[2] Image courtesy “Turbofan operation” by K. Aainsqatsi – Own work. Licensed under CC BY 2.5 via Commons – https://commons.wikimedia.org/wiki/File:Turbofan_operation.svg#/media/File:Turbofan_operation.svg

You’ve found my new blog, DataScientistInABox.com.  Thank you for visiting, hopefully I will make it worth your time.  I plan to bring my unique perspective from combining data science, systems thinking, and decision analysis.

MyTimeline

After completing a degree in physics, like many physicists of the time, I entered the world of software and systems development.  My early work included making (pseudo) random number generators with the right mix of performance and “randomness”, business intelligence tools for data warehouses, social media sites, on-line business-to-business systems, and Internet-based project collaboration software.  Employers and clients included Proctor & Gamble, National Car Rental, Perot Systems, and Red Sky Interactive.

In September 2001 an opportunity at John Deere presented itself; I was employed at Deere until April 2015 when I left to join H2O.ai.

At Deere I developed software products in the areas of B2B exchanges, food traceability, logistics and coordination, and geospatial data management and insights.

In 2010 and 2011 I started as a Systems Design and Management fellow at MIT and started working in Deere’s central research group, respectfully.  My research at MIT focused on optimal decision making under uncertainty, applied to complex systems development efforts.  My research at Deere had the same flavor until I became very focused on the impacts of algorithms, artificial intelligence, and machine learning would have on the impact of people, companies, and society at large.

Deciding that talking about “the rise of the machines” was no substitute for doing actual work, since 2012 data science has been my sole professional focus.  I’ve worked in the areas of consumer research, soil physics and machine optimization (combining probabilistic physics models with sensor measurements for model development & prediction), logistics optimization meshed with predictive models, customer churn/adoption, market segmentation, transaction fraud detection and prevention, the Internet of Things, applied machine learning, data mining, and on.

The more I work closely in these technical areas, the more I have been able to bring a systems thinking perspective the table.  It is that which I plan to present here.

Follow

Get every new post delivered to your Inbox.

Join 190 other followers

%d bloggers like this: