Last post in this series handled the steps of loading, creating the target prediction, and doing some visual exploratory analysis on the problem of predicting remaining useful life for (simulated) turbofans.  In this post I will do some one initial steps predictive modeling, how to perform cross validation.

Cross Validation

I choose to do 10-fold cross validation for this model.  k-fold validation was chosen because the data is relatively small in the training data set, with only 100 turbofan units and 20631 observations across all units.  The folds are selected to be across units.  A simple choice was to use modulo of the unit number to make the folds:

Fold Training Units Validation Units
1 2,3,4,5,6,7,8,9,10,12,13,14,… 1,11,21,31,…
2 1,3,4,5,6,7,8,9,10,11,13,14,… 2,12,22,32,…

Creating the fold column is easily done and used by specifying the fold column by name during the model building phase:



This setup for cross validation meshes well with the target of the exercise, which is to predict the remaining useful life of unseen turbofan units.  I find when working with IoT / time series / spatial data that setting up the cross validation strategy is key.  There are many considerations that go into this decision and is problem dependent.  Sometimes one wants to memorize the existing data so that nearby data in space-time can be predicted; in this case leveraging autocorrelation can be useful for the prediction step.  In other cases one wants to make predictions on unseen scenarios and needs to be careful to avoid issues of autocorrelation.

Next post in this series will cover model valuation and customized cross validation scoring this use case.

All of the code for this work is available my GitHub  repository for this project.  I have previously presented this material and MLConf Atlanta and Unstructured Data Science Pop-Up Seattle, with the support of



In the last post I setup the problem of predicting remaining useful life of turbofans based on information about the number of operational cycles and snapshot sensor measures.  In this post I will do some exploratory analysis for the simplest dataset: one operating mode and one failure mode.

First up was some preprocessing.  The data presents as the cycle number, incrementing by one for each cycle a turbofan is operated.  For example, turbofan 1 will have operating setting and sensor measurements for cycle 1, 2, 3, … until the last cycle number in the remaining useful life.


I wanted (needed) all turbofans in the training data set to be counting down the number of cycles they had in their remaining useful life.  I wanted it because it makes some the exploratory data analysis easier to interpret, and I needed it because it was the target variable.  A little function does the trick.  As a side benefit, this function demonstrates a common pattern of doing feature engineering:  do some grouping on the data (for this, by unit number), calculate some metric on each group (for this, the maximum number of cycles completed), and then joining (merging) the metric back to the original frame as a new feature.  This new feature took a couple extra steps, but the overall pattern is common.


Finally, for this post, we will look at how a few of the sensor measures for a few of the machines proceed over time.  Hear I took a sample of 10 turbofan units from H2O to a Pandas data frame and used Seaborn for the plotting.


Some sensors show trending over time (e.g., SensorMeasure4), some are constant (e.g., SensorMeasure1), some trend over operating cycles but in different trajectories (e.g., SensorMeasure9), and others seem to change but in no discernible trend (e.g., SensorMeasure6). Other measures take on discrete values, like SensorMeasure17, and do demonstrate a trend over cycles. (See below).

The good news is that there are trends over the target variable, RemainingUsefulLife.  This gives us some promise that we can build a model to predict how many cycles are left in each turbofan’s RUL.  This will be in the next post, along with tips needed to cross validate the models.


All of the code for this work is available my GitHub  repository for this project.  I have previously presented this material and MLConf Atlanta and Unstructured Data Science Pop-Up Seattle, with the support of

One of the recent projects I have finished was predictive modeling of turbofan engine degradation.  The goal is to predict the remaining useful life, or the remaining number of cycles, until a turbofan engine can no longer perform up to requirements from information from sensors on the turbofan and the number of cycles completed.  The original dataset was published by NASA’s Prognostics Center of Excellence [1].


I particularly liked this data set as it gave me a chance to combine some of my favorite things: data mining, machine learning, data from sensors, and physical things.  In general the problem falls under the engineering domain of prognostics.  The data simulates different degrees of initial wear and manufacturing variability, for a series of engines.  Each engine develops a fault at some point, speeding up the degradation.  From the data one knows the operational settings of the turbofan for that cycle and a snapshot measurement of each of 21 sensors during that cycle. Sensor noise and bias is also simulated.

There are four datasets available:  FD001 with 1 operating conditions and 1 failure mode, FD002 with 6 operating conditions and 1 failure mode, FD003 with 1 operating condition and 2 failure modes, and FD004 with 6 operating conditions and 2 failure modes.

Now that the ground work has been set, the next part will focus on exploratory data analysis for FD001.  Over the course of this series I will investigate and model the first (simplest) and last (most complex) data sets, including the use of Kalman filters for ensembling multiple machine learning models together and tips and tricks for modeling and cross validation of time series data.

To be continued…


All of the code for this work is available my GitHub  repository for this project.  I have previously presented this material and MLConf Atlanta and Unstructured Data Science Pop-Up Seattle, with the support of

[1] A. Saxena and K. Goebel (2008). “Turbofan Engine Degradation Simulation Data Set”, NASA Ames Prognostics Data Repository (, NASA Ames Research Center, Moffett Field, CA

[2] Image courtesy “Turbofan operation” by K. Aainsqatsi – Own work. Licensed under CC BY 2.5 via Commons –

%d bloggers like this: