Last post in this series handled the steps of loading, creating the target prediction, and doing some visual exploratory analysis on the problem of predicting remaining useful life for (simulated) turbofans. In this post I will do some one initial steps predictive modeling, how to perform cross validation.
I choose to do 10-fold cross validation for this model. k-fold validation was chosen because the data is relatively small in the training data set, with only 100 turbofan units and 20631 observations across all units. The folds are selected to be across units. A simple choice was to use modulo of the unit number to make the folds:
|Fold||Training Units||Validation Units|
Creating the fold column is easily done and used by specifying the fold column by name during the model building phase:
This setup for cross validation meshes well with the target of the exercise, which is to predict the remaining useful life of unseen turbofan units. I find when working with IoT / time series / spatial data that setting up the cross validation strategy is key. There are many considerations that go into this decision and is problem dependent. Sometimes one wants to memorize the existing data so that nearby data in space-time can be predicted; in this case leveraging autocorrelation can be useful for the prediction step. In other cases one wants to make predictions on unseen scenarios and needs to be careful to avoid issues of autocorrelation.
Next post in this series will cover model valuation and customized cross validation scoring this use case.
All of the code for this work is available my GitHub repository for this project. I have previously presented this material and MLConf Atlanta and Unstructured Data Science Pop-Up Seattle, with the support of H2O.ai.