Data points that are acquired as a series of readings over time can cause additional modeling difficulties. The problem is subtle, but common when the data is taken from a process that does not change quickly compared to the rate at which the data is taken. In that situation, the corresponding inputs and outputs of data points are very close to the inputs and outputs of the adjacent data points in time. That means the output of one data point can be predicted accurately by just finding its near neighbors in time and predicting the average of their outputs. How can a learner trick itself into doing that? Because the inputs of the near neighbors in time are similar, if it uses an approximation method that is very local in the input space, it will effectively identify the near neighbors in time. Why is this bad? When this kind of model is used to predict new data in the future it will fail because it no longer has the near neighbors in time to look at. In fact, some of the near neighbors in time are further into the future than the point being predicted so there's no hope of them being available. Note that when LOO-XVE is used, or when a random subset of points are taken out (as is the default for the CV Police), the near neighbors in time, including those occurring after the query point, are usually available in the training data.
We will use Vizier to observe an instance of this problem, and see how it can be avoided. The data file for this example is from the semiconductor industry. The sixteen input attributes are of various sensor readings available during an etching process. The output variable is a measure of quality that only becomes available much later in the processing. The factory managers would like to estimate the quality from the sensor readings. Then they can discard the bad wafers immediately rather than spending precious resources to finish the processing, only to find out at the end that they are defective. The data file has the time series problem because the readings are taken from a series of wafers as they go through production and the sensor readings and quality of successive wafers are similar.
File -> Open -> semitrain.mbl Edit -> Metacode -> A31:{9} Model -> Blackbox -> Launch!
The actual metacode chosen by Blackbox will vary depending on the speed of your computer (because Blackbox runs for a chosen number of seconds). The following description is based on one particular run, but your results should be qualitatively similar. Blackbox chooses the metacode A30:FN:99--99---999-. It reports that simply choosing the global average output would result in an error of 1.387. The chosen metacode gets an error of 0.535, which is a 59.8% improvement. At this point a factory manager might feel that he can do a good job of predicting wafer quality from his sensors.
In order to see the problems he will cause, we have a separate data file taken from the same process, but much later in time. We can use the batchpredict command to see how well the quality for these future wafers can be predicted using the original data set and the metacode found by Blackbox.
Edit -> Metacode -> A30:FN:99----99-----999- Model -> Batchpredict -> Show_errors ON Testfile: semitest.mbl Batch Predict
It turns out that the error for predicting this data set is 1.667, which is even worse than always predicting the average output! Exactly the problem described above has struck here. If the factory manager had implemented his plan, he would have caught very few of the defective wafers and discarded many good ones.
There is a way to safeguard against this when modeling time series data. The data set should be manually separated such that a test set is made from a single section of time, or a second data set can be taken at a completely different time. Then the CV Police can be set to use the other data set rather than randomly subsampling from the training data set. Although we don't do it here, we would get the most efficient searching by separating out a third data set and giving to the CV Expert.
Edit -> Metacode -> A31:{9} Model -> Blackbox -> Reset ON CV Police Use testset Testset semitest.mbl Launch!
This time Blackbox chooses the metacode L90:9, which is global linear regression on all the attributes. It now reports that it can make only a 19.8% reduction in the error over just guessing the average every time. This is very disappointing, but it is also a correct analysis. As it turns out, the sensor readings in this data are not good indicators of the final wafer quality and a learner that claimed otherwise would be misleading its user.