Overfitting detection in ndCurveMaster
Overfitting occurs when the statistical model has too many parameters in relation to the size of the sample from which it was constructed. This phenomenon is a problem found primarily in machine learning and will not usually apply in the case of regression models.
But ndCurveMaster offers advanced algorithms that allow the user to build complicated multivariable models to accurately describe empirical data. Overfitting may occur under these conditions.
In regression analysis with one independent variable this setting means you can easily detect overfitting in the graph:
But in statistical analysis of many variables it is not possible to detect overfitting in this way.
Therefore, an overfitting detection technique has been implemented in ndCurveMaster. The test set method is used to detect overfitting. ndCurveMaster may randomly choose part of the data and use it in a test set:
Next ndCurveMaster performs regression using the remaing data. And finally ndCurveMaster can detect overfitting by comparing test set and dataset RMS errors.
Here is an example multivariable regression model:
Y= a0 + a1*exp(x1) + a2*x2^-8 + a3*x3^5.6 + a4*x4^-1 + a5*x5^9 + a6*x6^4.1 + a7*exp(x6)^3 + a8*x5^16 + a9*(1/2)^(x4) + a10*x2^-6 + a11*x1^1.9 + a12*x3^5.2 + a13*x1^-11 + a14*(ln(x3))^8 + a15*exp(x5)^-1 + a16*x4^1.9 + a17*x6^16 + a18*x2^10 + a19*exp(x3)^5 + a20*(ln(x6))^2 + a21*x4^-4 + a22*exp(x5)^2 + a23*x4^4.2 + a24*x5^15 + a25*x6^15 + a26*x3^12
Standard statistical analysis referring to the data set not detecting overfitting:
But ndCurveMaster can also check test data and data set RMS errors:
The test set RMS error is 9.55 and the dataset RMS error only equals 0.138.
The use of overfitting is clearly shown the graph below. The blue points represent the dataset and the yellow – the test set:
The results from the graph mean that the fit of the data set points looks perfect but the test set points do not.