ndCurveMaster

ndCurveMaster Utilizes Machine Learning for Equation Discovery

ndCurveMaster uses a combination of random and iterative searches to find the best equations. The process starts with a random search over a specified time frame, during which several models are generated. The three models with the lowest RMSE error are then subject to further analysis through random iterative searches. The user can choose the search configuration, including: the time for the first phase, and the type of search for the second phase:

Detection of Multicollinearity in ndCurveMaster

ndCurveMaster provides a multicollinearity detection feature to enhance the quality of the models created through the use of:

Variance Inflation Factor (VIF)

The VIF index is widely used for detecting multicollinearity (more information can be found on Wikipedia). There's no strict VIF value to determine the presence of multicollinearity. VIF values above 10 are often considered an indication of multicollinearity, but values above 2.5 in weaker models may cause concern. ndCurveMaster calculates the VIF values for each model, which are displayed in the last column of the regression analysis table for each predictor, as shown below:

ndCurveMaster Variance Inflation Factor VIF

In addition, ndCurveMaster offers a search option for models with a VIF limit value. The user can select the "VIF cannot exceed" checkbox to only display models that do not exceed the selected VIF value. The default VIF limit value is 10, as shown below:

ndCurveMaster VIF button

The user can adjust the limit as needed.

Pearson Correlation Matrix

The Pearson Correlation Matrix window displays Pearson Correlation coefficients between each pair of variables, as shown below:

ndCurveMaster Pearson Correlation matrix

Examining the correlations between variables is the simplest way to detect multicollinearity. In general, an absolute correlation coefficient of more than 0.7 between two or more variables indicates the presence of multicollinearity.

Back to Top

Detecting Overfitting in ndCurveMaster

Overfitting occurs when a statistical model has too many parameters compared to the sample size used to build it. This is a common problem in machine learning, but it is not usually applicable to regression models. However, ndCurveMaster's advanced algorithms make it possible to build complex multi-variable models, which can also be susceptible to overfitting.

In regression analysis with one independent variable, overfitting can be easily detected by examining the graph:

ndCurveMaster Overfitting Curve

But in statistical analysis of many variables it is not possible to detect overfitting in this way. Therefore, ndCurveMaster has implemented an overfitting detection technique using the test set method. The software randomly selects part of the data for the test set:

ndCurveMaster Overfitting Program Option

and uses the remaining data for regression analysis. Overfitting is detected by comparing the root mean square (RMS) errors of the test set and the entire dataset.

The following example shows a multi-variable regression model: Y= a0 + a1*exp(x1) + a2*x2^-8 + a3*x3^5.6 + a4*x4^-1 + a5*x5^9 + a6*x6^4.1 + a7*exp(x6)^3 + a8*x5^16 + a9*(1/2)^(x4) + a10*x2^-6 + a11*x1^1.9 + a12*x3^5.2 + a13*x1^-11 + a14*(ln(x3))^8 + a15*exp(x5)^-1 + a16*x4^1.9 + a17*x6^16 + a18*x2^10 + a19*exp(x3)^5 + a20*(ln(x6))^2 + a21*x4^-4 + a22*exp(x5)^2 + a23*x4^4.2 + a24*x5^15 + a25*x6^15 + a26*x3^12

Standard statistical analysis of the entire dataset may not detect overfitting. However, ndCurveMaster can also check the RMS errors of the test set and the entire dataset. In the example below, the test set RMS error is 9.55 and the dataset RMS error is 0.138. ndCurveMaster detects overfitting because the test set error is 68.775 times the dataset error.

ndCurveMaster Overfitting Detect

The overfitting is clearly demonstrated in the graph below, where blue points represent the entire dataset and yellow points represent the test set:

ndCurveMaster Overfitting Detect Chart

The graph shows that the fit of the entire dataset looks perfect, but the test set points do not fit well.

Back to Top

Heuristic Techniques Employed by ndCurveMaster

ndCurveMaster employs heuristic techniques for curve fitting and incorporates scientific algorithms. The process of finding the optimal combinations in multi-dimensional models (3D, 4D, 5D, 6D, etc.) and selecting the best-fitting functions from a set of functions results in a large number of possible combinations. An exact algorithm for searching through all these possible combinations is both computationally expensive and time-consuming, so ndCurveMaster has implemented a heuristic approach to solve this problem.

The best nonlinear functions and variable combinations are determined through randomization and iterative searching using the following methods: The "AutoFit" method uses an algorithm where variables are randomized and base functions are iterated. This algorithm is fast and efficient, but the solution is limited by the iteration. The search will automatically complete when the correlation coefficient value reaches its maximum.

The "Random Search" method uses an algorithm in which both variables and base functions are fully randomized. This method takes more time than the "AutoFit" method, but the solution is unlimited due to the randomization process. The user must manually stop the search by clicking the ESC key, and can search for an unlimited time using this method.

These methods improve the discovery of better models, although the solutions obtained through these heuristic techniques are not as optimal as those obtained through exact methods. However, multiple searches can help the user find a solution that is closer to the optimal result.

Using the heuristic algorithm means that, even when using the same data set, the method of finding the best models and the models themselves may differ each time. Hence, it is advisable to fit models to the same data multiple times.

Back to Top