Scientific Solutions for Data Analysis and Curve Fitting

Optimizing Complex Nonlinear Models Using Monte Carlo–Based Heuristic Search Sensitivity Analysis Enhanced Multicollinearity Detection Preventing Overfitting in Multivariable Models Q-Q Plot and Normality Tests Discover Equations from Data

Optimizing Complex Nonlinear Models Using Monte Carlo–Based Heuristic Search

Finding complex multivariable nonlinear models (3D, 4D, 5D, 6D, etc.) and selecting optimal functional forms leads to an extremely large combinatorial search space. An exhaustive (exact) search is computationally infeasible. Therefore, ndCurveMaster employs Monte Carlo–based heuristic optimization, in which randomized sampling and iterative evaluation are used to efficiently explore the solution space.

In this framework, all randomized searching implemented in the software can be interpreted as a Monte Carlo method. Randomized selection and transformation of base functions allow the algorithm to escape local minima and probabilistically identify near-optimal predictor structures with respect to RMSE and additional user-defined quality constraints.

The discovered models are optimized not only for prediction error minimization but also for satisfying multiple statistical and methodological criteria, including:

Statistical significance of predictors (F-test and t-tests)
Limited overfitting at a user-defined training–test error ratio
Controlled multicollinearity (maximum allowed VIF)
Absence of proportional bias verified using the Bland–Altman test

Within this Monte Carlo framework, ndCurveMaster implements two complementary randomized search strategies:

AutoFit: A fast heuristic search in which the model structure is optimized using a combination of partial randomization and iterative refinement of base functions. The procedure converges automatically once further improvements in model quality become negligible.
Random Search: A fully unconstrained Monte Carlo strategy where both predictors and base functions are continuously randomized. This method enables unrestricted exploration of the model space and is terminated manually by the user, allowing arbitrarily long searches.

Due to the stochastic nature of Monte Carlo optimization, repeated runs may yield different—but comparably high-quality—solutions even for the same dataset. Performing multiple independent searches therefore increases the probability of identifying models that are very close to the global optimum, both in terms of predictive accuracy and statistical robustness.

Sensitivity Analysis

Sensitivity analysis is a technique used in data analysis and machine learning to assess the impact of individual input variables on forecast outcomes. It allows for the identification of input variables that are critical to prediction accuracy, as well as those that can be omitted without quality loss, especially when other significance measures, such as t-value, cannot be used. This situation occurs when the distribution of model residuals is not normal, making inference difficult.

The ndCurveMaster program determines the importance of a given variable based on the ratio of the estimation error (RMSE) for the model with the omitted variable (RMSEo) to the estimation error for the model with all input variables (RMSEf), as follows:

SA = (RMSEo/RMSEf - 1) * 100

where:

RMSEo – the RMSE error with the omitted variable,
RMSEf – the RMSE error with all variables.

A higher value of this coefficient indicates a greater impact of the given variable on the prediction outcome. The removed variable causing the largest increase in error achieves the highest RMSER value and can be considered the most significant. Based on SA values, a ranking of variables by importance can be developed, and non-significant variables can be eliminated.

The "SA %" column in the Statistics window displays the results of the sensitivity analysis:

Values of SA that are too low indicate that the predictor is not statistically significant and are highlighted in red.

Enhanced Multicollinearity Detection in ndCurveMaster

ndCurveMaster provides a multicollinearity detection feature to enhance the quality of the models created through the use of:

Variance Inflation Factor (VIF)
Pearson Correlation Matrix.

Variance Inflation Factor (VIF) for Multicollinearity Detection

The VIF index is widely used for detecting multicollinearity (more information can be found on Wikipedia). There's no strict VIF value to determine the presence of multicollinearity. VIF values above 10 are often considered an indication of multicollinearity, but values above 2.5 in weaker models may cause concern. ndCurveMaster calculates the VIF values for each model, which are displayed in the last column of the regression analysis table for each predictor, as shown below:

ndCurveMaster Variance Inflation Factor VIF

In addition, ndCurveMaster offers a search option for models with a VIF limit value. The user can select the "VIF cannot exceed" checkbox to only display models that do not exceed the selected VIF value. The default VIF limit value is 10, as shown below:

The user can adjust the limit as needed.

Utilizing the Pearson Correlation Matrix in ndCurveMaster

The Pearson Correlation Matrix window displays Pearson Correlation coefficients between each pair of variables, as shown below:

ndCurveMaster Pearson Correlation matrix

Examining the correlations between variables is the simplest way to detect multicollinearity. In general, an absolute correlation coefficient of more than 0.7 between two or more variables indicates the presence of multicollinearity.

Preventing Overfitting in Multivariable Nonlinear Models

Overfitting occurs when a statistical model has too many parameters compared to the sample size used to build it. This is a common problem in machine learning, but it is not usually applicable to regression models. However, ndCurveMaster's advanced algorithms make it possible to build complex multi-variable models, which can also be susceptible to overfitting.

In regression analysis with one independent variable, overfitting can be easily detected by examining the graph:

But in regression analysis of many variables it is not possible to detect overfitting in this way. Therefore, ndCurveMaster has implemented an overfitting detection technique using the test set method. The software randomly selects part of the data for the test set:

ndCurveMaster Overfitting Program Option

and uses the remaining data for regression analysis. Overfitting is detected by comparing the root mean square (RMS) errors of the test set and the entire dataset.

The following example shows a multi-variable regression model: Y = a₀ + a₁*exp(x₁) + a₂*x₂^-8 + a₃*x₃^5.6 + a₄*x₄^-1 + a₅*x₅⁹ + a₆*x₆^4.1 + a₇*exp(x₆)³ + a₈*x₅¹⁶ + a₉*(1/2)^x₄ + a₁₀*x₂^-6 + a₁₁*x₁^1.9 + a₁₂*x₃^5.2 + a₁₃*x₁^-11 + a₁₄*(ln(x₃))⁸ + a₁₅*exp(x₅)^-1 + a₁₆*x₄^1.9 + a₁₇*x₆¹⁶ + a₁₈*x₂¹⁰ + a₁₉*exp(x₃)⁵ + a₂₀*(ln(x₆))² + a₂₁*x₄^-4 + a₂₂*exp(x₅)² + a₂₃*x₄^4.2 + a₂₄*x₅¹⁵ + a₂₅*x₆¹⁵ + a₂₆*x₃¹²

Standard statistical analysis of the entire dataset may not detect overfitting. However, ndCurveMaster can also check the RMS errors of the test set and the entire dataset. In the example below, the test set RMS error is 9.55 and the dataset RMS error is 0.138. ndCurveMaster detects overfitting because the test set error is 68.775 times the dataset error.

The overfitting is clearly demonstrated in the graph below, where blue points represent the entire dataset and yellow points represent the test set:

The graph shows that the fit of the entire dataset looks perfect, but the test set points do not fit well.

Q-Q Plot and Normality Tests

In ndCurveMaster, it is possible to test the normality of residuals from the discovered equations. This allows for assessing the significance of predictors in the calculated regression equations. The following methods are implemented:

Shapiro-Wilk Test for data sets containing 3 to 5000 observations,
Anderson-Darling Test for any number of observations,
Q-Q Plot (Quantile-Quantile Plot), which provides a visual assessment of whether a data set follows a specific distribution (e.g., normal distribution).

To perform normality tests, select the Q-Q Plot & Normality option in the "Graph" menu, as shown below:

If the residuals do not follow a normal distribution, this may affect the validity of statistical results and inferences. In such cases, the significance of individual predictors can be assessed using the sensitivity analysis (SA %) in the "Statistics" window. SA % indicates the percentage increase in the RMSE (Root Mean Square Error) of the entire equation after removing a given variable/predictor from the regression model. The higher the percentage, the more significant the predictor, as shown below:

The critical values for the Shapiro-Wilk Test are determined as follows:

For sample sizes (N) ranging from 3 to 50, they are read directly from a table of critical values,
For sample sizes (N) from 51 to 5000, they are calculated using approximation equations.

The approximation equations for the Shapiro-Wilk Test critical values were derived from exact Monte Carlo simulations with 10,000 iterations. These equations were developed using ndCurveMaster and are characterized by 100% accuracy, with the sum of squared residuals across all data points being zero.

Below are the approximation equations created with ndCurveMaster, which are also implemented in the software, for various significance levels (α):

for α = 0.01:
S = 0.999955813553981 - 4.2000018710969 * N^-1 - 2977188846 * N^-5.4 + 195600085 * N^-4.85 - 8350.66049575806 * N^-2.45 + 10004269352 * N^-5.9 + 4173.90900230408 * N^-2.3
for α = 0.02:
S = 0.996940119599458 - 1.09465537965298 * N^-0.65 + 0.000141521624755114 * N^0.26 - 5368.41369628906 * N^-2.9 - 42.2440037727356 * N^-1.85 + 3289.34020996094 * N^-2.7 + 0.244719844311476 * N^-0.45
for α = 0.05:
S = 1.09685223293491 - 0.0862856622552499 * N^-0.15 + 1.56707056060867e27 * N^-18 + 0.00625888709328137 * N^0.32 - 0.0376221658079885 * N^0.16 - 0.000274505593097274 * N^0.51 - 1.24064072500914 * N^-0.85
for α = 0.1:
S = 0.997943465801654 + 92836.087890625 * N^-3.2 - 6594.45446777344 * N^-2.3 - 0.584783179685473 * N^-0.75 - 74523.06640625 * N^-2.95 + 0.00500002509215847 * N^-1/12 + 17327.6760253906 * N^-2.35
for α = 0.5:
S = 1.00000902912007 - 0.733561176806688 * N^-0.9 + 7.67083466053009 * N^-1.4 - 4.34975712005325e15 * N^-11 - 580986.578125 * N^-4.4 + 1528148.203125 * N^-4.65 - 4.96384191513062 * N^-1.3
for α = 0.9:
S = 0.999988040423204 - 2.46055405586958 * N^-1.05 + 17054.1804199219 * N^-3.05 + 15054383.25 * N^-5.45 - 263485.02734375 * N^-4.05 + 4.4287580922246 * N^-1.25 - 5155.42724609375 * N^-2.8
for α = 0.95:
S = 1.00000214339047 + 82.4580221176147 * N^-2.25 - 12370693.875 * N^-4.4 - 0.434151018271223 * N^-0.9 + 9107632.5 * N^-4.2 + 6.82077260139725e15 * N^-11 - 914461.28125 * N^-3.85
for α = 0.98:
S = 0.996691931562964 - 14680.5522460938 * N^-3.8 - 0.14887606119737 * N^-0.65 + 10816.9770507813 * N^-3.7 + 0.000971209832641762 * N^0.09 + 0.0137125887558796 * N^-1/4
for α = 0.99:
S = 0.999998431990093 - 0.592065832577646 * N^-0.95 - 4201280438272 * N^-8 + 7.94718337312489e17 * N^-11 + 0.988007228355855 * N^-1.25 + 724202386620416 * N^-9 - 4.15988488370913e16 * N^-10

The complete table, containing all exact critical values for the Shapiro-Wilk Test, calculated using the Monte Carlo method and approximations generated with ndCurveMaster for sample sizes (N) ranging from 51 to 5000, can be downloaded in Excel format from the following link: Shapiro_Wilk_Critical_Values_Table_5000.xlsx.

Discover Equations from Data

Equation discovery involves identifying mathematical relationships that best describe the behavior of a dataset. This process is crucial for understanding complex systems and uncovering the underlying laws governing data. By leveraging advanced curve fitting and model optimization techniques, researchers can develop predictive models and gain deeper insights.

ndCurveMaster is a powerful tool designed specifically for such tasks. Unlike traditional software like Excel, which may struggle with local optima, ndCurveMaster employs a randomized search algorithm to explore a broader solution space.

In this tutorial, both Excel and ndCurveMaster were used to discover the following equation:

Y = 3 + 3 · x₁^2.5 + 4 · x₂³ + (-3.5) · x₃^1/2.

Using only data and curve fitting techniques, ndCurveMaster successfully discovered this equation, while Excel provided a close approximation. The results are presented in the table below.

Software	Discovered Equation	RMSE	Pearson Correlation Coefficient
Excel	Y = 0.89 + 3.518 * x₁^2.65 + 1.314 * x₂^2.11 - 0.066 * x₃^-4.16	4.38	0.99999715
ndCurveMaster	Y = 3 + 3 · x₁^2.5 + 4 · x₂³ + (-3.5) · x₃^1/2	0	1

Read the full tutorial here: Curve Fitting in Excel: A Tutorial on Fitting a Complex Nonlinear Regression Model to Your Data