Nonlinear Regression Tutorial: A Step-by-Step Modeling Workflow

This tutorial shows how to build, validate, and improve a nonlinear regression model in ndCurveMaster. The workflow is divided into three stages to demonstrate not only how to find an accurate model, but also how to check whether it is reliable and how to refine it further using a data-driven approach.

In stage 1, the goal is to build an initial nonlinear model and compare it with a basic linear model. This stage shows how ndCurveMaster searches for better predictor forms and how model quality can be improved.

In stage 2, the analysis is repeated with overfitting detection. The purpose of this stage is to verify that the model performs well not only on the main dataset but also on unseen test data.

In stage 3, the predictor form selected in stage 2 is kept unchanged, but the regression coefficients are recalculated using the full dataset. This makes it possible to use all available data while preserving a model structure that has already been checked for overfitting.

Before moving to these stages, it is useful to briefly explain how the program works.

Using elements of random model-space exploration, ndCurveMaster generates different forms of nonlinear functions, including power, logarithmic, and trigonometric functions, and then fits them to the data. This is a data-driven approach in which the model structure is determined on the basis of the data.

Model search can be performed in three ways:

Random Search – search performed exclusively by random exploration:
Random Search documentation
Randomly Iterated Search – search performed in an iterative way:
Randomly Iterated Search documentation
Advanced Search – a combined approach: the first phase works like Random Search and the second phase like Randomly Iterated Search:
Advanced Search documentation

The iterative method also includes an element of randomness because the program randomly chooses the order in which individual predictors are optimized.

Due to the random nature of the search algorithm, models developed with ndCurveMaster may differ from one run to another. However, when more thorough and time-consuming calculations are performed, the results are usually similar. In tasks related to discovering physical equations, when the number of base functions is small, the resulting solutions are usually identical.

In practice, the key to developing an appropriate model is the proper selection of the functions used for fitting. For this reason, the program includes several ready-made sets of functions and also allows users to create their own. A detailed description is available here:

Function sets and configuration

For most problems, I use power functions up to the third degree. I try to avoid logarithmic, trigonometric, and exponential functions unless there are clear reasons to use them.

Using power functions up to the third degree, and preferably up to the second degree, usually gives good-quality models while also reducing the risk of overfitting.

I also often use my own functions by means of the Custom functions option. Ready-made sets of custom functions are also included with the program. Details are described here.

Stage 1. Basic nonlinear regression model

After loading the file Tutorial.txt, use the default settings in the program window, as shown below:

Here we select the Y variable and the X variables. We do not use weights. We use a simple model, so we do not select the options for models with combinations of variables or the log-linear model.

We leave the significance level alpha at its default value. We select the Multicollinearity detection option in order to check multicollinearity using the VIF indicator.

We choose the first set of base functions. This set contains the functions described here:

Base function set documentation

In this exercise, we will not yet check for overfitting, so we do not select the Use a test set to detect overfitting option.

If you click View Data, you will see the loaded data:

Next, confirm the settings by clicking the OK button. You will see the initial linear regression model screen:

The Statistics window contains the basic statistical parameters of the model. Details about this window are available here. In addition, for each predictor the program displays the SA % value, where SA stands for sensitivity analysis.

Sensitivity analysis is used to assess the influence of individual input variables on the prediction result. It makes it possible to identify variables that are particularly important, as well as variables that can be omitted without a significant deterioration in model quality. This is especially useful when classical measures such as the t-value cannot be interpreted unambiguously, for example because of a non-normal residual distribution.

The SA % indicator is calculated using the following equation:

SA = (RMSEo / RMSEf - 1) * 100

where:

RMSEo – the RMSE of the model after removing a given variable,
RMSEf – the RMSE of the model with the full set of variables.

The higher the value of this indicator, the greater the influence of a given variable on the prediction result. The variable whose removal causes the greatest increase in error has the highest SA value and may be regarded as the most important.

Unacceptable values of statistics such as t-value, VIF, or SA are displayed in the Statistics window in red. In such cases, the Recommendation column displays the message suggested removal, which means that the given predictor may be insignificant.

The Statistics window shows that the model as a whole is correct, but all predictors except x2 are insignificant. They are marked in red. Predictors x3 and x4 are especially unimportant because their p values exceed the significance level to the greatest extent. This is also confirmed by the SA% analysis, for which the values of these predictors are the lowest. This means that removing x3 or x4 increases the error by only about 0.16% or 1.21%.

The VIF values are low, which indicates that the variables are not strongly correlated with one another. This is also confirmed by the Pearson correlation matrix:

This table confirms that there is a strong relationship between variable x2 and Y.

In the next part, we will check whether it is possible to fit better equations to these predictors in order to increase estimation accuracy and improve model quality.

Before starting the search, click Settings and check whether the default options are consistent with the figure below:

A description of these options is available here.

Selecting the Random Search using only basic function forms option means that during Random Search, or in the first phase of Advanced Search, only the basic functions from a given set are sampled. This speeds up the search and helps to find more general solutions. At the beginning, however, I suggest leaving this option unselected.

The Multicollinearity detection option checks multicollinearity during the search by calculating the VIF indicator.

In the Search Strategy section, the duration of the first phase of Advanced Search is set to 15 seconds. At this stage, the program generates different forms of nonlinear functions and fits them to the data. This time may be insufficient. On faster computers it may be acceptable, but on weaker hardware it is worth increasing it.

The second phase consists of an iterative search through the solutions found in the first phase. The first phase is very important because it makes it possible to find diverse solutions and reduces the risk of becoming trapped in a local minimum.

In the field Select the number of CPU threads, set the maximum possible value. This will speed up the equation search.

Click the Advanced Search button to start the search.

During the first search phase, information about the number of discovered models appears in the lower-right corner. This makes it possible to assess whether the program has found any general solutions:

In the example shown, the program has found 7 models. If no models appear in this area, it usually means that the search time in the first phase is too short. In such a case, it is recommended to increase the search time.

After the search is completed, a large set of new functions is obtained:

The most accurate model has the following form:

Y = a0 + a1 · x1^(1.70) + a2 · x2^(2.60) + a3 · x3^(0.45) + a4 · x4^(1.10) + a5 · x5^(-1)

The regression coefficients are given in the Value of a column. The equation can also be copied in different formats by clicking the Copy button above the model list in the upper-left window.

This equation is of very high quality. This is confirmed by the correlation coefficients, especially the CCC, as well as the low errors. The RRMSE is particularly important; if it is lower than 10%, the result can be regarded as very good. All predictors are now significant, and the most important predictor is x2.

Compared with the initial linear model, the normality of the residual distribution improved, and the proportional bias indicated by the Bland–Altman test dropped to zero. The plots below compare the initial linear model with the fitted nonlinear model.

Linear model id: 1	Nonlinear model id: 31

It should be noted, however, that the model was developed using a data-driven method. Although the power exponents are not high, except for x2^(2.60), overfitting may still have occurred, and this phenomenon can also appear in linear models.

Therefore, it is worth repeating the analysis, this time using a test set to detect overfitting. In my experience, to reduce the risk of overfitting, it is best to build the model using power functions up to the third degree at most.

If you want to open the result obtained at this stage immediately, you can use the ready-made file phase1.ndc.

Stage 2. Nonlinear model with overfitting control

In the next stage, we repeat the analysis and build a new model using the following settings:

Stage 2 settings with overfitting detection

Here we choose the set Polynomial, rational and power functions up to 3, and we also select the Overfitting detection option.

After loading the new model, it is worth checking whether the test data, marked in red, are properly scattered:

Distribution of test data used for overfitting detection

In the settings, we turn off Multicollinearity detection, because the previous stage showed that the variables are not significantly collinear. Turning this option off also speeds up the calculations.

I suggest setting a safe overfitting control level at 1.2. This means that only those models will be accepted in which the RMSE for the test set does not exceed the error obtained for the main dataset by more than 20%.

Then click Advanced Search.

During the first search phase, only one model was found, which suggests that the 15-second search time is slightly too short.

Even so, a very good model was obtained:

The equation has the following form:

Y = a0 + a1 · x1^(1.75) + a2 · x2^(2.6) + a3 · x3^(1/2) + a4 · x4^(1.15) + a5 · x5^(-3/4)

After including the regression coefficients, we obtain:

Y = 7.27685889198256 + 2.67170670020036 · x1^(1.75) + 3.99827130385565 · x2^(2.6) + (-3.40478304807294) · x3^(1/2) + 1.36065220987535 · x4^(1.15) + (-10.6735426823943) · x5^(-3/4)

The error for the test data is only 19% higher than the error for the main dataset, because Test/Dataset RMSE = 1.19. This means that the model meets the adopted overfitting-control criterion. For both the dataset and the test set, the Bland–Altman test gives very good results.

If you want to open the result from this stage immediately, use the file phase2.ndc.

Stage 3. Re-estimating coefficients using the full dataset

Finally, the model can be improved slightly further. The aim is to recalculate the regression coefficients for the full dataset while preserving the same predictor forms that were selected in stage 2:

x1^(1.75)
x2^(2.6)
x3^(1/2)
x4^(1.15)
x5^(-3/4)

Changing only the regression coefficients should not increase the risk of overfitting, while it makes it possible to use the entire available dataset more effectively.

To do this, the equation from stage 2 should be saved to a file. Click the Save button located above the list of functions in the left panel, and then save the equation under any name to the function archive:

Saving a function to a file in ndCurveMaster

A description of saving and loading functions can be found here.

Next, load the dataset Tutorial.txt again. Choose the same function set Polynomial, rational and power functions up to 3, but this time clear the Overfitting detection option, as shown below:

Stage 3 settings without overfitting detection

After loading the model, click the Load button in the toolbar above the list of functions in the left panel and load the previously saved function file:

Then confirm the model:

As a result, you will obtain the full dataset and the function that previously did not show overfitting:

This is the same function form as before, but with re-estimated regression coefficients:

Y = 7.04312320227064 + 2.67375576330874 · x1^(1.75) + 3.99940460829172 · x2^(2.6) + (-3.40226314895395) · x3^(1/2) + 1.38542511445383 · x4^(1.15) + (-10.7010013884196) · x5^(-3/4)

In summary, this model is slightly more accurate than the model from stage 2. For example, RRMSE is 0.10165, whereas in stage 2 it was 0.114813. The model uses all the data, and the predictor form does not show a tendency toward overfitting.

The model accuracy is high. Both the entire equation and all predictors are significant. The Pearson matrix shows no significant multicollinearity among the variables. The residual distribution indicates no heteroscedasticity. In the linear model, a slight funnel effect was visible, suggesting heteroscedasticity. In addition, the normality of the residual distribution improved slightly, and the Bland–Altman test performed much better than for the initial linear model.

Below is a comparison of the linear model and the final model with nonlinear predictors:

Linear model	Model with nonlinear predictors

Ready-made result files

At the end, you can download the ready-made files from the individual stages of the analysis:

phase1.ndc – basic model,
phase2.ndc – model with overfitting control,
phase3.ndc – final model after re-estimating the coefficients.

Frequently Asked Questions

What does this nonlinear regression tutorial show?

This tutorial explains how to build a nonlinear regression model in ndCurveMaster, compare it with a linear model, detect overfitting, and then improve the final model by re-estimating coefficients using the full dataset.

Why is overfitting detection important?

Overfitting detection helps verify whether a model performs well not only on the data used for model development but also on unseen test data. This improves model reliability.

Can ndCurveMaster be used for curve fitting and multivariable nonlinear regression?

Yes. ndCurveMaster is designed for curve fitting and nonlinear regression with multiple independent variables, including data-driven model discovery and statistical validation.

Why re-estimate coefficients using the full dataset?

Once a predictor form has been checked for overfitting, re-estimating the coefficients on the full dataset makes it possible to use all available information and slightly improve predictive accuracy.