AN IMPROVED NEURAL NETWORK MODEL FOR ENHANCING RUTTING DEPTH PREDICTION

. Rutting is the main distress form of asphalt pavement, and its prediction accuracy is directly related to the reliability of the designed road. This research developed a neural network model to improve the prediction ability about the rutting of a pavement performance criterion and compared it with the multiple linear regression model and the existing neural network model. The neural network model is developed using the Keras module from the TensorFlow package in Python. Two reports generated by the National Cooperative Highway Research Program project 01-37A and the Long-Term Pavement Performance website records have been used as data sources for training the neural network model, which are reliable data preserved after years of monitoring. The input variables include the pavement thickness, service time, average annual daily traffic of trucks and the deformation of the asphalt concrete layer, granular base layer and subgrade layer. This experiment used 440 samples, of which 352 samples (80%) were used for model training and 88 samples (20%) for testing. The training results of


Introduction
Rutting is the main distress form of asphalt pavement, and it is also a significant criterion to measure the performance of asphalt pavement. The design method originated from the Mechanistic-Empirical Pavement Design Guide (MEPDG), was funded by the National Cooperative Highway Research Program Project (NCRHP) in 1996 and published in 2004, representing the most advanced pavement design guide in the United States. Because the depth of rutting is an essential criterion for measuring pavement performance, the accuracy of its prediction directly affects the absolute reliability of pavement design (Gong et al., 2018), which must be considered carefully. Prediction accuracy weighs heavily on the function of the model form and the data occupied during its calibration. Researchers have invested much time and effort in over 20 performance prediction models in the MEPDG (Schram & Abdelrahman, 2006). However, just as the viewpoint of Pierce & McGovern (2014), only three institutions or organisations indicated that the primary design tool for their pavement is MEPDG. A potential reason is that there are some disadvantages in using multiple linear regression (MLR) models to estimate the occurrence of distress and the service life of the pavement.
Rutting on the pavement comes from the permanent deformations (PD) of each structure layer or sub-layer under vehicle load and the combination of climate, environment and material factors (Ali et al., 2017). Of course, as mentioned above, MEPDG linearly combines the PD of all sensitive layers when calculating the total rutting. But in practice, many factors affect the PD of the pavement, including some that maintain an excellent linear relationship with it and other exceptions. Therefore, incorporating nonlinear functions solves this problem well when considering establishing a suitable pavement model to predict distress and improve accuracy.
Artificial intelligence technology, such as neural networks (NN) and deep learning, have been widely utilised to establish a model for predicting pavement performance. Najafi et al. (2019) applied an artificial neural network model to predict the rate of wet and dry vehicle crashes based on surface friction, traffic level, and speed limit. Esra'a & Abo-Qudais (2018) combined MLR analysis and feed-forward neural 1. Rutting depth calculation

Calculation formula
Rutting is a plastic deformation caused by various loads during long-term work that has to consider its rate and accumulation. The accumulation rate of plastic deformation is measured in the laboratory using repeated load PD triaxial tests for both AC mixtures and unbound materials. The laboratory-derived relationship is then adjusted to match the rut depth measured on the roadway (American Association of State…, 2008). For all AC mixtures, the MEPDG field calibrated form of the laboratory-derived relationship from repeated load PD tests is shown in Equation (1) where Δ p(AC) is the accumulated permanent or plastic vertical deformation in the AC layer/sublayer, mm; ε p(AC) is the accumulated permanent or plastic axial strain in the AC layer/sublayer, mm./mm; h (AC) is the thickness of the AC layer/sublayer, mm; β 1r , β 2r , β 3r are the local or mixture field calibration constants, and for the global calibration, these constants were all set to 1.0; k z is the depth confinement factor; k 1r , k 2r , k 3r are the global field calibration parameters (k 1r = -3.35412, k 2r = 0.47910, k 3r = 1.56060); ε r(AC) is the resilient or elastic strain calculated by the structural response model at the mid-depth of each AC sublayer, mm/mm; n is the number of axle-load repetitions; T is the mix or pavement temperature, °F; D is the depth below the surface, mm; H AC is the total AC thickness, mm.
The Equation (2) where Δ p(soil) is the permanent or plastic deformation for the layer/ sublayer, mm; k s 1 is the global calibration coefficients (2.03 − for granular materials and 1.35 − for fine-grained materials); ε v is the average vertical resilient or elastic strain in the layer/sublayer and calculated by the structural response model, mm/mm; h soil is the thickness of the unbound layer/sublayer, mm; ε 0 is the intercept determined from laboratory repeated load permanent deformation tests, mm/mm; ε r is the resilient strain imposed in laboratory test to obtain material properties ε 0 , β, and ρ, mm/mm; n is the number of axle-load applications; ε s 1 is the local calibration constant for the rutting in the unbound layers (the local calibration constant was set to 1.0 for the global calibration effort).
Calculation parameters β and ρ are determined by Equation (3) where W c is the water content, %; M r is the resilient modulus of the unbound layer or sublayer, kPa; a 1 and a 9 are the regression constants (a 1 = 0.15 and a 9 = 20.0); b 1 and b 9 are the regression constants (b 1 = 0 and b 9 = 0). Just like the above formula, it was necessary to calibrate some parameters to ensure the best combination. After selecting the appropriate value of k 2r and k 3r from the Equation (1), use the parameter minimisation function of the Equation (4) to determine the value of k s 1 in the Equation (2). The final result of Δ total is obtained after calculating the linear combination utilizing Equation (5) where k ij represents different calibration coefficients matrix; Δ layers is the plastic deformation value matrix of each layer, and both are one-dimensional column vectors.

Regression metric evaluation on calibration
After going through the building steps of the above model, the next step is to evaluate the performance of the model. The metrics selected this time are the correlation coefficient R 2 and the standard error S e . R 2 is also called the coefficient of determination, which defines how many of the regression mel predicts future samples (Wang et al., 2021). Besides, it also means that the model explains the variance ratio of the predicted response variable (Haroon & Clustering, 2017). Its value ranges from 0 to 1.0, where 1.0 is the best. S e reflects the deviation of the predicted value from the observed value. The less the S e , the better the model fitting performance. Equation (6) and Equation (7) describe how to calculate R 2 and S e : where ŷ i is the estimated target output of the i th sample; y i is the corresponding (correct) target output; n samples is the number of samples. Table 1 shows the achievements of different organisations or individuals in optimizing linear regression models. As in Table 1, only the R 2 of the American Association of State… (2008), Kaya (2015), Mallela et al. (2009aMallela et al. ( , 2009b, and Schram & Abdelrahman (2010) exceed 0.5, indicating that most linear models perform poorly. As shown in Table 1, when developing a linear regression model to predict the value, the impact of nonlinear factors on the target value was ignored. The final model was inaccurate, and the R 2 deviated from 1.0. Such a model is difficult to explain the problem, let alone used to predict rutting.

Artificial neural networks
Inspired by biology, the NN is composed of a series of simple units closely related. Each unit has a certain number of real inputs and actual individual output (Dayhoff & DeLeo, 2001;Hopfield, 1988). An essential use of NN is to accept and process complex inputs from sensors and perform adaptive learning. The neural network algorithm simulates a biological neural network and is a pattern-matching algorithm usually used to solve classification and regression problems.

The framework of the neural network method
This section explains the overall establishment of the NN model for estimating pavement rutting using sample data collected by the NCHRP and the Federal Highway Administration (FHWA) Long-Term Pavement Performance (LTPP). It includes 11 columns of data, such as predicted rutting in the pavement structure layers, time of service, original air void, and resilient modulus. The data used for training is the most original without being modified by algorithms to prevent distress to the practicality of the model. Figure 1, the NN framework consists of four parts: 1. the NN architecture; 2. data set preparation for preprocessing; 3. preprocessing and data set allocation; 4. rutting reckoning through deep machine learning.  In the first step, the NN architecture is designed to be used by converting pavement performance parameters (i.e., historical record data) into a variable spreadsheet. The second step is to use the actual road data provided in the spreadsheet to generate a sample table. Before calling the corresponding sample training, it is necessary to perform feature standardisation, data cleaning, and feature extraction. The purpose is to hope that the value of each feature is obtained after the data is processed and float within a low range. In the third step, the processed data is divided into data sets, employing one part as the training set for model building and the other as the testing set for model evaluation. In the last step, use the prepared data set to train the NN model and then estimate the rutting depth for the untrained samples.

Proposed neural network architecture
Figure 2 shows the NN system structure used for rutting estimation, consisting of some hidden input and an output layer. Every layer of communication must add a bias. The input layer is the pavement performance data read from the variable spreadsheet, and then the data is passed to the first hidden layer through feature transformation. The first hidden layer is passed to the second hidden layer through feature transformation, and so on, until all the hidden layers are traversed. This process is also called forward propagation. A regularisation penalty must be introduced during forwarding propagation to prevent the model from overfitting. Finally, the results of the NN are output through the output layer.

Standardisation and regularisation
Since the different attributes of the sample, this time, has different magnitudes, to eliminate the extent of influence, the data needs to be standardised so that the characteristics of the sample are scaled to a specified range. The Equation (8) shows the standardised formula: where Z is the standardised data; X is the original data; X mean is the mean of the original data; std(X) is the standard deviation of the original data. As shown in Figure 3, overfitting refers to the phenomenon that the model positively fits the data set, causing the model to fail to output high-precision predictions for untrained data. It happens when the features transformed from the original data are fed to the NN, and the network fits the noisy data by remembering these features instead of learning them (Dietterich, 1995;Jeong et al., 2020). Overfitting models have low bias and high variance and lack the generalisation of untrained data sets, resulting in inaccurate predictions (Jeong et al., 2020). Regularisation is to scale a particular sample norm to unit one and select weight parameters with stronger generalisation ability. That is, all tend to be stable. Usually, the L2 regularisation is chosen as in Equation (9), which takes the square sum of the weight parameters and penalises it more strongly.
where J is a regularised loss function; J 0 is the sum of squared error between actual values and prediction; α is the penalty coefficient; w is the weight vector.

Activation function
In the NN architecture, after inputting the data, the features are transformed ler by layer, and finally, the error is evaluated through the loss function, forward propagation. Next, choose an appropriate optimisation method, load backpropagation, go from back to front, and update the weight parameters layer by layer. If there is a linear transformation among layers, then only one set of weight parameters is needed to represent all the remaining products. This single group of weight parameters is impossible to explain practical principles. On the one hand, the NN must solve linear and nonlinear problems (Hopfield, 1988); on the other hand, it needs to filter the transformed features so that the valuable weight features play a more significant role in the activation function. Common activation functions include the sigmoid function and rectified linear unit (ReLu) function (Ramachandran et al., 2017). As shown in Figure 4, a derivation is required layer by layer in the backpropagation process. The sigmoid derivative of the function is close to zero when the value is significant. Since backpropagation is carried out layer by layer, if the gradient of a layer is zero, all the network layers behind it are updated restrictedly (Wang et al., 2022), which is also the biggest problem with the sigmoid function (Bengio et al., 1994).
The ReLu function is a continuous piecewise function consisting of two parts. The value of the function on the negative semi-axis of x is all zero, and the function expression on the positive semi-axis of x is y x = . The gradient is kept active by preventing gradient saturation and providing better calculations through linear operations. For the input x, the output is zero when it is less than zero, and when it is higher than zero, the output is equal to the input itself. It is also a nonlinear function, but it solves the problem of the disappearance of the gradient, and the calculation is straightforward, speeding up the iteration speed of the network.

Optimizing with adaptive moment estimation
The predicted value of the regression equation and the actual value of the sub-sample are common unequal, indicating a difference between the actual value of the data and the predicted value, usually called an error term. The objective function, introduced in Equation (10), is considered optimizing to reduce the error value. Equation (10) calculates the extreme value of its partial derivative to obtain the lowest parameter θ. Therefore, the extreme value is found as long as the appropriate learning rate is selected and the gradient is calculated. This process is also called the gradient descent algorithm. The batch gradient descent algorithm requires consideration of all sample data. Every iteration of the optimisation calculation needs to calculate all the samples in the formula. Still, the iteration speed considers being plodding if the number of samples is huge. Facts have proved that the Stochastic Gradient Descent (SGD) algorithm is an effective optimisation tool and is essential in many successful machine learning cases (such as advanced deep learning technologies) (Deng et al., 2013;Kingma & Ba, 2014). It uses only one sample simultaneously, dramatically improving the iteration speed. However, SGD also has the following problems: 1. difficulty selecting a suitable starting learning rate; 2. the pre-specified adjustment rules limit the learning rate adjustment strategy; 3. the same learning rate is applied to each parameter; 4. the optimisation process of the highly non-convex error function tends to fall into many local sub-optimal solutions or saddle points.
where m is the number of samples in the entire data set; h θ (x (i) ) is the regression equation; y (i) is the actual value. For the problems of simple SGD, Duchi et al. (2011) released the adaptive gradient (AdaGrad) optimisation algorithm that adjust different learning rates for each parameter and use lower parameters for frequently changing parameters. The step size is updated, and the sparse parameter is updated with a larger step size. But the main flaw comes from the continuous accumulation of the square of the gradient in the denominator term, and the time step increases. The denominator term becomes larger and larger, which eventually causes the learning rate to shrink to too low to be effectively updated. Root mean square prop (RMSProp) is an algorithm mentioned by Hinton et al. (2012) in their teaching plan that adjusts the learning rate by combining the exponential moving average of the gradient square to deal with the problem of non-convergence caused by the stable (non-stationary) objective function.
Two scholars, Kingma & Ba (2014), proposed the adaptive moment estimation (Adam) optimiser in 2014, as shown in Figure 5, combining the advantages of AdaGrad and RMSProp optimisation algorithms. Based on the AdaGrad and RMSProp algorithms, Adam calculates the update step size by comprehensively considering the first-moment estimate (average of the gradient) and the second-moment estimate (the uncentered variance of the gradient). Adam is considered to be an optimiser with better default performance in many cases, mainly including the following significant advantages: 1. weakening the influence of parameter update by the gradient transformation; 2. very few adjustments are required to ensure that the hyperparameters have good interpretability; 3. the step size is limited to a specific range (initial learning rate); 4. suitable for scenarios with large amounts of data or parameters; 5. solving the unstable target function problem is excellent; 6. assisting in solving the problem of gradient sparsity or gradient with considerable noise.

Data acquisition and input
Plentiful and reliable data sets are essential for a successful supervised learning approach. However, obtaining a large amount of high-quality labelled data is difficult in practice but feeding such data into the neural network is crucial to the training effect of the model. In this NN training, historical monitoring data collected from the LTPP database of a total of 88 sites in 28 states were used to improve the reliability of the model training results. These data are currently available in the two reports issued by NCHRP Project 01-37A, Appendix GG-1 (Applied Research Associates, 2004a) and Appendix EE-1 (Applied Research Associates, 2004c). Other traffic data are accessed directly by visiting the LTPP InfoPave website (https://infopave.fhwa.dot.gov/). Figure 6 shows the source of these data.
The data extracted from Appendix EE-1 (Applied Research Associates, 2004c) and Appendix GG-1 (Applied Research Associates, 2004a) are listed in Table 2 and Note.  Table 3, including ten columns of values of the feature, such as the rutting of AC, granular base (GB) and subgrade (SG) layers, and a column of target values representing the actual rutting depth. The selection and sources of specific variables are listed in Table 4.

Neural network model construction
Four hundred forty data set samples were used to develop the rutting prediction model. For the NN training process, 352 samples were randomly selected (80% of the data). In comparison, for the NN testing processes, the remaining 88 samples (20% of the data) were divided equally (Esra'a & Abo-Qudais, 2018). This neural network is called NN-pro. Because of the limited sample data, it is very wasteful to set up the validation set separately. To fully use all the existing data, the crossvalidation method in machine learning has been introduced that divides the training set into multiple parts. After numerous attempts, fourfold cross-validation is finally used, as shown in Figure 7. The process must be divided into four steps when verifying a particular result. In the first step, use the first three as the training set and the last as the validation set to get a result. By analogy, each time uses another copy as the validation set and the rest as the training set. After four steps, four results are obtained, each corresponding to each small part, and the combination contains all the data in the original training set. Then the final four results are averaged to obtain the final model evaluation result. Cross-validation plays a significant role in evaluating the model and making the results more accurate.
When configuring the network, the appropriate hyperparameters, such as the number of layers, the number of neurons in each layer, learning rate, batch and epoch, need to be determined. Different hyperparameters have different effects on the model. Regularisation parameters, the number of layers of the neural network, and the number of neurons mainly affect the classification accuracy of the neural network; learning rate and activation function primarily  (Huang et al., 2004). To build the best model configuration, try to find the optimal number of hidden layers and neurons from one to five, and finally exploit a random search to determine the best model (Bergstra & Bengio, 2012;Huang et al., 2004Huang et al., , 2006Pao et al., 1994). The specific structure of the NN-pro and the selection of hyperparameters are shown in Table 5. Figure 8 shows how the error generated during model training decreases with the accumulation of iterations. The results show that the standard error of the network dramatically decreases as the learning process deepens. When the error value stops changing, the entire model training ends. The results show that after about 2000 iterations, the error generated by the model stabilises, and the standard error in the training set is only 1.034, implying that NN performs well in predicting rutting depth.

Results and analysis
As illustrated in Equation (5), the rutting depth calculation method given by MEPDG is to calculate the PD from the AC layer, GB layer, and SG layer, integrating the deformation of all layers through a linear combination equivalent to fitting ten zero-intercept rutting feature variables to the MLR model (Gong et al., 2018). Consequently, to better evaluate the performance of the NN, it is important to build a multiple linear regression model MLR-pro with the same input variables compared to the two types of models. Of course, the MLR-pro model is also compared to the models MLR20 and NN20 established by Gong et al. (2018) to show the performance of model optimisation and choose R 2 and S e as evaluation indicators.  Figure 10 show the actual and predicted rutting from MLR20 and MLR-pro. R 2 and S e in training set with 352 sample data increased by 123.5% and decreased by 47.8%, respectively. In a total of 88 sample data from the testing set, the R 2 and S e in the MLR20 model are 0.265 and 3.09 mm, respectively. The MLR-pro showed R 2 equal to 0.712 and S e equal to 1.798 mm. The results show that compared to MLR20, the R 2 of MLR-pro has increased by 168%, and the S e has decreased by 41.8%. Figure 11 and Figure 12 show the actual performance of the NN20 and NN-pro models. For the NN20 model, R 2 and S e in the training set are 0.899 and 1.402 mm, respectively, and R 2 and S e in the testing set are 0.867 and 1.403 mm, respectively. R 2 and S e in the training set and testing set of the NN-pro model are 0.929 and 1.034 mm, and 0.902 and 1.050 mm, respectively. After testing, the R 2 of NN-pro is 3% and 4% higher than that of NN20 in the training and testing set, and the S e are reduced by 26% and 25%. Compared to the MLR-pro model, the R 2 increased by 26.7%, and the S e decreased by 41.6% in the testing set. Similarly, the R 2 has exceeded 0.9 by NN-pro, which means that more than 90% of the samples are generally in line with the prediction of this model, implying that NN-pro has good generalisation ability. The specific data comparison is listed in Table 6.

Model analysis
Choosing the correct variables to feed the NN is also very important because the NN model is prone to overfitting. When the input features of the NN are too many, the dimension is too large, making the model excellent performance on the training set but unsatisfactory on the testing set. NN-pro optimises the input variables based on NN20, selects ten variables to feed neurons, and combines Adam to optimise the loss function so that the residual of the model is significantly reduced. Although ten input variables are used in this model, their status is unequal, and their impact on the entire model varies in importance. Using the random forest algorithm in ensemble learning makes it easy to get the degree of the determinism of different variables in the model. Figure 13 depicts the importance of different variables to the results. The main influences on the rut depth are the PD of the AC layer and AADT truks because the surface layer is directly loaded, and the pavement mechanism undergoes cumulative plastic deformation under the repeated rolling action of the vehicle. Deformation of the GB layer plays a more critical role in developing rutting. For this kind of pavement structure composed of loose materials, slip and displacement occur among particles, causing it to lose bearing capacity and trigger distress. In summary, NN is much better than MLR in terms of rutting model prediction performance and has good generalisation performance. The difference between the models is essentially the selection of different weights and biases, which means that the final result of the model is entirely determined by the weight and bias parameters. Multiple linear regression is a vital algorithm in machine learning, and the predictive ability of the model is also relatively strong. However, linear correlations among variables and nonlinear influencing factors are common in practical applications. The activation function in NN achieves this function. The input-to-loss calculation process is completed through forwarding propagation, optimised parameters through backpropagation, and a suitable weight or bias is obtained. The original perception is that the output of each layer and the input of the previous layer constitute a linear function. Even if the number of NN layers increases, the output is a linear combination of the input. However, suppose an activation function that coordinates nonlinear factors is introduced. In that case, there are both linear connections among neurons and nonlinear connections, so such NN applies to any nonlinear function or model. Figure 14 plots the correlation analysis between the first three variables with more significant influence and the actual value of the rutting. The results show that although the depth of the rutting does increase with the increase of the independent variable, it is hard to explain the relationship among them with a linear relationship, so this is also the necessity of introducing nonlinear factors.
However, the proposed NN model still has some shortcomings. The sample points near the centerline are a little sparse, especially on the test set, limiting the model from reaching a higher level of prediction. The main reason for this problem is insufficient training sample data sets, only 440 groups. In the future, high-performance test equipment will be considered available to collect more accurate road performance data and expand the data set to develop the potential of the model.  Importance outstanding performance of the neural network model optimises the asphalt pavement design. It promotes the development of precise and intelligent road maintenance that helps improve the method of the Mechanistic-Empirical Pavement Design Guide. In the future, it is recommended to use high-performance test equipment to collect more accurate road performance data and expand the data set to develop the potential of the model.