3.3. Data Preparation
The LiDAR data were first adjusted to a local Cartesian coordinate system in which the
x-axis was parallel to the moving direction, the
y-axis was transverse to it, and the road surface was horizontal to the best extent possible. With the physical LiDAR configuration angles
representing the rotation in degrees around the
x,
y, and
z axes respectively, the rotation matrices are defined as follows:
The origin is taken as the physical center of the LiDAR; thus, no translation is necessary. The rotation is performed around the origin (0, 0, 0). The combined rotation matrix
is defined as:
Given a point cloud
represented as:
Each point
in the point cloud is represented as a vector (in addition to a corresponding intensity
):
The rotation matrix
is applied to each point in the point cloud
resulting in the rotated point cloud
:
All subsequent references to the point cloud
herein pertain to the rotated point cloud
, unless explicitly stated otherwise. An example of the top, cross-sectional, and intensity cross-section views of a single scan are shown in
Figure 5.
To extract the lane marking data points from the LiDAR scans, a process was developed to simplify manual segmentation. Rasterized graphs of the top view and intensity cross-section were used to segment the lane marking. The image limits were independent from the range of the point cloud used to plot it, so that each pixel in the rasterized image was exactly matched to the dimensions in the original point cloud. Next, the rasterized images were manually annotated to segment the points belonging to the pavement marking at the retroreflectometer’s measuring window. The annotated bounding boxes were then translated back to actual dimensions. Finally, points belonging to the annotated areas were extracted from the full point cloud. A rasterized intensity cross-section graph is shown in
Figure 6.
After annotation and segmentation of the marking points, the output of this process at each station is an array of shape (n, 4, where n is the number of point cloud points at the measured station and 4 represents the
x,
y,
z, and intensity readings.
Figure 7 summarizes the proposed data preparation process.
3.4. Feature Extraction
After selecting points for each pavement marking location using the procedure discussed previously, the dataset was structured in such a way that each measured retroreflectivity value corresponded to a collection of point cloud points. The number of point cloud points varied between locations, making it challenging to fit a regression model without consistent input. To address this issue, features were extracted from the point cloud data. This feature extraction process transformed the dataset into a standardized format, where each pavement marking location was represented by a fixed set of features, regardless of the original number of points in the point cloud. These extracted features captured the relevant information from the point cloud data, providing a consistent shape of input for the regression model. The shape of the dataset is described in
Figure 8, where, at location (N):
is a subset of point cloud points of shape
,
is the measured retroreflectivity value, and
is the vector of extracted features of constant shape (
).
For the task of extracting and evaluating a wide range of features, the Python package (tsfresh) proved to be a valuable tool. tsfresh is designed to automatically compute numerous features from time series data [
17]. In this study, the data could be treated as time series data, such that the lateral axis (
y axis) corresponded to the time axis. By using tsfresh, we could leverage its functionality to extract a multitude of features for each channel of the input data. These features included basic statistics (such as minimum, maximum, mean, various percentiles, and skewness), complexity features (such as count above and below mean), linear trend features, and entropy features, among many others. In general, the feature names were encoded such that the first letter represented the variable from which the feature was extracted, followed by double underscores and the feature family, followed by any parameters used to define the feature. For example,
was applied to
which was the intensity, the feature family was the
, and the particular quantile this feature represented was the
. Following is a summary of the most important extracted features:
These include the mean, median, mode, standard deviation, root mean square, absolute energy, maximum, minimum, range, inter-quartile range, quantiles from 0.1 to 0.9, skewness, kurtosis, sum of recurring values, and many other basic and statistical features describing the central tendency, the variation, and the distribution shape of the data. The full list of features extracted is described in [
18].
This family of features involves fixing a window with top and bottom limits based on quantiles (the lower quantile) and (the higher quantile). Only values inside this window are considered and consecutive changes of the values are determined. The differences are taken either raw or as absolute values. Finally, the differences are aggregated using an aggregate function such as mean, median, variance, etc.
This feature is a measure of the nonlinearity of the data values across the direction of
y (lateral distance). The feature is implemented in the tsfresh library and was proposed in [
19]. It is calculated as [
20]:
where:
is the number of data points.
is the intensity value.
is the lag.
This family of features involves calculating the linear least-squares regression of values aggregated over defined chunks. First, the series was divided into chunks of defined length . Next, an aggregation function was applied on each chunk to summarize its values. The aggregation function could be max, min, mean, or median. Finally, a linear regression was applied on the aggregated values. In this regression, the aggregated values were the dependent variable, and the chunk number was the independent variable. The attribute returned () could be “p-value”, “r-value”, “intercept”, “slope”, or “stderr”.
3.5. Feature Selection
The feature extraction process from the previous section resulted in more than 3000 features extracted. Dealing with such high dimensional data required rigorous observation to exclude useless data. The feature selection process implemented in this study consisted of multiple steps. The feature selection process is discussed in detail in the following subsection.
3.5.1. Filter Methods
The first step was simply to analyze the variance of the feature and observe that it did not stay unchanged across the different samples. The second step of the feature selection process utilized a set of three filter methods. The methods implemented included Pearson’s correlation coefficient, Spearman’s rank correlation coefficient, and the maximal information coefficient (MIC). These approaches were chosen for their ability to evaluate features based on their relationship with the output variable, independent of any specific model’s influence. Each method, with its unique properties, provided valuable insights into the presence of a relationship for each feature with the measured retroreflectivity. Thresholds were set for each of the filter methods used, and the final set of features was selected based on passing the thresholds of the filter methods. This step drastically reduced the number of features to be considered in the following steps. In general, the number of features was reduced from more than 3000 features to sub 100 related features.
The variance
of a feature
is calculated as:
where:
is the number of samples.
is the value of the feature for the ith sample.
is the mean value of the feature across all samples.
The variance threshold method then checks the variance of each feature against a specified threshold such that, if , then remove feature .
Pearson’s correlation coefficient is a metric measuring the linear association between two continuous variables. Its value ranges between −1 and 1, with values near the extremes indicating strong negative or positive linear relationships. When selecting features, a significant Pearson’s correlation with the target variable can suggest the importance of the feature. However, it is imperative to note that features with high inter-correlations can introduce redundancy, suggesting the potential removal of overlapping features. Nonetheless, in this study, this will be addressed using a different method in later steps of feature selection. The Pearson’s correlation coefficient (
) is calculated using the formula:
where:
is the number of data points.
are the individual data points of the feature x and the variable y.
the means of all the data points of x and y respectively.
Spearman’s rank correlation coefficient is a non-parametric measure used to assess the strength and direction of the monotonic relationship between two variables. The value of Spearman’s rank coefficient can range from −1 to 1, with values close to the extremes indicating a strong negative or positive monotonic relationship. The ranking process makes Spearman’s method more robust to non-linear relationships and outliers compared to methods that rely on actual data values, such as Pearson’s correlation.
Spearman’s rank correlation coefficient (
) is given by:
where:
is the difference between the ranks of the two variables for each data point. (The ranks refer to the ordering of data points when sorted in ascending or descending order.)
is the number of data points.
The maximal information coefficient (MIC) is a modern statistical measure used to identify and quantify associations between two variables. The basis of MIC lies in the concept of mutual information. Mutual information quantifies the amount of information one can acquire about one variable by observing another. For MIC, the objective becomes finding an optimal grid partitioning for the paired variables that pushes this mutual information to its peak. Once the mutual information is computed, it undergoes a normalization process which aims to confine the MIC values within a range of [0, 1]. A MIC value close to 0 would suggest a negligible association between the variables in question. In contrast, a value veering towards 1 would be indicative of a pronounced association.
3.5.2. Exhaustive Search
After the preliminary round of feature selection, a refined subset of features remained. The features that demonstrated their significance in the initial selection phase were then subjected to a rigorous evaluation to observe their individual and combined performance in a model. First, each feature was employed independently to train the model, after which the model’s performance was gauged. This meticulous individual analysis aimed to pinpoint the feature that, on its own, offered the most substantial contribution to the model’s performance. After the singular evaluations, Exhaustive Search was employed to explore all possible pairings of the shortlisted features. In practical terms, for a dataset with remaining features post the initial selection, Exhaustive Search would evaluate potential feature pairs. Each of these pairs was used to train the model, and its performance was recorded.
3.5.3. Wrapper Feature Selection: Sequential Backward Selection
In general, wrapper methods are a type of feature selection technique where different subsets of features are used to train a model. The performance of the models trained on the different subsets is then used to gauge the feature importance and select the features that result in the best performance metrics. In contrast to filter methods where the feature usefulness is determined by intrinsic properties of the data, wrapper method depend on empirical performance of the model trained on the particular subset of features. In sequential backward selection, the evaluation is initiated by including the full set of features then sequentially removing the least important features. The decision to remove a feature is based on the impact of its removal on the model performance. The process is repeated until a desired number of features is reached, or until the removal of further features significantly degrades the model performance.
3.6. Evaluated Models
In the comprehensive evaluation of potential models that correlate LiDAR data to retroreflectivity, a diverse set of regression models was explored. Evaluating a diverse suite of algorithms ensured that the analysis was both comprehensive and robust, catering to the myriad possible structures and relationships within the data. Following is a discussion of the evaluated models.
Linear Regression is one of the foundational algorithms in regression analysis. It assumes a linear relationship between the independent and dependent variables, aiming to find the best-fitting line that minimizes the sum of squared errors. While it offers simplicity and interpretability, its assumption of linearity can be a limitation, especially when the true relationship between variables is more complex.
Lasso Regression, which stands for Least Absolute Shrinkage and Selection Operator, introduces L1 regularization to the traditional linear regression, adding a penalty equivalent to the absolute value of the magnitude of coefficients. This often results in some feature coefficients being exactly zero, effectively leading to feature selection. Lasso can be particularly useful when it is suspected that many features are redundant or irrelevant.
Ridge Regression is another variant of linear regression that employs L2 regularization. Instead of eliminating certain features as Lasso does, Ridge tends to shrink the coefficients of less important features closer to zero, balancing out the contribution of all features. It is especially beneficial in situations where features are highly correlated, known as multicollinearity.
ElasticNet Regression offers a middle ground between Lasso and Ridge by combining both L1 and L2 regularizations. This hybrid approach ensures that ElasticNet inherits the strengths of both methods, making it versatile in handling various data structures.
Random Forest is an ensemble technique that builds multiple decision trees and aggregates their results. By leveraging the power of multiple trees, it mitigates the risk of overfitting, to which a single decision tree might be prone. Random Forest starts by creating multiple sets of data from the original dataset using a technique called bootstrap sampling. This means that for a dataset of size , several samples each of size are randomly selected with replacement, such that a sample can be picked more than once. Let us say that bootstrapped samples are created. For each of these bootstrap samples, a decision tree is constructed. However, instead of considering all features at each split, only a random subset of the features is considered. The final prediction of the regression is then the average of the predictions from all the trees.
Extra Trees, or Extremely Randomized Trees, is an ensemble method similar to Random Forest. The primary distinction lies in how the trees are constructed: while splits in Random Forest are based on the most discriminative thresholds, Extra Trees introduces more randomness in selecting features and thresholds, often leading to increased diversity among individual trees.
Gradient Boosting is another ensemble technique, but instead of building trees in parallel like Random Forest, it constructs them sequentially. Each tree in the sequence tries to correct the errors made by the previous ones. The algorithm starts with a simple model, often a constant value representing the mean of the target variable. The primary objective then becomes the minimization of residuals or the discrepancies between the predicted and actual values. As the algorithm progresses, each tree is specifically trained on these residuals from the preceding model. This strategy ensures that every new tree in the sequence addresses and rectifies the shortcomings of the previous ones.
XGBoost stands for eXtreme Gradient Boosting. It is an optimized and more efficient implementation of the gradient boosting algorithm, known for its speed and performance. XGBoost introduces regularization to the boosting process, reducing the risk of overfitting and often leading to better generalization on unseen data.
LightGBM or Light Gradient Boosting Machine is another gradient boosting framework designed for speed and efficiency. It differs from other boosting algorithms in its approach to tree growth, opting for a leaf-wise strategy rather than the traditional depth-wise approach.
kNN (k-Nearest Neighbors) for Regression is a non-parametric algorithm that predicts output based on the average of the k nearest training examples in the feature space. It is especially adept at capturing localized patterns within the data.
3.7. Performance Metrics
Two fundamental performance metrics in the context of regression modeling are the coefficient of determination,
, and the mean squared error (MSE). The coefficient of determination, denoted as
, quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables. It is mathematically defined as:
Here, is the actual value, is the predicted value, and is the mean of the observed data.
In this study, the coefficient of determination was predominantly employed as the performance metric. The primary rationale behind this choice was to maintain consistency with the conventions established in related literature.
Cross-Validation
One of the foremost challenges in machine learning is overfitting, which occurs when a model, while aiming to achieve the best possible performance on training data, becomes overly complex, capturing not just underlying patterns but also the noise or random fluctuations present. The result is a model that performs exceptionally on the training set but falters when exposed to unseen data. Cross-validation emerges as a safeguard against this. Furthermore, cross-validation provides an estimate of a model’s generalization performance. If a model performs well across different cross-validation folds, it is more likely to perform well on unseen data.
In
k-fold cross validation, the dataset is randomly partitioned into
equal-sized subsamples or folds. Of these
folds, a single fold is retained as the validation set, and the remaining
folds are used as the training set. The model is then trained on the
training folds and validated on the held-out fold. This process is repeated
times. At the end of this process,
different models are obtained and
performance estimates are calculated. The number of folds
can be selected based on multiple factors such as the size of the dataset available. In this study, the number of folds
was selected to be 5. The overall performance was then calculated as the mean of these
performance metrics. The process is illustrated in
Figure 9. Let us say that
is the performance metric used; the overall
is calculated as: