2.2. Data Collection
The research was conducted in the city of Wroclaw (in Poland), where the collection process is carried out according to a fixed scheme presented in
Figure 1.
The datasets used in this paper were derived from a combination of information from two different sources (
Figure 2). Dataset 1 consisted of the garbage truck drives’ data, and Dataset 2 consisted of the factors affecting the garbage truck’s stopping time at each WCP.
Data on garbage truck driving and a mixed waste collection were collected in Dataset 1. The data included information on each process’s start and end times: driving to the WCP, stopping at the WCP, and emptying the container. This database consists of 14 routes conducted by different vehicles, with different loaders, and for different WCPs located in different areas of the city. The vehicle routes are not doubled. A total of 661 individual pieces of information on the time spent at WCP are available thanks to this measurement.
Dataset 2 was based on field research consisting of collecting information on the various factors affecting the WCP collection process’s time. These factors were identified based on a literature review and information collected from garbage truck employees. The following factors were examined: WCP cover type, building type, WCP surface type, and the number of containers. These data were collected via tablets equipped with proprietary measurement applications (the survey was conducted in June–August 2020). The survey delivered the current information about the individual characteristics of 5983 WCPs. The data were collected as part of the basic research of a project supported by the National Science Centre, Poland (grant number 2019/03/X/ST8/00287), which aimed to determine the influence of the studied factors on the WCP service time.
The creation of Dataset 3 required us to link each record from Dataset 1 with the corresponding WCP from Dataset 2. Due to the lack of a key to link these databases directly, we relied on GPS coordinates. We were able to link 258 from the 661 records with their corresponding factors. Finally, we were able to consider seven factors (five categorical and two numerical) influencing time spent at a WCP by a garbage truck. To sum up, a total of eight variables were collected as Dataset 3, seven of which are factors influencing the eighth variable—time spent at WCP. The summary of the considered variables is presented in
Table 2.
Time spent at WCP—This indicator represents the time required to perform all the necessary actions within the WCP. This time is counted from the moment of stopping until the vehicle leaves the WCP. Data were collected in Dataset 1.
WCP type—This is mainly divided according to the type of small architecture object, i.e., object within which the containers were placed. Three types of WCP were distinguished: freestanding containers, covered and open, covered and closed. It was verified whether the need to avoid obstacles and open the cover has a significant impact on the time spent at WCP by a garbage truck. Data were collected as part of Dataset 2.
Building type—The type of building often determines a different pickup technique for loaders. The collection process is different in single-family housing, multi-family housing, and other (e.g., stores and mixed building types). Data were collected within Dataset 2.
WCP Surface—There was considered the type of ground on which the containers are hauled as one of the factors. There are two types of surfaces: paved and unpaved. This factor seems to be much more critical when analyzed with weathering factors. However, including weathering factors would complicate the final model and make it impractical. Data were collected as part of Dataset 2.
A number of loaders—This is one of the factors considered in the literature [
9,
19]. Data were collected as part of Dataset 1.
Planned cleaning of WCPs—Cleaning containers may result from a random event (that type was not included in the model), but most of the work to keep WCPs clean is a planned activity. There are cleaning schedules for all the WCPs and the schedules for the specific WCPs based on the residents’ demands. In this case, the WCP service is much longer, and this factor should be taken into account during route planning. Data were collected within Dataset 1.
A number of containers—Where a container was empty or no containers were reported for collection at a given WCP (mainly single-family housing), we assumed 0. For quantities 8, 9, 11, and 12, there was insufficient representation, so they were not included in the model. Data were collected as part of Dataset 1.
Truck distance from WCP—this dataset was used to assess the fixed distance between the vehicle’s stop and the actual WCP. An example of such a situation can be gated communities. The vehicles often do not enter these communities but stop in front of the gate, and the containers are hauled from the cover to the vehicle. Data were collected as part of Dataset 1.
2.3. Multiple Regression Model
It has been verified that the use of a regression model based on only one factor is insufficient in estimating time spent at WCP by garbage truck (
Table 3).
Simple regression models based on each factor separately resulted in achieving the highest = 0.584 when including only the number of containers. Therefore, multiple regression was used to predict a time spent at WCP by a garbage truck.
In linear regression, with
p independent variables (predictors)
X1,
X2, …,
Xp and a dependent variable (predicted value)
Y, Equation (1) can be obtained [
21]:
In our case, we consider the variables listed in
Table 2: the time spent at WCP by a garbage truck as a dependent variable and factors from Dataset 2 as independent variables.
According to Reference [
22], the main stages and procedures of multiple regression analysis, presented in
Figure 3, can be developed.
Based on this scheme, the following steps were performed:
Due to difficulties outlined in the description of Dataset 3 development, the sample size is 258. This sample size is considered sufficient considering the sample size rule based on a number of predictors,
p, proposed by Reference [
22], where
N > 50 + 8 *
p. Additionally, it should be noticed that the used data come from different regions of the city and form different routes.
For predicting time spent at WCP by garbage truck (dependent variable), seven factors connected with WCP (independent variables) were initially chosen: WCP cover type, building type, WCP surface, number of loaders, planned cleaning, number of containers, and truck distance from WCP.
As the first step of data preparation, data were divided into two subsets:
- -
Subset 1: Two hundred measurements of the collected data for internal validation and model building;
- -
Subset 2: Fifty-eight measurements of the collected data for external validation, data from two independent routes, and also from two city regions not included in Dataset 1.
Among chosen seven independent variables, five of them are categorical type. To be able to use them in the model, dummy coding [
23,
24] was necessary. WCP surface and planned cleaning are categorical variables with only two categories. From a variable with two categories, one variable will be created with the value 0 (absence of chosen category) or 1 (presence of chosen category). In the case of WCP type, building type, and truck distance from WCP, there are three categories in each of them. From one categorical variable with three categories, there will be created two variables with the value 0 or 1. One category must be omitted to eliminate collinearity. It should be noted at this stage that dummy coding resulted in a new, larger number of independent variables. Five categorical variables were transformed into eight independent variables (
Table 4). Consequently, the initial number of seven independent variables was expanded to ten independent variables.
In the data preparation stage, there is also a need to check basic assumptions of multiple regression, which, among others, are normality, linearity, and multicollinearity.
According to the assumption of normality, residuals (the difference between observed and predicted values) should be normally distributed [
25]. To check this assumption, chi^2 test was performed. It was previously stated that there are no grounds for rejecting the hypothesis about the normal distribution of residuals. Another assumption of linearity, a linear relationship between the independent and dependent variables [
26], is fulfilled (non-linearity test, Lagrange multiplier = 0.735,
p-value = 0.391, α = 0.05, and
p-value > 0.05). Multicollinearity occurs when one of the independent variables is in a linear relationship (is strongly correlated) with one of the others [
21]. Multicollinearity can be detected, for example, by examining the correlation matrix (
Table 5) or with the use of variance inflation factor (VIF), with its minimum value equaling 1 and value above 10 indicating multicollinearity [
27].
The coefficient of correlation values (from −0.43 to 0.23) listed in
Table 5 do not indicate any significant correlation between the independent variables. This is also confirmed by the fact that every independent variable has a value of VIF slightly above 1 (from 1.125 to 1.651). Both the correlation matrix and the VIF prove that there is no multicollinearity among independent variables.
From three main types of multiple regression (standard, sequential, and stepwise) described in Reference [
28], stepwise multiple regression was used, which can only be implemented for prediction purposes [
28]. In stepwise regression, there are three techniques for independent variables choosing: forward selection (adding variables one by one based on statistical criterion), backward elimination (removing variables one by one based on statistical criterion), and stepwise procedure (a combination of forward and backward). According to chosen by us backward selection, model building starts with all the independent variables, which are eliminated one by one based on a chosen criterion (for example,
p-value or Mallow’s Cp [
29]). The elimination process (based on the
p-value in test F greater than 0.05) is presented in
Table 6.
After eliminating all independent variables with a
p-value greater than 0.05, the final model (Model 5) is developed. From ten independent variables inputted at the beginning, four of them were removed due to insignificance (
p-value greater than 0.05). For every model
and adjusted
, we calculated according to (2) and (3):
where
x—actual values,
y—predicted values,
p—number of predictors, and
n—number of observations.
shows the percent of the variance in the dependent variable predicted by independent variables [
28]. The more independent variables, the greater
. Therefore,
should be used, which includes the number of predictors.
It can be noticed that the highest representing predictive power was found to be 0.806 for the final model (Model 5). This model also has the lowest standard error of the estimate.
Table 7 shows the analysis results of the obtained coefficients of the final regression model.
Based on the presented coefficients, it can be stated that time spent at WCP increases as a number of containers increases. The presence of single-family or multi-family building type, truck distance from WCP 0–15 m, no planned cleaning, and a number of loaders increase causes time spent at WCP decrease. Moreover, the single-family building has more than two times greater influence on time spent at WCP decrease than multi-family building.
In accordance with the coefficients presented in
Table 7, the regression model equation can be formulated as Formula (4):
Internal and external validation of the developed model is described in the next section (
Section 3).
Results are presented in
Section 3, where metrics for internal and external validation results are shown. Besides
, Root Mean Square Error (
RMSE) and Mean Absolute Error (
MAE) were calculated with the use of Formulas (5) and (6):
where
x—actual values,
y—predicted values,
p—number of predictors, and
n—number of observations.