3.2. Applications
In this section we demonstrate the proposed criterion with several real applications including advertising products, heart blood pressure health and software reliability analysis. Based on our preliminary study on the collected data, the multiple linear regression model assumption is appropriate to be used in our applications 1 and 2 to illustrate the model selection.
Application 1: Advertising Budget.
In this study, we use the advertising budget data set [
15] to illustrate the proposed criterion where the sales for a particular product is a dependent variable of multiple regressionand the three different media channels such as TV, Radio, and News paper are independent variables. The advertising dataset consists of the sales of a product in 200 different markets (200 rows), together with advertising budgets for the product in each of those markets for three different media channels: TV, radio and newspaper. The sales are in thousands of units and the budget is in thousands of dollars.
Table 6 shows the first few rows of the advertising budget data set.
We now discuss the results of the linear regression model using this advertising data.
Figure 1 and
Figure 2 present the data plot and the correction coefficients between the pairs of variables of the advertising budget data, respectively. It shows that the pair of Sales and TV variables has the highest correlation. This implies that the TV advertising has a direct positive effect on the Sale. Results also show that there is a statistical significant positive effect of both TV and Radio advertisings on the Sales. From
Table 7, TV media is the most significant media among the three advertising channels and it has strongest impacts on the Sales. The
R2 is 0.8972, so 89.72% of the variability is explained by all three media channels. From
Table 8, the values of
R2 with all three variables and just two variables (TV and Radio advertisings) in the model are the same. This implies that we can select the model with two variables (TV and Radio) in the regression. We can now examine the adjusted
R2 measure. For the regression model with TV and Radio variables, the adjust
R2 is 0.8962 while adding the third variable (Newspaper) into the model, the adjusted
R2 of the full model size is then reduced to 0.8956. Based on the new proposed criterion, the model with the two advertising media channels (TV and Radio) is the best model from a set of seven candidate models as shown in
Table 8. This result is consistent with all criteria such as MSE, AIC, AICc, BIC, RMSE, and adjusted
R2.
Application 2: Heart Blood Pressure Health Data.
Blood pressure (BP) is one of the main risk factors for cardiovascular diseases. BP is the force of blood pushing against your artery walls as it goes through your body [
16]. Abnormal BP has been a forceful issue that causes strokes, heart attacks, and kidney failureso it is important to check your blood pressure on a regular basis. The author has monitored blood pressure daily of an individual since January 2019 using Microlife product. He measured his blood pressure each morning and evening each day within the same time interval and recorded the results of all three measures such as Systolic Blood Pressure ("systolic"), Diastolic Blood Pressure ("diastolic"), and Heart Rate ("pulse") each time as shown in
Table 9. The Systolic BP is the pressure when the heart beats – while the heart muscle is contracting (squeezing) and pumping oxygen-rich blood into the blood vessels. Diastolic BP is the pressure on the blood vessels when the heart muscle relaxes. The diastolic pressure is always lower than the systolic pressure [
17]. The Pulse or Heart rate measures the heart rate by counting the number of beats per minute (BPM).
The newly heart blood pressure health data set consists of the heart rate (pulse) of such individual in 86 days with 2 data points measured each day, making a total of 172 observations. The first few rows of the data set are shown in
Table 9. In
Table 9 for example, the first row of the data set can be read as follows: on a Thursday ("day" = 5) morning ("time"=0), the high blood "systolic", low blood "diastolic", and heart rate "pulse" measurements were 154, 99, and 71, respectively. Similarly, on a Thursday afternoon (i.e., the second row of the data set in
Table 9, and "time" =1), the high blood, low blood and heart rate measurements were 144, 94, and 75, respectively.
From
Figure 3, the systolic BP and diastolic BP have the highest correlation. In this study, we decided not to include the Time variable (i.e., column 2 in
Table 9) in this model analysis since it may not necessary reflect the health measurement much. The analysis shows that the Systolic blood pressure seems to be the most significant factor that can have strong impacts on the heart rate measure. The
R2 is 0.09997, so 9.99% of the variability is explained by all three variables (Day, Systolic, Diastolic) as shown in
Table 10. Based on the new proposed criterion, the model with only Systolic blood pressure variable is the best model from the set of seven candidate models as shown in
Table 10. This result stands alone compared to all other criteria, except BIC. In other words, the best model based on our proposed criterion will only obtain Systolic BP variable in the model.
Application 3: Software Reliability Dataset #1.
In this example, we use the numerical results recently studied by Song et al. [
12] to illustrate the new criterion by comparing it to some existing criteria based on the two real data sets in the applications of software reliability engineering.
Table 11 shows the numericalresults of 19 different software reliability models based on four existing criteria such as MSE, AIC,
R2, and adjusted
R2 and a new criterion, called Pham criterion, using dataset #1 [
18]. In dataset #1, the week index ranges from 1 week to 21 weeks, and there are 38 cumulative failures at 14 weeks. Detailed information is recorded in Musa et al. [
18]. Model 6 as shown in
Table 11 provides the best fit based on the MSE,
R2, adjusted
R2 and new criteria. However, Model 1 seems to be the best fit based on the AIC.
Application 4: Software Reliability based on Dataset #2.
Similarly, in this example we use the numerical results recently studied by Song et al. [
12] to illustrate the new criterion based on a real dataset #2 [
19]. In dataset #2, the weekly index uses cumulative system days, and the failures in 58,633 system days. The detailed information is recorded in [
19].
Table 12 presents the numerical results of 19 different software reliability models based on four existing criteria such as MSE, AIC,
R2, and adjusted
R2 and the new proposed criterion.
Based on dataset #2, Model 7 (see
Table 12) provides the best fit based on the AIC and new criteria where Model 17 indicates to be the best fit based on the MSE,
R2, and adjusted
R2.