1. Introduction
Patient-Centered Communication (PCC) has been one of the most widely debated subjects in healthcare over the past few decades and has important implications in the promotion of a harmonious doctor-patient relationship and the improvement of health care. Quality PCC was originally defined by the Institute of Medicine as a model that aims to obtain the necessary diagnostic and treatment information relevant to medical care in addition to the wishes, needs, and preferences of the patient. The reason for this is to make clinical decisions consistent with the patient’s values and to enhance the understanding and consensus between doctors and patients. PCC is not only a quality of an individual practitioner but also of the entire health system [
1].
According to research, PCC has a clear positive impact on healthcare, reducing disease symptoms and improving clinical outcomes in cancer treatment [
2,
3,
4]. PCC is also essential for patient care, medical education, clinician licensure, and quality assessment [
5]. Evidence suggests that patient-centered care improves disease outcomes and quality of life, alleviates medical conflict, and is critical for addressing racial, ethnic, and socioeconomic disparities in health care and health outcomes [
6,
7]. Therefore, the identification of important PCC predictors is crucial.
Existing research has linked PCC to multi-dimensional variables, such as individual sociodemographic characteristics, health status, and attention to health problems. For example, patients who prefer PCC tend to be younger and more educated [
8]. Racial or ethnic minorities are less likely to participate in PCC with providers due to a lack of emotional communication, which may influence whether providers use a patient-centered approach, putting patients at risk for persistent health conditions [
9]. In addition, studies showed that those with strong self-efficacy in caring for their health as well as those with good overall health reported better PCC from providers [
10,
11]. However, few variables have been investigated in the existing literature, and the selection of variables is prone to some degree of subjectivity, so it is difficult to investigate the variables of interest comprehensively and objectively.
Machine learning methods are widely used in the medical and health field for drug discovery, disease prediction, and diagnosis [
12]; however, few studies include large-scale variable machine learning research in the direction of PCC. Existing studies on PCC typically use traditional statistical methods, such as interaction analysis and linear regression, to model a limited set of easily measurable variables to predict PCC [
13,
14,
15]. These simple, cost-effective types of models are often preferred in many settings, including population-wide screening or diagnosis in resource-limited settings [
16]. However, as big data has evolved in the medical field, the cost of data collection has decreased, and the scale of data has increased [
17]. Although machine learning methods are more complex than traditional statistical methods, their performance on large-scale data has certain advantages [
18]. Furthermore, traditional statistical analysis methods and machine learning methods should be complementary [
19]. In the literature relevant to the research question in this paper, machine learning methods are gradually used to identify important predictors of research variables [
20,
21,
22,
23]. After variable screening or feature extraction, high-precision classification or prediction tasks with small errors can be achieved, which can improve the timely diagnosis of the prodromal stage of related diseases and important problems and provide a reference for early intervention and prevention [
24]. Therefore, machine learning methods with large-scale data were hereby combined for the identification of important PCC predictors.
The present analysis aimed to gain a clearer picture of sociodemographic, healthcare access, and health status variables, their impact on the quality of patient-healthcare provider communication, and to identify significant predictors, based on a national sample. Therefore, based on the extensive data variable set of the Health Information National Trends Survey (HINTS), four machine learning methods were employed to select the important factors for predicting PCC from a set of characteristic variables based on variable importance measures. This study combined the strengths of feature selection, machine learning, and extensive datasets to provide support for more comprehensive predictor identification and prediction of PCC.
2. Materials and Methods
2.1. Data Source
Data from the National Cancer Institute’s 2019–2020 Health Information National Trends Survey (HINTS) were collected. HINTS regularly collects nationally representative data on the American public’s knowledge, attitudes, and use of cancer and health-related information. This study analyzed pooled data from cycles 3 and 4 of HINTS 5. This survey provides an opportunity to examine the perceived PCC levels on a population-level basis. The dataset contains metric variables for PCC and extensive demographic survey data to help identify sociodemographic, lifestyle, and health-related factors associated with respondents’ perceived PCC. The administration of HINTS is approved by the Westat Inc. Institutional Review Board and exempted by the Office of Human Research at the National Institutes of Health. HINTS also offered additional useful information about the survey design and allowed estimates for individual countries.
2.2. Statistical Analysis
The combined data of HINTS 5 Cycle 3 and 4 were used to perform descriptive statistical analysis on relevant variables. The original variable set of HINTS, which includes a very large number of variables, was screened using the t-test for binary and continuous independent variables, and the F-test for multi-category variables. Finally, eighteen candidate variables were obtained for modeling. This paper deals with some of the filtered variables (see
Supplementary Material Table S1 for details). The threshold of significance level was set to 0.1, and the variables with a
p-value less than 0.1 were included in the candidate set. Four machine learning methods were employed to identify significant predictors of PCC, including Generalized Linear Models (GLM), Random Forests (Random Forests), Deep Neural Networks (Deep Learning), and Gradient Boosting Machines (GBM). Variable importance measures were used to identify important predictors, and the metrics corresponding to each algorithm are described in the method introduction. Furthermore, the model performance and prediction performance of each algorithm were evaluated using the Mean Absolute Percentage Error (
MAPE), Mean Absolute Error (
MAE), Root Mean Square Error (
RMSE), and Root Mean Squared Logarithmic Error (
RMSLE) values under five-fold cross-validation. See
Figure 1 for the simple computational framework of this study. Feature selection consisted of two steps: preliminary screening and machine learning important predictor identification. All statistical analyses were performed on R Software version 4.1.2. R is a commonly used programming language created by statisticians Ross Ihaka and Robert Gentleman. The official R software environment is an open-source free software environment in the GNU package, provided under the GNU General Public License.
2.3. Measures
2.3.1. Patient-Centered Communication
The focus variable in this paper is PCC, which is described by seven items in the HINTS. Participants were asked: “The following questions are about your communication with all doctors, nurses, or other health professionals you saw during the past 12 months. How often did they do each of the following: (a) Give you the chance to ask all the health-related questions you had; (b) Give the attention you needed to your feelings and emotions; (c) Involve you in decisions about your health care as much as you wanted; (d) Make sure you understood the things you needed to do to take care of your health; (e) Explain things in a way you could understand; (f) Spend enough time with you; (g) Help you deal with feelings of uncertainty about your health or health care.” Response options included: (1) Always, (2) Usually, (3) Sometimes, and (4) Never. These questions were addressed only to participants who had seen a doctor, nurse, or other health professional in the past 12 months. HINTS created a composite PCC scale based on this question, with values ranging from 0 to 100, with higher scores indicating more increased positive communication with healthcare providers.
2.3.2. Demographic Variables and Other Related Variables
Based on the combined data of HINTS 5 Cycle 3 and 4, a total of 143 initial variables were obtained. Some of the obtained variables had the same meaning. In general, this initial set of variables involved sociodemographic characteristics, such as Age, Gender, Education, Race, and variables related to personal health statuses, such as GeneralHealth, EverHadCancer, Deaf, and OwnAbilityTakeCareHealth, and also personal living habits variables, such as UseInternet, DrinkDaysPerWeek, and WeeklyMinutesModerateExercise. The variables were sorted according to the questionnaire section. Due to a large number of variables, each variable is not listed here. Except for the PCC, the remaining involved variables were questions answered by all the participants, for example, a questionnaire designed for women only, “How long ago did you have your most recent Pap test to check for cervical cancer”, which were not considered in this paper.
2.4. Methods
The four machine learning algorithms GLM, Random Forests, Deep Learning, and GBM were applied to analyze the factors affecting PCC. All four machine learning methods can provide an important measure of the introduced variable, which can be used to evaluate the importance of the variable. For specific variable importance indicators, see the introduction of each method. The implementation from the R package “h2o” was used for all models. Using grid search and five-fold cross-validation, the optimal parameters of the machine learning model were selected based on the objective with the smallest RMSE.
2.4.1. GLM
The GLM model constructed in this paper adopts the regularization method to solve the problem of overfitting that may occur. Regularization can reduce the variance of prediction errors and deal with correlated predictors by introducing penalty items and in the form of during model building. The combination of and penalties in this algorithm can be parameterized by and . Here, is calculated by a grid search in the (0, 1) interval, which controls the elastic net penalty distribution between the norms of and , and the penalty strength is controlled by the parameter . Therefore, the best regularized model was constructed by performing an automatic search on every value of set in the (0, 1) interval using the grid search. The final model had a regularization parameter of 0.65. Variable importance is measured according to the “absolute value of normalization coefficient”.
2.4.2. Random Forests
Random forest is an ensemble learning method, which integrates many decision trees into a forest and uses it to predict the final result by the majority voting of all trees to ensure the stability of the model. Breiman (2000) combined classification trees into random forests, which improved the prediction accuracy without significantly increasing the amount of computation [
25]. Random forest is insensitive to multi-collinearity, robust to missing and unbalanced data, and is a powerful classification and regression tool. In this paper, the three hyperparameters “ntrees”, “max_depth”, and “mtries” of the random forest in the “h2o” package were calculated. Hyperparameter “ntrees” represents the number of trees, set in the range of 10 to 50, “max_depth” represents the specified maximum tree depth, set in the range of 2 to 12, and “mtries” represents the range of the number of variables selected at the node of each tree, set in the range of 5 to 30. Variable importance was measured by the “mean decrease gini” indicator.
2.4.3. Deep Learning
The back-propagation algorithm used in the deep learning model is based on gradient descent and is a popular supervised learning algorithm for training feed-forward neural networks. The neural network consists of an input layer, a hidden layer, and an output layer. The input vector to each neuron in the first layer of the network is provided to obtain the activation level through weighted summation, and then an activation function is applied to the activation level to obtain the result. These results are fed to the next layer of neurons. This procedure is continued until the last layer (i.e., the output layer) calculates the result, which is the output vector of the neural network.
The number of hidden layers and the number of nodes per hidden layer are hyperparameters in the “h2o” package. The number of hidden layers in this paper was set to two or three, the number of nodes in the first two layers was set to 100~200 and 50~100, respectively, and five nodes were used if there was a third layer. The most suitable rectifier with dropout (dropout ratio is 0.5 by default) was selected as the activation function in the deep learning model. The variable importance from the first two layers of the network was calculated using the weight-based Gedeon method, and the top ten variables according to importance were selected among them. In this paper, the final deep learning model consisted of two hidden layers, the first with 200 nodes and the second with 100 nodes.
2.4.4. GBM
GBM is a type of boosting algorithm, which is a machine learning technique for regression and classification problems. This method generates predictive models in the form of an ensemble of weak predictive models, usually decision trees. The basic idea is that multiple weak learners are generated serially, and the goal of each weak learner is to fit the negative gradient of the loss function of the previous accumulation model so that the accumulated model loss after adding the weak learner is reduced to the direction of the negative gradient. Different weights are used to linearly combine the base learners, to ensure that the excellent learners can be reused. In this paper, each regression tree was built sequentially and in parallel with all features of the dataset based on the GBM model in the “h2o” package in a fully distributed manner, so the three hyperparameters of “ntrees”, “max_depth”, and “rate” were set separately. Hyperparameter “ntrees” indicates the number of trees and was set from 10 to 50, “max_depth” indicates the specified maximum tree depth and was set from 2 to 12, and “rate” represents the specified learning rate, which was set between 0.01 and 0.10. The optimal choice of final parameters was set with the number of trees at 50, the maximum depth at 5, and the learning rate at 0.10. Its relevance was assessed throughout the variable selection process depending on whether the variable was selected to split and how much the squared error increased or decreased.
2.4.5. Evaluation Indicators
The model evaluation indicators used in this paper were
MAE,
RMSE,
RMSLE, and
MAPE, which are commonly used for prediction problems. Equations (1)–(4) are the calculation formulas for each indicator, where
represents the sample size, and
and
represent the actual value and predicted value, respectively.
MAE and
RMSE measure the absolute error and absolute squared error between the predicted and true values, respectively, and are appropriate for cases where the error is relatively obvious.
RMSLE is a variant of
RMSE that calculates the ratio of predicted to actual values, primarily used when outliers in a dataset are particularly large.
MAPE is the average of the absolute percentage error of each entry in a dataset. The smaller the value of each index, the better the fitting effect.
3. Results
After removing missing data, the combined dataset from HINTS Cycles 3 and 4 yielded a sample of 4593 respondents. The mean value of the dependent variable PCC was 80.59, and the standard deviation was 20.9988. Overall, the level of communication quality between patients and medical staff was good.
In the combined dataset, an additional 143 variables were present, in addition to PCC. Because the dependent variable PCC is a continuous variable, the t-test was performed on the continuous and binary variables among the 143 variables in the analysis of variable significance, and the F-test was performed on the multi-categorical variables. Eighteen variables were selected for regression analysis based on significance (p < 0.1), including twelve categorical variables and six continuous variables.
Table 1 provides the descriptive statistics of the remaining eighteen variables after feature selection for the sociodemographic and health-related characteristics of all participants, as well as the results of the significance tests for all independent variables. Overall, 94.27% of individuals are confident that they can access advice or information about cancer when needed, and 72.81% trust information about cancer provided by government health agencies, but more than half do not trust information from charitable organizations. The majority of people in the sample (45.55%) consulted a doctor or healthcare provider first when they needed cancer information. Furthermore, among people who own electronic devices (97.24%), 80.84% of them do not suffer from diabetes or hyperglycemia, and there is little difference in whether individuals have psychological distress (50.99% vs. 49.01%). In addition, 68.28% believed that everything could cause cancer, and 74.51% believed that the quality of medical services they received in the past 12 months was low. In terms of numerical variables, the average weight of the individuals was 181.2077 pounds, they did about 6.9375 h of sitting per day, and the average age was 54.7037. On average, individuals did at least 173.5256 min of moderate-intensity exercise per week, drank alcohol 3.5785 times per week, and had a mean BMI of 28.5081, which is outside the normal range and is considered overweight.
Following missing value removal and classification of some categorical variables into two categories, the eighteen independent variables were re-tested for significance. Among them, six continuous variables (Weight, AverageTimeSitting, Age, BMI, WeeklyMinutesModerateExercise, and AvgDrinksPerWeek) remained highly significant (p < 0.0001), while some categorical variables changed their significance levels after binary classification. There were five variables (CancerConfidentGetHealthInf, CancerTrustGov, StrongNeedCancerInfo, HaveDevice_Cat, and EverythingCauseCancer) with test p-values greater than 0.1 from the dependent variable. To investigate the maximum number of variables within the relative scope, the above eighteen variables were introduced into the model as candidate variables to further identify important variables.
Table 2 shows the top ten important predictors of PCC in the regression analysis performed by the four algorithms. A total of fifteen important predictors were screened out by the four algorithms, including individual sociodemographic characteristics, health-related factors, and living habits. Among them, the variables QualityCare and Weight were identified as important predictors in the four algorithms, and the variable QualityCare had the highest variable importance index value in each algorithm. The variables CancerTrustCharities, EverythingCauseCancer, HealthIns_Other, AverageTimeSitting, WeeklyMinutesModerateExercise, StrongNeedCancerInfo, and AvgDrinksPerWeek also showed high importance and were identified as important predictors by the three algorithms.
Table 3 evaluates the performance of the four machine learning methods using commonly used regression model performance evaluation metrics (
MAE,
RMSE,
RMSLE, and
MAPE). The values of each index are the five-fold cross-validation results obtained by constructing a model based on the top ten important predictors identified by each algorithm. Overall, the prediction effect of the Random Forest model was the best, with the metric values of
MAE,
RMSE,
RMSLE, and
MAPE being 14.8905, 18.4192, 0.3701, and 0.2537, respectively.
4. Discussion
This paper comprehensively examined the impact of individual sociodemographic characteristics, living habits, health status, and variables that reflect the attention of individuals to health-related content in terms of PCC. Four machine learning methods were used to screen for significant predictors by variable importance measures. The study of the current status of PCC and exploration of the influencing factors that affect its degree is critical for patient-centered care. It is beneficial to mobilize the enthusiasm of patients to participate in nursing and treatment to better meet the psychological expectations and feelings of the patients [
26,
27].
A total of fifteen significant predictors related to patient communication were derived by four machine learning approaches, based on the extensive variable set of the HINTS database. The results showed that the significant predictors identified relate to various aspects of the individual. Socio-demographic characteristics reflect the basic characteristics of individuals, which often affect the living habits, behavior patterns, and communication attitudes of an individual. Although the conclusions on the relationship between various sociodemographic characteristics and PCC are inconsistent [
28,
29,
30], sociodemographic characteristics are an important aspect affecting PCC, which is consistent with the findings of this study.
From the perspective of sociodemographic characteristics, the important predictors of PCC screened in this paper include age, weight, and BMI among other indicators. Age is accompanied by personal growth and life experience, which can change personal characteristics and affect personal communication ability. The perception of medical interaction varies with age, so age can have an impact on PCC [
30,
31]. Some studies suggest that, on average, obese patients may also receive less patient-centered care than non-obese patients, and that both weight and BMI have an impact on the quality of patient communication [
32,
33,
34]. However, unlike most studies, common indicators such as gender, income, and marital status are missing from the sociodemographic characteristics screened in the present paper. A potential reason is that, unlike the subjective selection of variables in previous studies, the variables selected in this paper are extensive and objectively selected based on statistical methods. Furthermore, the relationship between sociodemographic characteristics and PCC may be complex. Thus, more research is needed in this area.
Consistent with previous literature, health-related predictors associated with PCC include mental health status (QualityCare), cancer-related perceptions (the belief that anything can cause cancer), the way health information is queried (StrongNeedCancerInfo), and whether people believe the cancer information published by the relevant agencies (CancerTrustCharities) [
35,
36,
37,
38]. The variables with the highest relative importance value identified by the four machine learning methods in this paper are all QualityCare. An important relationship is present between the quality of care an individual receives and PCC, as patients who receive better care can be motivated to communicate with their healthcare providers [
39].
As novel findings, the study discovered that highly isolated language affects PCC, possibly because poor English speaking might impede communication between patients and healthcare personnel. Exercise time, drinking frequency, and sitting time also affect PCC. This may be because, on one hand, personal living habits can affect attitude to life, which in turn affects personality, that is, communication habits. On the other hand, personal living habits are closely related to physical health. Bad living habits can lead to chronic diseases, which in turn will have an impact on PCC. These findings suggest that not only common factors, such as individual sociodemographic characteristics and health status, but also individual living habits should be further considered when conducting PCC prediction. Patient-centered care should also fully understand the health, hygiene, exercise, and other multi-dimensional conditions of the patient.
Briefly, our contributions can be summarized in the following points. First, after reviewing existing research on the PCC problem, to the best of our knowledge, this is the first study introducing machine learning methods to PCC predictor identification. Second, the variable selection method is relatively objective. Previous studies on PCC issues have subjectively introduced predictors, and the research conclusions have a certain degree of subjectivity. In this paper, four machine learning methods are used to select variables objectively according to variable importance indicators. Third, there is a wide range of variable choices. Comprehensive experiments on the HINTS database demonstrate that our adopted methods can effectively discover important predictors in large-scale variables regarding patients without the intervention of human experts. Large-scale variable sets can not only guarantee the objectivity of variable selection to a certain extent but also facilitate the discovery of some novel variable relationships and impact patterns.
A known limitation of non-linear and ensemble machine learning algorithms is their poor interpretability. While the predictors of PCC were hereby identified, it was difficult to explain the direction and magnitude of the influence of each variable. In addition, the machine learning algorithm used in this paper lacks a good prediction performance and needs to be further adjusted. Therefore, the predictors identified in this paper should be evaluated in conjunction with other well-interpretable models or clinical evidence. Further efforts need to combine the important predictors obtained in this paper with other easy-to-interpret models to analyze the direction and mechanism of influence of the newly obtained predictors on PCC. On this basis, a decision support system can be constructed to provide support for doctors or clinical staff. Understanding the sociodemographic and health-related factors that predict PCC can help to achieve more accurate predictions of PCC problems and enhance the quality of communication between doctors and patients to support medical staff in providing higher-quality medical services to patients.
5. Conclusions
Based on the National Cancer Institute’s 2019–2020 Health Information National Trends Survey (HINTS) database, four machine learning methods were hereby used to identify important predictors of PCC from a wide range of data sets based on variable importance measures. A total of fifteen significant predictors were obtained, involving multiple dimensions, such as personal sociodemographic characteristics, living habits, and concerns about health problems. Notably, this paper identified four novel potentially relevant variables, an individual’s level of verbal expression, exercise habits, etc., which significantly impacted respondents’ perceived PCC quality. Understanding the sociodemographic and health-related factors that predict PCC can help researchers make high-precision predictions of PCC problems and also provide references for improving the level of communication between medical staff and patients to provide optimal care.