1. Introduction
Data create the current world mainly by being collected, analyzed, generated, and served as reliable solutions. In 2023, the world generated around 120 ZB of data, which is equal to 337,080 PB in daily and 17.85 TB of daily data per internet user around the world. Nowadays, the data science field has become extremely important to deal with this huge data and extract meaningful information using statistics, scientific computing, and algorithms. Data science learns data mainly using four types of algorithms: supervised learning [
1], unsupervised learning [
2], semi-supervised learning [
3], and reinforcement learning [
4]. Each of these methods has been properly applied to various fields, including biologics, economics, and astronomics, to explore the vast and complex data and discover new knowledge for future development. In this research, the application domain will be wine.
Wine culture has a long and deep history and serves as a popular alcoholic beverage made from fermented fruit juice, typically fermented grape juice. It also has a significant economic impact on wine production and consumption in the world. Spherical Insights, providing statistics of market insights and facts across 170 industries and more than 150 countries, mentions that the global wine market was valued at approximately USD 409.25 billion in 2022 [
5]. The world consumption of wine reached 232 million hectoliters, while the world production of wine reached 258 million hectoliters in the year of 2022, according to the state of the world vine and wine sector in 2022 published by the International Organization of Vine and Wine (OIV) [
6]. In this huge market, which has a variety of choices, wine reviews, which describe wines characteristics, reflecting score, vintage, price, and comments from professional sommeliers, are useful and valuable for wine makers, distributors, and consumers. There are a lot of wine reviewers and reviews published by wine magazines in the world. Among those, reviews from Wine Spectator [
7], a world leader in magazine wine reviews, are collected and used to transfer into usable knowledge in this research.
The price of wine ranges very widely, from several dollars to thousands of dollars. The price of a bottle of wine is influenced by several things. First is the cost of production, including raw materials such as grapes, barrels, and bottles, as well as utility and labor costs. Administrative, sales, and marketing expenses are also considered. When wine is purchased at a restaurant, a mark-up, the additional charge to the wine price, is applied [
8]. Distributors, wholesalers, and retailers also apply mark-ups to make profits. Additionally, nature conditions are another variable that play a significant role. Nature conditions affect the overall supply and demand factor, and challenging years may result in higher labor expenses. Second is the consumer preferences and willingness to pay, determined by the reputation of both the wine and its producer [
9]. The reputations are provided by famous wine magazines or reviewers, such as Wine Spectator and Robert Parker, who is one of the most famous wine experts in the world. The score and review that a certain wine receive affect the trends and customers’ preferences. If a wine receives a high score and a great comment from the influential reviewer, there is a possibility that the price of the wine may be driven upward, while unfavorable evaluation may lead to price decreases.
Wineinformatics is a new data science research area with a focus on understanding wine through machine learning algorithms by processing wine datasets. Wine datasets are structured, including physicochemical laboratory data and wine reviews [
10]. A physicochemical laboratory [
11] can be easily read and analyzed by computers since it is numeric information. Wine reviews, written in human language, contain important and detailed information about wine features, which are necessary in this field. In order to use this human language format information, it is processed by the computational wine wheel, a natural language processor [
12,
13]. Language processing is a technique developed based on Wine Spectator’s wine reviews, and the computational wine wheel works as a dictionary by capturing keywords in the wine reviews and transferring them into binary information so that the computer can perform an analysis. In previous research [
12,
13,
14], the computational wine wheel demonstrated the capability of transforming wine reviews into computer-understandable codes and enabling machine learning algorithms to recognize the correlation between wine reviews and scores. In this research, the computational wine wheel is applied to extract attributes from wine reviews and seek the possibility to include other commonly available wine data in the attributes.
Instead of predicting wine prices [
15], this paper focuses on using wine price as an additional attribute and aims to determine if the price attribute increases the wine score prediction accuracy by comparing results obtained with and without this attribute in the dataset. Since the price contains numerical values ranging from single digits to thousands, the value of the price is normalized using various methods: the mean, median, boxplot mean, and boxplot median. Two supervised methods are employed: naive Bayes, which is a white-box algorithm, and SVM, a black-box algorithm. The major contributions of this paper are:
Including price as an additional attribute in the processed review data for predicting wine score categories.
Proposing and testing several methods for converting continuous price values into a binary dataset
Laying the groundwork for incorporating additional information into processed wine review data, enabling the application of neural networks and deep learning in similar wineinformatics research.
2. Materials and Methods
For this study, the ALL Bordeaux Wine Dataset, a collection of wine reviews from Wine Spectator, was used to compare accuracies across different collections of key attributes. The language conversion was completed using the computational wine wheel, a dictionary designed to transform human language into a machine-understandable format. The main key attribute in this research is price. Price values are normalized with three measurements: mean, median, and boxplot. All the combinations of the presence or absence of the price attribute and its normalizations are analyzed by two supervised learning algorithms: naive Bayes and SVM classifiers. Five-fold cross-validation is also utilized to ensure a fair evaluation.
2.1. Wine Reviews
Wine Spectator is a trustworthy source of information about wine. It focuses on wine and wine culture. Each year, experts review more than 15,000 wines, and the magazine publishes 15 issues, each of which includes 400 to 1000 wine reviews with detailed tasting comments and drink recommendations [
7]. When wines are submitted for review, Wine Spectator conducts their tasting with in a single-blind manner, meaning the reviewers know something necessary to tasting, such as the wine’s grape variety and vintage, but do not know the wine’s producer or its price in order to avoid bias [
7]. It uses a 100-point scale system.
95–100 Classic: a great wine
90–94 Outstanding: a wine of superior character and style
85–89 Very good: a wine with special qualities
80–84 Good: a solid, well-made wine
75–79 Mediocre: a drinkable wine that may have minor flaws
50–74 Not recommended
Following is an example of a wine review of Château Latour Pauillac 2009.
Château Latour Pauillac 2009 ° 99 pts ° USD 1600
This seems to come full circle, with a blazing iron note and mouthwatering acidity up front leading to intense, vibrant cassis, blackberry and cherry skin flavors that course along, followed by the same vivacious minerality that started things off. The tobacco, ganache and espresso notes seem almost superfluous right now, but they’ll join the fray in due time. The question is, can you wait long enough? Best from 2020 through 2040. 9580 cases made—JM.
Country: France • Region: Bordeaux • Issue Date: 31 March 2012
2.2. The Computational Wine Wheel 2.0
Wine reviews, expressed in human language, require processing and conversion into machine-understandable format via the computational wine wheel (CWW), a natural language processing application. The CWW 2.0 was created based on 1100 wine reviews from Wine Spectator [
13]. An updated version, CWW 3.0, was introduced and developed using Robert Parker’s wine reviews in addition to Wine Spectator, containing more attributes in the machine [
16]. Both versions convert words to attributes in the same way. Since our study utilized wine reviews sourced from Wine Spectator, the CWW 2.0 was employed.
Keywords from reviews are abstracted by this application and encoded using a one-hot encoding method to transform the categorical information into numeric vectors [
16]. For example, if a wine review mentions terms indicating fruits such as apple, blueberry, plum, etc., the CWW captures these words and encodes them as 1 if they correspond to a predefined attribute in the machine; otherwise, they are encoded as 0. In addition to fruit flavors, the CWW contains more various wine characteristics, including descriptive adjectives (balance, beautifully, etc.) and body of the wine (acidity, tannin, etc.). The CWW also generalizes similar words into the same coding. For instance, apple, fresh apple, and ripe apple are generalized as “Apple” since they express the same flavor, yet green apple matches the “Green Apple” attribute since green apple flavor is different from apple flavor.
Figure 1 shows the detailed example.
2.3. Data
For this study, the ALL Bordeaux Wine Dataset is utilized. This dataset was developed in the previous study [
14], collecting all the Bordeaux wine from 2000 to 2016. The prior investigation has studied and developed two datasets: the ALL Bordeaux Wine Dataset and the 1844 Bordeaux Wine Official Classification Dataset, collecting all wines listed in a famous collection of Bordeaux wines, the 1844 Bordeaux Wine Official Classification, from 2000 to 2016 [
14]. These datasets were analyzed by SVM and naive Bayes classifier methodologies.
These Bordeaux wine data were gathered from
Wine.com, an e-commerce website based in the United States.
Wine.com is the leading wine retailer, offering customers access to the world’s largest wine store. To provide detailed and varied guidance to its customers, the platform includes professional wine reviews from various critics, such as
Wine Spectator,
Wine Enthusiast, and
Decanter, as well as wine experts like
Robert Parker and
James Suckling.
Wine.com was selected as the data source due to its reliability and convenience.
The dataset that is used in this research, the ALL Bordeaux Wine Dataset, contains a total of 14,349 Bordeaux wines produced within the 21st century (2000–2016). There is a total of 10,086 wines rated below 89 (89− wines) and 4263 wines rated 90 or above (90+ wines), and in particular, the number of 90+ wines is approximately 57.73% lower than those scored 89−.
Figure 2a illustrates the distribution of scores in the dataset. Most of the wines are scored between 86 and 90, representing “Very Good” wines.
Figure 2b shows the trend of the number of wines reviewed each year by reflecting the quality of vintages. The line chart indicates that more than 1200 wines were reviewed in 2009 and 2010, which implies that 2009 and 2010 are good vintages in the Bordeaux region.
Figure 2 is adapted from [
14].
Using this dataset, the score, wine reviews, and price were collected. The score serves as a class label, with a threshold set at 90 points. In this research, two models were created to predict whether a wine would receive equal to or above 90 points or below 89 points and compare the accuracies between models with price attributes and those without price attributes.
2.3.1. Preprocessing of Price Data
In the dataset, the price attribute contains various formats. For example, some values are written simply as USD 50, while others are described as USD 50/375 mL, USD 50/500 mL, USD 50/750 mL, or USD 50/750 mL. There are also null values indicated in different ways, such as USD NA, USD NA/375 mL, USD NA/500 mL, USD NA/750 mL, or USD NA/750 mL. To standardize these different formats, first of all, all the null values were unified into a consistent format to indicate null entries and dropped for a direct comparison without any imputation bias. For a fair comparison, all the prices were adjusted to a 750 mL basis since the simple format, such as USD 50, is usually based on 750 mL. For instance, if the value is USD 50/750 mL, it remains 50. If the value is USD 50/375 mL, it is adjusted to 100 by doubling the amount since 375 mL is half of 750 mL. Similarly, if the value is USD 50/500 mL, it is adjusted to 75 by multiplying by 1.5 since 500 mL is 1.5 times 750 mL.
Figure 3 is the overall distribution after this preprocess.
After that, the price values were normalized and compared using several measuring methods: mean, median, and boxplot. First, the mean normalization utilized the average value of USD 50.04. Second, the median normalization used the middle value of the price distribution, which was USD 30.00. Third, the boxplot method was employed to handle outliers, which represent values significantly higher or lower than a specified range. In boxplot analysis, five key numbers describe a distribution: the minimum value (USD 1.00),
Q1 (the first quartile, USD 20.00),
Q2 (the median, USD 30.00),
Q3 (the third quartile, USD 46.00), and maximum value (USD 985.00).
Q1 represents the 25th percentile, making the value below which 25% of the data falls, while
Q3 represents the 75th percentile, making the value below which 75% of the data falls. The interquartile range (
IQR) is calculated as the difference between
Q3 and
Q1.
IQR was calculated to USD 26.00. A point is considered an outlier if its distance from the median exceeds 1.5 times the IQR, either below Q1 or above Q3. After all the outliers were removed from the dataset, two measurements were calculated: mean (USD 31.32) and median (USD 28.00), referred to as boxplot mean and boxplot median in this paper, to use as a threshold. Then, the outliers were concatenated back into the dataset so that all the data were used in the analysis.
2.3.2. Price Distribution
In order to analyze the distribution of wine price values and their corresponding scores in the dataset, the distribution tables,
Table 1 and
Table 2, were demonstrated. These tables are organized by the threshold used, as shown at the left-top corner of each table. They provide a clear and structured way to observe how the price attribute correlates with wine scores, which can help in understanding the impact of price on wine quality as perceived by the scores.
Some patterns are identified in the tables; wines priced below the thresholds (whether median, mean, or quartile) tend to have lower scores (89−), while wines priced above the thresholds have an equal likelihood of scoring either 90+ or 89−, indicating no clear distribution. Especially, wines priced below the mean and the median thresholds show a higher proportion of lower scores of 89−. Reflecting the overall distribution of scores in the dataset, which has more 89− wines than 90+, the number of wines scoring 89− is generally higher.
As indicated in the boxplot distribution (
Table 3), the most expensive wines tend to receive the high scores above 90+. Although most of the distributions are clearly separated, the proportions of wines priced above the mean, those priced in the range from
Q2 to
Q3, and those priced above the boxplot mean or median are relatively unclear compared to other categories. This indicates that wines priced above these thresholds do not exhibit a clear pattern, which presents the possibility for the classification algorithms to struggle with finding consistent patterns and accurately predicting the class label.
In order to analyze the distribution of wine price values and their corresponding scores in the dataset through Boxplot analysis, the distribution tables,
Table 4 and
Table 5, were demonstrated. Boxplot_mean indicates that the mean value was utilized as a threshold after outliers and null values were dropped. Boxplot_median means that the median value was utilized as a threshold after outliers and null values were dropped.
In the dataset, the lowest price is USD 1.00, and the highest price is USD 985.00, indicated at the right top of the table n, respectively.
Q0 indicates the minimum value in the boxplot, and prices smaller than
Q0 are considered outliers. Likewise,
Q4 indicates the maximum value in the boxplot, and prices larger than
Q4 are considered outliers. The formulas are shown below. In each range, the left side is included but not the right side. For example, the range between
Q1 and
Q2 means that the range of prices is equal to or more than USD 20.00 and less than USD 30.00.
2.4. Classification Algorithms
The goal of this research is to examine the impact of the price attribute on model accuracy. According to previous research, the naive Bayes classifier algorithm achieved the best accuracy among all applied white-box classification algorithms, while the support vector machine (SVM) classifier algorithm, a black-box classification algorithm, always had slightly better accuracy compared to naive Bayes [
16]. Therefore, naive Bayes classifier algorithm and SVM classifier algorithm were applied to find out if the price attribute improves the accuracy and to determine which algorithm demonstrates better performance with the collected features in this study.
2.4.1. Naïve Bayes
Naive Bayes is a statistical classifier that calculates probability and predicts a class based on Bayes’ theorem. It is commonly used for machine learning classification as a white-box algorithm. All the input attributes are treated independently. The formula of the Bayesian theorem [
17,
18] is as follows.
: The posteriori probability of hypothesis H given training data X.
: The posteriori probability of observing attribute X given hypothesis H.
: The prior probability of given hypothesis H.
: The prior probability of given training data X.
By applying the above formula, naive Bayes classifier has been built to handle multi-dimensional datasets.
X represents n-D attribute vector
X =
, and class
C has m classes
. This classification is to derive the maximum posteriori. The formula of the naive Bayes classifier is as follows.
However, when a value of
X never appears in the training dataset, the prior probability of that value of
X will be 0, as indicated by
(for each
= 1, 2, …,
m). In order to handle zero multiplication, Laplace smoothing is introduced.
where
λ is the parameter, and
K is number of classes.
For our research, λ is simply set to 1, and K is 2, as the prediction task is binary, distinguishing between wines rated 90 or above and those rated 89 or below.
2.4.2. SVM
SVM is a black-box machine learning algorithm used for classification and prediction, and it is effective for handling both linear and nonlinear data [
19]. This method was employed for this study due to its strong performance in bi-class classification problems. It uses nonlinear mapping to transform the original training data into a higher dimension, where it can be linearly separated. The goal of SVM is to search for the hyperplane, the decision boundary, that separates the data into classes in the best way. The hyperplane is chosen to maximize the margin, meaning the nearest data point of any class has its maximum distance from the boundary [
20]. This is also known as support vectors. Some featured advantages of SVM are the high prediction accuracy, robustness that it works with many different types of data even when training data contain errors, and quick evaluation of the learned target function. In spite of these strengths, it can take a long training time, and it is difficult to understand the learned function since it is a black-box algorithm. In this project, SVM light [
21] was employed to perform the classification of features. The process requires two input datasets, one for training, used for modeling, and the other for testing, utilized for predicting. In our study, SVM was trained over 7700 times on each training dataset, and more than 2600 support vectors were defined to distinguish the two classes.
2.5. Evaluation
All experiments conducted in this research use 5-fold cross-validation to avoid overfitting and evaluate the predictive performance of the classification model. In order to split ALL Bordeaux Wine Dataset into five subsets with the same distribution as the original dataset, there are several elaborate steps [
14]. Firstly, the dataset is shuffled randomly. Secondly, it is split into two sets; one set includes wines equal to or above 90 score (90+ wine group), and another set includes wines below 89 score (89− wine group). Thirdly, these two sets are separated into five subsets, respectively. Finally, the first subset from the 90+ wine group and the first subset from the 89− wine group are combined to create a new set, and this process is repeated for the rest.
Figure 4 illustrates these steps.
After the above process, for fold 1, subset 1 is used as a testing set, and the rest of the subsets serve as a training set. The model is trained on the training set, and the accuracy is obtained from the testing set as shown in
Figure 5. After repeating another four times for the rest, the average accuracy, precision, recall, and F-score are taken as the performance result for the cross-validation.
To evaluate the performance of the classification model, four statistical measures are used: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). As shown in
Table 6, a true positive means a prediction is correct as the predicted value is positive (90+ wine) and the actual value is also positive. A true negative indicates a prediction is correct as the predicted value is negative (89− wine) and the actual value is also negative. A false positive implies that a prediction is incorrect as the predicted value is positive (90+ wine) but the actual value is negative (89− wine). A false negative means that a prediction is incorrect as the predicted value is negative (89− wine) but the actual value is positive (90+ wine).
Based on the evaluation matrix, four values are used to evaluate the classification results: accuracy, recall, precision, and specificity.
Accuracy is defined as the percentage of wines that have been correctly classified over all wines in the dataset. It tells us how many wines were predicted accurately as 90+ and 89−.
Recall is defined as the percentage of classic wines that have been predicted correctly.
Precision is defined as the percentage of classic wines that have been predicted correctly out of all the wines that have been classified as classic.
Specificity is defined as the percentage of the non-classic wines that have been predicted correctly.
3. Results
With the ALL Bordeaux Wine Dataset, null values in the price attribute were appropriately dropped from the original dataset, resulting in a dataset with 9721 wine entries, and the accuracy of classifiers with and without the price attribute was compared using various normalization methods. It was observed that both classifiers demonstrated slightly improved performance with the inclusion of the price attribute alongside wine reviews. SVM achieved the highest accuracy of 87.98% among all the experiments when the price attribute was normalized using the mean. For naive Bayes, the largest improvement was observed with a 0.99% increase in accuracy when the price attribute was included, achieving an accuracy of 86.92%. Through all experiments, SVM accuracy consistently performed better than naive Bayes, and the results suggested that the boxplot median was the best normalization for this dataset.
3.1. Absence of Price Attribute
Using the ALL Bordeaux Wine Dataset with only wine reviews attributes,
Table 7 shows that both naive Bayes and SVM classifiers achieved over 85% accuracy on the dataset containing 9721 wine samples. Naive Bayes achieved 85.93% accuracy, and SVM reached 87.41% accuracy (
Table 7). SVM demonstrated 1.48% higher accuracy compared to naïve Bayes. These results indicate a consistent basic pattern: SVM outperforms naive Bayes for the dataset. This pattern is always true across all the results, serving as the foundation for further analysis with the price attribute. Since this dataset has high dimensionality, which includes 986 attributes without the price feature, SVM effectively leverages these features to find an optimal separating hyperplane that maximizes the margin between classes. Additionally, these results highlight the relationship between wine reviews and their corresponding scores, as well as how effectively the computational wine wheel captures influential keywords in the reviews that contribute to the received scores.
3.2. Presence of Price Attribute
To evaluate the impact of the price attribute on prediction accuracy, the same dataset was used with price incorporated as an additional feature. As shown in
Table 7, both naive Bayes and SVM classifiers achieved improved accuracies of 86.92% and 87.98%, respectively. All results showed enhancement compared to when the price attribute was not included. For naïve Bayes, the most significant improvement was 0.99% with the boxplot median normalization, and the average improvement across the four normalization methods was 0.90%, which is close to 1%. This level of improvement is notable, considering it resulted from adding just one attribute to a dataset already containing 986 wine review attributes. This suggests that the price attribute has predictive power, contributing up to a 1% increase in accuracy in this dataset. Naïve Bayes, which directly calculates the relationship between class labels and attributes, clearly showed the influence of the price feature. For SVM, the accuracy improvements were consistent across all models, with an average of a 0.50% increase. Although SVM did not show as much improvement as Naïve Bayes, the inclusion of the price attribute still led to better decision making. These improvements present the positive impact of incorporating the price attribute and indicate that wine price is correlated with wine scores and reviews. The inclusion of price allowed the models to better capture and learn data patterns, boosting prediction performance.
Note: To ensure a fair comparison, results should be compared between models that have trained on the same number of wine samples. In this study, all the null values in the price attribute were removed. However, when applying the boxplot normalization methods, outliers (1016 wines) that were initially excluded to compute mean and median thresholds were reintroduced into the dataset. This approach is reasonable because the price range of wines is inherently broad, and expensive wines are crucial for a comprehensive analysis. These outliers represent significant variations in the data that could influence the relationship between price and quality. Removing them would risk oversimplifying the model, thereby missing important trends or patterns that could improve prediction accuracy. Additionally, maintaining all data points, including outliers, is essential since larger datasets generally lead to higher model accuracy. This is because models benefit from more comprehensive training data, which allows them to generalize better to unseen data. Therefore, to maximize the robustness and reliability of the findings, no data, including outliers, should be excluded from the analysis.
5. Conclusions
In this research, we examined the relationship between wine price and score, as well as the impact of including price on prediction accuracy using the ALL Bordeaux Wine Dataset. The results demonstrated that the price attribute enhanced the performance of both naive Bayes and SVM classifiers, leading to better accuracies, from 85.93% to 86.92% and from 87.41% to 87.98%, respectively. Naïve Bayes clearly demonstrated the positive impact of the price attribute with a 0.99% improvement. Among the four normalization methods, the boxplot median normalization (USD 20.00) performed the best in maximizing accuracy, as this threshold distributed the 90+ wines optimally and created a stronger correlation between wine price and wine score. Therefore, it is revealed that wine price, especially when normalized effectively, is a valuable attribute for more accurate wine score prediction.
The findings related to the boxplot normalization method opened a new challenge for feature work: focusing on wines priced within the range that makes score distribution ambiguous, specifically between USD 28.00 and USD 46.00. A more detailed analysis of wines in this range could provide a better understanding of why they are particularly challenging and how this range could be addressed for improved predictive performance. Similar research can be referenced and seek deeper insights for improvements [
15,
22,
23,
24]. Additionally, future studies could replicate these experiments with different datasets, such as wines from various regions, wines reviewed by other experts, or, furthermore, collecting data from different sources other than
wine.com to mitigate the inherent bias associated with wine sales. That could help to further explore the influence of price on wine classification.
One of the key tasks is to incorporate various learning algorithms, such as neural networks, which are highly regarded in the machine learning field for their strong predictive performance. Neural networks, in particular, have demonstrated impressive results in wineinformatics research [
12] and could enhance the accuracy of wine score predictions while better capturing correlations between wine price and score. It is possible to build one neural network that takes wine price and extracted wine review keywords as inputs for wine grade category prediction as outputs, like the SVM and naïve Bayes did in this research work. It is also possible to build one neural network that takes extracted wine review keywords as inputs for wine price category prediction as outputs and then use the predicted price category as part of an input pair with extracted wine review keywords as other inputs for wine grade category prediction as outputs, as demonstrated in
Figure 6. This is simulating human minds that consider multiple aspects of wine before purchasing, which forms the deep learning structure for wineinformatics [
25,
26,
27,
28,
29].