1. Introduction
Taxes are an essential source of income for any government. Being able to detect companies that are likely to incur tax arrears as accurately as possible would enable tax authorities to better target their tax audits and implement preventive measures aimed at ensuring the timely payment of taxes. However, despite the high economic importance of ensuring tax compliance, studies on predicting corporate tax arrears have so far been scarce. In the machine learning domain, more attention has been ongoingly directed to the detection of tax fraudsters (e.g., [
1,
2,
3,
4]).
The main drawbacks of current studies (e.g., [
5,
6,
7]) are that they have mostly concentrated on using financial ratios as predictors of tax arrears, and have only proposed models for predicting tax arrears for the following year. The disadvantage of using financial ratios is that they become available with a considerable time lag after the payment irregularities have already been going on for some time. In addition, they cannot be used if financial reports are unavailable, which is much more likely to happen in the case of financially distressed firms [
8,
9], which in turn are also more likely to have tax arrears [
10]. In addition, the accuracy of models using financial ratios has been moderate. The disadvantage of annual predictions is that they can only be made once a year, and they only predict if a company will have tax arrears any time in the following year, which seems rather vague for practical purposes.
To the best of the authors’ knowledge, there are no studies where the behavior of tax arrears in the past has been applied to predict future tax arrears. Thus, this paper intends to fill this research gap by applying monthly time series of corporate tax arrears for predicting tax arrears in the next month. Using data with monthly instead of annual frequency would have much higher practical value, since, in carrying out their daily activities, tax authorities would need to be able to detect companies likely to incur tax arrears not only once a year and not only for the entire following year, but at any time and in the more immediate future, using the most recent information available. Besides outlining the practical applicability, this paper contributes to the financial management literature by showing whether and how past payment defaults signal future payment defaults.
The aim of this paper is to explore which machine learning methods and types of independent variables work best in predicting companies to have tax arrears in the next month, given the time series of their tax arrears in the preceding 12 months. In addition to the 12 monthly amounts of tax arrears, two alternative types of variables constructed from them are considered. One of those types includes statistical measures and counts of events, and the other includes monthly amounts with the aggregation of amounts in earlier months into period means. The machine learning methods used are decision tree (DT), random forest (RF), k-nearest neighbors (KNN) and multilayer perceptron (MLP), which have also been applied in the related area of failure prediction (see, for example, [
11,
12,
13]). The two areas are related, since defaulting on taxes is a strong sign of financial distress, which in turn might eventually lead to bankruptcy [
6].
Data used for this paper were corporate monthly tax arrears of the entire population of Estonian SMEs for the period 2011–2018, from which more than 2 million observations, i.e., company-13 month period pairs, were collected using the moving window approach. In total, 49,156 companies were included.
A specific characteristic of monthly tax arrears is that they are rare events, as companies usually try to pay their taxes in time. However, learning from mostly zero-valued data is a difficult task for machine learning models. The approach used in this paper for reducing the high proportion (92%) of zero values was to build machine learning models only for observations which had tax arrears in at least two among any of the 12 preceding months, while the rest of the observations, where tax arrears in the next month were very unlikely to occur and difficult to predict, were always predicted not to have tax arrears in the next month. Thus, two types of accuracies are reported in this paper: (a) accuracies based on different machine learning approaches to predict the presence of tax arrears for firms with at least two months with tax arrears, (b) accuracy in all test data based on the best accuracy noted in (a) and accuracy for other observations based on the previously described simple intuitive logic. While the accuracy of type (a) expresses the performance of the machine learning methods in solving the classification task, calculating also the accuracy of type (b) was necessary for measuring performance on the entire test set in order to make the results comparable to previous studies.
The models were implemented in Python programming language, using the Keras neural networks library for the MLP model, scikit-learn machine learning libraries for other models and SciPy library for statistical tests.
The rest of the paper is organized as follows. The literature review is provided in
Section 2.
Section 3 contains a description of the dataset, variables and methodology. The results, accompanied by the discussion, are presented in
Section 4, and conclusions in
Section 5.
2. Literature Review
The literature focusing on corporate taxes is multifaceted, spanning over finance, accounting, economics, ethics and other domains (see, e.g., [
14]). The topic of taxes is also closely interconnected to economic sustainability, as firms with tax arrears are more likely to be insolvent and engage in law violations [
15,
16]. Bibliometric analyses indicate that economic sustainability remains among the key topics in the literature domain of firms’ sustainable development [
17,
18].
Despite the high economic importance of ensuring tax compliance, studies on predicting corporate tax arrears have so far been scarce. At the same time, there is more abundant recent literature focusing on predicting the tax revenues of governments or tax fraud committed by firms. Nevertheless, these topics are very different, as tax arrears prediction studies usually forecast a binary variable (occurrence of default), the roots of which are in the lack of liquidity [
10]. Tax revenue prediction studies, on the contrary, usually forecast continuous variables with predictors portraying growth [
19], while tax fraud studies forecast a rare binary event, the roots of which often lie in causes other than poor liquidity [
20]. In previous studies on tax arrears prediction, the focus has been on using companies’ financial statements data (in most cases in the form of financial ratios) for tax arrears prediction. Contrary to this paper, all those studies make predictions for the next year, instead of the next month. An overview of previous studies is provided in
Table 1.
There have been a handful of studies where the presence of tax arrears has been predicted based on financial ratios. For example, Marghescu et al. [
5] used data on 328 Finnish companies to predict the presence of arrears in employer contribution taxes using logistic regression. The classification accuracy of their model was very low (61.6%), and only exceeded the naïve baseline model of predicting none of the companies to have tax arrears by less than one percentage point. As the model was heavily underspecified, they suggested that more variables should be added.
Höglund [
6] used genetic algorithm-based variable selection, followed by linear discriminant analysis (LDA) to predict tax arrears next year, using a dataset of 768 Finnish firms. The independent variables used in that study included 17 financial ratios and two industry-related variables (bankruptcy risk and payment default risk). The accuracy of their best model was 73.8%. Höglund’s [
6] dataset was also applied in Abedin et al. [
21], with a rich selection of machine learning tools, with accuracy remaining in a comparative range.
Batista et al. [
7] used financial ratios to classify Portuguese real estate agencies as tax-compliant (i.e., not having tax arrears), using discriminant analysis and logistic regression. They built separate models for each of the three years (2007–2009) included in the dataset, using data of ca. 200 companies for each model, as their aim was also to compare results before and after the financial crisis. In addition to conventional financial ratios, independent variables in their model also included the Taxation Effective Rate, which is an indicator associated with tax evasion. The accuracy of all their models was rather similar, with the accuracy level of 72.4% achieved by the discriminant analysis model built for the year 2008 being the best.
The abovementioned four studies reveal that the performance of financial ratios as predictors of tax arrears is rather low. The likely reason for this is that the annual reports from which the ratios are calculated become available with a considerable time lag, after the payment irregularities have already been going on for some time. The latter argument has been proven by Lukason and Laitinen [
22], who indicated that around three quarters of European bankrupt firms witness financial problems portrayed through financial ratios only very shortly before bankruptcy. Namely, when financial problems occur, they might be detectable only through the last annual report, for which, in all jurisdictions, there is a submission time lag set in laws. Thus, the empirically validated theoretical concept by Lukason and Laitinen [
22] indicates that payment defaults (including tax arrears) are very likely to occur in circumstances when the available annual report does not signal any financial problems. In addition, it could be assumed that for predicting tax arrears with monthly frequency, as is done in this paper, using solely financial ratios would be even more difficult, because the presence of tax arrears may change monthly, while financial ratios only change annually in the unlisted firm segment.
Another disadvantage of using financial ratios for predicting tax arrears is that the resulting models cannot be used for cases where financial reports are unavailable. This, however, is much more likely to happen in the case of financially distressed firms, which are therefore also more likely to have tax arrears [
9,
10]. For example, when predicting tax arrears based on financial ratios, Höglund [
6] left as much as 63% of the companies with tax arrears out from the model for this reason alone. If tax authorities were to use such models in practice for selecting companies for tax auditing, they would not be able to predict tax arrears for a large proportion of companies that are likely to have them, which represents a serious drawback for the practical application of those models.
A methodologically entirely different approach for predicting tax debt was taken by Zhao et al. [
23], who used sequence classifiers, i.e., frequent pattern mining models, where independent variables are temporally ordered, to predict social security debts. The independent variables were the activity codes of 155 possible activities of ca. 10,000 taxpayers in the Australian tax database, with each activity code being accompanied by a taxpayer ID and the date and time of the activity. They constructed ca. 16,000 activity sequences from that data with the aim of predicting which sequences lead to social security debts. The accuracy of their best classifier was moderate (76.0%). Additionally, “debt” in their study had a different meaning than “tax arrears” in this paper—instead of corporate taxes left unpaid, by “debt”, they meant overpayment of social security benefits by the tax authorities. As the benefits depended on entries in the tax database, the issue that they solved had some similarities to fraud discovery tasks.
Finally, Su et al. [
24] built an ensemble classification model composed of k-nearest neighbors, multilayer perceptron and several tree-based algorithms (random forest, extremely randomized trees, gradient tree boosting and XGBoost), using data of 70,000 Chinese companies for training and 50,000 for testing to predict the presence of tax arrears in the next year. To the best of the authors’ knowledge, this is the only study where some tax-related variables have been used to predict tax arrears in the future. However, in that study, the majority of independent variables consisted of financial variables originating from balance sheets and income statements. The accuracy of the proposed model was excellent (90.58%). Contrary to this paper, they used classification into three classes (no tax arrears, and tax arrears below or above the threshold of 5000 RMB) instead of binary classification, and the model was built for making annual instead of monthly predictions.
Tax arrears prediction has some similarities to bankruptcy prediction. In both cases, the aim is to assess a company’s ability to fulfil its obligations [
7], i.e., whether it is in financial distress. In this regard, defaulting on taxes is a strong sign of financial distress, which in turn might eventually lead to bankruptcy [
6]. Therefore, tax arrears prediction can detect financial distress earlier than bankruptcy prediction, i.e., before the temporary tax payment difficulties (temporary insolvency) have evolved into bankruptcy (permanent insolvency). This is important in ensuring tax compliance, since, as shown by Kukalová et al. [
25], the recovery rate of unpaid taxes in insolvency proceedings can be remarkably low. Since the two research topics are related, to a certain extent, knowledge drawn from the bankruptcy prediction field is also applicable in the field of tax arrears prediction.
Given the above, studies where tax arrears have been used as independent variables in predicting bankruptcy might be relevant. For example, Lukason and Andresson [
10] compared the performance of tax arrears and financial ratios in bankruptcy prediction, and they found that tax arrears were in fact better predictors of bankruptcy. They also noted that payment defaults can be a vital substitution for financial ratios in cases where annual reports are not available. The independent variables based on tax arrears used in their study (maximum and median of tax arrears, number of month-ends with tax arrears and length of the longest sequence of month-ends with tax arrears) were also included among the initial independent variables considered in this paper.
In their study, Kubicová and Faltus [
26] also tried to use tax obligations for bankruptcy prediction. Their approach was, however, quite different and experimental in nature, as they used ratios which had financial statement items related to income tax (e.g., total income tax, deferred income tax) in the numerator and own capital, sales and total assets in the denominator. They concluded, however, that such ratios are not suitable for predicting company defaults, at least not for Slovakian companies, which were the object of their study.
As this paper uses previous tax arrears, studies concerning previous payment behavior might also be relevant. In this regard, there have been a few studies where previous payment behavior has been used for predicting company defaults or credit risk. For example, Ciampi et al. [
27] used the numbers and values of more than 60 days past due and/or overdrawn exposures of bank loans, along with financial ratios, for bankruptcy prediction. Karan et al. [
28] used independent variables such as the proportion of invoices paid late among all invoices, sum of days paid before deadline, total debt/total purchases and average amount paid, among other independent variables, for predicting the credit risk that retailers pose for a wholesaler. Finally, Back [
29] used, inter alia, independent variables such as numbers of payment disturbances and delays for predicting the financial difficulties of firms. The study by Back [
29] also indicates that firms with current payment defaults are likely to have previous payment defaults as well. The latter could be subject to the poor financial management practices inherent to many SMEs [
30], while a more general financial explanation could be that these firms are inefficient, struggling with constant liquidity problems and, more broadly, for survival [
31,
32]. Thus, the findings of previous studies in the financial domain would lend support to the idea proposed in this paper, that the presence or absence of tax arrears in the past could possess value in predicting the same phenomenon in the future.
When choosing the machine learning methods to be used in this paper, methods that have previously been applied in the related area of bankruptcy prediction were considered. According to Veganzones and Severin [
33], these methods can be divided into three categories: traditional statistical methods, machine learning methods and ensemble methods (i.e., combinations of several methods), although some researchers (e.g., [
34]) place hazard models and neural networks in separate categories. When comparing recent trends, Veganzones and Severin [
33] found that among bankruptcy prediction articles published in 2008–2017, only 13% use traditional statistical methods, while 36% use machine learning methods and 51% use ensemble methods. As noted by Domingos [
35], composing ensemble models has become a standard practice in machine learning, as they often provide better results than single models. A possible explanation for the high proportion of studies using ensemble models in bankruptcy prediction could also be that this field of research has already been thoroughly studied with a wide variety of standalone methods, which could be why researchers are now trying to increase the predictive performance by combining the models in different ways.
Based on a review by Shi and Li [
36] of articles published in 1968–2017, traditional statistical methods used in bankruptcy prediction include logit (logistic regression) and probit, multivariate discriminant analysis (MDA) and hazard models. Among the machine learning methods, they identify neural networks, support vector machines (SVM), decision trees, genetic algorithm, fuzzy sets and rough sets as methods that have been applied in the bankruptcy prediction literature already before 2007, while methods such as random forest, Adaboost, particle swarm optimization, naïve bayes and k-nearest neighbors (KNN) have appeared only after 2007. According to du Jardin [
37], ensemble techniques widely used in bankruptcy prediction include bagging, boosting, rotation forest, Decorate and random subspace.
In general, it has been found that machine learning methods have higher accuracy in bankruptcy prediction than traditional statistical methods [
13]. For example, Alaka et al. [
11] found that across bankruptcy prediction articles published in 2010–2015, the average accuracy levels of the most widely used machine learning methods (neural networks, SVM and decision tree) were all higher than those of the most widely used statistical methods (logit and MDA), with the average accuracy of neural networks being the highest, followed by SVM. The disadvantage of statistical methods is that they are subject to some restrictive assumptions. For example, MDA assumes variables to be normally distributed and have equal covariance matrices, logistic regression assumes the absence of multicollinearity between independent variables [
38], and probit assumes cumulative normal distribution [
39]. On the other hand, machine learning methods have the advantage that they can deal with non-linear distributions and do not have stringent assumptions on the data [
33]. For the reasons above, traditional statistical methods were not used in this paper.
As regards the predictive performance of machine learning methods, there is no consensus on which one of them performs best in bankruptcy prediction [
13], since no method performs consistently better than all others across different datasets [
40]. Therefore, it was not possible to choose the methods for this paper based on which methods have been established as best-performing in bankruptcy prediction.
The methods chosen for this paper included decision tree (DT), k-nearest neighbors (KNN), multilayer perceptron (MLP) and random forest (RF), where DT and KNN are conventional machine learning methods, MLP is a neural network method and RF is an ensemble method. While DT and neural networks (along with SVM) rank among the three most widely used machine learning methods in bankruptcy prediction [
11], RF and KNN have appeared in the literature of this research field only recently [
36]. SVM was not used in this paper, since, according to the information in the standard scikit-learn library for SVM, the time complexity of the algorithm makes its application impractical beyond sample sizes exceeding a few tens of thousands of observations. All methods chosen for this paper can handle well data where classes are not linearly separable, which is also the case with tax arrears.
3. Data, Variables and Methodology
3.1. Data
The dataset used for this paper included monthly amounts of tax arrears of Estonian SMEs during the period 2011–2018. The original dataset, which was obtained from the Estonian Tax and Customs Board, contained tax arrears of 419,210 legal entities at different monthly reporting dates, as well as the end-of-month figures. However, in this paper, only the end-of-month figures were used, partly because figures for other dates were not available for all the years. In addition, using only end-of month figures allowed us to disregard cases of less economic importance, where taxes were paid just a few days late. In Estonia, it is very usual that tax arrears lasting a few days occur, which are more subject to technical or negligence causes, rather than portraying a temporary liquidity crisis [
10].
In order to increase the homogeneity in the data, only SMEs (by the European Union’s definition) that were going concerns and had at least a minimal level of economic activity (being VAT liable with at least 16,000 EUR annual turnover) at the time their tax arrears were recorded. Companies in bankruptcy or liquidation or that had ceased their activities were removed starting from the date of their bankruptcy or liquidation notice or deletion date. Finally, data on public entities and NGOs were also left out.
After additionally removing a few outliers with tax arrears exceeding 2.5 million EUR, the dataset contained a total of 49,156 companies. The data for each company could also include non-full years, as the cut-off dates (VAT registration start and end dates, as well as bankruptcy, liquidation and deletion dates) could be any dates within a year. All consecutive 13-month periods for each company, i.e., 12 months for the independent variables and the last month for the dependent variable, were then collected into the final dataset using a moving window approach.
The resulting dataset contained 2,078,408 company-13 month period pairs, with each company being included in the dataset on average 42 times (i.e., on average, 3.5 years of data were available for each company). The advantage of the moving window approach was that it allowed us to capture the dynamics of tax arrears in the 12 months preceding the prediction while using a large amount of data. Based on Lukason and Andresson [
10], it could be assumed that long-horizon prediction (i.e., spanning several years) of future tax arrears with previous tax arrears is not applicable. The latter argument is proven in the empirical section as well.
A specific characteristic of the dataset was the sparsity of data, with an overwhelming proportion (91.82%) of the monthly tax arrears being zero. This shows that most of the time, most companies do not have tax arrears, most likely because they try to avoid owing money to the government and pay their taxes in time. It could also be observed that tax arrears have a tendency to persist. Namely, the larger the number of month-ends with tax arrears during the preceding 12 months, the more likely a company was to have tax arrears also in the next month (see
Figure 1 and
Figure 2). At the same time, the proportion of companies having a certain number of previous month-ends with tax arrears decreased with each additional month with tax arrears. Thus, on the one extreme, there were companies with no previous tax arrears, making up as much as 76.9% of the observations, for which the probability of having tax arrears in the next month was only 0.8%. On the other extreme, there were companies with tax arrears in all 12 preceding months, which made up only 1.8% of the dataset, but for which the probability of tax arrears in the next month was 93.0%. The rest of the observations lied in between, with the proportion of the respective observations in the dataset decreasing and the probability of tax arrears in the next month increasing with each additional month with previous tax arrears.
While the large proportion of zero values can be explained with companies trying to avoid indebtedness towards tax authorities, for any machine learning method, learning from mostly zero-valued data is a challenging task and is likely not to render good results. The reason for this is that the variation between observations may become dominated by noise [
41].
The approach used to reduce data sparsity was to only consider observations with tax arrears in at least any two months during the 12-month period preceding prediction (15% of the training data) for building the machine learning models, while the rest of the observations (85% of the training data), where the probability of tax arrears in the next month was very low (1.33%), were always predicted not to have tax arrears in the next month. Such dividing of the dataset into two parts achieved a considerable reduction in zero values (from 91.82% to 48.84%) in the part of the data used for the models. In addition, this part of the data was indeed economically the most interesting, as a company could be expected to be much more likely to have tax arrears in the next month if it already had incurred tax arrears at least twice during the 12 preceding months.
Twelve-month periods starting in any month in 2011–2016 (with dependent variable in 2012–2017) were used as the training set and those starting in 2017 (with dependent variable in 2018) were used as the test set. Using different periods for training and testing ensured independence of the training and test data. Training and test set sizes, including their parts containing observations with zero, one and more than one months with tax arrears, along with the percentage of observations with tax arrears in the next month, are presented in
Table 2.
In classification, a dataset is imbalanced if the number of observations in one class exceeds the number of observations in the other class [
42]. As shown in
Table 2, the dataset as a whole was heavily imbalanced, with only 8.16% of observations in training set having tax arrears next month. However, the part of the dataset containing observations with 2–12 months with tax arrears, which was the only one for which machine learning models were to be built, was almost perfectly balanced, with 46.48% of observations in the training set having tax arrears in the next month. Machine learning models tend to perform better if they are trained on balanced datasets [
42], where the sizes of the classes are equal. For balancing the dataset used for building the models, undersampling was used by randomly removing 17,844 observations without tax arrears in the next month from the part of the training set containing observations with 2–12 months with tax arrears. No balancing was performed for the test set, as this would have resulted in biased estimates on how well the models would perform on new real-life data.
3.2. Dependent Variable
The dependent variable used in the machine learning models was a dummy variable, “tax arrears next month”, the value of which was “1” for observations with tax arrears in the next month and “0” for observations without tax arrears in the next month. Due to the low economic significance of tax arrears below 100 EUR, a company was only considered to have tax arrears in the next month if its tax arrears in the next month were at least 100 EUR. This is in line with §14(5) of the Estonian Taxation Act, according to which tax authorities are required to issue a certificate concerning the absence of tax arrears if tax arrears of the person requesting such a certificate are below 100 EUR.
3.3. Independent Variables
Three types of independent variables were considered in this paper: 12 monthly amounts of tax arrears without aggregation (M12) and with aggregation of amounts in earlier months into period means (M5), and counts of events and statistical measures (STATS) (see
Table 3). As regards the notation of the months in
Table 3 and elsewhere in this paper, month 1 is the earliest month of the period, and month 12 is the last month of the period (i.e., the month preceding the month for which predictions were made). The reason for considering other types of independent variables besides the 12 monthly figures was that it seemed uncertain whether predictive models would perform well with 48.84% of the independent variable values being zero. The added types of independent variables contained a much lower proportion of zero values, and also helped to capture different aspects of the dynamics of tax arrears during the 12 months preceding the prediction.
The M12 type of independent variables (see
Table 3) were just the 12 monthly amounts of tax arrears. The STATS type of independent variables contained counts of events (months with or without tax arrears) and statistical measures, which were included under a single type of variable, because otherwise both would have only had two variables in the final models. Independent variables corresponding to four STATS type of variables used in this paper (“d max” and “d med”, “d m in debt” and “d longest”) have previously been also successfully used in bankruptcy prediction by Lukason and Andresson [
10]. In their research, Lukason and Andresson [
10] found that tax arrears were in fact better predictors of bankruptcy than financial ratios.
The motivation for using the M5 type of independent variables was an observation that the Gini importances extracted from a decision tree model of all except the last four 12 monthly tax arrears were very low (below 1%) (see “Gini before aggregation” in
Table 4). In essence, Gini importances show the relative importance of each independent variable compared to other independent variables in making the decisions about the best splits in a decision tree model. Aggregating amounts in earlier months into period means allowed us to increase the Gini importances of the resulting independent variables, and was therefore expected to also increase the performance of the models.
In order to decide which months to aggregate, all possible combinations for aggregating the first ten monthly amounts into period means were explored, with the restriction that, as a result of the aggregation, all Gini importances were to be above 1%. The best possible choice for aggregation, chosen based on the accuracy of the resulting decision tree model (81.18%), was to aggregate months 1–5 and 6–9 into period means, and not to aggregate the last three months (see “Gini after aggregation” in
Table 4). Decision tree has previously been used as a variable selection method for example by Cho et al. [
43], who also used it as a preliminary technique to select independent variables that were subsequently used for building models with other machine learning methods. In their study, decision tree outperformed stepwise logistic regression as an independent variable selection tool.
In deciding whether to leave any of the initially selected independent variables out from the final models, their descriptive statistics (
Appendix A Table A1), correlation matrix (
Appendix A Table A2) and univariate prediction accuracies (
Appendix A Table A3) were considered. The latter were obtained by training univariate models for each independent variable with all four machine learning methods that were later also used for training the multivariate models. The parameters used in the univariate models that differ from the default parameters are given in
Appendix A Table A4. Univariate prediction accuracies and correlations were not calculated for the M12 type of independent variables, since leaving out any of the 12 monthly amounts would have jeopardized the integrity of the time series.
All univariate prediction accuracies were satisfactory (above 60%) (see
Appendix A Table A3), indicating that all independent variables that were initially considered could have been useful predictors of tax arrears in the next month. However, due to high correlations (see
Appendix A Table A2), some of the independent variables were left out of the final models (see
Table 3). Namely, as regards the STATS variables, “d m in debt” and “d longest” were left out due to high correlations with other counts of event types of variables, “d max” due to high correlations with other statistical measures and “d mean” due to the high correlation with “d med”. The independent variables that were left out due to high correlation had lower univariate prediction performance (see
Appendix A Table A3) than the independent variables that they correlated with. Since the M5 type of variables essentially constituted a time series, and autocorrelation is a typical property of time series data, none of the M5 type of variables were excluded due to high correlation. For all independent variables included in the final models, the distribution properties of the classes were different (see descriptive statistics in
Appendix A Table A1), which confirmed that they could be useful predictors of tax arrears in the next month.
In order to ensure the best possible performance of the models, the independent variables used in KNN and MLP models needed to be on similar scales. For KNN, this requirement was due to distance calculations performed in order to find the closest neighbors. For MLP, rescaling was necessary because having independent variables on different scales makes it harder for the algorithm to learn appropriate weights.
The rescaling method used for KNN was the signed natural logarithm, as defined in [
44]:
The latter was applied to all independent variables that were expressed in euros (i.e., all except counts of events, where the variances were small). The reason for using the signed natural logarithm instead of the natural logarithm was that, in the case of M5 and M12 types of independent variables, some values were non-positive.
For MLP, rescaling was done by first using the signed natural logarithm in the same way as for KNN, and then applying a widely used standardization method that consists in subtracting the mean and dividing by standard deviation and is sometimes called
z-score (see, for example, [
45]):
where
is the mean and
s is the standard deviation of the independent variable in the training set. In the case of M5 and M12 types of independent variables, the mean and standard deviation of all variables belonging to the respective type of independent variables were used instead of standardizing each variable separately.
The Anderson–Darling test was used to check the normality of distributions of the independent variables. Results showed that none of the independent variables were normally distributed. Then, a two-sample Kolmogorov–Smirnov test was used to check the statistical significance of the independent variables in discriminating between the classes. The advantage of this test is that, unlike many other statistical tests, it does not require data to be normally distributed. The test results showed that all independent variables were significant at the 1% significance level.
3.4. Methodology
As provided in
Section 3.1, the approach used in this paper for reducing the overwhelming proportion of zero values among the monthly tax arrears was to build machine learning models only for observations which had tax arrears in at least two among any of the 12 preceding months, while the rest of the observations were always predicted not to have tax arrears in the next month. This was justified because, among observations with previous tax arrears in zero or one months, the probability of tax arrears in the next month was very low (0.77% and 6.69%, respectively), and due to all or nearly all monthly figures being zero, they were difficult to predict (see also
Figure 1 and
Figure 2).
In the final model, predictions made by the best-performing machine learning model were combined with predictions made for observations with less than two months with tax arrears. More specifically, for each test set observation, prediction in the final model was made either according to the best machine learning model if there were at least two, or predicted not to have tax arrears in the next month if there were less than two months with tax arrears among any of the 12 months preceding the prediction. This way, predictions were obtained for all test data, not depending on the number of months with previous tax arrears, which allowed us to make the results comparable to previous studies.
For building the machine learning models, four widely used classification methods, which have also worked well in the related area of bankruptcy prediction, were used in this paper—decision tree (DT), random forest (RF), k-nearest neighbors (KNN) and multilayer perceptron (MLP).
Decision tree is a classification method where decision rules are learnt from the values of the independent variables. The trained model can be represented as a binary tree structure, where classification is based on the predominant class in the leaf node at the end of the decision path. In the learning process, at each iteration, the best split is chosen. In this paper, the criterion used for choosing the best split was
Gini impurity, which is calculated as [
45]:
where
p is the proportion of observations belonging to one class among total observations in the node.
Other parameter values of the DT models are provided in
Appendix A Table A5. All three parameters are criteria for stopping the recursive splitting process of nodes (i.e., for pruning the tree). Setting stopping criteria helped to avoid overfitting. The DT algorithm used in scikit-learn is an optimized version of the CART algorithm.
Random forest is a classification method that consists in building a certain number of DT models, each time using only a randomly chosen part of the training data. Classification is then performed using the averaged results of all DT models. Since RF combines the results of a number of models, it is considered an ensemble model. Similar to DT models,
Gini impurity was also used in RF models as the criterion for choosing the best split. Other parameter values are provided in
Appendix A Table A5.
K-nearest neighbors is a classification method that maps training set observations in the multi-dimensional space and makes the prediction for each test set observation based on k training set observations that are closest to it. The values of k used for the models in this paper are given in
Appendix A Table A5. The distance measure used in all KNN models was
Euclidean distance [
45]:
where
d is the number of independent variables, and
xi and
yi are the values of the
i-th independent variable in training set observations
x and
y.
Multilayer perceptron is a neural network method. The network consists of an input layer, a number of hidden layers and an output layer. Each layer contains a certain number of neurons. The learning process is split into several epochs, where the network parameters (weights of the edges between neurons in each layer and the bias term of each neuron) are learnt using back-propagation mechanism. Within each epoch, data are handled in a number of patches. A loss function is used in optimizing the parameter values.
The MLP models used in this paper had three hidden layers, with 4, 4 and 2 neurons, respectively. The parameter values of the models are given in
Appendix A Table A5. All MLP models used the Adam optimizer and binary cross-entropy as loss measure. In the hidden layers, the
ReLu (rectified linear unit) activation function, and in the output layer, the
sigmoid activation function, were used. The formulas of these functions, as provided in Keras documentation, are:
Using the sigmoid activation function meant that the output of the network was given in the form of probabilities. The probabilities were then converted into dependent variables with value “1” (tax arrears next month) if they exceeded 0.5, and with value “0” (no tax arrears next month) otherwise.
In order to compare the performance of different machine leaning methods on different types of independent variables, separate models were built for each of the three types of independent variables described in
Section 3.3 using each of the four machine learning methods. The encoding of the model names, as well as independent variables included in each model, is given in
Table 5. The description of the independent variables is provided in
Table 3.
The performance of the models was measured based on accuracy. Accuracy is calculated as a ratio of correct predictions over all predictions [
46]. Additionally, the misclassification rates were calculated, showing the ratio of falsely classified observations among observations which actually had tax arrears in the next month (type I error), and among observations which actually did not have tax arrears in the next month (type II error).
Cross-validation (1:10) on the training set was used for choosing the best parameters for each of the models. The criteria for choosing the model with the best parameters were the arithmetic means of accuracies on cross-validation test sets, as well as minimum overfitting and underfitting. Then, models with the best parameters were trained on the training set and tested on the test set.
The Python machine learning libraries used for building the models were DecisionTreeClassifier, RandomForestClassifier and KNeighborsClassifier in scikit-learn. For building MLP modes, Keras neural networks library was used.
5. Conclusions
The aim of this paper was to explore which machine learning methods and types of independent variables are most useful in predicting companies to have tax arrears in the next month, given the time series of their tax arrears in the preceding 12 months. The data were the monthly tax arrears of Estonian SMEs in 2011–2018.
A specific characteristic of tax arrears is that they are rare events, showing that companies usually pay their taxes in time. Since learning from mostly zero-valued data is a difficult task for machine learning models, the approach in this paper was to build those models only for observations which had tax arrears in at least two among any of the 12 preceding months, while the rest of the observations were always predicted not to have tax arrears in the next month. The approach was justified because, among observations with previous tax arrears in less than two months (85% of the data), the probability of tax arrears in the next month was very low (1.33%) and, due to all or nearly all monthly figures being zero, they were difficult to predict. This approach succeeded in reducing zero values in the dataset from 91.82% to 48.84% and resulted in a nearly balanced dataset.
The machine learning methods used were decision tree (DT), random forest (RF), k-nearest neighbors (KNN) and multilayer perceptron (MLP). With each of these methods, models were built using three alternative types of independent variables: 12 monthly amounts of tax arrears, statistical measures and counts of events and monthly amounts with aggregation of months 1–5 and 6–9 into period means.
The final decision support system consists of two parts. First, for firms with at least two months of tax arrears during a twelve-month period, the best method was random forest trained on monthly tax arrears with aggregation of months 1–5 and 6–9 into period means (accuracy 84.46%), where the months to aggregate were chosen based on the Gini importances of the 12 monthly amounts. Second, observations with less than two months with previous tax arrears were all simply predicted not to have tax arrears in the next month, which yielded 98.88% accuracy. Thus, the accuracy of the final model was 95.28%, which could be considered excellent. The model was better at correctly predicting a company not to have tax arrears in the next month, with the percentage of false classifications among observations without tax arrears in the next month being only 2.5% (type II error), while among observations with tax arrears in the next month, it was 19.8% (type I error).
This paper represents the first attempt to predict corporate tax arrears based on the historical monthly time series of previous tax arrears. While there have been a handful of studies where tax arrears have been predicted based on financial ratios or annual tax arrears among other independent variables, using data with monthly instead of annual frequency has much higher practical value. This is because in carrying out their daily activities, tax authorities would greatly benefit from being able to detect companies likely to incur tax arrears not only once a year and not only for the entire next year, but at any time and for the more immediate future, using the most recent information available.
This paper has high practical value, since the proposed approach could enable tax authorities to better target their tax audits to companies that are likely to default on their corporate tax obligations, and better focus preventive measures aimed at ensuring the timely payment of taxes. In addition, it is important to note that the very low type II error will result in only a small number of good firms being groundlessly targeted, thus ensuring that administrative resources are well employed to deal with likely debtors. The main limitation of the paper is that tax systems, the collection of taxes and punitive measures for not paying taxes differ among countries; thus, the results might not be fully applicable in other environments. Nevertheless, since Estonia holds 12th place in the World Bank’s [
48] ranking on the ease of paying taxes, and according to the last wave of the World Value Survey [
49], in Estonia, the propensity to cheat on taxes is lower than the world’s average, the results could be reasonably transferable to many countries. Therefore, a practical guideline to tax authorities free of the latter limitation would be that: (a) occasional tax arrears lasting for a short period of time are usually not a sign of increased risk; (b) several consecutive months of tax arrears demand the relevant authority’s attention, while the exact response depends on the specific country’s circumstances; (c) when predicting tax arrears, different machine learning tools seem not to have remarkable advantages over each other; (d) concerning variables, a simple approach of accounting for the presence of tax arrears in each of the most recent months and a more consolidative approach for more further months could be the best choice.
For future research, we suggest implementing additional variables from different domains to even enhance the already high prediction accuracy of the current decision support system. For instance, we believe that real-time information possessed by tax authorities in different countries, e.g., about firms’ cooperation partners and members of management, structure of paid taxes or even transactional data from bank accounts, could be put into use for solving the respective classification task. Second, future research could explore the possibilities for building multiannual models, or separate models for each company, or combining patterns discovered for each company with general patterns applicable to all companies. Third, it might also be useful to develop models that predict the probability of tax arrears in the next month. Such models could be then developed further to predict the amount of tax arrears in the next month for cases where the probability of tax arrears in the next month exceeds a certain threshold. Finally, future research could explore the possibilities for predicting the occurrence of tax arrears for different numbers of months ahead, instead of predicting it only for the next month.