1. Introduction
Complications of preterm birth (PTB), defined as a birth which happens prior to the 37th week of pregnancy, are the most common cause of mortality among children of 5 years old or younger [
1,
2,
3]. In addition, it was shown as a critical factor for the survival of newborns [
2]. Preterm-born babies present a major challenge to medical assistance, which needs to supplement their yet not fully developed vital organs [
4]. Trying to understand and to prevent the causes of PTB has become increasingly more common in scientific research, especially with the recent emergence of progressively more trustful and more complex government-owned datasets. Coming closer to this goal could mean finding of ways of preventing PTB or at least of anticipating it, thus providing assistance to the mother in time, possibly reducing the amount of lives lost.
The work presented in [
5] shows that PTB’s etiology is multi-factorial and that the risk of PTB could be associated with the socioeconomic situation of a given region (neighbourhood socioeconomic status or neighbourhood SES). Neighbourhood SES is an area-level measurement which aggregates SES factors (such as income, education and employment) in a certain geographic level [
6]. Research shows that PTB rates are higher in areas with low SES when compared to areas with high SES [
7].
Numerous machine learning techniques have been previously applied to the problem of PTB prediction or stratification, including SVMs [
8], neural networks [
9,
10,
11] and decision trees [
12,
13]. However, the most commonly applied techniques are logistic regression and linear regression, employed in the analysis and prediction of PTB for various factors: poverty [
14], pregnant mother’s working conditions [
15,
16], general social factors [
17,
18] and, mainly, clinical and hereditary factors [
19,
20,
21,
22]. There is a vast literature associating different factors to preterm birth using traditional statistical methods [
23,
24,
25], including an association of social factors [
26,
27].
One way of comprehending the associated risk of the many SES factors to the occurrence of PTB is through data clustering. Clustering is a segment of unsupervised machine learning techniques that seek to associate and group elements together without any initial comprehension of the data themselves. In order to do so, clustering techniques make use of distancing algorithms to judge how close or similar two points are from each other and whether or not they should be grouped in the same cluster. Clustering techniques have been used for scientific analyses for decades in many areas, such as psychology [
28], genetics [
29] or geophysics [
30].
Clustering as the method to discover the groups which are more vulnerable to PTB risk is less common than traditional statistical methods. However, its application can already be seen in some recent studies. In [
31], spacial clustering shows a possible relation between living closer to landfills and PTB occurrences. The studies presented in [
32,
33] show the clustering of hereditary and behavioural factors associating them with PTB risk. In addition, in [
34], the study investigates the geographical distribution of PTB risk in Paris by clustering at the level of “census blocks”.
Thereby, the main objective of this work is to stratify the risk of PTB in Brazil from SES factors in order to confirm or deny that PTB occurrence is indeed related to socioeconomic conditions, a common general finding of multiple previous works in this area. The stratification process is conducted through clustering analysis based on unsupervised machine learning techniques. The analysis was performed by combining three freely available datasets collected by the Federal Government of Brazil:
Sistema de Informações sobre Nascidos Vivos (SINASC) [
35], containing data regarding gestation, birth, newborns and mothers;
Cadastro Único (CADU) [
36], containing a wide range of socioeconomic data from Brazilian citizens on a personal and family level; and the population estimate as disclosed by
Instituto Brasileiro de Geografia e Estatística (IBGE) [
37]. A new dataset was generated from the combination of these datasets and a new metric—PTB Municipal Rate (PMR)—was created. These two were used together in a clustering analysis, at municipal level, seeking to visualise the relation between SES factors and PTB risk. This article presents an analysis of some SES factors associating them with each discovered cluster. That way, the results presented in this work might contribute to the elaboration of more efficient and specialised politics for the Brazilian public health service.
2. State of the Art
The relationship between SES factors and occurrence of PTB, as mentioned before, has been studied by many authors, mostly but not exclusively dealing with just one or two “dimensions” of SES (e.g., education and income) and traditional statistical comparison methods rather than machine learning. For instance, this is the case observed by [
16], where working conditions are observed together with preterm birth. Women with long work-hour schedules and those reportedly dissatisfied with their current work are shown to have significant higher risk of PTB in European countries. Women working excessively long hours (over 43 h/week) were found to have a preterm delivery odds ratio of 1.33 compared the unity (30–39 h/week), and women who had to work in standing position for over 6 h had an odds ratio of 1.26 compared to the unity (less than 2 h). These findings put working conditions and stressful situations as some of the possible non-biological factors to influence PTB, a view also strengthened by the results observed by [
38], whose study observes the same relation of work, stress and PTB in Cypriot women.
The possibility of a certain region’s sanitation and housing conditions affect birth delivery time is explored by related recent studies by [
26,
39,
40]. These studies present access to proper sanitation facilities as possibly an important factor to help increase PTB occurrences among Indian women. All studies obtained a statistically significant difference in the frequency of PTB outcomes when contrasting people with toilet access with people with no toilet access. Furthermore, the results of [
26] also suggest that the harassment of girls and women (stressful event) and excessive time fetching water (over 2 h/day, manual labour) increase the risk of PTB, with odds ratios of 1.26 and 1.33, respectively. The results of [
39,
40] also include analyses on education data, and, in both studies, women with higher levels of education appear to have significantly lower risk of PTB.
Education as a social factor that raises the risk of PTB is also defended by the meta-analysis presented by [
27]. The analysis was performed over 12 distinct countries’ groups of mothers, collected in different years and using different education indicators. Its results indicate that mothers with low levels of education are more likely to experience PTB, with an increased risk of 48% and 84% on the two scoring methods used. This is a considerably large difference and is a strong indication that education is an important aspect when exploring PTB factors. The same idea is given by the study provided by [
41], where the higher educated women in Lombardy are shown to have 19% less risk of experiencing PTB, and a reduced risk was also observed when analysing foreign-born and local-born mothers separately.
The notion of relating all or most of these social factors at once and studying and treating them all as factors of social deprivation or social inequality is seen with association to preterm in the study presented in [
5]. The study merges these factors into an SES Neighbourhood feature and associates it with personal data from the patients. The results given by intra-cluster correlation indicate SES neighbourhood-level circumstances to be responsible for 5.72% of all variance in PTB. Although only a small portion of the total variance, this can have considerable impact on model fine-tuning if one aims to develop a preterm predictor, and it provides a strong case for the continued studies on socioeconomic factors and PTB.
Another study to tackle the relationship of SES Neighbourhood and preterm was presented by [
42] and also had results that advocate for the importance of the socioeconomic environment to PTB. The main difference of this study when compared to [
5] is that their study used income variation over time as a way of measuring the socioeconomic status of neighbourhoods, with this different method obtaining final numbers that showed that women living in areas of low socioeconomic indices or in areas where socioeconomic levels are declining have higher risk of PTB occurrence. Stable Low-level areas (i.e., low-level areas that do not show progress in socioeconomic factors) had the highest odds ratio of 1.20—compared to Stable High-level areas.
A recent study by [
34] also uses SES Neighbourhood to investigate preterm birth across the city of Paris’ block areas using spatial clustering, and its results endorse the idea of SES factors as an influential factor of PTB. When using SES Neighbourhood as cluster detection variables, the clustering resulted in a final cluster division with a
p-value of 0.06, but when adjusting for SES, removing it from the clustering, the
p-value increased to 0.81, a much less significant number, indicating that SES Neighbourhood was responsible for a great portion of the explainable PTB variance.
As it has already been put, most of these works presented above, as well as most of the non-cited related literature, make use of traditional statistical methods, associating a selected range of features and verifying possible correlations. The three latest mentioned works [
5,
34,
42], go one step further and work with a merged value of many dimensions, but they still need a subjective human decision on how to unite these values into a significant feature. A few questions yet unanswered or only partially answered on PTB and SES are as follows: (1) If such a relationship exists and is significant, can high and low PTB areas be discovered through the clustering of SES factors? (2) Is this relationship intrinsic enough that it can be found automatically by a machine without any significant feature selection? (3) Is it possible to uncover the socioeconomically deprived areas most likely to suffer from high PTB numbers? (4) Which SES factors are more likely to alter considerably in regions with high and low PTB occurrences? Therefore, the current work contributes to the research area by attempting to fully or partially answer these questions by using two distinct unsupervised learning methods to explore large Brazilian datasets of SES and birth data.
By combining k-Means and DBSCAN, two very different clustering algorithms, the first method also contributes by creating a new method for targeted cluster analysis. The algorithm initially provides a free clustering layer of k-Means clustering, with results then filtered by a target variable excluded from the initial cluster. The results are then passed to a final/decision cluster, generalising and removing clusters to provide the final results. This method allows us to completely isolate PTB from the SES clustering, while also finding significant clusters without having to rely a traditional optimal cluster techniques, which would ignore the external targeted variable.
4. Results
After the application of
,
and
, the
model generated a total of 1337 CoI. The number of detected CoIs grew according to the total numbers of clusters defined in the k-Means models,
, as it can be seen in
Figure 5. It also shows that the first cases of CoI appear when the
k-means input number of centres,
, equals 5, reaching about 90 CoIs for the highest number of
(27 to 30).
The correlation matrix,
, and its reordered version can be seen in items (a) and (b) of
Figure 6, resepctively. It is possible to observe some cluster patterns from the distance-based reordering alone. In item (c) of
Figure 6, the final clusters for each sample are highlighted in different colours, allowing a visual comparison between the
output and the sample distance algorithm.
After applying the
preprocessing steps, the final clustering was performed by using
.
found seven final clusters, divided into four clusters with high PMR (PTB Municipality Rate) and three clusters with low PMR. In item (c) of
Figure 6, it is shown how some of the rows of the correlation matrix were not selected to any final cluster. In item (a) of
Figure 7, it is possible to observe a stagnation or even a reduction in the identification of valid clusters for the highest input number of centres in comparison with median values. In item (b), the same clusters are shown but now separated not by high and low PMR but for individual final clusters.
The CoI’s PMR distribution for each final cluster was calculated. This distribution can be observed in
Figure 8, where each cluster is represented on the
x-axis, the distributions on the
y-axis and the national average PMR is indicated visually (aprox.
). It is shown that almost all validated clusters have their centroid PMR varying from
to
in comparison to the national average, with the exception of Cluster 1, with a centroid PMR almost
units above the average.
The regional distribution of these clusters was also observed, that is, which municipalities belong to which cluster. Through data visualisation, it is possible to contextualise—as well as to validate—the discovered clusters. Since the input of the problem is social data, it was expected for at least some of the clusters to be located in socially similar concentrated areas. Three visualisations were generated to verify that.
The first visualisation is shown in
Figure 9: it is a binary plot generated using the type of cluster (high or low PMR). The amount of times each municipality was classified into a validated high or low PMR cluster was counted, and each municipality was marked with the type it was mostly classified as. White-coloured municipalities were never classified in a CoI.
The second visualisation was generated from the subtraction of the total amount of times in which a municipality was classified as high PMR minus the total amount of times, in which it was classified as low PMR, obtaining a type of degree of intensity or belonging of each municipality to the types of clustering, and it can be seen in
Figure 10.
The third visualisation, seen in
Figure 11, reveals in which of the seven final clusters each municipality was mostly classified by the
models, making it possible to visualise the regional aspects of the clusters. Clusters 1, 2, 3, and 6 appear to be more concentrated in specific regions of the map, while 0, 4, and 5 have a more sparse distribution.
Looking at
Figure 9 and
Figure 10, it is possible to see a clear regional aspect not only for the individual clusters, but also for the types of clusters, with High PMR clusters located mostly in the North and Northeast regions, and the Low PMR clusters in the Centre-South area. In the Northeast, High PMR clusters are concentrated in the state of Maranhão and across the São Francisco River valley. The most intense Low PMR clusters are seen in the state of São Paulo and in Southern Minas Gerais. The North region is almost entirely classified in clusters of High PMR and, as it is shown in
Figure 11, the most frequently observed cluster in the region is Cluster 1, notably the one with highest PMR.
In order to measure how the clusters are differing from one another, T-tests were performed to measure the p-value of each variable. Two additional sets, treated here as clusters, were created for comparison, N, containing all municipalities that were not grouped in any of the final seven clusters, and A, containing all municipalities, regardless of clustering.
The
p-value was calculated for every variable and for every pair of clusters. The comparison of a cluster to itself was done by generating two random sub-samples of the cluster and testing them against each other. After every
p-value was determined, the percentage of variables with a
p-value above the 5% threshold for every pair of cluster was calculated and is shown in
Figure 12. It is possible to observe how clusters are significantly distinct from each other through most variables. The closest similarity was observed between clusters 5 and 6 and between clusters 0 and 5 (only 37% and 50% of variables were significantly different, respectively).
In addition, in order to obtain a general view of how High PMR and Low PMR compare to each other on different aspects of SES, the features used for clustering were categorised into seven segments: Sanitation, Employment, Living Conditions, Education, Household Type, Race and Income. Then, a subset of the data was created for each segment, containing all municipalities, their assigned clusters and only the features of the respective segment. Dimensionality was reduced using t-SNE for visualisation purposes to create 2D maps of the subsets, and an SVM-RBF classifier was applied to the t-SNE maps to find the boundaries in the generated space that best separates High PMR and Low PMR clusters. The t-SNE outcome and the boundaries can be seen in
Figure 13. The first (upper) plot for each segment contains only the Low PMR cluster points, the second (lower) also shows the High PMR cluster points and the separation boundaries. It is possible to see, even without reaching the most easily understandable feature-level view, how High and Low PMR clusters follow distinct patterns SES-wise. Some segments, such as Sanitation and Living Conditions, show High PMR cluster points as very well-grouped, and when comparing upper and lower plots, it is almost as if the High PMR points filled an empty space in the lower plot. Others, such as Education and Income, show High and Low PMR cluster points being more mixed up but with High PMR points centred in a smaller area.
Finally, the core of each validated cluster was extracted, containing information about the mean and variance observed for each of the seven clusters. With that information, it is possible to view the detailed features of each cluster.
It is possible, by checking the individual characteristics of each cluster, to see the relationship between SES factors used as input and the PMR.
Figure 14 shows the percentage difference between each cluster and the national average for some of those characteristics: higher education, race, water supply, garbage destination, sewage access and number of rooms in residence. It is noticeable that there is a clear contrast between High PMR and Low PMR clusters among these characteristics.
Cluster 4 is notable for being the only cluster that does not strongly respect this contrast. In
Figure 8, Cluster 4 is shown as the one with lowest PMR among those with above-average PMR. In
Figure 11, it is visible how cluster 4 is the most disperse among the High-PMR-type clusters, with a noticeable amount of coastal municipalities both in the Northeast and in the state of Rio de Janeiro. In contract, Clusters 1, 2 and 3, with higher PMRs, are concentrated in the North region and in the Northeast region’s countryside.
5. Discussion
In this work, unsupervised learning techniques, a two-level clustering, was used two discover clusters of High and Low preterm birth (PTB) rate among Brazilian municipalities while clustering only for SES factors. The clustering resulted in seven final clusters, four with a High PTB Rate and three with a Low PTB Rate, and found significant socioeconomic differences between these High and Low PTB Rate clusters. The results found in this clustering process corroborate and add to the discoveries made by [
5]. Their study uses a considerably smaller group for analysis (5297 pregnant women), performs prediction—logistics regression—instead of clustering and uses the SES factors of income, education and employment. Their results suggest that SES factors can help improve accuracy when predicting PTB, thus implying the existence of a relationship between SES factors and PTB. The fact that we were able to observe a similar relationship, with significant difference in PTB among varying SES clusters, even when working with data from a different country with a much larger population and with a larger number of features and a different learning algorithm, strengthens the idea that such a relationship is indeed meaningful. The feature-level analysis endorses the results found by [
7,
26,
27] with High PTB clusters generally having lower levels of income, running water, sanitation access and (mother’s) education.
Medical and health sciences extensively use data, especially biological data, to tackle daily problems. Preterm birth, despite much research, is still not totally comprehended, but studies suggest the influence of external factors, including SES factors. Although deeper research is needed to fully externalise the reasons why SES factors can affect PTB, some strong possibilities are the lack of health assistance/infrastructure leading to worse pregnancy accompaniment by health professionals and, therefore, to higher chances of pregnancy issues leading to PTB, as well as a worse quality of sanitation services causing pregnant women to have overall worse health conditions that could increase the chances of PTB. By finding SES neighbourhoods that are more suitable for the occurrence of PTB, the health system may be able adjust itself better, and earlier, in order to provide assistance to the maximum number of newborns. The use of machine learning clustering techniques allows the analysis of multiple factors at once, with the algorithm naturally adjusting the relevance of each dimension during the training process, creating a situation that is less dependent on a single person’s or a few people’s take on the subject. This characteristic makes it a convenient choice to test and compare assumptions made by less feature-rich models, challenging or reinforcing the current understanding of the subject. In addition, the possibility of applying clustering to the problem also provides a fast, self-adjusting method that could possibly serve as part of a larger, automated and maybe live health management predictive model.
The two-level clustering method described, followed by , allows k-Means clustering in contexts usually not covered by traditional “optimal number of clusters” techniques. By setting a initially designated cluster target rule, k-Means can be used to track down specific sorts of clusters, guaranteeing the significance of the found cluster(s) through recurrence and DBSCAN validation, while also maintaining the algorithm’s explainability factor for posterior analyses. Using this method, we were able to identify seven distinct clusters of notably outlying PTB Rate (10% threshold) as well as how strongly each municipality is associated to those clusters and how different these clusters are amongst each other at the feature level, segment level and overall. The two-level clustering validation behaved as expected, selecting similar cluster centres and discarding the noise generated mostly from the “over-fitting” high number of cluster in some units. The validated final clusters, even if chosen only by their PTB Rate, were shown to be significantly distinguishable in most of the SES factors used in the process. The clustering, working in a PCA-reduced hyperspace, was also able to find clusters that are shown to be distinct, even in specific SES segments such as sanitation and living conditions. In addition, since neighbouring municipalities tend to be more socially alike to each other than to further away municipalities, the more regionally concentrated clusters found here are another previously expected outcome.
Although the goal of finding the High PTB Rate clusters was successful, it represents just a step in what could be followed by a series of analyses concerning each variable or segment of SES individually. Sensible studies and analyses should follow to discover the most relevant features among the 104 considered to explain why such features matter for PTB and to know which features are not relevant—so they can be ignored in a future improved model. Preterm birth analyses reach many areas of study, and SES is just one of the considered factors, so an isolated study such as this is naturally limited in its results. Although many studies have explored the subject analysed here, there was no such study found for comparison that employs precisely the three key points: preterm birth, SES data and unsupervised learning. Finally, this work provides a method that allows cluster analysis on high-dimensional datasets and applies this method to enable the analysis of PTB Rate through SES factors.
6. Comparison to State of the Art
The results found in this clustering process corroborate and add to the discoveries made in [
5]. Their study uses a considerably smaller group for analysis (5297 pregnant women), performs prediction—logistic regression—instead of clustering and uses SES factors of income, education and employment. Their results suggest that SES factors can help improve accuracy when predicting PTB, thus implying the existence of a relationship between SES factors and PTB. The fact that we were able to observe a similar relationship, with significant difference in PTB among manifold SES clusters, even when working with data from a different country, with a much larger population, with a larger number of features and a different learning algorithm, strengthens the idea that such relationship is indeed meaningful. Their work also notes how such a relationship is restricted; how having such SES information—combined with some individual-level data that they used—is still insufficient for a real-world clinical application of predicting PTB (their work’s goal), which matches with our pre-work thoughts on PTB that it is a multi-factorial problem; and also helps to explain our difficulty and our need to develop an alternative method to achieve our clustering goals using
k-Means. There are many aspects of PTB invisible to both our work and theirs, and small differences when dealing with different levels of SES are observable but are naturally limited.
How we were able to cluster SES factors in a non-personal level and observes that the regions found had considerable PTB difference draws comparison to and supports the findings in [
34]. Their study finds socioeconomic clusters around the city of Paris’ blocks using a spatial clustering technique. They clustered the city into areas and found out the ones where mothers are most likely of experience PTB, and then adjusted the clustering using as control variable an SES index created from 41 original SES variables available. After the adjustment, their results suggest a considerable influence of SES factors on PTB occurrences, as the non-adjusted model showed a much more significant (smaller)
p-value. This interpretation is supported by the results of our work by the
k-Means method, as it is perceptible in the Brazilian map of clusters shown for both methods how some regions differ considerably in terms of PTB rate, and similarly to their work, those clusters predominantly assume a regional-centred aspect, creating contiguous areas of similar SES characteristics, of which some have significantly higher or lower PTB rate. Unlike in [
34], which uses the geographical location as part of its spatial algorithm, this regional-centred aspect was not intended nor influenced by any location feature, which were removed from the original dataset. For that reason, we obtained an outcome that re-emphasises this strong regional aspect of SES and PTB and, consequently, how neighbouring regions affect the SES and PTB outcomes of a given municipality or city block.
In the feature-wise view presented for
k-Means method, a few types of variables stood out. Sanitation variables were among the most outstanding in both models, with proper sewage systems, water systems and garbage collecting systems being considerably more present in High PMR regions; this can be linked to and reinforces the findings by [
26,
39,
40]. Although their works do not assess all of these sanitation points, many key variables found are strongly linked to their work, and the “Household with Bathroom” variable, a major point of their analyses, showed a much higher presence on higher PMR clusters, being one of the most diverging features in the
k-Means method. This was also viewed in a general comparison for the
k-Means method and corroborates with the findings by [
27,
41], all of which found statistically significant differences in PTB when stratified by mother’s education. Another clear disparity observed was related to race/skin, with white skin being one of the features most strongly related to Low PMR; this corroborates the findings of the meta-analysis presented by [
48], which aggregates several studies related to race and preterm birth occurrences in the United States between 2010 and 2015 and finds a higher risk (1.51 OR) of PTB among historically disfavoured racial groups. Although the exact groups cannot be compared properly, as the Brazilian and U.S. populations have significant racial differences, the high presence of whites in Low PMR clusters and the high presence of
pardos and indigenous people in the High PMR provide a strong ratification of their results.