1. Introduction & Relevant Work
Insurance activities are conducted on the basis of a truthful relationship between insured and insurer. Data completeness and reliability help to bolster this relationship. For instance, the residence region of the insured plays a major role in pricing. However, it is common that this information is not available, not correctly added to the database, or simply not current anymore. It has been observed that this lack of structure is sometimes enabling poor pricing and even occasionally hinting at the possibility of fraud. This paper presents an effort to mitigate this risk by exploiting the predictive power of a telematics dataset that can be leveraged in an insurance context.
Different machine learning methodologies are exploited in the literature in the context of telematics.
Meiring et al. (
2015), reviewed intelligent approaches for using telematics data in different applications. Moreover,
Wuthrich (
2017) utilized the k-means clustering method for categorizing the driving styles of drivers to enrich the insurance pricing scheme. In another application,
Dong et al. (
2016), by means of deep learning (RNN and CNN), identified the driving styles of drivers based on recorded GPS data. They presented a clustering-based ML approach for predicting the risk of the insured drivers based on their recorded GPS data as well as the acceleration and velocity in all directions.
On the topic of telematics being used as a means of predicting claims frequency and improving pricing,
Pesantez-Narvaez (
2019) examined the performance of XGBoost approach for predicting the claims based on a telematics dataset, including total annual distance driven and percentage of total distance driven in urban areas.
Telematics data in combination with the CNN algorithm has also been used for vehicle type identification by
Nguyen (
2018), based on data that were collected from the users’ smartphones.
Verbelen and Antonio (
2018) combined the traditional insurance pricing with the telematics approach to include the driving behavior of the insured driver using generalized additive models and compositional predictors.
Wang et al. (
2017) implemented driver identification based on the telematics data and Random forest classification method.
Qazvini (
2019) compared the performance of a Poisson regression model and a zero-inflated Poisson (ZIP) model for the prediction of the claim frequency based on the telematics data that were collected from insured drivers.
The rich information coupled to big geo-tagged datasets has sparked many projects aiming to generate some level of structure out of unstructured information.
One such application is given here
1, where POI are automatically identified based on the spatial density of the pickup locations requested on a taxi platform. Their approach utilizes density-based models to automatically identify clusters of pickup locations where the density is higher. The main assumption is that clusters with a high density of points are to be considered more important.
Along the same lines,
Yang et al. (
2017) have also considered the density of the spatial distribution of points. Using the Laplacian zero crossing, they define the boundaries of what they consider an increased-interest region (POI), based on the locations of millions of geo-tagged Flickr photos. The added value of that approach is the simultaneous definition of the POI, as well as its physical boundaries.
Finally, the work of
Deng et al. (
2019) utilizes the same concept of spatial density to address the problem of the automatic identification of an urban center. In this work, the geo-tagged items to be clustered are POI, and their density is thought to represent the city center. By representing the POI density as a surface, similar to that utilized in terrain representation, the authors are introducing the concept of the surface contours of the density function. They subsequently utilize the contours to generate a contour tree, which hierarchically groups high-density regions together. The physical boundaries of the region of interest, are, here too, defined automatically.
A different source of geo-tagged data is telematics. By monitoring the motility patterns of an individual via specialized sensors, we are now able to generate vast amounts of geospatial data, rich in information and ready to be used for various life-improving projects. In one such application,
McKenzie et al. (
2019) utilize user-generated geo-content to help identify financial access points in sub-Saharan Africa.
The methods that are discussed above provide the background for this study; however, they present two shortcomings in the context of our application. Firstly, when the data richness is lower and the denser regions are scarce, fitting a density surface might prove problematic, whereas a straightforward clustering method (which is, however, density based) is the most efficient approach, without compromising on accuracy. In our application the median of the number of trips in the drivers’ population is equal to 177; significantly lower than the vast volumes of geo-tagged pictures or available trips per user in the other studies. Furthermore, the aforementioned applications do not deal with competing classes of dense geo-tagged items. In our specific application, after identifying the denser regions (dwell locations), we also have to classify them as per the specific meaning that they carry within the context of the application, i.e., identify a destination as being a work address vs being a home address.
This paper proposes a methodology for the automatic identification and classification of a dwell location (or POI) of an individual. Subsequently, we apply the proposed methodology for the automatic identification of a user’s residence address based on motility patterns that arise from a telematics dataset.
The remainder of this paper is organized, as follows:
Section 2 provides a framework including (
Section 2.1) the dataset utilized, (
Section 2.3) how the datasets are preprocessed, and (
Section 2.2) discussion on the used algorithms. In
Section 3, (
Section 3.1) the prediction results of clustering algorithms are presented and discussed, (
Section 3.2) the performance of the exploited algorithms is evaluated, and (
Section 3.3), (
Section 3.4) the results of classification as residential address or not are presented.
2. Methodology
In this work, we take advantage of available telematics data in order to identify the user-relevant POI in the form of spatial clusters:
where
is spatial cluster for the user
j and destinations
’s. Point of interest
l for user
j,
, is defined as set of spatial clusters
’s relevant to intended point of interest
.
Out of the defined user-relevant POI, a prediction is made of the particular POI that corresponds to the intended POI,
, for instance, the residence address of the driver:
where
is set of POI considered for prediction, while
’s are user’s trips relevant to
.
To this end, a single user is described by a point cloud of all the trips’ end locations:
where
is representing the
user out of
n total users. On the other hand
is the
trip of
user among all
m total trips of user
.
Using the Mean Shift or DBSCAN clustering methods, the points are clustered into what we define as individual destinations. Here, the residential address is considered as intended . Indeed other types of POI can be also considered and this assumption does not reduce the level of generality of the methodology and it is just assumed for the purpose of methodology illustration.
We assume two types of datasets are available for carrying out the POI identification task. The first is a database of geo-tagged items, representing the motility pattern of the user in question. The second is a database of ground truth information about the exact location of the target POI. We have examined the ratio of dual-address drivers and we do not expect it to severely impact our results, as, in our portfolio, their proportion is quite low. This fact led us to the fundamental assumption that each user has only one true residential address.
In this work, we take advantage of available telematics data in order to identify the user-relevant POI in the form of spatial clusters. Out of the defined user-relevant POI, a prediction is made of the particular POI that corresponds to the residence address of the driver.
Figure 1 shows a schematic representation of the trip point cloud around the user’s residence address. Both Mean Shift and DBSCAN clustering methods are tuned via a parametric study so that the result of the clustering is representative of a series of single destinations:
where
’s are single destinations attributed to
’s (
).
In the figure, after clustering, the points have been split in two clusters with their respective centroids represented by an orange circle. Each centroid acts as a single-point representative of the whole cluster and it is checked for its proximity to the real home address:
where
’s are identified centers for
’s (
).
For the training and evaluation of the model, a circle is defined per user (user r), with radius , and its center positioned on the exact coordinates of the user’s actual address (). The prediction is considered to be successful when the centroid of the cluster () that is predicted as a home address lies within this circle ().
In the case of
Figure 1, cluster A is considered to be a positive observation, while cluster B is a negative one. In
Section 3, we address the effect of this radius on the prediction accuracy.
Each destination cluster is subsequently enhanced with information that describes its properties. The information introduced about the cluster is descriptive of the user’s interaction pattern with it. The extracted/created specific features are introduced and discussed in
Section 2.3.
Following the feature engineering phase, a classification model is constructed with the purpose of identifying the destination that is the most likely to be the residence address of the individual under examination.
2.1. Dataset
The dataset utilized for the analysis has been extracted from an in-house telematics database. The information in the database has been recorded utilizing SMAAS (smartphone as a sensor). The dataset has a total of 1438 individuals and 525,663 trips.
Each user was monitored throughout a certain time interval, while his location was recorded at a frequency of around 1/3 Hz. For the purposes of this analysis, the dataset is processed and reorganized so that the intermediate points are dropped and only the start and end locations of the user’s trips remained in the dataset. In addition to the location data, temporal information on arrival and departure are also included in the dataset. Another dataset containing the address of the user is utilized in combination with the start/finish trip locations.
During a data cleaning step, we made sure that we removed from the training dataset all users who, throughout their trips, have not once arrived at a distance smaller than or equal to 500 m from their true registered address. As such, the useful dataset was reduced to 928 individuals with a corresponding total of 371,258 trips, for further processing.
2.2. Clustering Algorithms
Two different clustering methods, DBSCAN and Mean Shift, are employed for the identification of the individual destinations of the user. In the following paragraphs, we elaborate on these two algorithms and their theoretical backgrounds, and present a performance comparison.
The DBSCAN (Density Based Spatial Clustering Applications with Noise) algorithm is a robust clustering approach for a dataset mixed with noise, as indicated in
Chakraborty et al. (
2011) and
Tran et al. (
2013). This method utilizes two independent parameters for detection of the body and border of a cluster of an arbitrary shape, solely based on the density of the points within it. The algorithm works well in detecting regions of higher density,
Zhong et al. (
2019).
However, it presents a few problems in the context of user-relevant POI detection. This algorithm only works well for clusters of similar density,
Wang et al. (
2019). Furthermore, DBSCAN tends to merge different clusters into a single one, regardless of the shape and size of the resulting cluster, if a chain of equal density points exists as a bridge between them. This property, albeit necessary and desired in other applications, presents a problem in the context of POI detection, as it allows for regions of interest that are arbitrarily large, which inherently contradicts the notion of a POI.
The most important drawback of this algorithm is its sensitivity to the model parameters, as indicated in
Ren et al. (
2014). Parameter optimization is necessary to assure that the clustering performance is not negatively affected by the selected model parameters. Despite the possible complications of the algorithm, it has been tried with success, also because of its ability to exclude points that are considered to be noise. The DBSCAN algorithm has the ability to handle the noisy data efficiently, even in a dynamic environment with an ever-changing dataset,
Chakraborty et al. (
2011).
The non-parametric Mean Shift algorithm is based on the assumption that each dense region represents a cluster. This approach to clustering allows the method to be independent of further assumptions regarding the features of the clusters, such as the number of clusters and their distribution.
The method shifts the kernel centroid iteratively towards the direction of the average of the points contained in the kernel. The difference between subsequent centroids defines the mean shift vector, which always points towards the direction of maximum increase in the density. The governing parameter of the Mean Shift algorithm, the bandwidth parameter, is the radius of proximity that is considered by the kernel function, which is adjustable based on the application.
The governing parameters of the algorithm are interpretable, and the performance is stable. Despite its potential for improved clustering accuracy, the Mean Shift approach is a computationally expensive method for very large datasets. In this work, the dataset size does not impose a high computational burden, thus computational efficiency is not a governing performance criterion.
The Mean Shift algorithm is very sensitive to the selection of the bandwidth parameter (h). Thus, the bandwidth parameter needs to be chosen carefully based on the application. Inappropriate adjustment of h can lead to an incorrect number of clusters or an undesirable cluster configuration.
2.3. Feature Engineering
The dataset undergoes three preparation steps before any meaningful prediction can take place. First, the user’s trips are filtered, such that only the last known location of the trip remains as a representative of the trip’s destination. Subsequently, the total population of ending locations is segmented into individual destinations, such that the dataset reduces in size to the total number of different unique destinations. A single point represents the population of each destination, namely, the coordinates of the centroid of the identified cluster. Finally, for every separate cluster, a feature-engineering step is necessary in order to enrich the knowledge about the user interaction with the cluster.
Table 1 presents the list of covariates that are generated.
To summarize the above, a concise representation of the methodological steps is presented in
Figure 2.
3. Performance Comparison & Results
3.1. Clustering Performance Comparison
Following the feature-engineering phase, a prediction is made using one of the unsupervised machine learning approaches based on the observed feature space presented in
Section 2 (
Section 2.3).
In this section, the achieved prediction accuracy of the two different clustering techniques is compared. When considering the imbalanced nature of the dataset, the Precision-Recall curve of the positive class is considered as the main performance criterion. A high area under the Precision-Recall curve is desired, representing a high level of both precision and recall. A metric of this property is the f1 score of the model, which will also be presented in the results.
For the considered clustering techniques, a parametric study is carried out to investigate the effect of the independent model parameter on performance, while the home distance tolerance, , is kept constant and equal to 500 m. The value of 500 m is selected, such that all intended applications of the method are applicable without ambiguity. The effect of the selection is also addressed in an effort to assess the robustness of the method for business applications that require potentially higher spatial precision.
3.1.1. Tuning DBSCAN
For DBSCAN, the influence of the model parameter,
, has been examined within a range from 0.0005 to 0.05, with the obtained f1 score presented in
Figure 3. These
values roughly correspond to a physical radius that is in the range of 35 m to 350 m. The second hyper-parameter of DBSCAN has been chosen to be equal to three.
For given and physical radius, the algorithm will keep expanding the cluster as long as the three or more close neighbors are found within distance . A too small radius will lead to many destinations being excluded as being noise, while a too large value for the same radius runs the danger of joining destinations that should have ideally been separated and giving the cluster an irregular shape. The latter will transfer the centroid of the combined cluster far from the true home address. Moreover, the types of points that will end up being clustered will be very diverse, which will generate features that are of limited predictive power.
3.1.2. Tuning Mean Shift
The same study is performed for Mean Shift, when considering values for the h parameter in the same range. A change in this radius would mean that the size of the control circle is affected. Just as with DBSCAN, a very large radius would lead to clustering of distant locations as a single destination, and the centroid would drift from the ideal, close-to-home location.
One difference of Mean Shift is that the area of influence of a single cluster is predetermined, and equal to the circle in question. The method is not allowed to append points that are in close proximity to the generated cluster, as DBSCAN would. Instead, it will generate a new destination with its own points, which is h distance away from the former cluster.
Given the nature of the problem that we are trying to solve, it appears that Mean Shift can more robustly cluster the individual trips to different destinations. In the extreme case where the parameter value is too low, the total number of clusters will be equal to the individual trips made by the user, in which case the feature engineering process that is mentioned in
Section 2 (
Section 2.3) would not provide any added value.
Following the hyper-parameter tuning of the two clustering methods, the best-performing setup from each model is presented for comparison in
Figure 4. The superior performance of Mean Shift is notable, with a PR curve obtaining recall rates in the order of 85% and a precision of 80%. Please define if appropriate.
3.2. Spatial Accuracy Investigation
Finally, the possibility to improve on the spatial accuracy of the prediction is assessed. A parametric study is performed, varying the radius . For smaller values of the radius, a more exact prediction of the home address is obtained. It is logical that the model would only be accurate within a certain distance tolerance given the use of movement data as a proxy for the definition of true address. It is therefore expected that any improvement on the spatial accuracy will come at the expense of some of the predictive capacity of the model.
The best performing algorithm, Mean Shift, is selected for the analysis. The model specific parameter is kept constant and equal to 0.005 (≈200 m) and the model fitting is repeated for home distance tolerance parameters (
) of 500 m, 200 m, and 100 m. The results are depicted in
Figure 5.
As expected, a drop in the predictive capacity is observed for smaller tolerances (meaning higher spatial accuracy). However, it is notable that, even in the most demanding scenario with a prediction accuracy of just 100 m, the model performs relatively well, providing an educated guess with an f1 score of 0.744. In all cases, the model performs much better than the “no-skill” prediction that represents randomness (see
Figure 5).
The simulations are performed on PC equipped by Intel Core i7-8750H 2.20 GHz CPU and 32 GB RAM.
3.3. Rule-Based Prediction
A simple rule is devised for the ’discovery’ of the home address of every individual, out of the total set of destinations from the recorded trips. For each person, the destination that has been visited most frequently will be selected as the residence address. The approach implies that every individual of the dataset is expected to have exactly one residence address. This is in accordance with the data cleaning process indicated in
Section 2 (
Section 2.1). The method performs relatively well with an achieved f1 score of 0.785. This result was used as a baseline and in the next section it is compared to machine learning-based methodologies.
3.4. Machine Learning Prediction
For the more advanced modeling, the problem is represented as a binary classification problem, where the model aims to classify every individual destination cluster of the user as either being a home address or not (unity or zero labels respectively). The cluster classification phase (as being a residence address or not) is performed using one linear and two non-linear methods, namely, Logistic Regression, Random Forest and Multilayer Perceptron. The results are subsequently summarized and compared with the rule-based approach presented in
Table 2. The analysis indicates that the three ML models perform at a comparable level to each other, but outperform the rule-based prediction by about 4%. The increased performance of the ML models is justified by the utilization of all 10 generated features made available in the feature engineering phase. In contrast, the rule-based prediction only looks at the relative cluster size before classifying the destination.
The prediction appears to be very sensitive to the clustering methodology used and a lot less to the classification method. This indicates that the first phase of modeling is the most important aspect of the proposed methodology for user-relevant POI identification, namely the clustering of the individual trips into meaningful destinations. The importance of the correct selection of the clustering method becomes paramount given the sensitivity of the method in the clustering phase. Mean Shift has proven that, despite DBSCAN being preferred for spatial clustering in literature, it can perform better and bring an uplift in the f1 value in the order of 17%.
Given the similarity of the performance and for the sake of simplicity, only the Multilayer Perceptron(MLP) results will be demonstrated and discussed in this section. The considered MLP was tuned to comprise two hidden layers of 30 and 15 nodes respectively. The model is solved using the ’Adam’ optimizer. The dataset is split into train and test sets, chosen at 80% and 20% of the total observations, respectively, for which the full set of available features is included. Using five-fold cross validation, the fitting was performed multiple times, such that potential bias in the presented results would be addressed. The uncertainty bandwidth accompanying the presented curves accounts for the standard deviation in the results of each of the five model runs.
The model itself is comprised of two sequential steps. In the first step, the prediction of the machine learning model returns a probability vector for the total number of clusters for all the users in the dataset. This is succeeded by a logical check where the probabilities are converted to binary values, taking into account the fact that only a single home address will correspond to each user. Here, too, the assumption is in line with the data cleaning process that is indicated in
Section 2 (
Section 2.1).
The obtained performance level for the configured MLP algorithm is given in
Table 3 and
Table 4.
4. Conclusions
In this paper, we utilize the DBSCAN and Mean Shift clustering methods in order to identify the residential address of subscribed drivers. A parameter study is performed for fine-tuning the model parameters and home distance tolerance. Both the Mean Shift and DBSCAN approaches perform well, with f1 scores of 0.66 and 0.83, respectively. Furthermore, it is shown that clustering trips with Mean Shift facilitates predictions that outperform the ones based on DBSCAN clustering. Finally, a supervised ML approach using Logistic Regression, Random Forest, and Multilayer Perceptron is implemented for the identification of the user home residential address. The model performance measures show that the ML-based address identification approach yields reliable predictions with a precision of 81% and a recall of 85%.
The methodology in this paper is demonstrated for the identification of the home address. However, we believe it to be suitable for the identification of any kind of user-relevant POI, as long as ground truth data exist for the algorithm to train on.
The proposed model could work very well within an insurance context as a fraud prevention or identification mechanism, by contrasting the predicted addresses to the registered ones. It would also be a source of useful information at a primitive stage in the acquisition of new contracts, as it would facilitate marketing, and automatically populate part of the questionnaire that is necessary at the subscription phase. The latter is also known in an insurance context as a one-click policy.
The concept is promising and additional research on this topic could result in predictions that are even more accurate. Other clustering methodologies (like OPTICS) in combination with a more elaborate feature-engineering phase could possibly render even better predictions. In this work, the utilized ML model is deemed to be sufficient to demonstrate the predictive capacity of the concept. A thorough examination of different ML techniques would potentially result in an uplift to the achieved accuracy.