1. Introduction
Data are increasingly produced, maintained, and used by heterogeneous groups of people, across cultures, country borders, differing levels of education, and so forth, which leads to a diversity of characteristics [
1]. Among others, data are often stored in a loose format or even different formats within one dataset; people often have diverse motivations to contribute and use data; and the data are maintained with differing intensity. Further, such differences lead to a different degree of organization and collaboration among the contributors and users of a dataset. These factors pose major challenges to the interpretation and use of such data. Instead of well-defined ontologies, the data often need to be interpreted on-the-fly and while considering the context of its genesis. The geographical domain challenges the interpretation of data in a particular way because geographical data is often of intercultural and global nature.
Data (and corresponding projects) that are the result of a social process have been termed in different ways, among them ‘User-Generated Data’ and ‘User-Created Content’ [
2]. These and other terms are often context dependent and refer thus only to a subset of such data sources. They possess a connotation and do thus not communicate the class of such datasets in its uttermost generality. The term ‘User-Generated Data’ refers, e.g., to the creation process of the data rather than also to their use. This is why we coin the term ‘Shared Data Sources’ (SDS) as a more generic umbrella term without such a connotation. We define:
A dataset or project is called a ‘Shared Data Source’ if its production, maintenance, and use are predominantly social processes.The reason behind the introduction and choice of the term ‘SDS’ is manifold. First, the term ‘SDS’ puts emphasis on the fact that the data are the result of a creation and maintenance process and are used in some social context. A ‘source’ refers to the process from which the data derives and by that also to its use—the creation process usually has the aim of producing data for a certain use, and the future use is the reason for why the data is created. In comparison, most of the terms used to identify a subset of SDSs refer to either information or data, such as Volunteered Geographic Information, while only some refer to a process [
3]. The term ‘SDS’ contains reference to both the data and the process. Secondly, the introduction of the term ‘SDS’ is meant to highlight that all three aspects (data creation, maintenance, and use) need to predominantly be of social nature. Data created in a social process but used only by a single company is not an SDS, and data created by a small number of people is neither as long as the social interaction between these people does not strongly influence and shape the resulting data. The social nature of SDSs is, in fact, determining because mere technical aspects become less important whenever the data is shaped by a social process [
4].
In addition to the term ‘Shared Data Source’,
we coin the term ‘Geographical Shared Data Source’ (GSDS) for referring to an SDS in the geographical domain. At first hand, one might question why this term should be introduced because the restriction to a specific domain does not seem to be vital. In fact, many factors that relate to the heterogeneity of the social process, among them the cultural, social, and educational background of the involved people, are acknowledged to be significant to geography. When an SDS is about geographical content on a global scale
1, its creation, maintenance, and use are thus often particularly subject to social influence. This is why we use GSDSs as examples in the following.
This article focusses on the methodological challenge of making sense of SDSs, both individually by setting an SDS in the context of others and by more holistically examining a larger number of SDSs in their entirety. Thereby, the following research questions are addressed:
- RQ1
Which ‘dimensions’ can be used to characterize an SDS?
The social nature of SDSs imposes structural complexity to both the data and the entire projects. There exist hence a multitude of dimensions, which can be used to characterize an SDS. Which of these dimensions are important and how can they be grouped by their role in the process of creating, maintaining, and using the data? Further, how can one evaluate the importance of a certain dimension for characterizing an SDS?
- RQ2
How to characterize an SDS in the context of other SDSs? And how to characterize the change of an SDS over time?
A characterization of an SDS relative to other ones is particularly efficient because differences and similarities become apparent. Again, the dimensions discussed in RQ1 can be utilized for such a comparison. Which methods exist to measure distances between SDSs? How to investigate similarities and differences between SDSs by visualizations? And how to trace changes of an SDS over time, i.e., the evolution of an SDS?
- RQ3
How to choose suitable prototypes for grouping SDSs by their characteristics? And how to assess existing prototypes?
Shared Data Sources are commonly classified into different types
2, among them, Volunteered Geographic Information (VGI; Goodchild [
8]), Ambient Geographic Information (AGI; [
9]), and Public Participation Geographic Information Systems (PPGIS; [
10]). Such typification creates obstacles: the types are fuzzy to some degree, they overlap, and the typification is often ambiguous. As a result, the definitions of these types evolve over time and even different definitions coexist. How can we identify groups of SDSs that share common characteristics? How to evaluate and visualize their fuzziness? And how to make sense of SDSs evolving over time in terms of prototypes?
It should be noted that these questions are of mere methodological nature. In this article, we propose conceptual means for characterizing ‘(Geographical) Shared Data Sources’. The proposed dimensions and the discussed prototypes (VGI, PGI
3, and AGI) serve only as examples, and additional dimensions and prototypes can easily be integrated into the proposed framework. It is the aim of the article to discuss the methodological chances and issues of how to make sense of SDSs in their mutual context by utilizing the proposed framework. Thereby, we introduce new lenses for the interpretation of the data incorporating their genesis and characteristics, which allows for a more fine-grained typification of SDSs. The practical usefulness of the approach is demonstrated by setting GSDSs mutually into context and by discussing existing prototypes in terms of GSDSs.
The article is structured as follows. After a literature review, we establish the notions of the prototypes VGI, PGI, and AGI as well as of SDSs and GSDSs (
Section 2). Further, we argue why a conceptual framework for setting SDSs mutually into context is needed (
Section 3). It suggests itself to conceptualize SDSs by their characteristics in such a framework. These characteristics can formally be represented by several dimensions, which describe different aspects of an SDS (
Section 4). Visualizations and statistical analysis can take advantage of these dimensions when analysing SDSs. We have described a large number of SDSs from the geographical domain, which can be set in context relative to the prototypes. As a result, the entirety of the described GSDSs, and not only individual ones, can be examined both visually and statistically. The utilized methods of visualization are by no means the only ones. We describe in detail why they are suitable and which aspects they focus on (
Section 5). Finally, the findings of our analysis are discussed. In particular, we show that the introduced dimensions are compatible to a high degree with both a (common) thematic categorization of the SDSs as well as with the prototypes of VGI, PGI, and AGI, which is why the dimensions can serve as a reference frame for characterizing SDSs. This demonstrates the usefulness of the proposed methodological means (
Section 6).
2. Related Work and Prototypes of Shared Data Sources
Shared Data Sources expose very different characteristics, even within the geographical domain. For instance, OpenStreetMap (OSM), Twitter, and civic issue tracking projects are very different. In literature, little work can be found about how to gain a
systematic comprehension for classifying GSDSs in their generality and for describing the differences among them. In order to distinguish between volunteered and contributed information, Harvey [
11] has discussed a classification of GSDSs by opt-in and opt-out strategies. Contributors
volunteer if they decide on their own to contribute, and they do ‘only’
contribute (in contrast to
volunteer) if they have to actively veto the use of the contributed data, according to Harvey [
11]. A more fine-grained characterization has been used by Saxton et al. [
12], who refers, among others, to the level of collaboration; compensation schemes, which reflect the amount of money one can earn; the control mechanisms; and trust-building systems. By this characterization, he is able to distinguish different mechanisms to create GSDSs, among them citizen media and collaborative software development. Another characterization has been proposed by Spyratos et al. [
13]. They refer, among others, to the knowledge and the technical resources of the contributor; data quality requirements; and the actual data quality, including the mechanisms to ensure data quality. In addition, Spyratos et al. [
13] have examined thematic categories of GSDSs. Finally, Comber et al. [
14] have used text-based analysis methods to investigate scientific articles about GSDSs in order to achieve a common understanding of these SDSs in the geographical domain.
Geographical Shared Data Sources have been grouped into types, each of them containing GSDSs that share common characteristics. A number of terms have been introduced to refer to these different types of GSDSs, among them Volunteered Geographic Information (VGI), Ambient Geographic Information (AGI), Public Participation Geographic Information Systems (PPGIS), Citizen Science, Citizen-Contributed Geographic Information, Collaborative Mapping, Crowdsourcing, Participatory Sensing, Neogeography, and Science 2.0 [
3]. These terms have been coined by different people, and while most of these terms refer to different concepts, there exist substantial overlaps between many of them. An overview of such terms has been provided by See et al. [
3]. The inconsistent use of these terms is, among others, the result of the fact that the ways we contribute and use data change over time. The evolving nature of GSDSs has been discussed for a long time, among others, in respect to trust [
15], their future potential, and corresponding obstacles [
16]. Also, the influence of GSDSs on science has been discussed, both in general [
17] and in respect to pluralism in science [
18]. In the following, we discuss three of the most important types, namely VGI, AGI, and PPGIS.
The term
Volunteered Geographic Information (VGI) is widely used. Introduced by Goodchild [
8], it refers to ‘a special case of the more general Web phenomenon of
user-generated content’ (p. 212). Thereby, he refers several times to the ‘volunteer effort’. Further thoughts, in particular on data quality as a result of the particular characteristics of VGI, have been expressed by Goodchild and Li [
19]. Also Bordogna et al. [
20] discuss how the creation process of VGI shapes the data quality due to its characteristics. They outline in how far VGI is particularly prone to fuzziness of the ontology as well as to varying precision of geometric information. The use of VGI in the context of geographical science and even in social practice have been discussed by Elwood et al. [
4], and Sui and DeLyser [
21]. As volunteers contribute knowingly to VGI, they often collaborate in some way, which leads to common insights and shared concepts [
1]. In 2018, VGI is one of the most used types of SDSs in the geographical domain.
The term
Ambient Geographic Information (AGI) has been introduced by Stefanidis et al. [
9]. People often leave ‘footprints’ by using social media—‘passively contributed’ data. Burke et al. [
22] discuss the concept of ‘participatory sensing’, strongly related to the idea of AGI: mobile devices collect data and share them automatically. Thereby, opportunities and potential uses of such data become apparent, e.g., in public health, urban planning, cultural identity, and natural resource management [
22]. Further critical reflection refers to the context, which is often unknown in case of AGI [
23]. The ability of AGI to communicate places through a description of how people experience them is, e.g., limited by the missing context in which the descriptions were created.
The term
Public Participation Geographic Information Systems (PPGIS) has been used at least since 1996 when it was coined in a workshop [
10]. Originally, it was used to refer to GISs that foster the ‘public involvement in the definition and analysis of questions tied to location and geography’, among others, ‘to incorporate public knowledge from multiple sources into decision frameworks’ that are ‘now primarily in the hands of expert managers of data-driven technologies’ [
10]. Sieber [
24] summarized the historical development of the concept of PPGIS. Thereby, she highlighted the main factors of PPGIS in terms of how people interact, which technology they use, how they collect and use the data, and the outcomes and evaluation of PPGIS. Despite using different terms, PPGIS has some commonalities with citizen science [
25,
26,
27,
28,
29].
In this article we will refer to three exemplary prototypes, which will be used to discuss methodological questions, in particular RQ3. The terms discussed before (VGI, AGI, and PPGIS) refer to types of SDSs, but as their definitions are fuzzy in nature and inconsistent definitions and naming schemes coexist in literature [
3], there is a need to introduce definitions of prototypes at least for the scope of this article. This helps avoiding the confusion about how to define the terms discussed before (VGI, AGI, and PPGIS). In contrast to these, a prototype does not describe a set of SDSs, which is why the discussion of the boundaries of such a set is avoided as well. Also, a prototype is not an existing SDS. Each SDS has its unique characteristics, which cannot be summarized to its fullest extent by a prototype. The prototypes act rather as a reference frame when setting SDSs into context. As the considerations are of merely methodological nature and the prototypes are only used as examples in this context, the considerations provided in the following sections will easily allow to incorporate further prototypes. In the scope of this article, we use three prototypes, which can be distinguished by the following characteristics:
Volunteered Geographic Information (VGI)4. A GSDS that people knowingly and actively contribute to. The contributors are aware of their contribution and have access to large parts of the resulting dataset. They can potentially join the team of organizers.
Participatory Geographic Information (PGI). A GSDS to which people are invited to contribute to. The contributors are aware of their contribution but have no access to the resulting dataset. They cannot easily join the team of organizers.
Ambient Geographic Information (AGI). A GSDS to which people contribute unintentionally when performing other activities, often even without being aware of the contribution. The contributors have no access to the resulting dataset. They cannot join the team of organizers.
These prototypes can be distinguished by two main properties: the awareness of the contributor while contributing and the possibility to join the team of organizers (
Table 1). While these characteristics allow to distinguish these three prototypes, the prototypes are more complex and expose far more characteristics. We consider a larger array of characteristics, which relate to how the GSDSs are characterized in literature.
In respect to the term PGI, it should be noted that the terms
Public Participation GIS (PPGIS) and
Participatory GIS (PGIS) have been used in the literature [
30]. The terms differ in the context in which they are used: urban and regional context in developed countries, and rural context in developing countries. In the scope of this article, we use one prototype to refer to any kind of SDS to which people are invited to contribute but to the data of which they have no or only very indirect access. In contrast to the terms PPGIS and PGIS, which include a reference to a system for creating, maintaining, and using the data, we refer to the information itself. This is why we use the term Participatory Geographic Information (PGI)
5.
4. Dimensions for Conceptualizing Shared Data Sources
As has been motivated in the last section, there is a need for a conceptual framework for characterizing individual SDSs in their mutual contexts and for characterizing a set of SDSs. The aspects can be aligned into dimensions, i.e., sets with a diverging character—low and high values, for example. The dimensions conform yet to different scales of measurement. In a first step, we will motivate how to conceptualize SDSs and thus which factors can be used to group the dimensions. Subsequently, the dimensions related to these factors are discussed.
4.1. Conceptualizing Shared Data Sources
Shared Data Sources are more than data. They are the result of one or more social processes, in which a number of people are involved. These social processes create the data and render their meaning. Without an understanding of how the data are created by social processes, the data cannot be interpreted as information in a meaningful way.
A formal conceptualization of an SDS needs to consider all parts of the social processes. This includes the creation by the contributors, the data and their meaning themselves, and the use of the data by consumers. Such contribution and potentially also the use of the data is often coordinated and structured by an organizer. The factors involved in the process are depicted in
Figure 1 and explained in the following:
Contributors. In the case of an SDS, there are a number of people contributing. These people often organize themselves to some degree, resulting in a community structure. The contribution process itself can be of different nature, e.g., of differing difficulty.
Data and information. The collected data are often of social relevance. In case of GSDSs, they describe either a part of the world or the entire surface of the Earth. The data also describe differing temporal extents, and they might be available only for a limited time.
Consumers of the data. The number of people or organizations using the data varies among projects. Also, the data can have a very limited scope of usage, or it can serve for many purposes.
Organizational structure of the project. An SDS can expose various degrees of coordination. For instance, NGOs or governmental organizations might organize such a project more strictly, while a community-based project is often more open to differing types of contribution. Projects also differ in the number of levels of the hierarchy, which is often a result of the evolution of the project.
Organizers of the project. Shared Data Sources are organized by very different kinds of groups and organizations. For instance, OpenStreetMap is organized by its community and the OpenStreetMap foundation, and Twitter is organized by the company Twitter Inc. While the contributors can easily become part of the organizer in case of OpenStreetMap, this is much harder to achieve in case of Twitter—you would have to apply for a position at Twitter Inc. Also, organizers have very differing intentions. Some intend to make money while others act more altruistically.
Many aspects related to these factors have been discussed in literature. As an example, Elwood et al. [
4] have discussed who contributes to GSDSs and which motivations lie behind, as well as what legitimates the contributions. Further, it has been discussed how contributors collaborate [
4,
39]. It has been explored how taxonomies become folksonomies as a result of collaboration [
1,
40]. Trust has been discussed as a source for collaborative behaviour [
41], and trust has been examined as a result of social interaction [
42,
43]. Further, Harvey [
11] has examined how volunteered contributions to GSDSs are.
In subsequent sections, we discuss dimensions related to these factors. There exist numerous aspects that can be used to characterize SDSs. Due to the multiplicity, a choice needs to be made for which dimensions to include. We have selected dimensions that greatly allow for discriminating between several SDSs while also aiming to cover all major factors that render an SDS. Further factors exist and could easily be added, e.g., about the contributors’ motivations or the ways users are able to access the data. The methodological means described in
Section 5 are exemplified by referring to this choice of dimensions but introducing additional dimensions would not alter the general way of how the methodological means work. The description of each dimension consists of a name, questions or a short summary for defining the dimension, and one or two practical examples illustrating the dimension.
4.2. Dimensions Related to the Contributor
Type of contributor. Who contributes? Can everyone contribute or is only a person with a strong knowledge in technology able to contribute?—Everyone can contribute to Twitter, while only ‘techies’ are able to contribute to platforms that collect sensor data like, such as OpenSenseMap.
Intention of the contributor. Why do people contribute? Do they contribute to share or support, or do they contribute for some other less altruistic reason?—People contribute to OpenStreetMap mostly because they want to share knowledge and use open geo data, while people use Facebook for more personal benefit without having user analytics in mind.
Awareness of the organizers’ intention. Are the contributors aware of the intention of the organizers?—People contributing to OpenStreetMap or the HOT community are aware of what the data is used for, while most users are unaware of their mobile phones collecting and sharing telemetry data, and of the corresponding intention of the company behind the data source.
Awareness of the contribution. Are the contributors aware of their contribution? In particular, are the contributors aware of how the data will be used?—Despite the fact that people writing Twitter messages are aware of writing a message, they have only little awareness of contributing to a dataset that can be accessed at a larger scale.
Effort of contributing. Is it easy to contribute? Do formal or technical issues make it hard or cumbersome to contribute?—Sharing information via social media is easy, while sharing data using Wikidata is more complex.
Type of the contributions. Is the information about general and long-lasting facts, or is it about short-time and individual events only?—Contributions in OpenStreetMap describe the general environment, while contributions in OpenSenseMap refer to discrete points in time at which an observation is made—temperature, e.g., changes more quickly than the location of a street.
4.3. Dimensions Related to the Data and Information within the Project
Factuality of the contribution. Are the contributions about factual information that everyone can agree on, or are they based on personal perception?—Wikipedia aims to collect only factual knowledge, while opinions, personal reflections, and feelings are often shared on social media.
Temporal extent of the project. How long will the project by intention exist?—OpenStreetMap aims to exist for an unlimited time, while the HOT platform consists of subprojects of both limited and unlimited temporal extent.
Temporal extent of the data usage. How long will the data of the project be available?—In contrast to the existence of a project, the data collected in a project is sometimes only accessible for a shorter period, e.g., in case of civic issue tracking platforms.
Spatial extent. What is the spatial extent of the project? Is the project only of local nature, does it concern a country or continent, or is it a worldwide project?—Information in social media is potentially of global nature, while civic issue tracking platforms often focus on a single city.
4.4. Dimensions Related to the Consumer of the Data
Targeted beneficiary. Who shall mainly benefit from the SDS, according to the intention of the organizer? Do, by intention, the organizers themselves benefit, or a specific group of people, or even everyone?—OpenStreetMap aims to provide information to everyone, while open civic issue tracking platforms are targeting local citizens as beneficiaries.
Verifiability (of the content) of the contribution. Can the contributed information be verified independently in an objective way?—Wikipedia data needs to be both verifiable and referenced, while the data of the OpenSenseMap can hardly be verified as it is impossible to remeasure environmental data of the past.
Consumer of the data. Who is able to access large parts of the data and actually also uses them? Is it some organization or NGO only? Do the contributors themselves consume the data? Or can potentially everyone consume the data?—Data of civic issue tracking platforms are mostly consumed by the local government, while Wikipedia data can be consumed by everyone.
4.5. Dimensions Related to the Organizational Structure
Level of organization. Is the community (including the organizers) organized by a strong and distinctive hierarchy?—The hierarchical structures in OpenStreetMap are much less distinctive than in Wikipedia. In the former, there exist basically only the roles of non-authenticated readers, contributors, moderators, and administrators with root access to the servers, whereas the roles of extended contributors, administrators, bureacrats, changes reviewers, rollbackers, account creators, and so forth additionally exist in the latter.
Organizational engagement. How strong does the organizer lead and engage with the contribution process?—In OpenStreetMap the organizer informally recommends which semantics to use for a contribution, while in Facebook the organizer mainly concentrates on attracting new users.
4.6. Dimensions Related to the Organizer
Type of organizer. Which formal state does the project and its organizer have? Is the project organized by, e.g., a company or by a community?—Facebook is organized by a company while Wikipedia is organized by a foundation and a subgroup of the contributors.
Effort of joining the organizer. How open is the organizer to new people being involved in the organizational process? Does the team of organizers welcome new people who would like to organize, or is it even near to impossible to join the team?—People can without large effort join the organizing team of OpenStreetMap because the organization work is, by and large, executed by open mailing lists and the OpenStreetMap Wiki, while it is much harder to join a company organizing commercial social media services.
Size of organizer. How many people are part of the organizing team? Is the project organized by one person only or is it organized by a larger group of people?—Projects like Wikipedia and OpenStreetMap are organized by a large number of people, while civic issue tracking platforms are often organized by only some few people.
Specificness of the organizer’s intention. What is the motivation of the organizer to establish and run the project? Is it a very specific one like using the data for a certain well-defined purpose, such as making money, or is it rather a general one of providing useful data?—OpenStreetMap aims at providing geo data for multiple purposes, while civic issue tracking platforms focus on very specific purposes.
Commercial orientation of the organizer. Is the motivation of the organizer a commercial or a non-commercial one?—Wikipedia and OpenStreetMap are not organized with a commercial intention, while many social media platforms like Twitter and Facebook have a commercial background.
5. Exploration and Visualization of Shared Data Sources
The presented framework aims to facilitate a fine-grained comparison and analysis of SDSs, allowing to explore them in different visual and analytical manners. In particular, methods like parallel coordinates, spider diagrams, correlation matrices, and clustering techniques can be utilized to get a better sense of where and how SDSs resemble and differentiate. In this section, we first discuss how to practically describe SDSs by the introduced dimensions (
Section 5.1). Then, we discuss several ways of exploring SDSs in their entirety, both by setting them mutually into context as well as by setting them into the context of the three prototypes VGI, PGI, and AGI (
Section 5.2,
Section 5.3 and
Section 5.4). Finally, we discuss ways of how to analyse the dimensions, which are used to set the SDSs into context (
Section 5.5 and
Section 5.6). It should become clear from the description of the framework that additional dimensions and SDSs can be taken into consideration. It is the aim of this section to discuss and illustrate how to apply the framework rather than to explore the SDSs by the chosen dimensions. The findings and results of applying the framework are discussed in
Section 6.
5.1. Collection and Preparation of the Data
The authors of this article have collected a list of GSDSs. Despite not being complete—there exist too many GSDSs, and new ones appear and disappear on a regular basis—the list contains SDSs with various characteristics. To our best knowledge, no type of SDSs has been left out. Also, the list is limited by our perspectives on GSDSs. The GSDSs have been grouped into several thematic categories, which are widely used and accepted: augmented reality/games, citizen science, civic issue tracking, crowd-sourced sensor data, mobility trajectories, OSM-related SDSs, social media, and the Wikipedia ecosystem. It should be noted that this categorization is by no means the only one but these categories seem to be widely used.
Each of the SDSs has been characterized according to the dimensions that have been discussed in
Section 4. As a result, each SDS is described by a textual description for each dimension. In case of some SDSs, it is hard to unambiguously assign a single value to a certain dimension because the data source itself is too heterogeneous. In these cases, the value that characterizes the data source best has been chosen. The textual description has been converted to a numerical value. As a result of the characterization, for each data source
s, there exists a value
for dimension
i, which is between 0 and a maximum
.
As a result of the characterization of an SDS by linguistic prototypes, each dimension is described by a discrete value. While numbers are usually regarded as being ‘crisp’, textual representations are mostly regarded as exposing some fuzziness. By translating from a textual description to a numerical one, this fuzziness is hidden to some degree. When interpreting the data, one should, however, be aware of this fuzziness. In some cases, elements of the visualization are hidden by other elements, which is the result of the discrete scale. The values characterizing an SDS have thus, for some visualizations, been slightly randomized in order to avoid such situations in which visual elements remain hidden.
In addition to the SDSs, also prototypes have been described by referring to these dimensions. Such prototypes are typical for a cluster of SDSs of similar characteristics. A prototype does, however, not reflect an existing SDS but rather exposes characteristics similar to most of the SDSs of the cluster. In subsequent sections, SDSs are both visually and statistically compared to these prototypes.
5.2. Setting Shared Data Sources Into Context by a Trilinear Graph
Two-dimensional space can represent at most two aspects independently. It is thus impossible to naturally embed all dimensions of SDSs into two or even three-dimensional space. Instead, one might set the SDSs into the context of prototypes, which reduces the number of required dimensions of the space used for the visualization. An SDS is accordingly not explicitly characterized by its dimensions rather than by its similarity (or dissimilarity) to each prototype. To further reduce the number of involved dimensions, we will visualize the data in projective space, which means that only the proportions but no absolute values are kept.
Each data source is compared to each prototype. For this purpose, each dimension
of the data source is subtracted from the corresponding dimensions
of the prototype:
These comparisons can be used to characterize each data source relative to the prototypes. In the visualization, each prototype shall be represented at a corner of a triangle, i.e., at location
,
, and
respectively. Each data source is positioned relative to these coordinates. The
i-th coordinate is thereby given as
As a result, a pair of two-dimensional coordinates is assigned to each of the data sources.
Even if
vanishes for every prototype
p, the prototypes themselves are not assigned to the corners of the triangle. A prototype shares certain characteristics with other prototypes, which is why for two prototypes
p and
q, the difference
is not maximal in general. Thus, to each prototype, coordinates in the inner of the triangle are assigned (
Figure 2a). In
Figure 2b, a linear transformation has been applied to the coordinates such that the prototypes get reprojected to the corners of the triangle. The SDSs are mostly located within the triangle but can, due to the reprojection, also be located outside of the triangle. As the locations of the respective prototypes in the visualization form an equilateral triangle, the visualization is a trilinear graph [
44], which we call the ‘Triangle of Shared Data Sources’ in short.
Trilinear graphs allow for a visual clustering of SDSs. Similarities can easily be recognized and the evolution of a dataset can be visualized as a trace in the graph. However, different characteristics of an SDS can lead to similar locations in the graph. In particular, it cannot be distinguished between two SDSs that are very similar or dissimilar to all prototypes. In both cases, the SDS would be depicted in the centre of the graph.
5.3. Examination of Individual Shared Data Sources by Parallel Coordinates
Parallel coordinates are a technique to represent a larger number of dimensions in two-dimensional space [
45]. Thereby, the dimensions are not naturally embedded but rather independently displayed (
Figure 3), which allows for a comparison of more dimensions than the embedding space has. The dimensions are represented by parallely aligned axes. Each data source corresponds, accordingly, to several locations—one on each axis. These locations are joined by a polygonal line to allow for a visually clustering of SDSs with similar characteristics. Parallel coordinates provide thus a good opportunity to identify similarities between SDSs in respect to a smaller subset of all dimensions—dimensions that are depicted on neighbouring axes. When being displayed on an interactive display, the possibility to reorder the axes can even aid when comparing two non-neighbouring axes. It is widely known that parallel coordinates are not very useful when being displayed in a static way, as is the case in
Figure 3. When being displayed in an interactive way on a website, however, parallel coordinates are a very useful technique.
5.4. Visual Clustering of Shared Data Sources by Spider Charts
Spider charts are a variant of parallel coordinates [
44]. Instead of aligning the axes in parallel, the axes are circularly arranged (with uniform angle) and share a common starting point at the centre (
Figure 4 and
Figure 5). Such a layout makes each value less readable because the axes are oriented differently. When the points on the axes are, however, connected by a polygonal line, the overall characterization of the SDS can easily be recognized by the shape of that line. As a result, the spider chart allows for a visual clustering of SDSs.
5.5. Correlation Analysis for Examining the Dimensions
The dimensions described in
Section 4 are correlated to some degree. For instance, an organizer with clear financial interests might, at least statistically, not have strong interest in involving contributors in the organizational process because this could mean to share the revenue. An organizer without financial interests is in many cases, however, interested in involving contributors because the workload can be shared. Correlations between several dimensions can lead to a bias when characterizing SDSs. At the same time, correlations can also indicate similarities between a number of SDSs because they share similar characteristics. A correlation analysis of the dimensions (in respect to the considered SDSs) can, accordingly, provide insights about the interpretation and explanatory power of other methods that refer to these dimensions.
The correlation of several dimensions can be examined statistically. As there is no reason why the dimensions should be strictly linearly related, the dimensions need to be examined for various kinds of correlation. The Pearson correlation coefficient is widely used but would be able to measure linear correlation only. This is why we employ Kendall’s rank correlation coefficient
6. This coefficient has the advantage of being able to detect non-linear relations. In contrast to Spearman’s rank correlation coefficient, Kendall’s rank correlation coefficient is regarded as being more robust [
46]. To account for the different value ranges of the dimensions, which could lead to significant bias in the results, the data was rescaled to a common value range (the unit interval) prior to analysis. The resulting correlation is visualized as a heat map in
Figure 6.
A correlation of two or more dimensions can have origin in many different scenarios. For instance, dimensions can be causally related. High values in one dimension may, e.g., cause high values in another dimension. However, dimensions can also be correlated without any causal relationship, i.e., the correlation can be spurious. With respect to the various ways of correlation, it makes sense to examine in more detail why a correlation has been detected. In
Figure 7, two pairs of dimensions are displayed in detail. As before, the data is slightly randomized in order to make possible displaying identical values. Despite both pairs of dimensions exposing a similarly strong correlation, more detailed information can be concluded from this type of visualization.
5.6. Hierarchical Clustering of Shared Data Sources
The clustering of SDSs into categories can be achieved in various ways: by experience and common sense; by a visual exploration of their characteristics as is, e.g., discussed in
Section 5.4; or based on a statistical analysis. The latter way has the advantage of not relying on human judgement and thus of potentially creating more fine grained categorizations while avoiding human bias. In
Figure 8,
Figure 9 and
Figure 10, a hierarchical clustering of GSDSs is shown. For generating these figures, the GSDSs have been (hierarchically) clustered based on how similar they are. For the hierarchical clustering, the distances between the clusters were calculated using the
Unweighted Pair Group Method with Arithmetic Mean (UPGMA) algorithm [
47] in combination with the average Manhattan metric for determining the distance between two clusters. The UPGMA algorithm is widely employed because it is robust towards outliers. The similarity of the SDSs has been measured by their characteristics expressed in terms of the previously introduced dimensions (
Figure 8) as well as by their similarity to the three prototypes AGI, PGI, and VGI (
Figure 9). Such clustering can also be performed for a subgroup of dimensions, e.g., the ones related to the organizer (
Figure 10).
The similarity between an SDS and a prototype can be measured in different ways. Among the similarity metrics commonly used to measure such similarity in terms of the dimensions are the Euclidean metric, the Manhattan metric, cosine similarity, and Kendall’s rank correlation coefficient. The Euclidean metric, defined as
, computes the shortest distance between the vectors as we know it from the space we live in. This metric only makes sense if the space exposes some concept of neighbourhood because the dimensions are set into their mutual context. This is the case when regarding the space spanned by the previously introduced dimensions as a conceptual space [
38]. Accordingly, the differences in the dimensions do not just add up but are examined in combination. This is in contrast to the Manhattan metric, which is defined as
. The Manhattan metric considers the difference in each dimension separately and is thus meaningful also for ‘abstract’ spaces that do not expose the concept of neighbourhoods other than separately for each dimension. The cosine similarity is different to these two metrics by only considering the angle spanned by the two vectors, not their magnitude. Instead of comparing each dimension independently, as is the case with the Manhattan metric, the cosine similarity puts emphasis to the combination of the dimensions. It is defined by
and requires the dimensions to be rescaled such that the distribution of the respective values is centred around 0. As a result, the cosine similarity only considers how the difference in one dimension relates to the difference in another one, making it particularly useful when the dimensions shall be compared relatively rather than absolutely. Finally, Kendall’s rank correlation coefficient
derives from the intuition of measuring whether the rank of each dimension is about the same for two vectors (
Figure 9).
The results of the clustering are visualized by a dendrogram, which is attached to a heat map that represents the results of the similarity metrics. The dendrogram groups SDSs, which are very similar. Such groups and further SDSs can, in turn, be grouped again. The resulting groups are depicted by horizontal lines, whose vertical position refers to the similarity of the grouped SDSs. While the dendrogram provides an overview of the hierarchical clustering, the heat map can be used to retrace why the algorithm has clustered in the way it did. The results of the statistical analysis and the visualizations as well as the gained insights about the methods and the SDSs, both gained from this and from previous methods, are discussed in the next section.
7. Conclusions
This article aims to provide a lens on new and collaborative forms of geographical data sources. We have introduced and coined the notions of ‘Shared Data Sources’ (SDS), ‘Geographical Shared Data Sources’ (GSDS), and ‘Participatory Geographic Information’ (PGI). Thereby, we have discussed the need for a conceptual framework for describing SDSs, which derives from incorporating dimensions to characterize different types of SDSs. A number of dimensions for conceptualizing SDSs have been introduced, among them dimensions related to the contributors, the data and information, consumers, the organizational structure, and the organizers. Finally, we have introduced tools and instruments to examine SDSs in their entirety, leading to different lenses through which we can learn about and make sense of such data sources.
The provided tools—visualizations and statistical analysis—allow for an examination of a set of SDSs in its entirety but an examination of differences and similarities within VGI, PGI, AGI, and similar prototypes would be of interest for future research. Categories of SDSs similar to AGI can easily be distinguished from those dissimilar to AGI. It seems, though, to be much harder to distinguish between VGI and PGI-related SDSs in the same way. Future research might focus on how SDSs can be distinguished better in terms of these prototypes and what hinders us from doing so at the moment. In particular, it may be discussed why the categories, prototypes, and characteristics discussed in this article are compatible and which limitations exist for this compatibility.
A number of dimensions have been introduced to characterize SDSs. Given some desired characteristics, can these dimensions be used to construct new forms of SDSs? For instance, can we derive from the desired characteristics of which nature the social process creating, maintaining, and using the data should be? If not, which characteristics contradict and hinder us from such a construction? The reasons behind these contradictions may even provide clues about further correlations and characteristics of SDSs. Also, one may ask which parts of the conceptual space remain yet ‘unused’ and would thus give rise to new types of SDSs.
When analysing SDSs and making sense of them in their entirety, data about the SDSs are needed, in particular, data about the contributions and the resulting data, about the consumers of the data, and about the roles of the organizers. The data used in this article have been collected by ourselves, which poses limitations to their interpretation and creates biases. Future research might explore how these limitations and biases can be characterized and how they can be avoided. In particular, the views of organizers, contributors, and users could be incorporated, which would create different biases. Also, it would be interesting to examine how such biases influence the resulting analysis. Such research would ideally build upon the methodology of the social sciences, requiring very different perspectives than the one used in this article.
Some SDSs are too heterogeneous to be described in a meaningful way, or too broad to be properly demarcated from one another. Among these examples is the Internet, which is very diverse and heterogeneous in its nature. Another example are Linked Open Data, which form, by definition, a web of statements. These statements are created by various people and organizations, and they share common vocabularies. These common vocabularies and semantics make possible to mutually relate the statements, leading (more or less) to only one big and heterogeneous data source. Future research might explore how such heterogeneous and broad SDSs can be conceptualized and incorporated into the analysis.
The discussed characterizations allow for making sense of SDSs. Further research might discuss structural differences between GSDSs in the geographical domain and SDSs in general. Also, the characterization by the ‘Triangle of Shared Data Sources’ allows for an examination of the temporal development of an individual SDS. Having examined several such trajectories, one might conclude how types and prototypes like VGI, PGI, and AGI evolve over time and, in addition, how our categorization into categories such as augmented reality/games, citizen science, civic issue tracking, crowd-sourced sensor data, and social media reacts to this temporal evolution. Finally, such understanding might render possible to trace or even predict the future development of SDSs, or a prototype like VGI or AGI.