1. Summary
The primary motivation for the compilation of the presented datasets comes from the need of the project (spatial differentiation and visualization of geodemographic processes, with a focus on households in an aging society in the Czech Republic) dealing with spatial differentiation and visualization of geodemographic processes in Czechia over the last 25 years. The project goal is to explore various geodemographic processes at a very detailed level, specifically at nomenclature of Territorial Units for Statistics (NUTS) local administrative units 2 (LAU2), commonly used in the European Union for statistical purposes [
1]. While most of the national statistical data, especially those coming from censuses, are published for clearly defined administrative units [
2], it is essential to have such units for spatial analyses and visualization in the form of spatial data. It is not that complicated to obtain the actual spatial data of administrative boundaries and corresponding statistical indicators for the selected year. However, when we want to explore trends in a more extended period (e.g., five, ten, twenty years), it is becoming a very complex task to prepare and harmonize spatial and statistical data. While we confronted our project vision with the actual state of the data quality, we encountered key issues we had to solve:
“How can we display a given geodemographic indicator in a single large-format map considering all the changes in municipalities’ spatial composition?”
“Is it even possible to explore trends of given geodemographic indicator in certain regions while we do not have unified municipal units?”
“Will be geodemographic changes among municipalities and between years even comparable?”
The latter issue is also explicitly mentioned in [
2] as one of the crucial limitations of statistical data, i.e., lack of mutual comparability due to frequent changes of administrative or statistical units. Although there are efforts to substitute administrative boundaries with regular grid structures (e.g., GEOSTAT population grid), and methods for transferring grids into administrative units (and vice versa) exist (e.g., [
3,
4]), and have been used in several scientific studies (e.g., [
5,
6]), there is, yet, no country in Europe (to the authors’ knowledge) deploying grids as an official approach for socioeconomic data curation—from collection, management, and analysis to visualization. Moreover, as mentioned by [
7], the analysis and results should correspond with administrative boundaries as an essential requirement for real-world applications or practical applications of scientific conclusions (e.g., [
8,
9,
10]). In other words, measures taken to tackle, e.g., the elderly population, changes will never be targeted to specific grid cells; instead, policy powers will be applied within the respective administrative boundaries. Therefore, our approach for data harmonization focused on keeping administrative boundaries for consequent analysis and visualization.
Our presented dataset unified municipal administrative units allowing analyses of data as they change over time. The added value of the dataset does not lie in the novel use of spatial methods—instead, we used a purposeful sequence of necessary individual tools—but in the final product in the form of a universal dataset. The data were curated with extremely high precision in order to meet the requirements of our project colleagues, demographers, working with data on individual citizens of Czechia (the non-public specialized database demographers obtained from the Czech Statistical Office containing detailed information from official state registers about every single citizen). This means, literally, ten million records that ultimately have to be in correspondence with the total sum of municipal statistics. Hence, our treatment and proper assignment of spatial representation of municipalities had to be 100% precise. Thus, individual and semi-manual data adjustment was time-consuming, heterogeneous (in terms of arisen complications), and necessary. Moreover, a number of diminutive, but significant, data adjustments had to be done. Finally, the article introduces a workflow from an Excel sheet to open data publishing. It is based on the cooperation between experts in cartography, geoinformatics, and demography.
We used an approach based on a principle of “common spatial denominator”, i.e., we used data aggregation into larger units with stable boundaries. Although we lost spatial detail to increase the time extent (more about these drawbacks in, e.g., [
2,
11]), the fact that we could use every single statistical data from 2019 backwards prevailed. In regards to the data itself, we obtained (a) a list of codes of municipalities for each year from 1995 to 2019, and (b) spatial data with administrative boundaries and their centroids for selected years only (1996, 2000, 2002, 2003, and from 2009 each year onwards).
Additionally, spatial data were in two different generalization levels—older data (up to 2009) with coarser administrative boundaries, and newer data with high-precision boundaries. Most of the spatial data centroids also contained information about municipality IDs (compatible with the list of codes); however, in the missing years of spatial data, these were not available. More about spatial data curation is in the section Methods. In regards to non-spatial data, i.e., list of codes giving municipal ID and information about the absolute number of units,
Table 1 provides an overview of their development in the observed period (1995 to 2019). Please note that
Table 1 shows numbers of official (and existing) municipalities in particular years.
However, we identified four broad problems during data processing, which many researchers share. These problems might present when dealing with administrative boundaries in general:
- (1)
Splitting features: administrative units were divided over—it is clear from
Table 1 that the number of administrative units increased from 6232 in 1995 to 6258 in 2019, which is the absolute difference of 26 units. In general, this number is not significant (given the total number of more than 6000 units); however, changes were detected in more than 70 cases over the period due to other discrepancies further described below. This first problem, which causes a need for identification and recoding of administrative units, is illustrated in
Figure 1.
- (2)
Merging features: administrative units were merged over time—although the changes in the absolute number of municipalities (
Table 1) were not dramatic, we have to keep in mind that the division of administrative units (mentioned in point 1) was blurred by the merging of other ones. Nevertheless, the merged administrative units led us to find them and correct the spatial data in order to maintain a “common denominator” principle. Moreover, we had to select the ID code to be kept in data. The illustration of the second problem is in
Figure 2.
- (3)
Re-indexing features: administrative units changed their ID codes—in several cases, the municipalities changed their official status (due to changes in the systematic classification of settlement units) and had to be recoded (
Figure 3).
- (4)
Topology mismatch: spatial details of administrative boundaries changed—as mentioned before, the spatial data were provided in two different levels of detail in terms of administrative boundary precision (
Figure 4), which complicated the spatial treatment of data. For instance, due to non-corresponding topology of boundaries, some of the formerly intended geographical information system (GIS) tools could not be applied (e.g., “select by location” tool), and “spatial join” remained as the only option. In general, any spatial-based calculations, such as areal measurements, choropleth map creation (when the area is needed), would be inaccurate. Therefore, this problem has to be solved as well.
Although all of the problems mentioned above concerned approximately 70 municipalities, they could not be ignored. First, in several cases, they involved large towns and cities or military areas. Therefore, their exclusion from the dataset would diminish the quality of the data and consequent analysis and visualization. Second, the geodemographic dataset intended for joining with spatial data was obtained from the Czech Statistical Office, with extremely detailed characteristics about individual inhabitants. For that reason, all of the calculations and aggregations to administrative units had to be performed with 100% precision (in terms of total counts of inhabitants). Moreover, several specific “spatial” problems had to be treated individually (see details in
Section 3.1). To summarize the final output, we prepared a dataset consisting of the spatial representation of aggregated administrative units valid for a period from 1995 to 2019, and the non-spatial part in the form of the correspondence table, where “old” (former) and “new” (based on aggregation) ID codes of municipalities are listed. We work-titled this dataset the “universal” or “superlayer” (we will use “universal” for the rest of the manuscript).
The dataset allows users to use statistical data from any year from the period and to link them with the spatial representation of administrative boundaries. The most significant advantage of the universal is that users can (1) compute derived indexes from any statistical data from 1995 onwards, and (2) analyze time-variability and trends of such data without further need of spatial data curation. An example of a combination of both benefits is in
Figure 5 depicting an overall trend from 1995 to 2018 of the vital index [
12] in Czech municipalities. This would not be possible without losing valuable information about some municipal units (by excluding them from the map) if the universal layer had not been created. Analogically, it is possible to perform year-to-year comparisons only by joining statistical data in given years with the correspondence table and consequently with spatial data (more in User Notes in
Section 4).
Similarly, other phenomena (data and layers) can be treated in the way we present in this paper, e.g., areas of geological regions, climatic types, and catchment areas. Moreover, we used a retrospective approach, which was challenging, because the current archiving and metadata instructions and tutorials are different than they were 15–20 years ago (if there were any at the time) [
13]. Therefore, the presented universal dataset can save other researchers, regardless of the topic they deal with, a significant part of their time on data processing.
3. Methods
This chapter describes the main methodological steps taken during the creation of the universal dataset. The chapter is divided into three sections based on the nature of the data curation. General overview of data curation flow is depicted in
Appendix A.
3.1. Treatment of the Data from a Spatial Perspective
Because the main goal was to compile a data set for analyzing and visualization of geodemographic indicators over time, the study started with obtaining the spatial data of Czech municipalities (LAU2). As mentioned above, we missed some years from an earlier time in our observed period (1995–2019). Therefore, the filling gaps started with the year 1996, i.e., one year (1995) backwards, and the following years onwards. By doing so, we had to check the changes between years with the use of the CZSO code list and manually change administrative boundaries accordingly if the municipalities merged. In the case of splitting the municipalities, we considered those and recorded their ID codes into the correspondence table. Consecutively, this procedure was repeated until the year 2019. This procedure ensured having administrative boundaries (polygons) spatially constant throughout the observed period (1995–2019).
Additionally, we received the official data from ČÚZK from 2009 onwards in finer spatial detail (see differences in
Figure 8a), which forced us to decide what spatial data to use. We decided to keep the former (2008 and backwards) more generalized administrative boundaries because the final visualizations were intended to display the whole Czechia. Otherwise, the finer detail of administrative boundaries would cause additional cartographic problems in map-making (e.g., rendering errors of the final map due to the output map scale).
As mentioned in the Summary chapter, spatial data obtained from ČÚZK were in a vector format as administrative boundaries (polygons) and their centroids (points with information on municipalities’ ID codes). Therefore, it was necessary to link centroids with polygons to get ID codes within the polygons for future connection with other statistical data. From the analytical perspective, the “spatial join” tool served this purpose. In general, “spatial join” tool projects the attribute information from points to polygons based on their mutual geographical location. However, during “spatial join” application, errors emerged when point data carrying information with ID code did not fit within the right administrative boundaries—see
Figure 8. Although these problems occurred in a small number of municipalities, they had to be corrected to fulfil the requirement of 100% correct ID assignment for future matching with tabular data.
In summary, the errors were first checked visually, corrected manually, and then cross-checked (and corrected) again after the data validation process. At this point, it is crucial to note that for such data adjustments, no automatic GIS tools could have been applied since automation had not been able to handle such issues in the data sufficiently. Automatic tools (or semi-automatic combination of GIS tools) is indeed beneficial for processing large datasets; however, often leaving some “outliers” unsolved (e.g.,
Figure 8b,c). In such cases, individual deviations in the data had to be treated manually, unfortunately. Since we did not work with other attribute data than municipality IDs (qualitative information), we could not even apply any of the methods commonly deployed in the modifiable areal unit problem (MAUP) [
16], e.g., proportional redistribution of values.
Once we had all of the municipal ID codes contained in the polygon representation of administrative boundaries, we applied “spatial join” again for data validation to check the correct assignment of ID codes. In this procedure, the transformation of administrative boundaries (polygons) back into their centroids (points) occurred. As a result, 25 centroids (mostly laying on each other) for each municipality in Czechia were obtained. Consequently, we used “spatial join” again—target feature was a layer containing edited polygons of administrative boundaries based on the “common denominator” principle and all 25 centroids with ID code attributes. However, this time, we applied a different merge rule for attributes (coincidently, also called “join”) in advanced settings of the tool, which allows listing all joined attributes within one record of the target layer. This helped us to:
- (1)
Check the total number of ID codes—one municipality should contain 25 ID codes. If there were more/fewer ID codes, further data inspection and review was necessary. This validation step was in terms of quantity.
- (2)
Check if there were two or more different ID codes—one municipality could have more ID codes, which indicated a split or merger of several municipalities. This validation step was in terms of quality.
All issues identified in spatial data processing were recorded and immediately taken into account in the non-spatial part of the universal dataset.
3.2. Treatment of the Data from a Non-Spatial Perspective
Since there were changes (mentioned in points 1 and 2 in the Summary section, and partially described in
Section 3.1) emerging in different years, it was necessary to search for them individually in each year. Once the change was detected and identified, it has to be decided which ID code will be maintained. For change detection purposes, a combination of an official document on changes of administrative delimitation of Czechia from ČÚZK, official historical registers from CZSO, results from spatial data treatment, and other internet searches were used. Unfortunately, these sources were not mutually coherent, so individual changes had to be verified individually. In case of units division over time, the former ID code remained in one of those divided, and the newly established municipality received the ID code from the “common denominator” spatial unit with stable boundaries (fortunately, all the former ID code units existed throughout the whole period). Explained by the example—municipality A, divided into A1 and A2, while A1 kept its ID (based on its size or importance within the settlement system) and A2 was assigned with an ID from the former bigger unit (usually the same ID as A1); instead of keeping the new one. If two or more administrative units merged over a given period, recoding was done backwards (all municipalities forming a new, bigger unit, were assigned with the identical ID). In other words, the newly established administrative unit (and its spatial representation) is projected back in time. Again, in the same logic, this newly created unit acted as a “common denominator”, therefore, kept in the final dataset.
Thus, every year after a detection of split or merger of the administrative unit, the former code replaces the new one, while the reference information about both codes was maintained. The correspondence table containing both codes represents a final product from non-spatial data preparation. This table allows linking any statistical table with the spatial part of the dataset (see more in the User Notes section).
Methodologically, we applied a combination of available tools in Microsoft Excel (e.g., look up function to search for differences, contingency table to cross-check overall counts, and so on), and the programming language R (functions na.omit and setdiff) in order to find differences between spatial and tabular data automatically. This combination of non-spatial tools was used rather for practical reasons. After initial preparation of the spatial dataset (with the use of geodatabase), demographers required us to deliver a list of municipalities in MS Excel as they commonly work in such tabular environments. The data was then cross-checked with the statistical tables demographers possessed and sent notes for corrections (highlighted in MS Excel) back. This iterative process was, therefore, easier to handle directly in Excel.
3.3. Open-Data Portal Design
Sharing universal layer via an open data portal is based on two steps: metadata and geometry preprocessing and publishing. Metadata is structured information that is used to characterize, identify, and interpret each dataset. Metadata is an essential parameter in the field of spatial data to be able to use the relevant data correctly [
17]. Metadata should be part of every dataset and web service. In our case, the metadata are characterized as title, description, spatial extent (bounding box), author, date of publishing, license and permission, original source, number of features, and attributes specification (title, format, count statistics). All metadata for universal is available in a standardized format at URL:
https://gislib.upol.cz/portal/sharing/rest/content/items/08a30f65288f40ccac2e31fc6ce6b908/info/metadata/metadata.xml.
If appropriate metadata is implemented, the publishing phase follows. Publishing is the formal and technical process of data publication. After publishing, data are visualized and available for download within the open data concept. Sharing with the public or only team-members is a crucial option. Within the ArcGIS platform, the user is asked for available data formats for download option, in our case, all possible options are available: native Esri spatial geodatabase as the original source, feature collection, CSV, KML, Shapefile, GeoJSON, XLSX, and standardized GeoServices API. All data, applications, and websites created through ArcGIS Open Data Portal are stored in the Esri Geospatial Cloud repository, which means that they are available for administrators for further updates. Open Data Portal of Department of Geoinformatics, Palacký University Olomouc, is available at
Supplementary Materials. It contains dozens of datasets divided by categories. Each dataset is available via a specific URL. The case study of the universal layer is available via
Supplementary Materials.
The interface of a specific dataset contains spatial, tabular, and attribute segments (see
Figure 9). The interactive map provides a general overview of phenomena. It is not a fully-developed application with high-level of interactivity; it does not meet all cartographical rules. It allows the basic functionality of web maps–zoom in/out, pan, and attributes selection by click for data preview. The tabs allow the user to switch among spatial and tabular visualization of the whole dataset. Usually, the web-design is tested with real users before publishing, e.g., by eye-tracking methods [
18,
19]; however, since the Esri platform is devoted to fast, effective, and easy publication of data, it does not allow advanced options to change the user interface.
Moreover, it implements advanced filtering. The second part includes all metadata, including descriptions and attributes. Buttons allowing download in specified formats are fundamental from the users’ point of view. If the Esri user is interested in the dataset, he could simply use it for a custom project within ArcGIS platform, directly upload this layer by “Make web map” button, and create highly interactive web map applications.