1. Introduction to BASECOL
The current publication aims at providing a description of the latest technical developments performed on the BASECOL database. The intention is to provide a paper that conveys the technical quality of the BASECOL service in relation with the Virtual Atomic and Molecular Data Center (VAMDC) [
1,
2,
3], which is supported by the VAMDC consortium
1. The current VAMDC e-infrastructure interconnects about 38 atomic and molecular databases that cover atomic and molecular spectroscopy and processes. About 90% of the inter-connected databases handle data that are used for the interpretation of astronomical spectra and for the modeling in media of many fields of astrophysics. VAMDC offers a common entry point to all encorporated databases through the VAMDC portal
2 and it develops central services such as the species database
3 or the Query Store service
4. VAMDC also develops standalone tools in order to retrieve and handle the data such as the SPECTCOL tool [
4]
5. VAMDC provides software and support in order to include new databases within the VAMDC e-infrastructure
6. One current feature of VAMDC e-infrastructure is the constrained environment for the description of data, in particular the XSAMS schema and other standardized protocols
7 that ensure a higher quality for the distribution of data.
The genesis of this database was prompted by the need of collisional inelastic rate coefficients for the analysis of astrophysical spectra obtained with the latest spatial and ground heterodyne instruments. The first publication [
5] on BASECOL introduced the status of BASECOL in 2012 and had an exhaustive table about the 2012 scientific content. The database has been partially updated since 2013, and, once completed, further communication will provide the scientific status. An excellent review of the scientific needs and issues related to such collisional data can be found in [
6].
The BASECOL database collects from the refereed literature the state-to-state rate coefficients for the inelastic excitation of rotational, vibrational and ro-vibrational levels of molecules by atoms, molecules, and electrons. Currently, the collisional rate coefficients come from theoretical calculations. The collisional processes are described in the temperature range relevant to the interstellar medium, to circumstellar atmospheres and to cometary atmospheres. In addition, the database includes the energy tables that contain identification of the levels involved in the collisional transitions, some fitted collisional data sets to the numerical data and the references attached to the collisional data sets. Using the services described below (“Search Collision” and “Search Article”), we provide here the following statistics
8: there currently are 284 articles and 217 collisional sets. Among those 217 collisional data sets, we identify the excitation of 58 molecular species (among those 16 species are molecular ions) and of four atomic species. The molecular species span 26 diatomic molecules, 20 triatomic molecules, and 12 molecules with more than three atoms. Among the 217 collisional sets, 71, 97, 19, and 30 collisional sets have as a collider, respectively, He, H
(among those, 32 sets are with the ortho-H
collider), H, and electron.
BASECOL is primarily a scientific product that can be trusted and that is aimed at both producers of collisional rate coefficients and at users of those quantities. Indeed, one aspect of the BASECOL database is the quality, completeness, and safety of the provided information, as BASECOL is managed by scientists having a deep knowledge of the type of data ingested in the database. This means that the data are checked with respect to the methodologies used to calculate the data and with respect to the consistency of the data sets. The data policy of BASECOL is to keep a record of all available and published data sets. Other important scientific aspects of BASECOL are the versioning of the collisional data sets and the references attached to all sets of data. This allows traceability of the information.
From a technical point of view, BASECOL is a sophisticated product that follows international rules for management of FAIR data [
7] (cf.
Section 6) and that is compliant with the VAMDC infrastructure. In the following sections, we will describe the new multitier architecture of the BASECOL service, the steps taken towards improving data integrity and how the connection to the VAMDC is integrated into the BASECOL service. Finally, we will describe the public web interface that provides access to the BASECOL data.
3. Technical Evolution for Improved Data Integrity and for VAMDC Content Requirements
We designed a new data import system that improves data integrity and authenticity (R7 of CoretrustSeal recommendation
13), and that introduces rigorous procedures in managing archival long lasting storage of data. The data import system is composed of an import ASCII file, of a Java application that parses and loads the import file into the database and of a git repository. We first describe the import ASCII file and then the import procedure.
3.1. Description of the Import ASCII File
The import ASCII file is composed of several “Data” sections, and the concatenation of these sections constitutes what is called a “collisional data set”. Each section of the import file corresponds to an “object” that follows the versioning rules (cf.
Appendix B). In this paper, we give the same name both to the section of the import ASCII file and to the “object” which is the concept to which rules of versioning are applied. An “object” may include metadata and numerical data or just metadata. Some of the metadata are aimed to allow the compatibility of BASECOL with the VAMDC architecture, some metadata are either mandatory or optional, and some medatata are related to the uniqueness of an “object”. The concept of uniqueness is attached to the numerical data as well, and is used in the versioning process handled by the import script (cf.
Section 3.2 and
Appendix B). The different data sections are the “Element” section, the “Energy table” section, the “Energy Origin” section, the “Rate Coefficients” section, the optional “Fitting coefficients” section and the “Publications” section. The “Element”, “Energy table”, and “Energy Origin” sections must be provided for the molecule called “target”, i.e., the molecule whose collisional excitation is of interest for astrophysical users, and for the “collider”, i.e., the perturbing atom, molecule or electron. These sections are described thereafter.
3.1.1. Element Section
The “Element” section is compulsory and corresponds to the object “Element” of
Appendix B which is composed of a list of metadata given in
Table 1. An example of an “Element” section input for N
H
is given in
Table A1. The metadata of the “Element” object allowing compatibility with VAMDC concern the identification of the species. The VAMDC standards provide standardization of the species through the metadata “stoichiometric formula” (the atoms are ordered in alphabetic order and the number of occurences of the atom is given after the atomic symbol), and through the “inchikey”. When a species is already present in BASECOL it is possible to retrieve the species metadata from BASECOL through a web user interface (cf.
Appendix C); otherwise, information can be retrieved from the species database
14 when the metadata are linked to VAMDC, with the exception of “mass”. If neither BASECOL nor the species database contain the species information, the BASECOL data provider can contact the VAMDC team to get advice (
[email protected]).
3.1.2. Energy Table Section
The “Energy Table” section is compulsory and corresponds to the object “Energy Table” of
Appendix B, which is composed of a list of metadata given in
Table 2 and of a numerical table. The imported numerical energy table identifies the labels of the collisional transition in the rate coefficients table. The administrator can import the energy table used in the rate coefficients calculations, or an energy table created with data from spectroscopic databases such as CDMS, JPL, and HITRAN [
10,
11,
12,
13]. A template of an “Energy table” section input is given in
Table A2. The numerical energy table is composed of labels, values of the quantum numbers, and numerical values for the energy. The energy tables are currently being harmonized so that the default energy unit be “cm
”. The quantum numbers’ description follows the VAMDC standards: the atomic quantum numbers are standardized in the XSAMS document
15 and the molecular quantum numbers are described in the case-by-case document
16 where different situations are identified (for example: case =“dcs” means
diatomic close-shell molecules, case = “hunda” means
Hund’s case (a) diatomics, etc....). For each “case”, a set of quantum numbers is defined, and the import file must follow those standards. If a given pattern of quantum numbers characterization exists in BASECOL (similar type of molecule and coupling case), the choice of quantum numbers can be retrieved from the import file WebUI (see
Appendix C).
3.1.3. Energy Origin Section
The “Energy Origin” section is compulsory and corresponds to the object “Energy Origin” of
Appendix B which is composed of a list of metadata given in
Table 3 and of a numerical table with a single line. This single line gives the values of the quantum numbers associated with the energy origin of the corresponding “Energy Table” section (cf.
Section 3.1.2). A template of an “Energy Origin” section input is given in
Table A3.
3.1.4. Rate Coefficients Section
The “Rate Coefficients” section comes next, it is compulsory, and corresponds to the object “Rate Coefficients” of
Appendix B, which is composed of a list of metadata given in
Table 4 and of a numerical table. A template of a “Rate Coefficients” section input is given in
Table A4.
BASECOL allows the collider to undergo transitions, and therefore the initial and final levels of the collider must be provided. The BASECOL collisional rate coefficients (in units of cm
s
) can be state-to-state rate coefficients, effective rate coefficients, and thermalized coefficients; those items are scientifically defined in Dubernet et al. [
5] and must be included in the the metadata “rateType” of
Table A4.
3.1.5. Fitting Coefficients Section
The import script optionally processes the “Fitting Coefficients” section. This section corresponds to the object “Fitting Coefficients” of
Appendix B, which is composed of a list of metadata given in
Table 5 and of a numerical table. A template of a “Fitting Coefficients” section input is given in
Table A5. The transitions of the numerical fitting coefficients table must be organized identically to the transitions of the related rate coefficients table.
Currently, 86 collisional rate coefficients sets have fitting tables, and those fits have been either provided by the authors or calculated by the BASECOL maintainers. The list of fitted collisional sets is indicated in
Table A1 of our previous publication [
5]. No additional sets have been fitted since 2013 because of lack of manpower
17. The fitting functions are the so-called ”common fit equation”:
with
T the temperature in Kelvin,
R the rate coefficients in cm
/s,
,
, and
the fitted parameters, and specific fitting functions used in various publications for their data: ”Faure et al. 2004” [
14], ”Faure et al. 2001” [
15], ”Sarpal et al. 1993” [
16], ”Lim et al. 1999” [
17] as indicated in
Table A5.
3.1.6. Publications Section
Finally, the import file imposes to process the “Publications” section. This section may include several “Publication” objects; each object is composed of a list of metadata given in
Table 6 and follows the versioning defined in
Appendix B. The “Publications” section input file must contain at least one “Publication” object, which is the reference to the paper where the collisional data set has been published. This main reference is indicated in the publication input file with
$$mainArticle = yes; this reference appears in red in the “References” section of the public WebUI (cf.
Section 4.2). If the collisional dataset is obtained through several papers, additional references, identified with
$$mainArticle = no, can be included. The BASECOL team strongly recommends to include the references to the potential energy surface used in the collisional calculations. Additional references may be considered such as references linked to the collider and to the target energy levels, as well as to the method used in the calculations. We do not provide an input file for this entry as some minor format modifications are currently being made. In the case of old papers with no DOI, a random DOI is created with the string “tmp_doi” followed by a random unique identifier. This random internal DOI is not provided to VAMDC as DOI is not mandatory in VAMDC. In the import file, DOI is mandatory as we want to enforce the good practice of using unique identifiers. The references’ entries of the BASECOL database are currently updated with their DOI. The keywords metadata are composed of a key metadata followed by its value. The key metadata are “Perturbing Element”, “Target Element”, “Possible systems”, “Transitions”, “Type of data”, “Possible Method”, and “Miscellaneous”. To each reference, a set of “key metadata:value” is attached that characterizes the content of the corresponding paper. These keywords allow the search by keywords in the “Articles” part of the BASECOL website (cf.
Section 4).
It should be stressed that the publication part of the BASECOL database is extremely important because it ensures provenance of the data, and it ensures that BASECOL follows FAIR principles [
7] (cf.
Section 6), which is an intrinsic feature in VAMDC.
In addition, the BASECOL policies ask the users to cite the authors of the collisional datasets. Some improvement of that section could be made possible in the future by deploying a software similar to the one developed for the HITRAN and the AMBDAS databases [
18] that allows for retrieving the information using a DOI.
3.2. Import Procedure
The import script performs integrity and format checks on the import ASCII file. The importation process starts when the import file is valid. The import script is designed either to import a whole new “collisional data set” or to update an existing “collisional data set”. In each “Data” section of the “collisional data set” (
Section 3.1), the import script checks the values of the metadata related to the uniqueness of an “object”, and, in addition, for the “Energy Table”, “Energy Origin”, “Rate Coefficients”, and “Fitting Coefficients” sections, it checks the values of the numerical tables. If the value of a single item that we call “uniqueness item”, metadata, or numerical value does not correspond to what is already in the database, the “collisional data set” is considered as a new collisional data set; if all the uniqueness items have the same value as in the database, the “ collisional data set” is considered as an update of an existing “collisional data set”.
When a new “collisional data set” is imported in the database, a collisional entry is created. In order to ensure data-traceability and reproducibility, a version number is attributed to the freshly created set and a comment describing the details of this creation operation is stored. Alternatively, when an existing “collisional data set” is updated, the metadata not linked to the uniqueness concept and affected by the change are modified, a new version number of the “collisional data set” is created, and a comment describing the type of change is stored. Detailed information on the versioning is provided in
Appendix B.
Once the import file has been processed, its scientific content, i.e., the “collisional data set”, is copied into the database of the BASECOL “ingestion instance”. This “collisional data set” is visible on the website of the BASECOL “ingestion instance”, and the administrator can check the imported data. Once the information is checked, the administrator allows its visibility in the “production instance” of BASECOL. Indeed, by default, any new “collisional data set” status is “non visible” in the production database, thus protecting the “production instance” of the database from non checked information when the “ingestion” database is dumped into the “production” database. The visibility on the “production instance” website of BASECOL implies that VAMDC can access the “collisional data set”.
The processed import ASCII file is additionally copied into a git versioning system. The git versioning system allows traceability in case of a later update and allows restoration of the database in case of corruption and/or loss of the database. This process guarantees the data integrity.
4. Public Access to BASECOL via the Web User Interface (WebUI) at basecol.vamdc.org
The public interface of BASECOL has been simplified, and we describe here its current features. The “Home” section is the landing page for the URL:
https://basecol.vamdc.org; it provides the citation policies and the units. There are two ways to query the database: a guided query with the “Browse Collisions” (see
Section 4.1.1) section that corresponds to a simplified version of BASECOL2012 collisional query browser and the “Search Collisions” (see
Section 4.1.2) that provides queries by several free search criteria. The “Search articles” section (cf.
Figure 3) provides a search on the bibliographic entries which are associated with numerical data. The “Tools” section gives access to information on the SPECTCOL tool [
4] and on a scientific package for the water-H
collisions [
19,
20,
21,
22,
23,
24]. The search on energy tables has been removed following recommendations from the French PCMI community
18.
4.1. Query Part
4.1.1. Browse Collisions
The query part of the “Browse Collision” section is a four-step process: from a single page, the user successively chooses a target, its symmetry, then a collider and its symmetry. The interface provides an auto-completion for the species name and can suggest available colliders according to the selected target. The available datasets corresponding to the restrictions submitted by the user are listed immediately below the query section.
Figure 4 shows an example for the request of CO colliding with ortho–H
. In this example, we see that the data set of Yang et al. [
25] has the flag recommended = “yes” as it is more complete and more recent than the previous sets [
26,
27] which have recommendation = “no”.
4.1.2. Search Collisions
The query part of the “Search Collision” section allows for freely querying the database with the year of publication and the authors of paper attached to the datasets, the target and collider species. Therefore, it is possible to see the complete content of the database when no criteria are chosen, it is possible to see all collisional sets with electrons as the collider, and so on. This is an extremely useful tool for both the users and the managers of the database. We provide an example in
Figure 5.
4.2. Returned Information
4.2.1. General Display
Once the user has selected a collisional data set, the user interface displays on the same single page the complete numerical and documentary information.
Figure 6 shows the return page for the selection of Yang et al. [
25] data set. From this page, one can access the rate coefficients, the fits to the rate coefficients, the energy tables with state labels that are used to identify states in the rate coefficient tables, the bibliographic references attached to the collisional data set (the main reference where the dataset is published is shown in red), and then a set of information that allows for rapidly characterizing the methodology used in the calculations. In BASECOL2012 [
5] user interface, those items were split between several tabs.
The user can visualize the rate coefficients and energy tables on the WebUI and can download the data as text files. When fitting coefficients are available, a new graphical display allows for comparing the calculated rate coefficients with the fitting function.
4.2.2. History and Versioning
In the “History” part of the return page, the different versions of the collisional datasets will be available. At the time of publication, none of the sets have been updated since the creation of the versioning system. Therefore, we show how it works in
Figure 7. The versions correspond to major changes in the database (numerical data), and each version has a creation date and might have some minor modifications done at a later date. The currently displayed version is explicitly notified to the user by the “(this page)” tag after the corresponding version number. Clicking on any of the other versions will bring a similar return page (such as
Figure 6) corresponding to the selected version. That history part reflects the complex versioning system described in this paper and is an important achievement as it allows full traceability and reproductibility of the data.
7. Conclusions
This paper provides a detailed analysis of the latest technical developments on the BASECOL database and its environment. The major upgrades concern its very careful versioning system linked to an ingestion system that verifies the imported data with respect to the content of the database, as well as the possibility for users to retrieve previous versions of the collisional data sets. Another major update is the dual system of a validation and of a production environment coupled to the storage of the collisional data sets ascii files in a github repository. BASECOL is now designed to be a database fully compliant with the VAMDC e-infrastructure, in particular through the import script that contains the parameters necessary for the VAMDC interoperability. Citing one referee: ”The 2020 version is undoubtedly a major update compared to the 2012 version: simpler, clearer with excellent traceability and reproducibility. The website is well designed and user friendly.” Nevertheless, some issues must still be solved, they concern a more friendly import management environment and, of course, the scientific update of the database. In the future, we might foresee partnerships with websites that provide the above cited RADEX files, and we might adopt a different business model for the sustainability of the database.