1. Introduction
First, scientists in translational medicine must understand how to use Google search engine. You may be surprised that depending on browsers, the searched result may be different. There are two types in keyword searches: word keyword search and phrase keyword search. In a phrase keyword search, quotation marks indicate the ordered set of words. For example, “set operations” is composed of two words, i.e., set and operations, where set must be the first word and operations should be the second word after set.
An exhaustive search for articles containing the two phrases “vaccine safety” and “set operations” revealed only three articles over the Internet [
1,
2,
3]. Jacquez et al. showed how to use set operations for breast cancer analysis where the dataset is only composed of 285 instances [
1]. Lu et al. did not show set operations at all for their analysis where the phrase of “set operations” was included in their references [
2]. Barry DeVille et al. published a SAS book that briefly introduced set operations using VAERS datasets with a Statistical Analysis System (SAS) [
3]. However, there was no detailed explanation on set operations by just showing graphic results with SAS. Since SAS needs a proprietary license, it is not open-source programing. To the best of our knowledge, there is no tutorial on set operations with open-source programing for vaccine safety. This paper’s role with open-source programing in Python will be critical for translational medicine to deal with large datasets.
The author has published a tutorial paper on the PyPI packaging for translational medicine [
4]. However, the significant contribution of this paper lies in that the previous tutorial did not include software reproducibility and set operations for efficient computing with large datasets. This paper details the calculations on set operations used in translational medicine. Set operations are used for calculating adverse effects on deaths due to COVID-19 using VAERS datasets [
5].
There are many articles on the efficacy of vaccines, but few articles on adverse effects with vaccines. Writing this tutorial on set operations with open-source Python for translational medicine is motivated by four reasons: (1) we need to show that efficient computation, such as set operations in Python, is crucial for manipulating large datasets such as VAERS with 748,230 instances; (2) the computational complexity should be understood for accelerating computation; (3) there is no tutorial analysis on the extensive adverse effects of COVID-19 vaccines; and (4) PyPI packaging and software reproducibility are essential for scientists in translational medicine for maximum software dissemination to the world.
This paper presents a data analysis with set operations. The computational time complexity is depending on the structure of nested loops and the size of individual loops in algorithms or programs. For example, if your program has a single loop, the size of the loop determines the computational time complexity. In Python, the computational time complexity for a single-for-loop is determined by the number of instances (n), which is called Big O Notation O(n):
for i in range(len(instances)):
In double-nested loops or triple-nested loops, the time complexity can be expressed with O(n2) and O(n3), respectively. With set operations, the double-nested loops, the triple-nested loops, and other loops can be converted to O(n). Therefore, this paper introduces set operations to significantly reduce the time complexity.
For example, when calculating the number of deaths with mixing Pfizer and Moderna vaccine, with O(n) time complexity, the number of deaths can be generated with set intersection.
In datasets, the number of instances is equivalent to the number of patients. In other words, the unique patient IDs can be used and shared in set operations in multiple datasets. Patient IDs are unique and shared in three VAERS datasets.
The number of Pfizer-death-patients deathPFIZER set can be simply calculated by intersecting the deathIDs and PFIZERIDs sets. Similarly, the number of Moderna-death-patients can be computed by intersecting the deaths-set and Moderna-set. Therefore, patient deaths from mixing the Pfizer and Moderna vaccines can be calculated by intersecting the Pfizer-death-patients-set and Moderna-death-patients-set. However, we do not know if Pfizer is the first vaccine. In other words, there are Pfizer-Moderna-death-patients and Moderna-Pfizer-death-patients. The time complexity in the above calculations is with O(n).
The maleIDs and femaleIDs sets can be similarly generated with O(n) for gender class set operations. All features, such as types of vaccines, gender class (male or female), death or alive (non-death), and ages, can be simply computed in this manner with set operations with O(n). In other words, the computation time with set operations is drastically reduced from O(n3) or O(n2) to O(n).
The advantage of PyPI is that it allows vaers to run on Windows, MacOS, and Linux operating systems, without being aware of operating systems as long as Python is installed on the system. This advantageous feature of PyPI is that it can maximize the open-source dissemination of software to the world.
This paper introduces Code Ocean for the reproducibility of software codes after showing the PyPI packaging. Code Ocean is the de facto service provider for software reproducibility.
In traditional software development, programmers must write a program from scratch. With the rapid progress of open-source software, programmers must choose the right libraries from depositories and glue them with minimum effort. The selected libraries and packages are available to the public and can be installed by a simple pip terminal-line command [
6]. In other words, programmers must be familiar with the bash command in the terminal.
In this tutorial, we will follow the order of the execution of the commands in the bash shell based on reverse engineering. There is no significant difference between Windows, MacOS, and Linux operating systems.
This paper depicts a vaers executable package [
7] as an example for calculating adverse effects on the number of deaths due to COVID-19 by gender and age group against the Moderna [
8] and Pfizer [
9] vaccines. The vaers method is currently under review.
First, programmers must understand how to scrape a dataset over the Internet. The executable vaers use the VAERS datasets. VAERS stands for Vaccine Adverse Event Reporting System. VAERS is a national early warning system to detect possible safety problems in US-licensed vaccines. VAERS is not designed to determine if a vaccine caused a health problem, but it is especially useful for detecting unusual or unexpected patterns of adverse event reporting that might indicate a possible safety problem with a vaccine.
Second, the dataset file must be read in Python. VAERS is composed of three csv files: 2021VAERSDATA.csv, 2021VAERSSYMPTOMS.csv, and 2021VAERSVAX.csv. In vaers.py, 2021VAERSDATA.csv and 2021VAERSVAX.csv are used. csv stands for comma-separated-value.
Third, a program is built to compute the target values using set operations. This paper shows how to calculate adverse events of death by sex and age group for each of the Novartis [
10], Moderna, and Pfizer vaccines.
Fourth, the Python program is converted to the PyPI package with three files: setup.py, vaers.py, and README.md. The README.md file can be created using the GitHub site. Therefore, you need to create a new account on the GitHub site.
Finally, the PyPI package is uploaded using a twine command. In order to upload a PyPI package, you need to have an account on the pypi.org site.
In order to use and run a Python program, you must choose a proper installation package, miniconda, depending on your operating system from the following site:
For Windows, double-click on the file, Miniconda3-py38_4.11.0-Windows-x86_64.exe. Python3.8 is recommended in this paper. For MacOS, the file, Miniconda3-py38_4.11.0-MacOSX-x86_64.sh, should be installed by the following terminal command: zsh or bash [
11,
12]:
$ zsh Miniconda3-py38_4.11.0-MacOSX-x86_64.sh
or
$ bash Miniconda3-py38_4.11.0-MacOSX-x86_64.sh
For Linux, download Miniconda3-py38_4.11.0-Linux-x86_64.sh and run the following command:
$ bash Miniconda3-py38_4.11.0-Linux-x86_64.sh
For Windows users, you have two options of Miniconda: one on Windows 11 or 10 and the other on Windows Subsystem for Linux (WSL). WSL is a compatibility layer for running Linux binary executables (in ELF format) natively on Windows 11 or 10. WSL has not been completed yet, but you are allowed to use binary executables on Windows from the WSL command line.
From here onwards, there is no difference between all operating systems. You should be familiar with conda and pip command with options:
First, start a terminal command and update the Miniconda environment by the following command. The first ($) is a prompt from the terminal, while the second ($) is the dollar key.
$ conda update conda
Second, update the pip installation command. “-U” stands for update.
$ pip install -U pip
or
$ python -m pip install -U pip
In order to install pandas, for example, run the following command.
$ pip install -U pandas
or
$ conda install pandas
In order to know the Python version number,
$ python -V
Python 3.8.4
the “which” command can inform the location of Python.
$ which python
/home/takefuji/miniconda3/bin/python
If the library is not Python-related, install it by the apt command on WSL or brew on MacOS.
First, apt should be updated and upgraded on Linux or WSL on Windows.
$ sudo apt update
$ sudo apt upgrade
Then, you can install the necessary library. For example, “wget” is a library name. “sudo” is a superuser command.
$ sudo apt install wget
For MacOS users, you must install the brew command, then run the following command to install matplotlib library. matplotlib is a library name.
$ brew install matplotlib
In vaers, the wget command is needed.
In WSL and MacOS, you must install the X-Window. For Windows users, you should download VcXsrv Windows X Server exe file and install it. For Mac users, you should install XQuartz. Before running Python, you should start the X Server.
vaers was selected for this tutorial because there is no tutorial on Python set operations. Set operations are useful to calculate the adverse effects on death by gender (male and female), age group, and vaccine group (Moderna, Pfizer, and Novartis).
In traditional programming, the programmer must program the target software from scratch. In open-source programming, the right libraries must be chosen from depositories and the selected libraries are simply glued together with minimum effort. This is called rapid open-source prototyping. vaers.py was developed within a few hours.
In other words, the skills in open-source programming lie in selecting the right libraries from a variety of the existing libraries [
13]. The more examples that are available in open-source libraries, the easier it is for users to create the desired code.
This tutorial was written based on our experience with 19 PyPI projects:
3. Discussion on Set Operations
This tutorial allows researchers to submit a new PyPI package and to showcase their skills on PyPI packages around the world. All that is required is to create three files, including uaers.py, setup.py, and README.md, by following instructions in the Materials and Methods Section. Before submitting the new package, you should test it on your local machine.
There are four set operations as shown in
Figure 1: union, intersection, exclusive OR, and subtraction. In Python, the union set operation of set A and set B can be calculated by the following:
set(A).union(B)
Similarly, the set intersection between A and B can be operated by:
set(A).intersection(B)
ExclusiveOR operation of A and B is calculated by:
set(A). symmetric_difference(B)
Subtraction operation of A and B is calculated by:
set(A).difference(B)
In the vaers.py, set intersection operations are used.
In vaers.py, the shaded lines from the first line before def main() are used for checking the existence of two files and, if two files exist, then they are read by pd.read_csv of pandas library.
d=pd.read_csv(sys.argv [1]+’VAERSDATA.csv’,low_memory=False,encoding=‘cp1252’)
vax=pd.read_csv(sys.argv [1]+’VAERSVAX.csv’,low_memory=False,encoding=‘cp1252’)
Three csv files use the common ID numbers so that deathIDs is a set of death IDs in the dataset. The following two lines calculate the number of total instances and the number of deaths. d is pandas data read from the ‘VAERSDATA.csv’ file, while vax is pandas data read from ‘VAERSVAX.csv’.
d[‘DIED’].fillna(“N”,inplace=True)
deathIDs=d.loc[d.DIED==‘Y’,’VAERS_ID’]
There are two types in the DIED determinant: Y or N. Therefore, the number of the total instances is calculated by len(d[‘DIED’]), where len is length or size function in Python. deathIDs indicates the number of deaths where d.DIED==’Y’.
The pandas .loc function is convenient for enforcing the equal condition (==) in the dataset.
The gender of SEX determinant plays a key role in set operating.
maleIDs=d.loc[d.SEX==“M”,’VAERS_ID’]
femaleIDs=d.loc[d.SEX==“F”,’VAERS_ID’]
maleIDs indicates male IDs while femaleIDs indicate female IDs.
Novartis IDs can be calculated by:
NOVIDs=vax.loc[vax.VAX_MANU==“NOVARTIS VACCINES AND DIAGNOSTICS”,’VAERS_ID’]
where VAX_MANU determinant enforces “NOVARTIS VACCINES AND DIAGNOSTICS”.
The following three lines show the calculation of the intersection of two sets: Moderna and Pfizer IDs.
M_P indicates the intersection operation of two sets using the Moderna and Pfizer IDs.
MODERNAIDs=vax.loc[vax.VAX_MANU==“MODERNA”,’VAERS_ID’]
PFIZERIDs=vax.loc[vax.VAX_MANU==“PFIZER\BIONTECH”,’VAERS_ID’]
M_P=set(MODERNAIDs).intersection(PFIZERIDs)
M_Pdeath indicates the intersection of two sets: deathMODERNA and deathPFIZER.
M_Pdeath=set(deathMODERNA).intersection(deathPFIZER)
The following set operation indicates the intersection of two sets: deathPFIZER and femaleIDs. In other words, len(deathPFIZERfemaleIDs) indicates the number of female deaths due to the Pfizer vaccine.
deathPFIZERfemaleIDs=set(deathPFIZER).intersection(femaleIDs)
Set operations are not only useful for calculating the target sets for translational medicine, but also for efficient computing with converting O(n2) or O(n3) to O(n) time complexity.
In order to run vaers, type the following command in the terminal. vaers will automatically start to calculate with set operations (
Listing 1).
Listing 1. The result of vaers execution.
$ vaers
total instances: 748230
total deaths 10125
NOVIDs instances: 1475
NOVIDs deaths: 2
NOV death per instance 0.001356
MODERNA+PFIZER: 996
MODERNA+PFIZER death: 5
MODERNA+PFIZER death per instance: 0.00502
MODERNAIDs instances: 325993
MODERNA deaths 4071
MODERNA 0.012488
deathMODERNAmaleIDs 2330
deathMODERNAfemaleIDs 1657
MODERNAfemaleIDs: 224687
MODERNAmaleIDs: 87945
MODERNA female death 0.007375
MODERNA male death 0.026494
PFIZERIDs instances: 313773
PFIZER deaths 4488
PFIZER 0.014303
PFIZERfemaleIDs: 210111
PFIZERmaleIDs: 87945
PFIZER female death 0.009043
PFIZER male death 0.024402