3. Results
As mentioned in the Materials and Methods Section, genes associated with the 318 probes for the treated cell lines (contact with collagen–glycosaminoglycan mesh) were uploaded to Enrichr (no probes were selected for control cell lines using this method). The full list of probes, genes, and enrichment analysis is provided in the
Supplementary Materials (Data S1). Several enriched biological terms were determined.
The top ranked term in the GO biological process (BP) (
Table 2) is “regulation of apoptotic process”. Na et al. reported [
9] that collagen–glycosaminoglycan has an anti-apoptosis effect. Thus, the fact that this term is ranked first is reasonable.
“Focal adhesion” is the top ranked term in “GO Cellular Component 2021” (
Table 3) and the nineth ranked in “KEGG 2021 Human” (
Table 4); moreover, Murphy et al. [
10] reported that the collagen–glycosaminoglycan scaffold plays critical roles in focal adhesion.
Other than these three categories, there are some additional categories that support the suitability of our analysis. For example, “ARCHS4 Cell-lines” lists IMR90, which is the cell line used in this study, as the top ranked cell line (
Table 5).
Moreover, although it is not the top ranked term, “FETAL LUNG”, from which IMR90 cell lines were derived, is ranked within the top 10 ranked terms in “ARCHS4 Tissues” (
Table 6).
Although we provide only a few examples, our results suggest that our analysis was robust.
4. Discussion
Although we successfully applied our methodology to the dataset, one might wonder whether more conventional methods can achieve similar performance. Since this dataset was generated using archaic technology, namely, microarray, more modernized methodologies adapted to high-throughput sequencing technology (e.g., edgeR [
11] or DESeq2 [
12]) cannot be employed. Moreover, the archaic technologies adapted to microarray (e.g., SAM [
13] and limma [
14]) cannot be employed, because they can only deal with categorical classification, whereas we need to identify genes whose expressions are altered as a numerical variable (hours). Thus, we decided to employ more conventional methodology than SAM or limma, namely, gene selection using linear regression.
As described in the Materials and Methods Section, we identified 813 probes using linear regression-based FE and uploaded the gene symbols associated with the identified probes to Enrichr. When considering only the number of probes selected, it performed better than the PCA-based unsupervised FE, which could only identify 324 probes. Selecting no probes for the control cell lines is the same as PCA-based unsupervised FE. Thus, it seems that the application of PCA-based unsupervised FE, instead of linear regression, was not productive.
Nevertheless, if we consider the performance of the enrichment analysis more carefully, this impression is reversed. A full list of the probes, genes, and the results of enrichment analysis are provided in the
Supplementary Materials (Data S2). First, for “GO BP 2021”, in which PCA-based unsupervised FE ranked apoptosis first (
Table 2 and
Table 7), although the top ranked term “regulation of apoptotic process” in
Table 2 is associated with the adjusted
p-value as small as
, the top ranked term in
Table 7 is associated with adjusted
p-value as large as
, which is much less significant. Even the tenth ranked term in
Table 2 is more significant than the top ranked term in
Table 7. Generally, more genes uploaded have more opportunities to be associated with more significant enrichment. Nevertheless, genes associated with 813 probes, which were greater than the 324 probes identified using PCA-based unsupervised FE, could be associated with the less significant terms. This clearly suggests the inferiority of linear regression as compared to PCA-based unsupervised FE.
Regarding the comparison of the “GO Cellular Component 2021” in
Table 3 and
Table 8, we have a similar impression.
Although “focal adhesion” is ranked first in both Tables, its significance is very distinct. It is associated with an adjusted
P-value as small as
in
Table 3, whereas it is associated with that as large as
in
Table 8. The number of overlapping genes is only 39 in
Table 8, whereas it is higher (43) in
Table 3, despite the fact that a higher total number of genes was uploaded to Enrichr, as shown in
Table 8. Thus, the performance of linear regression is again poorer than that of PCA-based unsupervised FE.
For KEGG, not only are the generally adjusted
p-values larger (i.e., less significant) in
Table 9 than those in
Table 4, but also “Glycolysis/Gluconeogenesis” and “Focal adhesion”, which are ranked within the top 10 in
Table 4, are not even listed in
Table 9, and no other terms seemingly related to the experiments are mentioned. Thus, the performance of linear regression is again poorer than that of PCA-based unsupervised FE.
For “ARCHS4 Cell-lines” and “ARCHS4 Tissue”, the results are similar. In
Table 10, not only are the adjusted
p-values generally larger (i.e., less significant) than those in
Table 5, but the adjusted
p-values attributed to IMR90 in
Table 10 (
) are also much larger (i.e., less significant) than those in
Table 5. The number of overlapping genes for IMR90 is only 128 in
Table 5, whereas that in
Table 10 is 89, despite the fact that more than twice the total number of genes were uploaded to Enrichr, as shown in
Table 5. However, the number of overlapping genes for HUVEC, which is the wrong one, is as large as 113 in
Table 10, whereas that in
Table 5 is only 64. Thus, the increased number of genes selected using linear regression substantially contributes to the increase in overlapping genes assigned to the wrong answer. Moreover, lower ranked terms failed to demonstrate an association with significant
p-values (e.g., less than 0.015). These finding suggest the inferiority of linear regression as compared to PCA-based unsupervised FE.
Although “FETAL LUNG” is fourth ranked in
Table 11, its adjusted
p-value is
, which is much less significant than that in
Table 6 (
). Thus, overall, PCA-based unsupervised FE performed better than linear regression.
Finally, we attempted to conduct a time-series analysis, which is more widely used than linear regression for time course data. To this end, we used the fsMTS [
15] package implemented in R [
5] that included multiple methods, such as correlation-based, lasso-based, mutual information-based, and random forest-based methods. Nevertheless, none of the fsMTS methods could be performed. This was because time-series analysis requires auto/cross-correlations that require the memory size proportional to the square of the number of features. Since the number of features in this analysis was as high as
, it was computationally infeasible to execute the methods in fsMTS. Thus, our strategy, PCA-based unsupervised FE, was the only one applicable to the present data set.
The limitation of our methodology is that because of its unsupervised nature, when it fails to select biologically reasonable genes, there are no ways to improve it, although it occasionally worked effectively in the present study.