Co-evolution based machine-learning for predicting functional interactions between human genes – Nature Communications
Clade-wise phylogenetic profiling outperforms traditional approaches
Clade-wise phylogenetic profiling (PP) takes into consideration the co-evolution of genes in different evolutionary scales, from the kingdom to the species level3,7,10. In addition, it was shown that different pathway types might show different co-evolutionary patterns, e.g. metabolic pathways being more conserved throughout evolution while signaling pathways often rewire14. Accordingly, we sought to utilize clade-wise PP to improve the predictive power of PP and enable the prediction of the interaction context by developing a supervised machine-learning based approach (Machine-Learning based Phylogenetic ProfilingMLPP). This approach integrates the phylogenetic profiling signals from 49 clades throughout 1154 species encompassing the eukaryotic tree of life. As the tree of life is hierarchical by nature, the clades for the analysis were chosen to cover the entire eukaryotes spaces, while reducing overlap and co-linearity between input features (see Methods, Supplementary Table1, Supplementary Fig.1).
We first computed a species-by-gene matrix representing the sequence similarity of every gene in each species to its human ortholog, with a given row comprising the phylogenetic profile of a single gene (see Methods). Then, for each clade, we calculated the covariance between the phylogenetic profiles for each pair of genes as features to the machine-learning algorithm. Thus, for each gene pair, we used 49 clade-wise covariances of the genes phylogenetic profiles as features. We then trained a binary classification model to predict gene pair functional interactions defined as co-occurrence in any Reactome pathway20. We used the same 49 features to train additional models to predict the interaction contexts for each context separately. The interaction context refers to the ways and functions in which genes are functionally related and is hereby defined as the co-occurrence of genes in some pathway type (12 Reactome top-level pathways, e.g., Metabolism, Immune System and 28 high level Gene Ontology terms) or protein complex co-occurrence in Reactome (see Methods)
We compared the performance of several machine-learning algorithms and positive-unlabeled frameworks21,22,23,24, choosing a random forest classification algorithm (similar to Claesen et al.23) on the basis of performance and robustness to unlabeled data (see Methods, Supplementary Methods, Supplementary Figs.2, 3). To determine the added benefit of using clades in comparison to random sets of organisms, we compared the real clades to randomized clades. The comparison revealed that the tree structure and clade-specific evolution are indeed important to the performance of the method (Supplementary Text1, Supplementary Table2). The method is also robust to the choice of blast pre-processing (See Methods, Supplementary Table3).
To test the performance of our method, we compared it to four established PP methods: normalized Phylogenetic Profiling (NPP)1, SVD-Phy25, PrePhyloPro (PPP)26 and the Hamming distance on a binarized phylogenetic profile (BPP)9,26. These four methods do not take clades into consideration and are based solely on similarity metrics between genes phylogenetic profiles. We showed that our method, trained on functional interactions from Reactome, outperformed the others in terms of auROC (Fig.1A) as well as partial auROC (at FPR<0.1) and average precision (Supplementary Table4, Supplementary Fig.5), achieving a 14%, 3%, and 10% increase relative to the next best methods, respectively.
As several biases may obscure this comparison, we performed cross-validation with stratification in addition to random allocation to train and test splits. Previous studies found that functional interaction prediction models tend to overfit to genes found in pairs in both the training and test set (but not in the same pairs). Gene pairs in the test set were thus stratified to having both, just one, or neither of the genes in the training set as previously suggested27 (see Methods, Supplementary Fig.5, Supplementary Table4). In addition, genes with high sequence similarity (for example, paralogous genes) tend to be functionally related and co-evolved. However, this relationship is easily captured without PP and thus produces optimistic results for predicting functional interaction. We thus stratified gene pairs for this phenomenon (see Methods). When filtering out gene pairs with high-sequence similarity, differences in performance were even more pronounced (Supplementary Table4, Supplementary Fig.5). Another consideration is that of gene age. More recent genes (i.e. first appeared in a common ancestor closer to humans) may prove more difficult for co-evolutionary based methods. This can be attributed to greater similarity between closer organisms, leading to high phylogenetic profiling similarity between these genes regardless of function. We thus stratified for this phenomenon and showed that indeed the models performance is reduced for the subset of genes found only in Metazoa and Chordata (See Methods, Supplementary Fig.4). However, these gene pairs constitute only a small portion of functional interactions in Reactome (1% for Metazoa specific and 0.5% of Chordata specific, mutually inclusive) and a high percentage of paralogous pairs (20% for Metazoa, 17% for Chordata, 5% for all genes).
In addition to functional interactions, our model predicts for each gene pair the interaction context. Previous studies showed that interactions belonging to different interaction contexts may show a globally different phylogenetic profile14. The interaction context represents additional information about the functional interaction of a gene pair, such as the pathway type. We showed that our approach outperformed the other PP methods in predicting pathway types from Reactome and high-level terms from GO and achieved high auROC, partial auROC and average precision in cross-validation and stratifications as described above (Fig.1B compared to NPP, see Supplementary Information for additional comparisons).
Further comparisons of temporal splits and external validation databases revealed similar gains in performance. We assessed the performance of our functional interaction model, which was trained on Reactome (Feb. 2019), in predicting functional interactions from a future version of Reactome (Jan. 2021). Our model was robust to these temporal changes (Supplementary Fig.6). Additionally, we externally validated our functional interaction model performance, trained on functional interactions from Reactome, in predicting functional interactions from both datasets similar and dissimilar to Reactome. Our model was robust for predicting PPIs from BioGridthe Biological General Repository for Interaction Datasets28 (Supplementary Fig.7AD) and from IntActthe EMBL-EBI Molecular Interaction Database29 (Supplementary Fig.7EH); functional interactions from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database30 (Supplementary Fig.8); and protein complex co-occurrence from CORUM31 (Supplementary Fig.9AD), and IntAct Complex29 (Supplementary Fig.9EH) databases. For protein complex co-occurrence, we also compared PP approaches to the In Complex interaction context model trained on complex co-occurrence in Reactome (Supplementary Fig.10). These external validations were robust for each dataset both at the whole dataset and excluding functional interactions found in Reactome, and when excluding functional interactions between paralogous genes.
As phylogenetic profiling is commonly used to understand functional interactions at the pathway level, we compared the different methods at this level. For each pathway, we calculated the pairwise score for all pairs of genes in the pathway. To enable a comparison between the different methods, the scores of all gene pairs for each method were normalized by conversion to percentiles. Comparing the median percentile per pathway for all KEGG pathways, MLPP (functional interactions model) outperformed the NPP method in 77.5% of cases and identified 43.8% of pathways at the 95% percentile level (Fig.2A, Supplementary Fig.11A). For example, for the KEGG30 pathway Fatty Acid Metabolism, MLPP predicted its pairwise interactions at higher percentiles (Fig.2B, the redder, the higher the percentile) than the BPP (Fig.2C) and NPP (Fig.2D) methods. A similar comparison is shown for the KEGG Valine Leucine and Isoleucine pathway (Fig.2EG). Larger gains in performance occurred when accounting for sequence similarity (Supplementary Fig.11C) and similarly when compared to the BPP method (Supplementary Fig.11B, D) and when comparing using the CORUM database of complexes (Supplementary Fig.11EH).
We then applied our method to identify modules of functionally interacting genes. For this, we clustered genes by their predicted functional interactions (i.e. predicted probability of interaction across all gene pairs) and extracted tightly interconnected modules by hierarchical clustering (see Methods). We identified many modules of known functionally interacting genes (Fig.3). Some of the modules, such as ciliary5,14 and heme biogenesis genes2, were described previously as highly co-evolved (Fig.3B, G, respectively). However, we also identified clusters where the signal was indeed contained in only a subset of the clades and thus more easily found by our method. For example, the mitochondrial respiratory complex III and IV genes (Fig.3C) and NADH dehydrogenase (Fig.3D) have a strong co-evolutionary signal in Fungi. Other clusters, such as the B12 metabolism cluster (Fig.3E) and the Histidine catabolism cluster (Fig.3F), show signals in both Fungi and Nematoda. Finally, the cluster in Fig.3A contains genes related to mRNA splicing as well as some genes with no previous association to splicing (in red). Many of the modules found contained mostly genes with high sequence similarity (Supplementary Fig.12) such as alcohol dehydrogenase enzymes (A), receptor/ion channel subunits (B, D, G), ribosome subunits (E, F), the exosome (C), collagen subunits (H) and histones (I). As previously stated, these modules were expected as genes with high sequence similarity are highly co-evolved and are often functionally related.
Functional annotation of the Ignorome
Next, we applied our method to predict biological function for less studied genes. Many genes (termed the Ignorome) are less studied and rarely mentioned if at all in the literature15. This makes understanding their function challenging. Our systematic and unbiased approach does not depend on additional data, generalizes well to genes found only in the test set (see Methods) and, accordingly could help capture the function of the Ignorome. Hence we focused on genes lacking functional annotation in Uniprot32. These genes were additionally required to be in the lower 20% of PubMed mentions (less than 10 papers) or belong to ~2000 genes identified by the neXtProt consortium as genes with an unknown function18 (Supplementary Fig.13).
To predict gene function, we utilized a random-walk based prioritization of genes, which we call the PathScore. Using our MLPP approach described above, we generated the fullpredicted functional interaction network for each of the interaction-context pathway type models. We then scored genes according to the equilibrium distribution of random walks on this network (see Methods). The score given for each gene signals its importance for this pathway type according to its connectivity in the predicted functional interaction network (see Methods). Shown for DNA Repair, PathScore ranks genes known to belong to this pathway type at the top (Fig.4A) and is robust across train and test splits in multiple cross-validations (Fig.4B, similar analyses can be found for the rest of the interaction context models in Supplementary Figs.14,15). By inspecting the less studied genes, we identified tens of genes at the top 250 PathScore ranks for each pathway type, yielding one or more annotations to 238 Ignorome genes (Fig.4C, Supplementary Data3).
Specifically, for DNA-repair type pathways, we identified several potentially functionally related genes. Out of the first 50 genesranked for DNA repair, we identified nine genes annotated as related in Reactome (22 for top 200). In addition, we identified nine genes that are known to be related to DNA repair but were not found in Reactome. These include EXO5 (rank 8), an exonuclease related to DNA-repair and genome stability33; C17orf53 (rank 16), previously an Ignorome gene, which was recently identified to be involved in homologous recombination repair34; the DNA polymerase DNTT (rank 18); the telomerase TERT (rank 23); the SMC5-6 complex gene SMC5 (rank 47, also EID3 in rank 55)35; and genes previously prioritized as related to DNA repair (ELP636rank 33, PIGN37rank 34, NUDT1538rank 36, STK1939rank 46). These serve as a strong external validation of the PathScore prioritization approach. In addition we identified 18 genes in the top 200 that were prioritized by several CRISPR assays to be related to DNA repair39 (rank in brackets)GPATCH8 (3), SCNM1 (7), OMA1 (10), AOC2 (50), RCE1 (71), ALG3 (92), THUMPD1 (111), DPH6 (115), PIGW (119), TYW1 (131), VPS16 (132), PPOX (143), DUSP12 (146), ISCA2 (158), NAALADL2 (187), POLA2 (194).
This approach also highlighted hundreds of genes that may be functionally related to several pathway types. Overall, we identified 1554 non-Ignorome and 58 Ignorome genes at the top 250 ranks for more than one pathway type. For example, the Yippee-type proteins YPEL1, YPEL2, and YPEL4 were ranked high in Cell Cycle, Disease, Gene Expression, Homeostasis, Metabolism of Proteins, Signal Transduction, Transport of Small Molecules and Vesicle-mediated Transport with YPEL1 and YPEL2, ranking in the top 250 for these eight pathway types. The Yippee family proteins are putative zinc-fingers known to be related to the centromere40.
MLPP uncovers evolutionary insights underlying pathway co-evolution
Phylogenetic profiling can be backtracked to produce evolutionary insights into pathway evolution. These insights include important loss events5,12 and analyses on different pathway types14. Our method enables similar evolutionary inference by calculating the contribution of each clade (feature) to the prediction of functional interaction for a gene pair. Clade contribution to the prediction is calculated using the SHAP method for tree-based models41,42. SHAP values are calculated by considering the change in predictions when the clade is present or absent from the model through all possible combinations. For example, for the gene-pair ACO1–IDH1 from the citric acid cycle (TCA), the probability for functional interaction is 0.87. The probability can be decomposed by SHAP to 0.117 for Fungi, 0.06 for Chromadorea, 0.01 for Ascomycota, and so forth (with a bias term of 0.427, Fig.5A; clades with a SHAP value less than 0.002 in absolute value are not shown). The interpretation of these values is conceptuallysimilar to the interpretation of the coefficients and intercept (bias) of a linear model. The evolutionary inference is made at the clade level and thus cannot point to the timing of specific loss events such as previously described, e.g. in ref. 5. Nevertheless, it can reveal clades where loss events may have happened, the pathways first introduction, or loss of co-evolution at the common ancestor level. Moreover, our model allows for a unified assessment of these evolutionary insights across all gene pairs, pathways and pathway types. We thus present insights into functional interactions and pathway evolution at these three levels.
As an example, we focus on the citric acid cycle (TCA) pathway. The model identifies Fungi and Chromadorea as the clades with the highest importance for predictions in this pathway (Fig.5B). These clades are complementary in predicting some of the functional interactions. While Fungi is the single most informative clade, it failed to predict the interactions of the genes PCK1 and PCK2 with the rest of the TCA genes. However, these interactions are well captured in Chromadorea (Fig.5C, D). This may relate to the function of PCK1/2 as these genes control gluconeogenesis from TCA intermediate metabolites and are thus peripheral in the pathway.
Overall, the phylogenetic profile of TCA in Fungi revealed that most genes are conserved throughout the clade except for the known loss of the pathway in the Microsporidia parasites43 (Fig.5E, demarcated with a box). Thus, the model links the phenotypic change in Microsporidians to the loss event by utilizing the importance measure provided by our method, demonstrating its applicability in identifying evolutionary insights. Two additional examples of pathway evolutionary insights are provided; methylmalonic acid metabolism and histidine metabolism (Fig.3E, F, respectively, the top part of each subfigure). In these pathways, the model identified specific loss events in nematodes by clade importance. Overall we showed that our model could capture specific loss events similar to those found in the previous approaches5.
Next, we sought to derive insights at a higher level of pathway co-evolution. We first assessed the general informativeness of clades in predicting functional interactions. Overall, for functional interactions the most critical clades were Fungi (mean absolute SHAP value of 0.04), Nematoda (0.022) and their subclades Fungi Incartae sedis (0.03) and Chromadorea (0.033) (Fig.6A, from the top by decreasing importance as the mean absolute SHAP values). Unexpectedly, these specific clades had a higher importance than using all Eukaryotes (0.018), suggesting that specific clades may prove more informative in general, both for our approach and for phylogenetic profiling in general.
For the interaction context, different pathway types had different clade importance (Supplementary Fig.16). It was initially expected that the more ancient pathway types would rely more on distant organisms and vice versa. This hypothesis was recapitulated in our analysis. For example, the Metabolism model assigns higher importance to distant clades such as Alveolates, Stramenopiles, Fungi, and all Eukaryotes, while the Immune System model assigns higher importance to clades closer to humans such as Metazoa and Ecdysozoa. However, some pathways displayed a counterintuitive attribution of clade importance. For example, the Signal Transduction model assigns high importance to all Eukaryotes while expected to rewire often and thus be more informative in organisms closer to human.
We then examined the clade importance of gene pairs and pathways. For this, average SHAP values per pathway were projected using UMAP44. This analysis clusters together pathways with a similar clade importance attribution (Fig.6B). For example, many metabolic pathways (Fig.6B, bottom right) give high importance to Chromadorea and Fungi. Similarly, a group of receptor types, complexes, and signaling pathways give high importance to all Eukaryotes and Chromadorea (Fig.6B, top left). These differences highlight the added value of clade-wise phylogenetic profiling, which is able to detect co-evolution in subsets of the Eukaryotic tree. UMAP projection of gene-pair SHAP values identified similar insights. Here, clusters of gene pairs with similar clade importance show that the highest-scoring pairs gave high importance to both Fungi and Nematoda (Supplementary Fig.17).
Overall, our method enables one to uncover specific patterns across gene pairs, specific pathways, and all pathways level thus shedding light on pathway evolution. These insights can be categorized into two types. First, identifying clades with gene loss events that translate to meaningful phenotypical effects and, second, shedding light on the underlying evolutionary processes behind pathways of various kinds. These include per pathway gain, differences among clades in pathway losses and the informativeness of various clades for phylogenetic profiling of functional interactions in general and specific pathway types particularly.
Analysis of parasitic organisms signal in phylogenetic profiling
Many of the most informative clades described above, such as Chromadorea, Stramenopiles, Alveolata, and Fungi Incertae Sedis contain a large percentage of parasitic organisms. Parasitic organisms are known to undergo vast gene losses and drastically diverge from their free-living counterparts45,46,47,48. We thus hypothesized that these insights may be related, identifying parasitic organisms across the tree of life as a key signal in phylogenetic profiling.
Parasitic organisms (see Methods, Supplementary Table5) are generally less conserved with respect to humans than free-living organisms (Fig.7A, in red). The lowest percent of orthologs are found in parasitic organisms in Alveolates, Microsporidia (Fungi Incertae Sedis, denoted as Fungi I.S) Kinetoplastids and intestinal flagellates (Hexamitidae, denoted as other eukaryotes) (Fig.7A, in red). Only two organisms show similar loss levels, one micro-algae (Nannochloropsis gaditana in Stramenopiles, marked with a green arrow) and one endosymbiotic Kinetoplastida (Perkinsela, marked with a red arrow). For the endosymbiont, the same rational of gene loss related to host adaptation was previously described49.
Six parasites containing clades were further compared to non-parasitic (free-living or mutualistic) organisms in the same clades and a reference parent clade (Fig.7B, C, a full list of organisms considered parasitic can be found in Supplementary Table5, see Methods). Parasitic clades showed a statistically significant reduction in conservation level (as compared to human) from both the non-parasitic organisms (Nematoda, Alveolata) and the reference clade (all but Stramenopiles, 2-sided MannWhitney test, Fig.7B). Moreover, many of the lost genes in parasitic organisms were highly conserved in non-parasitic organisms in the reference clade (Fig.7C).
Thus, we were interested in mapping the genes lost in each clade and how this signal translates across clades and into pathways. We analyzed genes that are highly conserved in all Eukaryotes but have low conservation in at least one parasitic clade (see Methods), ending up with 4114 genes (Fig.7D). The presence and absence of each gene in each clade were considered, and all clade combinations underwent enrichment analysis for Gene Ontology (GO) biologic processes (see Methods). Parasitic clades often lack orthologs for specific metabolic, signaling, and developmental pathways (Fig.7D, top 10 clade combinations by number of genes in the intersection). By inspecting the different combinations, we found that some pathways were enriched in certain combinations. For example, mRNA splicing genes were lost both in Kinetoplastida and Microsporidia, while GTP signaling genes were lost in either the 2nd, 3rd, or 8th combination (from the left), all consisting of Microsporidia and Alveolates. Hence, some pathways can be identified by their loss pattern in parasitic organisms.
However, a sensitivity analysis of excluding parasitic organisms in the training of the method results in conflicting evidence. On one hand, while many of the clades with highest importance contain parasites (as mentioned above), excluding parasites or these clades causes only a mild reduction in performance (see Methods, Supplementary Table6). On the other hand, clade importance does indeed shift from these clades to others (Supplementary Fig.18). This suggests that while parasitic organisms do indeed contribute to the model presented in this work, other signals exist in phylogenetic profiling which can reach a similar performance.