RESEARCHOpen Access icon OPENOpen Access license

Computational Structural Genomics Unravels Common Folds and Novel Families in the Secretome of Fungal Phytopathogen Magnaporthe oryzae

    Affiliations
    Authors and Affiliations
    • Kyungyong Seong
    • Ksenia V. Krasileva
    1. Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, U.S.A.

    Published Online:https://doi.org/10.1094/MPMI-03-21-0071-R

    Abstract

    Structural biology has the potential to illuminate the evolution of pathogen effectors and their commonalities that cannot be readily detected at the primary sequence level. Recent breakthroughs in protein structure modeling have demonstrated the feasibility to predict the protein folds without depending on homologous templates. These advances enabled a genome-wide computational structural biology approach to help understand proteins based on their predicted folds. In this study, we employed structure prediction methods on the secretome of the destructive fungal pathogen Magnaporthe oryzae. Out of 1,854 secreted proteins, we predicted the folds of 1,295 proteins (70%). We showed that template-free modeling by TrRosetta captured 514 folds missed by homology modeling, including many known effectors and virulence factors, and that TrRosetta generally produced higher quality models for secreted proteins. Along with sensitive homology search, we employed structure-based clustering, defining not only homologous groups with divergent members but also sequence-unrelated structurally analogous groups. We demonstrate that this approach can reveal new putative members of structurally similar MAX effectors and novel analogous effector families present in M. oryzae and possibly in other phytopathogens. We also investigated the evolution of expanded putative ADP-ribose transferases with predicted structures. We suggest that the loss of catalytic activities of the enzymes might have led them to new evolutionary trajectories to be specialized as protein binders. Collectively, we propose that computational structural genomics approaches can be an integral part of studying effector biology and provide valuable resources that were inaccessible before the advent of machine learning-based structure prediction.

    Copyright © 2021 The Author(s). This is an open access article distributed under the CC BY-NC-ND 4.0 International license.

    Fungal phytopathogens encode a diverse set of effector proteins to infect and colonize plant hosts. Effectors include rapidly evolving or recently emerged proteins, the biological roles of which cannot be easily inferred with primary sequence information alone. On the other hand, structural biology has illuminated effector evolution through effectors’ structural commonality (Franceschetti et al. 2017). MAX effectors from Magnaporthe oryzae and Pyrenophora tritici-repentis are unrelated by their sequences but share a common β-sandwich fold (de Guillen et al. 2015). A few sequence-unrelated proteins form an analogous LARS effector family in Leptosphaeria maculans (Blondeau et al. 2015; Lazar et al. 2020). RALPH effectors commonly adopt the RNase fold despite highly divergent sequences in powdery mildews (Bauer et al. 2021; Pedersen et al. 2012; Pennington et al. 2019; Spanu 2017). Common structural folds without sequence similarity may be the outcome of convergent evolution or, alternatively, the loss of detectable homology (Andrie et al. 2008; de Guillen et al. 2015). Regardless of the true origin, their repeated appearance manifests their essential roles in virulence. However, because solved effector structures are limited, only a small number of common structural folds have been uncovered to date (Mukhi et al. 2020).

    Structural genomics is an approach that aims to determine the structures for all proteins by experimental and prediction methods (Baker and Sali 2001). The application of structural genomics is valuable to elucidate effector biology; however, it has remained largely limited. Experimental determination of all effector structures would require immense time and community effort. Prediction is an attractive alternative. Nonetheless, predicting the folds of proteins has, until recently, been predominantly dependent on homology modeling that has obvious limitations: because of the lack of template structures and detectable homology between related effectors, most of the effector structures are not predictable.

    The Critical Assessment of Structure Prediction (CASP) 13 in 2018 demonstrated the success of de novo folding algorithms. These algorithms, represented with AlphaFold, leverage machine learning to predict interresidue distances based on covariance inferred from a multiple sequence alignment (MSA) of the target and its homologous sequences (AlQuraishi 2019). Although structure prediction does not achieve atomic resolutions comparable with experimental structure determination techniques, the folds of unknown proteins were successfully predicted in the absence of homologous templates (Senior et al. 2020). Additionally, the expected quality score reported for predicted structures strongly correlates with the actual precision (Senior et al. 2020; Yang et al. 2020; Zhang 2008). Collectively, the ability to predict the structural folds and estimate the model quality enables us to apply structural prediction on phytopathogens’ secretomes.

    In this study, we tested the computational structural genomics approach on 1,854 secreted proteins from destructive fungal phytopathogen M. oryzae (Dean et al. 2012; Wilson and Talbot 2009). We show that the template-free algorithm TrRosetta performs better than template-based modeler I-TASSER in predicting known and unknown secreted protein structures. Using structural information, we found analogous gene families and uncovered evidence of new common folds across phytopathogens. We propose that computational structural genomics can complement traditional genomic approaches in the analyses of effector evolution. The rapidly evolving protein structure field would help to guide the rational selection of conserved effector folds in phytopathogens for the development of new disease resistance strategies in near future.

    RESULTS

    Template-free modeling with TrRosetta outperforms homology modeling in predicting the folds of known effectors.

    We assessed the performance of TrRosetta and I-TASSER on nonredundant 9 secreted and 15 nonsecreted Magnaporthe proteins available in the Protein Data Bank (PDB) (Supplementary Fig. S1; Supplementary Table S1) (Berman et al. 2000). We predicted their structures with each tool and superposed the computational models against the experimental ones with template modeling (TM)-align to measure actual precision as TM scores (Zhang and Skolnick 2005). A TM score >0.5 indicates that the two compared structures display approximately the same fold (Xu and Zhang 2010). Nonsecreted proteins with functional annotations were relatively easy targets for both methods, given the high precision of the predicted structures (Fig. 1A). TrRosetta produced structures with TM scores >0.5 for known effectors AvrPiz-t, Avr-Pia, Avr-Pib, and Avr-PikD, whereas I-TASSER only predicted the Avr-Pia structure (Supplementary Fig. S2) (De la Concepcion et al. 2018; Ose et al. 2015; Zhang et al. 2013; Zhang et al. 2018). This suggested that template-free modeling with TrRosetta outperforms homology-based modeling with I-TASSER in the selected effector structure modeling, successfully predicting the effector folds (TM score >0.5).

    Fig. 1.

    Fig. 1. TrRosetta predicts the folds of many known effector structures. Statistics of protein structure prediction on A, 24 Magnaporthe proteins and B, 15 fungal effector proteins with solved structures available in the Protein Data Bank. The actual precision was obtained as template modeling (TM) scores by superposing the computational models from TrRosetta or I-TASSER against the experimental structures with TM-align. The estimated precision is a metric TrRosetta produces as a measure of the prediction quality and is reported to well correlate with actual precision. Actual precision (TM score) >0.5 indicates that the predicted and experimentally determined structures display approximately the same fold. Estimated precision >0.5 indicates that the fold of the predicted structures is likely correct.

    Download as PowerPoint

    We collected 15 available effector structures across other fungal phytopathogens and examined whether TrRosetta could predict their folds (Fig. 1B; Supplementary Table S2). The MSAs for carbohydrate-binding Avr4, LysM-containing Ecp6, and RNAse-like BEC1054 contained diverse homologs (>2,000) for coevolutionary inference and, accordingly, TrRosetta predicted their folds. In contrast, AvrL567-D, AvrLm4-7, Avr2, SnTox3, and ToxB had only a limited number of homologs collected (≤60). Nonetheless, their folds were still predicted, suggesting that TrRosetta can capture the overall folds of many effector structures when coevolutionary information is limited.

    A combination of template-free and homology-based modeling resolves a large subset of secreted protein structures.

    We next modeled 1,854 putative secreted proteins encoded in the M. oryzae genome. TrRosetta and I-TASSER use the average probability of the top L predicted long- plus medium-range contacts (|ij| > 12) and the mean estimated TM score to estimate prediction qualities. These metrics, which we collectively termed as estimated precision, are reported to correlate well with actual precision, and the estimated precision >0.5 indicates that the fold of the predicted structures is likely correct (Yang et al. 2020; Zhang 2008). TrRosetta and I-TASSER predicted 627 structures in common with estimated precision >0.5 (Fig. 2A; Supplementary Table S3). Among them, 493 TrRosetta models displayed equal or higher expected precision. Each tool predicted additional 514 and 154 structures with estimated precision >0.5. Collectively, 70% of the secreted proteins were modeled by at least one of the methods with expected precision >0.5.

    Fig. 2.

    Fig. 2. Structure prediction statistics on 1,854 Magnaporthe oryzae secreted proteins. A, Structure prediction statistics on 1,854 M. oryzae secreted proteins. The average probability of the top L predicted long- plus medium-range contacts (|ij| > 12) and the mean estimated template modeling (TM) score were used as the expected precision for TrRosetta and I-TASSER, respectively. These values are reported to well correlate with the actual precision (Yang et al. 2020; Zhang 2008). The number of proteins belonging to each section is indicated in the plot. B, Mature protein length distribution (AA = amino acids) and C, number of homologs collected for coevolutionary inference and the proportion of predicted disordered residues for 559 proteins that were not predicted by both TrRosetta and I-TASSER. D, Structure-based classification of 527 proteins without functional annotations. The 527 protein structures were assigned to SCOPe and CATH categories with RUPEE. The TM score cut-off >0.5 was required for the classification. The top 15 hits are reported.

    Download as PowerPoint

    The lack of homologous templates explains the relatively poor performance of homology modeling with I-TASSER. Many models exclusively modeled by I-TASSER included very short or highly disordered proteins that may not be trustworthy (Supplementary Fig. S3; Supplementary Table S3). The 559 proteins missed by both TrRosetta and I-TASSER were predominantly small (Fig. 2B) and did not have a sufficient number of homologs for coevolutionary inference, possibly attributed to the loss of detectable homology or recent origin (Fig. 2C). A subset of these proteins was likely intrinsically disordered, because the proportion of predicted disordered residues is high (Fig. 2C). These structures tend not to adopt a single, foldable confirmation and, therefore, their structures would not be able to be predicted.

    We classified the 527 uncharacterized proteins without PFAM domains or with the domain of unknown functions by searching for similar structural folds and topologies in the SCOPe and CATH databases with RUPEE (Fig. 2D; Supplementary Table S4). The classification identified enzymes of diverse functions. These include hydrolytic enzymes adopting the α/β-hydrolase and Rossmann fold, glycosidases displaying the TIM (β/α) barrel structure, glucanases belonging to the concanavalin A-like lectins or glucanase and jelly roll structures, and metallopeptidase of the zincin-like and collagenase fold. Such results indicated that a subset of the unknown M. oryzae secretome contains divergently or rapidly evolving enzymes. Other structural folds and topologies were also present, which may supplement the infection mechanisms of M. oryzae (Supplementary Table S3).

    Template-free modeling with TrRosetta captures the structures of previously identified effector proteins.

    We predicted the folds of previously identified effectors from M. oryzae, the structures and functions of which are mostly unknown (Table 1). Among those with homologous structures, all were predicted, except for MoCDIP12, which has a homologous domain to Avr-Pik (Fig. 3A). In most cases, the TrRosetta models displayed higher expected precision, and structural comparisons supported the quality of the predicted models (Supplementary Fig. S4). Among the cloned effectors without homologous structures, TrRosetta could generate models with high estimated precision for MoNIS1, Avr-Pi54, EMP1, MoCDIP1, MoCDIP5, MoCDIP10, Avr-Pita1, and BAS4 (Fig. 3B). With lower estimated precision, the structures for PWL, BAS3, and MoCDIP3 were also predicted (Supplementary Fig. S5).

    Table 1. Collection of previously identified effectors in Magnaporthe oryzae and structure prediction statistics

    Fig. 3.

    Fig. 3. Predicted structures of known effectors in Magnaporthe oryzae. A, Predicted structures of known effectors that have easily identifiable template structures with BLASTP. B, Structures of known effectors predicted by TrRosetta that do not have easily identifiable templates. C to E, PFAM and structure-based annotations of MoCDIP1, MoCDIP5, and MoCDIP10. The PFAM domain search and structural similarity search with RUPEE against the SCOPe and CATH databases were used for the annotation. The structurally unaligned region of MoCDIP5 is indicated in lighter yellow (D).

    Download as PowerPoint

    We searched for structural matches of the effectors to reveal their putative biological roles. Although PFAM search did not uncover any conserved domains, structural comparability to pectin lyases and metallopeptidases existed for MoCDIP1 and MoCDIP5, suggesting that these cell-death-inducing proteins may be enzymes (Fig. 3C and D) (Chen et al. 2013). However, conserved catalytic residues found in structural matches were not present, indicating different modes of catalytic mechanisms (Supplementary Fig. S6) (Cho et al. 2001; Lenart et al. 2013). For MoCDIP10, ferritin-like domain and immunoglobulin-like β-sandwich fold were detected (Fig. 3E) (Guo et al. 2019). The newly detected β-sandwich fold resembled DNA-binding and other protein-binding domains, possibly suggesting that it may aid binding to host targets (Supplementary Fig. S6).

    Sensitive sequence similarity search and structural comparisons define secretome classes in M. oryzae.

    Structural comparisons together with sensitive sequence similarity searches can better unravel the interconnection between the secreted proteins in M. oryzae. Sequence-to-sequence similarity search with BLASTP revealed that 819 proteins had at least one other homolog present in the secretome (Fig. 4A; Supplementary Table S5). A more sensitive profile-to-sequence similarity search with HHblits linked an additional 186 sequences into homologous groups. Profile-to-profile similarity search with HHsearch and structural comparisons with TM-align added 49 and 69 proteins to the clusters, respectively. Eventually, 1,123 proteins (60%) had at least another sequence or structure-related protein in the secretome. In all, 731 proteins remained as singletons that lacked detectable paralogs and structural analogs in the secretome, and only 118 proteins had functional annotations (Fig. 4B). Of the 613 remaining proteins without PFAM annotations, 86 proteins had predicted structures with the estimated precision ≥ 0.6, which could be relatively reliably used for subsequent analyses.

    Fig. 4.

    Fig. 4. Statistics on secretome clustering and the MAX effector cluster. Four methods were sequentially used to reveal sequence-based and structure-based similarity. In all cases of the sequence-based methods, only the pairs with E-value < 10−4 and bidirectional coverage >50% were regarded as significant. A, The number of proteins found in clusters with at least another homolog or structural analog in the Magnaporthe oryzae secretome. B, The number of singletons without any homologs or analogs in the M. oryzae secretome. The number of proteins with meaningful PFAM domains, excluding domains of unknown functions, or structures predicted with estimated precision ≥ 0.6 are indicated. C and D, The network graph for MAX effectors and the number of newly retrieved singletons. C, Each node and edge represent a protein and similarity that can be detected by the method. The MAX effector cluster with 11 members (cluster 26) exists inside the yellow dotted ring. The newly retrieved singletons remain outside the ring. D, Criteria for the final model selection and the significant structural similarity were relaxed by the template modeling (TM) score of 0.01. The number of newly retrieved singleton members is indicated. Colors correspond in C and D. E, Structural superposition of the newly retrieved MAX effector members and their most similar structures available in the Protein Data Bank. The normalized TM scores, as a measure of similarity, were 0.47 and 0.63 for MGG_04384 and 2MM0 (ToxB), 0.47 and 0.46 for MGG_17464 and 2MM2 (ToxB), 0.81 and 0.89 for MGG_15972 and 6G10 (Avr-PikD), and 0.47 and 0.60 for MGG_16475 and 2N37 (Avr-Pia).

    Download as PowerPoint

    We specifically traced putative MAX effectors, because sensitive sequence similarity search and structural comparisons are required to reveal them (de Guillen et al. 2015). Among the final clusters was cluster 26, a group of 11 uncharacterized proteins, which includes homologs of Avr-Pib (MGG_12426) and AvrPiz-t (MGG_18041) (Fig. 4C; Supplementary Table S5). In sequence-to-sequence similarity search, these 11 proteins were separated into six singletons and two clusters. By the profile-to-sequence similarity search, two singletons and three individual clusters formed, which did not display sufficient sequence similarity to each other. However, the structure-to-structure comparison was able to link them all, eventually placing the Avr-Pib and AvrPiz-t homologs or analogs into a single cluster.

    Many predicted MAX effector structures were barely modeled with a precision of 0.5 (Fig. 1A), and structural similarity of some solved MAX effector pairs displayed TM scores ≤ 0.5 (Supplementary Fig. S7). Therefore, the standard criterion of TM score >0.5 will likely miss putative MAX effector candidates in the final model selection or structural similarity comparison. We reduced the TM score cut-off by 0.01 in the two procedures and examined whether any singletons could be retrieved into this MAX effector cluster (Fig. 4D). By decreasing the cut-off by 0.03, we found five new members, four of which identified solved MAX effector structures as their most similar analogs (Fig. 4E). By 0.09, 11 new members were retrieved. However, only one of them (MGG_17266) displayed similarity to a MAX effector structure with an average TM score of 0.45, indicating that the new members are unlikely MAX effectors (Supplementary Table S5). Such outcomes suggested that relaxing the TM score cut-offs to some extent may be appropriate for putative candidate effector selection for a family of interest.

    Structural clustering unravels novel families of sequence-unrelated structural analogs.

    With the structure-based clustering, we could capture the presence of novel sequence-unrelated, structural analogs (Supplementary Table S5). An example is BAS4, which is highly expressed at the invasive hyphae and participates in the transition from biotrophy to necrotrophy (Mosquera et al. 2009; Wang et al. 2019). BAS4 exists in cluster 17, with 14 other members initially divided into five distinct groups based on homology (Fig. 5A). However, the predicted structures disclosed a common ferredoxin-like fold and α-β plait topology, linking the groups into a single cluster (Fig. 5B). Interestingly, the CATH classification assigns LARS effectors, including AvrLm4-7 (4FPR), to the α-β plait topology (Blondeau et al. 2015). In comparison with the members in cluster 17, AvrLm4-7 is larger and displays a difference in the secondary structure topology. Nevertheless, the structural superposition illustrated that MGG_01064 is roughly contained in AvrLm4-7, possibly suggesting unknown virulence functions associated with this topology that may be important for phytopathogens (Fig. 5C).

    Fig. 5.

    Fig. 5. Large clusters of sequence-unrelated structurally similar effectors and the appearance of structural analogs in other phytopathogens. A and D, Network graphs for clusters 17 and 14, respectively. The node represents a member in the cluster and is colored according to its membership based on profile-to-profile and profile-to-sequence similarity search for clusters 17 and 14, respectively. The edge indicates detectable structural similarity. If such similarity is present between two proteins with different membership, their nodes are connected. B and E, Representative structures for clusters 17 and 14, respectively. Structures predicted with the highest expected precision were selected from each sequence-based cluster. C, Structural superposition between the predicted MGG_01064 and LARS effector protein 4FPR. The template modeling (TM) score, as a measure of similarity, normalized for each structure is indicated in the parentheses. F, Structural superposition between the MGG_16533 model and Ecp7 as well as the C-terminal region of Ecp29 (158-266) predicted by TrRosetta.

    Download as PowerPoint

    Structural comparisons revealed another fold comprising the effector family in M. oryzae and its analogs in a phytopathogen. The 18 members in cluster 14 are divergent, and the profile-to-sequence similarity searches only revealed three homologous groups and three singletons (Fig. 5D; Supplementary Fig. S8). Yet all of them share the γ-crystallin-like fold and, thus, are structurally related (Fig. 5E). Lineage-specific presence and absence variations, the exclusive appearance of some homolog in cereal-infecting fungi, and a high expression during the transition to biotrophy all indicate that this group of secreted proteins likely represents true effectors (Gardiner et al. 2012; Torres et al. 2016; Zhong et al. 2016). It was also suggested that four extracellular effectors (Ecp4, Ecp7, Ecp29, and Ecp30) from Cladosporium fulvum would adopt a β/γ-crystallin fold (Mesarich et al. 2018). Consistently, TrRosetta predicted the expected fold for the Ecps with the estimated precision >0.5 (Supplementary Fig. S9). Indeed, the Ecp proteins displayed noticeable similarity to the M. oryzae effectors, collectively indicating that structural analogs may play a significant role across phytopathogens (Fig. 5F).

    Secretome clustering and structure-based functional inference lead to new hypotheses for the infection mechanism of M. oryzae.

    Secretome clustering and structure-based annotation could help formulate new hypotheses. For instance, the structure prediction and comparisons identified a small RNase cluster in the M. oryzae secretome (cluster 74). The predicted structures belong to the T1 family with similarity to the Blumeria graminis RNase-like effector (6FMB), pointing to the possible existence of a common mechanism in the two distant phytopathogens (Supplementary Fig. S10). Another example is proteins containing the necrosis-inducing factor domain in cluster 33 (Hec2 domain; PF14856), the structures of which were predicted with high estimated precision (Stergiopoulos et al. 2010). The predicted model identified glycan-binding protein Y3 isolated from fungus Coprinus comatus (5V6J) as the closest analog and was linked to seven other sequence-unrelated members by structural analogy (Supplementary Table S5) (Zhang et al. 2017). The structural similarity to Y3 is limited to the fold level. However, because the previous study identified cooccurrence of the Hce2 domain with other glycan-binding LysM domains, chitin-binding modules, and chitinase, the Hec2 domain may be possibly involved in their common biological goal of glycan binding and processing (Stergiopoulos et al. 2012).

    A group of secreted pseudoADP-ribose transferases may have evolved from canonical ADP-ribose transferases to serve as specialized binders.

    One of the largest clusters, cluster 8, includes 27 members, 10 of which possessed predicted structures with estimated precision > 0.75 and the ADP-ribosylation fold (Fig. 1C; Supplementary Table S5). ADP-ribose transferases (ARTs) in plant pathogenesis are well represented with type III effectors and Scabin toxins (Feng et al. 2016; Fu et al. 2007; Lyons et al. 2016; Singer et al. 2004); however, their role is largely unknown in fungal pathogens. To elucidate how this clade evolved, we employed Shannon’s entropy on the 10 core ART members of this cluster (Prigozhin and Krasileva 2021; Seong et al. 2020). Well-conserved residues measured with entropy < 1 contained some known catalytic residues and residues around them (Fig. 6A) (Aravind et al. 2014; Katoh and Standley 2013). These residues primarily appeared in proximity in the three-dimensional structure (Fig. 6B), composing a highly similar structural core among paralogs (Fig. 6C) to which the NAD+ molecule was predicted to dock (Supplementary Fig. S11) (W. Zhang et al. 2020). Conversely, sequence variations correlated with structural deviations for other residues, suggesting that the core ARTs have evolved divergently while maintaining their core structures and functions.

    Fig. 6.

    Fig. 6. Evolution of putative ADP-ribose transferases (ARTs) informed through structural comparison. A, Entropy plot for the core putative ARTs with structures predicted with estimated precision > 0.75 and the ADP-ribosylation fold. The 10 sequences were aligned with MAFFT, and Shannon’s entropy was calculated for the columns that contain MGG_00230 sequences. The gap was ignored in the entropy calculation but the proportion of gap characters is indicated below the entropy plot. The cut-off for conserved residues was entropy < 1 and the proportion of gap < 0.1. Known catalytic residues (red) and residues around them (black) are indicated in the entropy plot. B, Ribbon structure of MGG_00230 with annotated conserved and catalytic residues in magnate and red, respectively. C, Structural superposition of the 10 core ARTs and Scabin toxin generated with multiple template modeling (mTM)-align (Dong et al. 2015). The structural core measured with the maximum pairwise residue distance < 4Å is indicated in magnate. D, Homology relationship between the core and secondary ARTs in cluster 8. The arrow indicates the direction of homology detection. For instance, MGG_00230 → MGG_09666 denotes that homology to MGG_09666 is detectable when MGG_00230 is used as a query. HHblits was used as a more sensitive sequence similarity method. E, Structural superposition of the four secondary ARTs and a core ART, MGG_00230, generated with mTM-align. The structural core is in magnate. F, Surface structures of secondary ARTs MGG_16321 and MGG_09666. The left models indicate the conserved structural core in magnate. The right models display interaction interfaces in red predicted by MaSIF. G, A proposed mechanism of the ART evolution in Magnaporthe oryzae.

    Download as PowerPoint

    The other members in this cluster are highly divergent from the core ARTs in primary sequences and, therefore, we designate them as secondary ARTs. The homology from core ARTs to the secondary ARTs is not always obviously detectable (Supplementary Fig. S12). Homology detection is often unidirectional, and sequence similarity to an intermediate sequence (e.g., MGG_09666) is required to infer homology between two members (Fig. 6D) (Park et al. 1997). To understand how the secondary ARTs may evolve, we focused on the four members with predicted structures (Supplementary Table S5). Upon superpositioning their structures against a core ART, noticeable differences in evolutionary patterns appeared in sequences and structures. First, the catalytic residues were absent, implying that these proteins may be pseudoenzymes incapable of mediating NAD+-dependent ADP-ribosylation (Supplementary Fig. S13) (Waterhouse et al. 2009). Second, structural conservation is skewed toward one side of the core ART (Fig. 6E). Because the conserved region may play a functional role, we employed MaSIF to predict the protein interaction interface (Gainza et al. 2020). The comparison of the surface structures revealed that the predicted interaction interface overlaps with the structurally conserved regions, while the overall fingerprints of the paralogs are distinct (Fig. 6F). This possibly indicates that the structural core may constitute the backbone of the interface, whereas the other sequence around the core may determine the shape complementarity for substrates or protein targets.

    The data suggest a hypothesis for the ART evolution in M. oryzae. Canonical, bifunctional ARTs, which bind to possibly essential host targets and transfer an ADP moiety to alter their cellular activities, first emerged and formed the core ART cluster by frequent duplication and diversification (Fig. 6G). One of the paralogs lost catalytic residues necessary to metabolize NAD+ and began to deviate from the canonical ARTs in evolution. Instead of being selected against, this paralog evolved the remaining protein–protein interaction interface and became specialized as host protein binders. Eventually, additional pseudoARTs arose by duplication events, and they subsequently diversified for different host proteins.

    DISCUSSION

    Although sequencing the genomes of new pathogen strains became common practice (Islam et al. 2016; Tembo et al. 2021; Win et al. 2021), elucidating the functions and structures of secreted proteins remains challenging. Additionally, primary sequences alone are typically insufficient to infer the roles of effectors, necessitating a novel approach to tackle the problem. Computational structural genomics has been a nearly infeasible concept with homology-based structural modeling. However, we show that template-free modeling enables us to apply this methodology to study pathogen proteins. The accuracy of predicted structures is not yet comparable with experimentally determined structures. Nonetheless, analyses such as structural classification, comparisons, and clustering can be adequately conducted with the predicted structures. It is also advantageous that such analyses are performed at the secretome-wide level, which would not be possible with experimentally determined structures because of their limited availability. With the new layer of information, computational structural genomics opens new possibilities in data analyses of pathogen secretomes.

    Comparative genomics with predicted structures may enhance our understanding of effector evolution.

    Computational structural genomics allows us to pinpoint unknown proteins that would adopt similar folds of known effector structures such as MAX effectors. However, whether the evolution of the new candidates would resemble that of the known effectors is unclear. Comparative genomics has provided insights about effector evolution, revealing lineage-specific presence or absence variations and sequence diversification (Kim et al. 2019). Comparative genomics should include predicted structures to examine evolutionary mechanisms that may govern effector evolution in analogous families.

    Comparative computational structural genomics may reveal commonly used effector folds for immune receptor engineering.

    Our computational structural genomics approach revealed novel effector families that display the α-β plait topology or the γ-crystallin-like folds. More importantly, structural analogy to other sequence-unrelated effectors in different phytopathogens was present, suggesting that phytopathogens may commonly employ effectors with similar folds. Therefore, we propose that effectorome-wide structure prediction for diverse phytopathogens and comparative computational structural genomic analyses should be followed. Such studies may also provide a new path for nucleotide binding leucine-rich repeat immune receptor engineering to improve plant immunity. In a recent study, the authors demonstrated that allelic intracellular MLA receptors recognize structurally similar RNase-like effectors through their polymorphic C-terminal leucine-rich repeats (Bauer et al. 2021). If phytopathogens share effectors with a common structural fold, with immune receptors that can recognize such an effector, we may be able to rationally design or directly evolve variants that can target other structurally similar effectors.

    Secretome-wide structural clustering helps to prioritize effectors for experimental structural determination and functional elucidation.

    The structure-based functional inference is analogous to sequence-based functional annotation transfer. Both require not only predicted structures or sequences of good quality but also the existence of references with known roles and functions. The scope of solved effector structures and our understanding of their functions are currently limited. Therefore, molecular and structural biology work remains critical, regardless of the improvement in structure prediction algorithms. Our work provides a unified structural genomics resource that can be used to group and prioritize candidate effectors for further analyses.

    Rapidly improving protein structure prediction algorithms are offering solutions to the current challenges.

    Approximately 30% of the secreted proteins could not be predicted by either TrRosetta or I-TASSER. Some of these proteins are predicted to be largely disordered and would likely be unfoldable and, therefore, protein structure prediction may not provide any useful information about their folds and functions. Others typically failed to retrieve a sufficient number of diverse homologs necessary for coevolutionary inference, likely attributed to homology detection failure or recent origins of effectors. There are a few available options to predict the folds of these proteins. First, other state-of-the-art prediction tools such as RaptorX and RoseTTAFold could be additionally utilized (Baek et al. 2021; Xu et al. 2021). Second, prediction performance with small MSAs could be improved through, for instance, MSA dropout and consistency learning (Liu et al. 2021). Third, more intensive structural refinement with DeepAccNet may be able to correct the inaccurate original structures (Hiranuma et al. 2021). Finally, AlphaFold 2 demonstrated exceptional success in the CASP 14 (Callaway 2020). Algorithms with a similar level of performance could become available in the near future, and we expect that they will be able to predict the folds of many proteins missed in this study.

    The emergence of pseudoenzymes—Can this be a common theme in effector evolution?

    We proposed the emergence of pseudoenzymes and their subsequent diversification in the ART evolution (Fig. 6G). Although the discussion about this notion is scarce as yet in effector evolution, it is not entirely new. B. graminis secretes a significant number of effectors that adopt the RNase fold; however, many of these effectors lack essential residues for catalytic activities, suggesting that they may be pseudoenzymes (Pedersen et al. 2012). Recent studies demonstrated that RNase-like effectors without a canonical enzymatic activity have a functional role in pathogenesis and are targeted by immune receptors (Bauer et al. 2021; Pennington et al. 2019).

    The expansion of pseudoenzymes and the validation work of their function raise an intriguing perspective in effector evolution. For both M. oryzae pseudoARTs and B. graminis RNase-like effectors, the ancestral, canonical protein would likely have properties to bind to important host targets. Loss of catalytic activities would not be evolutionarily deleterious to the effector, if its binding to the host targets was sufficient for virulence. Instead, this event would relax evolutionary constraints to maintain the enzymatic functions in both sequence and structure, eventually opening new paths in evolutionary trajectories that canonical ARTs or RNAse effectors could not access. Therefore, the ancestor could become specialized as a binder, and frequent duplication and diversification could subsequently allow the paralogous proteins to follow other accessible evolutionary paths, eventually leading to an expanded protein family that may target other host molecules.

    Future perspective.

    Genome-scale protein structure prediction is still time consuming and computationally intensive. However, with the advances in machine learning and parallel computing, the field of protein structure prediction is rapidly evolving to challenge this problem. The growing structural data will shift our perspectives on evolution toward three-dimensional space, unrestricted to primary sequences. We also foresee that computational structural genomics will be applied to much larger proteomes of the plant hosts. Such datasets will facilitate our understanding of the interplay between effectors and host proteins and coevolution stemming from this interaction.

    MATERIALS AND METHODS

    Secretome prediction.

    The 12,755 protein sequences of M. oryzae strain 70-15 were downloaded from Ensembl (Dean et al. 2005). We utilized the neural network of SignalP v3.0, one of the most sensitive to defect fungal effectors, to identify putative secreted proteins (Bendtsen et al. 2004; Sperschneider et al. 2015). SignalP initially predicted 2,348 secreted proteins with a D-score ≥ 0.43. 119 possible false positives were removed because their predicted signal peptides overlapped with PFAM domains annotated with InterProScan v5.30-69.0 over ≥10 amino acids (Quevillon et al. 2005). Following the predicted cleavage site based on the Y-score from SignalP, mature protein sequences were generated. We used TMHMM v2.0 to eliminate 344 putative integral membrane proteins with one or more transmembrane helices in the mature proteins (Krogh et al. 2001). Finally, 25 mature proteins with ≥1,000 amino acids were removed, and only the longest isoform was selected. In total, 1,854 proteins remained for the analysis.

    Gene prediction for the Magnaporthales.

    We obtained 248 genome assemblies for the organisms in the order Magnaporthales from the NCBI. Because the primary purpose of the genome annotation was to collect homologs of the predicted secreted proteins of M. oryzae, we mainly relied on Liftoff v1.3.0 to transfer existing annotations of M. oryzae 70-15 to a target genome (Shumate and Salzberg 2021). We then utilized BRAKER v2.1.5 to predict additional gene models uncaptured by Liftoff (Brůna et al. 2020, 2021; Buchfink et al. 2015; Gotoh et al. 2014; Iwata and Gotoh 2012; Lomsadze et al. 2005; Stanke et al. 2006). For 222 M. oryzae genomes, for instance, the reference annotation of M. oryzae 70-15 was initially mapped into each target genome with Minimap v2.17-r974-dirty (Li 2016). The gene models with ≥98% coverage and ≥40% sequence identity were annotated by Liftoff and collected if no premature stop codon existed (−a 0.98 −s 0.4 −sc 0.4 −copies −exclude_partial). Prior to running BRAKER, the repeat library for the reference genome was generated by RepeatModeler v2.0.1 (Flynn et al. 2020). RepeatMasker v4.1.0 was used to soft-mask each target genome (Smit et al. 2013). The fungal proteomes collected from OrthoDB v10 and the available genome annotations for the order Magnaporthales were used as protein evidence for BRAKER (Kim et al. 2019; Kriventseva et al. 2019). Among the final annotations, we selected all the gene models not overlapping with those previously predicted by Liftoff. For other species in Magnaporthales, we used the annotation sets of their closely related species as the reference and followed the annotation pipeline.

    Generation of multiple sequence alignments.

    To generate an MSA of a target protein and its homologs, DeepMSA was used (C. Zhang et al. 2020). DeepMSA iteratively searches the Uniclust30, Uniref90, and Metaclust databases to construct an MSA (Mirdita et al. 2017; Steinegger and Söding 2018; Suzek et al. 2015). We supplemented the Metaclust database with additional fungal protein sequences to facilitate collecting diverse fungal homologs. The fungal datasets consisted of 1,689 annotated proteomes from the Joint Genome Institute and the 248 Magnaporthales annotations we produced (Grigoriev et al. 2014). The two datasets were concatenated, and gene models with premature stop codons were removed. The 25,077,589 gene models were clustered with Linclust from MMseqs2 to reduce redundancy (–min-seq-id 0.95 –cov-mode 1 −c 0.99) (Hauser et al. 2016; Mirdita et al. 2017). The resulting 17,679,966 representative gene models were appended to the Metaclust database. DeepMSA was run on these databases to generate an MSA for each secreted protein without the final filtering step.

    Protein structure modeling and final structure selection.

    The protein structures of all the candidate secreted proteins were predicted with homology modeling by I-TASSER v5.1 and template-free modeling by TrRosetta (Yang et al. 2015, 2020). For I-TASSER, the template library was downloaded from the I-TASSER server (24 August 2020). I-TASSER was run in the light mode with the MSA from the previous step converted into the PSI-BLAST-readable format using the a3m2mtx script in the DeepMSA package (−a3m −neff 7). For the benchmarking set, any homologous templates with ≥70% sequence identity were excluded (−homoflag benchmark −idcut 0.7). We used the estimated TM score of the first model as a measure of precision, and the predicted structures with the mean TM score >0.5 were considered reliable.

    For TrRosetta, we selectively filtered the MSAs, limiting their sizes to 6,000 sequences to prevent unnecessarily deep, large MSAs. If the MSA from DeepMSA was larger than the limit, sequences with ≥75% of the query coverage were first sampled from high to low sequence identity over the aligned regions. If necessary, sequences with ≥50% of the query coverage were additionally sampled. TrRosetta was run with the filtered MSAs to predict interresidue orientations and distances, and we generated five full-atom models with PyRosetta (Chaudhury et al. 2010). The model with the lowest energy score was chosen as a final model. We used the average probability of the top L predicted long- plus medium-range contacts (|ij| > 12) estimated by the top_prob.py script in the TrRosetta suite as precision measurement. A structure with the probability >0.5 was considered reliable. Among the outputs from I-TASSER and TrRosetta, the structure with higher estimated precision was selected as a final model. We used PyMOL v2.4.0 (Schrödinger, LLC) to visualize the structures.

    Functional and structural annotations of the secretome.

    We employed InterProScan v5.30-69.0 to identify homologous PFAM v31.0 entries (El-Gebali et al. 2019). PaperBLAST and PHI-base were used to search for M. oryzae effectors and their homologs in other pathogens (Price and Arkin 2017; Urban et al. 2020). The secondary structures and disordered residues were predicted with PSIPRED v4.0 and DISOPRED v3.16 (McGuffin et al. 2000; Ward et al. 2004). The predicted structures were compared with RUPEE against the SCOPe v2.07 and CATH v4.2.0 databases as well as the PDB chains (TOP_ALIGNED, FULL_LENGTH) (Ayoub and Lee 2019; Berman et al. 2000; Fox et al. 2014; Sillitoe et al. 2019).

    Network analysis using homology and structures.

    We identified homology using sequence-to-sequence search with BLASTP v2.7.1, profile-to-sequence search with HHblits v3.1.0, and profile-to-profile search with HHsearch v3.1.0 (Camacho et al. 2009; Remmert et al. 2012). The MSAs generated in the previous step represented the profiles. In all cases, only the pairs with E-value < 10−4 and bidirectional coverage >50% were regarded as significant. The structural similarities were compared with TM-align by superposing each pair of structures predicted with the estimated precision >0.5 by TrRosetta or >0.55 by I-TASSER. We increased the cut-off for I-TASSER because the structures with the estimated precision of approximately 0.5 introduced spurious clustering, and the sources of their selected templates were often distantly related organisms like humans. We also did not use any proteins with ≥35% disordered residues for structural comparisons, because these proteins tended to be modeled by I-TASSER with similar homologous templates and, thus, introduced biases in the similarity search. The two compared structures were considered similar if the TM score was >0.5 for both structures or >0.6 and >0.4 for each. We used the igraph package v1.2.4.1 in R v3.6.1 for the network analyses and to reveal the cluster membership of secreted proteins (Csardi and Nepusz 2006; Ihaka and Gentleman 1996).

    Data availability.

    The genome annotations, MSAs. and structures generated for the secreted proteins, and the data used for network analyses are available to download in Zenodo.

    ACKNOWLEDGMENTS

    We thank B. Staskawicz for access to the computational resources; J. Yang for his advice on structure prediction; C. L. Shaw, D. M Prigozhin, E. L Baggs, and P. Joubert for the critical review of the manuscript; S. Kamoun, M. Banfield, and P. Gladieux for their thoughtful comments on the manuscript; and anonymous reviewers for their constructive comments.

    AUTHOR-RECOMMENDED INTERNET RESOURCES

    Schrödinger, LLC: https://pymol.org/2

    Zenodo: https://zenodo.org/record/4456015

    The author(s) declare no conflict of interest.

    LITERATURE CITED

    Funding: This research relied on the Savio computational cluster resource provided by the Berkeley Research Computing program at the University of California, Berkeley. K. Seong is supported by the Berkeley Fellowship. K. V. Krasileva is supported by the Gordon and Betty Moore Foundation (grant number 8802) as well as by the joint funding from the Foundation for Food and Agriculture and 2Blades (CA19-SS-0000000046) and the Innovative Genomics Institute.

    The author(s) declare no conflict of interest.