TECHNICAL ADVANCEOpen Access icon OPENOpen Access license

effectR: An Expandable R Package to Predict Candidate RxLR and CRN Effectors in Oomycetes Using Motif Searches

    Affiliations
    Authors and Affiliations
    • Javier F. Tabima1
    • Niklaus J. Grünwald2
    1. 1Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331, U.S.A.
    2. 2Horticultural Crops Research Laboratory, USDA-ARS, Corvallis, OR 97330, U.S.A.

    Published Online:https://doi.org/10.1094/MPMI-10-18-0279-TA

    Abstract

    Effectors are small, secreted proteins that facilitate infection of host plants by all major groups of plant pathogens. Effector protein identification in oomycetes relies on identification of open reading frames with certain amino acid motifs among additional minor criteria. To date, identification of effectors relies on custom scripts to identify motifs in candidate open reading frames. Here, we developed the R package effectR, which provides a convenient tool for rapid prediction of effectors in oomycete genomes, or with custom scripts for any genome, in a reproducible way. The effectR package relies on a combination of regular expressions statements and hidden Markov model approaches to predict candidate RxLR and crinkler effectors. Other custom motifs for novel effectors can easily be implemented and added to package updates. The effectR package has been validated with published oomycete genomes. This package provides a convenient tool for wet lab researchers interested in reproducible identification of candidate effectors in oomycete genomes.

    Secreted effector proteins have been reported for all major groups of plant pathogens, including bacteria, fungi, oomycetes, nematodes, and viruses (Jones and Dangl 2006; Toruño et al. 2016). Effector proteins are defined as secreted proteins that manipulate plant processes to the advantage of the parasite, in order to promote infection and generate disease (Petre and Kamoun 2014). Effector proteins modulate and interfere with the normal physiology of the plant host in order to facilitate disease and infection (Jiang and Tyler 2012; Jones and Dangl 2006; Kamoun 2007). Effector proteins are common in bacterial plant pathogens. Up to 29 effector proteins are injected by the bacterial pathogen Pseudomonas syringae into the host plant cells via the type III secretion system (Chang et al. 2005). Effector discovery has predicted over 50 candidate effector proteins for the species Heterodera glycines, a plant-parasitic nematode that affects soybean crops across the world (Wang et al. 2001). Multiple effector proteins have been reported for a plethora of plant-pathogenic fungal and oomycete species (Selin et al. 2016).

    The discovery of secreted effector proteins in these organisms was a major breakthrough, as these proteins are directly involved in pathogenicity. These effector proteins that disrupt the normal physiology of the plant can be recognized by specific resistance (R) genes coding for proteins (Jones and Dangl 2006). The recognition of an effector protein by an R gene leads to effector triggered immunity, which generates a signaling cascade that results in programmed cell death via a hypersensitive response. This programed cell death slows the growth of the pathogen and avoids the proliferation of disease into neighboring plant cells (Jones and Dangl 2006).

    Plant-pathogenic oomycetes or water molds are a group of highly devastating plant pathogens. The diseases caused by these organisms can affect hundreds of plant species and have led to high mortality of trees in forest ecosystems, losses of millions of dollars to agriculture, and have been implicated in the Irish potato famine (Fry 2008; Grünwald et al. 2008). Widely recognized plant-pathogen genera found in the water molds include Phytophthora, Pythium, Albugo, and Peronospora (Erwin and Ribeiro 1996).

    Oomycetes contain a high number of predicted effector proteins (Birch et al. 2006; Kamoun 2006, 2007; Pais et al. 2013). These proteins are secreted by the pathogen haustorium and are translocated into the host plant cell, in which they are transported to different organelles to disrupt physiological functions and facilitate disease (Anderson et al. 2015; Birch et al. 2006; Kamoun 2006). Recent advances in molecular and computational biology have provided new information about the amino acid sequence of these oomycete effectors. Conserved motifs were identified in two key oomycete effector protein families, namely, the RxLR-dEER motif for RxLR effectors and the LFLAK-HVLV motif for crinkler (CRN) effectors. These canonical motifs are located near the N terminus of an amino acid sequence following the signal peptide sequence. To date, hundreds of these RxLR and CRN effectors per species have been predicted for important plant-pathogenic oomycetes from the genus Phytophthora, such as Phytophthora infestans (approximately 580 RxLR and 196 CRN effector genes [Haas et al. 2009]), P. sojae (approximately 470 RxLR and 100 CRN effector genes [Tyler et al. 2006]), and P. ramorum (approximately 260 RxLR and 19 CRN effector genes [Tyler et al. 2006]). In addition, the downy mildew pathogen Hyaloperonospora arabidopsidis contains up to 134 predicted RxLR-coding genes in its genome (Baxter et al. 2010). However, not all oomycete species contain evidence of RxLR-coding genes in their genomes. No RxLR effector proteins have, to date, been found in the genera Pythium, Saprolegnia, or Albugo (Lévesque et al. 2010; Links et al. 2011; Win et al. 2012).

    The process of identifying candidate effector-coding genes in sequenced genomes has, to date, been ad hoc and not reproducible, due to the fact that each user can choose slightly varying thresholds. The most common bioinformatic process used for predicting effectors was described by Hass et al. (2009), in which the authors used a combination of pattern matching via regular expressions (REGEX) based on the canonical RxLR and LFLAK motifs, followed by homology searches using Markov models. This method was used for the prediction of effector proteins in Phytophthora infestans, resulting in a total of 562 predicted RxLR effectors and 196 CRN effectors (Hass et al. 2009). The approach used by Hass et al. (2009) has been modified to include predicted effector sequences from other species, in order to improve the homology search (Govers and Bouwmeester 2008). Several independent tools are available to perform different steps to recognize effectors (Dalio et al. 2018; Sonah et al. 2016), but, to date, no bioinformatic tool allows the prediction of oomycete effectors in a simple and fast manner.

    We developed the expandable effectR R package designed to predict effector proteins in a fast and reproducible way. This package uses both REGEX and homology search mechanisms based on hidden Markov models (HMM) to predict candidate effector proteins, using gene models obtained from whole-genome sequences. effectR provides functions in the statistical and computer language R (R Core Team 2018), which allows rapid identification and evaluation of candidate effector proteins for any oomycete species with a sequenced genome. The effectR package has been developed to facilitate the prediction of effector proteins for researchers with limited expertise in computational biology. We tested the effectR package using available oomycete genomes and successfully validated effector prediction for the species Phytophthora infestans. The effectR package is modular and can be expanded to predict effector proteins with different canonical motifs specified by the user. In fact, we encourage contributions to our GitHub repository of new functionality for effector prediction for other organismal groups. We included an example of a custom implementation for bacterial proteins that contain PAAR repeats. The effectR package is released on CRAN and a brief user tutorial is provided on GitHub.

    RESULTS AND DISCUSSION

    We developed the R package effectR that provides a convenient tool for rapid prediction of effectors in oomycete genomes or with custom scripts for any genome in a reproducible way. The effectR package relies on a combination of REGEX statements and HMM approaches to predict candidate RxLR and CRN effectors using several function calls (Table 1). Three steps are required to call effectors: i) A REGEX search to select amino acid sequences translated from all the open reading frames (ORFs) in a genome that match the motifs of interest, ii) a second, broad search of the amino acid sequences that match a Markov chain profile created with the amino acids that matched the motifs of interest in the REGEX step, and iii) a posthoc set of tools combining the results from the two previous steps after filtering for redundancies (Fig. 1). We are using the oomycete P. infestans as an example application.

    Table 1. Functions found in effectR and their descriptions

    Fig. 1.

    Fig. 1. Workflow representing the steps used in the effectR R package. The package uses an amino acid FASTA file as input that contains either all translated gene models from a sequenced genome or the six-frame translation open reading frames (ORF) of a genome to obtain effectors. The genomic ORFs represent six-frame translations of a genome, using the external software getorf from EMBOSS. After reading the amino acid FASTA input file into R, the effectR package predicts all effector proteins using three steps: i) Searching MOTIF patterns via regular expression (REGEX), ii) searching for additional effectors using a homology search based on hidden Markov models (HMM) using hmmer, and iii) a posthoc analysis of the candidate effectors to visualize and summarize the motifs of the candidates obtained by the package. All steps are modular and can be modified for discovery of any kind of effector.

    Download as PowerPoint

    Step 1: Obtaining candidate RxLR and CRN effectors using REGEX.

    The following example uses the function regex.search()with the test dataset test_infestans.fasta. This dataset contains a subset of 28 sequences, including six reference RxLR effectors (PexRD36, PexRD1, ipi01/Avrblb1, Avr1, Avr4, and Avr3a) (Haas et al. 2009), nine randomly selected predicted CRN effector proteins identified by Hass et al. (2009), and 14 in-house translated ORFs from the P. infestans reference genome sequence, of which eight translated ORFs contain RxLR and EER motifs. This test dataset is included in the effectR package and can be loaded as follows:

    # Loading the effectR package in R

    library(“effectR”)

    # Using the read.fasta function of the seqinr package to import the translated ORF FASTA file

    fasta.file <- system.file("extdata", "test_infestans.fasta",

    ​​​​ package = "effectR")

    ORF <- seqinr::read.fasta(fasta.file)

    # Verifying if the length of the loaded FASTA file correspond to the 28 translated ORF’s from P. infestans

    length(ORF)

    ## [1] 28

    # Executing step 1: Prediction of RxLR effectors from the ORF object. Results are saved in the rxlr.cand object

    rxlr.cand <- regex.search(ORF,motif = "RxLR")

    # What is the number of RxLR effectors predicted by step 1?

    length(rxlr.cand)

    ## [1] 15

    This code snippet illustrates that regex.search() was able to predict all 14 expected RxLR candidates from the test dataset along with one additional candidate out of the 28 reference effectors and translated ORFs from the included test_infestans.fasta dataset. In addition, the effectR package can also predict CRN effectors or predict other families of interest based on custom REGEX provided by the end user. Changing the motif="RxLR" option to “CRN” or “custom” will allow the prediction of these other motifs of interest without reloading the ORF dataset, thus significantly reducing processing time. Using the “CRN” option successfully predicts the nine CRN candidate effector proteins from Hass et al. (2009) included in the test dataset. Users are also encouraged to submit new functions to the effectR GitHub repository for prediction of other motifs for any organismal group (discussed below and in Supplementary Text S1 for an example).

    Step 2: Using HMMs to predict additional candidate effectors.

    The following code example uses the function “custom” on the test_infestans.fasta dataset after obtaining the REGEX RxLR candidates.

    # Loading the effectR package in R

    library(“effectR”)

    # Using the read.fasta function of the seqinr package to import the translated ORF FASTA file

    fasta.file <- system.file("extdata","test_infestans.fasta",

    ​​​​​​​ package = "effectR")

    ORF <- seqinr::read.fasta(fasta.file)

    # Step 1 prediction

    REGEX <- regex.search(ORF, motif = "RxLR")

    # Expanding the search of RxLR effectors using HMM searches (step 2). All candidate effectors predicted by step 2 will be saved in the candidate.rxlr object

    candidate.rxlr <- hmm.search(original.seq = fasta.file, regex.seq = REGEX)

    ## Starting MAFFT alignment.

    ## ---

    ## Executing MAFFT

    ## Please be patient

    ## MAFFT alignment finished!

    ## Starting HMM

    ## ---

    ## Creating HMM profile

    [additional console output was removed for brevity]

    ## HMM search done.

    ## ---

    ##

    ## Total of sequences found in REGEX: 15

    ## Total of sequences found in HMM: 17

    ## Total of redundant hits: 15

    ## Number of effector candidates: 17

    The previous code snippet shows that hmm.search() was able to identify two new RxLR candidates from the test_infestans.fasta dataset, increasing our total number of candidate RxLR effectors to 17. The hmm.search() function also detects candidate effectors previously predicted by the REGEX step. In the previous example, the hmm.search() function detected 15 candidate effectors using the hmm.search() step. The combined 17 candidate effectors based on the REGEX and HMM searches can now be evaluated in step 3.

    Step 3: Post hoc tools for curation of candidate effector genes.

    In the following steps, effectR provides functions to summarize and visualize candidate effector genes.

    Step 3a: Summarizing the predicted effector proteins and determining the number and position of each motif in the effector sequences.

    The effectR package includes the effector.summary() function that combines the results from both REGEX and HMM searches into a list of unique predicted effectors that can be exported into multiple sequence format files. The effector.summary() function returns a table of the number of motifs per sequence and the position of the first residue of each motif of interest for all candidate effectors. In addition, the table includes a summary motif column that will define the candidate sequence as complete (includes both motifs of interest), only X motif (if the sequence includes only one of the two motifs of interest), or no motifs (no motifs of interest found).

    # Loading the effectR package in R

    library(“effectR”)

    # Using the read.fasta function of the seqinr package to import the translated ORF FASTA file

    fasta.file <- system.file("extdata","test_infestans.fasta",

    ​​​​​​​ package = "effectR")

    ORF <- seqinr::read.fasta(fasta.file)

    # Step 1 prediction

    REGEX <- regex.search(ORF, motif = "RxLR")

    # Expanding the search of RxLR effectors using HMM searches (step 2). All candidate effectors predicted by step 2 will be saved in the candidate.rxlr object

    candidate.rxlr <- hmm.search(original.seq = fasta.file, regex.seq = REGEX)

    # Summarizing the predictions from step 1 and step 2 using the effectR.summary() function (Step 3a).

    # The summary of non-redundant RxLR predicted effectors will be stored in the RxLR.effector$consensus.sequences objects

    # The table of motif number and position will be stored in the RxLR.effectors$motif.table object

    RxLR.effectors <- effector.summary(candidate.rxlr, motif = "RxLR")

    # What is the number of non-reduntant RxLR effectors predicted by step 1 and step 2?

    length(RxLR.effectors$consensus.sequences)

    ## [1] 17

    # Table of the first five motif number and position for all non-redudant, predicted RxLR effector proteins

    head(RxLR.effectors$motif.table, n = 5)

    ##Sequence ID  RxLR number RxLR position EER number EER position MOTIF Length

    PITG_21388_ipi01/Avrblb1_RxLR  2  51,122  1  70Complete  152

    PITG_23132_PexRD36_RxLR  1    28  1  55 Complete  76

    PITG_15287_PexRD1_RxLR  1  50  1  75  Complete  213

    PITG_07387_Avr4_RxLR  1  42  1  53  Complete  287

    PITG_14371_Avr3a_RxLR  1  44  1  57  Complete  147

    Step 3b: Plotting the HMM profile.

    To visualize the position of the motifs of interest, we include the function hmm.logo() that plots the results from the HMM profile obtained in hmmbuild in a logo plot (Schneider and Stephens 1990), using the ggplot2 package (Wickham 2016). The hmm.logo() function reads the hmmer profile table and extracts a bit score of each amino acid at each position. Then, effectR plots the bit score on the y axis, the amino acid position in the x axis, and overlays the amino acid with the highest bit score over the plot in order to represent the frequency for each amino acid found at every position of the consensus sequence from the multiple sequence alignment (MSA) step. Here is an example of the hmm.logo() function for the P. infestans example dataset:

    # Loading the effectR package in R

    library(“effectR”)

    # Using the read.fasta function of the seqinr package to import the translated ORF FASTA file

    fasta.file <- system.file("extdata", "test_infestans.fasta",

    ​​​​​​​ package = "effectR")

    ORF <- seqinr::read.fasta(fasta.file)

    # Step 1 prediction

    REGEX <- regex.search(ORF, motif = "RxLR")

    # Expanding the search of RxLR effectors using HMM searches (step 2). All candidate effectors predicted by step 2 will be saved in the candidate.rxlr object

    candidate.rxlr <- hmm.search(original.seq = fasta.file, regex.seq = REGEX)

    # Plotting the HMM profile created in the hmm.search() function using the hmm.logo() function

    hmm.logo(candidate.rxlr$HMM_Table)

    ## R graphical output [NOTE: NOT SHOWN HERE, BUT PRINTED TO CONSOLE]:

    Validation.

    To test the developed package, we used a set of four sequenced genomes of three oomycete species and one fungal species to predict RxLR effectors. We used the genomes of the oomycete species Phytophthora infestans T30-4 version 1 (Haas et al. 2009), Pythium ultimum DAOM BR144 version 1 (Lévesque et al. 2010), and Albugo candida Ac2VRR version 1 (Links et al. 2011). P. infestans has more than 580 predicted RxLR effectors (Haas et al. 2009), while neither Pythium ultimum nor A. candida have any reported RxLR motifs. Instead, A. candida has 26 predicted gene models with a variant RxLR motif called Ac-RXL (Links et al. 2011). The genome of the ascomycete Fusarium oxysporum f. sp. lycopersici (Ma et al. 2010) was used as a negative control or outgroup with no expectation of finding RxLR motives.

    The assembled reference genomes of each of the four species were downloaded from Fungi-DB (Stajich et al. 2012). For each genome assembly, a six-frame translation of all ORFs was predicted using getorf from EMBOSS (Rice et al. 2000). Translated ORFs with a length of less than 100 nucleotides were discarded, as none of the reported functional effector proteins in oomycete species are shorter than this size. We predicted all RxLR effector proteins for each translated sequence from the genome assembly in the effectR package. Additionally, prediction of the signal peptide using all predicted candidate effectors was performed in SignalP 3.0 (Bendtsen et al. 2004), as recommended by Sperschneider et al. (2015). A threshold D score > 0.8 was used. Any candidate effector with a predicted signal peptide was considered a high-confidence candidate effector. All high-confidence candidate effector proteins from P. infestans were compared with the published list of effectors from Haas et al. (2009). We created a custom blast database using the RxLR effector proteins and identified matches between the amino acid sequences of the effectR predictions against the RxLR candidate database in Blastp (Altschul et al. 1990).

    As a first step, we identified a total of 174 proteins for Pythium ultimum, 47 for A. candida, and three for F. oxysporum f. sp. lycopersici that showed evidence for one or both the RxLR and EER motifs. However, in the second step, none of these proteins were considered high-confidence candidate effectors, as they lacked a signal peptide cleavage site (Table 2). These results are consistent with previous reports of finding no RxLR effector proteins in genomes of these oomycete and fungal species. The Ac-RXL effector proteins of A. candida have an independent origin from the RxLR effector proteins found in Phytophthora species (Links et al. 2011), and no functional RxLR effector proteins have been reported for F. oxysporum f. sp. lycopersici (Ellis et al. 2009; Ma et al. 2010).

    Table 2. Predicted RxLR effector proteins in effectR for the open reading frame translations (ORF) of the assembled genomes of Phytophthora infestans, Pythium ultimum, Albugo candida, and Fusarium oxysporum f. sp. lycopersicia

    For P. infestans, we predicted 395 candidate effector proteins in the REGEX step and 827 candidate effector proteins in the HMM step, for a total of 900 nonredundant RxLR effector proteins (a number similar to the 831 RxLR effectors predicted by Haas et al. [2009], using only a combination of REGEX and HMM methods). In contrast to the predictions in the previously screened fungal and oomycete species, a high number of high-confidence candidate effectors (631 high-confidence candidate effectors with evidence of a signal peptide cleavage site) were predicted for the translated sequences of P. infestans (Table 2). The logo plot shows the prevalence of the RxLR-EER motifs obtained by the REGEX step between residues 49 and 67 (Fig. 2). The high number of effector proteins predicted in the HMM step is a result of the low thresholds used by our package in order to obtain as many candidate effectors as possible. Of the 631 high-confidence candidate effector proteins, we predicted 453 RxLR effectors with a Blastp match of more than 90% identity with the Haas et al. (2009) RxLR predicted effector proteins. The genomic position of each of these 453 predicted proteins also corresponded with the position of the homologous effector protein reported by Haas et a. (2009), indicating that effectR successfully predicted previously known RxLR effector proteins from P. infestans. The functionally validated effector proteins PexRD36, PexRD1, ipi01/Avrblb1, Avr4, and Avr3a were correctly predicted by the effectR package, providing a successful positive control. The prediction of these 631 effector proteins was performed in under 20 min on a laptop computer using two cores and less than 1 Gb of RAM. This is a modest processing time for a FASTA file with more than 200,000 translated ORFs.

    Fig. 2.

    Fig. 2. Sequence logo plot of the candidate RxLRs obtained for Phytophthora infestans in the hidden Markov model profile built from the sequences obtained in the REGEX step (step 1). The size of the letters is proportional to the height of the bar and reflects the relative frequency of the particular amino acid. The plot also shows the high relative frequency of the RxLR-EER motifs around the 50th residue of the amino acid sequence.

    Download as PowerPoint

    A total of 99 of the RXLR effectors listed by Hass et al. (2009) were not predicted by effectR. Of these 99 RxLR effector proteins, we find 18 proteins with no presence of either a RxLR or EER domain, 24 proteins with no evidence of an EER domain, and the remainder 57 proteins with the RxLR or EER domain with higher upstream or downstream distances from the REGEX expected residue position. These results are expected, as Haas et al. (2009) used a combination of Blastp searches, TribeMCL clustering, and homology searches from reference effectors in addition to the REGEX + HMM searches, leading to a higher number of candidate effector proteins than expected just from the searches implemented in this package. effectR also predicted 79 candidate effector proteins not present in Hass et al. (2009) that include evidence of a signal peptide cleavage site (Table 3). Of these 79 newly predicted effector proteins, 40 proteins contained both RxLR and EER domains, 14 only contained the RxLR motif, 18 only had the EER motif, and seven proteins did not include any of the motifs of interest (Table 3).

    Table 3. New candidate RxLR effector proteins (n = 79) from Phytophthora infestans predicted by the effectR packagea

    In addition to the prediction of RxLR effector proteins, we predicted the CRN effector proteins for the ORF translations of the species used in the proof of concept (Supplementary Table 1). The results of the CRN prediction are consistent with the expectation of a high number of CRN effector proteins present in the genomes of oomycete species, with 21 candidate CRN effectors predicted for Pythium ultimum and 214 CRN effectors predicted for P. infestans. Of the 214 CRN effectors of P. infestans predicted by effectR, we found 159 CRN effector proteins with a Blastp match of more than 90% identity with the Haas et al. (2009) CRN predicted effector proteins. The number of predicted effectors for these two species is similar to the reported number of CRN effectors (196 CRN effectors for P. infestans and 26 CRN effectors for Pythium ultimum) (Lévesque et al. 2010). However, only four CRN effectors in A. candida were predicted. These four predicted CRN proteins for A. candida only have the LxLAK motif, as reported by Links et al. (2011), and are not considered canonical CRN effectors (Links et al. 2011). Finally, only three CRN effectors were predicted for the fungal species Fusarium oxysporum f. sp. lycopersici. No CRN effectors have been reported for this fungal pathogen, and the CRN effectors predicted by effectR only contain the LxLAK motif, indicating that these proteins are not canonical CRN effectors, as in A. candida.

    Custom scripts for other effectors.

    effectR can easily be modified for identifying candidate effectors. For example, regex.search() can be modified to call bacterial proteins that contain PAAR repeats. These proteins are associated with the VgrG-like spikes found in the type VI secretion system of bacteria and have been shown to be essential in target cell killing by the bacterial species Vibrio cholerae and Acitenobacter baylyi (Shneider et al. 2013). The PAAR proteins have a homonymous amino acid sequence motif (PAAR) with one or more repeats. We created an example showing how the effectR can predict potential candidate proteins in the proteome of the reference strain ATCC 39315 of V. cholerae (Heidelberg et al. 2000) by using the PAAR motif as part of the REGEX search. Our results indicate that effectR successfully predicted 19 candidate PAAR proteins, two of which are homologous to the previously reported PAAR proteins. These two predicted proteins only differ in one amino acid when compared with PAAR homolog proteins previously described by Shneider et al. (2013) from other strains of V. cholerae (Supplementary Fig. 1). These results show that effectR can correctly predict different proteins with other motifs than the canonical oomycete effectors and, also, illustrate the importance of using manual curation based on homologous proteins to avoid the detection of false positives.

    Conclusions.

    Our effectR package provides a novel tool for reproducible prediction of candidate effectors in oomycete genomes. The package is modular, in which every step can be modified to predict candidate effector proteins. Custom motifs for any new effector family can easily be added via custom scripts or by contributing to the GitHub repository. The package has been successfully tested for translated ORFs predicted from genomes of oomycete plant pathogens and can be used for a quick survey of the effector arsenal involved in any plant-pathogen interactions for any species of interest in which effector motifs are known. However, effectR only identifies candidate effectors, and further functional validation in the wet lab is needed.

    MATERIALS AND METHODS

    The effectR package.

    The effectR package is written in the R computer language (R Core Team 2018). effectR allows prediction of oomycete effector proteins. The package requires as input a FASTA file, ideally containing all six-frame amino acid translations for each ORF of the sequenced genome of interest or, at a minimum, all translated gene models. The package then returns the total number of predicted effectors, the amino acid sequence for each of the predicted effectors, the number and position of the motifs of interest for each predicted effector, and the Markov chain profile table, which can be conveniently visualized.

    External programs required to execute effectR.

    The effectR package requires installation of additional programs and the recommended use of external software to assure its correct functionality (reference resources provided on GitHub for instructions). effectR can use amino acid translations of the gene models predicted in a genome of interest, but we recommend the use of six-frame translations of all ORFs from a genome assembly. Including all six translations for each ORF in a genome will allow prediction of more candidate effectors present in an organism of interest. To generate six-frame ORF translations, we use getorf from EMBOSS (Rice et al. 2000). getorf can be run locally on the command line or online via the EMBOSS explorer (Rice et al. 2000). Other additional programs required by effectR are the MAFFT v7 MSA tool (Katoh and Standley 2013) and HMMER 3.1b2 (Eddy 2011). MAFFT performs a sequence alignment of the candidate effectors in order to build a reference profile to be scanned by HMMER. HMMER executes the searches based on HMMs (Fig. 1, step 2). These two programs are external to R and must be installed on the same machine as effectR. The effectR package includes functions to detect if both HMMER and MAFFT are available in the default user path or the user can specify the location of each of the binaries when executing effectR.

    While these additional programs are indispensable to complete step 2 of the effectR package, step 1 and step 3 can be executed independently of them. The user can execute step 1 within effectR, use the output from step 1 and the original ORF file to perform a MSA using other tools of interest, and execute HMM searches externally. The effectR package can import external MSA in FASTA format and HMM results in table format to be used in step 3. The inclusion of step 2 as part of the package was created for a more streamlined process in which the user can identify candidates using REGEX searches, broaden the number of candidate proteins via HMM searches, and summarize and quality control the results obtained from the effectR package in a fast and seamless manner.

    Obtaining candidate RxLR and CRN effectors using REGEX.

    To predict the first set of candidate effector proteins, effectR searches the ORF translation file to find sequences that match the motifs of interest (Fig. 1). These searches are based on REGEX matching. For the RxLR motif, the REGEX reported by Haas et al. (2009) is used:

    ^\w{10,40}\w{1,96}R\wLR\w{1,40}EER.

    This REGEX example shows the three parts used in step 1 to identify an RxLR effector candidate (Fig. 3). Part 1A (\w{10,40}) reserves the first 10 to 40 positions of the amino acid sequence for the signal peptide. The effectR package does not predict the cleavage site of a signal peptide and relies on external programs to predict the presence of said structural motif. Part 1B (\w{1,96}R\wLR) searches for amino acid residues that match the RxLR motif within the following 1 to 96 residues after the signal peptide. Part 1C (\w{1,40}EER) searches for the EER motif within the 40 residues following the RxLR motif. Note, that this is modified from Haas et al. (2009), who used [ED][ED][KR] in order to simplify the initial REGEX search. If the user wants to use the original Haas et al. (2009) motif as part of the REGEX search, the custom option at the regex.seach step can be modified in the following manner: regex.search(seq=ORF, motif = "custom", reg.pat = "^\\w{10,40}\\w{1,96}R\\wLR\\w{1,40}[ED][ED][RK]"). In addition, to identifying effectors with the canonical W-Y-L motif found in RxLR proteins (Win et al. 2012), the following custom script can be implemented: regex.search(seq=ORF, motif = "custom", reg.pat = "^\\w{10,40}\\w{1,96}R\\wLR\\w{1,40}[ED][ED][RK]\w{1-20}[WYL]").

    Fig. 3.

    Fig. 3. Graphical representation of the regular expression (REGEX) used by effectR to predict candidate RxLR effectors. This REGEX is integrated in the hmmsearch() function of effectR. The REGEX search is divided into three parts (gray boxes): Part 1 reserves the space for the signal peptide; part 2 performs a downstream search for the RxLR motif; and part 3 searches for the EER motif downstream from the RxLR motif. If a query amino acid sequence has a match to the two motifs, the sequence is considered a candidate RxLR effector protein and is used to build the hidden Markov model profile for step 2 of the package.

    Download as PowerPoint

    A limitation of effectR is that it does not directly predict the presence of a signal peptide. Other programs (i.e., SignalP version 3.0 [Bendtsen et al. 2004]), rather than higher versions, as recommended by Sperschneider et al. (2015) for oomycetes, can be used to predict the presence of the signal peptide.

    Step 1 is summarized in the regex.search() function of the effectR package. Step 1 determines if a translated ORF is a candidate effector after identifying the presence of a RxLR/CRN motif via REGEX (RxLR motif patterns: ^\w{10,40}\w{1,96}R\wLR\w{1,40}EER; CRN motif pattern: ^\w{1,90}LFLAK\w+ [Haas et al. 2009]). In addition, we have added a custom option that allows user specification of a custom REGEX, in order to identify candidate genes for other effectors of interest.

    Using HMMs to predict additional candidate effectors.

    A second independent method of identifying candidate effectors relies on HMM (Fig. 1, step 2). This search allows the identification of additional sequences that match a HMM profile. In effectR, the HMM profile is built using the candidate effectors predicted in the REGEX step. The HMM profile includes the probabilities of any amino acid occurring in a given position of the consensus sequence. The consensus sequence is the product of a MSA of the sequences used to build the HMM profile (Eddy 1998).

    To create the HMM profile, effectR aligns the candidate effectors to identify common motifs and builds the HMM profile based on these common motifs. effectR uses MAFFT (Katoh and Standley 2013) to calculate the MSA of the candidate effectors predicted from the REGEX step. The package effectR uses the E-INS-i iterative refinement algorithm for creating the alignment. The E-INS-i algorithm is suitable for sequences containing common domains of interest flanked by large unalignable regions (Katoh and Standley 2013). These large, unalignable regions are typically observed in RxLR and CRN motifs. After generating the MSA from REGEX candidate effectors, effectR creates a HMM profile using HMMER hmmbuild and hmmpress modules. After the HMM profile is built, effectR searches the original ORF file for sequences with hits against the HMM profile, using hmmscan. To encourage the user to perform manual curation, effectR does not apply any significance thresholds in the hmmsearch step and returns all the sequences that match the HMM profile to the user by default. However, we have included an option within hmm.search, called hmm.thres, that establishes a threshold of the bit score cutoff to return candidate effector proteins. The hmm.thres option is equivalent to the -T option used as bit score cutoff for the per-sequence ranked hit hmmsearch program. The result from the HMM search is a list of translated ORFs that match the HMM profile. This list can be used for manual curation steps (Fig. 1, step 3) or can be exported using the write.fasta function of the seqinr package (Charif and Lobry 2007). This step has been summarized in the hmm.search() function of the effectR package.

    Post hoc tools for combining candidate effector genes from REGEX and HMM searches.

    The effectR package allows for manual curation of the predicted effector proteins to screen for highly heterogeneous sequences or for sequences that match the HMM profile but do not include any of the motifs of interest in their amino acid sequence. The manual curation functions created for the effectR package include the effector.summary() function. This function summarizes the results from steps 1 and 2 and generates a list of unique candidate effectors. The effector.summary() function also returns the numbers of motifs and position in the amino acid sequence for all predicted candidate effectors. Finally, we created the hmm.logo() function to plot logo-like figures based on the candidate effectors predicted in steps 1 and 2. Each step is modular and fully compatible with the results of either REGEX or HMM searching.

    Software availability.

    The effectR package can be downloaded from CRAN or GitHub. effectR can be used on the command line or can be deployed in a user-friendly, point-and-click graphical interface, via the shiny R framework (Chang et al. 2017), using the shiny.effectR() function (Fig. 4). The end user must install the packages shiny (Chang et al. 2017) and shinyjs (Attali 2018) in order to deploy the shiny R graphic interface.

    Fig. 4.

    Fig. 4. Screen capture of the Shiny R web application for effectR. All functions available in the package are included in this web application. The web application can be initiated within an active R session by running the shiny.effectR() function.

    Download as PowerPoint

    ACKNOWLEDGMENTS

    We thank B. Knaus, Z. Kamvar, and Z. Foster for their technical advice, extensive code review, and vast expertise in R programing.

    AUTHOR-RECOMMENDED INTERNET RESOURCE

    GitHub effectR page: https://github.com/grun​waldlab/effectR

    The author(s) declare no conflict of interest.

    LITERATURE CITED

    The author(s) declare no conflict of interest.

    Funding: This work was supported by funds from the United States Department of Agriculture Agricultural Research Service project 2072-22000-041-00-D and National Institute of Food and Agriculture project 2010-511001-21649. Mention of trade names or commercial products in this manuscript are solely for the purpose of providing specific information and do not imply recommendation or endorsement.