AB12PHYLO: An Integrated Pipeline for Maximum Likelihood Phylogenetic Inference from ABI Trace Data
- Leo Kaindl
- Corinn Small
- Remco Stam †
- Chair of Phytopathology, School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
Multigene phylogenies constructed from multiplexed and Sanger sequencing data are regularly used in mycology and other disciplines as a cost-effective way of species identification and as a first means to investigate genetic diversity samples. We present AB12PHYLO, an integrated pipeline that can perform all necessary steps from reading raw Sanger sequencing data through visualizing and editing phylogenies. In addition, AB12PHYLO can calculate basic summary statistics for each gene in the phylogeny. AB12PHYLO is designed as a wrapper of several open access and commonly used tools for each of the intermediate stages and is intended to simplify the phylogenetic pipeline while still allowing a high degree of access. It comes as a command-line version for the highest reproducibility and an intuitive graphical user interface (GUI) for easy adoption by IT-agnostic end-users. The use of AB12PHYLO significantly reduces the hands-on working time for these analyses.
Multigene phylogenies obtained through Sanger sequencing often use barcode sequences, specific genic or intergenic fragments of well-defined genes, that have been widely used over the past decades. Examples are regions of the Internally Transcribed Spacer (ITS) (White et al. 1990), Elongation Factor 1 alpha (EF1) (Carbone and Kohn 1999) or RNA polymerase II subunit (RBP2) (Liu et al. 1999). These genes have been sequenced for a large number of type specimens, and sequence comparison of the samples in question with stored type specimens either through direct local alignments or database searches such as National Center for Biotechnology Information (NCBI) BLAST (Johnson et al. 2008), can help confirm species identity. Often, sequence data of a single barcode gene is not sufficient to specifically determine fungal identity at the species level, whereas a combination of three or more barcodes can reliably determine which species the sample belongs to (see, e.g., Woudenberg et al. 2015). In another example, construction of multigene phylogenies formed the basis for phylogenetic reclassification. The fungal plant-pathogen genus Ulocladium appears morphologically different from the genus Alternaria, but multigene phylogenies did not result in monophyletic clades, suggesting renaming Ulocladium spp., which now fall under the broader Alternaria genus (Woudenberg et al. 2013). This method is also in use to get better insights into pathogens in the field. Two recent studies used multigene phylogenetic analyses to confirm the nature and relationship of the pathogens A. alternata and A. solani in potato fields in Wisconsin or Brazil (Adhikari et al. 2020; Ding et al. 2020). Other recent studies used the method to identify and compare Colletotrichum spp. on tea (Orrock et al. 2019), strawberry (Chen et al. 2019), and a variety of hosts (He et al. 2019), to re-assess the taxonomic classification of Mycosphaerella spp. on persimmon (Hassan and Chang 2018) or to get first insights into the diversity of Phytophthora spp. in the Amazon forest (Legeay et al. 2020).
Today, a number of tools exist for each of the steps for multigene phylogenetic analyses, including quality control and trimming, the generation of a multiple sequence alignment (MSA), extraction of informative sites, and the construction of the final phylogenetic tree. Additionally, a BLAST search in a reference database is often performed to identify sequences of type specimens to compare the samples within the phylogeny. Made over the past decades, these tools are all independent from and often not perfectly adapted to one another.
These steps involve manual inspection of the sequence quality, followed by manual data trimming. Some tools exist that automate sequence file inspection to a certain extent (Rausch et al. 2020; Singh and Bhatia 2016), yet these tools do not help the user with the subsequent steps, such as alignment with reference sequences or phylogenetic reconstruction, whereas such steps often require additional hands-on work as well, if only to prepare the output of one tool as input for the next. Manual editing of input and output files is also the case when using popular web-based phylogeny tools like NGPhylogeny.fr (Lemoine et al. 2019). Manual processing slows down analysis and affects reproducibility in general, as many parameters or small conversion steps are often not properly recorded. To speed up data analyses of this kind and increase their reproducibility, we constructed a fully customizable pipeline which we call AB12PHYLO.
AB12PHYLO is developed as a Python 3 package around widely used open-source tools. It takes raw ab1 (ABI) files as input. Additionally, the user can provide a template specifying the corresponding sample names, which can be formatted in a 96-well plate format, to represent the way the samples are often loaded for sequencing. When no sample template is specified, AB12PHYLO uses regular expressions that the user can modify to search for and extract the file and gene names.
Its command-line version is assembled from the following three parts (with eight main steps).
Part A—Sequence assessment
For file input, after the command line is supplemented with default configurations, the tables mapping plate coordinates to sample IDs are read into memory. Ab1 trace files are read using Biopython Bio.SeqIO (Cock et al. 2009), are matched to their original sample ID and gene, and are passed to quality control. Reference sequences are saved to the respective per-gene dataset.
Quality control was modeled after SeqTrace (Stucky 2012): Read ends are trimmed until a user-defined proportion of characters in the chromatogram have a Phred quality score at or above a user-defined threshold, with 8/10 and 30 as pre-set default values. End trimming can discard reads based on lack of quality. Consecutive stretches of characters with a score below the Phred threshold will be replaced by an equal-length stretch of unknown N characters if they are longer than the last user-definable limit in trace processing, which is pre-set at 5. Reverse reads are replaced by their reverse complement.
Part B—Sequence alignment
Edited sequences are passed to a multiple sequence alignment tool in per-gene datasets. AB12PHYLO is able to interface with local installations of MAFFT (Katoh et al. 2002), Clustal Omega (Sievers et al. 2011), or MUSCLE (Edgar 2004) or an EMBL-EBI online service for any of them (Notredame et al. 2000) at https://www.ebi.ac.uk/Tools/msa.
The alignments are trimmed with Gblocks (Castresana 2000). Requirements for a conserved site can be set at four different levels, from 90% identity to the most relaxed permissible parameters, and a fifth option skipping trimming entirely. The per-gene MSAs are concatenated into a supermatrix alignment.
A BLAST similarity search of data from the first gene in the analysis is carried out to identify source species. If this search is to be run locally, AB12PHYLO employs BLAST+ (Camacho et al. 2009), which will download, update, or check a user-defined database before searching it. Per default, AB12PHYLO will query the NCBI nucleotide database for sequences not found in the local database with Biopython Bio.Blast (Cock et al. 2009), and BLAST can also be run entirely via the public NCBI BLAST API, but this approach is not suitable for large datasets. Two more directly related options are available, Skip BLAST altogether or parse one or several XML files from a previous analysis or a web BLAST.
By default, a maximum likelihood (ML) tree is inferred from the concatenated alignment with RAxML-NG (Kozlov et al. 2019). While the evolutionary model is pre-set to GTR+Γ and the number of ML tree searches to 10 with random or parsimony starting trees each, these parameters can be user-defined. Also, the number of parallel threads can be limited. Alternatively, trees can be inferred using IQ-TREE (Minh et al. 2020), which allows automated model selection. Moreover, IQ-TREE can also be executed in the windows version or AB12PHYLO.
With RaxML-NG or IQ-TREE, bootstrap replicates are generated from the best ML phylogeny found in the previous step. FBP (Felsenstein bootstrap proportion) and TBE (transfer BS expectation) support values for the best ML tree are computed from the bootstrap trees constructed in parallel threads.
For output, the generated phylogeny is plotted with Toytree (Eaton 2020) and is shown alongside other results in an HTML results page. A CGI script allows an interactive search of taxa and selection of populations while computing diversity statistics and Tajima's D neutrality test. An overview of the main features of AB12PHYLO is shown in Figure 1. A more detailed model of the command-line AB12PHYLO program flow is shown in Supplementary Figure S1.
The GUI version of AB12PHYLO implements the same process, while giving users direct control over each step. Visualizations of sequence trimming and MSAs allow immediate identification of out-of-register samples and carefully balanced MSA trimming to prevent both signal loss and trimming artifacts. Furthermore, the graphical AB12PHYLO enables comfortable export of the computation-heavy ML tree inference to a more powerful computer, faster calculation of diversity statistics, and more as well as easier tree modifications.
As a proof of concept, we obtained the data from two of the above-mentioned studies, Ding et al. (2020) and Legeay et al. (2020) to reconstruct their phylogenies. To repeat the study by Ding et al. (2020), we ran AB12PHYLO with default settings, providing both the raw ab1 files and the sequence data of the type specimens as used by Ding et al. (2020). Two samples did not pass the default quality controls. With the remaining 74 samples, we resolved a phylogenetic tree similar to the one in the original work, in which the same genotype groups can be annotated (Supplementary Fig. S2). The MSA used for the phylogeny was 1,822 bp long and included 74 samples. Our analysis was run on 12 threads on a system with 64 GB of RAM, with a total runtime including parallelized bootstrapping and BLAST of less than 10 minutes.
When trying to reproduce the analysis by Legeay et al. (2020) from unpublished ABI trace data of the sequenced nuclear loci and the parameter configuration defined as AB12PHYLO defaults, we resolved a phylogenetic tree that was clearly different from the one published (Supplementary Fig. S3A). Therefore, we re-constructed a tree using both ABI traces as well as sequence data published on the NCBI website (Supplementary Fig. S3C), adjusting quality control so that trimmed traces visually resembled the submitted sequences, while no samples are discarded, and removing more non-conserved MSA positions. We still observed clear differences to the published reference tree. First, the groups assigned by Legeay et al. (2020) can neither be reproduced from the sequence data in GenBank nor ABI trace data. While we could mostly resolve the group in green and its proximity to 1176 (in pink), the orange and gray groups cannot be separated. In particular, samples 438 and 563 exchange their positions with 534 and 567 in all trees we inferred, including one inferred from sequence data as submitted to the NCBI (Supplementary Fig. S3B). Furthermore, we saw significantly longer branches and wrong positions for 823 and 311. From visually inspecting MSAs, we know this is because of low-quality reads, and these samples as well as 1176 were likely re-sequenced later on. Finally, we re-constructed a phylogeny from the GenBank data with BEAST2 (Bouckaert et al. 2019), a Bayesian inference method, to eliminate our ML approach as a source of the discrepancy. When we compare the tree constructed with BEAST2 (Supplementary Fig. S4) to the tree constructed with AB12PHYLO (Supplementary Fig. S3B), we also see that these trees are congruent. This indicates that incongruency with the tree published by Legeay et al. (2020) is likely caused by variation in the input data. This could arise from differences in quality control (trimming), the use of additional loci (e.g., EF1 was not provided properly on NCBI), or handling or labeling errors by the original authors. Even though some trace files did not pass AB12PHYLO's default filtering (as illustrated in Supplementary Figure S5), the remaining trace file data and the GenBank data produce congruent trees; therefore, we argue that automated processing of trace files by AB12PHYLO is an appropriate approach for replacing manual curation. Thus, we conclude that AB12PHYLO can produce high-quality multigene phylogenies rapidly. The use of AB12PHYLO significantly reduces hands-on working time for these analyses and overall runtime by parallelization of computation-heavy ML tree inference. Moreover, the fact that we observed differences between published phylogenies and our re-analyses highlights the importance of reproducibility.
We thank S. Ding and M. Buée and colleagues for providing the raw ABI sequence data from their studies and T. Schmey for testing AB12PHYLO. This work was in part funded to the German Science Foundation (DFG).
The author(s) declare no conflict of interest.
- 2020. Gene genealogies reveal high nucleotide diversity and admixture haplotypes within three Alternaria species associated with tomato and potato. Phytopathology 110:1449-1464. https://doi.org/10.1094/PHYTO-12-19-0487-R Link, Google Scholar
- 2019. BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 15, 1-28. https://doi.org/10.1371/journal.pcbi.1006650 Crossref, Google Scholar
- 2009. Blast+: Architecture and applications. BMC Bioinformatics 10:421. https://doi.org/10.1186/1471-2105-10-421 Crossref, Google Scholar
- 1999. A method for designing primer sets for speciation studies in filamentous ascomycetes. Mycologia 91:553-556. https://doi.org/10.2307/3761358 Crossref, Google Scholar
- 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17:540-552. https://doi.org/10.1093/oxfordjournals.molbev.a026334 Crossref, Google Scholar
- 2019. Genetic diversity of Colletotrichum spp. causing strawberry anthracnose in Zhejiang, China. Plant Dis. 104:1351-1357. https://doi.org/10.1094/PDIS-09-19-2026-RE Link, Google Scholar
- 2009. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25:1422-1423. https://doi.org/10.1093/bioinformatics/btp163 Crossref, Google Scholar
- 2020. Spatiotemporal distribution of potato-associated Alternaria species in Wisconsin. Plant Dis. 105:149-155. https://doi.org/10.1094/PDIS-11-19-2290-RE Link, Google Scholar
- 2020. Toytree: A minimalist tree visualization and manipulation library for python. Methods Ecol. Evol. 11:187-191. https://doi.org/10.1111/2041-210X.13313 Crossref, Google Scholar
- 2004. Muscle: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792-1797. https://doi.org/10.1093/nar/gkh340 Crossref, Google Scholar
- 2018. Phylogenetic and morphological reassessment of Mycosphaerella nawae, the causal agent of circular leaf spot in persimmon. Plant Dis. 103:200-213. https://doi.org/10.1094/PDIS-05-18-0857-RE Link, Google Scholar
- 2019. Characterization and fungicide sensitivity of Colletotrichum spp. from different hosts in Shandong, China. Plant Dis. 103:34-43. https://doi.org/10.1094/PDIS-04-18-0597-RE Link, Google Scholar
- 2008. NCBI BLAST: A better web interface. Nucleic Acids Res. 36:W5-9. https://doi.org/10.1093/nar/gkn201 Crossref, Google Scholar
- 2002. Mafft: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30:3059-3066. https://doi.org/10.1093/nar/gkf436 Crossref, Google Scholar
- 2019. Raxml-Ng: A fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35:4453-4455. https://doi.org/10.1093/bioinformatics/btz305 Crossref, Google Scholar
- 2020. Surprising low diversity of the plant pathogen Phytophthora in Amazonian forests. Environ. Microbiol. 22:5019-5032. https://doi.org/10.1111/1462-2920.15099 Crossref, Google Scholar
- 2019. Ngphylogeny.Fr: New generation phylogenetic services for non-specialists. Nucleic Acids Res. 47:W260-W265. https://doi.org/10.1093/nar/gkz303 Crossref, Google Scholar
- 1999. Phylogenetic relationships among ascomycetes: Evidence from an RNA polymerase II subunit. Mol. Biol. Evol. 16:1799-1808. https://doi.org/10.1093/OXFORDJOURNALS.MOLBEV.A026092 Crossref, Google Scholar
- 2020. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37:1530-1534. https://doi.org/10.1093/molbev/msaa015 Crossref, Google Scholar
- 2000. T-Coffee: A novel method for fast and accurate multiple sequence alignment11edited by J. Thornton. J. Mol. Biol. 302:205-217. https://doi.org/10.1006/jmbi.2000.4042 Crossref, Google Scholar
- 2019. Anthracnose in U.S. tea: Pathogen characterization and susceptibility among six tea accessions. Plant Dis. 104:1055-1059. https://doi.org/10.1094/PDIS-07-19-1518-RE Link, Google Scholar
- 2020. Tracy: Basecalling, alignment, assembly and deconvolution of Sanger chromatogram trace files. BMC Genomics 21:230. https://doi.org/10.1186/s12864-020-6635-8 Crossref, Google Scholar
- 2011. Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol. Syst. Biol. 7:539. https://doi.org/10.1038/msb.2011.75 Crossref, Google Scholar
- 2016. Automated sanger analysis pipeline (ASAP): A tool for rapidly analyzing sanger sequencing data with minimum user interference. J. Biomol. Tech. Jbt. 27:129-131. https://doi.org/10.7171/jbt.16-2704-005 Crossref, Google Scholar
- 2012. Seqtrace: A graphical tool for rapidly processing dna sequencing chromatograms. J. Biomol. Tech. Jbt. 23:90-93. https://doi.org/10.7171/jbt.12-2303-004 Crossref, Google Scholar
- 1990. Amplification and direct sequencing of fungal ribosomal RNA genes for phylogenetics. Pages 315-322 in: PCR Protocols: A Guide to Methods and Applications. M. A. Innis, D. H. Gelfand, J. J. Sninsky, and T. J. White, eds. Academic Press, San Diego, CA. Google Scholar
- 2013. Alternaria Redefined. Stud. Mycol. 75:171-212. https://doi.org/10.3114/sim0015 Crossref, Google Scholar
- 2015. Alternaria section alternaria: Species, formae speciales or pathotypes? Stud. Mycol. 82:1-21. https://doi.org/10.1016/j.simyco.2015.07.001 Crossref, Google Scholar
Data availability: AB12PHYLO is published under the GPLv3 license. It runs on standard desktop computers either under Linux, MacOS, or Windows operating systems and can be installed via the pip or conda package-management systems, the latter also allowing easy installation of an environment with all external tools. Installation instructions and source code are available at https://github.com/lkndl/ab12phylo.
Funding: The project was funded by the German Science Foundation (DFG, STA1547/2, STA1547/4). Some analyses were performed on the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A537B, 031A533A, 031A538A, 031A533B, 031A535A, 031A537C, 031A534A, 031A532B).
The author(s) declare no conflict of interest.
Copyright © 2022 The Author(s). This is an open access article distributed under the CC BY 4.0 International license.