PEMA: a Pipeline for Environmental DNA Metabarcoding Analysis

PEMA is a HPC-centered, containerized assembly of key metabarcoding analysis tools. It supports the downstream analysis of four marker genes (16S/18S rRNA, ITS and COI) but also, by allowing the user to train the classifiers with custom reference databases, it can be used for further marker genes. By combining state-of-the art technologies and algorithms with an easy to get-set-use framework, PEMA allows researchers to tune thoroughly each study thanks to roll-back checkpoints and on-demand partial pipeline execution features.

Date ( Creation): 2020-03-12

Date ( Publication): 2021-02-11

Date ( Revision): 2021-02-10

Status: Completed

Creator

Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research - Haris Zafeiropoulos

Keywords: metagenomics

Keywords: remote sensing

Keywords: modelling

Keywords: e-DNA

Keywords: Metabarcoding

Keywords: 16S

Keywords: 18S

Keywords: ITS

Keywords: COI

Keywords: Marker gene analysis

Keywords: Taxonomy assignment

Access constraints: License

Use limitation: GNU GPLv3

OnLine resource: Paper describing the code (
WWW:LINK-1.0-http--related
)

OnLine resource: Home page/Github (
WWW:LINK-1.0-http--link
)

OnLine resource: Docker Hub (
WWW:LINK-1.0-http--link
)

OnLine resource: Singularity Hub (
WWW:LINK-1.0-http--link
)

Operation name: Sequence pre-processing

Web site: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Description: FASTQC is used to obtain an overall read-quality summary.

Function: Sequence pre-processing

Operation name: Sequence pre-processing

Web site: http://www.usadellab.org/cms/?page=trimmomatic

Description: Trimmomatic is used for trimming steps.

Function: Sequence pre-processing

Operation name: Sequence pre-processing

Web site: https://cutadapt.readthedocs.io/en/stable/

Description: Cutadapt is used for ITS to address the variability in length of this marker gene.

Function: Sequence pre-processing

Operation name: Sequence pre-processing

Web site: http://cab.spbu.ru/software/spades/

Description: BayesHammer is taken from the SPAdes assembly toolkit to revise incorrectly-called bases.

Function: Sequence pre-processing

Operation name: Sequence pre-processing

Web site: https://github.com/neufeld/pandaseq

Description: PANDAseq assembles the overlapping paired-end reads.

Function: Sequence pre-processing

Operation name: Sequence pre-processing

Web site: https://pythonhosted.org/OBITools/welcome.html

Description: The “obiuniq” program of OBITools groups the identical sequences in every sample, keeping track of their abundances.

Function: Sequence pre-processing

Operation name: Sequence pre-processing

Web site: https://github.com/torognes/vsearch/releases/tag/v2.9.1

Description: The VSEARCH package is invoked for chimera removal.

Function: Sequence pre-processing

Operation name: OTU clustering

Web site: https://github.com/torognes/vsearch/releases/tag/v2.9.1

Description: VSEARCH is used for OTU clustering.

Function: OTU clustering

Operation name: OTU clustering

Web site: https://github.com/tingchenlab/CROP

Description: In case of COI marker genes COI, an unsupervised probabilistic Bayesian clustering algorithm (CROP) can be selected to perform the OTU clustering step.

Function: OTU clustering

Operation name: ASVs inference

Web site: https://github.com/torognes/swarm

Description: For all marker genes supported, PEMA invokes the Swarm V2 algorithm to infer ASVs.

Function: ASVs inference

Operation name: Taxonomy assignment

Web site: https://github.com/lanzen/CREST

Description: For the 16S/18S rRNA and ITS marker genes, the LCAClassifier algorithm of the CREST set of resources and tools is used together with the Silva and the Unite database. Two versions of Silva are included in PEMA: 128 and 132. Phylogeny-based assignment is also available for 16S rRNA marker gene data using a custom reference tree of 1,000 Silva-derived consensus sequences.

Function: Taxonomy assignment of the OTUs or ASVs returned in OTU clustering / ASVs inference step

Operation name: Taxonomy assignment

Web site: https://github.com/rdpstaff/classifier

Description: For the COI marker gene, PEMA supports the RDPClassifier and the Midori and Midori2 reference databases to assign taxonomy of the MOTUs.

Function: Taxonomy assignment of the OTUs or ASVs returned in OTU clustering / ASVs inference step

Operation name: Ecological downstream analysis

Web site: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0061217; http://joey711.github.io/phyloseq/index.html; https://cran.r-project.org/web/packages/vegan/index.html

Description: The phyloseq R package can be used for downstream ecological analysis of the taxonomically assigned OTUs or ASVs. This includes α- and β-diversity analysis, taxonomic composition, statistical comparisons, and calculation of correlations between samples.

Function: Ecological downstream analysis of the taxonomic tables

Operation name: Alignment tool

Web site: https://mafft.cbrc.jp/alignment/software/

Description: For the alignment of the consensus sequences returned by the phat algorithm for building the reference tree. It is used in the PEMA framework every time a user asks for a phylogenetic tree with the OTUs/ASVs found.

Function: Alignment of consensus sequences

Operation name: Alignment tool

Web site: https://cme.h-its.org/exelixis/web/software/papara/index.html

Description: Alignment of short reads to reference phylogenies and alignments. In PEMA it aligns the OTUs/ASVs using the alignment of the sequences used for the reference tree as a core to align to.

Function: Alignment of short reads

Operation name: Build the reference tree

Web site: https://github.com/amkozlov/raxml-ng

Description: Build the reference tree, and as with MAFFT it is used to build a phylogeny tree based on the OTUs/ASVs retrieved if the user asks.

Function: Build the reference tree

Operation name: Sequence placement on a phylogenetic tree

Web site: https://github.com/Pbdas/epa-ng

Description: Performs maximum likelihood-based phylogenetic placement of genetic sequences on a user-supplied reference tree and alignment. In the PEMA framework it is used to assign OTUs/ASVs retrieved by PEMA to the reference tree.