View on GitHub

Assembler Components

Components of genome sequence assembly tools

Assembler Components

Components of genome sequence assembly tools

Rationale

Genome, metagenome and transcriptome assemblers range from fully integrated to fully modular. Fully modular assembly has a number of benefits. This repository is ongoing work to define some important checkpoints in a modular assembly pipeline, along with standard input/output formats. For now we have a bias towards Illumina-type sequencing data (single reads, paired reads, mate-pairs, 10x), but we aim to make the components also compatible with 3rd generation reads.

Feel free to contribute via pull-requests.

File formats

FASTA/FASTQ: Optional attributes are found in the comment field, and formatted as in SAM, XX:x:xxxx. The comment field follows a space in the header.

FASTQ: Paired-end reads are interleaved and may be compressed. Linked read barcodes may be indicated with the BX tag.

FASTA: The sequence should not be line wrapped. The FASTA file should be indexed (FAI).

GFA: GFA2 is preferred over GFA1. The sequence fields may be empty (*). The sequences may be stored in an adjacent FASTA file, named for example assembly.gfa and assembly.fa.

SAM/BAM: SAM/BAM files are sorted by position and indexed, unless otherwise stated. Different sequencing libraries may be indicated with the read group RG attribute.

GFA record types

GFA sequence segment attributes

File names

An assembly is contained in a single directory. The files are named according to the pattern [0-9]+_[a-z]+\.[a-z.]+. The numeric prefixes are zero-padded and identical in length, and they indicate the stage of the assembly. A descriptive name and file type extension follow. The files may be compressed.

Example

0_pe.fq.gz
1_unitig.gfa
2_denoise.gfa
3_debulge.gfa 3_debulge.bam 3_debulge.bam.bai
4_link.gfa
5_scaffold.gfa
6_assembly.gfa 6_assembly.fa 6_assembly.fa.fai 6_assembly.bam 6_assembly.bam.bai

Notation

type1(record[attributes],…) + … → type2(record[attributes],…) + …

This stage requires a file of type1 and produces a file of type2. For example, estimate copy number of unitigs. A GFA file of sequence segments and edges and a BAM or PAF file of mapped reads produces a GFA file with estimated copy numbers of unitigs.

GFA (SE) + BAM/PAF → GFA (S[CN],E)

Stages of genome assembly

Preprocess reads

Remove sequencing artifacts specific to each sequencing technology. Improve the quality of the input reads with minimal loss of information, for example heterozygous variants.

FASTQ → FASTQ

Quality control

Assess the quality of the sequencing, and estimate parameters of the genome.

FASTQ → TSV

Correct reads

Correct sequencing errors in reads.

FASTQ → FASTQ

Unitig

Assemble unitigs by de Bruijn graph (dBG) assembly or overlap, layout, consensus (OLC) assembly.

FASTQ → GFA (SE)

de Bruijn Graph (dBG)

Overlap, layout, consensus (OLC)

Denoise

Remove sequencing errors from the assembly graph. Retain variants.

GFA (SE) → GFA (SE)

Collapse variants

Identify and/or remove variants from the graph.

GFA (SE) → GFA (SE)

A single sequence or path through the bulge may be selected, or the bulge may be replaced by a consensus sequence, possibly using IUPAC ambiguity codes to represent the consensus.

Map reads

Map reads to the assembly sequences.

FASTQ + GFA (S)/FASTA → BAM

Estimate copy number

Estimate the copy number of each sequence segment.

GFA (SE) + BAM/PAF → GFA (S[CN],E)

Note that the median depth of coverage is more robust than the mean depth of coverage to the alignment artifacts caused by collapsed repeats, misaligned reads, and other issues.

Resolve repeats and scaffold

Expand repeats, and order and orient sequence segments into contigs and scaffolds.

FASTA/GFA (SE) + BAM/PAF → FASTA/GFA (SE)

Contigs are contiguous sequences with no gaps. Creating contigs requires expanding the repetitive sequence found between the unique contigs. Contigs are derived from contiguous paths of sequence segments without any gaps. Scaffolds are derived from discontiguous paths of sequence segments with gaps between the segments.

A tool may implement scaffolding as a single stage of assembly. Scaffolding however may be viewed as composed of the three distinct stages: link unitigs, order and orient unitigs to construct paths, and contract paths to create new sequence segments.

Fill gaps

Assemble the sequence found in the scaffold gaps between adjacent contigs.

FASTQ + FASTA/GFA (S) → FASTA/GFA (S)

Polish

Map the reads to the assembly and correct assembly errors.

FASTQ + FASTA/GFA (SE) + BAM/PAF → FASTA/GFA (SE)

Visualize the assembly graph

GFA (SE) → PNG/SVG

Assess assembly quality

Assess the contiguity and correctness of the assembly.

FASTA/GFA (S) → TSV

Tools

A tool may combine multiple assembly stages in a single tool.

Preprocess reads

Illumina mate-pair

Illumina paired-end

Linked reads

Nanopore

Quality control

Correct reads

Unitig

Denoise

Collapse variants

Map reads

Estimate copy number

Order and orient

Contract paths

Fill gaps

Polish

Visualize the assembly graph

Assess assembly quality

Pipelines

The data is from Unicycler: “These are synthetic reads from plasmids A, B and E from the Shigella sonnei 53G genome assembly”. Shigella sonnei plasmids (synthetic reads), short_reads_1.fastq.gz, short_reads_2.fastq.gz

ABySS

Assemble paired-end reads using ABySS. This ABySS pipeline is a minimal subset of tools run by the complete abyss-pe pipeline.

cd components
k=99
./download_test_data.sh
./abyss_unitigs.sh 0_pe.fq.gz $k
./abyss_contigs_from_unitigs.sh 1_unitig.gfa2 1_unitig.fa $k
./abyss_scaffolding.sh 6_contigs.fa 6_contigs.gfa $k
# Visualize the assembly graph
Bandage load 9_assembly.gfa1 &

BCALM+ABySS

This assembles reads using BCALM for unitigs, then uses the rest of the ABySS pipeline from the previous example.

cd components
k=99
./download_test_data.sh
./bcalm.sh 0_pe.fq.gz $k
./abyss_contigs_from_unitigs.sh 1_unitig.gfa2 1_unitig.fa $k
./abyss_scaffolding.sh 6_contigs.fa 6_contigs.gfa $k
# Visualize the assembly graph
Bandage load 9_assembly.gfa1 &

Components

Download test data

components/download_test_data.sh

# Download Unicycler test data
# input: nothing
# output: 0_pe.fq.gz

#install dependencies
brew install seqtk curl

seqtk mergepe <(curl -Ls https://github.com/rrwick/Unicycler/raw/master/sample_data/short_reads_1.fastq.gz) <(curl -Ls https://github.com/rrwick/Unicycler/raw/master/sample_data/short_reads_2.fastq.gz) | gzip >0_pe.fq.gz

From reads to unitigs with BCALM

components/bcalm.sh

# input: [reads] [k]
# e.g. 0_pe.fq.gz 99
#
# output: 1_unitig.gfa2
# contains unitigs created by BCALM

# make sure bcalm and convertToGFA.py are in your path (will later be automatically done by 'conda install bcalm' when someone makes bcalm availaon conda)

reads=$1
k=$2

# Install the dependencies
pip install gfapy
# Unitig with BCALM
bcalm -in $1 -out 1_bcalm -kmer-size $k -abundance-min 1 -verbose 0
mv 1_bcalm.unitigs.fa 1_unitig.fa
convertToGFA.py 1_unitig.fa 1_unitig.gfa $k
# convert bcalm output to gfa2
(printf "H\tVN:Z:1.0\n"; tail -n +2 1_unitig.gfa) >1_unitig.gfa1
gfapy-convert 1_unitig.gfa1 > 1_unitig.gfa2

# cleanup
rm -f 1_bcalm*glue* 1_bcalm.h5

From reads to unitigs with ABySS

components/abyss_unitigs.sh

# input: [reads] [k]
# e.g. 0_pe.fq.gz 99
#
# output: 1_unitig.gfa2 1_unitig.fa
# contains unitigs created by abyss

reads=$1
k=$2

# setting up
brew install abyss

# Unitig
gunzip -c $reads | ABYSS -k$k -t0 -c0 -b0 -o 1_unitig.fa -
AdjList --gfa2 -k$k 1_unitig.fa >1_unitig.gfa 
mv 1_unitig.gfa 1_unitig.gfa2

From unitigs to contigs with ABySS

components/abyss_contigs_from_unitigs.sh

# input: [unitigs.gfa2] [unitigs.fa] [k]
# e.g. 1_unitig.gfa2 1_unitigs.fa 100
#
# output: 6_contigs.gfa2 and 6_contigs.fa
# contigs produced by ABySS

unitigs_gfa2=$1
unitigs_fa=$2
k=$3

# install dependencies
brew install abyss pigz samtools

# Denoise
abyss-filtergraph --gfa2 -k$k -t200 -c3 -g 2_denoise.gfa $unitigs_gfa2
# Collapse variants
PopBubbles --gfa2 -k$k -p0.99 -g 3_debulge.gfa $unitigs_fa 2_denoise.gfa >3_debulge.path
MergeContigs --gfa2 -k$k -g 3_debulge.gfa -o 3_debulge.fa $unitigs_fa 2_denoise.gfa 3_debulge.path
# Map reads
gunzip -c 0_pe.fq.gz | abyss-map - 3_debulge.fa | pigz >3_debulge.sam.gz
# Link unitigs
gunzip -c 3_debulge.sam.gz | abyss-fixmate -h 4_link.tsv | samtools sort -Osam | DistanceEst --dist -k$k -s500 -n1 4_link.tsv >4_link.dist
# Resolve repeats
samtools faidx 3_debulge.fa
SimpleGraph -k$k -s500 -n5 -o 5_resolverepeats-1.path 3_debulge.gfa 4_link.dist
MergePaths -k$k -o 5_resolverepeats.path 3_debulge.fa.fai 5_resolverepeats-1.path
# Contract paths
MergeContigs --gfa2 -k$k -g 6_contigs.gfa2 -o 6_contigs.fa 3_debulge.fa 3_debulge.gfa 5_resolverepeats.path

# cleanup (comment to keep files)
rm -f 5_resolverepeats.path 3_debulge.gfa 3_debulge.fa 3_debulge.fa.fai 3_debulge.path 5_resolverepeats-1.path 4_link.dist 4_link.tsv 2_denoise.gfa 3_debulge.sam.gz 

From unitigs to contigs with gfaview

components/gfaview_contigs_from_unitigs.sh

# input: [unitigs_gfa1] [unitigs_gfa2] [unitigs_fa] [k] 
# e.g. 1_unitig.gfa1

# output: 6_contigs.gfa 6_contigs.fa
# 'contigs' generated by popping bubbles and removing tips using gfa1 tool
# https://github.com/lh3/gfa1 

unitigs_gfa1=$1
unitigs_gfa2=$2
unitigs_fa=$3
k=$4

# remove tips (2 rounds), misc trimming, pop bubbles, 
# but then cannot use -u to build unitigs of simplified graph (i.e. contigs), because it doesn't output sequences
gfaview -t -m -t -b $unitigs_gfa1 > 2_denoised.gfa1

# make the output of gfaview properly gfa1 and gfa2
sed -i 's/L1/l1/g' 2_denoised.gfa1 # gfapy wants lower case flags
sed -i 's/L2/l2/g' 2_denoised.gfa1
sort -n -k 2 2_denoised.gfa1 > 2_denoised.gfa1.sorted # because ABySS GFA parser wants S lines in same order as in FASTA file
head -n 1 $unitigs_gfa2 > 2_denoised.gfa2 # copy header as gfaview doesnt 
gfapy-convert 2_denoised.gfa1.sorted >> 2_denoised.gfa2

# would like to use use rgfa to do compaction but itdoesnt work on the Singella example
#./rgfa-mergelinear 2_denoised.gfa1 > 6_contigs.gfa

# convert gfa to fasta
awk '$1 ~/S/ {print ">"$2"\n"$3}' 2_denoised.gfa1.sorted > 2_denoised.fa

# just contract paths in the 2_denoised.gfa2, using ABySS as wasn't able to do it with neither gfaview nor rgfa-mergelinear
abyss-filtergraph --gfa2 -k$k --assemble 2_denoised.gfa2 2_denoised.fa -g 3_denoised.gfa2 > 3_denoised.paths
MergeContigs --gfa2 -k$k 2_denoised.fa 3_denoised.gfa2 3_denoised.paths -o 6_contigs.fa -g 6_contigs.gfa2 

# cleanup (comment to keep files)
rm -f 2_denoised.gfa1.sorted 2_denoise.gfa1 2_denoised.gfa1 3_denoised.paths 3_denoised.gfa2

From contigs to scaffolds with ABySS

components/abyss_scaffolding.sh

# input: [contigs.fa] [contigs.gfa] [k]
# e.g. 6_contigs.fa 6_contigs.gfa 99
# 
# output: 9_assembly.gfa and 9_assembly.gfa1
# scaffolds assembly using abyss scaffolder in GFA2 and GFA1 format

contigs_fa=$1
contigs_gfa=$2
k=$3

# Map reads
gunzip -c 0_pe.fq.gz | abyss-map - $contigs_fa | pigz >6_contigs.sam.gz
# Link unitigs
gunzip -c 6_contigs.sam.gz | abyss-fixmate -h 7_link.tsv | samtools sort -Osam | DistanceEst --dot -k$k -s500 -n1 7_link.tsv >7_link.gv
# Order and orient
abyss-scaffold -k$k -s500-1000 -n5-10 $contigs_gfa 7_link.gv >8_scaffold.path
# Contract paths
MergeContigs --gfa2 -k$k -g 9_assembly.gfa -o 9_assembly.fa $contigs_fa $contigs_gfa 8_scaffold.path
# Compute assembly metrics
abyss-fac 9_assembly.fa
# Convert GFA2 to GFA1
abyss-todot --gfa1 9_assembly.gfa >9_assembly.gfa1

# cleanup (comment to remove)
rm -f 6_contigs.sam.gz 7_link.gv 7_link.tsv 8_scaffold.path 8_scaffold.path