Low-Input Nanopore Sequencing and CarrierSeq
Update 2/9/21: This protocol may be outdated/obsolete considering recent advances by Oxford Nanopore Technologies
Long-read nanopore sequencing technology is of particular significance for taxonomic identification at or below the species level. For many environmental samples, the total extractable DNA is far below the current input requirements of nanopore sequencing, preventing “sample to sequence” metagenomics from low-biomass or recalcitrant samples.
We address this problem by employing carrier sequencing, a method to sequence low-input DNA by preparing the target DNA with a genomic carrier to achieve ideal library preparation and sequencing stoichiometry without amplification. We then use CarrierSeq, a sequence analysis workflow to identify the low-input target reads from the genomic carrier.
CarrierSeq has been tested experimentally by analyzing sequences from a combination of 0.2 ng Bacillus subtilis ATCC 6633 DNA in a background of 1000 ng Enterobacteria phage λ DNA. After filtering of carrier, low quality, and low complexity reads, we detected target reads (B. subtilis), contamination reads, and 'high quality noise reads' (HQNRs) not mapping to the carrier, target or known lab contaminants. These reads appear to be artifacts of the nanopore sequencing process as they are associated with specific channels (pores).
Update: In Mojarro et al. (2019) Astrobiology, PDF we were able to detect 5 reads of B. subtilis from approximately 2 pg of input DNA.
Methods
CarrierSeq implements bwa-mem (Li, 2013) to first map all reads to the genomic carrier then extracts unmapped reads by using samtools (Li et al., 2009) and seqtk (Li, 2012). Thereafter, the user can define a quality score threshold and CarrierSeq proceeds to discard low-complexity reads with fqtrim (Pertea, 2015). This set of unmapped and filtered reads are labeled 'reads of interest' and should theoretically comprise target reads and likely contamination. However, reads of interest may also include 'high-quality noise reads' (HQNRs), defined as reads that satisfy quality score and complexity filters yet do not match to any database and disproportionately originate from specific channels. By treating reads as a Poisson arrival process, CarrierSeq models the expected reads of interest channel distribution and rejects data from channels exceeding a reads/channels threshold (xcrit). Reads of interest are then sorted into 08_target_reads (reads/channel ≤ xcrit) or 07_hqnrs (reads/channel > xcrit).
Mojarro et al. (2017) BioRxiv, now published in BMC Bioinformatics.
Low-Input Library Protocol
Following the 1D Lambda Control Experiment for the MinION SQK-LSK108, simply replace the 5 µL of DNA CS 3.6 kb (the positive control) with you low-input sample. Then proceed without any special handling.
CarrierSeq Implementation
Full CarrierSeq documentation is available from - https://github.com/amojarro/carrierseq
Reads to be analyzed must be compiled into a single fastq file and the carrier reference genome must be in fasta format. Run CarrierSeq with:
./carrierseq.sh -i <input.fastq> -r <reference.fasta> -q <q_score> -p <p_value> -o <output_directory> -t <bwa_threads>
Library and CarrierSeq Example
Library Preparation: 0.2 ng of B. subtilis DNA was prepared with 1000 ng of Lambda DNA using the Oxford Nanopore Technologies (ONT) ligation sequencing kit (LSK-SQK108). The library was then sequenced on a MinION Mark-1B sequencer and R9.4 flowcell for 48 hours and basecalled using ONT’s Albacore (v1.10) offline basecaller.
CarrierSeq Parameters: q-score = 9 (default) and p-value = 0.05.
Example Data: https://doi.org/10.6084/m9.figshare.5868825.v1
Sequencing and CarrierSeq Summary
At Q9, the expected B. subtilis abundance is 590 reads (obtained by mapping all 'reads of interest' directly to the B. subtilis reference genomes) for the example data. The xcrit value was calculated to be 7 reads/channel.
All Reads (Lambda + B. subtilis + Contamination + Noise)
Total Reads: 717,432 reads
Total Bases: 6.4 gigabases
Reads of Interest (B. subtilis + Contamination + HQNRs)
Total Reads: 1,811 reads
Total Bases: 8,132,374 bases
HQNRs
Total Reads: 1,179 reads (including 17 false negative B. subtilis reads)
Total Bases: 7,282,767 bases
Target Reads
Total Reads: 632 reads (including 574 true positive B. subtilis reads, 4 true positive contamination reads, and 54 false positive HQNRs)
Total Bases: 849,607 bases
ROI Pore Occupancy
The matrix below illustrates the reads/channel distribution of B. subtilis, contamination, and HQNRs of the 'reads of interest' across all 512 nanopore channels. Here we are able to visually identify overly productive channels (e.g., 191 reads/channel, etc) producing likely HQNRs.
HQNR Pore Occupancy
Bad channels identified by CarrierSeq as HQNR-associated (reads/channel > 7).
Target Reads Pore Occupancy
Good channels identified by CarrierSeq as non-HQNR-associated (reads/channel ≤ 7). By imposing a stricter p-value, CarrierSeq may be able to reject more HQNRs (e.g., xcrit = 5).