Implementation
Quick start
Suppose your sequence library looks like this:
Run a standard pre-processing pipeline using NanoPrePro as follows:
nanoprepro \
--input_fq input.fq \
--p5_sense ATCGATCG \ # 5' adapter/primer sequence (sense strand; 5' to 3')
--p3_sense A{20}GCAATGA \ # 3' adapter/primer sequence (sense strand; 5' to 3')
--beta 0.2 \
--output_full_length output.fq \
--trim_adapter \
--trim_poly \
--orientation 1 \
--filter_lowq 7 \
--report report.html
This command performs the following pre-processing steps and
generates a report file (report.html):
--beta 0.2: performs \(F_{\beta=0.2}\) optimization for adapter/primer alignment cutoffs (see Step 1).--output_full_length output.fq: identifies full-length reads (see Step 2).--trim_adapter: trims adapter/primer sequences (see Step 3).--trim_poly: trims poly(A/T) sequences (see Step 4).--orientation 1: reorients reads to sense strand (see Step 5).--filter_lowq 7: filters low-quality (avg. Q-score < 7) reads (see Step 6).
Pre-processing pipeline
Step 1. \(F_{\beta}\) optimization
NanoPrePro optimizes adapter/primer alignment cutoffs by:
Simulating both true and random alignments.
Identifying cutoff values that best separate true from random alignments.
First, the adapter/primer sequences provided by the user are aligned twice to each read.
(--p5_sense ATCGATCG and --p3_sense A{20}GCAATGA)
NanoPrePro then search for the alignment cutoffs that maximize the \(F_{\beta}\) score
(--beta <float>), the weighted harmonic mean of precision and recall:
\(\text{Precision} = \frac{\text{true alignments that pass the cutoffs}}{\text{true alignments that pass the cutoffs} + \text{random alignments that pass the cutoffs}}\)
\(\text{Recall} = \frac{\text{true alignments that pass the cutoffs}}{\text{all true alignments}}\)
\(F_{\beta}\) score is calculated as:
The \(\beta\) parameter controls the weighting of precision versus recall:
Higher \(\beta\) values emphasize recall.
Lower \(\beta\) values emphasize precision.
The alignment cutoff values achieving the highest \(F_{\beta}\) score are used for adapter/primer identification.
Note
A{20} indicates that up to 20 consecutive A nucleotides
may occur adjacent to the 3′ adapter/primer. These bases are NOT used
for alignment. See Poly A/T trimming for details.
Step 2. Full-Length/truncated/chimeric read classification
Reads are classified into three categories based on adapter/primer alignment results:
Full-length: 5’ and 3’ adapter/primer present, no internal adapters/primers.
Chimeric: contains internal adapter/primer sequences.
Truncated: not chimeric and not full-length.
Output files for each read type can be specified using:
Full-length:
--output_full_length(default to standard output).Chimeric:
--output_fusion.Truncated:
--output_truncated.
Step 3. Adapter/Primer trimming
This step is activated with --trim_adapter.
It trims adapter/primer sequences from the output reads.
Note
Trimming is applied to all requested output reads, regardless of read type.
Step 4. Poly(A/T) trimming
This step is activated with --trim_poly.
The expected length, location, and nucleotide of mono-polymers are assigned along with the primer sequence.
Use a pattern like N{M} to specify the location and length of poly(A/T) tails.
For example, this command tells NanoPrePro that poly A tails of up to 20 nucleotides are adjacent to the 3’ adapters/primers:
--p3_sense A{20}GCAATGA
NanoPrePro then use a sliding window approach to identify and trim poly(A/T) sequences.
The window size is set by --poly_w <int> (default: 6).
The minimum number of A or T bases in the window is set by --poly_k <int> (default: 4).
The length of poly(A/T) tails would be recorded in the ID line of each read (see Output Documentation).
Note
Poly(A/T) trimming is applicable only if adapters/primers are trimmed. Similar to adapter/primer trimming, this step can be performed on all classes of output reads.
Step 5. Read reorientation
Read strands are determined based on the orientation of aligned adapters/primers.
Adapter/primer sequences should be provided in the sense direction (--p5_sense , --p3_sense).
Reads are determined antisense if adapters/primers are aligned in the antisense direction.
Reorientation can be performed using --orientation 1/-1/0:
1: sense direction
-1: antisense
0: do not reorient
Step 6. Filtering low-quality reads
Average Q-scores are calculated after trimming adapter/primer/polyA(T) sequences (if applied).
Trimming removes low-quality regions at read termini, providing a more accurate measure of read quality.
The threshold for filtering low-quality reads can be set with --filter_lowq <int>.
Step 7. Output
NanoPrePro produces:
FASTQ: processed reads
HTML report: summary of pre-processing statistics
FASTQ Files
Processed reads are saved separately for full-length, truncated, and chimeric reads.
Output file names can be assigned with --output_full_length, --output_truncated, and --output_fusion.
Note
Gzip-compressed FASTQ files are supported. For example:
--output_full_length output.fq.gz
Per-read annotations are appended to FASTQ read IDs. See Output Documentation for details.
HTML Report
Written to the file specified by --report.
The report includes Q-score distributions, the proportion of full-length/truncated/chimeric reads, and adapter/primer alignment results from \(F_{\beta}\) optimization.
The simulated alignment results help users manually picking cutoffs. See Output Documentation for guidelines on manually selecting alignment cutoffs based on simulated alignment data.