Implementation

Quick start

Suppose your sequence library looks like this:

library construction

Run a standard pre-processing pipeline using NanoPrePro as follows:

nanoprepro \
   --input_fq input.fq \
   --p5_sense ATCGATCG \ # 5' adapter/primer sequence (sense strand; 5' to 3')
   --p3_sense A{20}GCAATGA \ # 3' adapter/primer sequence (sense strand; 5' to 3')
   --beta 0.2 \
   --output_full_length output.fq \
   --trim_adapter \
   --trim_poly \
   --orientation 1 \
   --filter_lowq 7 \
   --report report.html

This command performs the following pre-processing steps and generates a report file (report.html):

  1. --beta 0.2: performs \(F_{\beta=0.2}\) optimization for adapter/primer alignment cutoffs (see Step 1).

  2. --output_full_length output.fq: identifies full-length reads (see Step 2).

  3. --trim_adapter: trims adapter/primer sequences (see Step 3).

  4. --trim_poly: trims poly(A/T) sequences (see Step 4).

  5. --orientation 1: reorients reads to sense strand (see Step 5).

  6. --filter_lowq 7: filters low-quality (avg. Q-score < 7) reads (see Step 6).

Pre-processing pipeline

Step 1. \(F_{\beta}\) optimization

NanoPrePro optimizes adapter/primer alignment cutoffs by:

  1. Simulating both true and random alignments.

  2. Identifying cutoff values that best separate true from random alignments.

First, the adapter/primer sequences provided by the user are aligned twice to each read. (--p5_sense ATCGATCG and --p3_sense A{20}GCAATGA)

NanoPrePro then search for the alignment cutoffs that maximize the \(F_{\beta}\) score (--beta <float>), the weighted harmonic mean of precision and recall:

\(\text{Precision} = \frac{\text{true alignments that pass the cutoffs}}{\text{true alignments that pass the cutoffs} + \text{random alignments that pass the cutoffs}}\)

\(\text{Recall} = \frac{\text{true alignments that pass the cutoffs}}{\text{all true alignments}}\)

\(F_{\beta}\) score is calculated as:

\[F_{\beta} = (1 + \beta^2) \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}} {(\beta^2 \cdot \mathrm{precision}) + \mathrm{recall}}\]

The \(\beta\) parameter controls the weighting of precision versus recall:

  • Higher \(\beta\) values emphasize recall.

  • Lower \(\beta\) values emphasize precision.

The alignment cutoff values achieving the highest \(F_{\beta}\) score are used for adapter/primer identification.

Note

A{20} indicates that up to 20 consecutive A nucleotides may occur adjacent to the 3′ adapter/primer. These bases are NOT used for alignment. See Poly A/T trimming for details.

Step 2. Full-Length/truncated/chimeric read classification

Reads are classified into three categories based on adapter/primer alignment results:

  • Full-length: 5’ and 3’ adapter/primer present, no internal adapters/primers.

  • Chimeric: contains internal adapter/primer sequences.

  • Truncated: not chimeric and not full-length.

Output files for each read type can be specified using:

  • Full-length: --output_full_length (default to standard output).

  • Chimeric: --output_fusion.

  • Truncated: --output_truncated.

Step 3. Adapter/Primer trimming

This step is activated with --trim_adapter. It trims adapter/primer sequences from the output reads.

Note

Trimming is applied to all requested output reads, regardless of read type.

Step 4. Poly(A/T) trimming

This step is activated with --trim_poly. The expected length, location, and nucleotide of mono-polymers are assigned along with the primer sequence.

Use a pattern like N{M} to specify the location and length of poly(A/T) tails. For example, this command tells NanoPrePro that poly A tails of up to 20 nucleotides are adjacent to the 3’ adapters/primers:

--p3_sense A{20}GCAATGA

NanoPrePro then use a sliding window approach to identify and trim poly(A/T) sequences. The window size is set by --poly_w <int> (default: 6). The minimum number of A or T bases in the window is set by --poly_k <int> (default: 4). The length of poly(A/T) tails would be recorded in the ID line of each read (see Output Documentation).

Note

Poly(A/T) trimming is applicable only if adapters/primers are trimmed. Similar to adapter/primer trimming, this step can be performed on all classes of output reads.

Step 5. Read reorientation

Read strands are determined based on the orientation of aligned adapters/primers. Adapter/primer sequences should be provided in the sense direction (--p5_sense , --p3_sense). Reads are determined antisense if adapters/primers are aligned in the antisense direction.

Reorientation can be performed using --orientation 1/-1/0:

  • 1: sense direction

  • -1: antisense

  • 0: do not reorient

Step 6. Filtering low-quality reads

Average Q-scores are calculated after trimming adapter/primer/polyA(T) sequences (if applied). Trimming removes low-quality regions at read termini, providing a more accurate measure of read quality. The threshold for filtering low-quality reads can be set with --filter_lowq <int>.

Step 7. Output

NanoPrePro produces:

  • FASTQ: processed reads

  • HTML report: summary of pre-processing statistics

FASTQ Files

Processed reads are saved separately for full-length, truncated, and chimeric reads. Output file names can be assigned with --output_full_length, --output_truncated, and --output_fusion.

Note

Gzip-compressed FASTQ files are supported. For example: --output_full_length output.fq.gz

Per-read annotations are appended to FASTQ read IDs. See Output Documentation for details.

HTML Report

Written to the file specified by --report. The report includes Q-score distributions, the proportion of full-length/truncated/chimeric reads, and adapter/primer alignment results from \(F_{\beta}\) optimization.

The simulated alignment results help users manually picking cutoffs. See Output Documentation for guidelines on manually selecting alignment cutoffs based on simulated alignment data.