Output
======

NanoPrePro produces two types of output files:

- **Processed FASTQ** with per-read annotations
- **HTML report** summarizing pre-processing results

.. _per_read_annotation:

Processed FASTQ
---------------

NanoPrePro appends pre-processing annotations to the FASTQ read IDs.  
These annotations use standardized flags, summarized below:


================ ============ ======= ==========================================================
flag             regex        default explanation
================ ============ ======= ==========================================================
``strand``       -?\d+\.\d*   0       0: unknown; > 0: sense; < 0: antisense
``full_length``  [0|1]        0       0: non-full-length; 1: full-length
``fusion``       [0|1]        0       0: non-chimeric/fusion; 1: chimeric/fusion
``ploc5``        -?\d+        -1      -1: unknown; 0: removed; > 0: 5' adapter/primer location
``ploc3``        -?\d+        -1      -1: unknown; 0: removed; > 0: 3' adapter/primer location
``poly5``        -?\d+        0       0: unknown; > 0: 5' poly length; < 0: trimmed 5' poly
``poly3``        -?\d+        0       0: unknown; > 0: 3' poly length; < 0: trimmed 3' poly
================ ============ ======= ==========================================================


**Example**

.. code-block:: bash

   @read_1 strand=0.91 full_length=1 fusion=0 ploc5=0 ploc3=0 poly5=0 poly3=-20
   AGAGGCTGGCGGGAACGGGC......TTTCAAAGCCAGGCGGATTC
   +
   +,),+'$)'%671*%('&$%......((&'(*($%$&%&$-((84*

This example shows that *read_1* is:

- Sense strand (``strand=0.91``)  
- Full-length (``full_length=1``)  
- Non-chimeric (``fusion=0``)  
- Adapter/primer removed (``ploc5=0 ploc3=0``)  
- PolyA trimmed (``poly3=-20``)  

.. _html_report:

HTML report
-----------

The HTML report provides an overview of pre-processing results, including:

- Quality score histograms  
- Proportion of filtered/passed full-length, truncated, and chimeric reads  
- Simulated real/random adapter/primer alignment results with interactive cutoff exploration  

.. _guideline:

Choosing Cutoffs from Simulated Real/Random Alignment Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The report includes an interactive plot of simulated **true** and **random** alignments.  

.. image:: images/simulation.png
   :alt: simulated real/random alignments

- **X-axis**: sequence similarity  
- **Y-axis**: aligned location  
- **Blue dots**: true alignments  
- **Orange dots**: random alignments  

A slider allows switching between results for different adapter/primer substring lengths.  
Hovering over the plot displays precision, recall, and F-score under specific cutoff settings.

**Guidelines**  

- Higher precision: use higher sequence similarity cutoffs and longer adapter/primer substrings.  
- Higher recall: use lower similarity cutoffs and shorter adapter/primer substrings.  

⚠️ *Note:* Increasing adapter/primer length too far can be counterproductive. 
Basecalling accuracy often drops near read termini, making the last nucleotides unreliable. 
Stop increasing the adapter/primer length once most blue dots begin deviating from sequence similarity = 1.