class: center, middle, inverse, title-slide .title[ # RNA-seq preprocessing ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 2025-09-22 ] --- <!-- HTML style block --> <style> .large { font-size: 130%; } .small { font-size: 70%; } .tiny { font-size: 40%; } </style> ## Computational ecosystem of sequencing <img src="img/SequencingEcosystem.png" width="700px" style="display: block; margin: auto;" /> .small[ Conesa, A., Madrigal, P., Tarazona, S. et al. A survey of best practices for RNA-seq data analysis. Genome Biol 17, 13 (2016). https://doi.org/10.1186/s13059-016-0881-8 ] --- ## RNA-seq analysis workflow <img src="img/rnaseq_workflow.png" width="500px" style="display: block; margin: auto;" /> .small[ Brazas, M. D., & Ouellette, B. F. (2016). Continuing education workshops in bioinformatics positively impact research and careers. PLoS computational biology, https://doi.org/10.1371/journal.pcbi.1004916 ] --- ## FASTA/FASTQ format **FASTA**: text-based representation of nucleotide sequence <img src="img/fasta.png" width="800px" style="display: block; margin: auto;" /> **FASTQ**: sequence and quality info <img src="img/fastq.png" width="800px" style="display: block; margin: auto;" /> .small[ http://zhanglab.ccmb.med.umich.edu/FASTA/ ] --- ## Quality of base calling - **Phred quality score** is widely used to characterize the quality of base calling - Phred quality score = `\(-10*log_{10}(P)\)`, where P is probability that base-calling is wrong - Phred score of 30 means there is 1/1000 chance that the base-calling is wrong - The quality of the bases tend to drop at the end of the read, a pattern observed in sequencing-by-synthesis techniques --- ## Quality control - **FASTQC** - Quality of raw and aligned sequencing data - Base quality per position - Nucleotide per position - GC content - K-mer enrichment <img src="img/fastqc.png" width="600px" style="display: block; margin: auto;" /> .small[ http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, Video tutorial how to interpret, https://www.youtube.com/watch?v=bz93ReOv87Y ] --- ## RNA-seq-specific quality control - **RNASeQC** - quality of mapped (aligned) data - **RSeQC** - Python-based table and graph QC reports - **MultiQC** - Summarization and visualization QC results for multiple samples in one report. Recognizes multiple QC tools .small[ https://github.com/getzlab/rnaseqc, Deluca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire MD, Williams C, Reich M, Winckler W, Getz G. (2012) RNA- SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics https://doi.org/10.1093/bioinformatics/bts196 http://rseqc.sourceforge.net/, Wang L, Wang S, Li W* RSeQC: quality control of RNA- seq experiments Bioinformatics (2012) https://doi.org/10.1093/bioinformatics/bts356 http://multiqc.info/ ] --- ## Adapter trimming - **Cutadapt** - full control over adapter trimming - **FASTX-Toolkit** - set of tools for low-level sequence trimming/cutting - **Trimmomatic** - well-documented and easy-to-use adapter trimmer using multiple algorithms. Handles single- and paired-end reads, accountss for read quality - **Flexbar**: similar to Trimmomatic by functionality .small[ https://cutadapt.readthedocs.io/en/latest/guide.html http://hannonlab.cshl.edu/fastx_toolkit/ http://www.usadellab.org/cms/?page=trimmomatic https://github.com/seqan/flexbar/wiki/Manual ] --- ## Duplicates removal - Duplicates may correspond to biased PCR amplification of particular fragments - For highly expressed, short genes, duplicates are expected even if there is no amplification bias - Removing them may reduce the dynamic range of expression estimates Generally, do not remove duplicates from RNA-seq data - If you ultimately want to remove duplicates, use Picard tools' `MarkDuplicates` command .small[ https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates ] --- ## Alignment - RNA-seq aligners face an additional problem, not encountered in DNA-only alignment: many RNA-seq reads will span introns - The average human intron length is >6,000 bp (some are >1 Mbp in length) - In a typical human RNA-seq experiment using 100-bp reads, >35% of the reads will span multiple exons - align over splice junctions - Aligners must be splice-aware, especially when aligning longer (>50bp) reads --- ## Three types of reads: exonic-, junction- and poly(A) end-reads. <img src="img/rna-seq2.png" width="650px" style="display: block; margin: auto;" /> --- ## Strategies for gapped alignments of RNA-seq reads **Exon-first method** - Map full, unspliced reads (reads originating from a single exon) to exons - Divide the remaining reads into smaller pieces and map them to the genome - An extension process extends mapped smaller pieces to find candidate splice sites to support a spliced alignment. --- ## Strategies for gapped alignments of RNA-seq reads **Seed-and-extend methods** - Divide each RNA-seq read in small words (k-mers) of similar size - Store a map of all k-mers in the genome in an efficient lookup data structure - Map k-mers to the genome via the lookup structure - Mapped k-mers are extended into larger alignments, which may include gaps flanked by splice sites. --- ## Strategies for gapped alignments of RNA-seq reads <img src="img/gapped_alignment.png" width="700px" style="display: block; margin: auto;" /> .small[ Garber, M., Grabherr, M., Guttman, M. et al. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods 8, 469–477 (2011). https://doi.org/10.1038/nmeth.1613 ] --- ## Alignment to the reference genome is the most frequently used for transcript quantification <img src="img/alignment.png" width="600px" style="display: block; margin: auto;" /> --- ## Alignment - Mapping RNA-seq reads to the genome * **General-purpose aligners** * **BWA** – efficient, based on Burrows-Wheeler Transform * **Bowtie** – ultrafast for short reads; originally paired with TopHat for RNA-seq * **Bowtie2** – supports longer reads and gapped alignment; more flexible than Bowtie * **RNA-seq optimized (spliced) aligners** * **STAR** – fast and accurate, detects splice junctions directly * **subread / subjunc** – spliced alignment; integrated with `featureCounts` for counting .small[ http://bio-bwa.sourceforge.net/ https://github.com/alexdobin/STAR https://bowtie-bio.sourceforge.net/bowtie2/index.shtml http://subread.sourceforge.net/ Timeline and extensive comparison of aligners: https://www.ebi.ac.uk/~nf/hts_mappers/ ] --- ## De Novo assembly using De Bruijn Graph * **De Bruijn Graph** - A graph-based data structure used to assemble sequences from short reads. * Break reads into **k-mers** (subsequences of length *k*). * Nodes represent **(k-1)-mers**, edges connect nodes if the corresponding k-mers overlap by k-1 bases. **Visual idea:** Each k-mer is a colored edge connecting two overlapping nodes (k-1-mers), forming a network that traces the sequence. * Enables reconstruction of full-length sequences (contigs) from short fragments. <img src="img/kallisto_deBruijn.png" width="350px" style="display: block; margin: auto;" /> --- ## De Novo RNA-Seq Assembly Tools * **Trinity** - De Bruijn graph-based assembly. * **Components**: Inchworm, Chrysalis, Butterfly. * **Trans-ABySS** - Parallelized de Bruijn graph assembler. * **rnaSPAdes** - Extension of SPAdes genome assembler tailored for RNA-Seq data. .small[ http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss https://github.com/trinityrnaseq/trinityrnaseq/wiki ] <!-- ## Trinity _de novo_ assembly - Trinity combines three independent software modules: **Inchworm**, **Chrysalis**, and **Butterfly** - Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. `\begin{center} \includegraphics[height=60px]{img/TrinityCompositeLogo.png} \end{center}` .small[ Grabherr, Manfred G., Brian J. Haas, Moran Yassour, Joshua Z. Levin, Dawn A. Thompson, Ido Amit, Xian Adiconis, et al. “Full-Length Transcriptome Assembly from RNA-Seq Data without a Reference Genome.” Nature Biotechnology 29, no. 7 (May 15, 2011): 644–52. doi:10.1038/nbt.1883. -->