Final Project Assignment: RNA-seq Data Analysis

Overview

The purpose of the final project is to gain hands-on experience with the full spectrum of RNA-seq analysis methods applied to real-world data. This project is designed to help you strengthen both your statistical and practical understanding of RNA-seq data analysis.

Your goal is to perform a complete and reproducible analysis and interpretation of an RNA-seq dataset. The project should include all key analytical steps—data retrieval, quality control, normalization, expression summaries, batch effect correction, clustering, differential expression, and functional enrichment analysis—supported by appropriate visualizations.

Dataset Selection

You must select an RNA-seq dataset (gene counts matrix) for analysis. Acceptable sources of processed RNA-seq data include:

Gene Expression Omnibus (GEO): https://www.ncbi.nlm.nih.gov/geo/
recount3 web resource: https://rna.recount.bio/ and the R package recount3: https://bioconductor.org/packages/recount3/

Read the publication associated with your chosen dataset to understand its biological and experimental context.

Dataset Requirements

At least two experimental conditions (e.g., cancer vs. normal, treated vs. control, or disease vs. healthy).
At least five samples per condition.
Preferably human data; however, model organism datasets are acceptable.

Project Organization

Create a dedicated project folder to store all scripts, data, and results.
Add a README.md file describing each script, its input, and its output.
Create a manuscript.md file containing your project report written in R Markdown format.
The report should be compiled as an HTML document.
Follow the IMRaD structure (Introduction, Methods, Results, and Discussion).
Include BibTeX references where relevant.
Provide supplementary materials containing:
- Differential expression results
- Functional enrichment results
The main text (excluding references, tables, and figure legends) should not exceed 3,000 words.

Report Structure and Content

1. Introduction / Background

Provide a clear and concise description of your dataset and the research question you are addressing.
Summarize the biological context and relevance of the study.

2. Methods

Include detailed descriptions of all analysis steps:

Quality Assessment
- Describe and report quality control metrics (e.g., library size, count distribution, sample correlation, PCA).
- Present the results of your quality evaluation.
Preprocessing
- Describe all preprocessing steps (e.g., filtering lowly expressed genes, normalization).
Batch Effect Correction
- Apply ComBat (if batch information is known) and sva (if batches are unknown).
- Display PCA plots before and after batch correction.
- Perform hierarchical clustering on the top 10% most variable genes before and after correction and describe your observations.
Differential Expression Analysis
- Use either edgeR or DESeq2.
- Include all relevant covariates in your model.
- Compare your results with those reported in the original publication, if applicable.
- Describe filtering criteria, multiple testing correction methods, and justify their appropriateness.
- Display boxplots of the top 10 differentially expressed genes between conditions.
Functional Enrichment Analysis
- Perform both GSEA and overrepresentation analysis.
- Use the three Gene Ontology domains (Biological Process, Molecular Function, Cellular Component) and KEGG pathways.

3. Results

Present your findings clearly, supported by figures and tables.
Each figure and table should be numbered and captioned.
Highlight key findings and trends observed across analyses.

4. Discussion / Conclusion

Interpret your results in the biological context of the study.
Discuss how your analysis and findings differ from those reported in the original publication.
Address potential sources of variability and limitations.

5. References

Use BibTeX-style references consistent with the R Markdown bibliography format.

6. Computational Component

Include all code chunks used in your analysis within the R Markdown file.
Ensure all data and code necessary to reproduce results are provided.
Code should be well-commented and formatted for readability (use consistent indentation and spacing).

Submission Instructions

Add the GitHub link to your manuscript file and push all scripts and data to your GitHub repository.
Submit the knitted HTML manuscript to Canvas.

Peer Review Process

After submission, you will be assigned to review one peer’s project.
- The goal is to learn from others’ analyses.
Instructions for peer review:
- The peer-to-peer assignment will be distributed via Canvas.
- Clone your peer’s repository and knit their final project document.
- Evaluate each section (Introduction, Methods, Results, Discussion, etc.) and rate as:
  - Pass, Fail, or Marginal, with brief justification.
Submit your assessment via Canvas on or before [To be determined].

Grading and Deadlines

The instructor will formally grade all projects, considering peer assessments.
Project grades will be released on or before [To be determined].
Final course grades will be entered in the system on or before [To be determined].

Summary of Deliverables

GitHub repository containing:
- All analysis scripts
- Data (or links to data)
- README.md and manuscript.md
HTML report (manuscript.html)
Peer review submission