Overview
The purpose of the final project is to gain hands-on experience with
the full spectrum of RNA-seq analysis methods applied
to real-world data. This project is designed to help you strengthen both
your statistical and practical
understanding of RNA-seq data analysis.
Your goal is to perform a complete and reproducible
analysis and interpretation of an RNA-seq
dataset. The project should include all key analytical steps—data
retrieval, quality control, normalization, expression summaries, batch
effect correction, clustering, differential expression, and functional
enrichment analysis—supported by appropriate visualizations.
Dataset Selection
You must select an RNA-seq dataset (gene counts
matrix) for analysis. Acceptable sources of processed RNA-seq data
include:
Read the publication associated with your chosen dataset to
understand its biological and experimental context.
Dataset Requirements
- At least two experimental conditions (e.g., cancer
vs. normal, treated vs. control, or disease vs. healthy).
- At least five samples per condition.
- Preferably human data; however, model
organism datasets are acceptable.
Project Organization
- Create a dedicated project folder to store all
scripts, data, and results.
- Add a
README.md file describing each script, its input,
and its output.
- Create a
manuscript.md file containing your project
report written in R Markdown format.
The report should be compiled as an HTML document.
- Follow the IMRaD
structure (Introduction, Methods, Results, and Discussion).
- Include BibTeX references where relevant.
- Provide supplementary materials containing:
- Differential expression results
- Functional enrichment results
- The main text (excluding references, tables, and figure legends)
should not exceed 3,000 words.
Report Structure and Content
1. Introduction / Background
- Provide a clear and concise description of your dataset and the
research question you are addressing.
- Summarize the biological context and relevance of
the study.
2. Methods
Include detailed descriptions of all analysis steps:
- Quality Assessment
- Describe and report quality control metrics (e.g., library size,
count distribution, sample correlation, PCA).
- Present the results of your quality evaluation.
- Preprocessing
- Describe all preprocessing steps (e.g., filtering lowly expressed
genes, normalization).
- Batch Effect Correction
- Apply
ComBat (if batch information is known) and
sva (if batches are unknown).
- Display PCA plots before and after batch
correction.
- Perform hierarchical clustering on the top 10% most variable
genes before and after correction and describe your
observations.
- Differential Expression Analysis
- Use either
edgeR or DESeq2.
- Include all relevant covariates in your model.
- Compare your results with those reported in the original
publication, if applicable.
- Describe filtering criteria, multiple testing correction methods,
and justify their appropriateness.
- Display boxplots of the top 10 differentially
expressed genes between conditions.
- Functional Enrichment Analysis
- Perform both GSEA and overrepresentation
analysis.
- Use the three Gene Ontology domains (Biological
Process, Molecular Function, Cellular Component) and KEGG
pathways.
3. Results
- Present your findings clearly, supported by figures and
tables.
- Each figure and table should be numbered and
captioned.
- Highlight key findings and trends observed across analyses.
4. Discussion / Conclusion
- Interpret your results in the biological context of the study.
- Discuss how your analysis and findings differ from those reported in
the original publication.
- Address potential sources of variability and limitations.
5. References
- Use BibTeX-style references consistent with the R Markdown
bibliography format.
6. Computational Component
- Include all code chunks used in your analysis
within the R Markdown file.
- Ensure all data and code necessary to reproduce results are
provided.
- Code should be well-commented and formatted for
readability (use consistent indentation and spacing).
Submission Instructions
- Add the GitHub link to your manuscript file and
push all scripts and data to your GitHub
repository.
- Submit the knitted HTML manuscript to
Canvas.
Peer Review Process
- After submission, you will be assigned to review one peer’s
project.
- The goal is to learn from others’ analyses.
- Instructions for peer review:
- The peer-to-peer assignment will be distributed via
Canvas.
- Clone your peer’s repository and
knit their final project document.
- Evaluate each section (Introduction, Methods, Results, Discussion,
etc.) and rate as:
- Pass, Fail, or
Marginal, with brief justification.
- Submit your assessment via Canvas on or before
[To be determined].
Grading and Deadlines
- The instructor will formally grade all projects, considering peer
assessments.
- Project grades will be released on or before
[To be determined].
- Final course grades will be entered in the system
on or before [To be determined].
Summary of Deliverables
- GitHub repository containing:
- All analysis scripts
- Data (or links to data)
README.md and manuscript.md
- HTML report (
manuscript.html)
- Peer review submission