Proteogenomics for neoantigen discovery#
Martyna Siewiera 2025-09-26
Overview#
Computational set-up#
Windows 10 with wsl2, docker, nextflow, miniconda
Data input:#
• .fastq files from RNA sequencing of paired tumor and nascent adjacent tissue or PBMC
• Reference genome FASTA from ensemble, 113 release, soft-masked, primary assembly
Tools and environments:#
• GATK4
• Samtools …
• Bcftools …
• RNA-seq pipeline …
Notes:#
First process the truncated .fastq files to ensure the pipeline is working, use seqtk, for example:
gzip -d <1.FASTQ.GZ>
gzip -d <2.FASTQ.GZ>
seqtk -s100 <1.FASTQ> 100000 > sub1.FASTQ
seqtk -s100 <2.FASTQ> 100000 > sub2.FASTQ
gzip *.FASTQ
Step by step:#
1. FASTQC and MULTIQC#
It’s better to evaluate the sequencing quality before running the whole pipeline. Then you can skip those later.
conda activate MOOC
mkdir -p QC_Reports
fastqc *_1.fastq *_2.fastq --ourdir QC_Reports
multiqc .
2. Sample sheet preparation#
A list of fastq files and their association is required for the RNASEQ pipeline. Use ‘create-samplesheet.R’ script, that goes like this:
library(tidyverse)
path_to_project <- "D:/NSCLC_cohort" #directory of the project
setwd(path_to_project)
files <- list.files("data", full.names = TRUE) #directory of where the fastq files are
samples <- files %>%
str_sub(end = -12, start = 6) %>%
unique()
samplesheet <- data.frame(sample = samples,
fastq_1 = keep(files, str_detect(files, "_1")),
fastq_2 = keep(files, str_detect(files, "_2")),
strandedness = "auto") # or reverse if confirmed
write_csv(samplesheet, "samplesheet.csv")
3. RNA-seq Nextflow pipeline with optional reference mapping with STAR#
Open docker app, in terminal go to the project directory and run:
sudo nextflow run \
nf-core/rnaseq \
--input samplesheet.csv \
--fasta /mnt/d/Reference/113-Release/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
--gtf /mnt/d/Reference/113-Release/Homo_sapiens.GRCh38.dna_sm.primary_assembly.gtf \
--star_index /mnt/d/Reference/STAR_reference/index/star \
--outdir results \
--email mamuszcz@gmail.com \
--aligner star_salmon \
--stringtie_ignore_gtf true \ #will use new transcripts resulted from stringtie
--save_align_intermeds true \
--skip_fastqc true \
--extra_star_align_args '--outTmpDir ~/star_temp' \
-profile docker \
-c star.config
star.config file looks as follows:
process {
resourceLimits = [
cpus: 7, # leave some for other processes
memory: '27.GB', # with wsl2 set to 30 out of 32 GB
time: 24.h
]
withName: STAR_GENOMEGENERATE { # in case you wanna index with STAR
ext.args = [
'--genomeChrBinNbits 15',
'--genomeSAsparseD 2'
].join(' ').trim()
}
}
Evaluate new transcripts:#
gffcompare -G -r /mnt/d/Reference/113-Release/Homo_sapiens.GRCh38.113.gtf results/star_salmon/stringtie/SAMPLE.transcripts.gtf
You may want to add new transcripts to the fresh run and calculate their abundance
4. Create fasta dictionary#
gatk CreateSequenceDictionary -R /mnt/d/Reference/113-Release/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
5. Variant calling#
With RNA-seq data, we must splt N cigars which splits reads by splice junctions that Mutect2 isn’t aware of
gatk SplitNCigarReads \
-R /mnt/d/Reference/113-Release/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
-I sampleA.markdup.sorted.bam \
-O sampleA.markdup.sorted.split.bam \
--create-output-bam-index true
Somatic variants call - Mutect2#
gatk Mutect2 \
-R /mnt/d/Reference/113-Release/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
-I sampleB.markdup.sorted.split.bam \
-I sampleA.markdup.sorted.split.bam \
-normal sampleA \
--dont-use-soft-clipped-bases true \
--output sample_somatic_variants.vcf \
--min-base-quality-score 20 # optional
# --genotype-germline-sites true
Now filter Mutect2 calls
gatk FilterMutectCalls \
-R /mnt/d/Reference/113-Release/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
-V sample_somatic_variants.vcf \
-O sample_somatic_variants_FILT.vcf
Germline variant calling#
Call tumor and benign tissue one by one:
gatk HaplotypeCaller \
-Xmx26g \
-R /mnt/d/Reference/113-Release/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
-I sampleA.markdup.sorted.split.bam \
-O sampleA.g.vcf.gz \
-ERC GVCF \ # generates a GVCF for later joint genotyping for tumor and healthy tissue
--dont-use-soft-clipped-bases # i.a. dont use introns (?)
Combine GVCF files for common SNV <- germline
gatk CombineGVCFs \
-R /mnt/d/Reference/113-Release/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
--variant sampleA.g.vcf.gz \
--variant sampleB.g.vcf.gz \
-O sample_combined.g.vcf.gz
Genotype the variants:
gatk GenotypeGVCFs \
-R /mnt/d/Reference/113-Release/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
-V sample_combined.g.vcf.gz \
-O sample_germline_variants.vcf
Variant annotation and filtering#
How many variants pass the filter?
grep -o 'SITE' sample_somatic_variants_FILT.vcf | wc -l
You can save only the records that PASS the filter:
bcftools view -f PASS sample_somatic_variants_FILT.vcf -o sample_somatic_variants_FILT2.vcf
Annotate with SnpEff or … to estimate the effects of the variant.
snpEff -Xmx26g GRCh38.p14 -cvsStats sample_vc_summary.csv -v sample_somatic_variants_FILT.vcf > sample_somatic_variants_ANN.vcf
# or without csv summary
snpEff -Xmx26g GRCh38.p14 -v sample_somatic_variants_FILT.vcf > sample_somatic_variants_ANN.vcf
SnpSift -Xmx26g annotate \
-v /mnt/d/Reference/ClinVar/clinvar.vcf.gz \
sample_somatic_variants_ANN.vcf > sample_somatic_variants_ANN_CLINVAR.vcf
6. Filter for non-synonymous variants (i.a. non-silent, resulting in a change in AA sequence)#
bcftools view \
-i "INFO/ANN~'missense-variants\|stop_gained\|stop_lost\|start_lost'" sample_somatic_variants_ANN.vcf \
-o sample_somatic_variants_ANN_NS.vcf
7. Find mutations which were found as germline AND somatic#
Prepare for intersection:
bgzip -c sampleA.vcf > sampleA.vcf.gz
tabix -p vcf sampleA.vcf.gz
Find and save shared variants between germline and somatic:
bcftools isec \
-n=2 \
-O v \
-o sample_shared_variants.vcf \
sample_somatic_variants.vcf.gz \
sample_germline_variants.vcf.gz
Identify and save somatic only:
bcftools isec \
-n=1 \
-c none \
-O v \
-o sample_somatic_only_variants.vcf \
sample_somatic_variants.vcf.gz \
sample_germline_variants.vcf.gz
Identify and save germline only:
bcftools isec \
-n=1 \
-c none \
-O v \
-o sample_germline_only_variants.vcf \
sample_germline_variants.vcf.gz \
sample_somatic_variants.vcf.gz
Reasonable read depth for variants:#
somatic: 100 dp
germline polymorphism: 30 dp
8. Incorporate variants into nucleotide FASTA reference#
gatk FastaAlternateReferenceMaker \
-R /mnt/d/Reference/113-Release/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
-O output.fasta \
-V input.vcf \
9. Translate nucleotide FASTA with variants to FASTA AA sequence with 6 frames (three forward and three reverse frames)#
Acc. to perplexity, use either Online tools:
• https://www.ebi.ac.uk/jdispatcher/st/emboss_transeq
• https://www.bioinformatics.org/sms2/translate.html
Offline:
• biostring R package translate()
• Biopython Bio.Seq (Seq) and Bio.Alphabet (IUPAC) modules -
translate()