Week 1 - setup, sequencing QC, mapping, variant calling#
Setup#
Due to a couple of problems along the way, the whole setup is summarized in cheatsheets. This includes Miniconda3, packages, Nextflow and Singularity installations.
Sequencing Quality Control#
Download the fasta compressed file (MN908947.fasta.zip) containing Wuhan-1 reference sequence into working directory
Activate MOOC environment
$ conda activate MOOC
Download the sequence data for following samples using:
$ fastq-dump --split-files ERR5743893
Create a folder to store QC reports
$ mkdir -p QC_Reports
Run fastqc on the two fastq files we just downloaded
$ fastqc ERR5743893_1.fastq ERR5743893_2.fastq --outdir QC_Reports
FastQC creates an .html report for each file. To have it summarized in one report we use a tool called multiQC.
$ multiqc .
Open the multiqc_report.html in the web browser and explore the output
Mapping sample sequence to reference#
Steps:
Indexing reference genome - fasta.fai
Map sequences - .sam
Compress to binary file - .bam
Sort the .bam for quicker downstream processing - (sorted).bam
Index sorted.bam file - (sorted).bam.bai
*Visualize in Integrative Genomics Viewer (IGV)
Indexing reference genome
# be organized
$ mkdir Mapping
#index
$ bwa index MN908947.fasta
# or this, since this produces the fasta.fai file for IGV
$ samtools faidx MN908947.fasta
Map sequences
$ bwa mem MN908947.fasta ERR5743893_1.fastq ERR5743893_2.fastq > Mapping/ERR5743893.sam
Change to Mapping and check file size
$ cd Mapping
$ ls -lhrt
Compress .sam to .bam
$ cd ..
$ samtools view -@ 20 -S -b Mapping/ERR5743893.sam > Mapping/ERR5743893.bam # can change nr of threads as to specs
Check again the file size to confirm compression
Sort the .bam file
$ samtools sort -@ 32 -o Mapping/ERR5743893.sorted.bam Mapping/ERR5743893.bam
Index the (sorted).bam file
$ samtools index Mapping/ERR5743893.sorted.bam
*Visualize in IGV
Web software: Genome > load refseq.fasta & refseq.fasta.fai
Tracks > load sampleseq.sorted.bam & sampleseq.sorted.bam.bai
Variant calling#
Call variants from .bam file with FreeBayes. This results in .vcf file.
$ freebayes -f MN908947.fasta Mapping/ERR5743893.sorted.bam > ERR5743893.vcf
It’s good practive to compress and index the VCF file to take up less space. Most tools will accept compressed VCF files so there’s no need to uncompress them later.
$ bgzip ERR5743893.vcf # index
$ tabix ERR5743893.vcf.gz # compress
examples on how to quickly retrieve variant calls:
$ bcftools stats ERR5743893.vcf.gz > ERR5743893_stat.vcf.txt
# or
$ bcftools view -o output_snps.vcf.gz -O z -v snps,indels ERR5743893.vcf.gz
$ bcftools query -f '%TYPE\n' ERR5743893.vcf.gz | sort | uniq -c
Sources#
Wellcome Connecting Science course on FutureLearn: “Bioinformatics for Biologists: Analysing and interpreting genomics datasets”