Week 1 - setup, sequencing QC, mapping, variant calling

Week 1 - setup, sequencing QC, mapping, variant calling#

Setup#

Due to a couple of problems along the way, the whole setup is summarized in cheatsheets. This includes Miniconda3, packages, Nextflow and Singularity installations.

Sequencing Quality Control#

Download the fasta compressed file (MN908947.fasta.zip) containing Wuhan-1 reference sequence into working directory
Activate MOOC environment

$ conda activate MOOC

Download the sequence data for following samples using:

$ fastq-dump --split-files ERR5743893

Create a folder to store QC reports

$ mkdir -p QC_Reports

Run fastqc on the two fastq files we just downloaded

$ fastqc ERR5743893_1.fastq ERR5743893_2.fastq --outdir QC_Reports

FastQC creates an .html report for each file. To have it summarized in one report we use a tool called multiQC.

$ multiqc .

Open the multiqc_report.html in the web browser and explore the output

Mapping sample sequence to reference#

Steps:

Indexing reference genome - fasta.fai
Map sequences - .sam
Compress to binary file - .bam
Sort the .bam for quicker downstream processing - (sorted).bam
Index sorted.bam file - (sorted).bam.bai
*Visualize in Integrative Genomics Viewer (IGV)

Indexing reference genome

# be organized
$ mkdir Mapping

#index
$ bwa index MN908947.fasta
# or this, since this produces the fasta.fai file for IGV
$ samtools faidx MN908947.fasta

Map sequences

$ bwa mem MN908947.fasta ERR5743893_1.fastq ERR5743893_2.fastq > Mapping/ERR5743893.sam

Change to Mapping and check file size

$ cd Mapping
$ ls -lhrt

Compress .sam to .bam

$ cd ..
$ samtools view -@ 20 -S -b Mapping/ERR5743893.sam > Mapping/ERR5743893.bam # can change nr of threads as to specs

Check again the file size to confirm compression

Sort the .bam file

$ samtools sort -@ 32 -o Mapping/ERR5743893.sorted.bam Mapping/ERR5743893.bam

Index the (sorted).bam file

$ samtools index Mapping/ERR5743893.sorted.bam

*Visualize in IGV
Web software: Genome > load refseq.fasta & refseq.fasta.fai
Tracks > load sampleseq.sorted.bam & sampleseq.sorted.bam.bai

Variant calling#

Call variants from .bam file with FreeBayes. This results in .vcf file.

$ freebayes -f MN908947.fasta Mapping/ERR5743893.sorted.bam  > ERR5743893.vcf

It’s good practive to compress and index the VCF file to take up less space. Most tools will accept compressed VCF files so there’s no need to uncompress them later.

$ bgzip ERR5743893.vcf # index
$ tabix ERR5743893.vcf.gz # compress

examples on how to quickly retrieve variant calls:

$ bcftools stats ERR5743893.vcf.gz > ERR5743893_stat.vcf.txt
# or
$ bcftools view -o output_snps.vcf.gz -O z -v snps,indels ERR5743893.vcf.gz
$ bcftools query -f '%TYPE\n' ERR5743893.vcf.gz | sort | uniq -c

Sources#

Wellcome Connecting Science course on FutureLearn: “Bioinformatics for Biologists: Analysing and interpreting genomics datasets”
CDCgov/datasets-sars-cov-2
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
https://multiqc.info/
https://igv.org/app/

Week 1 - setup, sequencing QC, mapping, variant calling

Contents

Week 1 - setup, sequencing QC, mapping, variant calling#

Setup#

Sequencing Quality Control#

Mapping sample sequence to reference#

Variant calling#

Sources#

More info#