Week 1 - setup, sequencing QC, mapping, variant calling

Week 1 - setup, sequencing QC, mapping, variant calling#

Setup#

Due to a couple of problems along the way, the whole setup is summarized in cheatsheets. This includes Miniconda3, packages, Nextflow and Singularity installations.

Sequencing Quality Control#

  • Download the fasta compressed file (MN908947.fasta.zip) containing Wuhan-1 reference sequence into working directory

  • Activate MOOC environment

$ conda activate MOOC
  • Download the sequence data for following samples using:

$ fastq-dump --split-files ERR5743893
  • Create a folder to store QC reports

$ mkdir -p QC_Reports
  • Run fastqc on the two fastq files we just downloaded

$ fastqc ERR5743893_1.fastq ERR5743893_2.fastq --outdir QC_Reports
  • FastQC creates an .html report for each file. To have it summarized in one report we use a tool called multiQC.

$ multiqc .
  • Open the multiqc_report.html in the web browser and explore the output

Mapping sample sequence to reference#

Steps:

  1. Indexing reference genome - fasta.fai

  2. Map sequences - .sam

  3. Compress to binary file - .bam

  4. Sort the .bam for quicker downstream processing - (sorted).bam

  5. Index sorted.bam file - (sorted).bam.bai

  6. *Visualize in Integrative Genomics Viewer (IGV)

  1. Indexing reference genome

# be organized
$ mkdir Mapping

#index
$ bwa index MN908947.fasta
# or this, since this produces the fasta.fai file for IGV
$ samtools faidx MN908947.fasta
  1. Map sequences

$ bwa mem MN908947.fasta ERR5743893_1.fastq ERR5743893_2.fastq > Mapping/ERR5743893.sam
  • Change to Mapping and check file size

$ cd Mapping
$ ls -lhrt
  1. Compress .sam to .bam

$ cd ..
$ samtools view -@ 20 -S -b Mapping/ERR5743893.sam > Mapping/ERR5743893.bam # can change nr of threads as to specs
  • Check again the file size to confirm compression

  1. Sort the .bam file

$ samtools sort -@ 32 -o Mapping/ERR5743893.sorted.bam Mapping/ERR5743893.bam
  1. Index the (sorted).bam file

$ samtools index Mapping/ERR5743893.sorted.bam
  1. *Visualize in IGV
    Web software: Genome > load refseq.fasta & refseq.fasta.fai
    Tracks > load sampleseq.sorted.bam & sampleseq.sorted.bam.bai

Variant calling#

  • Call variants from .bam file with FreeBayes. This results in .vcf file.

$ freebayes -f MN908947.fasta Mapping/ERR5743893.sorted.bam  > ERR5743893.vcf
  • It’s good practive to compress and index the VCF file to take up less space. Most tools will accept compressed VCF files so there’s no need to uncompress them later.

$ bgzip ERR5743893.vcf # index
$ tabix ERR5743893.vcf.gz # compress
  • examples on how to quickly retrieve variant calls:

$ bcftools stats ERR5743893.vcf.gz > ERR5743893_stat.vcf.txt
# or
$ bcftools view -o output_snps.vcf.gz -O z -v snps,indels ERR5743893.vcf.gz
$ bcftools query -f '%TYPE\n' ERR5743893.vcf.gz | sort | uniq -c

Sources#

More info#