R stack for -omics data

R stack for -omics data#

Martyna Siewiera 2024-02-14

Scope#

This is a quick overview of my current R stack that I use for proteomic data analysis. The aim of this cheatsheet is to organize packages and their standard use as it is easy to get lost in all available Bioconductor packages that in the end don’t work well or are missing a lot of functions and need to be combined with another packages. This is what has worked for me so far and I wanted a place where everything is summed up in a somewhat sequential order of analysis.

So far I have been mainly working with two or three experimental groups but there was an instance I had to create long functions just to automatize DE, term enrichment and their visualization for multiple groups. These functions can be found under “3.4 Functions for multiple groups”.

1. affy#

Currently used for testing normalization performance with MA plots. Blue line represents theoretical perfect loess curve, red line - actual loess curve. The more similar they are, the better.

Note: Normalization principle is that majority of features (proteins/genes) do not change their expression under most conditions and so their trend is used to normalize samples.

M = log2(x/y) aka log2FC

A = (log2(x) + log2(y))/2 = log2(xy)*1/2, aka means of means=average expression

library(affy)
# Named list with column indexes for each experimental group
groups <- list(1:3,4:6,7:9)
names(groups) <- colnames(design) #from limma pkg - search below

#take 2 experimental groups from matrix
x <- rowMeans(qnorm[,c(groups$x3D)])
y <- rowMeans(qnorm[,c(groups$x2D)])

M <- x - y
A <- (x + y)/2

#plot and save
maplot <- ma.plot(A = A, M = M, cex = 1)
title("... normalisation")

2. limma#

Originally dedicated to microarray and RNA-seq data, which is also useful for proteomics. Currently used for normalization methods and differential expression analysis.

Load required packages
```
library(tidyverse)
library(limma)
```

Prepare an expression matrix - log2 transformed, normalized, tidy.

mat <- read.delim("expr_matrix.tsv")
mat <- as.matrix(mat)
head(mat)

#different normalization methods, assess with boxplots with jitter and MA plots
# normalizeVSN()
# normalizeQuantiles()
# normalizeCyclicLoess()
# normalizeMedianValues()

Create a design table, which is a 0/1 matrix of assigning samples to an experimental group. There are many ways to do it, but I find the following one the easiest. In the model.matrix() function in rep() first digit represents the column index that will be filled with nr of replicates in a group indicated by second digit.

design <- model.matrix(~ 0+factor(c(rep(1,3),
                                    rep(2,3),
                                    rep(3,3),
                                    rep(4,3))))
# assign names of experimental groups
colnames(design) <- c("ctrl", "treat1", "treat2", "treat3")
# assign sample names from expression matrix to exp groups or manually
row.names(design) <- colnames(mat)
design

Create an lm fit and matrix of contrasts - meaning comparisons between groups that will be made. To make things clear in next steps, assign a number in the comment as it will be needed for coef arguments later.

fit <- lmFit(mat,design)
cont.matrix <- makeContrasts(ctrl_treat1 = ctrl - treat1, #1
                             ctrl_treat2 = ctrl - treat2, #2
                             ctrl_treat3 = ctrl - treat3, #3
                             levels = design) 

Fit the contrasts and calculate Bayes statistics.

fit_cont <- contrasts.fit(fit, cont.matrix)
fit_bayes <- eBayes(fit_cont)

Write down the whole differential expression matrix into a data.frame. ‘coef’ argument indicates which comparison is taken into account. See that expression matrix has to have protein/gene ids in rownames.

DEP <- topTable(fit_bayes, 
                coef=1, 
                adjust.method = "BH", 
                genelist = rownames(mat), 
                resort.by = "logFC", 
                number = nrow(mat))

For a quick volcano plot do this:

volcanoplot(fit_bayes,
            coef=1,
            highlight = 10,
            names = rownames(mat))
abline(h = 1.2, v = c(-1.5,1.5), lty = 2)

A much more preferred version:

deps <- topTable(fit_bayes, coef = 1, adjust="BH", genelist = rownames(mat), resort.by = "logFC", number = nrow(qnorm))

#Retain columns for volcano plot and assign Category of expression
volc_data <- deps
volc_data$Category <- "Not significant"
volc_data$Category[volc_data$logFC <= -1.3 & volc_data$adj.P.Val <= 0.05] <- "Down-regulated"
volc_data$Category[volc_data$logFC >= 1.3 & volc_data$adj.P.Val <= 0.05] <- "Up-regulated"

#Assign label of DEPs
volc_data$Label <- NA
volc_data$Label[volc_data$Category != "Not significant"] <- volc_data$ID[volc_data$Category != "Not significant"]

#Plot
theme_set(theme_light())
v <- ggplot(volc_data,
            aes(x = logFC,
                y = (-1*log10(adj.P.Val)),
                color = Category,
                label = Label))+
  geom_point(alpha = .5)+
  scale_color_manual(values = c("Down-regulated" = "blue", "Not significant" = "gray", "Up-regulated" = "red"))+
  labs(title = names(coefs[i]),
       y = "-log10 adj. p-value",
       x = "log2 fold change")+
  geom_hline(yintercept = 1.3, lty=2)+
  geom_vline(xintercept = c(-1.3,1.3), lty=2)+
  scale_x_continuous(limits = c(-5.5,5.5))+
  scale_y_continuous(limits = c(0,10.5))

#Additional formatting, adding n of upregs and downregs
v+
  theme(legend.position = "top",
        text = element_text(size=15))+
  geom_text_repel(colour = "black", max.overlaps = 15, size = 3.5)+
  annotate("text", x = -5.5, y=0, col= "blue",label=paste0("n=",length(downreg)))+
  annotate("text", x = 5.5, y=0, col= "red",label=paste0("n=",length(upreg)))

3. clusterProfiler#

Used for Gene Ontology, KEGG, Reactome and WikiPathways GSEA or ORA analysis. Very useful feature for simplifying term results (removing similar/redundant terms, easing up the interpretation). Used with ‘enrichplot’ to visualize results.

Important note: This package likes to not cooperate, esp. if it comes to plots and gseKEGG functions. To resolve the problem consider these:

make sure you have the latest version of BiocManager;
update clusterProfiler, enrichplot, org db, dependancies etc;
If still not working, use “remotes::install_github(”YuLab-SMU/clusterProfiler”))” to download dev version - last time worked;
still nothing - download KEGG db and use internal db.

Load libraries

library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)

Prepare geneList which is a named list of genes/proteins with their log2 fold changes/pvalues from DE like limma’s topTable.

geneList <- DEP[,2] #fc
names(geneList) <- as.character(DEP[,1]) #gene symbols
geneList <- sort(geneList, decreasing = TRUE) #sort ranks

Some functions, like ‘gseKEGG’ don’t accept symbols as input, in this case it is needed to convert symbols into eg. ENTREZID.

entrezid <- bitr(as.character(DEP[,1]), 
                 fromType = "SYMBOL", 
                 toType = "ENTREZID", 
                 OrgDb = org.Hs.eg.db)

geneList2 <- filter(DEP, ID %in% entrezid$SYMBOL)[,2] #fc - filter out unmapped
names(geneList2) <- as.character(entrezid$ENTREZID) #gene symbols
geneList2 <- sort(geneList2, decreasing = TRUE) #sort ranks

Sometimes you want to use external tools like WebGESTALT. Write down the geneList into tab-seperated file:

write.table(geneList, "genelist.txt", sep = "\t", quote = FALSE)

3.1 GO GSEA

gsea <- gseGO(geneList = geneList,
              OrgDb = org.Hs.eg.db,
              ont = "BP", # or CC, MF
              minGSSize = 50, # min nr of genes mapped to term
              maxGSSize = 100, # max nr of genes mapped to term
              pvalueCutoff = 0.05,
              verbose = TRUE,
              keyType = "SYMBOL")

To view results from GSEA:

gsea_results <- gsea@result
upreg <- filter(gsea_results, NES > 0 & p.adjust < 0.05)
downreg <- filter(gsea_results, NES < 0 & p.adjust < 0.05)
head(gsea_results)

Sometimes it is possible to merge similar terms through ‘simplify()’:

simp <- clusterProfiler::simplify(gsea)
simp_results <- simp@result

3.2 KEGG GSEA

kegg <- clusterProfiler::gseKEGG(geneList = geneList2,
                organism = "hsa", 
                keyType = "ncbi-geneid", # if ENTREZID in geneList
                minGSSize = 20,
                maxGSSize = 500,
                pvalueCutoff = 0.05,
                verbose = TRUE)

kegg_result <- kegg@result
head(kegg_result)
#this function converts entrezid back to readable symbols
kegg <- setReadable(kegg)

3.3 Visualisation of GSEA results

You can choose from barplots, dotplots, ridgeplots, umaps, cnetplots etc. Check clustProfiler and enrichplot vignette for possibilities.

ridgeplot(kegg) +
  geom_vline(xintercept = 0, lty = 2)

In case you’d like to make manual barplot, which is nicely ordered by p-value:

library(ggplot2)
theme_set(theme_minimal())

upreg %>% 
  arrange(desc(p.adjust)) %>%
  mutate(Description = factor(Description, levels = Description)) %>%
  ggplot()+
    geom_col(aes(x = setSize, y = Description, fill = -log10(p.adjust)))+
    scale_fill_gradient(low = "blue", high = "red")

3.4 Functions for multiple groups

# Create a named list with coef numbers for topTable and its name of comparison
coefs <- seq.int(1,3,1)
names(coefs) <- c("ctrlVtrt1", "ctrlVtrt2", "trt1Vtrt2")

GO GSEA

compute_gsea <- function(fit, GO = c("BP|CC|MF"), coefs, simplify = TRUE) {
  ## Case 1. Simplified table results + figures---------
  if (simplify == TRUE) {
    path_figs <- paste0("figures/gsego_results/gsego_", GO, "_results/unfiltered/")
    path_data <- paste0("data/gsego_results/gsego_", GO, "_results/unfiltered/simplified/")

    if (dir.exists(path_figs) == FALSE) {
      dir.create(path_figs,showWarnings = TRUE,recursive = TRUE)
    }

    if (dir.exists(path_data) == FALSE) {
      dir.create(path_data,showWarnings = TRUE,recursive = TRUE)
    }

    for (i in coefs) {
      filename_data <- paste0("gsego_", GO, "_", names(coefs[i]), ".tsv")
      filename_fig <- paste0("gsego_", GO, "_top30_", names(coefs[i]), ".png")

      deps <- topTable(fit_bayes, coef = i, adjust="BH", genelist = rownames(mat), resort.by = "logFC", number = nrow(qnorm))

      geneList <- deps[,2] #fc
      names(geneList) <- as.character(deps[,1]) #gene symbols
      geneList <- sort(geneList, decreasing = TRUE) #sort ranks

      # define keytype for genes
      keyType = "SYMBOL"

      # GO GSEA 
      # min/max GSSize means how many genes per term should be used
      gsea <- gseGO(geneList = geneList,
                    OrgDb = org.Hs.eg.db,
                    ont = GO,
                    minGSSize = 15,
                    maxGSSize = 100,
                    pvalueCutoff = 0.05,
                    verbose = TRUE,
                    keyType = keyType)


      # remove redundant GO terms via GOSemSim methods
      simp <- clusterProfiler::simplify(gsea)
      simp_results <- simp@result
      write.table(file = paste0(path_data, filename_data), 
                  x = simp_results, 
                  quote = FALSE, 
                  sep = "\t",
                  row.names = FALSE)

      # Visualize
      ridgeplot(simp, label_format = 60)+
        geom_vline(xintercept = 0, lty = 2)+
        labs(x = "log2FC")

      ggsave(filename = filename_fig,
             device = "png",
             path = path_figs,
             width = 800,
             height = 780,
             units = "px",
             dpi = 100)
    }
  } else if (simplify == FALSE) {
    ## Case 2. NOT simplified table results ONLY---------
      path_data <- paste0("data/gsego_results/gsego_", GO, "_results/unfiltered/notsimplified/")

      if (dir.exists(path_data) == FALSE) {
        dir.create(path_data,showWarnings = TRUE,recursive = TRUE)
      }

      for (i in coefs) {
        filename_data <- paste0("gsego_", GO, "_", names(coefs[i]), ".tsv")

        deps <- topTable(fit_bayes, coef = i, adjust="BH", genelist = rownames(mat), resort.by = "logFC", number = nrow(qnorm))

        geneList <- deps[,2] #fc
        names(geneList) <- as.character(deps[,1]) #gene symbols
        geneList <- sort(geneList, decreasing = TRUE) #sort ranks

        # define keytype for genes
        keyType = "SYMBOL"

        # GO GSEA 
        # min/max GSSize means how many genes per term should be used
        gsea <- gseGO(geneList = geneList,
                      OrgDb = org.Hs.eg.db,
                      ont = GO,
                      minGSSize = 15,
                      maxGSSize = 100,
                      pvalueCutoff = 0.05,
                      verbose = TRUE,
                      keyType = keyType)


        gsea_results <- gsea@result
        write.table(file = paste0(path_data, filename_data), 
                    x = gsea_results, 
                    quote = FALSE, 
                    sep = "\t",
                    row.names = FALSE)
    }
  }
}

KEGG GSEA

compute_keggsea <- function(fit, coefs) {
  path_figs <- paste0("figures/gsekegg_results/unfiltered/")
  path_data <- paste0("data/gsekegg_results/unfiltered/")
  if (dir.exists(path_figs) == FALSE) {
    dir.create(path_figs,showWarnings = TRUE,recursive = TRUE)
  }

  if (dir.exists(path_data) == FALSE) {
    dir.create(path_data,showWarnings = TRUE,recursive = TRUE)
  }

  for (i in coefs) {
    filename_data <- paste0("gsekegg_", names(coefs[i]), ".tsv")
    filename_fig <- paste0("gsekegg_", "top30_", names(coefs[i]), ".png")

    deps <- topTable(fit_bayes, coef = i, adjust="BH", genelist = rownames(mat), resort.by = "logFC", number = nrow(qnorm))

    entrezid <- bitr(as.character(deps[,1]), fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db)
    geneList <- filter(deps, ID %in% entrezid$SYMBOL)[,2] #fc
    names(geneList) <- as.character(entrezid$ENTREZID) #gene symbols
    geneList <- sort(geneList, decreasing = TRUE) #sort ranks

    kegg <- gseKEGG(geneList = geneList,                
                    organism = "hsa", 
                    keyType = "ncbi-geneid",
                    minGSSize = 20,
                    maxGSSize = 100,
                    pvalueCutoff = 0.05,
                    verbose = TRUE)

    kegg <- setReadable(kegg, OrgDb = org.Hs.eg.db, keyType = "ENTREZID")

    kegg_results <- kegg@result
    write.table(file = paste0(path_data, filename_data), 
                x = kegg_results, 
                quote = FALSE, 
                sep = "\t",
                row.names = FALSE)

    # Visualize
    ridgeplot(kegg, label_format = 60)+
      geom_vline(xintercept = 0, lty = 2)+
      labs(x = "log2FC")

    ggsave(filename = filename_fig,
           device = "png",
           path = path_figs,
           width = 800,
           height = 780,
           units = "px",
           dpi = 100)
  }
}

GO ORA

enrich_go <- function(fit, coefs, GO="BP|CC|MF", simplify = TRUE) {
  # simplify=TRUE -------------------------
  if (simplify == TRUE) {
    path_figs <- paste0("figures/enrichgo_results/enrich_", GO, "_results/unfiltered/")
    path_data <- paste0("data/enrichgo_results/enrich_", GO, "_results/unfiltered/simplified/")

    if (dir.exists(path_figs) == FALSE) {
      dir.create(path_figs,showWarnings = TRUE,recursive = TRUE)
    }

    if (dir.exists(path_data) == FALSE) {
      dir.create(path_data,showWarnings = TRUE,recursive = TRUE)
    }

    for (i in coefs) {
      filename_data <- paste0("enrich_", GO, "_sim_", names(coefs[i]))
      filename_fig <- paste0("enrich_", GO, "_top30_sim_", names(coefs[i]))

      deps <- topTable(fit_bayes, coef = i, adjust="BH", genelist = rownames(mat), resort.by = "logFC", number = nrow(qnorm))

      geneList <- deps[,2] #fc
      names(geneList) <- as.character(deps[,1]) #gene symbols
      geneList <- sort(geneList, decreasing = TRUE) #sort ranks

      # define keytype for genes
      keyType = "SYMBOL"

      #extract ids
      upreg <- deps$ID[deps$logFC >= 1.3 & deps$adj.P.Val <= 0.05]
      downreg <- deps$ID[deps$logFC <= -1.3 & deps$adj.P.Val <= 0.05]
      message(paste0(names(coefs[i]),": ", "Nr of upregulated genes:", length(upreg)))
      message(paste0(names(coefs[i]),": ", "Nr of downregulated genes:", length(downreg)))

      #enrich upreg
      go_enr <- enrichGO(gene = upreg,
                         universe = names(geneList),
                         OrgDb = org.Hs.eg.db,
                         keyType = "SYMBOL",
                         ont = GO,
                         pvalueCutoff = 0.05,
                         pAdjustMethod = "BH")
      go_enr_results <- go_enr@result
      write.table(file = paste0(path_data, filename_data, "_upreg.tsv"), 
                  x = go_enr_results, 
                  quote = FALSE, 
                  sep = "\t",
                  dec = ",",
                  row.names = FALSE)
      #Visualize upreg
      if(min(go_enr_results$p.adjust) <= 0.05) {
        go_enr_sim <- simplify(go_enr)
        barplot(go_enr_sim, showCategory = 30, label_format = 60)
        ggsave(filename = paste0(filename_fig,"_upreg.png"),
               device = "png",
               path = path_figs,
               width = 800,
               height = 780,
               units = "px",
               dpi = 100)
      } else {
        message(paste0("No enrichment for upregulated genes in ", names(coefs[i])))
      }
      #enrich downreg
      go_enr <- enrichGO(gene = downreg,
                         universe = names(geneList),
                         OrgDb = org.Hs.eg.db,
                         keyType = "SYMBOL",
                         ont = GO,
                         pvalueCutoff = 0.05,
                         pAdjustMethod = "BH")
      go_enr_results <- go_enr@result
      write.table(file = paste0(path_data, filename_data, "_downreg.tsv"), 
                  x = go_enr_results, 
                  quote = FALSE, 
                  sep = "\t",
                  dec = ",",
                  row.names = FALSE)
      #Visualize downreg
      if(min(go_enr_results$p.adjust) <= 0.05) {
        go_enr_sim <- simplify(go_enr)
        barplot(go_enr_sim, showCategory = 30, label_format = 60)
        ggsave(filename = paste0(filename_fig,"_downreg.png"),
               device = "png",
               path = path_figs,
               width = 800,
               height = 780,
               units = "px",
               dpi = 100)
      } else {
        message(paste0("No enrichment for downregulated genes in ", names(coefs[i])))
      }
      # simplify=FALSE ----------------------------------------------
    } 
  } else if (simplify == FALSE) {
    path_data <- paste0("data/enrichgo_results/enrich_", GO, "_results/unfiltered/not_simplified/")

    if (dir.exists(path_figs) == FALSE) {
      dir.create(path_figs,showWarnings = TRUE,recursive = TRUE)
    }

    if (dir.exists(path_data) == FALSE) {
      dir.create(path_data,showWarnings = TRUE,recursive = TRUE)
    }

    for (i in coefs) {
      filename_data <- paste0("enrich_", GO, "_notsim_", names(coefs[i]))

      deps <- topTable(fit_bayes, coef = i, adjust="BH", genelist = rownames(mat), resort.by = "logFC", number = nrow(qnorm))

      geneList <- deps[,2] #fc
      names(geneList) <- as.character(deps[,1]) #gene symbols
      geneList <- sort(geneList, decreasing = TRUE) #sort ranks

      # define keytype for genes
      keyType = "SYMBOL"

      #extract ids
      upreg <- deps$ID[deps$logFC >= 1.3 & deps$adj.P.Val <= 0.05]
      downreg <- deps$ID[deps$logFC <= -1.3 & deps$adj.P.Val <= 0.05]
      message(paste0(names(coefs[i]),": ", "Nr of upregulated genes:", length(upreg)))
      message(paste0(names(coefs[i]),": ", "Nr of downregulated genes:", length(downreg)))

      #enrich upreg
      go_enr <- enrichGO(gene = upreg,
                         universe = names(geneList),
                         OrgDb = org.Hs.eg.db,
                         keyType = "SYMBOL",
                         ont = GO,
                         pvalueCutoff = 0.05,
                         pAdjustMethod = "BH")
      go_enr_results <- go_enr@result
      write.table(file = paste0(path_data, filename_data, "_notsim_upreg.tsv"), 
                  x = go_enr_results, 
                  quote = FALSE, 
                  sep = "\t",
                  dec = ",",
                  row.names = FALSE)

      if(min(go_enr_results$p.adjust) >= 0.05) {
        message(paste0("No enrichment for upregulated genes in ", names(coefs[i])))
      } 
      #enrich downreg
      go_enr <- enrichGO(gene = downreg,
                         universe = names(geneList),
                         OrgDb = org.Hs.eg.db,
                         keyType = "SYMBOL",
                         ont = GO,
                         pvalueCutoff = 0.05,
                         pAdjustMethod = "BH")
      go_enr_results <- go_enr@result
      write.table(file = paste0(path_data, filename_data, "_notsim_downreg.tsv"), 
                  x = go_enr_results, 
                  quote = FALSE, 
                  sep = "\t",
                  dec = ",",
                  row.names = FALSE)
      if(min(go_enr_results$p.adjust) >= 0.05) {
        message(paste0("No enrichment for downregulated genes in ", names(coefs[i])))
      }
    }
  }  
}

Volcanoes

for (i in coefs) {
  if (dir.exists("figures/volcanoes/") == FALSE) {
    dir.create("figures/volcanoes/",showWarnings = TRUE,recursive = TRUE)
  }
  if (dir.exists("data/comparison_stats/") == FALSE) {
    dir.create("data/comparison_stats/",showWarnings = TRUE,recursive = TRUE)
  }

  deps <- topTable(fit_bayes, coef = i, adjust="BH", genelist = rownames(mat), resort.by = "logFC", number = nrow(qnorm))

  volc_data <- deps
  volc_data$Category <- "Not significant"
  volc_data$Category[volc_data$logFC <= -1.3 & volc_data$adj.P.Val <= 0.05] <- "Down-regulated"
  volc_data$Category[volc_data$logFC >= 1.3 & volc_data$adj.P.Val <= 0.05] <- "Up-regulated"
  volc_data$Label <- NA
  volc_data$Label[volc_data$Category != "Not significant"] <- volc_data$ID[volc_data$Category != "Not significant"]

  upreg <- volc_data$ID[volc_data$logFC >= 1.3 & volc_data$adj.P.Val <= 0.05]
  downreg <- volc_data$ID[volc_data$logFC <= -1.3 & volc_data$adj.P.Val <= 0.05]

  write.table(file = paste0("data/comparison_stats/", names(coefs[i]), ".tsv"), 
              x = volc_data, 
              quote = FALSE, 
              sep = "\t",
              row.names = FALSE)

  theme_set(theme_light())
  v <- ggplot(volc_data,
              aes(x = logFC,
                  y = (-1*log10(adj.P.Val)),
                  color = Category,
                  label = Label))+
    geom_point(alpha = .5)+
    scale_color_manual(values = c("Down-regulated" = "blue", "Not significant" = "gray", "Up-regulated" = "red"))+
    labs(title = names(coefs[i]),
         y = "-log10 adj. p-value",
         x = "log2 fold change")+
    geom_hline(yintercept = 1.3, lty=2)+
    geom_vline(xintercept = c(-1.3,1.3), lty=2)+
    scale_x_continuous(limits = c(-5.5,5.5))+
    scale_y_continuous(limits = c(0,10.5))
  v+
    theme(legend.position = "top",
          text = element_text(size=15))+
    geom_text_repel(colour = "black", max.overlaps = 15, size = 3.5)+
    annotate("text", x = -5.5, y=0, col= "blue",label=paste0("n=",length(downreg)))+
    annotate("text", x = 5.5, y=0, col= "red",label=paste0("n=",length(upreg)))

  ggsave(filename = paste0("volcano", names(coefs[i]), ".png"),
         path = "figures/volcanoes/",
         device = "png",
         width = 640,
         height = 660,
         units = "px",
         dpi = 100)
}

ANOVA filtering

For multiple comparisons it is best to check if removing low variance proteins results in better clustering/differentiation - e.g. with heatmap/dendrogram and/or PCA

# drop 'coef' argument for anova
anova <- topTable(fit_bayes, adjust = "BH", genelist = rownames(mat), resort.by = "adj.P.val", number = nrow(qnorm))

anova_filtr <- anova$ProbeID[anova$adj.P.Val <= 0.05]
temp <- filter(qnorm, !Gene %in% anova_filtr)

pheatmap::pheatmap(temp[1:18],
                   scale = "row",
                   show_rownames = FALSE,
                   cutree_rows = 5,
                   clustering_distance_cols = "euclidean",
                   clustering_distance_rows = "correlation",
                   annotation_col = annot_col,
                   annotation_colors = annot_colors,
                   main = "ANOVA filtered")

Visualize terms on a heatmap

term = "negative regulation of cell cycle process"
filtr <- str_split_1(gsea_results$core_enrichment[gsea_results$Description == term], pattern = "/")

pheatmap::pheatmap(qnorm[qnorm$Gene %in% filtr,1:18],
                   scale = "row",
                   show_rownames = TRUE,
                   clustering_distance_cols = "euclidean",
                   clustering_distance_rows = "correlation",
                   annotation_col = annot_col,
                   annotation_colors = annot_colors,
                   main = term)

4. pheatmap#

Global heatmap with experimental groups as column annotation.

# Tidying up--------------------------------------------------------------------
#drop nas and set rownames
msqrob <- drop_na(msqrob)
msqrob <- column_to_rownames(msqrob, "Gene")
msqrob <- as.matrix(msqrob[1:18])

# Global heatmap-----------------------------------------------------------------------
#Column annotation - each column is a named grouping variable
annot_col <- data.frame(
  Condition.L1 = as.factor(
    c(rep(c("2D", "3D.young", "3D.old"), each = 4),
      rep(c("PDX.2D", "PDX.3D"), each = 3))),
  Condition.L2 = as.factor(
    c(rep("2D", 4), 
      rep("3D", 8), 
      rep("PDX", 6))))
#set sample names as rownames
row.names(annot_col) <- colnames(msqrob)[1:18]

#Colors mapping for column annotation
#A named list for each grouping with color values
library(RColorBrewer)
annot_colors <- list(Condition.L1 = rev(brewer.pal(5, "Paired")),
                     Condition.L2 = c("firebrick1", "darkgreen", "blue"))
#Map names of experimental groups to the colors
names(annot_colors$Condition.L1) <- levels(annot_col$Condition.L1)
names(annot_colors$Condition.L2) <- levels(annot_col$Condition.L2)

#Plot hm
hm <- pheatmap::pheatmap(msqrob[1:18],
           scale = "row",
           show_rownames = FALSE,
           clustering_distance_cols = "euclidean",
           clustering_distance_rows = "correlation",
           annotation_col = annot_col,
           annotation_colors = annot_colors,
           main = "Global heatmap")

I like to check clustering after ANOVA filtering <0.05 adj.p-value and then after DE (limma pkg below).

Explore row dendrogram of hm after filtering of matrix.

filtr_deps <- unlist(deps_list) |> unique()

# View row tree, decide on nr of clusters
hm$tree_row %>% 
  as.dendrogram() %>%
  plot(horiz = TRUE)

# assign genes to clusters and annotate in hm, extract and explore in string network/go
clusters <- cutree(hm$tree_row, k = 5)
clusters <- as.data.frame(clusters) |> mutate(clusters=as.factor(clusters))
clipr::write_clip(rownames(clusters)[clusters$clusters == 5])

#plot hm again with clustered rows separated
hm <- pheatmap::pheatmap(qnorm[qnorm$Gene %in% filtr_deps,1:18],
                   scale = "row",
                   show_rownames = FALSE,
                   cutree_rows = 5,
                   annotation_row = clusters,
                   clustering_distance_cols = "euclidean",
                   clustering_distance_rows = "correlation",
                   annotation_col = annot_col,
                   annotation_colors = annot_colors,
                   main = "Significant DEPs")

Full list of all DEPs across comparisons:

#create deps list if multiple comparisons
deps_list <- function(fit){
  comb <- list()
  for (i in coefs) {
    deps <- topTable(fit_bayes, coef = i, adjust="BH", genelist = rownames(mat), resort.by = "logFC", number = nrow(qnorm))
    
    volc_data <- deps
    volc_data$Category <- "Not significant"
    volc_data$Category[volc_data$logFC <= -1.3 & volc_data$adj.P.Val <= 0.05] <- "Down-regulated"
    volc_data$Category[volc_data$logFC >= 1.3 & volc_data$adj.P.Val <= 0.05] <- "Up-regulated"
    volc_data$Label <- NA
    volc_data$Label[volc_data$Category != "Not significant"] <- volc_data$ID[volc_data$Category != "Not significant"]
    
    upreg <- volc_data$Label[volc_data$Category == "Up-regulated"]
    downreg <- volc_data$Label[volc_data$Category == "Down-regulated"]
    
    comb <- append(comb, c(list(upreg), list(downreg)))
  }
  comb_names <- rep((names(coefs)),each=2,times=1) |> paste0(rep(c("_upreg", "_downreg"), every=2))
  names(comb) <- comb_names
  return(comb)
}

5. PCAtools#

Principal component analysis. Better results after removing low variance features (e.g. after ANOVA).

library(PCAtools)
#column to rownames for pca function
mat <- as.matrix(msqrob[1:18])
#pca
p <- pca(mat, metadata = annot_col, scale = TRUE) #annot_col same as in pheatmap

#plot PCA
biplot(p, 
       lab = colnames(mat), 
       labSize = 5, 
       showLoadings = FALSE, #show genes driving the variation
       boxedLoadingsNames = FALSE, 
       colby = "Condition.L1", #colors of points
       colkey = annot_colors$Condition.L1, #names of points, same as in pheatmap
       title = "This is PCA biplot",
       encircle = TRUE, #encircle groups (no stats applied - for that use 'eclipse' arg)
       hline = 0, #add 0,0 coordinates
       vline = 0)

#show screeplot
screeplot(p)

#loadings plot to screen genes driving the variation in components
plotloadings(p)

6. eulerr#

Create area-proportional venn/euler diagrams. Most frequently used for peptide/protein overlap between samples.

#create a named list with peptide/proteins ids for each group/sample/comparison
comb <- list(l1, l2, l3, l4)
names(comb) <- c("Sample1", "Sample2", "Sample3", "Sample4")

euler(comb) |> plot(quantities = TRUE, #area-proportional?
                    main = "title", 
                    fills = c("#AB3428", "#FFB703", "white")) #circle colors

7. corrplot#

Visualize correlation matrix in a heatmap. Assess pre- and post-normalization.

#compute correlation coefs on matrix w/o NAs
corr <- cor(log2(mat), use="complete.obs")
#visualize
corrplot(corr, method = "color", hclust.method = "ward.D2", is.corr = FALSE, col = COL1("OrRd"))

8. ggpubr & patchwork#

Organize figures in ready-to-publish panels.

Session info#

sessionInfo()

## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=Polish_Poland.utf8  LC_CTYPE=Polish_Poland.utf8   
## [3] LC_MONETARY=Polish_Poland.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=Polish_Poland.utf8    
## 
## time zone: Europe/Warsaw
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] corrplot_0.92              eulerr_7.0.0              
##  [3] PCAtools_2.14.0            ggrepel_0.9.5             
##  [5] pheatmap_1.0.12            enrichplot_1.22.0         
##  [7] clusterProfiler_4.11.0.001 limma_3.58.1              
##  [9] affy_1.80.0                Biobase_2.62.0            
## [11] BiocGenerics_0.48.1        lubridate_1.9.3           
## [13] forcats_1.0.0              stringr_1.5.1             
## [15] dplyr_1.1.4                purrr_1.0.2               
## [17] readr_2.1.5                tidyr_1.3.1               
## [19] tibble_3.2.1               ggplot2_3.4.4             
## [21] tidyverse_2.0.0           
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3        rstudioapi_0.15.0        
##   [3] jsonlite_1.8.8            magrittr_2.0.3           
##   [5] farver_2.1.1              rmarkdown_2.25           
##   [7] fs_1.6.3                  zlibbioc_1.48.0          
##   [9] vctrs_0.6.5               DelayedMatrixStats_1.24.0
##  [11] memoise_2.0.1             RCurl_1.98-1.14          
##  [13] ggtree_3.10.0             S4Arrays_1.2.0           
##  [15] htmltools_0.5.7           SparseArray_1.2.3        
##  [17] gridGraphics_0.5-1        plyr_1.8.9               
##  [19] cachem_1.0.8              igraph_2.0.1.1           
##  [21] lifecycle_1.0.4           pkgconfig_2.0.3          
##  [23] rsvd_1.0.5                Matrix_1.6-5             
##  [25] R6_2.5.1                  fastmap_1.1.1            
##  [27] gson_0.1.0                GenomeInfoDbData_1.2.11  
##  [29] MatrixGenerics_1.14.0     digest_0.6.34            
##  [31] aplot_0.2.2               colorspace_2.1-0         
##  [33] patchwork_1.2.0           AnnotationDbi_1.64.1     
##  [35] S4Vectors_0.40.2          dqrng_0.3.2              
##  [37] irlba_2.3.5.1             RSQLite_2.3.5            
##  [39] beachmat_2.18.0           fansi_1.0.6              
##  [41] timechange_0.3.0          abind_1.4-5              
##  [43] httr_1.4.7                polyclip_1.10-6          
##  [45] compiler_4.3.2            bit64_4.0.5              
##  [47] withr_3.0.0               BiocParallel_1.36.0      
##  [49] viridis_0.6.5             DBI_1.2.1                
##  [51] ggforce_0.4.1             MASS_7.3-60.0.1          
##  [53] DelayedArray_0.28.0       HDO.db_0.99.1            
##  [55] tools_4.3.2               ape_5.7-1                
##  [57] scatterpie_0.2.1          glue_1.7.0               
##  [59] nlme_3.1-164              GOSemSim_2.28.1          
##  [61] grid_4.3.2                shadowtext_0.1.3         
##  [63] reshape2_1.4.4            fgsea_1.28.0             
##  [65] generics_0.1.3            gtable_0.3.4             
##  [67] tzdb_0.4.0                preprocessCore_1.64.0    
##  [69] data.table_1.15.0         hms_1.1.3                
##  [71] ScaledMatrix_1.10.0       BiocSingular_1.18.0      
##  [73] tidygraph_1.3.1           utf8_1.2.4               
##  [75] XVector_0.42.0            pillar_1.9.0             
##  [77] yulab.utils_0.1.4         splines_4.3.2            
##  [79] tweenr_2.0.2              treeio_1.26.0            
##  [81] lattice_0.22-5            bit_4.0.5                
##  [83] tidyselect_1.2.0          GO.db_3.18.0             
##  [85] Biostrings_2.70.2         knitr_1.45               
##  [87] gridExtra_2.3             IRanges_2.36.0           
##  [89] stats4_4.3.2              xfun_0.41                
##  [91] graphlayouts_1.1.0        statmod_1.5.0            
##  [93] matrixStats_1.2.0         stringi_1.8.3            
##  [95] lazyeval_0.2.2            ggfun_0.1.4              
##  [97] yaml_2.3.8                evaluate_0.23            
##  [99] codetools_0.2-19          ggraph_2.1.0             
## [101] qvalue_2.34.0             BiocManager_1.30.22      
## [103] ggplotify_0.1.2           cli_3.6.2                
## [105] affyio_1.72.0             munsell_0.5.0            
## [107] Rcpp_1.0.12               GenomeInfoDb_1.38.5      
## [109] png_0.1-8                 parallel_4.3.2           
## [111] blob_1.2.4                DOSE_3.28.2              
## [113] sparseMatrixStats_1.14.0  bitops_1.0-7             
## [115] viridisLite_0.4.2         tidytree_0.4.6           
## [117] scales_1.3.0              crayon_1.5.2             
## [119] rlang_1.1.3               cowplot_1.1.3            
## [121] fastmatch_1.1-4           KEGGREST_1.42.0