In this module, we work through an example of a GWAS analysis using a marginal approach. Our approach is ‘marginal’ in the sense that we are testing the relationship between the outcome and the genetic data by going one SNP at a time. This is (by far) the most pervasive approach in the existing literature.
This module does not require that you have worked through the previous modules; however, I use language from the quality control, imputation, and population structure modules throughout my explanation of this marginal approach.
As always, we begin by loading the libraries with the tools we need for analysis.
library(data.table)
library(dplyr)
library(bigsnpr)
library(ggplot2)
Next, we need to load the genetic data. For this section, we will work with the quality controlled (QC’d) and imputed data
# load QC'd & imputed data
obj <- snp_attach("data/qc_penncath.rds")
Now, get the clinical data which has the binary outcome (coronary artery disease (‘CAD’)) of interest for analysis
clinical <- fread("data/penncath.csv")
# str(clinical) # if you need to remind yourself of what is here
# add the clinical data to the data in the fam file
# NOTE: the key here is to make IDs align
obj$fam <- dplyr::full_join(x = obj$fam,
y = clinical,
by = c('family.ID' = 'FamID'))
If you completed the population structure module, you should load the principal components as well - we need them for our analysis. If you did not work through the population structure module, you can skip this step.
# load principal components
svd_X <- readRDS(file = "data/svd_X.rds")
# calculate the PCs
pc <- sweep(svd_X$u, 2, svd_X$d, "*")
dim(pc) # will have same number of rows as U
# [1] 1401 10
names(pc) <- paste0("PC", 1:ncol(pc))
We begin our analysis with a logistic regression model which uses the predictors sex and age to study the binary outcome CAD. In addition to adjusting for sex and age, we will adjust for the first 5 principal components, as per the observations we made about the population structure in our data.
# Fit a logistic model between the phenotype and each SNP separately
# while adding PCs as covariates to each model
obj.gwas <- big_univLogReg(X = obj$geno_imputed,
y01.train = obj$fam$CAD,
covar.train = pc[,1:5],
ncores = nb_cores())
# this takes a min or two, so I will save these results
saveRDS(object = obj.gwas, file = "data/gwas.rds")
obj.gwas <- readRDS("data/gwas.rds")
There are two methods for visualizing the results of a GWAS analysis: qq plots and Manhattan plots. Both of these tools are meant to highlight the genetic variants (SNPs) with the smallest corresponding p-values.
Due to their magnitude, p-values from GWAS studies are typically illustrated on the log-transformed scale.
As a first look at our results, we will examine a qq-plot. This plot
compares the observed p-values from our SNP tests with the p-values we
would expect if there were no associations. To make our qq-plot, we will
use the qq()
function from the package
qqman
.
# Q-Q plot of the object
qq <- bigsnpr::snp_qq(obj.gwas)
(qq)
If the observed p-values follow exactly the distribution we would expect under the global null hypothesis, then all points will be plotted on the red line.
We notice in this qq-plot that some of the p-values (plotted as points) diverge from the red line to the left. This indicates that some observed values are below what is expected. From a statistical point of view, this is pretty promising. Having some p-values that are smaller than expected indicates that there are likely to be significant results in the data.
A Manhattan
plot makes the smallest p-values “pop” by plotting all p-values so
that the smallest ones are the highest. We can create a Manhattan plot
of our results using the manhattan()
function (from the
aforementioned qqman
package).1
viridis22 <- c(rep(c("#fde725", "#90d743", "#35b779", "#21918c", "#31688e",
"#443983", "#440154"), 3), "#fde725")
manh <- snp_manhattan(gwas = obj.gwas,
infos.chr = obj$map$chromosome,
infos.pos = obj$map$physical.pos,
colors = viridis22)
(manh)
Here, the x-axis of the plot is divided into “bins”, where each bin represents a chromosome. The y axis shows the negative, log-transformed p-values. Each of the p-values from our analysis is plotted in the bin corresponding to the chromosome of the gene in that particular test, and the height of the point directly correlates with the significance of the test.
The goal of Manhattan plots are to help identify SNPs (or a region of SNPs) that are associated with a phenotype of interest. The blue and red lines represent values for \(-\text{log}_{10}\) transformed p-values at two specified thresholds of “significance.” When we do have SNPs with a p-value that exceeds these lines, we are often interested in the one that is the highest in a given region.
I could improve the Manhattan plot above by adding annotations which indicate the SNPs with the smallest p-values:
# NB: 5 × 10e−8 is a common threshold for significance in GWAS studies,
# whereas 5 x 10e-6 is a common threshold for "suggestive" results
signif_threshold <- 5e-8
suggest_threshold <- 5e-6
(manh +
geom_hline(yintercept = -log10(signif_threshold),
# note: plot y-axis is on -log10 scale
color = "#35b779",
lty = 2) +
geom_hline(yintercept = -log10(suggest_threshold),
color = "#443983",
lty = 2)
)
Based on the Manhattan plots, one region of interest could be around rs9632884 on Chromosome 9 - this SNP is above the suggestive threshold (indicated by the blue line), and there are a lot of other notable results clustered near this SNP. If I wanted to highlight a region of interest, I could do this to highlight the specific genes in this region, or locus.
The above is a simple outline of tools for summarizing GWAS results
in R
. There are other tools available for implementing GWAS
analyses - one popular software is called PLINK
. The PLINK software is made up
of command line programs that implement a wide array of GWAS (and whole
genome) analyses. This
tutorial provides a place to begin for new PLINK users, and this other
tutorial is well-known and more comprehensive.