Set up

In this module, we work through an example of a GWAS analysis using a marginal approach. Our approach is ‘marginal’ in the sense that we are testing the relationship between the outcome and the genetic data by going one SNP at a time. This is (by far) the most pervasive approach in the existing literature.

This module does not require that you have worked through the previous modules; however, I use language from the quality control, imputation, and population structure modules throughout my explanation of this marginal approach.

As always, we begin by loading the libraries with the tools we need for analysis.

library(data.table)
library(dplyr)
library(bigsnpr)
library(ggplot2)

Next, we need to load the genetic data. For this section, we will work with the quality controlled (QC’d) and imputed data

# load QC'd & imputed data
obj <- snp_attach("data/qc_penncath.rds")

Now, get the clinical data which has the binary outcome (coronary artery disease (‘CAD’)) of interest for analysis

clinical <- fread("data/penncath.csv")
# str(clinical) # if you need to remind yourself of what is here

# add the clinical data to the data in the fam file
# NOTE: the key here is to make IDs align

obj$fam <- dplyr::full_join(x = obj$fam,
                            y = clinical,
                            by = c('family.ID' = 'FamID'))

If you completed the population structure module, you should load the principal components as well - we need them for our analysis. If you did not work through the population structure module, you can skip this step.

# load principal components 
svd_X <- readRDS(file = "data/svd_X.rds")

# calculate the PCs                          
pc <- sweep(svd_X$u, 2, svd_X$d, "*")
dim(pc) # will have same number of rows as U
# [1] 1401   10
names(pc) <- paste0("PC", 1:ncol(pc))

Implement tests

We begin our analysis with a logistic regression model which uses the predictors sex and age to study the binary outcome CAD. In addition to adjusting for sex and age, we will adjust for the first 5 principal components, as per the observations we made about the population structure in our data.

# Fit a logistic model between the phenotype and each SNP separately
# while adding PCs as covariates to each model
obj.gwas <- big_univLogReg(X = obj$geno_imputed,
                           y01.train = obj$fam$CAD,
                           covar.train = pc[,1:5],
                           ncores = nb_cores())

# this takes a min or two, so I will save these results 
saveRDS(object = obj.gwas, file = "data/gwas.rds")

obj.gwas <- readRDS("data/gwas.rds")

Visualize results

There are two methods for visualizing the results of a GWAS analysis: qq plots and Manhattan plots. Both of these tools are meant to highlight the genetic variants (SNPs) with the smallest corresponding p-values.

Due to their magnitude, p-values from GWAS studies are typically illustrated on the log-transformed scale.

Using a qqplot

As a first look at our results, we will examine a qq-plot. This plot compares the observed p-values from our SNP tests with the p-values we would expect if there were no associations. To make our qq-plot, we will use the qq() function from the package qqman.

# Q-Q plot of the object
qq <- bigsnpr::snp_qq(obj.gwas)
(qq)

If the observed p-values follow exactly the distribution we would expect under the global null hypothesis, then all points will be plotted on the red line.

We notice in this qq-plot that some of the p-values (plotted as points) diverge from the red line to the left. This indicates that some observed values are below what is expected. From a statistical point of view, this is pretty promising. Having some p-values that are smaller than expected indicates that there are likely to be significant results in the data.

Using a Manhattan plot

A Manhattan plot makes the smallest p-values “pop” by plotting all p-values so that the smallest ones are the highest. We can create a Manhattan plot of our results using the manhattan() function (from the aforementioned qqman package).¹

viridis22 <- c(rep(c("#fde725", "#90d743", "#35b779", "#21918c", "#31688e", 
                "#443983", "#440154"), 3), "#fde725")

manh <- snp_manhattan(gwas = obj.gwas,
              infos.chr = obj$map$chromosome,
              infos.pos = obj$map$physical.pos,
              colors = viridis22)

(manh)

Here, the x-axis of the plot is divided into “bins”, where each bin represents a chromosome. The y axis shows the negative, log-transformed p-values. Each of the p-values from our analysis is plotted in the bin corresponding to the chromosome of the gene in that particular test, and the height of the point directly correlates with the significance of the test.

The goal of Manhattan plots are to help identify SNPs (or a region of SNPs) that are associated with a phenotype of interest. The blue and red lines represent values for \(-\text{log}_{10}\) transformed p-values at two specified thresholds of “significance.” When we do have SNPs with a p-value that exceeds these lines, we are often interested in the one that is the highest in a given region.

I could improve the Manhattan plot above by adding annotations which indicate the SNPs with the smallest p-values:

# NB:  5 × 10e−8 is a common threshold for significance in GWAS studies, 
#   whereas 5 x 10e-6 is a common threshold for "suggestive" results

signif_threshold <- 5e-8 
suggest_threshold <- 5e-6 

(manh + 
  geom_hline(yintercept = -log10(signif_threshold),
             # note: plot y-axis is on -log10 scale 
             color = "#35b779",
             lty = 2) + 
  geom_hline(yintercept = -log10(suggest_threshold),
             color = "#443983",
             lty = 2)
)

Based on the Manhattan plots, one region of interest could be around rs9632884 on Chromosome 9 - this SNP is above the suggestive threshold (indicated by the blue line), and there are a lot of other notable results clustered near this SNP. If I wanted to highlight a region of interest, I could do this to highlight the specific genes in this region, or locus.

Doing GWAS with PLINK software (command line)

The above is a simple outline of tools for summarizing GWAS results in R. There are other tools available for implementing GWAS analyses - one popular software is called PLINK. The PLINK software is made up of command line programs that implement a wide array of GWAS (and whole genome) analyses. This tutorial provides a place to begin for new PLINK users, and this other tutorial is well-known and more comprehensive.

if you like viridis colors as much as I do, check out this site for HEX codes.↩︎

GWAS tutorial: SNP testing (marginal approach)

Patrick Breheny, Logan Harris, and Tabitha Peter

January 27, 2024