Objectives of this module

When you have completed this module, you will know:

  1. How to summarize the assumptions (biological and statistical) that a GWAS makes about the individuals in a study and the structure of the data.

  2. How to identify instances when your data show evidence of not meeting these assumptions.

  3. What tools are available for analyzing complex data that do not meet the basic GWAS assumptions.

  4. How to implement the tools mentioned above using bigsnpr

Set up

The computational methods section of this module depend on having fully-imputed data, meaning that there are no missing values in the genotype data.

As a start, load the packages we will need for this module:

# load our packages  (same ones mentioned in the data module)
library(data.table)
library(dplyr)
library(bigsnpr)
library(ggplot2)

Population structure

Concept

In a GWAS context, population structure is a kind of relatedness among the individuals represented in the data set. When considering relatedness, it is helpful to define the phrase identical by descent. Two alleles from the same locus are identical by descent if they can be traced back from the same allele in an earlier generation. For more information on this concept, check out this example.

Population structure is defined by the existence of allele frequency differences that characterize sub-populations and is driven by the combined effects of evolutionary processes such as genetic drift, migration, demographic history, and natural selection.

Population structure is broadly categorized based on whether it describes recent or ancient relatedness. Ancient relatedness describes the presence of a common ancestor many generations previously. The presence of distinct ancestry groups with different allele frequencies in a sample is known as population stratification (a.k.a ancestry differences). Recent relatedness describes the sharing of a common ancestor only several generations previously. Pedigree-based methods may be used to explicitly model recent relatedness if familial relationships are known. In the absence of known familial relationships, recent relatedness is referred to as cryptic relatedness. 1

Population stratification

Population stratification in particular has been of great concern in genetic studies due to its potential to lead to spurious associations when population structure is associated with differences in both allele frequency and the trait or disease.

As an example, consider a GWAS to assess genetic variants associated with lung cancer in a sample comprised of subjects from two distinct subpopulations, A and B. Assume the minor allele of SNP X is present with higher frequency in subpopulation A compared to subpopulation B, but has no direct effect on lung cancer. Also suppose these subpopulations are geographically segregated in a such a way that subpopulation A is exposed to good air quality, and subpopulation B to poor air quality, and that air quality has a direct effect on lung cancer. A GWAS of data from these subpopulations would find SNP X to be significantly associated with lung cancer, even though we know it has no effect on lung cancer. If subpopulations A and B were not subject to different air qualities, all else being equal, SNP X would not be found to be associated with the phenotype.

Cryptic relatedness

In the context of GWAS data, cryptic relatedness is the phrase used to describe a situation when individuals in a study are more closely related than assumed by the investigators. For example, it could be possible that some of the individuals in a study are first cousins, unbeknownst to the researcher(s). This kind of unknown relatedness can act as a confounding factor in a GWAS analysis. Like population stratification, cryptic relatedness can lead to inflated false positive rates in gene-association studies. The former is usually characterized by having differences among groups of subjects, while the latter is typically characterized by unknown relatedness among individual subjects in the study.

Tools/implementation

GWAS (and many statistical tests) assume samples are independent. Cryptic relatedness and population structure can invalidate these tests since the presence of non-independent samples, and thus non-independent errors, can lead to inflated test statistics.

With this in mind, it is critical to evaluate and account for potential population structure in our data in order to avoid false positives and negatives.

Principal component analysis (PCA)

What is PCA

In essence, principle component analysis is a way to summarize data. Others have given really good explanations of what PCA is and how it is used, so I will not write in detail about that here. Suffice it to say that we will use PCA here to detect population structure in genetic data.

Connecting PCA to our data: let’s call our SNP data \(\mathbf{X}\). We load our data below:

# read in our QC'd, imputed data
obj <- snp_attach("data/qc_penncath.rds")

We can think of \(\mathbf{X}\) as a matrix with \(n\) rows (in this case, n = 1401) and \(p\) columns (in this case, p = 696644). For PCA, we will use a method that exploits the power of the singular value decomposition:

  1. Standardize (center and scale) the data \(\dot{\mathbf{X}}\).

  2. Calculate \(\dot{\mathbf{X}} \equiv \mathbf{U}\mathbf{D}\mathbf{V}^T\) using the singular value decomposition (SVD). Here, \(\mathbf{U}\) and \(\mathbf{V}\) are both orthogonal matrices, and \(\mathbf{D}\) is a diagonal matrix of non-negative numbers. We say that \(\mathbf{U}\) and \(\mathbf{V}\) contain the left and right singular vectors, respectively. The values on the diagonal of \(\mathbf{D}\) are the singular values.

  3. We can calculate the principal components using the vectors of \(\mathbf{U}\) and the matrix \(\mathbf{D}\), which is super cool! This is possible because the columns of \(\mathbf{U}\) are eigenvectors of the covariance matrix \(\dot{\mathbf{X}}\dot{\mathbf{X}}^\top\). Specifically, the \(i^{th}\) principal component \(p_i\) can be calculated as \(p_i = \mathbf{u}_i d_i\), where \(\mathbf{u}_i\) is the \(i^{th}\) column of \(\mathbf{U}\) and \(d_i\) is the \(i^{th}\) value on the diagonal of \(\mathbf{D}\).

How to visualize PCA

To visualize PCA, we will look at two kinds of visuals:

  • A scree plot, which tells us the proportion of variance explained by each of the PCs.

  • A plot of PCA scores, which is a scatter plot whose axes represent two principle components

PCA in R

# SVD 
svd_X <- big_SVD(obj$geno_imputed, # must have imputed data
                 big_scale(), # centers and scales data -- REALLY IMPORTANT! 
                 k = 10 # use 10 PCs for now -- can ask for more if needed
                 )
# Note: this can take a while to run (several minutes, depending on your
# machine). I'm going to save this SVD as an RDS object, and refer back to that in future use. 

saveRDS(svd_X, "data/svd_X.rds")
# load data from above 
svd_X <- readRDS("data/svd_X.rds")

# look at what we have:
str(svd_X)
# List of 5
#  $ d     : num [1:10] 2086 1236 1134 1084 986 ...
#  $ u     : num [1:1401, 1:10] -0.0114 0.0127 0.0161 0.0125 0.0166 ...
#  $ v     : num [1:696644, 1:10] -0.002079 -0.00151 -0.000783 -0.000725 -0.000526 ...
#  $ center: num [1:696644] 0.102 0.353 0.268 0.266 0.3 ...
#  $ scale : num [1:696644] 0.317 0.568 0.489 0.485 0.496 ...
#  - attr(*, "class")= chr "big_SVD"

# scree plot
plot(svd_X) # 5 PCs seem to capture most of the variance


# plot PCs 1 and 2 
plot(svd_X, type = "scores") # first 2 PCs definately capture some separation

We see some separation in the first two PCs – let’s look at the clinical data and see if we can identify a pattern:

clinical <- fread('data/penncath.csv') |>
  mutate(across(.cols = c(sex, CAD),
                .fns = as.factor))
dplyr::glimpse(clinical)
# Rows: 1,401
# Columns: 7
# $ FamID <int> 10002, 10004, 10005, 10007, 10008, 10009, 10010, 10011, 10012, 10013, 1001…
# $ CAD   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,…
# $ sex   <fct> 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 2, 1,…
# $ age   <int> 60, 50, 55, 52, 58, 59, 54, 66, 58, 67, 45, 58, 58, 57, 42, 51, 40, 43, 54…
# $ tg    <int> NA, 55, 105, 314, 161, 171, 147, 124, 60, 160, 95, 120, 207, 150, 119, NA,…
# $ hdl   <int> NA, 23, 37, 54, 40, 46, 69, 47, 114, 40, 35, 36, 42, 46, 35, NA, 42, 38, 4…
# $ ldl   <int> NA, 75, 69, 108, 94, 92, 109, 84, 67, 112, 158, 72, 167, 176, 76, NA, 107,…

plot(svd_X, type = "scores") +
  aes(color = clinical$sex) +
  labs(color = "Sex")

Although sex does not delineate the separation we observe in the PC plot, this gives an idea of how you could go about examining how different sample-level features (e.g., batch, family, data collection site, etc.) create subpopulations in the data.

Since we see that there are subpopulations in these data, we want to make sure and adjust for this in our GWAS analysis. Hold onto the SVD results – we will use them again.

Family relatedness

In addition to PCA, one other way to visualize relationships is to plot the realized relatedness matrix, \(\mathbf{K} \equiv \frac{1}{p}\dot{\mathbf{X}}\dot{\mathbf{X}}^\top\). This is particularly relevant if you think you have family-based data.

# get K 
K <- big_tcrossprodSelf(X = obj$geno_imputed,
                        fun.scaling = big_scale())

saveRDS(K, 'data/K.rds')
K <- readRDS("data/K.rds")
library(corrplot)
corrplot::corrplot(K[1:50, 1:50]*(1/ncol(obj$geno_imputed)),
                   is.corr = FALSE,
                   tl.pos = "n")

If there are closely related samples or clusters of families/batches/etc., you will see a \(\mathbf{K}\) with a ‘quilt-like’ pattern, with blocks of samples appearing together. Here, we see no such pattern - individuals are only correlated with themselves. This means family-based methods are not needed here.

Further reading

The bigstatsr package documentation gives further examples of PCA, and bigsnpr has some vignettes that go into deeper detail on PCA for genetic data.


  1. NB: definitions of the terms recent and ancient are somewhat subjective and hand-wavy since, in theory, if you look back far enough, everyone shares a common ancestor. However, the idea is that humans migrated, separated, and mated such that over time distinct groups developed allele frequencies different enough to confound an analysis.↩︎