R packages we need to get started

library(data.table)
library(dplyr)
library(ggplot2)

Throughout, I’m going to use the data.table package to read in and work with data frames; feel free to use something else. With one exception, the files are just white space delimited text files, anything can open them.

Getting the data

This tutorial will use data from the PennCATH study of genetic risk factors for coronary artery disease. Download the data from one of the following sources (the contents are the same):

Data (zip): Windows
Data (tar): Mac/Linux

Download and unzip/untar the data; you can read the paper as well if you wish:

Paper

In what follows, I will assume that the unzipped data files are in a folder called data; if you store them somewhere else, change the directory references.

Downloading PLINK

PLINK is a command line tool (i.e., it is software that you typically run in the terminal/shell). PLINK was developed by researchers at Harvard, and it is widely used in genetics research. There are multiple versions of PLINK – for our tutorial, version 1.9 is fine (this is the version cited in the methods paper linked above). You can download PLINK 1.9 for free at this site – make sure to pick the download that matches your operating system.

Next, you need to choose a folder where you will save the PLINK software on your computer.

If you are already a command-line user: the typical place to put this would be ~/.local/bin – but you can choose wherever you’d like.
If you are new to the command line, it’s ok for now to save PLINK in the same folder where you are running this tutorial. When we get to later sections of this tutorial, you will need to tell the computer ‘where to look’ for the executable file for PLINK - that is, your R session has to know where plink.exe is located. You can run getwd() in your R console right now – by default, this is where your Rmarkdown or R script document will ‘look’ for PLINK. If you will use PLINK on multiple projects, pick a central location on your computer (i use ~/bin on my Mac) and save plink.exe here. Then, when you need to do something with PLINK, you can supply the filepath ~/myfolder/plink. Learning the command line is worth it’s own tutorial – check out this one to get started learning more.

To ensure that PLINK has installed correctly, run something like this:

system("plink --version")

The system() command is a way for you to run commands via R. Note: you may need to tell your computer where to look for the plink executable file – for example, if you downloaded PLINK and saved it to ~/Downloads/¹, you will need to run:

system("~/Downloads/plink --version")

If PLINK installed correctly, you will see a message print out that looks like this:

PLINK v1.90b6.26 64-bit (2 Apr 2022)

File formats

The data are given in “PLINK” format, which is the most common format for chip-based GWAS data (as of this writing). PLINK is an open-source whole genome association analysis toolset designed to perform a range of basic large-scale analyses in a computationally efficient manner. As part of the toolset, PLINK has its own file formats – again, these have become the standard format in the GWAS field. Data stored in PLINK format come in ‘bundles’ – that is, a PLINK data set consists of a set of files, not just one data file. It is worth knowing how to use the PLINK command line tool to analyze data in these files, but those with no experience on the command line may find that a daunting place to begin. Thankfully, we can also interface with PLINK via R. The package we will use for this is bigsnpr.

Let’s take a look at the PLINK file structure, examining what information is stored in each file.

Among the zipped files are three that are necessary to perform a GWAS, the .bed, .bim, and .fam files. I saved these to a folder called data/.

`.fam`

This contains information on the subjects:

head(fam <- fread('data/penncath.fam'))
#       V1    V2    V3    V4    V5    V6
#    <int> <int> <int> <int> <int> <int>
# 1: 10002     1     0     0     1     1
# 2: 10004     1     0     0     2     1
# 3: 10005     1     0     0     1     2
# 4: 10007     1     0     0     1     1
# 5: 10008     1     0     0     1     2
# 6: 10009     1     0     0     1     2

There are 1401 rows, one for each subject. The six colums are:

Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Phenotype

In this data set, columns 2-4 are unimportant. In general, they are used to specify pedigrees (e.g., subject 3 is the daughter of subjects 1 and 2). In this study, however, none of the subjects are related, so the only column that is important is the first, which records the subject’s unique ID.

Phenotype is typically used to record case-control status or something like that, but it is also quite common to just record clinical/biological information in a separate spreadsheet, which is what was done here.

(clinical <- fread('data/penncath.csv'))
#       FamID   CAD   sex   age    tg   hdl   ldl
#       <int> <int> <int> <int> <int> <int> <int>
#    1: 10002     1     1    60    NA    NA    NA
#    2: 10004     1     2    50    55    23    75
#    3: 10005     1     1    55   105    37    69
#    4: 10007     1     1    52   314    54   108
#    5: 10008     1     1    58   161    40    94
#   ---                                          
# 1397: 11591     0     2    59    34    44    89
# 1398: 11592     1     1    45    69   101    77
# 1399: 11593     1     1    59    77    27    41
# 1400: 11594     1     1    30    NA    NA    NA
# 1401: 11596     0     1    64   224    35    96

As you can see, we’ve got the FamID to match this spreadsheet up with the genetic data, the disease status (CAD=1 means that the subject has coronary artery disease), and some covariates (age, triglycerides, HDL and LDL cholesterol levels).

`.bim`

The .bim file, by contrast, contains information on the genetic loci (SNPs):

bim <- fread('data/penncath.bim')
head(bim)
#       V1         V2    V3     V4     V5     V6
#    <int>     <char> <int>  <int> <char> <char>
# 1:     1 rs10458597     0 564621     -9      C
# 2:     1 rs12565286     0 721290      G      C
# 3:     1 rs12082473     0 740857      T      C
# 4:     1  rs3094315     0 752566      C      T
# 5:     1  rs2286139     0 761732      C      T
# 6:     1 rs11240776     0 765269      G      A

As you can see, we have 861473 rows here, one for each SNP measured in the study. The columns are:

chromosome (1-22, X, Y or 0 if unplaced)
rs# or snp identifier
Genetic distance (morgans)
Base-pair position (bp units)
Allele 1 (usually minor allele, or ‘effect’ allele)
Allele 2 (usually major allele)

It is pretty common for column 3 to be ignored, as it is here. Many common analyses do not require base-pair position.

So, for example, the file tells us that genetic locus rs12565286 is located 721290 bases into chromosome 1, and that most people have a C there, but some have a G. Of course, “most people” will be relative to a population – be sure to confirm with collaborators that you have a correct understanding of which allele is considered ‘minor’, and what population(s) are described in the data. More on the idea of ‘population structure’ to come in a later module.

`.bed`

Finally, the .bed file, which has all the data. This is by far the largest of the three files, as it contains the entire 1401 by 861473 matrix of genotype calls for every subject and every locus. To keep things manageable, this file is encoded in a special binary format – i.e., you can’t just open the file in a text editor or load it with something like fread(). That said, we can convert our PLINK data into a bigSNP object (defined in bigsnpr) – this will let us look at our data in our R session.

# install.packages("bigsnpr") # run this if its your first time using this package
library(bigsnpr)
library(bigstatsr) # a useful dependency of bigsnpr

To read this into R, run the following chunk only once. This will create the .rds and .bk files you need, in the same directory where you’ve stored the penncath.bed (my directory is called data/)

# Note: The function assumes that all the files have the same base filename, and differ only in their extension. 

snp_readBed("data/penncath.bed")

Now, read in the data to your R session

penncath <- snp_attach("data/penncath.rds")

# check out this bigSNP object 
str(penncath)
# List of 3
#  $ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 16 fields
#   ..$ extptr      :<externalptr> 
#   ..$ extptr_rw   :<externalptr> 
#   ..$ nrow        : int 1401
#   ..$ ncol        : int 861473
#   ..$ type        : Named int 1
#   .. ..- attr(*, "names")= chr "unsigned char"
#   ..$ backingfile : chr "/Users/tabithapeter/Desktop/adv-gwas-tutorial/data/penncath.bk"
#   ..$ is_read_only: logi FALSE
#   ..$ address     :<externalptr> 
#   ..$ address_rw  :<externalptr> 
#   ..$ bk          : chr "/Users/tabithapeter/Desktop/adv-gwas-tutorial/data/penncath.bk"
#   ..$ rds         : chr "/Users/tabithapeter/Desktop/adv-gwas-tutorial/data/penncath.rds"
#   ..$ is_saved    : logi TRUE
#   ..$ type_chr    : chr "unsigned char"
#   ..$ type_size   : int 1
#   ..$ file_size   : num 1.21e+09
#   ..$ code256     : num [1:256] 0 1 2 NA NA NA NA NA NA NA ...
#   ..and 26 methods, of which 12 are  possibly relevant:
#   ..  add_columns, as.FBM, bm, bm.desc, check_dimensions, check_write_permissions,
#   ..  copy#envRefClass, initialize, initialize#FBM, save, show#envRefClass, show#FBM
#  $ fam      :'data.frame':    1401 obs. of  6 variables:
#   ..$ family.ID  : int [1:1401] 10002 10004 10005 10007 10008 10009 10010 10011 10012 10013 ...
#   ..$ sample.ID  : int [1:1401] 1 1 1 1 1 1 1 1 1 1 ...
#   ..$ paternal.ID: int [1:1401] 0 0 0 0 0 0 0 0 0 0 ...
#   ..$ maternal.ID: int [1:1401] 0 0 0 0 0 0 0 0 0 0 ...
#   ..$ sex        : int [1:1401] 1 2 1 1 1 1 1 2 1 2 ...
#   ..$ affection  : int [1:1401] 1 1 2 1 2 2 2 1 2 0 ...
#  $ map      :'data.frame':    861473 obs. of  6 variables:
#   ..$ chromosome  : int [1:861473] 1 1 1 1 1 1 1 1 1 1 ...
#   ..$ marker.ID   : chr [1:861473] "rs10458597" "rs12565286" "rs12082473" "rs3094315" ...
#   ..$ genetic.dist: int [1:861473] 0 0 0 0 0 0 0 0 0 0 ...
#   ..$ physical.pos: int [1:861473] 564621 721290 740857 752566 761732 765269 777122 785989 792480 798959 ...
#   ..$ allele1     : chr [1:861473] "-9" "G" "T" "C" ...
#   ..$ allele2     : chr [1:861473] "C" "C" "C" "T" ...
#  - attr(*, "class")= chr "bigSNP"

From here, bigsnpr has a lot of functions to help us prepare data for analysis – these functions depend on you having PLINK 1.9 installed somewhere on your local machine. Go to that page, download the files, and save them on your computer (I keep files like this in ~/bin/). In order to tell whether PLINK installed correctly and is ready to go, type path/to/plink --version on either the command line prompt or in ‘Terminal’ in RStudio, where you fill in the path with the path to where you saved the downloaded files. If things are working correctly, you will see a message that says something like PLINK v1.90b6.26 64-bit (2 Apr 2022) – the PLINK v1.9 is the key here.

Note: in PLINK files, missing values are often recorded as “-9” - the bigsnpr package follows this convention as well. This is what you will see in the ‘map’ element of the bigSNP object – where allele codes are missing, “-9” is inserted.

Basic quality control steps

Once you have PLINK installed, we can start with some simple quality control (QC) steps in R:

Chromosome check

One helpful first step when getting any genetic data is to address the question: what chromosomes are the SNPs on? In our data, we can see this in the map element of our bigSNP object:

penncath$map$chromosome |> table()
# 
#     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15 
# 71038 73717 60565 55675 56178 54176 46391 48299 41110 47930 44213 42124 34262 28054 25900 
#    16    17    18    19    20    21    22 
# 27591 19939 26231 11482 22753 12463 11382

We notice here that (1) we do not have any sex chromosomes², and (2) we do not have any chromosomes with numbers outside 1-23. Sometimes, SNPs with unknown chromosome are labeled with a chromosome of 0. Another common convention is to label mitochondrial variants as having chromosome ‘26’. This is an issue that can throw off later analysis, so be sure to examine what data you have represented in your PLINK files.

Missing data

Any SNP with a lot of missing data (i.e., many samples do not have data for that marker) is probably represented with questionable quality in our data; these poorly-captured SNPs are often excluded from analysis. Likewise, any sample with lots of missing data suggests that there may be issues with the processing of that sample. Looking at our genotype data (i.e., what is in our .bed file), we see that some of our SNPs have missing values.

class(penncath$genotypes) # note: this is a file-backed object
# [1] "FBM.code256"
# attr(,"package")
# [1] "bigstatsr"
?bigstatsr::big_counts # handy function to summarize data 
# take a look at what this does 
big_counts(penncath$genotypes, ind.row = 1:10, ind.col = 1:10)
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# 0      10    9    9    8    9    8    9    9   10     5
# 1       0    0    0    2    1    0    1    1    0     5
# 2       0    1    0    0    0    0    0    0    0     0
# <NA>    0    0    1    0    0    2    0    0    0     0
# summarize genotype data by *SNP*
snp_stats <- big_counts(penncath$genotypes)
dim(snp_stats) # 4th row has NA counts 
# [1]      4 861473
boxplot(snp_stats[4,])

summary(snp_stats[4,]) 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#     0.0     1.0     5.0    31.5    25.0  1401.0

There are definitely some SNPs with lots of missing values – at least one SNP is missing across all 1,401 samples! A common practice is to exclude SNPs with >10% missing data. We’ll actually do the subsetting a little later, right now we’re just exploring.

We also need to consider any samples that have high proportions of missing values:

# summarize genotype data by *sample* 
sample_stats <- big_counts(penncath$genotypes, byrow = TRUE)
# cell [1,1] is showing the # of people with "0" genotype at the 1st SNP
sample_stats[, 1:10]
#        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
# 0    547076 559924 559990 551565 561342 558493 563185 553232 550485 537783
# 1    231887 228864 228959 228728 225771 228300 228626 227605 233055 235100
# 2     51692  52374  51956  51357  49713  52277  53064  50701  52954  52947
# <NA>  30818  20311  20568  29823  24647  22403  16598  29935  24979  35643
boxplot(sample_stats[4,])

summary(sample_stats[4,])
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    4421   12605   17833   19368   24998   43330

We notice that there are no outlier values in the number of NA values per person – if there were outliers, we would need to filter out samples that have a high proportion of missing data.

Heterozygosity check

If an individual had a ton of A/B calls but no A/A or B/B calls, or vice versa, that would likely indicate something was wrong in that sample – we would typically expect all samples to have some heterozygous calls and some homozygous calls.

allele_dat <- sweep(x = sample_stats[1:3,], 
                    # ^ leave off 4th row -- don't count NAs here
                    MARGIN = 1,
                    STATS = c(0, 1, 0),
                    FUN = "*") |> colSums()

boxplot(allele_dat/ncol(snp_stats))

hist(allele_dat/ncol(snp_stats),
     main = "Zygosity",
     xlab = "Proportion of samples which are heterozygous") # should be bell-curved shaped

No big outliers here… that’s a good sign.

Minor allele frequency (MAF) filtering

As a first step, let’s look at which SNPs have minor alleles that occur very rarely – in a typical GWAS, we will need to exclude SNPs that have low variation. We see there are some SNPs for which all 1401 samples have the A/A alleles, or the major/major alleles. For analysis purposes, we typically need to exclude such SNPs in which variation is rare.

hist(snp_stats[1,])

summary(snp_stats[1,])
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#     0.0   609.0   922.0   909.7  1214.0  1401.0

The histogram and summary shows that there are some SNPs for which few, if any, samples have a minor allele. We will need to filter out these rare variants.

Hardy-Weinberg equilibrium filtering

The Hardy-Weinberg principle states that under the assumption of random mating, the distribution of genotypes should follow a binomial distribution with probability \(\pi\) equal to the MAF. If this doesn’t happen, this is an indication that either:

There was a genotyping error for this SNP, or
Mating is not random

In the real world, mating is of course not random, making it difficult to exclude SNPs on the basis of HWE. The usual recommendation is to exclude a SNP only if HWE is hugely violated (e.g., \(p < 10^{-10}\) for a test of whether the data follow a binomial distribution). We will use this filter as a criteria.

Relatedness check

Another common QC check that people apply is to see whether anyone in their data set is related to each other (i.e., their genomes are far more similar than the genomes of two unrelated people). This makes sense, as most statistical methods assume independent samples and if you have, say, two sisters in the analysis, they’re not really independent. However, the tools for assessing relatedness are related to methods for assessing population structure, which is a complex topic that I’ll discuss in the ‘structure’ module of the tutorial.

Summary & implementation

In summary, here are the basic QC steps that (almost) all GWAS need to consider:

We need to check the chromosomes we have represented in our data
We need to filter out samples that have a lot of missing values (e.g., samples that had poor quality in upstream data collection/analysis)
We need to check for heterozygosity, and filter out samples that are clear outliers
We need to filter out SNPs that have a really low minor allele frequency
We need to filter out SNPs that are far outside Hardy-Weinberg equilibrium
We need to consider relatedness in our data – both the relatedness that is known as well as any latent relationship structure(s).

Now, let’s attempt to do some quality control for real – the following code will produce an error, which I think is instructive to see:

path_to_qc_penncath <- snp_plinkQC(
  plink.path = "~/bin/plink", # you will need to change this according to your machine!
  prefix.in = "data/penncath", # input current data
  prefix.out = "data/qc_penncath", # creates *new* rds with quality-controlled data
  maf = 0.01
)

The command above is a wrapper for PLINK, and so it will print out messages to the console, so you can follow along with what PLINK is doing. We tried to filter down to just variants (SNPs) with a MAF of at least 1%, but we get this error from PLINK:

Error: .bim file has a split chromosome Use --make-bed by itself to remedy this.

This means that our penncath data need some reformmating before we can do QC. From R, type in the following line to address the issue pointed out in the PLINK error message:

system("plink --bfile data/penncath --make-bed --out data/penncath_clean")

Let’s go through what each of those options (or ‘flags’) is doing:

plink: calls the program
--bfile: tells PLINK that you have .bed/.bim/.fam files (PLINK can work with other file types too, but that’s beyond the scope of this tutorial). data/penncath names the folder with our PLINK data
--make-bed: tells PLINK to make a cleaned-up version of the .bed file. New .fam and .bim files will be created also, to ensure that those 3 files are always in sync.
--out: tells PLINK where to put the to-be-created clean data and what to call these new files. I am creating files with the prefix penncath_clean in the data/ folder.

Running the above will also create data/penncath_clean.log, which keeps a record of what flags you used to create this data set. Check out that file to make sure everything ran correctly (it will tell you if there was a problem). PLINK will also print out progress messages to your console as it works.

Now that we have fixed the ‘split chromosome’ issue, we can do QC steps (1), (2), (4), and (5) in one command:

path_to_qc_penncath_bed <- snp_plinkQC(
  plink.path = "~/bin/plink", # again, you may need to change this according to your machine!
  prefix.in = "data/penncath_clean", # input data
  prefix.out = "data/qc_penncath", # creates *new* rds with quality-controlled data
  maf = 0.01, # filter out SNPs with MAF < 0.01
  geno = 0.1, # filter out SNPs missing more than 10% of data
  mind = 0.1, # filter out SNPs missing more than 10% of data,
  hwe = 1e-10, # filter out SNPs that have p-vals below this threshold for HWE test
  autosome.only = TRUE # we want chromosomes 1-22 only
  
)

“Under the hood”, PLINK is applying all the filters we want and creating a new set of bed/bim/fam files with the prefix “qc_penncath.” We will use these QC’d files in our upcoming analysis.

Step (3) we have done by visual inspection – in this data set, that is all we need. Step (6) requires quite a bit of explanation, so this step has its own module in this tutorial.

Now, let’s actually do the subsetting. IMPORTANT: The key thing to remember here is that when you subset genetic data in PLINK files, you need to subset both the map object and the genotypes – if you don’t, these objects will no longer match and you will end up with a devastating row mismatch problem. If you are using PLINK (any version) on the command line or bigsnpr in R, then this is automatically taken care of for you. However, if you are reading in the .bim/.fam files as data frames (as we did at the beginning), then you would need to make sure you subset both files.

Let’s look at our new RDS object, the one we just created using our QC steps:

qc_file_path <- snp_readBed(bedfile = path_to_qc_penncath_bed)

qc <- snp_attach(qc_file_path)

dim(qc$genotypes)

We see that there are 1401 samples and 696644 variants represented in our new quality controlled data.

Using our QC’d data in qc_penncath, we will continue in the next section with imputation

Note on memory issues

One of the great features about the bigSNP objects in bigsnpr is that these objects save the large genotype data (i.e., what’s in the .bed PLINK file) as file-backed – in layman’s terms, file-backed objects let you have an object in your global environment that is ‘pointing at’ the large data file stored on disk. If you want to read genotype data from just a few SNPs into memory, we can do this:

# notice: data are file backed 
class(penncath$genotypes)
# [1] "FBM.code256"
# attr(,"package")
# [1] "bigstatsr"

# read in just a few SNPs 
first_10_snps <- penncath$genotypes[,1:10]

# these are now stored in memory as a matrix 
class(first_10_snps); dim(first_10_snps)
# [1] "matrix" "array"
# [1] 1401   10

# take a look at this
head(first_10_snps)
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    0    0    0    0    0   NA    0    0    0     0
# [2,]    0    2    0    1    0    0    0    0    0     1
# [3,]    0    0    0    0    0    0    0    0    0     0
# [4,]    0    0    0    0    0    0    0    0    0     1
# [5,]    0    0    0    0    0    0    0    0    0     0
# [6,]    0    0    0    0    0    0    0    0    0     1

Advanced quality control steps

Sex check

In general, since we have genetic data on the individuals in the sample, including the X chromosome, we can determine (or at least, estimate) their “genetic sex” and compare that to the sex that is recorded in their clinical information. A discrepancy is very troubling, as it may be the result of a sample being switched or mis-labeled (there are other explanations as well).

Unfortunately, this data set does not include the X chromosome, so we can’t show that step here.

In general, though, one usually uses PLINK for this. The relevant command is called --check-sex. The relevant documentation: 1.9 documentation.

Note that there are no discrepancies between the sex recorded in clinical.csv and the one recorded in the .fam file:

table(penncath$fam$sex, clinical$sex)
#    
#       1   2
#   1 937   0
#   2   0 464

In our data, this is sufficient for us to know we are good to go.

Pruning, clumping, and LD filtering

SNPs which are close to one another in base-pair position are often in linkage disequilibrium (LD) with each other – we say, ‘the SNPs are in LD’. When SNPs are in LD, this can have a notable impact on analysis, particularly when the approach is a marginal analysis (i.e., testing one SNP at a time). Florian Privé, author of the bigstatsr and bigsnpr packages, has written some good vignettes on how and when to use pruning/clumping as a QC step to address LD. For now, I refer you there – I hope to do some of my own writing on the subject one day.

Note: the ~ just stands for your home directory↩︎
Recall that our example data represents humans; chromosomes are numbered differently in other organisms, so interpret the following comments in the context of the specific data you have↩︎

GWAS tutorial: Data formats, summary statistics, and quality control

Patrick Breheny and Tabitha Peter

March 16, 2024