R
packages we need to get started
library(data.table)
library(dplyr)
library(ggplot2)
Throughout, I’m going to use the data.table
package to
read in and work with data frames; feel free to use something else. With
one exception, the files are just white space delimited text files,
anything can open them.
This tutorial will use data from the PennCATH study of genetic risk factors for coronary artery disease. Download the data from one of the following sources (the contents are the same):
Download and unzip/untar the data; you can read the paper as well if you wish:
In what follows, I will assume that the unzipped data files are in a
folder called data
; if you store them somewhere else,
change the directory references.
PLINK is a command line tool (i.e., it is software that you typically run in the terminal/shell). PLINK was developed by researchers at Harvard, and it is widely used in genetics research. There are multiple versions of PLINK – for our tutorial, version 1.9 is fine (this is the version cited in the methods paper linked above). You can download PLINK 1.9 for free at this site – make sure to pick the download that matches your operating system.
Next, you need to choose a folder where you will save the PLINK software on your computer.
~/.local/bin
– but you can choose wherever
you’d like.R
session has to know where plink.exe
is
located. You can run getwd()
in your R
console
right now – by default, this is where your Rmarkdown or R script
document will ‘look’ for PLINK. If you will use PLINK on multiple
projects, pick a central location on your computer (i use
~/bin
on my Mac) and save plink.exe
here.
Then, when you need to do something with PLINK, you can supply the
filepath ~/myfolder/plink
. Learning the command line is
worth it’s own tutorial – check out this one to
get started learning more.To ensure that PLINK has installed correctly, run something like this:
system("plink --version")
The system()
command is a way for you to run commands
via R
. Note: you may need to tell your
computer where to look for the plink
executable file – for
example, if you downloaded PLINK and saved it to
~/Downloads/
1, you will need to run:
system("~/Downloads/plink --version")
If PLINK installed correctly, you will see a message print out that looks like this:
PLINK v1.90b6.26 64-bit (2 Apr 2022)
The data are given in “PLINK” format, which is the most common format
for chip-based GWAS data (as of this writing). PLINK
is an
open-source whole genome association analysis toolset designed to
perform a range of basic large-scale analyses in a computationally
efficient manner. As part of the toolset, PLINK has its own file formats
– again, these have become the standard format in the GWAS field. Data
stored in PLINK format come in ‘bundles’ – that is, a PLINK data set
consists of a set of files, not just one data file. It is worth knowing
how to use the PLINK command line tool to analyze data in these files,
but those with no experience on the command line may find that a
daunting place to begin. Thankfully, we can also interface with
PLINK
via R
. The package we will use for this
is bigsnpr.
Let’s take a look at the PLINK file structure, examining what information is stored in each file.
Among the zipped files are three that are necessary to perform a
GWAS, the .bed
, .bim
, and .fam
files. I saved these to a folder called data/
.
.fam
This contains information on the subjects:
head(fam <- fread('data/penncath.fam'))
# V1 V2 V3 V4 V5 V6
# <int> <int> <int> <int> <int> <int>
# 1: 10002 1 0 0 1 1
# 2: 10004 1 0 0 2 1
# 3: 10005 1 0 0 1 2
# 4: 10007 1 0 0 1 1
# 5: 10008 1 0 0 1 2
# 6: 10009 1 0 0 1 2
There are 1401 rows, one for each subject. The six colums are:
In this data set, columns 2-4 are unimportant. In general, they are used to specify pedigrees (e.g., subject 3 is the daughter of subjects 1 and 2). In this study, however, none of the subjects are related, so the only column that is important is the first, which records the subject’s unique ID.
Phenotype is typically used to record case-control status or something like that, but it is also quite common to just record clinical/biological information in a separate spreadsheet, which is what was done here.
(clinical <- fread('data/penncath.csv'))
# FamID CAD sex age tg hdl ldl
# <int> <int> <int> <int> <int> <int> <int>
# 1: 10002 1 1 60 NA NA NA
# 2: 10004 1 2 50 55 23 75
# 3: 10005 1 1 55 105 37 69
# 4: 10007 1 1 52 314 54 108
# 5: 10008 1 1 58 161 40 94
# ---
# 1397: 11591 0 2 59 34 44 89
# 1398: 11592 1 1 45 69 101 77
# 1399: 11593 1 1 59 77 27 41
# 1400: 11594 1 1 30 NA NA NA
# 1401: 11596 0 1 64 224 35 96
As you can see, we’ve got the FamID
to match this
spreadsheet up with the genetic data, the disease status
(CAD=1
means that the subject has coronary artery disease),
and some covariates (age, triglycerides, HDL and LDL cholesterol
levels).
.bim
The .bim
file, by contrast, contains information on the
genetic loci (SNPs):
bim <- fread('data/penncath.bim')
head(bim)
# V1 V2 V3 V4 V5 V6
# <int> <char> <int> <int> <char> <char>
# 1: 1 rs10458597 0 564621 -9 C
# 2: 1 rs12565286 0 721290 G C
# 3: 1 rs12082473 0 740857 T C
# 4: 1 rs3094315 0 752566 C T
# 5: 1 rs2286139 0 761732 C T
# 6: 1 rs11240776 0 765269 G A
As you can see, we have 861473 rows here, one for each SNP measured in the study. The columns are:
It is pretty common for column 3 to be ignored, as it is here. Many common analyses do not require base-pair position.
So, for example, the file tells us that genetic locus rs12565286 is located 721290 bases into chromosome 1, and that most people have a C there, but some have a G. Of course, “most people” will be relative to a population – be sure to confirm with collaborators that you have a correct understanding of which allele is considered ‘minor’, and what population(s) are described in the data. More on the idea of ‘population structure’ to come in a later module.
.bed
Finally, the .bed
file, which has all the data. This is
by far the largest of the three files, as it contains the entire 1401 by
861473 matrix of genotype calls for every subject and every locus. To
keep things manageable, this file is encoded in a special binary format
– i.e., you can’t just open the file in a text editor or load it with
something like fread()
. That said, we can convert our PLINK
data into a bigSNP
object (defined in bigsnpr
)
– this will let us look at our data in our R
session.
# install.packages("bigsnpr") # run this if its your first time using this package
library(bigsnpr)
library(bigstatsr) # a useful dependency of bigsnpr
To read this into R
, run the following chunk
only once. This will create the .rds and .bk files you
need, in the same directory where you’ve stored the
penncath.bed
(my directory is called
data/
)
# Note: The function assumes that all the files have the same base filename, and differ only in their extension.
snp_readBed("data/penncath.bed")
Now, read in the data to your R session
penncath <- snp_attach("data/penncath.rds")
# check out this bigSNP object
str(penncath)
# List of 3
# $ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 16 fields
# ..$ extptr :<externalptr>
# ..$ extptr_rw :<externalptr>
# ..$ nrow : int 1401
# ..$ ncol : int 861473
# ..$ type : Named int 1
# .. ..- attr(*, "names")= chr "unsigned char"
# ..$ backingfile : chr "/Users/tabithapeter/Desktop/adv-gwas-tutorial/data/penncath.bk"
# ..$ is_read_only: logi FALSE
# ..$ address :<externalptr>
# ..$ address_rw :<externalptr>
# ..$ bk : chr "/Users/tabithapeter/Desktop/adv-gwas-tutorial/data/penncath.bk"
# ..$ rds : chr "/Users/tabithapeter/Desktop/adv-gwas-tutorial/data/penncath.rds"
# ..$ is_saved : logi TRUE
# ..$ type_chr : chr "unsigned char"
# ..$ type_size : int 1
# ..$ file_size : num 1.21e+09
# ..$ code256 : num [1:256] 0 1 2 NA NA NA NA NA NA NA ...
# ..and 26 methods, of which 12 are possibly relevant:
# .. add_columns, as.FBM, bm, bm.desc, check_dimensions, check_write_permissions,
# .. copy#envRefClass, initialize, initialize#FBM, save, show#envRefClass, show#FBM
# $ fam :'data.frame': 1401 obs. of 6 variables:
# ..$ family.ID : int [1:1401] 10002 10004 10005 10007 10008 10009 10010 10011 10012 10013 ...
# ..$ sample.ID : int [1:1401] 1 1 1 1 1 1 1 1 1 1 ...
# ..$ paternal.ID: int [1:1401] 0 0 0 0 0 0 0 0 0 0 ...
# ..$ maternal.ID: int [1:1401] 0 0 0 0 0 0 0 0 0 0 ...
# ..$ sex : int [1:1401] 1 2 1 1 1 1 1 2 1 2 ...
# ..$ affection : int [1:1401] 1 1 2 1 2 2 2 1 2 0 ...
# $ map :'data.frame': 861473 obs. of 6 variables:
# ..$ chromosome : int [1:861473] 1 1 1 1 1 1 1 1 1 1 ...
# ..$ marker.ID : chr [1:861473] "rs10458597" "rs12565286" "rs12082473" "rs3094315" ...
# ..$ genetic.dist: int [1:861473] 0 0 0 0 0 0 0 0 0 0 ...
# ..$ physical.pos: int [1:861473] 564621 721290 740857 752566 761732 765269 777122 785989 792480 798959 ...
# ..$ allele1 : chr [1:861473] "-9" "G" "T" "C" ...
# ..$ allele2 : chr [1:861473] "C" "C" "C" "T" ...
# - attr(*, "class")= chr "bigSNP"
From here, bigsnpr
has a lot of functions to help us
prepare data for analysis – these functions depend on you having PLINK 1.9 installed
somewhere on your local machine. Go to that page, download the files,
and save them on your computer (I keep files like this in
~/bin/
). In order to tell whether PLINK installed correctly
and is ready to go, type path/to/plink --version
on either
the command line prompt or in ‘Terminal’ in RStudio, where you fill in
the path with the path to where you saved the downloaded files. If
things are working correctly, you will see a message that says something
like PLINK v1.90b6.26 64-bit (2 Apr 2022)
– the
PLINK v1.9
is the key here.
Note: in PLINK files, missing values are often recorded as “-9” - the
bigsnpr
package follows this convention as well. This is
what you will see in the ‘map’ element of the bigSNP
object
– where allele codes are missing, “-9” is inserted.
Once you have PLINK installed, we can start with some simple quality
control (QC) steps in R
:
One helpful first step when getting any genetic data is to address
the question: what chromosomes are the SNPs on? In our data, we can see
this in the map
element of our bigSNP
object:
penncath$map$chromosome |> table()
#
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# 71038 73717 60565 55675 56178 54176 46391 48299 41110 47930 44213 42124 34262 28054 25900
# 16 17 18 19 20 21 22
# 27591 19939 26231 11482 22753 12463 11382
We notice here that (1) we do not have any sex chromosomes2, and (2) we do not have any chromosomes with numbers outside 1-23. Sometimes, SNPs with unknown chromosome are labeled with a chromosome of 0. Another common convention is to label mitochondrial variants as having chromosome ‘26’. This is an issue that can throw off later analysis, so be sure to examine what data you have represented in your PLINK files.
Any SNP with a lot of missing data (i.e., many samples do not have data for that marker) is probably represented with questionable quality in our data; these poorly-captured SNPs are often excluded from analysis. Likewise, any sample with lots of missing data suggests that there may be issues with the processing of that sample. Looking at our genotype data (i.e., what is in our .bed file), we see that some of our SNPs have missing values.
class(penncath$genotypes) # note: this is a file-backed object
# [1] "FBM.code256"
# attr(,"package")
# [1] "bigstatsr"
?bigstatsr::big_counts # handy function to summarize data
# take a look at what this does
big_counts(penncath$genotypes, ind.row = 1:10, ind.col = 1:10)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# 0 10 9 9 8 9 8 9 9 10 5
# 1 0 0 0 2 1 0 1 1 0 5
# 2 0 1 0 0 0 0 0 0 0 0
# <NA> 0 0 1 0 0 2 0 0 0 0
# summarize genotype data by *SNP*
snp_stats <- big_counts(penncath$genotypes)
dim(snp_stats) # 4th row has NA counts
# [1] 4 861473
boxplot(snp_stats[4,])
summary(snp_stats[4,])
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.0 1.0 5.0 31.5 25.0 1401.0
There are definitely some SNPs with lots of missing values – at least one SNP is missing across all 1,401 samples! A common practice is to exclude SNPs with >10% missing data. We’ll actually do the subsetting a little later, right now we’re just exploring.
We also need to consider any samples that have high proportions of missing values:
# summarize genotype data by *sample*
sample_stats <- big_counts(penncath$genotypes, byrow = TRUE)
# cell [1,1] is showing the # of people with "0" genotype at the 1st SNP
sample_stats[, 1:10]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# 0 547076 559924 559990 551565 561342 558493 563185 553232 550485 537783
# 1 231887 228864 228959 228728 225771 228300 228626 227605 233055 235100
# 2 51692 52374 51956 51357 49713 52277 53064 50701 52954 52947
# <NA> 30818 20311 20568 29823 24647 22403 16598 29935 24979 35643
boxplot(sample_stats[4,])
summary(sample_stats[4,])
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4421 12605 17833 19368 24998 43330
We notice that there are no outlier values in the number of NA values per person – if there were outliers, we would need to filter out samples that have a high proportion of missing data.
If an individual had a ton of A/B calls but no A/A or B/B calls, or vice versa, that would likely indicate something was wrong in that sample – we would typically expect all samples to have some heterozygous calls and some homozygous calls.
allele_dat <- sweep(x = sample_stats[1:3,],
# ^ leave off 4th row -- don't count NAs here
MARGIN = 1,
STATS = c(0, 1, 0),
FUN = "*") |> colSums()
boxplot(allele_dat/ncol(snp_stats))
hist(allele_dat/ncol(snp_stats),
main = "Zygosity",
xlab = "Proportion of samples which are heterozygous") # should be bell-curved shaped
No big outliers here… that’s a good sign.
As a first step, let’s look at which SNPs have minor alleles that occur very rarely – in a typical GWAS, we will need to exclude SNPs that have low variation. We see there are some SNPs for which all 1401 samples have the A/A alleles, or the major/major alleles. For analysis purposes, we typically need to exclude such SNPs in which variation is rare.
hist(snp_stats[1,])
summary(snp_stats[1,])
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.0 609.0 922.0 909.7 1214.0 1401.0
The histogram and summary shows that there are some SNPs for which few, if any, samples have a minor allele. We will need to filter out these rare variants.
The Hardy-Weinberg principle states that under the assumption of random mating, the distribution of genotypes should follow a binomial distribution with probability \(\pi\) equal to the MAF. If this doesn’t happen, this is an indication that either:
In the real world, mating is of course not random, making it difficult to exclude SNPs on the basis of HWE. The usual recommendation is to exclude a SNP only if HWE is hugely violated (e.g., \(p < 10^{-10}\) for a test of whether the data follow a binomial distribution). We will use this filter as a criteria.
In summary, here are the basic QC steps that (almost) all GWAS need to consider:
We need to check the chromosomes we have represented in our data
We need to filter out samples that have a lot of missing values (e.g., samples that had poor quality in upstream data collection/analysis)
We need to check for heterozygosity, and filter out samples that are clear outliers
We need to filter out SNPs that have a really low minor allele frequency
We need to filter out SNPs that are far outside Hardy-Weinberg equilibrium
We need to consider relatedness in our data – both the relatedness that is known as well as any latent relationship structure(s).
Now, let’s attempt to do some quality control for real – the following code will produce an error, which I think is instructive to see:
path_to_qc_penncath <- snp_plinkQC(
plink.path = "~/bin/plink", # you will need to change this according to your machine!
prefix.in = "data/penncath", # input current data
prefix.out = "data/qc_penncath", # creates *new* rds with quality-controlled data
maf = 0.01
)
The command above is a wrapper for PLINK, and so it will print out messages to the console, so you can follow along with what PLINK is doing. We tried to filter down to just variants (SNPs) with a MAF of at least 1%, but we get this error from PLINK:
Error: .bim file has a split chromosome
Use --make-bed by itself to remedy this.
This means that our penncath
data need some reformmating
before we can do QC. From R
, type in the following line to
address the issue pointed out in the PLINK error message:
system("plink --bfile data/penncath --make-bed --out data/penncath_clean")
Let’s go through what each of those options (or ‘flags’) is doing:
plink
: calls the program
--bfile
: tells PLINK that you have
.bed/.bim/.fam
files (PLINK can work with other file types
too, but that’s beyond the scope of this tutorial).
data/penncath
names the folder with our PLINK data
--make-bed
: tells PLINK to make a cleaned-up version
of the .bed
file. New .fam
and
.bim
files will be created also, to ensure that those 3
files are always in sync.
--out
: tells PLINK where to put the to-be-created
clean data and what to call these new files. I am creating files with
the prefix penncath_clean
in the data/
folder.
Running the above will also create
data/penncath_clean.log
, which keeps a record of what flags
you used to create this data set. Check out that file to make sure
everything ran correctly (it will tell you if there was a problem).
PLINK will also print out progress messages to your console as it
works.
Now that we have fixed the ‘split chromosome’ issue, we can do QC steps (1), (2), (4), and (5) in one command:
path_to_qc_penncath_bed <- snp_plinkQC(
plink.path = "~/bin/plink", # again, you may need to change this according to your machine!
prefix.in = "data/penncath_clean", # input data
prefix.out = "data/qc_penncath", # creates *new* rds with quality-controlled data
maf = 0.01, # filter out SNPs with MAF < 0.01
geno = 0.1, # filter out SNPs missing more than 10% of data
mind = 0.1, # filter out SNPs missing more than 10% of data,
hwe = 1e-10, # filter out SNPs that have p-vals below this threshold for HWE test
autosome.only = TRUE # we want chromosomes 1-22 only
)
“Under the hood”, PLINK is applying all the filters we want and creating a new set of bed/bim/fam files with the prefix “qc_penncath.” We will use these QC’d files in our upcoming analysis.
Step (3) we have done by visual inspection – in this data set, that is all we need. Step (6) requires quite a bit of explanation, so this step has its own module in this tutorial.
Now, let’s actually do the subsetting. IMPORTANT:
The key thing to remember here is that when you subset genetic data in
PLINK files, you need to subset both the map
object and the
genotypes – if you don’t, these objects will no longer match and you
will end up with a devastating row mismatch problem. If you are using
PLINK
(any version) on the command line or
bigsnpr
in R
, then this is automatically taken
care of for you. However, if you are reading in the .bim/.fam files as
data frames (as we did at the beginning), then you would need to make
sure you subset both files.
Let’s look at our new RDS
object, the one we just
created using our QC steps:
qc_file_path <- snp_readBed(bedfile = path_to_qc_penncath_bed)
qc <- snp_attach(qc_file_path)
dim(qc$genotypes)
We see that there are 1401 samples and 696644 variants represented in our new quality controlled data.
Using our QC’d data in qc_penncath
, we will continue in
the next section with imputation
One of the great features about the bigSNP
objects in
bigsnpr
is that these objects save the large genotype data
(i.e., what’s in the .bed PLINK file) as file-backed – in layman’s
terms, file-backed objects let you have an object in your global
environment that is ‘pointing at’ the large data file stored on disk. If
you want to read genotype data from just a few SNPs into memory, we can
do this:
# notice: data are file backed
class(penncath$genotypes)
# [1] "FBM.code256"
# attr(,"package")
# [1] "bigstatsr"
# read in just a few SNPs
first_10_snps <- penncath$genotypes[,1:10]
# these are now stored in memory as a matrix
class(first_10_snps); dim(first_10_snps)
# [1] "matrix" "array"
# [1] 1401 10
# take a look at this
head(first_10_snps)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 0 0 0 0 0 NA 0 0 0 0
# [2,] 0 2 0 1 0 0 0 0 0 1
# [3,] 0 0 0 0 0 0 0 0 0 0
# [4,] 0 0 0 0 0 0 0 0 0 1
# [5,] 0 0 0 0 0 0 0 0 0 0
# [6,] 0 0 0 0 0 0 0 0 0 1
In general, since we have genetic data on the individuals in the sample, including the X chromosome, we can determine (or at least, estimate) their “genetic sex” and compare that to the sex that is recorded in their clinical information. A discrepancy is very troubling, as it may be the result of a sample being switched or mis-labeled (there are other explanations as well).
Unfortunately, this data set does not include the X chromosome, so we can’t show that step here.
In general, though, one usually uses PLINK for this. The relevant
command is called --check-sex
. The relevant documentation:
1.9
documentation.
Note that there are no discrepancies between the sex recorded in
clinical.csv
and the one recorded in the .fam
file:
table(penncath$fam$sex, clinical$sex)
#
# 1 2
# 1 937 0
# 2 0 464
In our data, this is sufficient for us to know we are good to go.
SNPs which are close to one another in base-pair position are often
in linkage
disequilibrium (LD) with each other – we say, ‘the SNPs are in LD’.
When SNPs are in LD, this can have a notable impact on analysis,
particularly when the approach is a marginal analysis (i.e., testing one
SNP at a time). Florian Privé,
author of the bigstatsr
and bigsnpr
packages,
has written some good vignettes
on how and when to use pruning/clumping as a QC step to address LD. For
now, I refer you there – I hope to do some of my own writing on the
subject one day.