Objectives of this module

  • Understand why imputation is necessary
  • Know what kinds of imputation methods are available
  • Implement imputation using bigsnpr package

Set up

To begin, read in the QC’d (quality controlled) data from earlier step (refer back to the first module of the tutorial).

# load our packages  (same ones mentioned in the data module)
library(data.table)
library(dplyr)
library(bigsnpr)
library(ggplot2)

# create a bigSNP data object using data from PLINK files created in previous module 
# snp_readBed("data/qc_penncath.bed") # again, snp_readBed is a one-time thing
obj <- snp_attach("data/qc_penncath.rds")
# double check dimensions 
dim(obj$fam)
# [1] 1401    6
dim(obj$map) 
# [1] 696644      6
dim(obj$genotypes)
# [1]   1401 696644

Imputation

In a genetics research context, most observations (e.g patients or subjects) will have missing values for at least one SNP. A common method of dealing with missing SNP data is imputation.

Why impute?

There are two main reasons that one would use imputation:

  1. To replace missing SNP values with what these values are predicted to be, based upon a person’s (or subject’s) available SNP values near the loci of interest. For instance, suppose that in a given data set, patient A is missing SNP 123. This value for patient A could be imputed based on the patient’s other SNP values at loci near 123.

  2. To infer values for SNPs that were not measured at all for any patients (subjects). This would be the case if one was to merge data from studies that examined different loci. For instance, suppose I am merging data from studies A and B. Suppose further that study A measured SNP 123 for all patients, but study B did not. In the merged data, I may need to impute values for SNP 123 for all patients in the study B data.

For the purposes of this tutorial, let us limit ourselves to scenario (1).

Recall that in the QC step of our analysis, we excluded SNPs with \(\ge 90\%\) missingness. However, there may still be SNPs with some missingness. SNPs that are not missing are described as “called.” The call rate is the proportion of genotypes that are called. Therefore, a call rate of 1 indicates that a SNP has no missing values.

First, let’s check how many SNPs in our qc’d data set have some missing data:

snp_stats <- bigstatsr::big_counts(obj$genotypes)
# ^ bigstatsr is loaded with bigsnpr, just being explicit here for didactic purposes
colnames(snp_stats) <- obj$map$marker.ID
snp_stats[,1:5] # shows # of 0, 1, 2, and NA values for 1st 5 SNPs
#      rs12565286 rs3094315 rs2286139 rs2980319 rs2980300
# 0          1247       958      1018      1057       988
# 1           131       362       316       316       370
# 2             6        66        30        28        25
# <NA>         17        15        37         0        18

(any_missing <- sum(snp_stats[4,] != 0)) # shows # of SNPs with NA values 
# [1] 565138
call_rates <- colSums(snp_stats[1:3,])/colSums(snp_stats) 
head(call_rates)
# rs12565286  rs3094315  rs2286139  rs2980319  rs2980300 rs11240777 
#  0.9878658  0.9892934  0.9735903  1.0000000  0.9871520  0.9992862

This tells us that 565138 out of 696644 SNPs still have some missing values (albeit less than 10%.) The average call rate is 0.988 – this high call rate is a good sign, affirming the quality of our data.

Even though the average call rate is high, having any missing values will be a problem for analysis downstream – most analytical approaches cannot handle missing values. To ‘fill in the blanks’, we can use an imputation method: we will look at bigsnpr::snp_fastImputeSimple as a place to begin, and bigsnpr::snp_fastImpute() for some more advanced techniques.

The bigsnpr::snp_fastImputeSimple() function offers 4 approaches to imputation:

  • “mode”: imputes the most common allele combination (0/1/2) for a SNP to the missing value
  • “mean0”: imputes the average of the allele combinations for a SNP to the missing value, rounded to the nearest whole number.
  • “mean2”: imputes the average of the allele combinations for a SNP to the missing value, rounded to 2 decimal places
  • “random”: imputes based on sampling according to allele frequencies

The bigsnpr::snp_fastImpute() function uses an extreme gradient boosting (XGBoost) algorithm for imputing missing values.

For my own research in biostats, the simpler imputation methods are sufficient – I will demonstrate how I would use the mode to impute missing values:

# impute based on mode
obj$geno_imputed <- snp_fastImputeSimple(Gna = obj$genotypes,
                                          method = "mode",
                                          ncores = nb_cores())

# save imputed values 
obj$geno_imputed$code256 <- bigsnpr::CODE_IMPUTE_PRED

# look to see that imputation was successful:
imp_snp_stats <- big_counts(X.code = obj$geno_imputed)
imp_snp_stats[,1:5] # all 0s in NA row 

Let’s save this fully imputed data set for future use in downstream analyses:

obj <- bigsnpr::snp_save(obj)

Further resources

A more complex method of imputation involves the use of reference genome panels in addition to the observed data itself. The basic idea is to use known haplotypes, or groups of alleles inherited together, from reference genomes to give us better estimates of unobserved genotypes. These reference panels typically come from the either the 1000 Genomes project or the HapMap project, both of which are maintained by large-scale international organizations that aim to develop haplotype maps of the human genome in diverse populations.

In addition to allowing us to estimate untyped SNPs as we did above, where our SNPs of interest were typed in our population of interest but we had call rates of less than 1, this method of imputation can also allow us to estimate SNPs that were not genotyped on a particular population at all. This can be useful for combining multiple genetic data sets where different SNPs were typed, or for evaluating associations in distinct genetic populations.

It’s important to be aware that this type of imputation is possible, and commonly done. However, since it involves its own large array of software and expertise, it is probably something you would want to consult with an expert on. The Michigan Imputation Server is a service that will do more complex imputation for you. It also contains information about the various reference panels, their versions, etc.