Skip to contents
library(plmmr)
#> Loading required package: bigalgebra
#> Loading required package: bigmemory

A widely-used format for storing data from genome-wide association studies (GWAS) is the PLINK file formats, which consists of a triplet of ‘.bed’, ‘.bed’, and ‘.fam’ files. The plmmr package is equipped to analyze data from PLINK files. If you have data in this format, keep reading – if you want to know more about what each of these files contains, see this other tutorial or the PLINK documentation. If your data is in delimited files (e.g., .txt, .csv, etc.), read the article on analyzing data from delimited files at vignette("delim_files", package = "plmmr").

The plmmr package is designed to handle data so that users can analyze large data sets. For this reason, data must be preprocessed into a specific format. There are two steps to prepare for analysis: (1) process the data and (2) create a design. Processing the data means that we take the feature data and create a ‘.rds’ object that contains your feature data in a format compatible with the bigmemory package. Creating a design involves standardizing the input of the features, outcome, and penalty factor into the modeling functions plmm() and cv_plmm().

First, unzip your PLINK files if they are zipped. Our example data, penncath_lite data that ships with plmmr is zipped; if you are on MacOS or Linux, you can run this command to unzip:

temp_dir <- tempdir() # using a temp dir -- change to fit your preference
unzip_example_data(outdir = temp_dir)
#> Unzipped files are saved in /tmp/Rtmp7yTm6J

For GWAS data, we have to tell plmmr how to combine information across all three PLINK files (the .bed, .bim, and .fam files). We do this with process_plink().

Here, we will create the files we want in a temporary directory just for the sake of example. Users can specify the folder of their choice for rds_dir, as shown below:

# temp_dir <- tempdir() # using a temporary directory (if you didn't already create one above)
plink_data <- process_plink(data_dir = temp_dir, 
                            data_prefix = "penncath_lite",
                            rds_dir = temp_dir,
                            rds_prefix = "imputed_penncath_lite",
                            # imputing the mode to address missing values
                            impute_method = "mode",
                            # overwrite existing files in temp_dir
                            # (you can turn this feature off if you need to)
                            overwrite = TRUE,
                            # turning off parallelization - leaving this on causes problems knitting this vignette
                            parallel = FALSE)
#> 
#> Preprocessing penncath_lite data:
#> Creating penncath_lite.rds
#> 
#> There are 1401 observations and 4367 genomic features in the specified data files, representing chromosomes 1 - 22 
#> There are a total of 3514 SNPs with missing values
#> Of these, 13 are missing in at least 50% of the samples
#> 
#> Imputing the missing (genotype) values using mode method
#> 
#> process_plink() completed 
#> Processed files now saved as /tmp/Rtmp7yTm6J/imputed_penncath_lite.rds

You’ll see a lot of messages printed to the console here … the result of all this is the creation of 3 files: imputed_penncath_lite.rds and imputed_penncath_lite.bk contain the data. 1 These will show up in the folder where the PLINK data is. What is returned is a filepath. The .rds object at this filepath contains the processed data, which we will now use to create our design.

For didactic purposes, let’s examine what’s in imputed_penncath_lite.rds using the readRDS() function (Note Don’t do this in your analysis - the section below reads the data into memory. This is just for illustration):

pen <- readRDS(plink_data) # notice: this is a `processed_plink` object 
str(pen) # note: genotype data is *not* in memory
#> List of 5
#>  $ X  :Formal class 'big.matrix.descriptor' [package "bigmemory"] with 1 slot
#>   .. ..@ description:List of 13
#>   .. .. ..$ sharedType: chr "FileBacked"
#>   .. .. ..$ filename  : chr "imputed_penncath_lite.bk"
#>   .. .. ..$ dirname   : chr "/tmp/Rtmp7yTm6J/"
#>   .. .. ..$ totalRows : int 1401
#>   .. .. ..$ totalCols : int 4367
#>   .. .. ..$ rowOffset : num [1:2] 0 1401
#>   .. .. ..$ colOffset : num [1:2] 0 4367
#>   .. .. ..$ nrow      : num 1401
#>   .. .. ..$ ncol      : num 4367
#>   .. .. ..$ rowNames  : NULL
#>   .. .. ..$ colNames  : NULL
#>   .. .. ..$ type      : chr "double"
#>   .. .. ..$ separated : logi FALSE
#>  $ map:'data.frame': 4367 obs. of  6 variables:
#>   ..$ chromosome  : int [1:4367] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ marker.ID   : chr [1:4367] "rs3107153" "rs2455124" "rs10915476" "rs4592237" ...
#>   ..$ genetic.dist: int [1:4367] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ physical.pos: int [1:4367] 2056735 3188505 4275291 4280630 4286036 4302161 4364564 4388885 4606471 4643688 ...
#>   ..$ allele1     : chr [1:4367] "C" "T" "T" "G" ...
#>   ..$ allele2     : chr [1:4367] "T" "C" "C" "A" ...
#>  $ fam:'data.frame': 1401 obs. of  6 variables:
#>   ..$ family.ID  : int [1:1401] 10002 10004 10005 10007 10008 10009 10010 10011 10012 10013 ...
#>   ..$ sample.ID  : int [1:1401] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ paternal.ID: int [1:1401] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ maternal.ID: int [1:1401] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ sex        : int [1:1401] 1 2 1 1 1 1 1 2 1 2 ...
#>   ..$ affection  : int [1:1401] 1 1 2 1 2 2 2 1 2 -9 ...
#>  $ n  : int 1401
#>  $ p  : int 4367
#>  - attr(*, "class")= chr "processed_plink"
# notice: no more missing values in X
any(is.na(pen$genotypes[,]))
#> [1] FALSE

Creating a design

Now we are ready to create a plmm_design, which is an object with the pieces we need for our model: a design matrix \mathbf{X}, an outcome vector \mathbf{y}, and a the vector with the penalty factor indicators (1 = feature will be penalized, 0 = feature will not be penalized).

As a side note: in GWAS studies, it is typical to include some non-genomic factors as unpenalized covariates as part of the model. For instance, you may want to adjust for sex or age – these are factors that you want to ensure are always included in the selected model. The plmmr package allows you to include these additional unpenalized predictors via the ‘add_predictor’ and ‘predictor_id’ options, both of which are passed through create_design() to the internal function create_design_filebacked(). An example with these options is included in the create_design() documentation.

# get outcome data 
penncath_pheno <- read.csv(find_example_data(path = 'penncath_clinical.csv'))

phen <- data.frame(FamID = as.character(penncath_pheno$FamID),
                   CAD = penncath_pheno$CAD)

pen_design <- create_design(data_file = plink_data,
                            feature_id = "FID",
                            rds_dir = temp_dir,
                            new_file = "std_penncath_lite",
                            add_outcome = phen,
                            outcome_id = "FamID",
                            outcome_col = "CAD",
                            logfile = "design",
                            # again, overwrite if needed; use with caution
                            overwrite = TRUE)
#> There are 62 constant features in the data
#> Subsetting data to exclude constant features (e.g., monomorphic SNPs)
#> Column-standardizing the design matrix...
#> Standardization completed at 2024-10-23 22:07:33
#> Done with standardization. File formatting in progress

# examine the design - notice the components of this object 
pen_design_rds <- readRDS(pen_design)
str(pen_design_rds)
#> List of 15
#>  $ X_colnames    : chr [1:4367] "rs3107153" "rs2455124" "rs10915476" "rs4592237" ...
#>  $ X_rownames    : chr [1:1401] "10002" "10004" "10005" "10007" ...
#>  $ n             : int 1401
#>  $ p             : int 4367
#>  $ outcome_idx   : int [1:1401] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ y             : Named int [1:1401] 1 1 1 1 1 1 1 1 1 0 ...
#>   ..- attr(*, "names")= chr [1:1401] "CAD1" "CAD2" "CAD3" "CAD4" ...
#>  $ std_X_rownames: chr [1:1401] "10002" "10004" "10005" "10007" ...
#>  $ ns            : int [1:4305] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ std_X_colnames: chr [1:4305] "rs3107153" "rs2455124" "rs10915476" "rs4592237" ...
#>  $ std_X         :Formal class 'big.matrix.descriptor' [package "bigmemory"] with 1 slot
#>   .. ..@ description:List of 13
#>   .. .. ..$ sharedType: chr "FileBacked"
#>   .. .. ..$ filename  : chr "std_penncath_lite.bk"
#>   .. .. ..$ dirname   : chr "/tmp/Rtmp7yTm6J/"
#>   .. .. ..$ totalRows : int 1401
#>   .. .. ..$ totalCols : int 4305
#>   .. .. ..$ rowOffset : num [1:2] 0 1401
#>   .. .. ..$ colOffset : num [1:2] 0 4305
#>   .. .. ..$ nrow      : num 1401
#>   .. .. ..$ ncol      : num 4305
#>   .. .. ..$ rowNames  : NULL
#>   .. .. ..$ colNames  : NULL
#>   .. .. ..$ type      : chr "double"
#>   .. .. ..$ separated : logi FALSE
#>  $ std_X_n       : num 1401
#>  $ std_X_p       : num 4305
#>  $ std_X_center  : num [1:4305] 0.00785 0.35974 1.01213 0.06067 0.46253 ...
#>  $ std_X_scale   : num [1:4305] 0.0883 0.7783 0.8636 0.28 1.2791 ...
#>  $ penalty_factor: num [1:4305] 1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "class")= chr "plmm_design"

A key part of what create_design() is doing is standardizing the columns of the genotype matrix. Below is a didactic example showing that the columns of the std_X element in our design have mean = 0 and variance = 1. Note again that this is not something you should do in your analysis – this reads the data into memory.

# we can check to see that our data have been standardized 
std_X <- attach.big.matrix(pen_design_rds$std_X)
colMeans(std_X[,]) |> summary() # columns have mean zero...
#>       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
#> -1.356e-16 -2.334e-17  3.814e-19  9.868e-19  2.520e-17  2.635e-16
apply(std_X[,], 2, var) |> summary() # ... & variance 1
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.001   1.001   1.001   1.001   1.001   1.001

Fitting a model

Now that we have a design object, we are ready to fit a model. By default, the model fitting results are saved as files in the folder specified in the rds_dir argument of plmmm. If you want to return the model fitting results, set return_fit = TRUE in plmm().

pen_fit <- plmm(design = pen_design,
                trace = T,
                return_fit = T)
#> Note: The design matrix is being returned as a file-backed big.matrix object -- see bigmemory::big.matrix() documentation for details.
#> Reminder: the X that is returned here is column-standardized
#> Input data passed all checks at  2024-10-23 22:07:34
#> Starting decomposition.
#> Calculating the eigendecomposition of K
#> Eigendecomposition finished at  2024-10-23 22:07:36
#> Beginning rotation ('preconditioning').
#> Rotation (preconditiong) finished at  2024-10-23 22:07:36
#> Setting up lambda/preparing for model fitting.
#> Beginning model fitting.
#> Model fitting finished at  2024-10-23 22:07:39 
#> Beta values are estimated -- almost done!
#> Formatting results (backtransforming coefs. to original scale).
#> Model ready at  2024-10-23 22:07:40
# you can turn off the trace messages by letting trace = F (default)

We examine our model results below:

summary(pen_fit, idx = 50)
#> lasso-penalized regression model with n=1401, p=4368 at lambda=0.01217
#> -------------------------------------------------
#> The model converged 
#> -------------------------------------------------
#> # of non-zero coefficients:  534 
#> -------------------------------------------------

plot(pen_fit)

Cross validation

To choose a tuning parameter for a model, plmmr offers a cross validation method:

cv_fit <- cv_plmm(design = pen_design,
                return_fit = T)

There are plot and summary methods for CV models as well:

summary(cv_fit) # summary at lambda value that minimizes CV error
#> lasso-penalized model with n=1401 and p=4368
#> At minimum cross-validation error (lambda=0.0408):
#> -------------------------------------------------
#>   Nonzero coefficients: 5
#>   Cross-validation error (deviance): 0.22
#>   Scale estimate (sigma): 0.471
plot(cv_fit)

The call to create_design() involves these steps:

  • Integrate in the external phenotype information, if supplied. Note: Any samples in the PLINK data that do not have a phenotype value in the specified additional phenotype file will be removed from the analysis.

  • Identify missing values in both samples and SNPs/features.

  • Impute missing values per user’s specified method. See R documentation for bigsnpr::snp_fastImputeSimple() for more details. Note: the plmmr package cannot fit models if datasets have missing values. All missing values must be imputed or subset out before analysis.

  • Integrate in the external predictor information, if supplied. This could be a matrix of meta-data (e.g., age, principal components, etc.). Note: If there are samples in the supplied file that are not included in the PLINK data, these will be removed. For example, if you have more phenotyped participants than genotyped participants in your study, plmmr::create_design() will create a matrix of data representing all the genotyped samples that also have data in the supplied external phenotype file.

  • Create a design matrix that represents the nonsingular features and the samples that have predictor and phenotype information available (in the case where external data are supplied).

  • Standardize the design matrix so that all columns have mean of 0 and variance of 1.