Preprocess PLINK files using the bigsnpr package
Usage
process_plink(
data_dir,
data_prefix,
rds_dir = data_dir,
rds_prefix = NULL,
logfile = NULL,
impute = TRUE,
impute_method = "mode",
id_var = "IID",
parallel = TRUE,
quiet = FALSE,
overwrite = FALSE,
...
)Arguments
- data_dir
The path to the bed/bim/fam data files, without a trailing "/" (e.g., use
data_dir = '~/my_dir', notdata_dir = '~/my_dir/')- data_prefix
The prefix (as a character string) of the bed/fam data files (e.g.,
data_prefix = 'mydata')- rds_dir
The path to the directory in which you want to create the new
.rdsand.bkfiles. Defaults todata_dir- rds_prefix
String specifying the user's preferred filename for the to-be-created .rds file (will be create inside
rds_dirfolder). If no rds_prefix is provided, the processed data files will be returned in memory. Note:rds_prefixcannot be the same asdata_prefix- logfile
Optional: the name (character string) of the prefix of the logfile to be written in
rds_dir. Default to NULL (no log file written). Note: do not append a.logto the filename; this is done automatically.- impute
Logical: should data be imputed? Default to TRUE.
- impute_method
If
impute = TRUE, this argument will specify the kind of imputation desired. Options are:mode(default): Imputes the most frequent call. Seebigsnpr::snp_fastImputeSimple()for details.random: Imputes sampling according to allele frequencies.mean0: Imputes the rounded mean.mean2: Imputes the mean rounded to 2 decimal places.xgboost: Imputes using an algorithm based on local XGBoost models. Seebigsnpr::snp_fastImpute()for details. Note: this can take several minutes, even for a relatively small data set.
- id_var
String specifying which column of the PLINK
.famfile has the unique sample identifiers. Options are "IID" (default) and "FID"- parallel
Logical: should the computations within this function be run in parallel? Defaults to TRUE. See
count_cores()and?bigparallelr::assert_coresfor more details. In particular, the user should be aware that too much parallelization can make computations slower.- quiet
Logical: should console messages be silenced? Defaults to FALSE
- overwrite
Logical: if existing .bk/.rds files exist for the specified directory/prefix, should these be overwritten? Defaults to FALSE. Set to TRUE if you want to change the imputation method you're using, etc.
- ...
Optional: additional arguments to
bigsnpr::snp_fastImpute()(relevant only ifimpute_method = 'xgboost')
Details
Three files are created in the location specified by rds_dir:
rds_prefix.rds: This is a list with three items: (1)X: the filebackedbigmemory::big.matrixobject pointing to the imputed genotype data. This matrix has typedouble, which is important for downstream operations increate_design()(2)map: a data.frame with the PLINKbimdata (i.e., the variant information) (3)fam: a data.frame with the PLINKfamdata (i.e., the pedigree information)rds_prefix.bk: This is the backing file that stores the numeric data of the genotype matrix.rds_prefix.descThis is the description file, needed to attach the genotype matrix to the R session.
Note that process_plink() need only be run once for a given set of PLINK
files; in subsequent data analysis/scripts, get_data() will access the .rds file.
For an example, see vignette on processing PLINK files.