Preprocess PLINK files using the bigsnpr
package
Usage
process_plink(
data_dir,
data_prefix,
rds_dir = data_dir,
rds_prefix,
logfile = NULL,
impute = TRUE,
impute_method = "mode",
id_var = "IID",
parallel = TRUE,
quiet = FALSE,
overwrite = FALSE,
...
)
Arguments
- data_dir
The path to the bed/bim/fam data files, without a trailing "/" (e.g., use
data_dir = '~/my_dir'
, notdata_dir = '~/my_dir/'
)- data_prefix
The prefix (as a character string) of the bed/fam data files (e.g.,
data_prefix = 'mydata'
)- rds_dir
The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to
data_dir
- rds_prefix
String specifying the user's preferred filename for the to-be-created .rds file (will be create insie
rds_dir
folder) Note: 'rds_prefix' cannot be the same as 'data_prefix'- logfile
Optional: the name (character string) of the prefix of the logfile to be written in 'rds_dir'. Default to NULL (no log file written). Note: if you supply a file path in this argument, it will error out with a "file not found" error. Only supply the string; e.g., if you want my_log.log, supply 'my_log', the my_log.log file will appear in rds_dir.
- impute
Logical: should data be imputed? Default to TRUE.
- impute_method
If 'impute' = TRUE, this argument will specify the kind of imputation desired. Options are: * mode (default): Imputes the most frequent call. See
bigsnpr::snp_fastImputeSimple()
for details. * random: Imputes sampling according to allele frequencies. * mean0: Imputes the rounded mean. * mean2: Imputes the mean rounded to 2 decimal places. * xgboost: Imputes using an algorithm based on local XGBoost models. Seebigsnpr::snp_fastImpute()
for details. Note: this can take several minutes, even for a relatively small data set.- id_var
String specifying which column of the PLINK
.fam
file has the unique sample identifiers. Options are "IID" (default) and "FID"- parallel
Logical: should the computations within this function be run in parallel? Defaults to TRUE. See
count_cores()
and?bigparallelr::assert_cores
for more details. In particular, the user should be aware that too much parallelization can make computations slower.- quiet
Logical: should messages to be printed to the console be silenced? Defaults to FALSE
- overwrite
Logical: if existing
.bk
/.rds
files exist for the specified directory/prefix, should these be overwritten? Defaults to FALSE. Set to TRUE if you want to change the imputation method you're using, etc. Note: If there are multiple.rds
files with names that start with "std_prefix_...", this will error out. To protect users from accidentally deleting files with saved results, only one.rds
file can be removed with this option.- ...
Optional: additional arguments to
bigsnpr::snp_fastImpute()
(relevant only if impute_method = "xgboost")
Details
Three files are created in the location specified by rds_dir
:
'rds_prefix.rds': This is a list with three items: (1)
X
: the filebackedbigmemory::big.matrix
object pointing to the imputed genotype data. This matrix has type 'double', which is important for downstream operations increate_design()
(2)map
: a data.frame with the PLINK 'bim' data (i.e., the variant information) (3)fam
: a data.frame with the PLINK 'fam' data (i.e., the pedigree information)'prefix.bk': This is the backingfile that stores the numeric data of the genotype matrix
'rds_prefix.desc'" This is the description file, as needed by the
Note that process_plink()
need only be run once for a given set of PLINK
files; in subsequent data analysis/scripts, get_data()
will access the '.rds' file.
For an example, see vignette on processing PLINK files