Skip to contents

Performs k-fold cross validation for lasso-, MCP-, or SCAD-penalized linear mixed models over a grid of values for the regularization parameter lambda.

Usage

cv_plmm(
  design,
  y = NULL,
  K = NULL,
  diag_K = NULL,
  eta_star = NULL,
  penalty = "lasso",
  type = "blup",
  gamma,
  alpha = 1,
  lambda_min,
  nlambda = 100,
  lambda,
  eps = 1e-04,
  max_iter = 10000,
  convex = TRUE,
  dfmax = NULL,
  warn = TRUE,
  init = NULL,
  cluster,
  nfolds = 5,
  seed,
  fold = NULL,
  returnY = FALSE,
  returnBiasDetails = FALSE,
  trace = FALSE,
  save_rds = NULL,
  save_fold_res = FALSE,
  return_fit = TRUE,
  compact_save = FALSE,
  ...
)

Arguments

design

The first argument must be one of three things: (1) plmm_design object (as created by create_design()) (2) a string with the file path to a design object (the file path must end in '.rds') (3) a matrix or data.frame object representing the design matrix of interest

y

Optional: In the case where design is a matrix or data.frame, the user must also supply a numeric outcome vector as the y argument. In this case, design and y will be passed internally to create_design(X = design, y = y).

K

Similarity matrix used to rotate the data. This should either be (1) a known matrix that reflects the covariance of y, (2) an estimate (Default is \(\frac{1}{p}(XX^T)\)), or (3) a list with components 'd' and 'u', as returned by choose_k().

diag_K

Logical: should K be a diagonal matrix? This would reflect observations that are unrelated, or that can be treated as unrelated. Defaults to FALSE. Note: plmm() does not check to see if a matrix is diagonal. If you want to use a diagonal K matrix, you must set diag_K = TRUE.

eta_star

Optional argument to input a specific eta term rather than estimate it from the data. If K is a known covariance matrix that is full rank, this should be 1.

penalty

The penalty to be applied to the model. Either "lasso" (the default), "SCAD", or "MCP".

type

A character argument indicating what should be returned from predict.plmm(). If type == 'lp', predictions are based on the linear predictor, X beta. If type == 'blup', predictions are based on the sum of the linear predictor and the estimated random effect (BLUP). Defaults to 'blup', as this has shown to be a superior prediction method in many applications.

gamma

The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD.

alpha

Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. alpha=1 is equivalent to MCP/SCAD penalty, while alpha=0 would be equivalent to ridge regression. However, alpha=0 is not supported; alpha may be arbitrarily small, but not exactly 0.

lambda_min

The smallest value for lambda, as a fraction of lambda.max. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise.

nlambda

Length of the sequence of lambda. Default is 100.

lambda

A user-specified sequence of lambda values. By default, a sequence of values of length nlambda is computed, equally spaced on the log scale.

eps

Convergence threshold. The algorithm iterates until the RMSD for the change in linear predictors for each coefficient is less than eps. Default is 1e-4.

max_iter

Maximum number of iterations (total across entire path). Default is 10000.

convex

(future idea; not yet incorporated) Calculate index for which objective function ceases to be locally convex? Default is TRUE.

dfmax

(future idea; not yet incorporated) Upper bound for the number of nonzero coefficients. Default is no upper bound. However, for large data sets, computational burden may be heavy for models with a large number of nonzero coefficients.

warn

Return warning messages for failures to converge and model saturation? Default is TRUE.

init

Initial values for coefficients. Default is 0 for all columns of X.

cluster

cv_plmm() can be run in parallel across a cluster using the parallel package. The cluster must be set up in advance using parallel::makeCluster(). The cluster must then be passed to cv_plmm().

nfolds

The number of cross-validation folds. Default is 5.

seed

You may set the seed of the random number generator in order to obtain reproducible results.

fold

Which fold each observation belongs to. By default, the observations are randomly assigned.

returnY

Should cv_plmm() return the linear predictors from the cross-validation folds? Default is FALSE; if TRUE, this will return a matrix in which the element for row i, column j is the fitted value for observation i from the fold in which observation i was excluded from the fit, at the jth value of lambda.

returnBiasDetails

Logical: should the cross-validation bias (numeric value) and loss (n x p matrix) be returned? Defaults to FALSE.

trace

If set to TRUE, inform the user of progress by announcing the beginning of each CV fold. Default is FALSE.

save_rds

Optional: if a filepath and name without the '.rds' suffix is specified (e.g., save_rds = "~/dir/my_results"), then the model results are saved to the provided location (e.g., "~/dir/my_results.rds"). Defaults to NULL, which does not save the result.

save_fold_res

Optional: a logical value indicating whether the results (loss and predicted values) from each CV fold should be saved? If TRUE, then two '.rds' files will be saved ('loss' and 'yhat') will be created in the same directory as 'save_rds'. Both files will be updated after each fold is done. Defaults to FALSE.

return_fit

Optional: a logical value indicating whether the fitted model should be returned as a plmm object in the current (assumed interactive) session. Defaults to TRUE.

compact_save

Optional: if TRUE, three separate .rds files will saved: one with the 'beta_vals', one with 'K', and one with everything else (see below). Defaults to FALSE. Note: you must specify save_rds for this argument to be called.

...

Additional arguments to plmm_fit

Value

a list with 12 items:

  • type: the type of prediction used ('lp' or 'blup')

  • cve: numeric vector with the cross validation error (CVE) at each value of lambda

  • cvse: numeric vector with the estimated standard error associated with each value of for cve

  • fold: numeric n length vector of integers indicating the fold to which each observation was assigned

  • lambda: numeric vector of lambda values

  • fit: the overall fit of the object, including all predictors; this is a list as returned by plmm()

  • min: The index corresponding to the value of lambda that minimizes cve

  • lambda_min: The lambda value at which cve is minmized

  • min1se: The index corresponding to the value of lambda within standard error of that which minimizes cve

  • lambda1se: largest value of lambda such that error is within 1 standard error of the minimum.

  • null.dev: numeric value representing the deviance for the intercept-only model. If you have supplied your own lambda sequence, this quantity may not be meaningful.

  • estimated_Sigma: an n x n matrix representing the estimated covariance matrix.

Examples

admix_design <- create_design(X = admix$X, y = admix$y)
cv_fit <- cv_plmm(design = admix_design, return_fit = TRUE)
print(summary(cv_fit))
#> lasso-penalized model with n=197 and p=101
#> At minimum cross-validation error (lambda=0.2411):
#> -------------------------------------------------
#>   Nonzero coefficients: 5
#>   Cross-validation error (deviance): 1.99
#>   Scale estimate (sigma): 1.410
plot(cv_fit)


# Note: for examples with filebacked data, see the filebacking vignette
# https://pbreheny.github.io/plmmr/articles/filebacking.html