biglasso
extends lasso and elasticnet linear and logistic regression models for ultrahighdimensional, multigigabyte data sets that cannot be loaded into memory. It utilizes memorymapped files to store the massive data on the disk and only read those into memory whenever necessary during model fitting. Moreover, some advanced feature screening rules are proposed and implemented to accelerate the model fitting. As a result, this package is much more memory and computationefficient and highly scalable as compared to existing lassofitting packages such as glmnet and ncvreg. Bechmarking experiments using both simulated and real data sets show that biglasso
is not only 1.5x to 4x times faster than existing packages, but also at least 2x more memoryefficient. More importantly, to the best of our knowledge, biglasso
is the first R package that enables users to fit lasso models with data sets that are larger than available RAM, thus allowing for powerful big data analysis on an ordinary laptop.
To install the latest stable release version from CRAN:
install.packages("biglasso")
To install the latest development version from GitHub:
remotes::install_github("pbreheny/biglasso")
biglasso
at least 2x more memoryefficient than glmnet
.biglasso (1.40)
, glmnet (4.02)
, ncvreg (3.120)
, and picasso (1.31)
.lambda
values equally spaced on the log scale of lambda / lambda_max
from 0.1 to 1; varying number of observations n
and number of features p
; 20 replications, the mean computing time (in seconds) are reported.y = X * beta + 0.1 eps
, where X
and eps
are i.i.d. sampled from N(0, 1)
.biglasso
is more computationefficient:
In all the settings, biglasso
(1 core) is uniformly faster than picasso
, glmnet
and ncvreg
. When the data gets bigger, biglasso
achieves 69x speedup compared to other packages. Moreover, the computing time of biglasso
can be further reduced by half via parallelcomputation of multiple cores.
biglasso
is more memoryefficient:
To prove that biglasso
is much more memoryefficient, we simulate a 1000 X 100000
large feature matrix. The raw data is 0.75 GB. We used Syrupy to measure the memory used in RAM (i.e. the resident set size, RSS) every 1 second during lasso model fitting by each of the packages.
The maximum RSS (in GB) used by a single fit and 10fold cross validation is reported in the Table below. In the single fit case, biglasso
consumes 0.60 GB memory in RAM, 23% of that used by glmnet
and 24% of that used by ncvreg
. Note that the memory consumed by glmnet
and ncvreg
are respectively 3.4x and 3.3x larger than the size of the raw data. biglasso
also requires less additional memory to perform crossvalidation, compared other packages. For serial 10fold crossvalidation, biglasso
requires just 31% of the memory used by glmnet
and 11% of that used by ncvreg
, making it 3.2x and 9.4x more memoryefficient compared to these two, respectively.
Package  picasso  ncvreg  glmnet  biglasso 

Single fit  0.74  2.47  2.57  0.60 
10fold CV    4.62  3.11  0.96 
Note: ..* the memory savings offered by biglasso
would be even more significant if crossvalidation were conducted in parallel. However, measuring memory usage across parallel processes is not straightforward and not implemented in Syrupy
; ..* crossvalidation is not implemented in picasso
at this point.
The performance of the packages are also tested using diverse real data sets: * Breast cancer gene expression data (GENE); * MNIST handwritten image data (MNIST); * Cardiac fibrosis genomewide association study data (GWAS); * Subset of New York Times bagofwords data (NYT).
The following table summarizes the mean (SE) computing time (in seconds) of solving the lasso along the entire path of 100 lambda
values equally spaced on the log scale of lambda / lambda_max
from 0.1 to 1 over 20 replications.
Package  GENE  MNIST  GWAS  NYT 

n=536 
n=784 
n=313 
n=5,000 

p=17,322 
p=60,000 
p=660,495 
p=55,000 

picasso  0.67 (0.02)  2.94 (0.01)  14.96 (0.01)  15.91 (0.16) 
ncvreg  0.87 (0.01)  4.22 (0.00)  19.78 (0.01)  25.59 (0.12) 
glmnet  0.74 (0.01)  3.82 (0.01)  16.19 (0.01)  24.94 (0.16) 
biglasso  0.31 (0.01)  0.61 (0.02)  4.82 (0.01)  5.91 (0.78) 
To demonstrate the outofcore computing capability of biglasso
, a 96 GB real data set from a largescale genomewide association study is analyzed. The dimensionality of the design matrix is: n = 973, p = 11,830,470
. Note that the size of data is 3x larger than the installed 32 GB of RAM.
Since other three packages cannot handle this datalargerthanRAM case, we compare the performance of screening rules SSR
and Adaptive
based on our package biglasso
. In addition, two cases in terms of lambda_min
are considered: (1) lam_min = 0.1 lam_max
; and (2) lam_min = 0.5 lam_max
, as in practice there is typically less interest in lower values of lambda
for very highdimensional data such as this case. Again the entire solution path with 100 lambda
values is obtained. The table below summarizes the overall computing time (in minutes) by screening rule SSR
(which is what other three packages are using) and our new rule Adaptive
. (No replication is conducted.)
Cases  SSR  Adaptive 

lam_min / lam_max = 0.1 , 1 core 
189.67  66.05 
lam_min / lam_max = 0.1 , 4 cores 
86.31  46.91 
lam_min / lam_max = 0.5 , 1 core 
177.84  24.84 
lam_min / lam_max = 0.5 , 4 cores 
85.67  15.14 