vignettes/web/inference-local.rmd
inference-local.rmd
Classical, single-feature, hypothesis testing approaches rely upon tail-area probabilies, or the probability that a test statistic, , exceeds a certain value. In contrast, local approaches like local mfdr base inference on that feature’s specific value value of without considering the hypothetical possibility of more extreme results.
Local false discovery rates are a Bayesian idea that can be implemented in large-scale testing situations by using empirical Bayes methods to obtain estimates of:
The probability of the null hypothesis being true, conditional upon the exact value of the observed test statistic . This probability is defined as the local false discovery rate for the feature.
Using Bayes’ rule, we have
where is the prior probability of a true null hypothesis for the collection of tests, is the theoretical density of test statistics under the null, and is the density of non-null test statistics.
A variety of estimators are possible depending on how one goes about
estimating this mixture of densities. One simple approach, currently
used by ncvreg
, is to set
and to avoid estimating
by estimating only the marginal density
using a kernel density approach. Thus:
In situations where , local mfdr estimates are capped at 1.
For each predictor, mfdr()
constructs a test statistic
based upon the mathematical conditions necessary for that variable to
enter the model characterized by a given value of
.
For linear regression models, these statistics have the form:
The subscript indicates the removal of the predictor. For logistic and Cox regression models, these statistics have the form:
Here is the unpenalized score function (ie: the first derivative, with respect to , of the log-likelihood), and is the diagonal element of the unpenalized Hessian matrix (ie: the second derivative of the log-likelihood)
Under feature independence, each of these statistics will follow a
standard normal distribution under the null hypothesis of that predictor
being independent of the current model’s residuals. Despite being
derived under independence, mfdr
tends to be accurate under
mild to moderate dependence structures, see Miller and Breheny (2018)
for additional details.
Local mfdr estimates can be obtained via the summary()
function:
fit <- ncvreg(Prostate$X, Prostate$y)
summary(fit, lambda = 0.07, number = Inf)
# MCP-penalized linear regression with n=97, p=8
# At lambda=0.0700:
# -------------------------------------------------
# Features satisfying criteria : 8
# Average mfdr among chosen features : 0.592
#
# Estimate z mfdr Selected
# lcavol 0.530785 8.7704 < 1e-04 *
# svi 0.684680 3.9737 0.010695 *
# lweight 0.622144 3.7369 0.026104 *
# lbph 0.038452 1.5077 0.901245 *
# age -0.004084 -1.2704 0.926945 *
# pgg45 0.000000 0.8675 0.951263
# gleason 0.000000 0.7467 0.955590
# lcp 0.000000 -0.2711 0.964801
The argument number = Inf
requests mfdr estimates for
all features, regardless of whether or not they are active in the
specified model. These estimates can be understood by studying the
theoretical null and empirically estimated mixture densities for these
data:
The feature lcavol
has an extremely small estimated mfdr
with a statistic of
,
the origin of this estimate is apparent when examining the ratio between
the null and mixture densities at
.
In contrast, the feature lbph
has a estimated mfdr of
with a statistic of
,
this estimated is explained by the null and mixture densities being
similar near
.