AdaPTGMM
PackageThe AdaPTGMM method is a flexible multiple testing algorithm that uses covariates to estimate the local false discovery rate with finite sample FDR control.
#devtools::install_github("patrickrchao/AdaPTGMM")
library(AdaPTGMM)
First, we generate some data.
# Generate data
<- function(n,testing){
generate_data <- rnorm(n)
x <- function(x) 1/(1+exp(-x))
expit <- expit(2*(x-1))
pi1 <- 1 * (runif(n) < pi1)
H <- H * rlogis(n, location=1/2, scale=1)
theta <- rnorm(n, mean=theta)
z <- data.frame(x=x)
x if(testing == "one_sided"){
<- pnorm(z, lower.tail = FALSE)
pvals else if(testing == "two_sided"){
}<- 2*pnorm(abs(z), lower.tail = FALSE)
pvals
}return(list(x=x,z=z,pvals=pvals))
}
# Two sided testing
<- generate_data(3000,"two_sided")
data <- data$x
x <- data$pvals
pvals <- data$z z
The AdaPTGMM package performs automatic model selection over the number of mixture components and a list of candidate models (or featurization).
beta_formulas
input. Here we consider a spline basis with 2-4 degrees of freedom.nclasses
input. Here we consider 3-5 components. p.adjust
), the user may specify a vector of alpha levels in alphas
. Here we choose 0.01, 0.05, 0.1, 0.2.library("splines")
<- paste("splines::ns(x, df = ", c(2,3,4), ")")
formulas <- c(3,4)
nclasses <- c(0.01,0.05,0.1,0.2)
alphas
<- adapt_gmm(x=x,pvals=pvals, alphas=alphas,
res beta_formulas = formulas, nclasses= nclasses)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#>
|
| | 0%
|
|====== | 12%
|
|============ | 25%
|
|=================== | 38%
|
|========================= | 50%
|
|=============================== | 62%
|
|====================================== | 75%
|
|============================================ | 88%
|
|==================================================| 100%
#> alpha = 0.2: FDPhat 0.1987, Number of Rej. 229
#> alpha = 0.1: FDPhat 0.0988, Number of Rej. 162
#> alpha = 0.05: FDPhat 0.0474, Number of Rej. 116
#> alpha = 0.01: FDPhat 0.0091, Number of Rej. 55
#> Complete.
The arguments of the model are in res$args
. The model selection chose a spline with 2 degrees of freedom and 4 Gaussian components.
# Selected beta formula from model selection
print(paste("Selected formula:", format(res$args$beta_formula), collapse=" "))
#> [1] "Selected formula: class ~ splines::ns(x, df = 2)"
# Selected number of classes from model selection
print(paste("Selected number of classes:", res$args$nclasses, collapse=" "))
#> [1] "Selected number of classes: 4"
Instead, we may choose to model the \(z_i\)’s. Since we are now inputting \(z_i\)’s and using a two sided test, we need to specify that the testing procedure is two sided with testing="two_sided"
.
<- adapt_gmm(x=x,z=z, alphas=alphas, testing="two_sided",
zres beta_formulas = formulas, nclasses= nclasses)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#>
|
| | 0%
|
|====== | 12%
|
|============ | 25%
|
|=================== | 38%
|
|========================= | 50%
|
|=============================== | 62%
|
|====================================== | 75%
|
|============================================ | 88%
|
|==================================================| 100%
#> alpha = 0.2: FDPhat 0.1985, Number of Rej. 204
#> alpha = 0.1: FDPhat 0.0987, Number of Rej. 157
#> alpha = 0.05: FDPhat 0.05, Number of Rej. 130
#> alpha = 0.01: FDPhat 0.0098, Number of Rej. 51
#> Complete.
We may also look at the fitted means and variances of the Gaussian components.
# Fitted Gaussian means
print(paste(c("Means:", sapply(zres$params$mu,function(x) round(x,2))), collapse=" "))
#> [1] "Means: -0.42 0.07 0.21 0.52"
# Fitted Gaussian variances
print(paste(c("Variances:", sapply(zres$params$var,function(x) round(x,2))), collapse=" "))
#> [1] "Variances: 0 0 0 4.49"
The masking function in AdaPTGMM is governed by three parameters, \(\alpha_m\), \(\lambda\) and \(\zeta\), see our paper for more details. These three variables must satisfy \[0<\alpha_m\le \lambda \le \alpha\cdot\zeta + \lambda \le 1.\] The value of \(\zeta\) controls the minimum possible number of rejections. If the analyst expects a small number of rejections, i.e. \(<20\), we recommend large \(\zeta\), for example 5-20.
If the user does not specify any values for \(\alpha_m\), \(\lambda\), and \(\zeta\), AdaPTGMM will choose default parameters, adapting to the total number of hypotheses. See our paper for more details.
We may input custom masking parameters. For the sake of example, we generate one sided p-values.
In this dataset, we have some null p-values close to 1. Masking 0 and 1 to the same value will lead to inflated FDP estimates and lower power.
<- adapt_gmm(x=x,z=z, alphas=alphas, testing="one_sided",
res1 beta_formulas = formulas, nclasses= nclasses,
alpha_m = 0.2, lambda = 0.4, zeta = 3)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#>
|
| | 0%
|
|====== | 12%
|
|============ | 25%
|
|=================== | 38%
|
|========================= | 50%
|
|=============================== | 62%
|
|====================================== | 75%
|
|============================================ | 88%
|
|==================================================| 100%
#> alpha = 0.2: FDPhat 0.1971, Number of Rej. 93
#> alpha = 0.1: FDPhat 0.0976, Number of Rej. 41
#> alpha = 0.05: FDPhat 0.0256, Number of Rej. 13
#> alpha = 0.01: FDPhat 0.0256, Number of Rej. 0
#> Complete.
<- adapt_gmm(x=x,z=z, alphas=alphas, testing="one_sided",
res2 beta_formulas = formulas, nclasses= nclasses,
alpha_m = 0.2, lambda = 0.3, zeta = 3)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#>
|
| | 0%
|
|====== | 12%
|
|============ | 25%
|
|=================== | 38%
|
|========================= | 50%
|
|=============================== | 62%
|
|====================================== | 75%
|
|============================================ | 88%
|
|==================================================| 100%
#> alpha = 0.2: FDPhat 0.1996, Number of Rej. 182
#> alpha = 0.1: FDPhat 0.099, Number of Rej. 128
#> alpha = 0.05: FDPhat 0.0495, Number of Rej. 101
#> alpha = 0.01: FDPhat 0.0072, Number of Rej. 46
#> Complete.
We see that using a masking function with \(\nu= \alpha\cdot\zeta + \lambda=1\) results in much lower power. For this reason, we recommend \(\alpha_m=\lambda\) and \(\nu=0.9\).
AdaPTGMM models the mixture proportions with a blackbox classification model. We provide implementation of multinomial logistic regression, GAM, glmnet, and a neural network with a single hidden layer. The user may specify the model type with respectively model_type=“nnet”, “mgcv”, “glmnet”, and “neural”.
We may use a multinomial logistic regression model from nnet::multinom.
<- adapt_gmm(x=x,z=z, alphas=alphas, testing="one_sided",
res_multinom beta_formulas = formulas, model_type = "nnet", nclasses= nclasses)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#>
|
| | 0%
|
|====== | 12%
|
|============ | 25%
|
|=================== | 38%
|
|========================= | 50%
|
|=============================== | 62%
|
|====================================== | 75%
|
|============================================ | 88%
|
|==================================================| 100%
#> alpha = 0.2: FDPhat 0.1989, Number of Rej. 186
#> alpha = 0.1: FDPhat 0.0977, Number of Rej. 133
#> alpha = 0.05: FDPhat 0.0476, Number of Rej. 84
#> alpha = 0.01: FDPhat 0.0102, Number of Rej. 0
#> Complete.
Alternatively, the user may construct use own beta model. AdaPTGMM accepts as input custom_beta_model, which must be a function that accepts a formula, dataset, and (optional) initialization parameters. This is a default template for writing a custom beta model. There are four key things that the function must do,
<- function(formula, data, initialization){
custom_beta
########################################################################
# EDIT THIS SECTION to fit any desired model
# Formula, data, and weights for each observation in data$weights
# Compute:
# 1. The fitted probabilities for the data
# 2. New model parameters for initialization (or leave as null if undesired)
# 3. Degrees of freedom for model (for model selection)
<- fit_model(formula,data,params=initialization,weights=data$weights)
model
<- fitted(model)
fitted_probabilities <- model$weights
new_weights <- model$df
df
########################################################################
<- list()
output $fitted_prob <- fitted_probabilities
output$model_weights <- new_weights
output$df <- df
output
return(output)
}
Here is an example of using a multinomial logistic regression model. To use a custom beta model, model_type="custom"
and custom_beta_model=
the custom beta function.
<- function(formula, data, initialization){
custom_multinom
########################################################################
# EDIT THIS SECTION to fit any desired model
# Formula, data, and weights for each observation in data$weights
# Compute:
# 1. The fitted probabilities for the data
# 2. New model parameters for initialization (or leave as null if undesired)
# 3. Degrees of freedom for model (for model selection)
if(is.null( initialization)){
<- nnet::multinom(formula=formula, data=data, weights = weights, trace = F)
model else{
}<- nnet::multinom(formula, data, weights = weights, trace = F,Wts= initialization)
model
}
<- fitted(model)
fitted_probabilities <- model$wts
new_weights <- sum(model$edf)
df
########################################################################
<- list()
output $fitted_prob <- fitted_probabilities
output$model_weights <- new_weights
output$df <- df
output
return(output)
}
<- adapt_gmm(x=x,z=z, alphas=alphas, testing="one_sided",
res_custom beta_formulas = formulas, nclasses= nclasses,
model_type = "custom", custom_beta_model = custom_multinom)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#>
|
| | 0%
|
|====== | 12%
|
|============ | 25%
|
|=================== | 38%
|
|========================= | 50%
|
|=============================== | 62%
|
|====================================== | 75%
|
|============================================ | 88%
|
|==================================================| 100%
#> alpha = 0.2: FDPhat 0.1989, Number of Rej. 186
#> alpha = 0.1: FDPhat 0.0977, Number of Rej. 133
#> alpha = 0.05: FDPhat 0.0476, Number of Rej. 84
#> alpha = 0.01: FDPhat 0.0102, Number of Rej. 0
#> Complete.
We see that we obtain the same results as the default implementation of multinomial logistic regression.
If speed is a core concern, we recommend decreasing the number of iterations used in fitting the model and number of EM iterations, as well as using a simple \(\beta\) model.
Overall Recommendations:
nclasses=3
.beta_formulas
to be a spline with 3 degrees of freedom.nfits
. By default, nfits=20
, we set it to be 3 here.niter_fit
. By default, niter_fit=5
, we set it to be 2 here.niter_ms
. By default, niter_fit=10
, we set it to be 5 here.Here is a one example:
<- adapt_gmm(x=x,z=z, alphas=alphas, testing="one_sided",
fast_res beta_formulas = "splines::ns(x, df=3)", model_type = "nnet", nclasses= 4,
nfits = 5, niter_fit=2, niter_ms = 5,intercept_model = FALSE)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#>
|
| | 0%
|
|==================================================| 100%
#> alpha = 0.2: FDPhat 0.2, Number of Rej. 165
#> alpha = 0.1: FDPhat 0.1, Number of Rej. 110
#> alpha = 0.05: FDPhat 0.0461, Number of Rej. 76
#> alpha = 0.01: FDPhat 0.01, Number of Rej. 50
#> Complete.