Introduction to AdaPTGMM Package

Patrick Chao

2021-07-27

The AdaPTGMM method is a flexible multiple testing algorithm that uses covariates to estimate the local false discovery rate with finite sample FDR control.

#devtools::install_github("patrickrchao/AdaPTGMM")
library(AdaPTGMM)

First, we generate some data.

# Generate data
generate_data <- function(n,testing){
  x <- rnorm(n)
  expit <- function(x) 1/(1+exp(-x))
  pi1 <- expit(2*(x-1))
  H <- 1 * (runif(n) < pi1)
  theta <- H * rlogis(n, location=1/2, scale=1)
  z <- rnorm(n, mean=theta)
  x <- data.frame(x=x)
  if(testing == "one_sided"){
    pvals <- pnorm(z, lower.tail = FALSE)
  }else if(testing == "two_sided"){
    pvals <- 2*pnorm(abs(z), lower.tail = FALSE)
  }
  return(list(x=x,z=z,pvals=pvals))
}

# Two sided testing
data <- generate_data(3000,"two_sided")
x <- data$x
pvals <- data$pvals
z <- data$z

The AdaPTGMM package performs automatic model selection over the number of mixture components and a list of candidate models (or featurization).

library("splines")
formulas <-  paste("splines::ns(x, df = ", c(2,3,4), ")")
nclasses <- c(3,4)
alphas <- c(0.01,0.05,0.1,0.2)

res <- adapt_gmm(x=x,pvals=pvals, alphas=alphas,
        beta_formulas = formulas, nclasses= nclasses)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#> 
  |                                                        
  |                                                  |   0%
  |                                                        
  |======                                            |  12%
  |                                                        
  |============                                      |  25%
  |                                                        
  |===================                               |  38%
  |                                                        
  |=========================                         |  50%
  |                                                        
  |===============================                   |  62%
  |                                                        
  |======================================            |  75%
  |                                                        
  |============================================      |  88%
  |                                                        
  |==================================================| 100%
#> alpha = 0.2: FDPhat 0.1987, Number of Rej. 229
#> alpha = 0.1: FDPhat 0.0988, Number of Rej. 162
#> alpha = 0.05: FDPhat 0.0474, Number of Rej. 116
#> alpha = 0.01: FDPhat 0.0091, Number of Rej. 55
#> Complete.

The arguments of the model are in res$args. The model selection chose a spline with 2 degrees of freedom and 4 Gaussian components.

# Selected beta formula from model selection
print(paste("Selected formula:", format(res$args$beta_formula), collapse=" "))
#> [1] "Selected formula: class ~ splines::ns(x, df = 2)"

# Selected number of classes from model selection
print(paste("Selected number of classes:", res$args$nclasses, collapse=" "))
#> [1] "Selected number of classes: 4"

Instead, we may choose to model the \(z_i\)’s. Since we are now inputting \(z_i\)’s and using a two sided test, we need to specify that the testing procedure is two sided with testing="two_sided".

zres <- adapt_gmm(x=x,z=z, alphas=alphas, testing="two_sided",
        beta_formulas = formulas, nclasses= nclasses)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#> 
  |                                                        
  |                                                  |   0%
  |                                                        
  |======                                            |  12%
  |                                                        
  |============                                      |  25%
  |                                                        
  |===================                               |  38%
  |                                                        
  |=========================                         |  50%
  |                                                        
  |===============================                   |  62%
  |                                                        
  |======================================            |  75%
  |                                                        
  |============================================      |  88%
  |                                                        
  |==================================================| 100%
#> alpha = 0.2: FDPhat 0.1985, Number of Rej. 204
#> alpha = 0.1: FDPhat 0.0987, Number of Rej. 157
#> alpha = 0.05: FDPhat 0.05, Number of Rej. 130
#> alpha = 0.01: FDPhat 0.0098, Number of Rej. 51
#> Complete.

We may also look at the fitted means and variances of the Gaussian components.

# Fitted Gaussian means
print(paste(c("Means:", sapply(zres$params$mu,function(x) round(x,2))), collapse=" "))
#> [1] "Means: -0.42 0.07 0.21 0.52"

# Fitted Gaussian variances
print(paste(c("Variances:", sapply(zres$params$var,function(x) round(x,2))), collapse=" "))
#> [1] "Variances: 0 0 0 4.49"

Masking

The masking function in AdaPTGMM is governed by three parameters, \(\alpha_m\), \(\lambda\) and \(\zeta\), see our paper for more details. These three variables must satisfy \[0<\alpha_m\le \lambda \le \alpha\cdot\zeta + \lambda \le 1.\] The value of \(\zeta\) controls the minimum possible number of rejections. If the analyst expects a small number of rejections, i.e. \(<20\), we recommend large \(\zeta\), for example 5-20.

If the user does not specify any values for \(\alpha_m\), \(\lambda\), and \(\zeta\), AdaPTGMM will choose default parameters, adapting to the total number of hypotheses. See our paper for more details.

We may input custom masking parameters. For the sake of example, we generate one sided p-values.

In this dataset, we have some null p-values close to 1. Masking 0 and 1 to the same value will lead to inflated FDP estimates and lower power.

res1 <- adapt_gmm(x=x,z=z, alphas=alphas, testing="one_sided",
        beta_formulas = formulas, nclasses= nclasses,
        alpha_m = 0.2, lambda = 0.4, zeta = 3)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#> 
  |                                                        
  |                                                  |   0%
  |                                                        
  |======                                            |  12%
  |                                                        
  |============                                      |  25%
  |                                                        
  |===================                               |  38%
  |                                                        
  |=========================                         |  50%
  |                                                        
  |===============================                   |  62%
  |                                                        
  |======================================            |  75%
  |                                                        
  |============================================      |  88%
  |                                                        
  |==================================================| 100%
#> alpha = 0.2: FDPhat 0.1971, Number of Rej. 93
#> alpha = 0.1: FDPhat 0.0976, Number of Rej. 41
#> alpha = 0.05: FDPhat 0.0256, Number of Rej. 13
#> alpha = 0.01: FDPhat 0.0256, Number of Rej. 0
#> Complete.

res2 <- adapt_gmm(x=x,z=z, alphas=alphas, testing="one_sided",
        beta_formulas = formulas, nclasses= nclasses,
        alpha_m = 0.2, lambda = 0.3, zeta = 3)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#> 
  |                                                        
  |                                                  |   0%
  |                                                        
  |======                                            |  12%
  |                                                        
  |============                                      |  25%
  |                                                        
  |===================                               |  38%
  |                                                        
  |=========================                         |  50%
  |                                                        
  |===============================                   |  62%
  |                                                        
  |======================================            |  75%
  |                                                        
  |============================================      |  88%
  |                                                        
  |==================================================| 100%
#> alpha = 0.2: FDPhat 0.1996, Number of Rej. 182
#> alpha = 0.1: FDPhat 0.099, Number of Rej. 128
#> alpha = 0.05: FDPhat 0.0495, Number of Rej. 101
#> alpha = 0.01: FDPhat 0.0072, Number of Rej. 46
#> Complete.

We see that using a masking function with \(\nu= \alpha\cdot\zeta + \lambda=1\) results in much lower power. For this reason, we recommend \(\alpha_m=\lambda\) and \(\nu=0.9\).

Modeling Mixture Proportions

AdaPTGMM models the mixture proportions with a blackbox classification model. We provide implementation of multinomial logistic regression, GAM, glmnet, and a neural network with a single hidden layer. The user may specify the model type with respectively model_type=“nnet”, “mgcv”, “glmnet”, and “neural”.

We may use a multinomial logistic regression model from nnet::multinom.

res_multinom <- adapt_gmm(x=x,z=z, alphas=alphas, testing="one_sided",
        beta_formulas = formulas, model_type = "nnet", nclasses= nclasses)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#> 
  |                                                        
  |                                                  |   0%
  |                                                        
  |======                                            |  12%
  |                                                        
  |============                                      |  25%
  |                                                        
  |===================                               |  38%
  |                                                        
  |=========================                         |  50%
  |                                                        
  |===============================                   |  62%
  |                                                        
  |======================================            |  75%
  |                                                        
  |============================================      |  88%
  |                                                        
  |==================================================| 100%
#> alpha = 0.2: FDPhat 0.1989, Number of Rej. 186
#> alpha = 0.1: FDPhat 0.0977, Number of Rej. 133
#> alpha = 0.05: FDPhat 0.0476, Number of Rej. 84
#> alpha = 0.01: FDPhat 0.0102, Number of Rej. 0
#> Complete.

Alternatively, the user may construct use own beta model. AdaPTGMM accepts as input custom_beta_model, which must be a function that accepts a formula, dataset, and (optional) initialization parameters. This is a default template for writing a custom beta model. There are four key things that the function must do,

  1. Fit the model using the formula, dataset, and weights per observation
  2. Return fitted probabilities for each row in the dataset
  3. Return the degrees of freedom of the model for model selection
  4. (Optionally) Return model weights for initialization
custom_beta <- function(formula, data, initialization){

  ########################################################################
  # EDIT THIS SECTION to fit any desired model
  # Formula, data, and weights for each observation in data$weights
  # Compute:
  # 1. The fitted probabilities for the data
  # 2. New model parameters for initialization (or leave as null if undesired)
  # 3. Degrees of freedom for model (for model selection)

  model <- fit_model(formula,data,params=initialization,weights=data$weights)

  fitted_probabilities <- fitted(model)
  new_weights <- model$weights
  df <- model$df

  ########################################################################
  output <- list()
  output$fitted_prob <- fitted_probabilities
  output$model_weights <- new_weights
  output$df <- df

  return(output)

}

Here is an example of using a multinomial logistic regression model. To use a custom beta model, model_type="custom" and custom_beta_model= the custom beta function.

custom_multinom <- function(formula, data, initialization){

  ########################################################################
  # EDIT THIS SECTION to fit any desired model
  # Formula, data, and weights for each observation in data$weights
  # Compute:
  # 1. The fitted probabilities for the data
  # 2. New model parameters for initialization (or leave as null if undesired)
  # 3. Degrees of freedom for model (for model selection)

  if(is.null( initialization)){
    model <- nnet::multinom(formula=formula, data=data, weights = weights, trace = F)
  }else{
    model <- nnet::multinom(formula, data, weights = weights, trace = F,Wts= initialization)
  }

  fitted_probabilities <- fitted(model)
  new_weights <- model$wts
  df <- sum(model$edf)

  ########################################################################
  output <- list()
  output$fitted_prob <- fitted_probabilities
  output$model_weights <- new_weights
  output$df <- df

  return(output)

}

res_custom <- adapt_gmm(x=x,z=z, alphas=alphas, testing="one_sided",
        beta_formulas = formulas, nclasses= nclasses,
         model_type = "custom", custom_beta_model = custom_multinom)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#> 
  |                                                        
  |                                                  |   0%
  |                                                        
  |======                                            |  12%
  |                                                        
  |============                                      |  25%
  |                                                        
  |===================                               |  38%
  |                                                        
  |=========================                         |  50%
  |                                                        
  |===============================                   |  62%
  |                                                        
  |======================================            |  75%
  |                                                        
  |============================================      |  88%
  |                                                        
  |==================================================| 100%
#> alpha = 0.2: FDPhat 0.1989, Number of Rej. 186
#> alpha = 0.1: FDPhat 0.0977, Number of Rej. 133
#> alpha = 0.05: FDPhat 0.0476, Number of Rej. 84
#> alpha = 0.01: FDPhat 0.0102, Number of Rej. 0
#> Complete.

We see that we obtain the same results as the default implementation of multinomial logistic regression.

Fast Version of AdaPTGMM

If speed is a core concern, we recommend decreasing the number of iterations used in fitting the model and number of EM iterations, as well as using a simple \(\beta\) model.

Overall Recommendations:

Here is a one example:

fast_res <- adapt_gmm(x=x,z=z, alphas=alphas, testing="one_sided",
        beta_formulas = "splines::ns(x, df=3)", model_type = "nnet", nclasses= 4,
        nfits = 5, niter_fit=2, niter_ms = 5,intercept_model = FALSE)
#> Model selection starting. Shrink the set of candidate models or reduce niter_fit/niter_ms/nfits if it is too time-consuming.
#> 
  |                                                        
  |                                                  |   0%
  |                                                        
  |==================================================| 100%
#> alpha = 0.2: FDPhat 0.2, Number of Rej. 165
#> alpha = 0.1: FDPhat 0.1, Number of Rej. 110
#> alpha = 0.05: FDPhat 0.0461, Number of Rej. 76
#> alpha = 0.01: FDPhat 0.01, Number of Rej. 50
#> Complete.