--- title: "User guide for the noisySBM package" author: "Tabea Rebafka" date: "20/10/2020" output: rmarkdown::html_vignette: toc: true toc_depth: 2 number_sections: true vignette: > %\VignetteIndexEntry{User guide for the noisySBM package} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(ggplot2) ``` `noisySBM` is the accompanying R package of the article *Powerful graph inference with false discovery rate control* by Rebafka, Roquain and Villers available at . It provides functions for generating data in the so-called noisy stochastic block model (NSBM), an estimation algorithm to fit parameters and cluster the nodes, as well as a powerful graph inference procedure. The context is the following. We observe a dense graph where most interactions are due to noise. In practice the graph may be a similarity matrix, a correlation matrix or a matrix containing the values of a test statistic applied on all pairs of nodes. The aim is to infer the meaningful edges in the graph. This is obtained by a new random graph model and a new powerful graph inference procedure that comes with a control the number of flase discoveries. The current version of the R package supports the entire analysis for undirected graphs in the Gaussian NSBM and the Gamma NSBM. The Poisson case as well as directed graphs are left for future developments. To start with, we load the package: ```{r} library(noisySBM) ``` # Noisy stochastic block model We introduce a new random graph model suited for the problem of graph inference. In this model the observed matrix is a perturbation of an unobserved binary graph, which is the *true* graph to be inferred. The binary graph is chosen to be a stochastic block model (SBM) for its capacity to model graph heterogeneity. We refer to the new model as the **noisy stochastic block model (NSBM)**. The definition is stated for an undirected graph without self-loops, but extensions are straightforward. Let $n\geq 2$ be the number of nodes and $\mathcal{A}=\{(i,j): 1\leq i 0\},\qquad\{g_{\nu},\nu \in \mathcal{V}\}\subset\{\mathcal N(\mu,\sigma^2),\mu\in\mathbb R,\sigma>0\}, \end{equation} where different choices for $\mathcal{V}_0$ and $\mathcal{V}$ lead to different variants. For example, often we assume that the mean of the null distribution is zero ($\mu_0=0$). We may also consider the **Gamma NSBM**, where the parametric density families are chosen as follows: \begin{equation}\label{gammamodel} \{g_{0,\nu},\nu\in \mathcal{V}_0\}=\{\text{Exp}(\nu_0),\nu_0>0\}, \:\:\:\{g_{\nu},\nu \in \mathcal{V}\}\subset\{\Gamma(a,b),a>0,b>0\}. \end{equation} In the **Poisson NSBM** both parametric families are the family of Poisson distirbutions: \begin{equation}\label{possoinmodel} \{g_{0,\nu_0},\nu_0\in \mathcal{V}\}= \{g_{\nu},\nu \in \mathcal{V}\}=\{\text{Poi}(\nu),\nu>0\}. \end{equation} # Model parameters and data generation of the NSBM with R In the `noisySBM` package there is an overall parameters: the graph type (directed or undirected). Let us consider the case: ```{r} directed <- FALSE ``` The global model parameter $\theta=(\pi,w,\nu_0,\nu)$ of the NSBM is a list and its elements are `pi`, `w`, `nu0` and `nu`. ```{r} theta <- list(pi=NULL, w=NULL, nu0=NULL, nu=NULL) ``` ## SBM parameters Denote `Q` the number of latent blocks in the SBM. Say ```{r} Q <- 2 ``` The parameters `pi` and `w` are the parameters of the latent binary SBM, where `pi` is a `Q`-vector indicating the mean proportions of nodes per block. The vector `pi` has to be normalized: ```{r} theta$pi <- c(2,1)/3 ``` Parameter `w` is a vector. For undirected graphs `w` is of length $Q(Q+1)/2$ with the following indexing $w=(w_{1,1},w_{1,2},\dots,w_{1,Q},w_{2,2},\dots,w_{2,Q},w_{3,3},\dots,w_{Q,Q})$. The parameter $w_{q,l}\in(0,1)$ is a Bernoulli parameter indicating the probability that a node in group $q$ is connected to a node in group $l$. For directed graphs `w` is of length $Q^2$ with the following indexing $w=(w_{1,1},w_{1,2},\dots,w_{1,Q},w_{1,2},\dots,w_{2,Q},w_{1,3},\dots,w_{Q,Q})$. In the directed case $w_{q,l}\in(0,1)$ is the probability of a directed edge from a node in group $q$ to a node in group $l$. For our undirected graph with two latent blocks, let us consider ```{r} theta$w <- c(.8,.1,.9) ``` This means that both blocks are communities (as $w_{1,1}=0.8$ and $w_{2,2}=0.9$), as nodes have a large probability to be connected to nodes in the same block, and a low probability to connect to nodes in the other group ($w_{1,2}=0.1$). ## Gaussian model For the noisy layer, we have to choose a model, Gaussian, Gamma or Poisson. In the Gaussian model, the null parameter `nu0` is a vector of length 2, giving the mean and the standard deviation of the the Gaussian null distrubtion ($\nu_0=(\nu_{0}^\text{mean},\nu_{0}^\text{sd})$). Mostly, we choose the standard normal distribution: ```{r} theta$nu0 <- c(0,1) ``` Parameter `nu` is a matrix with two columns: the first indicates Gaussian means $\nu_{q,l}^\text{mean}$, the second standard deviations $\nu_{q,l}^\text{sd}$ of the alternative distributions for the block pairs $(q,l)$. The number of rows of `nu` is identical to the length of `w`, that is, in the undirected case `nu` has $Q(Q+1)/2$ rows, otherwise $Q^2$. The rows are indexed like `w`, that is, $$ \nu^{\text{undirected}}=\begin{pmatrix}\nu_{1,1}\\\vdots\\\nu_{1,Q}\\\nu_{2,2} \\\vdots\\ \nu_{Q,Q} \end{pmatrix}\in\mathbb R^{Q(Q+1)/2\times 2}\qquad \nu^{\text{directed}}=\begin{pmatrix}\nu_{1,1}\\\nu_{1,Q}\vdots\\\\\nu_{2,1} \\\vdots\\ \nu_{Q,Q} \end{pmatrix}\in\mathbb R^{Q^2\times 2}\qquad \text{with } \nu_{q,l}=(\nu_{q,l}^\text{mean},\nu_{q,l}^\text{sd})\in\mathbb R^2. $$ For our undirected model with two latent blocks, we choose ```{r} theta$nu <- matrix(c(-1,5,-1, 1,1,1), ncol=2) theta$nu ``` ### Generate data To generate a data set with, say, $n=10$ nodes from the corresponding NSBM we use the function `rnsbm()`: ```{r} obs <- rnsbm(n = 10, theta, modelFamily = 'Gauss') round(obs$dataMatrix, digits = 2) ``` We see that the generated matrix is symmetric, as we are in the undirected case. The function `rnsbm()` also provides the latent binary graph, named `latentAdj`, which is also symmetric: ```{r} obs$latentAdj ``` as well as the latent blocks, names `latentZ`: ```{r} obs$latentZ ``` A more convenient visualisation of the data is given by the function `plotGraphs()`: ```{r} plotGraphs(obs$dataMatrix, binaryTruth=obs$latentAdj) ``` ### Directed graphs To generate data from a directed graph, the parameters `w` and `nu` have to be adapted to length $Q^2$, say ```{r} theta$w <- c(.8,.1,.2,.9) theta$nu <- matrix(c(-1,5,10,-1, 1,1,1,1), ncol=2) theta$nu ``` To generate data from the directed NSBM, we set `directed = TRUE`: ```{r} obs <- rnsbm(n = 10, theta, modelFamily = 'Gauss', directed = TRUE) plotGraphs(obs$dataMatrix, binaryTruth=obs$latentAdj) round(obs$dataMatrix, digits = 2) obs$latentAdj ``` Now the matrices are no longer symmetric. ## Gamma model As the Gamma distribution also has two parameters, `nu0` is a vector of length 2 and `nu` is a 2-column matrix as in the Gaussian case. Let us take as an example an undirectd graph in a model with $Q=3$ latent blocks with equiprobable blocks: ```{r} theta$pi <- rep(1/3, 3) ``` Suppose that inter-block probabilities are high, while intra-block probabilities are low: ```{r} theta$w <- c(.1, .8, .8, .1, .8, .1) ``` The null distribution is the exponential distribution Exp$(2)$: ```{r} theta$nu0 <- c(1,2) ``` For the alternative distributions we also choose exponentials: ```{r} theta$nu <- matrix(c(1, 1, 1, 1, 1, 1, 1, .5, .5, 1, .5, 1), ncol=2) theta$nu ``` We generate a graph with 10 nodes: ```{r} obs <- rnsbm(n = 10, theta, modelFamily = 'Gamma') plotGraphs(obs$dataMatrix, binaryTruth=obs$latentAdj) round(obs$dataMatrix, digits = 2) obs$latentAdj obs$latentZ ``` ## Poisson model The Poisson distribution has a single parameter, so that `nu0` is a scalar and `nu` becomes a one-column matrix. As an example consider a directed $Q=1$-block Poisson NSBM with parameters: ```{r} theta <- list(pi=1, w=.5, nu0=1, nu=5) ``` We generate a graph with 10 nodes: ```{r} obs <- rnsbm(n = 10, theta, modelFamily = 'Poisson', directed = TRUE) plotGraphs(obs$dataMatrix, binaryTruth=obs$latentAdj) round(obs$dataMatrix) obs$latentAdj obs$latentZ ``` # Estimation algorithm ## Basic function call `fitNSBM()` and output The function `fitNSBM()` fits the parameters of a NSBM to a data set in different models. The function applies to a symmetric real-valued square matrix `dataMatrix`. Let us see a basic example: ```{r} set.seed(1) theta1 <- list(pi=c(.5,.5), w=c(.8,.1,.2), nu0=c(0,1), nu=matrix(c(-1,5,10, 1,1,1), ncol=2)) obs1 <- rnsbm(n=30, theta1) ``` ```{r, eval=FALSE} res_gauss <- fitNSBM(obs1$dataMatrix, nbCores=2) ``` `fitNSBM()` applies the VEM-algorithm to estimate model parameters and compute a node clustering. By default, the algorithm explores different SBM models starting from a model with a single latent block (`Q=1`) up to 4 blocks or more if appropriate. The output of `fitNSBM()` is a list with estimation results for all visited models (for `Qmin=1` to `Qmax=4` (or more)). Here, for `Q=2` for instance, we obtain the estimates of the model parameters $\theta$ by ```{r} res_gauss[[2]]$theta ``` A node clustering is given by ```{r} res_gauss[[2]]$clustering ``` The list element `sbmParam` contains further results concerning the latent binary SBM, namely the number of latent blocks: ```{r} res_gauss[[2]]$sbmParam$Q ``` the variational parameters, i.e. the probabilities that a node belongs to one of the latent blocks: ```{r} res_gauss[[2]]$sbmParam$clusterProba ``` and the conditional probabilities of an edge between a pair of nodes given the block memberships of the nodes: ```{r} res_gauss[[2]]$sbmParam$edgeProba[,1:5] ``` These probabilities correspond to $1-l\text{-values}$, which are used in the test procedure for graph inference. The dimension of `sbmParam$edgeProba` is $Q(Q+1)/2$ times $n(n-1)/2$ (in the undirected case): ```{r} dim(res_gauss[[2]]$sbmParam$edgeProba) ``` that is, the rows correspond to group pairs $(q,l)$ and the columns to node pairs $(i,j)$. Moreover, `sbmParam` contains the values of the integrated classification criterion: ```{r} res_gauss[[2]]$sbmParam$ICL ``` Finally, the list element `convergence` provides details on the convergence of the algorithm: ```{r} res_gauss[[2]]$convergence ``` ## Models for edge distributions For the edges different distributions can be considered. The `noisySBM` package supports the Gaussian NSBM and the Gamma NSBM. In the Gaussian case a number of different restrictions on the parameter space are implemented. ### Gaussian NSBM By default, `fitNSBM()` considers a Gaussian NSBM with * `model='Gauss0'`, where the null distribution is a centered normal distribution with $\nu_{0}^{\text{mean}}=0$ and standard deviation $\nu_{0}^{\text{sd}}\in\mathbb R_+$ to be estimated, and all parameters $\nu_{q,l}=(\nu_{q,l}^\text{mean},\nu_{q,l}^\text{sd})$ for $1\leq q\leq l\leq Q$ of the alternative Gaussian distributions are estimated without any constraint. Other choices for `model` in the Gaussian case are the following: * `Gauss`: all Gaussian parameters of the null and the alternative distributions are unknown ; this is the Gaussian model with the maximum number of unknown parameters ($\nu_{0}^{\text{mean}}\in\mathbb R$, $\nu_{q,l}^{\text{mean}}\in\mathbb R$, $\nu_{0}^{\text{sd}}\in\mathbb R_+$, $\nu_{q,l}^{\text{sd}}\in\mathbb R_+$ for all $q\leq l$). * `Gauss01`: compared to `Gauss`, the null distribution is set to $\mathcal N(0,1)$ ($\nu_0^{\text{mean}}=0$, $\nu_0^{\text{sd}}=1$)) * `GaussEqVar`: compared to `Gauss`, all Gaussian variances (of both the null and the alternative) are supposed to be equal, but unknown ($\nu_0^{\text{sd}}=\nu_{q,l}^{\text{sd}}\in\mathbb R_+$ for all $q\leq l$). * `Gauss0EqVar`: compared to `GaussEqVar`, the mean of the null distribution is set to 0 ($\nu_0^{\text{mean}}=0$). * `Gauss0Var1`: compared to `Gauss`, all Gaussian variances are set to 1 and the null distribution is set to $\mathcal N(0,1)$ ($\nu_0^{\text{mean}}=0$, $\nu_0^{\text{sd}}=\nu_{q,l}^{\text{sd}}=1$) * `Gauss2distr`: the alternative distribution is a single Gaussian distribution, i.e. the block memberships of the nodes do not influence on the alternative distribution (all $\nu_{q,l}$ are identical). * `GaussAffil`: compared to `Gauss`, for the alternative distribution, there's a distribution for inter-group and another for intra-group interactions (there are parameters $\nu_{\text{in}}$ and $\nu_{\text{out}}$ such that $\nu_{q,q}=\nu_{\text{in}}$ for every $q$, and $\nu_{q,l}=\nu_{\text{out}}$ for every $q< l$). ### Gamma NSBM The estimation procedure `fitNSBM()` supports two exponential or Gamma models: * `Exp`: the null and the alternatives are all exponential distributions with unknown parameters (i.e. they are Gamma distributions with shape parameter equal to one and unknown scale parameter) * `ExpGamma`: the null distribution is exponential with unknown parameter, the alterantive distribution are Gamma distributions where both parameters are unknown. Let us consider the following example: ```{r} theta2 <- list(pi=rep(1/2, 2), w=c(.2, .8, .2), nu0=c(1,5), nu=matrix(c(1, 1, 1, 1, 1/3, 1), ncol=2)) set.seed(2) obs2 <- rnsbm(n=60, theta2, modelFamily='Gamma') ``` Let us fit both Gamma models to the data : ```{r, eval=FALSE} res_exp <- fitNSBM(obs2$dataMatrix, model='Exp', nbCores = 2) set.seed(3) res_gamma <- fitNSBM(obs2$dataMatrix, model='ExpGamma', nbCores = 2) ``` The parameter estimates obtained for $Q=2$ are close to the true parameter values in both cases: ```{r} Q <- 2 res_exp[[Q]]$theta res_gamma[[Q]]$theta ``` ## Model selection When the algorithm explores several model sizes $Q$, then model selection can be performed via the ICL criterion. The function `getBestQ()` gives the best model size $\hat Q$ that maximizes the ICL: ```{r} getBestQ(res_gauss) ``` Typically, the ICL curve is a concave function, as we can see by applying function `plotICL`: ```{r} plotICL(res_gauss) ``` Likewise, for our exponential data, we obtain: ```{r} getBestQ(res_exp) getBestQ(res_gamma) ``` and for the ICL curves: ```{r, warning=FALSE} dfICLexp <- data.frame(ICL = sapply(res_exp, function(el) el$sbmParam$ICL), Q = sapply(res_exp, function(el) el$sbmParam$Q), model='Exp') dfICLgam <- data.frame(ICL = sapply(res_gamma, function(el) el$sbmParam$ICL), Q = sapply(res_gamma, function(el) el$sbmParam$Q), model='ExpGamma') dfICL <- rbind(dfICLexp, dfICLgam) ggplot(dfICL, aes(x=Q, y=ICL, group=model, colour=model) ) + geom_line() + xlim(1,4) ``` The argument `sbmSize` of `fitNSBM()` is used to precise the models to be explored. When `sbmSize$Qmin` and `sbmSize$Qmax` are specified, then only this range of models is explored. For a single model size, do e.g. ```{r, eval=FALSE} res4 <- fitNSBM(obs1$dataMatrix, sbmSize=list(Qmin=2, Qmax=2), nbCores=2) ``` ## Initialization The argument `initParam` is a list and you may change default values to increase or descrease the number of initial points of the algorithm. ## Save output To directly save the output, a `filename` has to be precised in the call of `fitNSBM()`. ## Parlallel computing By default, computations are parallelized using the maximum number of available cores. However, the argument `nbCores` can be used to customize this choice. ## Node clustering Node clusterings can be compared vie the adjusted Rand index. The package provides a function `ARI()` to evaluate this indicator. On simulated data we can compare to the true clustering: ```{r} ARI(res_gauss[[3]]$clustering, obs1$latentZ) ``` To compare two node clusterings with different number of clusters: ```{r} ARI(res_gauss[[2]]$clustering, res_gauss[[3]]$clustering) ``` For the example for the Gamma NSBM, we find that the clusterings are perfect in both models (`Exp` and `ExpGamma`): ```{r} ARI(res_exp[[2]]$clustering, obs2$latentZ) ARI(res_gamma[[2]]$clustering, obs2$latentZ) ``` # Graph inference by multiple testing Graph inference is done by the function `graphInference()`. The function provides the estimated binary graph at level `alpha` using the approach of q-values in a NSBM. The arguments are the observed `dataMatrix`, a `clustering`, a parameter estimate `theta`, a significance level `alpha` and the model `modelFamily`. It returns a list containing the inferred graph `A`: ```{r} resGraph <- graphInference(obs1$dataMatrix, res_gauss[[2]]$clustering, res_gauss[[2]]$theta, alpha = .05, modelFamily='Gauss') resGraph$A[1:5, 1:5] ``` The ouptut also contains the q-values: ```{r} resGraph$qval[1:10] ``` which is a vector of length $n(n-1)/2$: ```{r} length(resGraph$qval) ``` To visualize the result, we can use again the function `graphPlots()` that also returns the FDR and TDR: ```{r} plotGraphs(obs1$dataMatrix, resGraph$A, obs1$latentAdj) ``` The graph inference procedure is also implemented in the Gamma model, but only in case where the shape parameter of all Gamma distributions under the null and under the alternative is the same, so for instance in the model where all distributions are exponential distributions. ```{r} resGraph2 <- graphInference(obs2$dataMatrix, res_exp[[2]]$clustering, res_exp[[2]]$theta, alpha = .05, modelFamily='Gamma') resGraph2$A[1:5, 1:5] ``` The ouptut also contains the q-values: ```{r} resGraph2$qval[1:10] ``` which is a vector of length $n(n-1)/2$: ```{r} length(resGraph2$qval) ``` To visualize the result, we can use again the function `graphPlots()` that also returns the FDR and TDR: ```{r} plotGraphs(obs2$dataMatrix, resGraph2$A, obs2$latentAdj) ```