noisySBM
is the
accompanying R package of the article Powerful graph inference with
false discovery rate control by Rebafka, Roquain and Villers
available at https://arxiv.org/abs/1907.10176. It provides functions
for generating data in the so-called noisy stochastic block model
(NSBM), an estimation algorithm to fit parameters and cluster the nodes,
as well as a powerful graph inference procedure.
The context is the following. We observe a dense graph where most interactions are due to noise. In practice the graph may be a similarity matrix, a correlation matrix or a matrix containing the values of a test statistic applied on all pairs of nodes. The aim is to infer the meaningful edges in the graph. This is obtained by a new random graph model and a new powerful graph inference procedure that comes with a control the number of flase discoveries.
The current version of the R package supports the entire analysis for undirected graphs in the Gaussian NSBM and the Gamma NSBM. The Poisson case as well as directed graphs are left for future developments.
To start with, we load the package:
We introduce a new random graph model suited for the problem of graph inference. In this model the observed matrix is a perturbation of an unobserved binary graph, which is the true graph to be inferred. The binary graph is chosen to be a stochastic block model (SBM) for its capacity to model graph heterogeneity. We refer to the new model as the noisy stochastic block model (NSBM).
The definition is stated for an undirected graph without self-loops, but extensions are straightforward. Let n ≥ 2 be the number of nodes and 𝒜 = {(i, j) : 1 ≤ i < j ≤ n} the set of all possible pairs of nodes. Denote Q the number of latent node blocks. In the NSBM we observe edges Xi, j, (i, j) ∈ 𝒜, and denote X = (Xi, j)1 ≤ i, j ≤ n ∈ ℝn2 the corresponding symmetric, real-valued observation matrix. The observations Xi, j, (i, j) ∈ 𝒜, are generated by the following random layers:
First, a vector Z = (Z1, …, Zn) of block memberships of the nodes is generated, such that Zi, 1 ≤ i ≤ n, are independent with common distribution given by the probabilities πq = ℙ(Z1 = q), q ∈ {1, …, Q}, for some parameter π = (πq)q ∈ {1, …, Q} ∈ (0, 1)Q such that $\sum_{q=1}^Q \pi_q=1$.
Conditionally on Z, the variables Ai, j, (i, j) ∈ 𝒜, are independent Bernoulli variables with parameter wZi, Zj, for some w = (wq, l)q, l ∈ {1, …, Q} ∈ (0, 1)Q2. As the graph is undirected, w is symmetric. We denote A = (Ai, j)1 ≤ i, j ≤ n the corresponding symmetric adjacency matrix, which is a standard binary SBM.
Conditionally on (Z, A), the observations
Xi, j,
(i, j) ∈ 𝒜 are
independent and Xi, j
is sampled from the distribution
Xi, j ∼ (1 − Ai, j)g0, ν0 + Ai, jgνZi, Zj,
where ν0 ∈ 𝒱0 and
νq, l ∈ 𝒱,
1 ≤ q, l ≤ Q
are unknown parameters, with νl, q = νq, l
for all q, l, and
{g0, ν, ν ∈ 𝒱0}
and {gν, ν ∈ 𝒱}
are given parametric density families with non-empty open subsets 𝒱0 ⊂ ℝd0
and 𝒱 ⊂ ℝd1.
The relation between A and the observation X is that missing edges (Ai, j = 0) are replaced with pure random noise modeled by the density g0, ν0, also called the null density, whereas in place of present edges (Ai, j = 1) there is a signal or effect. The intensity of the signal depends on the block membership of the interacting nodes in the underlying SBM, modeled by the density gνq, l, also called alternative density for parameter νq, l.
The unknown global model parameter is θ = (π, w, ν0, ν), where π and w come from the SBM, ν0 denotes the null parameter and ν = (νq, l)1 ≤ q, l ≤ Q ∈ 𝒱Q2 denotes the parameters of the effects. Due to the symmetry of w and ν, we sometimes consider, with some abuse of notation, that the w and ν belong to ℝQ(Q + 1)/2 rather than ℝQ2 .
Our main model is the Gaussian case. It is particularly suitable for modeling situations where the observations Xi, j correspond to test statistics or correlations, which are known to be approximately Gaussian. The Gaussian NSBM corresponds to the NSBM with the following choice of the parametric density families: where different choices for 𝒱0 and 𝒱 lead to different variants. For example, often we assume that the mean of the null distribution is zero (μ0 = 0).
We may also consider the Gamma NSBM, where the parametric density families are chosen as follows:
In the Poisson NSBM both parametric families are the family of Poisson distirbutions:
In the noisySBM
package there is an overall parameters:
the graph type (directed or undirected). Let us consider the case:
The global model parameter θ = (π, w, ν0, ν)
of the NSBM is a list and its elements are pi
,
w
, nu0
and nu
.
Denote Q
the number of latent blocks in the SBM. Say
The parameters pi
and w
are the parameters
of the latent binary SBM, where pi
is a
Q
-vector indicating the mean proportions of nodes per
block. The vector pi
has to be normalized:
Parameter w
is a vector. For undirected graphs
w
is of length Q(Q + 1)/2 with the
following indexing w = (w1, 1, w1, 2, …, w1, Q, w2, 2, …, w2, Q, w3, 3, …, wQ, Q).
The parameter wq, l ∈ (0, 1)
is a Bernoulli parameter indicating the probability that a node in group
q is connected to a node in
group l. For directed graphs
w
is of length Q2 with the following
indexing w = (w1, 1, w1, 2, …, w1, Q, w1, 2, …, w2, Q, w1, 3, …, wQ, Q).
In the directed case wq, l ∈ (0, 1)
is the probability of a directed edge from a node in group q to a node in group l.
For our undirected graph with two latent blocks, let us consider
This means that both blocks are communities (as w1, 1 = 0.8 and w2, 2 = 0.9), as nodes have a large probability to be connected to nodes in the same block, and a low probability to connect to nodes in the other group (w1, 2 = 0.1).
For the noisy layer, we have to choose a model, Gaussian, Gamma or Poisson.
In the Gaussian model, the null parameter nu0
is a
vector of length 2, giving the mean and the standard deviation of the
the Gaussian null distrubtion (ν0 = (ν0mean, ν0sd)).
Mostly, we choose the standard normal distribution:
Parameter nu
is a matrix with two columns: the first
indicates Gaussian means νq, lmean,
the second standard deviations νq, lsd
of the alternative distributions for the block pairs (q, l). The number of rows
of nu
is identical to the length of w
, that
is, in the undirected case nu
has Q(Q + 1)/2 rows, otherwise
Q2. The rows are
indexed like w
, that is, $$
\nu^{\text{undirected}}=\begin{pmatrix}\nu_{1,1}\\\vdots\\\nu_{1,Q}\\\nu_{2,2}
\\\vdots\\ \nu_{Q,Q}
\end{pmatrix}\in\mathbb R^{Q(Q+1)/2\times 2}\qquad
\nu^{\text{directed}}=\begin{pmatrix}\nu_{1,1}\\\nu_{1,Q}\vdots\\\\\nu_{2,1}
\\\vdots\\ \nu_{Q,Q}
\end{pmatrix}\in\mathbb R^{Q^2\times 2}\qquad
\text{with }
\nu_{q,l}=(\nu_{q,l}^\text{mean},\nu_{q,l}^\text{sd})\in\mathbb R^2.
$$
For our undirected model with two latent blocks, we choose
## [,1] [,2]
## [1,] -1 1
## [2,] 5 1
## [3,] -1 1
To generate a data set with, say, n = 10 nodes from the corresponding
NSBM we use the function rnsbm()
:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0.00 0.84 -1.02 0.93 0.24 -0.13 0.35 0.36 0.21 -1.82
## [2,] 0.84 0.00 -1.65 -0.99 -1.15 -0.90 -2.78 -1.47 1.83 -0.09
## [3,] -1.02 -1.65 0.00 0.57 0.36 0.94 -0.69 -0.18 -0.14 -1.28
## [4,] 0.93 -0.99 0.57 0.00 1.13 -0.39 -2.11 -1.20 -1.55 -0.66
## [5,] 0.24 -1.15 0.36 1.13 0.00 -0.38 -2.50 -0.82 -0.59 -1.53
## [6,] -0.13 -0.90 0.94 -0.39 -0.38 0.00 -2.19 -1.18 -1.15 0.71
## [7,] 0.35 -2.78 -0.69 -2.11 -2.50 -2.19 0.00 1.16 -0.61 0.54
## [8,] 0.36 -1.47 -0.18 -1.20 -0.82 -1.18 1.16 0.00 6.29 1.41
## [9,] 0.21 1.83 -0.14 -1.55 -0.59 -1.15 -0.61 6.29 0.00 -0.93
## [10,] -1.82 -0.09 -1.28 -0.66 -1.53 0.71 0.54 1.41 -0.93 0.00
We see that the generated matrix is symmetric, as we are in the
undirected case. The function rnsbm()
also provides the
latent binary graph, named latentAdj
, which is also
symmetric:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 0 0 0 0 0 1 1
## [2,] 0 0 1 0 0 1 1 1 0 0
## [3,] 0 1 0 1 0 0 0 1 0 0
## [4,] 0 0 1 0 0 1 1 1 0 0
## [5,] 0 0 0 0 0 1 1 1 0 0
## [6,] 0 1 0 1 1 0 1 1 0 0
## [7,] 0 1 0 1 1 1 0 0 0 0
## [8,] 0 1 1 1 1 1 0 0 1 0
## [9,] 1 0 0 0 0 0 0 1 0 1
## [10,] 1 0 0 0 0 0 0 0 1 0
as well as the latent blocks, names latentZ
:
## [1] 2 1 1 1 1 1 1 1 2 2
A more convenient visualisation of the data is given by the function
plotGraphs()
:
## NULL
To generate data from a directed graph, the parameters w
and nu
have to be adapted to length Q2, say
## [,1] [,2]
## [1,] -1 1
## [2,] 5 1
## [3,] 10 1
## [4,] -1 1
To generate data from the directed NSBM, we set
directed = TRUE
:
obs <- rnsbm(n = 10, theta, modelFamily = 'Gauss', directed = TRUE)
plotGraphs(obs$dataMatrix, binaryTruth=obs$latentAdj)
## NULL
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0.00 -0.51 -0.48 -1.67 -0.12 0.04 -1.76 -0.59 -1.80 -0.32
## [2,] -0.01 0.00 -1.34 0.35 -1.09 -2.43 -1.89 -2.65 -0.01 -0.53
## [3,] 0.14 0.23 0.00 3.93 0.19 -1.60 -2.18 -1.35 -1.38 -1.34
## [4,] -0.75 -0.34 0.13 0.00 -1.59 -1.63 -0.35 -0.95 0.01 0.05
## [5,] -0.36 -0.50 0.46 -1.02 0.00 -1.51 9.18 11.10 -0.78 -0.35
## [6,] -2.93 -0.57 -1.41 -1.44 0.87 0.00 -0.07 -0.58 -1.23 0.46
## [7,] -2.39 -0.24 -1.55 4.91 0.05 -1.14 0.00 1.05 -0.21 -1.56
## [8,] -1.62 -1.47 -0.65 -2.46 -0.65 -0.21 -1.51 0.00 -0.03 -1.43
## [9,] -1.89 -1.77 -0.18 0.70 1.50 0.39 -0.98 -2.13 0.00 -0.55
## [10,] -0.18 -0.66 1.04 1.89 0.68 -2.34 -4.23 -0.65 -0.15 0.00
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 1 1 0 0 0 1 1 1 0
## [2,] 0 0 1 0 0 1 1 1 1 1
## [3,] 1 0 0 1 0 1 1 1 1 1
## [4,] 0 0 0 0 1 0 0 0 0 0
## [5,] 0 0 0 1 0 0 1 1 0 0
## [6,] 1 1 1 0 0 0 1 1 1 1
## [7,] 1 1 1 1 0 1 0 1 1 1
## [8,] 1 1 1 0 0 1 1 0 1 1
## [9,] 0 1 0 0 0 1 1 1 0 1
## [10,] 1 0 0 0 0 1 1 1 0 0
Now the matrices are no longer symmetric.
As the Gamma distribution also has two parameters, nu0
is a vector of length 2 and nu
is a 2-column matrix as in
the Gaussian case.
Let us take as an example an undirectd graph in a model with Q = 3 latent blocks with equiprobable blocks:
Suppose that inter-block probabilities are high, while intra-block probabilities are low:
The null distribution is the exponential distribution Exp(2):
For the alternative distributions we also choose exponentials:
## [,1] [,2]
## [1,] 1 1.0
## [2,] 1 0.5
## [3,] 1 0.5
## [4,] 1 1.0
## [5,] 1 0.5
## [6,] 1 1.0
We generate a graph with 10 nodes:
obs <- rnsbm(n = 10, theta, modelFamily = 'Gamma')
plotGraphs(obs$dataMatrix, binaryTruth=obs$latentAdj)
## NULL
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0.00 1.11 4.27 0.12 0.20 1.91 2.07 0.60 0.28 0.01
## [2,] 1.11 0.00 3.96 0.79 1.40 2.53 3.00 1.06 1.47 1.77
## [3,] 4.27 3.96 0.00 0.44 3.88 2.07 1.72 0.26 1.03 2.28
## [4,] 0.12 0.79 0.44 0.00 3.40 4.74 6.96 0.04 2.77 0.58
## [5,] 0.20 1.40 3.88 3.40 0.00 1.81 0.42 0.21 0.55 0.62
## [6,] 1.91 2.53 2.07 4.74 1.81 0.00 1.20 0.10 0.37 0.12
## [7,] 2.07 3.00 1.72 6.96 0.42 1.20 0.00 0.64 0.39 0.89
## [8,] 0.60 1.06 0.26 0.04 0.21 0.10 0.64 0.00 0.38 0.26
## [9,] 0.28 1.47 1.03 2.77 0.55 0.37 0.39 0.38 0.00 0.95
## [10,] 0.01 1.77 2.28 0.58 0.62 0.12 0.89 0.26 0.95 0.00
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 1 0 1 0 1 0 0 1
## [2,] 0 0 0 1 1 1 1 1 1 1
## [3,] 1 0 0 0 1 1 1 1 1 1
## [4,] 0 1 0 0 1 1 1 0 1 1
## [5,] 1 1 1 1 0 1 0 0 0 0
## [6,] 0 1 1 1 1 0 0 0 0 0
## [7,] 1 1 1 1 0 0 0 0 0 0
## [8,] 0 1 1 0 0 0 0 0 0 1
## [9,] 0 1 1 1 0 0 0 0 0 1
## [10,] 1 1 1 1 0 0 0 1 1 0
## [1] 1 3 3 3 2 1 2 1 1 2
The Poisson distribution has a single parameter, so that
nu0
is a scalar and nu
becomes a one-column
matrix.
As an example consider a directed Q = 1-block Poisson NSBM with parameters:
We generate a graph with 10 nodes:
obs <- rnsbm(n = 10, theta, modelFamily = 'Poisson', directed = TRUE)
plotGraphs(obs$dataMatrix, binaryTruth=obs$latentAdj)
## NULL
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 5 3 5 6 4 4 0 4 6
## [2,] 6 0 5 6 7 4 1 0 1 8
## [3,] 3 2 0 4 4 1 9 6 4 4
## [4,] 3 3 3 0 0 1 0 6 7 0
## [5,] 0 10 9 8 0 9 5 3 0 1
## [6,] 1 5 3 0 8 0 3 1 2 3
## [7,] 7 2 1 5 1 0 0 1 10 4
## [8,] 1 4 4 3 1 2 4 0 0 3
## [9,] 3 4 4 0 2 2 2 2 0 5
## [10,] 2 2 3 1 0 4 7 0 6 0
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 1 1 1 1 1 1 0 0 1
## [2,] 1 0 1 1 1 1 0 0 0 1
## [3,] 1 0 0 1 1 0 1 1 1 1
## [4,] 0 1 1 0 0 0 0 1 1 0
## [5,] 0 1 1 1 0 1 1 1 0 0
## [6,] 0 1 1 0 1 0 1 0 0 1
## [7,] 1 1 0 1 0 0 0 0 1 1
## [8,] 0 1 0 1 0 0 1 0 0 1
## [9,] 1 1 0 0 1 1 1 0 0 1
## [10,] 0 0 1 1 0 0 1 0 1 0
## [1] 1 1 1 1 1 1 1 1 1 1
fitNSBM()
and outputThe function fitNSBM()
fits the parameters of a NSBM to
a data set in different models. The function applies to a symmetric
real-valued square matrix dataMatrix
. Let us see a basic
example:
set.seed(1)
theta1 <- list(pi=c(.5,.5), w=c(.8,.1,.2), nu0=c(0,1), nu=matrix(c(-1,5,10, 1,1,1), ncol=2))
obs1 <- rnsbm(n=30, theta1)
fitNSBM()
applies the VEM-algorithm to estimate model
parameters and compute a node clustering. By default, the algorithm
explores different SBM models starting from a model with a single latent
block (Q=1
) up to 4 blocks or more if appropriate. The
output of fitNSBM()
is a list with estimation results for
all visited models (for Qmin=1
to Qmax=4
(or
more)). Here, for Q=2
for instance, we obtain the estimates
of the model parameters θ
by
## $w
## [1] 0.8336686 0.1025082 0.1392889
##
## $nu0
## [1] 0.00000 1.02319
##
## $nu
## [,1] [,2]
## [1,] -1.172879 0.8221477
## [2,] 5.089987 0.9116246
## [3,] 10.233534 0.9057642
##
## $pi
## [1] 0.4327493 0.5672507
A node clustering is given by
## [1] 2 2 1 1 2 1 1 1 1 2 2 2 1 2 1 2 1 1 2 2 1 2 1 2 2 2 2 2 1 2
The list element sbmParam
contains further results
concerning the latent binary SBM, namely the number of latent
blocks:
## [1] 2
the variational parameters, i.e. the probabilities that a node belongs to one of the latent blocks:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.0000001 0.0000001 0.9999999 0.9999999 0.0000001 0.9999999 0.9999999
## [2,] 0.9999999 0.9999999 0.0000001 0.0000001 0.9999999 0.0000001 0.0000001
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 0.9999999 0.9999999 0.0000001 0.0000001 0.0000001 0.9999999 0.0000001
## [2,] 0.0000001 0.0000001 0.9999999 0.9999999 0.9999999 0.0000001 0.9999999
## [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## [1,] 0.9999999 0.0000001 0.9998480401 0.9999999 0.0000001 0.1307727 0.8518565
## [2,] 0.0000001 0.9999999 0.0001519599 0.0000001 0.9999999 0.8692273 0.1481435
## [,22] [,23] [,24] [,25] [,26] [,27] [,28]
## [1,] 0.0000001 0.9999999 0.0000001 0.0000001 0.0000001 0.0000001 0.0000001
## [2,] 0.9999999 0.0000001 0.9999999 0.9999999 0.9999999 0.9999999 0.9999999
## [,29] [,30]
## [1,] 0.9999999 0.0000001
## [2,] 0.0000001 0.9999999
and the conditional probabilities of an edge between a pair of nodes given the block memberships of the nodes:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 7.692067e-02 9.058548e-01 5.283000e-02 7.476060e-01 6.906621e-01
## [2,] 2.045408e-04 4.729178e-11 4.974843e-04 8.095808e-09 2.256252e-08
## [3,] 2.867288e-16 1.460504e-16 3.646087e-16 9.331119e-17 9.216202e-17
These probabilities correspond to 1 − l-values, which are used in the
test procedure for graph inference. The dimension of
sbmParam$edgeProba
is Q(Q + 1)/2 times n(n − 1)/2 (in the
undirected case):
## [1] 3 435
that is, the rows correspond to group pairs (q, l) and the columns to node pairs (i, j).
Moreover, sbmParam
contains the values of the integrated
classification criterion:
## [1] -1647.502
Finally, the list element convergence
provides details
on the convergence of the algorithm:
## $J
## [1] -762.9652
##
## $complLogLik
## [1] -791.6735
##
## $converged
## [1] TRUE
##
## $nbIter
## [1] 82
For the edges different distributions can be considered. The
noisySBM
package supports the Gaussian NSBM and the Gamma
NSBM. In the Gaussian case a number of different restrictions on the
parameter space are implemented.
By default, fitNSBM()
considers a Gaussian NSBM with
model='Gauss0'
, where the null distribution is a
centered normal distribution with ν0mean = 0 and
standard deviation ν0sd ∈ ℝ+
to be estimated, and all parameters νq, l = (νq, lmean, νq, lsd)
for 1 ≤ q ≤ l ≤ Q of
the alternative Gaussian distributions are estimated without any
constraint.Other choices for model
in the Gaussian case are the
following:
Gauss
: all Gaussian parameters of the null and the
alternative distributions are unknown ; this is the Gaussian model with
the maximum number of unknown parameters (ν0mean ∈ ℝ,
νq, lmean ∈ ℝ,
ν0sd ∈ ℝ+,
νq, lsd ∈ ℝ+
for all q ≤ l).
Gauss01
: compared to Gauss
, the null
distribution is set to 𝒩(0, 1) (ν0mean = 0,
ν0sd = 1))
GaussEqVar
: compared to Gauss
, all
Gaussian variances (of both the null and the alternative) are supposed
to be equal, but unknown (ν0sd = νq, lsd ∈ ℝ+
for all q ≤ l).
Gauss0EqVar
: compared to GaussEqVar
,
the mean of the null distribution is set to 0 (ν0mean = 0).
Gauss0Var1
: compared to Gauss
, all
Gaussian variances are set to 1 and the null distribution is set to
𝒩(0, 1) (ν0mean = 0,
ν0sd = νq, lsd = 1)
Gauss2distr
: the alternative distribution is a
single Gaussian distribution, i.e. the block memberships of the nodes do
not influence on the alternative distribution (all νq, l
are identical).
GaussAffil
: compared to Gauss
, for the
alternative distribution, there’s a distribution for inter-group and
another for intra-group interactions (there are parameters νin and νout such that νq, q = νin
for every q, and νq, l = νout
for every q < l).
The estimation procedure fitNSBM()
supports two
exponential or Gamma models:
Exp
: the null and the alternatives are all
exponential distributions with unknown parameters (i.e. they are Gamma
distributions with shape parameter equal to one and unknown scale
parameter)
ExpGamma
: the null distribution is exponential with
unknown parameter, the alterantive distribution are Gamma distributions
where both parameters are unknown.
Let us consider the following example:
theta2 <- list(pi=rep(1/2, 2),
w=c(.2, .8, .2),
nu0=c(1,5),
nu=matrix(c(1, 1, 1, 1, 1/3, 1), ncol=2))
set.seed(2)
obs2 <- rnsbm(n=60, theta2, modelFamily='Gamma')
Let us fit both Gamma models to the data :
res_exp <- fitNSBM(obs2$dataMatrix, model='Exp', nbCores = 2)
set.seed(3)
res_gamma <- fitNSBM(obs2$dataMatrix, model='ExpGamma', nbCores = 2)
The parameter estimates obtained for Q = 2 are close to the true parameter values in both cases:
## $w
## [1] 0.2572669 0.7285259 0.1674702
##
## $nu0
## [1] 1.00000 5.09704
##
## $nu
## [,1] [,2]
## [1,] 1 1.1857191
## [2,] 1 0.3159445
## [3,] 1 0.9510055
##
## $pi
## [1] 0.4666667 0.5333333
## $w
## [1] 0.1787426 0.6857469 0.1441840
##
## $nu0
## [1] 1.000000 4.879132
##
## $nu
## [,1] [,2]
## [1,] 1.522966 1.4000245
## [2,] 1.205248 0.3603533
## [3,] 1.064406 0.9241991
##
## $pi
## [1] 0.4666667 0.5333333
When the algorithm explores several model sizes Q, then model selection can be
performed via the ICL criterion. The function getBestQ()
gives the best model size Q̂
that maximizes the ICL:
## $Q
## [1] 3
##
## $ICL
## [1] -1625.099
Typically, the ICL curve is a concave function, as we can see by
applying function plotICL
:
Likewise, for our exponential data, we obtain:
## $Q
## [1] 2
##
## $ICL
## [1] -4034.005
## $Q
## [1] 2
##
## $ICL
## [1] -3855.48
and for the ICL curves:
dfICLexp <- data.frame(ICL = sapply(res_exp, function(el) el$sbmParam$ICL),
Q = sapply(res_exp, function(el) el$sbmParam$Q), model='Exp')
dfICLgam <- data.frame(ICL = sapply(res_gamma, function(el) el$sbmParam$ICL),
Q = sapply(res_gamma, function(el) el$sbmParam$Q), model='ExpGamma')
dfICL <- rbind(dfICLexp, dfICLgam)
ggplot(dfICL, aes(x=Q, y=ICL, group=model, colour=model) ) +
geom_line() +
xlim(1,4)
The argument sbmSize
of fitNSBM()
is used
to precise the models to be explored. When sbmSize$Qmin
and
sbmSize$Qmax
are specified, then only this range of models
is explored. For a single model size, do e.g.
The argument initParam
is a list and you may change
default values to increase or descrease the number of initial points of
the algorithm.
To directly save the output, a filename
has to be
precised in the call of fitNSBM()
.
By default, computations are parallelized using the maximum number of
available cores. However, the argument nbCores
can be used
to customize this choice.
Node clusterings can be compared vie the adjusted Rand index. The
package provides a function ARI()
to evaluate this
indicator. On simulated data we can compare to the true clustering:
## [1] 0.8891908
To compare two node clusterings with different number of clusters:
## [1] 0.8663651
For the example for the Gamma NSBM, we find that the clusterings are
perfect in both models (Exp
and ExpGamma
):
## [1] 1
## [1] 1
Graph inference is done by the function
graphInference()
. The function provides the estimated
binary graph at level alpha
using the approach of q-values
in a NSBM. The arguments are the observed dataMatrix
, a
clustering
, a parameter estimate theta
, a
significance level alpha
and the model
modelFamily
. It returns a list containing the inferred
graph A
:
resGraph <- graphInference(obs1$dataMatrix, res_gauss[[2]]$clustering, res_gauss[[2]]$theta, alpha = .05, modelFamily='Gauss')
resGraph$A[1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 1 0
## [3,] 0 0 0 1 0
## [4,] 0 1 1 0 0
## [5,] 0 0 0 0 0
The ouptut also contains the q-values:
## [1] 0.6530847 0.6127403 0.1729543 0.6530847 0.4992396 0.5169365 0.2844466
## [8] 0.3909315 0.6530847 0.6530847
which is a vector of length n(n − 1)/2:
## [1] 435
To visualize the result, we can use again the function
graphPlots()
that also returns the FDR and TDR:
## $FDR
## [1] 0.05940594
##
## $TDR
## [1] 0.8050847
The graph inference procedure is also implemented in the Gamma model, but only in case where the shape parameter of all Gamma distributions under the null and under the alternative is the same, so for instance in the model where all distributions are exponential distributions.
resGraph2 <- graphInference(obs2$dataMatrix, res_exp[[2]]$clustering, res_exp[[2]]$theta, alpha = .05, modelFamily='Gamma')
resGraph2$A[1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 1 0 0 1
## [2,] 1 0 0 1 0
## [3,] 0 0 0 1 0
## [4,] 0 1 1 0 0
## [5,] 1 0 0 0 0
The ouptut also contains the q-values:
## [1] 5.864995e-07 2.776895e-01 2.508740e-01 6.342863e-04 2.655163e-01
## [6] 4.825911e-02 1.583784e-03 4.200489e-01 4.114394e-04 1.662359e-01
which is a vector of length n(n − 1)/2:
## [1] 1770
To visualize the result, we can use again the function
graphPlots()
that also returns the FDR and TDR:
## $FDR
## [1] 0.04626866
##
## $TDR
## [1] 0.7412993