1. Motivation
There is growing interest in the development of models and algorithms that capture group-level interactions [
1,
2,
3]. For example, multiple co-authors may be involved in a collaboration, multiple workers may share an office space, and multiple proteins may contribute in a cellular process. In such cases, representing the connectivity via a network of pairwise interactions is an obvious, and often avoidable, simplification. Hypergraphs, where any number of nodes may be grouped together to form a hyperedge, form a natural generalisation. Hypergraph-based techniques have been developed for the following:
Studying the propagation of disease or information [
4,
5,
6,
7,
8,
9];
Investigating the importance or structural roles of individual components [
10,
11,
12];
Discovering and quantifying clusters [
13,
14,
15];
Predicting future connections [
16,
17];
Inferring a connectivity structure from time-series data [
18].
Just as in the pairwise setting, it is also of interest to consider processes that create hypergraphs [
19,
20,
21]. Comparing generative hypergraph models against real data sets may help us to understand the mechanisms through which interactions arise. Furthermore, realistic models can be used to produce synthetic data sets on which to base simulations and also to form null models for studying features of interest.
Models that use a geometric construction, with connectivity between elements determined by distance, have proved useful in many settings. Random geometric graphs were first introduced in [
22] to model communication between radio stations, although the author also mentioned their relevance to the spread of disease. These models have subsequently proved useful in many application areas, ranging from studies of the proteome [
23,
24,
25] to academic citations [
26]. In many settings, the notion of distance may relate to the embedding of nodes into a latent space that captures key features. Here, similarity is interpreted in an indirect or abstract sense. Random geometric graphs have also been studied theoretically, with many interesting results arising from the perspectives of analysis, probability, and statistical physics [
27,
28,
29,
30,
31,
32].
Our aim in this work is to motivate and analyse a random geometric hypergraph model. In a similar manner to [
19], we make use of the connection between hypergraphs and bipartite graphs. The model is introduced and motivated in
Section 2, where we also show the results of illustrative computational experiments concerning connectivity. Our main contribution is to derive a condition on the thresholding radius that asymptotically guarantees connectivity of the hypergraph. The result is stated and proved in
Section 3. Some further computations concerning the expected degree are presented and interpreted in
Section 4, and directions for future work are described in
Section 5.
2. The Random Geometric Model and Its Connectivity
In this section, we motivate and informally describe a random geometric hypergraph model and computationally investigate its connectivity. We make use of a well-known equivalence between hypergraphs and bipartite graphs [
19,
33]. Suppose we are given an undirected bipartite graph where nodes have been separated into two groups, A and B. By construction, any edge must join one node in group A with one node in group B. We may form a hypergraph on the nodes in group A with the following rule:
In this way, the nodes in set B may be viewed as hyperedge “centres.” Two nodes from group A that are attracted to the same centre are allocated to the same hyperedge. In many graph settings, there is a natural concept of distance between nodes. For example, in social networks, geographical distance between places of work or residences may play a strong role in determining connectivity. More generally, there may be a more nuanced set of features (hobbies, tastes in music, pet ownership, etc.) that help to explain whether pairwise relationships arise. This argument extends readily to the bipartite/hypergraph scenario. Hyperedge centres may correspond, for example, to shops, office buildings, gyms, train stations, restaurants, concert venues, churches, etc., with an individual joining a hyperedge if they are sufficiently close to that centre, for example, exercising at a local gym. In the absence of specific information, it is natural to assume that the features possessed by a node arise at random, so that a node is randomly embedded in for some dimension d. In a similar way, we may simultaneously embed our hyperedge centres in and assign a node to a hyperedge if and only if it is within some threshold distance of the centre.
Figure 1 illustrates the idea in the two-dimensional case. We have a bipartite graph with two types of nodes. Groups A and B are represented by circles and stars, respectively. We form a hypergraph by placing a circle node in a hyperedge if and only if it is within a certain distance of the corresponding star. Colours in the figure distinguish between the different hyperedges. We emphasise that mathematically the resulting hypergraph consists only of the list of hypergraph nodes and hyperedges. Information about the existence/number of hyperedge centres and the locations of all nodes in
is lost.
Our aim in this work is to study connectivity: a basic property that is of practical importance in many areas, including disease propagation, communication, and percolation. We consider the random geometric hypergraph to be connected if the underlying random geometric bipartite graph is connected. We focus on the smallest distance threshold that produces a connected network and study an asymptotic limit where the number of nodes tends to infinity.
We motivate our analytical results with computational experiments. To produce
Figure 2, we formed random geometric bipartite graphs based on
n points embedded in
. For each graph, the points had components chosen uniformly and independently in the range
. We separated these points into two groups of size
and
. We then used a bisection algorithm to compute the smallest radius
r that produced a connected bipartite graph. In other words, we found the smallest
r such that a connected graph arose when we created edges between pairs of nodes from different groups that were separated by a Euclidean distance less than
r. (Equivalently, we assigned
points to the role of the nodes in a random geometric hypergraph and
points to the role of the hyperedge centres, and we computed the smallest node–hyperedge centre radius that gave connectivity.) We ran the experiment for a range of
n values between
and
. For each choice of
n, we repeated the computation for 500 independent random node embeddings.
Figure 2 shows the mean, maximum, and minimum radius arising for each
n. Note that the axes are scaled logarithmically. We have superimposed a reference line of the form
, which is seen to be consistent with the behaviour of the radius.
Figure 3 and
Figure 4 repeat these computations with the points embedded into
and
, respectively. We see that the behaviour remains consistent with a decay roughly proportional to, and perhaps slightly slower than,
for dimension
d.
In the next section, we formalise our definition of a random geometric hypergraph and establish a condition on the radius decay rate for connectivity that agrees with
, up to log-dependent factors (which of course would be extremely difficult to pin down in computational experiments). We also note for comparison that a threshold of the form
has previously arisen in the study of random geometric graphs, refs. [
30,
34].
In related work, we note that Barthelemy [
19] proposed and studied a wide class of random hypergraph models, including examples where nodes are embedded in space and connections arise via a distance measure. That approach to defining a random geometric hypergraph differs from ours by assuming that the number of hyperedges is given and by considering a process where new nodes are added to the network, with new connections arising based on the current hyperedge memberships (Figure 6 in [
19]).
3. Connectivity Analysis
We now give a formal definition of a random geometric hypergraph and show that under reasonable conditions a thresholding radius of order ensures connectivity, asymptotically.
Let D be a bounded Euclidean domain in such that D has a Lipschitz boundary. Given , we let be a Poisson point process sampled from D with respect to some continuous and bounded distribution f such that everywhere on D. We use to denote the Euclidean norm. Let , and let be the expeted number of nodes and be the expected number of hyperedges, chosen such that . Let be a function of n, tending to 0 as .
Definition 1. Let be the probability space on the set of geometric hypergraphs, where the random nodes are chosen as a Poisson point process in D sampled with respect to f; the random hyperedges are induced by another Poisson point process in D sampled with respect to f; and where, using bipartite graph–hypergraph equivalence, a node and a hyperedge are connected by an edge if .
Suppose that the expected number of nodes
and of hyperedges
satisfy
Equivalently, this means that
and
as functions of
n satisfy
Let
be the smallest constant such that for all
,
Partition
into a grid of cubes
of width
, where
and
is to be determined. Let
, and for each
, let
and let
Because
D has a Lipschitz boundary, by compactness there exists
depending on
D and
d (but not on
), such that we can choose
sufficiently large such that for all
and all
We then choose
so that for all
,
Note also that we have where
Lemma 1 (Asymptotic coverage).
Suppose that m as a function of n satisfies, for all ,and suppose that satisfieswhere arbitrarily slowly as . With the probability tending to 1 as , for all , Proof. It suffices to show that the RHS in
tends to 0 as
.
Because
is a homogeneous Poisson point process, we have, using (
3), for all
,
and by the pigeonhole principle,
. Hence,
□
Note that with our choice of
, we have
Hence, Lemma 1 gives us a lower-bound estimate on the decay of
as a function of
n, to ensure that the balls centred at the points of
and of radius
tend to form a covering of the domain
D as
. This is an asymptotic result.
From a practical point of view, it is more useful to have a non-asymptotic version of Lemma 1, even if we must increase slightly the constraint on the decay of . This is the object of Lemma 2.
Lemma 2 (Non-asymptotic coverage).
Suppose that m as a function of n satisfies, for all ,and suppose this time that satisfiesfor some fixed , then a.s., there exists such that for all and all Proof. A proof proceeds similarly to that of Lemma 1, but the different constraint on
instead yields
and the required result then follows by using the Borel–Cantelli lemma, because then, the series
converges as
. □
We believe that the lower-bound condition on the decay of found in Lemma 1 is sharp and that the lower-bound condition in Lemma 2 is close to being sharp. In Theorem 1, we apply Lemmas 1 and 2 to obtain a sufficient lower-bound condition on for the connectivity of random geometric hypergraphs, with an extra factor of 2. We suspect that this factor could be reduced with a more sophisticated analysis.
Theorem 1. For every , let satisfy (1) and . If satisfies (3), then with the probability tending to 1 as , the random geometric bipartite graph is connected.
If satisfies (4), then a.s. there exists , such that for all , the random geometric bipartite graph is connected.
Proof. The result is a consequence of Lemmas 1 and 2 and the triangle inequality.
Suppose that
is such that for all
,
Given two points
, we can find a path of adjacent cubes from
such that the first cube contains
x and the last cube contains
y. From (
2) and the triangle inequality, the distance between a point in one cube and another point in an adjacent cube is at most
. Because for each cube in the path we can find a point from
and a point from
, we can then form a path of edges of a length that is at most
from
x to
y, alternating between points in
and points in
, and such a path is then a path in
This shows the connectivity of the graph for all
n, satisfying condition (
5).
This condition is true with the probability tending to 1 as
, if we assume that
satisfies (
3), using Lemma 1 with
and
instead of
m, giving us the first part of the theorem.
Using Lemma 2 with
and
instead of
m, if
satisfies (
4), there exists
such that (
5) is true for all
, giving us the second part of the theorem. □
4. Expected Degree at Connectivity Threshold
We now present some further computations that expand on the results in
Section 3. We recall that in
Figure 2,
Figure 3 and
Figure 4 we evaluated the threshold radius at which connectivity occurred. In
Figure 5,
Figure 6 and
Figure 7 we used the same random geometric hypergraph samples, each evaluated at its connectivity threshold. For each hypergraph, we computed the expected node degree and expected hyperedge degree, that is, for the underlying bipartite graph, the expected degree of the nodes in group A and in group B. The figures, which again are on a log–log scale, indicate that the two expected degrees grow slowly with
n. In each figure, we have included a reference curve proportional to
.
We can offer a heuristic explanation for these curves. To be concrete, we focus on the nodes in group A of the bipartite graph. Here, because the group B nodes are placed uniformly at random, the expected degree is roughly the number of group A nodes contained in a general ball of radius
r, where
r is the connectivity radius. We can compute this quantity as the sum of the independent probabilities that each node is in the ball, which is of the order of the volume of the ball, that is,
. Because there are order
n nodes, this suggests an expected degree of the order
. Using
from
Section 3 for the threshold radius, we arrive at an expected degree of order
. This prediction gives a reasonable match to the results in
Figure 5 and
Figure 6, where the embedding dimensions are two and four, respectively.
However, we note that in
Figure 7, where the embedding dimension is ten, the mean node degree and hyperedge degree appear to grow slightly faster than
. We believe that in a high dimension, a concentration-of-measure effect becomes relevant. In order to have more meaningful information on the mean degree, we would like to be able to control simultaneously the number of nodes of graph A in the
n balls of radius
r centred at the
n nodes of graph B and to be able to argue that this random number behaves asymptotically like
n times the volume of the ball (the expected number), i.e., remains near its expected value for each of the
n balls. For this to be true, we would need some concentration inequalities, which would ensure that the empirical measure induced by the random sample of the nodes of graph A yields a good approximation of the underlying sampling measure when evaluated at
n random balls of radius
r. Such concentration inequalities are known to hold in a regime slightly more restrictive than that of connectivity, i.e., where
grows slightly faster than
; see, for instance, Lemma
in [
35], where such concentration inequalities are valid provided
.
A second issue is that the effect of interchanging the order in which the expectation operation and operation are applied cannot be understood without careful analysis.
We leave for future work the task of formalising and proving appropriate asymptotic statements about the degree structure for this random geometric hypergraph model.