1. Introduction
Advances in machine intelligence have led to an increase in human-agent teaming. In this context, one or more machines act as semi-autonomous or autonomous agents interacting with other machine teammates and/or their human proxies. This phenomenon has led to cooperative work models where the role of an agent can be, interchangeably, a human, or machine, support system. Human counterparts that interact with automation become less like operators, supervisors, or monitors, and more like equal-authority peers.
Critical to the success of any team is efficient and effective communication. Multi-agent systems are no different. Information sharing is a key element in building collective cognition, and it enables agents to cooperate and ultimately achieve shared goals successfully. Information sharing, or communication, provides the foundation for a team’s success. In complex multi-agent engagements, information is not always universally available to all agents. Such engagements are often characterized by distributed entities with limited communication channels among them, where no agent has a complete view of the solution space, and information relevant to team goals only becomes available to team members in spontaneous, unpredictable and even unanticipated ways. Moreover, there is always a resource cost to inter-agent communication. Finding highly efficient and effective communication patterns is a recurring problem in any multi-agent system, particularly if the system agents are distributed.
We are concerned with how a Multi-agent System (MAS) [
1], or Team, sends information between agents or teammates. By “how” we mean “how” in an information theoretic [
2] sense—in particular, we do not concentrate on the mechanics or physics of the transmission other than how it impacts information theory. We are concerned with what strategy an agent can to use to maximize its information flow to another agent. From an information geometric standpoint, we only use a simple metric in this article, but lay the ground work for more complex Riemannian metrics. We are concerned with a transmitting agent sending a small amount of distinct symbols in a fixed time. In fact, we restrict ourselves to two symbols to develop our theory (A list of notation is at the end of the article.). We are using a mathematical approach to model the communication between two agents. The equations we present are based on a series of assumptions that we will explain.
We assume that an agent sends two symbols to another agent. We refer to the symbols as “0” or “1”. We are concerned with the fidelity of how the symbols are passed. All symbols take the same time to pass. We will be looking at the (Shannon) capacity as one agent attempts to send a symbol to another agent.
Our scenario is illustrated in
Figure 1 and
Figure 2. The first agent
sends a 0 or 1 to the second agent
. We have a clock and the unit of time is
t. Every
t,
transmits the symbol to
. We assume that the symbol is received within the same time unit (i.e., we assume instantaneous transmission speeds during each interval
t). There is no feedback (which, for the channels we analyze, would not change the capacity anyway (p. 520 [
3])) from
to
, and the transmission is considered to be memoryless (quoting [
4], “…channel is memoryless if the probability distribution of the output depends only on the input at that time and is conditionally independent of previous channel inputs or outputs”). Furthermore, it is implicit that the channel statistics never change (sometime the literature refers to this as a “stationary” condition).
To summarize the above, we have a Discrete Memoryless Channel (DMC) between and . This channel measures information flow in terms of bits per symbol (since t does not vary). We let X represent the input distribution to this DMC, and we let Y denote the output random variable.
The probability for the random variable
X is given by
; it is the probability that
inputs symbol
i, and
is the probability that
received symbol
j. The input distribution
X is determined by the transmission fidelity of
. In particular,
Whereas the output distribution
Y is determined by the (assumed to be well-defined) conditional distribution between
X and
Y, and the input distribution. Thus,
The approach presented in this paper follows from [
2,
5,
6,
7].
The conditional probabilities of the DMC is given by a
matrix
, where (Please keep in mind the swapping of the indices, and, as we had above for
, that notationally
. Furthermore, the convention is that a conditional probability is fixed for all
, even if that probability is 0. In the next footnote, we address the impact of this with respect to (w.r.t.) information theory).
Note that
.
Before we continue with the mathematics let us put this research into some more perspective. Von Neumann’s [
8] seminal work had no concept of “Teamwork”, which is at the core of what we are discussing. Sliwa’s [
9] review suggests that minimum communication channels are more important when context is understood during teamwork, a suggestion opposite to our work in this article which we hope to test in the future. Lawless [
10] suggests that maximized channels become more important when Teams confront uncertainty in their environment. Schölkopf et al. [
11] suggest that i.i.d. data are insufficient to reconstruct whatever social event is being captured, that something is missing and a new approach must be innovated, our goal in this article. Our results will be discussed in situ for maximum effect.
1.1. Entropy and Mutual Information
We extend our random variables to allow more than two possible outcomes, and give the following definitions with the most generality possible. We now have possible inputs, and possible outcomes.
Given a discrete random variable
V, we define the entropy of
V as (By convention log is the base 2 logarithm, and ln is the natural logarithm. Furthermore, we are able to extend the definitions (p. 19 [
4]), as is standard, so that
. These conventions allows the most general derivation of (
8) from (
7)).
If
, then we define the binary entropy function of
z as
Note that if
B is a binary random variable taking the values 0 or 1, then
. In fact, we simplify the notation and express the probability of the event
as
Furthermore, when it is clear which distribution we are using, we further simplify the notation and just write
. Thus,
Given two discrete random variables
, we define [
2] the conditional entropy of
V given
W as
where, as in the
case
forming the channel matrix (Of course, as in the 2 × 2 case, conditional probability is only defined when
. However, as we note below, such terms are dealt with by using the limiting value of the constant conditional probability term which makes our mutual information calculations consistent, keeping in mind that
is always taken to be 0. Furthermore, keep in mind that a distribution that achieves capacity for a 2-input channel (the subject of this paper) never has either probability value as zero of course (Ref. [
12] gives better bounds). There are, however,
channels for which this does not hold, for example
which has an optimizing input distribution of
.)
We define the mutual information between
V and
W by [
2]
Using (
5) and (
7), and some substitutions [
4] (again, division by 0 is taken care of in the usual way by using limiting values ([Section 2.3] [
4]), we find that
We now give Shannon’s definition [
2] of (channel) capacity. It has been well-studied since its inception. We will not delve into the Noisy Coding Theorem, or any of the other results which showcase its importance. Rather, we will assume in this paper that capacity is a standard measure of how much information a channel can transmit in an essentially noise-free manner [
2,
4]. The traditional units of capacity and mutual information are accepted in this article; they are bits per channel usage, which in our scenario is equivalent to bits per
t.
Definition 1.
We consider W to be the input random variable to a DMC. The capacity C of the DMC is The optimization is taken over all possible distributions of
W with its fixed values
. The supremum is actually achieved and can be taken as a maximum [
2,
4]. Note that when trying to compare the magnitude of the channel capacity (with the same number of inputs), it suffices to compare the mutual information for all
x values. Of course the two channels may have different optimizing distributions. Note the principle (and similar principles) that if
and if
achieves capacity at
, then
.
Of course swapping rows, or swapping columns from the channel matrix (
6) is just notational and leaves capacity unchanged. However, we end this subsection with some interesting results in information theory—some obvious, some not so obvious.
Property 1.
Removing a row from the channel matrix (6) never increases the capacity. Proof. Not using a channel input cannot increase mutual information. This is equivalent to using input probability distributions which are always zero for a particular index; therefore, the capacity can never be greater since capacity is the maximum over all input distributions. □
Property 2.
- A—
For any input probability, combining (by adding two columns to form one column hence reducing the channel matrix from to as illustrated below with ) two columns of a channel matrix will never increase mutual information.
- B—
For input probabilities with all terms non-zero, the mutual information will stay the same iff one of the combined columns is a multiple of the other. Otherwise, the uncombined channel has a larger mutual information and hence a larger capacity. (Note, that for a 2-input channel [12] has shown that the capacity achieving distribution has both probabilities in the interval so we can apply this property to the capacity directly.)
Proof. The Data-Processing Inequality (Cascade of Channels) [
3] shows that the capacity of the third channel above cannot be greater than that of the first channel. That is, processing one channel into another can never increase the information sent. The actual statement of the inequality is for mutual information. However, we use the probability that maximizes the mutual information of the first channel (which is its capacity), and therefore, it is less than or equal to the mutual information of the third channel which is less than or equal to the third channel’s capacity. This argument holds for any initial channel matrix (with adjustments to the second matrix), not just the
matrix, or the columns we chose, for simplicity above.
B: Without loss of generality (WLOG), combine the first two columns of
n by
m channel matrix (note how the indices are reversed as compared to (
6))
to make
For
Q, the output symbols are
, where
j goes from 1 to
m. For
, they are the same, but with
and
replaced by
. For both channels, the input symbols are
, with input probability vector
p defined as
Therefore,
and
If either of these last two relations are 0, WLOG we assume
. This assumption means column 1 of
Q must be a 0 column (since the input probabilities are positive), so it contributes 0 to the mutual information. Therefore, the mutual informations are equal, and one column is a constant multiple of the other. Now that we have dealt with this case, we can assume
and
are positive for the remainder of this proof. For fixed
p, the mutual information of an
n by
m channel is
Columns 3 through
m of
Q and 2 through
of
are the same, so their contributions to mutual information are the same. Therefore, we only need to consider columns 1 and 2 of
Q and column 1 of
. Let
be these columns’ mutual information, that is,
Note that
can also be written as
The log sum inequality [
4] states that, for a series of non-negative numbers
and
with sums
a and
b, respectively, where
k goes from 1 to
K, then
with equality iff
are equal for all
i. By applying this inequality to the above terms in square braces, we have that
, with equality iff
for all
i. Since
and
are nonzero and independent of
i, this is true iff column 1 of
Q is a constant multiple
of column 2. In fact, this also shows that
is a constant multiple of
, regardless of the all of the positive input probabilities. □
1.2. Back to Our Binary-Input Binary-Output DMC, the (2,2) Channel
Restating (
1) and following the approach of [
13]:
The above expressions simplify for our DMC under investigation. Using (
1) and (
2), we have that the distribution of
Y is
We now define a differentiable function
by
which gives us
Thus,
From (
5), we have that
Putting the above together gives us
Using (
9), we have that the capacity of the
(2,2) channel is
So, for the
(2,2) channel, the capacity calculation boils down to a (not so simple) calculus problem. Silverman [
14] was the first to express the closed form result (see also [
5,
13] and ([Equation (
5)] [
7]) for derivations and alternate expressions).
which is a continuous function on the unit square
. It is trivial to show that capacity is continuous on the unit square without the main diagonal
. However, to prove continuity on the entire unit square requires some work and uses the fact that (
15) is continuous in
a,
b, and
x see ([Section 2.4] [
15]).
One can easily show that (see
Figure 3)
by simple algebraic substitution. Additionally, this tells us that
also.
is equivalent to capacity being symmetric across the line
, and
is equivalent to capacity being symmetric when across the line
(simple geometry proves this). This result is illustrated in
Figure 3. Thus, capacity has a quadrant of the unit square as its principal domain (see ([
Figure 1] [
14])).
1.3. Power/Fidelity Constraints of
We consider the situation where we attempt to increase the capacity by adjusting the terms
a and
b. Ideas like this for a Team’s interdependence, with a different measurement and no mention of information theory, were discussed in [
1]. However, the values of
are a function of the transmitting environment from
to
. If the agents were all-powerful, that could simply adjust
a to be 1, and
b to be 0 (or visa versa) to achieve a channel of maximal capacity
.
1.3.1. Positive Channels
Let us start by considering positive channels [
6], that is
. Note if
, we have a negative channel, and if
, we have a 0-capacity channel. Of course, no matter what
. However, if we are at a point (a,b), is it better to increase
a, decrease
b, or some combination thereof? Implicit in this question is that we stay in the domain of positive channels (under the line
).
Definition 2.
We say that we have a power constraint P when we are at the channel given by (a,b) and the most we can adjust the channel is to where the standard Euclidean distance (its norm) between and is no more that P.
In terms of Information Geometry [
1], our distance is obtained from the Riemannian metric
Of course we can generalize this to a more general metric of the form
which would put us in a non-Euclidean situation. This non-trivial situation may be necessary if
a and
b relate differently to various transmission characteristics.
It is shown in ([Theorem 4.9] [
6]) that if we restrict ourselves to positive channels, that the capacity increases as
a increases, and decreases as
b increases. This result makes physical sense in terms of adding or decreasing noise. Now consider the (closed) disk of radius
r about the point
, denoted as
. We assume that
r is small enough so that
is composed only of positive channels.
Example 1.
We illustrate this situation in Figure 4 by the channels that are in the disk or radius 0.15 about the point (0.6,0.2). Theorem 1.
Given a closed disk consisting of positive channels, the maximum capacity is achieved and occurs on the boundary circle .
Proof. Since
is a continuous function on the compact set
, it is has a maximum denoted as
. Assume that the maximum is achieved at an interior point
. By ([Theorem 4.9] [
6]) we know that increasing
increases capacity, which contradicts
being achieved at the interior point
. □
We note that the above theorem still holds for non-positive channels by a simple adjustment of the proof.
Example 1 is illustrated in
Figure 4 and is examined again in
Figure 5 and
Figure 6, where we can see the level sets of
and the surface plot of capacity. Furthermore, numerical calculations show that the maximum of capacity for the closed disk is obtained at the boundary points
and has a value of 0.32.
Of course, as the center of the disk and the radius vary, so does the relative position of the point on the circle that capacity is achieved at. What is interesting is that it is not obvious where this point should be. We will explain this further. For a positive channel, increasing
a brings increased capacity, whereas decreasing
b results in increased capacity. So, considering our example of the disk centered at
with radius 0.15, one might think that this critical point is when
b is decreased by the amount that
a is increased—this being the point on the boundary circle at
radians, which only gives us a capacity of 0.31. However, numerical methods tell us that the actual maximum occurs are 5.25 radians with a value, as noted, of 0.32. Of course, for this example, the difference is not much, but this result is relative to the size of the disk. What is important is that the actual critical point depends on the disk’s position to the two lines
and
. We do note that when the disk is centered on the line
, that
radians is the correct position for the critical point. One can also see this by examining the capacity level sets in
Figure 5.
Of course, we are using an metric which has a metric ball of a disk. If, for example, we used an metric, the ball would be a square rotated by 45 degrees.
1.3.2. Power
We assume that the transmitting agent has adjustable power P. This power allows the transmission capabilities of to vary. By way of example, say that transmits with fidelity . Now, is given an increase in its transmitting power that allows it to change to such that the “distance” between the two points is less than P. Consider that we use the Euclidean norm and set . This tells us that all such points are in the disk of radius about the center . We note that this is a rudimentary concept of power. Power helps a transmission when we are restricted to the bottom quarter of this disk and where a is increasing (giving more transmission fidelity) and b is decreasing (more transmission fidelity—recall that b gives us the probability of a 1 going to the opposite symbol 0). However, the conclusion is still the same point that we made and illustrated above.
1.3.3. Results and Discussion
We end this section with a brief summary. We have discussed how one agent can pass Shannon information to another and how changing the transmission characteristics can increase or decrease this information transfer. We have used capacity as our metric for information transfer. Let us now progress to multiple agents. We have also proven some information theoretic properties for the reader (Properties 1 & 2).
In the situation that we discussed in this section where there are two transmitting agents and one receiving agent, we denote the channel as
, which is given by the channel matrix (In this article, we freely identify a channel with its matrix. Furthermore, for a
channel, we identify the channel as the ordered 2-tuple
also.)
described earlier (
4). We denote that channel capacity as
which we have analyzed as
in this section.
2. Two Transmitting Agents
Say we have two transmitting agents, and acting independently with respect to each other. Assume they have the same transmitting characteristics; that is, the channel matrices are the same. The receiving agent gets symbols from both transmitting agents. How does this impact the information flow to ?
In our scenario,
and
both sense the same environment. That is, they both wish to send a 0 or they both wish to send a 1. So, as before, the possible inputs are 0 or 1, but the outputs are of the form
since we are assuming that the noise affects each transmitting agent independently. Keep in mind that both
and
are both attempting to transmit the same symbol.
The output that
uses is given by the random variable
Y.
We denote
. Our channel matrix is
and is
We note that the second and third columns of the above channel matrix are identical. This has implications for the mutual information and, of course, the capacity of the channel.
Let us look at this in more generality. Say we have two channel matrices
Both channels have the same input random variable
X as above. The output random variables are
and
, respectively.
Let us consider the
channel first.
has probability values
as follows
The mutual information is
. We expand the mutual information into the sum of two functions. The first function is from the first and last columns, and the second function is from the middle column. That is
Now let us consider the
channel. As above
As above we expressthe mutual information as
and we have that
A quick inspection tells us that
; thus, the mutual information of both channels is the same. This result is not surprising because if we combine output symbols where the channel matrix has identical rows, we lose nothing as far as the output information is concerned—there is no extra value in looking at the output symbols separately. This makes sense, and is also what our mathematics have shown.
Let us keep in mind that we wish to find
, the capacity of the Shannon channel when there are two transmitting agents. (To keep our notation consistent,
is the capacity given by the corresponding
channel matrix as in (
4), whereas
is the capacity of the channel given by *).
Theorem 2.
.
Proof. has four output symbols which are in essence 2-vectors. We ignore the second component of the vector. Therefore, we collapse the first and third symbol to a, and the second and fourth to . This results in , and since using more output symbols never lowers capacity, by Property 2 (also, a code that works for works for as well by collapsing the symbols), we are done. (Later in the paper we do better than this result with Corollary 1 to Theorem 6.) □
We now form another channel related to what we discussed above. Say now that the receiving agent receives the symbols without any order. Therefore, instead of a 2-vector, the output is one of the three multisets
with
We call this channel
, and its channel matrix is
From what we discussed above with and , we see that
Let us examine the bounds in Theorem 1 above. We will see that, not surprisingly except for special cases,
.
Figure 7 is a plot of
as a function of
.
From
Figure 7, we see that except for the line
(where both channels
and
have 0 capacity), and at
or
(where both channels have capacity 1), that
. We note that for
and the other higher dimensional channels that we will discuss, there is to our knowledge no closed form as there is for
. Therefore, for our calculations of capacity, we rely upon numerical results from the Blahut-Arimoto capacity algorithm [
16,
17].
Results and Discussion
In this section, we have laid the groundwork for n transmitting agents. We derived some capacity results. We concentrated on the effects of going from 1 to 2 transmitting agents. What happens as we go to three or more transmitting agents?
3. Multiple Transmitting Agents
We have the canonical representation for the channel of
n transmitting agents, and we denote this canonical channel matrix as
, which is formed by taking the output of channel
(Note, due to the simplicity of the construction for “small” channels, we have that
.) and adding a 0 or a 1 to it. For
this results in
This comes from taking the output for two agents as given in canonical form by (
21) and extending it to
Theorem 4.
Rearranging outputs/columns of a channel matrix does not affect capacity.
Proof. By looking at the expression for mutual information, we see that changing the order of arithmetic operations leaves it unchanged. This result follows, since capacity is the maximum of mutual information. □
Therefore, we can permute the columns of
and obtain a new matrix
, which has the same capacity, that is
, and is given below.
Look at the above theorem in terms of the columns of
. Let us use
as an example.
Collapsing the output in this situation is equivalent to interchanging the 4th and 5th columns (which does not change capacity) and forming the matrix
.
As above when we looked at
and
, we see that we may form the channel where we identify output symbols with the same conditional probabilities for both inputs. This give us the channel
, where
Theorem 5.
Proof. As above for in Theorem 3, or we can just use Property 2 repeatedly. □
The reason we introduce
is that it is a cleaner way to express the channel, and the calculations are simpler than that of
. For example,
is a
matrix, whereas
is a
matrix. This obviously makes the coding issues easier. Now we examine
Figure 8, which is the difference between
and
.
When we compare
Figure 8 to
Figure 7, we easily see that
grows, except for the endpoints and the line
(which stay at 0) as n grows.
Nota Bene We now look at the prior illustrative results in terms of a more general encompassing theory. We included much of
Section 2 so that the reader who is not familiar with some of the “tricks” will have a feel for why the more general results hold.
Theorem 6.
for any positive integer n.
Proof.
(The proof is the same as for the above when .) can be obtained from by combining certain columns together; the result follows from Property 2. □
Corollary 1.
, except for and where they both have capacity 1, and the line where they both have capacity 0.
Proof. We show the proof in three steps.
If , since the rows are identical. In this case, it is trivial to show that (the output has no idea what the channel input was). One can see this by the fact that . In short, the capacities are equal.
If or , both and are both the identity matrix with zero columns added in; hence, . In short, the channel capacities are equal.
Now, excluding the special cases where , , or , by Property 2, we only have to show that here are two combined columns that are not multiples of each other.
By excluding the special cases, we cannot use the endpoints of the unit square; therefore, a or b must be in . WLOG, we assume that .
Consider a generic column of ; it is of the form . By construction, has two columns, and that when combined result in column c. If is not a constant multiple of , we will have shown that . Assume the opposite—that is, ; since neither a or is 0 we have that . Then is equivalent to . We now have three cases for b.
. In this case, and we only look at the last column of , so we let Since we are assuming that , we have that
, which is impossible.
. Using the same argument as above, just replace the last column of with the first. So again, it is impossible that the columns are multiples.
. As above for a, we also have that . This tells us that which has been ruled out.
Thus, we have shown the existence of two columns of that are not multiples of each other and combine them into a column of . □
Theorem 7.
, except for when , and in that case, the channel capacity is 0.
Proof. WLOG, we assume . We can do this because of the constraint and the fact that the rows of a channel matrix can be interchanged without affecting its capacity. Take a positive be fixed. For a large enough N, we can always find a rational number for any such that and . (The padding prevents m from converging to or ). This result is guaranteed to exist for sufficiently large N.
Given , let , giving us . Certainly there exists a positive integer N such that . Therefore, for any integer , we have that . Consider as a sub-interval of . For any , consider the largest integer W such that . Look at ; by the definition of W, this must be greater than x. However, since , we have that . We let . Keep in mind two characteristics of m as a function of n:
Let
be the channel matrix
, but modified as follows: all outputs
for
are combined into
, and all of the other outputs are combined into
. The channel matrix then looks like this:
where
(Keep in mind that we are dealing with the binomial random variable
, where
i is the number of successes in
n Bernoulli trials, with the probability of success
,
).
If we let
be the cumulative standard normal distribution function, the De-Moivre Laplace limit theorem [
18] states that (when we take
as integers)
This step leaves us with
Thus, the De-Moivre Laplace limit theorem gives us (with
):
Since
a and
are positive, then
is negative, giving
If
, then
is negative. However, if
, it is positive, giving (Even though
m changes as
n changes, the value of
remains greater than or equal to
for
. Since
approaches
∞, so does
. The same logic can also be used for the
case.)
Thus, we have that
beahves the same, but with
a replaced by
b. Since
, then the
and
; thus,
which has a channel capacity of 1. Since
was formed by combining the outputs of
, then
. Therefore, by the squeeze theorem,
. □
Results and Discussion
The theorems presented in this section shows what happens as the number of transmitters grows. The ultimate result of this section was Theorem 7, which used a rather non-trivial application of the Central Limit Theorem. At this point, the seemingly obvious but difficult result that we proved, i.e., that as the number of transmitting agents grows, so does the reliability of the channel in terms of its capacity. This result, of course, is in line with the similar result that if we have a code that consisted of repeating a symbol many times the error rate is small (the transmission rate may be low, but this does not apply to our agent examples).
4. Non-Identical Transmitting Agents
In a shift, say we start with only two transmitting agents, but their noise characteristics are different. Of course, keep in mind that in this situation, we have assumed that there is a master transmitter using the X agent to communicate with Y. The master transmitter picks the input symbols and the transmitting agents do their best to communicate by forming one encompassing Shannon channel. We have shown above that, if all of the agents share the same assumption for , the channel capacity increases as the number of agents increase. However, what happens if the are different for the various agents? Are we better off only using a subset of agents, or is it still best to use as many agents as possible? We partially answer those questions below.
Let
be the channel matrix for agent 1, and
be the channel matrix for agent 2.
The output is such that the receiving agent uses the ordering of agent 1 first, then agent 2. If the agents wish to send a signal of 0, the possible outputs, expressed via their probabilities, are
If the agents wish to send a signal of 1 instead, we have
This gives us a combined channel matrix for both agents who are transmitting as
, where
We use our own notation to express the above channel as the tensor product,
We know, by Property 2, that collapsing output symbols does not increase capacity. However, if we collapse
and
into
and
and
into
, we have a channel matrix of
:
Thus,
.
Now let us combine the first and third outputs of
into
and the second and fourth outputs into
. This gives us a channel matrix
.
Thus,
. This result leads us to the next theorem:
Theorem 8.
As the number of agents increase, no matter if they have different channel noises, the total channel capacity is non-decreasing.
Proof. In the above discussion we have show that
Therefore, by repeating the same argument we see that as we add extra agents the capacity can never decrease. □
In fact, as before when the agents had identical characteristics, the channel capacity, except for special cases (dependent columns, a capacity 0 or 1, etc.), is greater than that for separate agents. One can see this by examining the channel matrix—if you unpack the outputs and find that the statistics are different, extra information is learned. Let us now look at the special case of combining a channel with a 0-channel.
Theorem 9.
For any zero channel given by , we find that Proof. If we can show that the mutual information of
is given by (
15), we are done. The channel matrix for this situation is
Let
, and we find that
. Further,
Now again using the log of a product as the sum of the logs, then grouping like log terms, this results in
and we see that
□
Results and Discussion
In this section, we showed what happens when two transmitting agents with different noise characteristics are used. Our important result was that as the number of agents increase, no matter if they have different channel noises, the total channel capacity is non-decreasing. As with many of our results it relied upon the algebra of mutual information giving common sense answers. However, without proofs we just have intuition to rely upon.
5. Resource Allocation
We now concern ourselves with the physical limitations of the receiving agent. We assume that the receiving agent has a limited resource that it can use to receive messages. To the extent possible, the receiving resource, , may be measured in terms of various antennas or various allocations of frequencies, etc. It is not our goal in this article to discuss the engineering of the receiving agent in general. Rather, we accept it as a given.
Upon completion of the mathematics in this section, the results do not seem surprising. That is good! It shows that our intuition is correct and it lays a foundation for dealing with many agents and non-linear allocation schemes (where we lose elements of intuition). Furthermore, aside from linearity, we based our allocation scheme on a Euclidean metric; it is not at all clear if an information geometric-style Riemannian metric be used instead. That is beyond the scope of the article.
Let us take the simplest case where there are two transmitting agent
and
. As before,
has channel matrix
. We model noise affecting each channel in a linear manner. Suppose that an agent
is given, as before, by its channel matrix
How does noise, which results from the receiving agent not allocating enough of its resources to
, change this channel matrix? The channel
is a point in
. Consider the shortest path from
to the main diagonal (which consists of zero-capacity channels). View
as sitting
and consider the straight line
. This line is orthogonal to the straight diagonal line of zero-capacity channels, goes through the point
, and intersects the line for the zero-capacity channels at
. The line segment of interest is given parametrically for
as
We model noise as moving on this new line segment from the point
to the point
. No noise corresponds to
, total noise to
; that is, we use
t as a measure of the noise normalized in a linear manner between 0 and 1.
EXAMPLE: Let . If , the channel is given as and the capacity is 0.12. If , the channel is given as and the capacity is 0. Let , then the channel is given by , which has a capacity of 0.001.
Now, let , then the channel is given by , which has a capacity of .10. Note that, unsurprisingly, the cleaner channel has .
What we have been discussing motivates the following our modeling definition.
Definition 3.
An agent with channel matrix requires the receiving resource for its channel matrix to be unchanged. If the receiving agent only allocates to , the channel matrix is modified from in the following manner, Thus, corresponds to above, and corresponds to above. As decreases, the capacity “travels” the shortest path in the Euclidean metric to the line of the 0-capacity channels. This is the essence of our modeling assumption.
Note that a channel is a 0-capacity channel iff . However, if we let , then .
Theorem 10.
For a non-zero channel , that is, , decreases as decreases from to 0.
Proof. If
is a positive channel, that is, if
, we have that
decreases and
increases as
goes from
to 0. This result is easily shown with algebra, but even more simply by observation of the line segment. From ([Theorem 4.9] [
6]), if
is a negative channel, then by symmetry of capacity about the line
, that completes the proof. □
Corollary 2.
If we have a 0-capacity channel , then the is constant at 0 as decreases.
Proof. Trivial, since the line segment reduces to the point is this situation. □
5.1. Resource Allocation Amongst Different Transmitters
Assume that there are two transmitting agents with matrix , and with matrix . The difference from before is that the receiver can only allocate total resource to the reception by the agents and, further, each agent requires resource to prevent degradation to its channel matrix.
If
allocates
to
, we have the resulting channel matrix Equation (
33) as given above. Then it allocates the remainder
to
, resulting in this channel matrix
As we have shown in the previous section, we arrive at:
Consider the situation when all of the resource is allocated to one channel; then, without the loss of generality, we let
, giving
Keep in mind that the above result is the channel matrix when we combine a 0-capacity channel with
. Intuitively, this should not change the capacity from that of
. Looking at the channel matrix and thinking in terms of coding, we see that we are affecting the first and second outputs; as much as the third and fourth. Below, we present the mathematical details.
Theorem 11.
.
Proof. Let us calculate
. We let
and
. Thus,
Next we examine the conditional entropy:
Again use the rule that the log of a product is the sum of the logs to arrive at:
This result is the same as the mutual information of
. Thus, the maximum of the mutual information for both cases remains the same. □
Corollary 3.
.
Proof. If we swap the two transmitting agents we establish the proof (details are left to the reader). □
Note that any 0-capacity channel is some channel witha 0 resource allocation. Thus,
Corollary 4.
Combining with a 0-capacity channel results in a channel with the same capacity as .
We arrive at the question at hand—what happens with a partial allocation to each channel? That is, in general, how does compare to and ? Our answer follows.
Allocate Resources to and a 0-Capacity Channel
In this situation, we know that and that . What happens for ? Not surprisingly, we get the following theorem:
Theorem 12.
Through allocation if we combine , the first channel, with , the second channel, we find that .
Proof. Trivial from Theorem 9. □
5.2. More Examples
We will find the capacity of
by using (
35) for various
and agent matrices.
From these results, we see that both
are possible. In fact, equalities are also possible by using the special cases examined at the beginning of this section. However,
is not possible. (We show this by a re-wording and then proving that
cannot be larger than both
and
.) Thus, we need a lemma.
Lemma 1.
For channels and , we find thatwith equality if or . Proof. The product channel
is given by channel matrix
The capacity of this product channel equals the sum of the capacities of its component channels
and
(p. 85 [
5]). Removing the middle two rows gives us
, and, since removing a row never increases capacity, we find that
□
Theorem 13.
If we combine through an allocation , the first channel, with , the second channel, then cannot be greater than both of the individual channel’s component capacities.
Proof. Let
so that
. For any input probability distribution held constant, the mutual information is convex with respect to the elements of the channel matrix ([Theorem 2.7.4] [
4]). That is, for any given input probability distribution
x, for all
,
where
is the mutual information of channel
with input distribution
x; thus,
If we let
, we have from convexity that
for any input probability distribution
x, because
always equals 0. Now, we let
be a capacity achieving input probability (unique except for 0-channels) distribution for
, giving
Therefore,
and by replacing
with
and repeating the above convexity argument, we find that
By Lemma 1,
□
Thus, we have shown that and, by using Theorem 11 and Corollary 3, equality can be obtained by letting or 0, the choice depending on the underlying original channels.
Results and Discussion
In this section, we showed what happens when we have limited transmission power and want to distribute it among two transmitting agents. The theorems of this section capture the physical properties of the power allocation and happily agree with intuition.
6. Conclusions
We considered the use of Shannon information theory, and its various entropic terms to aid in reaching optimal decisions that should be made in a multi-agent/Team scenario. Our metric for agents passing information are classical Shannon channel capacity. Our results are the mathematical theorems in this article showing how combining agents influences the channel capacity.
We have put the idea forward of multi-agent communication on a firm information theoretic foundation. We examined simple scenarios in this paper to lay that strong foundation. We obtained results that may seem obvious, but are quite difficult to prove. We ask the reader to keep in mind that there is a big difference between “it is obvious” and “it has been shown”.
From our perspective we have shown that, except for certain boundary cases, one can achieve near perfect transmission of Shannon information, provided one has a large enough number of agents.
We have used most information versus resource (power) allocation as an optimizing criterion. With regard to resource allocation, our results tell us that the best thing to do is to just use the strongest channel. This result is not surprising. However, without the mathematics to prove it, we would be relying on intuition. Furthermore, note that we only used a simple linear allocation scheme in this section, and we only combined two agents. Future work will consider non-linear allocation schemes and multiple agents to continue what we have started in this paper. Going forward, this path is especially meaningful if we adjust the Riemannian metric to influence the power allocated to each channel. For example, a geometric region with high noise levels can be reflected in the Riemannian metric by acknowledging that the terms of the metric are functions of a and b. We will explore this direction in future work.
In addition, in future work, we will also consider more than two agents competing for the available resources, non-Euclidean Riemannian metrics, and more complicated signaling alphabets and schemes. We are also interested in information flow in the Vicsek [
19] bird flocking model.
7. Notation
We include some of the notation that is used repeatedly throughout the article. The other notation is variants of what we give here with changes to the indices and is made clear in its first usage.
MAS | Multi-agent System |
| Agent X |
M | A channel matrix, that is every row contains non-negative numbers that sum to 1 |
| 2 × 2n channel matrix, representing n (transmitting) Agents |
| Entropy of the (discrete) random variable V |
| Conditional Entropy of the random variable V conditioned on W |
| Mutual information between the random variables V and W |
C | Capacity of a generic channel |
| Specifically the capacity of a 1 (transmitting) agent channel |
| A specific 1-agent channel
. Note: C(a,b):=C() |
| Another 1-agent channel
|
| The combined channel () ⊗ () with channel matrix
|
| Combined power allocated channel with channel matrix |
| |
| = , formed from the () channel |