1. Introduction
In theory, a rather large number of indexes are proposed, which supposedly measure the significance of the scientific publications of an author. Among the most popular of them should be noted:
- (i1)
the total number of citations of a particular author [
1,
2,
3];
- (i2)
Hirsch index of the author [
4] (see also [
5]).
It is these two indexes that we consider in the proposed work.
The definition of the numerical value of the index (i1) is clear from its name.
Recall the definition of the Hirsch index (see [
4]). The Hirsch Index
h is the number of articles that have been cited at least
h times each. This index was introduced in [
4], where its properties were explained. In our opinion, these do not correspond to the index purpose. However, we dwell on the description of both the positive and negative sides of the Hirsch index after constructing citation models for scientific articles. One of them has already been stated by us in preprint [
6].
2. Citation Model Construction
We now turn to the construction of the author’s citation model. It will be considered as a composite of two models. The first of it describes the process of publishing an article by one author which will be cited, and the second describes the process of citing such an article.
Let us make some assumptions, which we discuss later.
Assumption 1. Let the probability of rejection or non citing of the manuscript be q and the decisions on publication of different manuscripts are taken independently.
Then it is clear that the probability that the scientist will have exactly
k cited papers equals
,
. In other words, the number of publications of a scientist has a geometric distribution with parameter
q. This distribution supposes that the number of an author publications may be arbitrarily large. However,
tends to zero rather fast as
and, therefore, the mean value of the number of publications is not too large. The generating function of this distribution has the form
Of course, here we assume that all the journals to which the author sends manuscripts have the same review system, i.e., all of them accept the manuscripts of this author with the same probability
. More realistic is the situation with a random parameter
q:
where
is a probability distribution on
interval and then
.
Let us go back to (
1). How large may be the time spent by a scientist to publish a corresponding number of papers? Of course, this time is a random variable
T and we are interested in its distribution. The usual assumption on the working time is its exponential distribution with parameter
and the Laplace transform
. Suppose that times needed for the publication of
j-th paper is
, and
are independent and identically distributed as
T random variables. Then the time needed for all publications has the Laplace transform
i.e., it has exponential distribution with the parameter
.
It is natural to assume that each cited publication will produce some number of citations. Of course, the likelihood that the article will be quoted again depends on the number of previous citations.
Assumption 2. Assume the probability that an article having () citations will not have new quotes equalling where p is the probability that the article will not be quoted for the first time. The parameter γ is responsible for the speed of convergence of the rejection probability to zero.
Consequently, the likelihood that the article will be quoted exactly k times equals
. For the case of
, the generating probability function for the number of citations of this article is
. The corresponding distribution function is named after Sibuya [
7]. Below we consider the case of arbitrary positive
. The corresponding study has general mathematical interest. Therefore, we provide it in a number of sections below.
3. Distribution of Citation Number of a Paper
Let us consider an ordered sequence of experiments
, where an event
A may appear in each of the experiments with the probability
. Define a random variable
X as the number of the first experiment in which
A appears. We suppose that
X is an improper random variable in the sense that it may take infinite value (that is, the event
A will never appear). For the case
we say that
X is a proper random variable. It is clear that, since we define any product from 1 to 0 to be 1,
and
Particular cases are:
The probabilities
are constant. So (
2) is
corresponding to the classical geometric distribution. Its tail is
Clearly, the tail and probabilities (
3) decrease exponentially fast as
n tends to infinity.
The probabilities are given by
, where p is a number from the interval
. Equation (
3) is transformed to
According to (
4)
X is a proper random variable and has, in this case, the Sibuya distribution with parameter
with the following tail
having heavy power asymptotic for
. Such the distribution does not have a finite mean value. It is not difficult to see that
The presented distributions can be respected as a kind of “extreme points” from the perspective of the tail behavior for proper random variable X. Hence, it is natural to study roughly speaking the cases “happening between them”; namely to consider, for example, the situations when , with and . As it was mentioned above, the parameter is responsible for the speed of convergence of the rejection probability to zero.
4. Main Result on Citation Number Distribution
The research subject is in the asymptotic behavior of the probabilities (
2) for
with
. Additionally, to the discussed earlier values of
or
, we distinguish the following two cases:
- (A)
;
- (B)
.
Let us consider the case (A). We have
Consider the product from right-hand-side of (
5) in more details.
Here
is an integer part of
. It is not difficult to see that
has a finite positive limit as
. This limit may depend on
p and
. Let us denote it by
. Therefore,
Relations (
5) and (
7) give us
For
the following asymptotic representation is known
where
is Riemann zeta function. Further considerations depend on properties of the number
.
- (i)
Suppose that
is not integer. Then
and
However,
and, therefore,
From this and (
10) it follows
where
depends on
p and
only.
- (ii)
Suppose that
is positive integer. Then
and
It is known that
and
where
is Euler’s constant. Therefore,
Now we see that the asymptotic behavior of the probability
in the case (A) is given by (
11) and (
13). From the relations (
11) and (
13) it follows
so that
X is a proper random variable.
For the distribution tail
we have
If
is not a positive integer, then
where
depends on
p and
. Similarly, for the case of integer
,
Let us consider the case (B). We have
where
. Transform the product in the right-hand-side:
The series under an exponential sign converges because
. From latest relation we see that
and
X is an improper random variable.
Therefore, for conditional probabilities we have
where
depends on
p and
only.
Summarizing, we obtain the following theorem
Theorem 1. For the considered experiment scheme with probabilities given in (5) the following statements are true: If then , .
If and is not a positive integer then If and is a positive integer then All depend on parameters p and γ only.
One of the reviewers of the first version of the paper advised us to study the form of the constants for some particular cases. We are very grateful him for the advice. Below we consider the case
. In this case
so that the sum under exponential sign in (
19) contains only one summand. The calculations similar to give above leads to the following expression
In other words, the constant
has form
However, precise calculation of all other constant is rather difficult. We do not these constants for the aims of this paper and omit any other calculations of constants.
5. Comments
Theorem 1 shows that for , the tail of the corresponding distribution is not heavy. Namely, the distribution has finite moments of all positive orders. However, the tail becomes heavier with growing . In the case of the distribution is unimodal with mode equal to 1. For the values , the distribution has a power-type tail, which is heavier than the ones occurring for . In the case the conditional distribution under condition does not have the finite mean. However, for growing values of the tails of conditional distributions look to be less heavy. In the case of the conditional distribution has mode at 1.
6. The Case of Growing
Above, we considered the case of the probability of event A decreasing with increasing iment number. For completeness, consider the case of an increase of this probability.
Namely, suppose that in (
1)
for
and
. Then
It is clear that
, and the tail of the distribution
is a quickly decreasing function of
m. Of course, distribution of
X has finite moments of all orders and it may have a mode not only at 1.
7. Back to the Distribution of Citation Number of One Author
We suppose now that the distribution of citation number of one paper has the form (
5):
with
. Corresponding probability generating function is
As was mentioned above, the number of cited paper is distributed according to geometric law with probability generating function (
1):
The probability generating function of citation number of one author equals to the composition of and Q, i.e., it is . It is clear that the tail of corresponding distribution is not heavy for , it is heavy for , and the distribution is improper for .
Although the case of improper distribution seems to be not realistic, we discuss it for some particular cases below, after consideration of proper cases .
Let us remind that the case leads to the light tailed distributions while leads to the laws with the heavy tail. The choice between models with light or heavy tails can only be made based on real data. Below we analyze some data of this kind.
7.1. Analyzing Data from Scholar Google “Mathematics"
Let us give the data for the part “Mathematics" on 16 February 2020 (see
Table 1). The data given concern are the first 10 in the number of citations of authors. We do not give the names of these scientists. The table shows:
The serial number of the author;
The total number of citations by the author;
Hirsch Index;
The number of citations of the most popular work (By the most popular work we understand the work of this author having the largest number of citations among the works of this scientist);
Ratio of citations to squared Hirsch index;
Table 1 shows the first scientist has 2.76 times more citations than the second. In other words, the maximum of the observations is essentially greater than previous one. This observation leads us to think that the corresponding distribution has heavy tails (see [
8,
9]). As we have seen, it is possible for the case
only.
7.2. Analyzing Data from Scholar Google “Biostatistics"
Let us give the data for the part “Biostatistics" on 16 February 2020 (see
Table 2). The structure of
Table 2 is the same as that of
Table 1.
Table 2 shows the first scientist has 1.59 times more citations than the second. Although it is it is less than the case of
Table 1, the number is large enough to support our hypothesis on the presence of a heavy tail.
We do not give the data on the part “Statistics” but mention the situation is similar to that of the
Table 1 and
Table 2.
7.3. Final Model for the Distribution of Citations
From the considerations of the two previous subsections, it follows that the most natural way to describe the distribution of citations is to choose
. This means
and the probability generation function of citations distribution is given by
Denote by Y the number of citations of a given scientist. It is clear that may be found as the n-th coefficient of expansion in power series. We have
where
is a hypergeometric function. Therefore,
It is possible to verify that for all integers . Therefore, we meet a scientist without papers or with citing papers with maximal probability. If we limit ourselves by consideration of the scientists having at least one citation then the highest probability corresponds to authors with one citation.
The Laplace transform of the distribution of
Y has form
Its asymptotic as
is
This relation shows that the random variable
Y has moments of order less than
p and does not have moments of higher order. Because
the variable
Y has infinite mean. In practice, this means that some scholars have a very large number of citations. These citations refer to publications by a relatively small number of scholars. Of course, the data in
Table 1 and
Table 2 are in agreement with these statements. It is important that the model is built on the assumption of the same capabilities of scientists. Even so, we must observe a greater variability in the number of citations of their publications. Thus, the difference in the number of citations can be purely random and not say anything about the real contribution of the scientist into corresponding science field.
Of course, the proposed model is very idealistic, since it does not take into account the real difference in the capabilities of scientists, as well as in their equipping with the necessary tools and equipment. Taking into account the noted differences is likely to lead to the need to consider mixtures of the proposed distributions with different parameters p and q. However, such a complication will not make it possible to distinguish scientists with a large contribution to science from those with a smaller impact.
Surely, the arguments presented for the choice of are rather crude, i.e., in reality, it may happen that is close to unity. Although in this case, the distribution tail is not heavy, but over a very large (but finite) interval it is close to heavy. So, qualitatively, our conclusions will remain unchanged.
Based on the foregoing, we conclude that it is practically senseless to use the number of citations of a scientist’s work to assess his contribution to science.
7.4. Remarks on the Model with
In this subsection, we are trying to justify the possibility of using models with gamma greater than one. As already noted, in this model the probability is not equal to zero. It is unlikely that this corresponds to the situation with the consideration of all scientists working in this field of science. However, a very long citation process (ideally, endless) is quite possible in the case of the most prominent scientists. For example, in the field of Mathematics, the works of Professor Andrei Nikolaevich Kolmogorov (1903–1987) continue to be cited. Over the past 15 years, they have been cited about 30,000 times, although more than 30 years have passed since the death of their author. It is highly probable that the citation process for these works will continue for a long time.
In addition, the concept of citation is somewhat arbitrary in our opinion. For example, in Mathematics, some theorems or other objects bear the names of scientists who were related to their preparation. Does the mention of these theorems and the corresponding names in some articles mean their citation? For example, many articles and books mention the Gaussian distribution without reference to the corresponding publication by Gauss. Is this mention a quotation? It seems to us that such kind of nominal results are not counted in determining the citation index. However, they certainly indicate the scientific significance of the result. It is very likely that for accounting for citations of this kind, models with a greater than 1 may be required.
8. Hirsch Index
Recall that the definition of the Hirsch index was given on Page 1. Hirsch states that the proposed index h is intended to rank authors of articles in the field of Physics. At the same time, it is noted that the index can be used in other fields of science. Since the number of citations is used in determining the index h, it seems plausible that h is associated with this number. Hirsch notes that the number of citations is given by . He wrote: “I find empirically that ranges between 3 and 5” (We change notations of Hirsch. Namely, his a is our .). Further, Hirsch wrote: “ is very atypical value”.
Below we show that the Hirsch statements presented here are doubtful. In addition, the use of this index seems unreasonable.
Let’s start by analyzing the data in
Table 1 and
Table 2. Remind that the column 5 gives corresponding values of
.
Table 1 does not contain any
while
Table 2 has only one such value
. Other values of
are “very atypical”, especially for
Table 1.
Table 2 contains 2 values of
. Therefore, at least for such fields as “Mathematics” and “Biostatistics”, Hirsch’s conclusion about the “typical” form of proportionality between the number of citations of an author and the square of corresponding Hirsch’s index seems to be incorrect. However, was Hirsch right in the field of “Physics"?
8.1. Data in “Physics”
Now we give the data on field “Physics”, arranging them into a table in the same way as for
Table 1.
Again,
Table 3 has only one
, namely
. However, there are six values
. The kappa values for the “Physics” area look smaller than for the “Biostatistics” area and significantly smaller than for the “Mathematics” area. The value of the Hirsch index for Physics has much less variability than for Biostatistics and Mathematics. The differences in citation numbers are much greater for Mathematics than in the case of Physics.
So, we see that Hirsch’s understanding of the situation in Physics is closer to reality than in the case of Biostatistics and, especially, Mathematics.
8.2. Data Comparison
The average value of the Hirsch index in the case of
Table 1 is 99.3 with a standard deviation of 66.45. The same indicators for
Table 2 are 153.8 and 47.97, and for
Table 3—198.2 and 21.73. We see that the standard deviation of the Hirsch index in the case of Mathematics is three times greater than in the case of Physics. On the contrary, the average value of the index is maximum in the case of Physics and minimum in the case of Mathematics. This shows that if Hirsch index is useful in the field of Physics, then its usefulness in the field of Mathematics is doubtful. Probably, it is true for Biostatistics too.
Authors with a higher Hirsch index are often inferior to others in the number of citations of the most popular works. For example, in
Table 1, Author 1, having the highest Hirsch index, is inferior to Authors 2, 4, 5, 6 and 7 in the number of citations of the most popular work. In this case, Author 1 wrote his most cited work with co-authors, while author 2 did without co-authors.
It is clear that the Hirsch index does not exceed the number of cited publications of the author, which has an exponential distribution. Thus, the distribution of the Hirsch index has a light tail. Since the number of citations has a heavy tail, it is more variable than the Hirsch index. However, these two indicators are stochastically strongly related. Indeed, for the data in
Table 1, the sample correlation coefficient between these indicators is
. On the other hand, the correlation coefficient between the Hirsch index and the number of citations of the most popular works is
. This coefficient indicates a small relationship between the indicators, and it is negative. In other words, a large Hirsch index is most likely not found among authors with highly cited individual articles. For
Table 2, the values of the correlation coefficients equal to
,
, and for
Table 3 ,
.
The increase in the Hirsch index with a decrease in the number of citations of the most popular work may result in the division of the work into a series of publications. However, when assessing the quality of a scientist’s contribution, one should take into account that the publication of a series of articles instead of one may be caused not by a desire to increase the number of publications, but, for example, by a gradual insight into the essence of the problem under consideration. Such insight often requires a very long time, i.e., publication of a series of articles is justified. It should be noted that the publication of a series of articles naturally leads to an increase in the number of self-citations. This increase cannot be considered as a flaw of the author and does not mean attempts to artificially increase the number of citations. At the same time, the presence of a series of publications (which increases the Hirsch index) cannot be considered as preferable to one highly cited work.
The presence of higher values of the Hirsch index in Physics compared to Mathematics can be explained by the use in modern Physics of expensive equipment in experimental Physics and/or the results obtained on it in theoretical Physics. Often this equipment is used by some laboratory or scientific group, and then transferred to another or others. After some time, this equipment again becomes available to the first group. Thus, new experimental facts arrive intermittently, and during the break they are processed and published. A theoretical analysis of the observed facts is also taking place. Then comes new information related to new experiments. Therefore, the very flow of information (both experimental and theoretical) contributes to the publication of not a single article, but a series of articles. This circumstance leads to an increase in the Hirsch index with a relative decrease in the number of citations of popular works.
A similar situation is absent in Pure Mathematics. Therefore, there the appearance of the series has much fewer reasons. Separate works appear, which often cover a substantial part of the problem under consideration. They cause a stream of citation of this particular work, and in a series of works. Thus, the Hirsch index becomes smaller than it would be if a series of articles were published instead of this one, but the most popular work causes more citations than each individual work in the series.
So, the use of the Hirsch’s index has some basis in the field of Physics, but it is not related to what is happening in Mathematics.
For some areas of Applied Mathematics, a situation may be observed that is intermediate between what is happening in Physics and in Pure Mathematics.
However, it is not clear to us why not replace the Hirsch index with two. The first of these could be the number of all citations, and the second - the number of citations of the most popular work. The Hirsch index is stochastically quite closely linked to the number of all citations, so it and this number are “interchangeable”. However, after the termination of the work of a scientist in a given field of science, the number of his publications does not increase and, therefore, the Hirsch index remains limited, while the number of citations can continue to grow unlimitedly. This is exactly what happens with the works of the most outstanding scientists of the past.
9. Distribution of the Hirsch Index
In this section, we obtain the probability distribution of the Hirsch index.
We introduce some notation. It is clear that the Hirsch index is a random variable. Let us denote it by H. We will denote the values of this H by h. Our aim here is to determine the probabilities that , i.e., . In order for the event to occur, it is necessary and sufficient that:
- (a)
no less than h works were published;
- (b)
h of the published works are cited at least h times, and the rest - less than h times.
Suppose that
l works are published, and
. The probability of this event is
. Recall, the probability that a published work will be quoted
k times equals to
. Therefore, the probability that the published work will be cited at least
h times equals to
where
is Euler gamma function.
The probability that a published work will be cited less than
h times is defined as
Thus, the probability that
l papers are published, and the Hirsch index
H has taken the value
h is
So, the random variable
H has the following distribution
where
Note that this distribution is not geometric one because the value of depends on h.
Next, we are interested in estimating the tail of the distribution of
H. To do this, we estimate the asymptotic behavior of the
. An application of the Stirling formula allows one to easily obtain that
This formula immediately leads us to an asymptotic expression for the logarithm of probability
for
. Namely,
It follows that the probability of the event decreases faster than the exponential function for . Of course, the tail of the distribution of H also decreases faster than the exponential function. Therefore, there are moments of all orders of this distribution. Note that the distribution of the number of citations of articles by this author has an infinite mean value. So, if an author has a fairly large number of citations, then the ratio of the number of citations to the square of the Hirsch index can be arbitrarily large. This fact contradicts Hirsch’s claim that is bounded.