1. Introduction
In publications on networks, including the Internet, it is standard to mention the sources of traffic (see, e.g., [
1]). A description of the sources is given in rather general terms (see, e.g., [
2]), without details, which are necessary in order to proceed to any quantitative calculations. The data from [
3] indicate that the main portion of the data volume is Internet video. This information, interesting from the common point of view, provides no information to someone analyzing network traffic that is measured in a specific place, for example, at a research university or a banking data processing unit. Therefore, in many cases, a more detailed concept of traffic sources is required.
Most of the theoretical publications are devoted to fairly abstract probabilistic models, such as Markov chains, in which the data source is an abstract generator of random events, which, as a rule, models the generation of packets. Experimental results are available mainly for backbone data transmission networks, which are quite far from what could be taken as a specific data source in the usual sense (although not in the sense of a mathematical model). This situation motivates a study of the traffic, as a result of the activity of physical sources, i.e., activities not of abstract models but of real objects. We call both people and computer programs “real objects”, which exist and act in accordance with their own rules and/or wills, rather than as abstract models.
In the classical teletraffic theory [
4,
5,
6,
7], the primary source is “someone who places a call”. In network communications, the primary source is “someone starting a network service”. In packet technology, the user does not generate any stream directly. The user only starts a network service, which is a computer program. The program, after being launched, transmits and receives data streams in accordance with its own rules.
In the classical theory [
4,
5,
6,
7], any user (even if the user is very different from the others) does the same: occupies a channel for use. In this sense, all users are equivalent, and the traffic is homogeneous. Non-homogeneity of traffic can result from the discrepancy in time that each user occupies the channel, but this issue has been solved within the framework of the classical teletraffic theory (see, e.g., [
8,
9]).
In packet data technology, data are generated by programs. A simple look at the traffic generated by different programs is enough to conclude that these services generate very different data streams, which cannot be considered as equivalent, so the resulting data stream is extremely heterogeneous, see
Figure 1. However, another question naturally arises: what about the data streams generated by the same service? Maybe there is some equivalence of data streams generated by the same service?
The answer to the latter question may be affirmative. However, before proceeding any further, we need to make the following important remark. In general, the analysis of a process depends on the scale of the consideration. This also applies to data flow analysis. The problem of the data streams’ equivalence may be effectively resolved with some specific choice of scale. In the theory of multi-traffic, the time scales displayed in
Figure 2 are distinguished.
The data flows generated by a specific service during a session belong to the scale of “minutes”. In
Figure 2, the scale of “seconds” corresponds to “bursts” [
10,
11,
12], and the scale of “microseconds” corresponds to “packets”. We are not concerned with the scale of packets, although simulating communication systems at this scale is very popular nowadays [
13,
14]. There are toolkits for measuring data flows at both such scales, for example, Tmeter [
15]. Thus, the main question is formulated as follows: is it possible to identify a given service on the level of “minutes” and “seconds” or is that the same as “sessions”(“calls” in
Figure 2) and “bursts”? This question correlates with the classical metrology [
16] approach: one should classify the random processes and determine their numerical characteristics [
17] as well as the modern modifications of the classical metrology approaches (see, e.g., [
18]).
The data streams that circulate on the Internet are the result of both the operation of Internet services and the activities of the people who use them. The period of human activity at the computer is several hours, and the cycle of this activity is a day, a work shift, exercising a single duty, etc. This initiates the introduction to the scheme in
Figure 1 at one more level: a workday. This level accounts for the human factor. There is always a data flow not related directly to a user’s activity: the service data flow. However, this flow is a small part of the total, and, thus, we shall neglect it in this paper.
A user sends no data directly to the Internet, but only starts services or programs creating and/or transferring data to the network. The immediate sources of data are the services initiated by the user.
The Internet is based on data packets technology [
13,
14,
19], and an analysis of the data flow on the Internet theoretically can be based on the study of packets and the protocols of Internet services, for both local and global networks. Unfortunately, keeping track of all possible interactions of these protocols, both among themselves and with the users and a per-packet study of the data flows, are so complex and subject to so many random influences that the analysis of data flow at the microscale (at the scale of data packets) becomes extremely complicated. It is why we do not use the level “microseconds” in
Figure 1. In this context, the macroscopic approach is justified: we consider the overall data flow over certain time intervals, for example, seconds, and analyze data flows during the session.
Even on a single computer, multiple services may be run. This raises the question of the interaction of multiple data flows. The latter problem may also be solved by macroscopic analysis.
The aim of this paper is to examine the data flows generated by typical Internet services and the interaction of these flows “at the source”: at a single computer and on the local network.
The basic tool used in this investigation is a series of reliable and reproducible experiments together with the standard statistical methods.
In this paper, the authors present experimental data concerning the output data flows.
5. Construction of the Data Rate Distribution by Using the Computer Simulation
To propose a hypothesis about the type of distribution, skewness and excess kurtosis (kurtosis in the tables below) were calculated. The results are presented in
Table 3 and
Table 4.
It follows from
Table 3 and
Table 4 that the distribution should be selected among the asymmetric distributions. The distributions determined from the simulation have a characteristic form similar to the graphs of the gamma distribution. The gamma distribution density function is given by the following formula.
where
and
are the parameters [
21].
The numerical values of
and
were carried out by the least squares method (the mean square deviation of the empirical distribution functions from the function
(1) was minimized). The computed parameters are presented in
Table 5.
The graphs of the function
(1) for the parameters indicated in
Table 5 and the graphs of the density, determined by using the computer simulation, are shown in
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11,
Figure 12,
Figure 13,
Figure 14 and
Figure 15. The abscissa axis shows the data rates in K/s, while the ordinate axis shows their probabilities.
An application. In
Figure 15, the graphs, for the first time, become similar to the normal distribution, but 2000 users are impossible on a local network. As a result, we conclude that the data rate initiated by e-mail users is described by the gamma distribution.
An application. The results of the statistical analysis presented above have numerical applications. For example, these results provide us with the information about the maximal data stream the function of the user number. This information is useful to estimate the necessary local network capacity and the data stream from this local network to the global network. The maximum data rate is marked in
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11,
Figure 12,
Figure 13 and
Figure 14 by an asterisk, and the corresponding numerical values of the rate (in K) are presented in
Table 6.
6. Justification of the Constructed Density of the Data Flow Rate
Visually, there is a good match between the plots of the distribution functions determined from the computer simulations and the plots of the distribution density functions of the gamma distribution in
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11,
Figure 12,
Figure 13,
Figure 14 and
Figure 15. We present the statistical justification for this conclusion. We propose the following hypothesis: the data rate determined from our computer simulation has a gamma distribution with the parameters indicated in
Table 5.
To verify/reject this hypothesis, we use the Kolmogorov–Smirnov test [
22], which consists of the following: as a measure of the discrepancy between the theoretical and statistical distributions, the maximum value of the modulus of the difference between the statistical distribution function
and the corresponding theoretical distribution function
is considered:
The critical value of the Kolmogorov–Smirnov test is calculated by the formula
, where
is the number of relative empirical frequencies. The probability
is determined from the table from [
22] (
is the probability that, due to purely random reasons, the maximum discrepancy between
and
will be no less than the one that is actually observed [
22]). If
is close to 1, then the hypothesis of the gamma distribution of the computer-simulated data transfer rate is accepted. At a value close to zero, this hypothesis is rejected, see [
22] for details.
The calculated values
and
are presented in
Table 7.
Since the probabilities
from
Table 7 are close to 1, then we accept the hypothesis of the gamma distribution of the simulated frequencies.
9. Conclusions
The information about the individual characteristics of the data streams generated “at the source”, the superposition of these streams, and a specific user’s activity allow for computing the total data streams on the network.
The authors suggest analyzing the characteristics of traffic generated by Internet services and to develop data rate “portraits” (or “passports”) of these services. The development of the “portraits” (or “passports”) of services assumes a statistical investigation of these services. Examples of such analysis are presented for the most common Internet services. A detailed analysis is presented for an e-mail service.
The authors experimentally investigated the superposition of data streams on a local network and found that the data rates are additive, with the relative error less than 5%. The authors found that the additive rule is not satisfied exactly.
These results may be used for the development of models of the data streams, based on the “passport data” for the Internet services and the statistical information about users’ activity. The models may be used to compute the total data flow on the local network, depending on the users’ activity: the types of services, the number of users, the users’ activity, etc. (see examples in the subsections titled “An application”).