Generalizations and Properties of Normalized Similarity Measures for Boolean Models

Bădică, Amelia; Bădică, Costin; Logofătu, Doina; Neremzoiu, Ionuţ-Dragoş

doi:10.3390/math13030384

Open AccessArticle

Generalizations and Properties of Normalized Similarity Measures for Boolean Models

¹

Department of Statistics and Business Informatics, University of Craiova, 200585 Craiova, Romania

²

Department of Computers and Information Technology, University of Craiova, 200440 Craiova, Romania

³

Faculty of Computer Science and Engineering, Frankfurt University of Applied Sciences, Nibelungenplatz 1, 60318 Frankfurt am Main, Germany

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(3), 384; https://doi.org/10.3390/math13030384

Submission received: 6 January 2025 / Revised: 21 January 2025 / Accepted: 22 January 2025 / Published: 24 January 2025

(This article belongs to the Special Issue Mathematics and Applications)

Download

Browse Figure

Versions Notes

Abstract

:

In this paper, we provide a closer look at some of the most popular normalized similarity/distance measures for Boolean models. This work includes the generalization of three classes of measures described as generalized Kulczynski, generalized Jaccard, and generalized Consonni and Todeschini measures, theoretical ordering of the similarity measures inside each class, as well as between classes, and positive and negative results regarding the metric properties of measures related to satisfying or not satisfying the triangle inequality axiom.

Keywords:

distance measure; similarity measure; metric; Boolean model; finite set

MSC:

03E75; 68R01; 68T01

1. Introduction

Defining a quantitative measure that evaluates the similarity of two objects occurs in various scientific areas, including computing, natural sciences, medicine, forensics, socio-economic sciences, engineering, and arts. For example, biologists use similarity measures for sequence alignment, protein structure comparison, and studying the similarity in species distribution. Computer scientists are using similarity measures for comparing data points in clustering, classification, and anomaly detection, for determining relevant documents in information retrieval, for assessing semantic compatibility of natural language texts, as well as for object and speech recognition in computer vision and multi-modal user interfaces. Medical doctors are interested in assessing medical image similarity in radiology or evaluating disease spread patterns in epidemiology. Sociologists and psychologists are conducting survey response analysis and evaluating behavioral similarity, while forensic engineers are analyzing fingerprints and handwriting. Last but not least, economists are assessing the similarity of consumer behaviors and stock patterns, artists and musicians are interested in melody similarity and art styles, while Earth scientists are analyzing terrain similarity, map matching, and weather patterns.

In this paper, we provide a theoretical investigation of some of the most popular similarity/distance measures for Boolean models. For a fair discussion and comparison we restrict our attention to normalized similarities (i.e., bounded by

[0, 1]

) of type 1 according to [1], which only depend on characteristics present in the compared objects. In particular, we are interested in developing clear definitions and generalizations of the measures, as well as analyzing their theoretical ordering and metric properties in detail.

We provide new theoretical results covering the following three aspects:

i.: Generalization of three classes of measures described as generalized Kulczynski, generalized Jaccard, and generalized Consonni and Todeschini measures (Definition 7, Proposition 6, Equation (36)).
ii.: Theoretical ordering of the similarity measures inside each class, as well as between classes (Proposition 4, Proposition 6, Proposition 7).
iii.: Positive and negative results regarding the metric properties of the measures related to satisfying or not satisfying the triangle inequality axiom (Proposition 5, Proposition 8, Proposition 9, Proposition 10).

To the best of our knowledge, the proposed generalizations of similarity measures, as well as the results regarding the theoretical ordering, are new. Moreover, the positive and negative results regarding satisfying or not satisfying the triangle inequality axiom can be seen as generalizations of particular results concerning specific similarity measures, like the metricity of Jaccard similarity and non-metricity of cosine similarity.

The paper starts with a brief overview of related works in Section 2. Section 3 covers background definitions of similarity and distance measures, metrics, and set measures, as well as notations used in the rest of the paper. Section 4 introduces the most popular normalized similarity measures found in the literature. Section 5, Section 6 and Section 7 cover the most important results regarding the generalization of Kulczynski, Jaccard, and Consonni and Todeschini similarity measures and their properties.

2. Related Works

The Jaccard similarity was proposed more than one century ago in [2] to study patterns of flora distribution. The first mathematical proof that Jaccard distance satisfies triangle inequality was provided many decades later in [3]. One year later, a shorter and very elegant proof was provided by [4]. Two recent generalizations of Jaccard similarity using modular and submodular set functions, including metricity proofs, were provided in [5]. It is interesting to note that these results apply to general distributive lattices rather than just sets. Note, however, that they are different from our proposal of a parameterized version of Jaccard similarity as a symmetric Tverski index.

A very recent axiomatic characterization and generalization of the Jaccard similarity metric is provided in [6], featuring the increase in marginal sensitivity to the gradual removal of common elements (i.e., updating axiom A3 from [6]). It is interesting to note that the cited study shows by means of a simple example that their proposed generalization of Jaccard (different from ours), Sørensen–Dice, cos (i.e., Driver and Kroeber or Ochiai), and overlap similarities do not satisfy triangle inequality (axiom A7 in [6]). However, our results are more general, stating that the generalized Kulczynski distance satisfies triangle inequality only for the particular Braun–Blanquet case and fails to satisfy it for all other mean functions.

An early overview of binary, i.e., presence–absence and similarity coefficients, is provided in [7]. While the authors observed that most of the binary coefficients considered for discussion were developed intuitively and tested empirically, their purpose was to highlight their conceptual relationships and to standardize their symbolic expressions. Nevertheless, other theoretical results are missing.

An evaluation of 43 coefficients of association and similarity based on binary data is proposed in [8]. That analysis involves theoretical considerations of admissibility and other additional conditions, as well as an empirical evaluation of the significance of their association on a real data basis using the chi-square test. An interesting conclusion of that work was that “a set of measures that generally work well” should comprise Sørensen–Dice, Kulczynski, Driver and Kroeber (Ochiai), and Braun–Blanquet similarities. According to our analysis, these four measures are, in fact, special cases of generalized Kulczynski similarity, thus theoretically explaining some of the experimental conclusions of [8].

An experimental comparison of the effect of base rates on the grouping of 71 binary similarity coefficients was recently provided in [9]. However, while this comparison involves a much larger ecosystem of similarity measures for binary models than our work, the results obtained are purely experimental and address a very specific and practical problem rather than investigating the more general theoretical properties of the analyzed measures, like ordering and metricity.

A recent survey of similarity measures for both binary and numerical data is presented in [1]. While the results of this survey address both binary and numerical data, the analysis provided covers mainly semantic aspects: order-based comparison, value-based comparison, degree of severity, and power of discrimination, while other mathematical properties that are more relevant for stronger theoretical results, like metricity properties, are not investigated.

Yet another path of approaching the definition of distances between finite sets is to regard them as capturing discrete probability distributions. Probably the earliest work in this direction is provided by Rajski’s distance [10], which is defined based on Shannon entropy and joint entropy [11]. Rajski proved that his distance defines a metric space of discrete probability distributions, thus satisfying triangle inequality. An abstract definition of entropy and its metricity properties in the context of submodular and supermodular functions on lattices is proposed in [12]. These results could provide an interesting basis for investigating extensions of our work, covering the probabilistic aspect in the framework of general lattices.

Moreover, rather more recently, a new measure of similarity of Boolean vectors (i.e., finite sets), defined as the collision probability of optimal locality-sensitive hashing, was proposed in [13]. Their proposed probability Jaccard similarity is shown to be a natural generalization of the Jaccard similarity to probability distributions, as compared to the weighted Jaccard index. Interesting connections between locality-sensitive hashing and supermodularity of the similarity measures are proposed in [14]. This might suggest possible connections between their results and our results on generalized Consonni and Todeschini measures relying on the supermodularity property.

3. Background

A similarity measure or index, or simply similarity, at its core, quantifies the similarity between two (or more) objects. In what follows, we address only the binary case.

The most general axiomatic definition of similarity is provided below [15].

Definition 1.

(Similarity Measure) Let X be a set of objects. A function

s : X \times X \to R

is called a similarity measure on X if and only if it satisfies the following axioms:

1.: It is non-negative, i.e., $s (x, y) \geq 0$ holds for all $x, y \in X$ .
2.: It is symmetric, i.e., $s (x, y) = s (y, x)$ holds for all $x, y \in X$ .
3.: It satisfies the identity of indiscernibles, i.e., $s (x, y) \leq s (x, x)$ holds for all $x, y \in X$ .

A distance measure or simply distance, also known as dissimilarity, at its core, quantifies the dissimilarity between two (or more) objects. Observe that distance is exactly the opposite of similarity, i.e., whenever the similarity is higher, the distance is lower and vice versa.

Definition 2.

(Distance measure) A function

d : X \times X \to R

is called a distance (or dissimilarity) measure on X if it satisfies the following axioms:

1.: It is non-negative, i.e., $d (x, y) \geq 0$ holds for all $x, y \in X$ .
2.: It is symmetric, i.e., $d (x, y) = d (y, x)$ holds for all $x, y \in X$ .
3.: It is reflexive, i.e., $d (x, x) = 0$ holds for all $x, y \in X$ .

Very often, distance measures are required to reflect the real-world geometric intuition that the direct distance between two points is always shorter than or equal to the sum of distances through an intermediate point. This constraint is known as triangle inequality, and it has the advantage, on one hand, of interpretability and consistency with human reasoning, and on the other hand, of providing some logical properties that are necessary for the mathematical foundations of many algorithms and theories.

Definition 3.

((Semi)Metric) A distance measure

d : X \times X \to R

is semimetric on X if it satisfies the triangle inequality axiom:

1.: $d (x, y) \leq d (x, z) + d (z, y)$ for all $x, y, z \in X$ .
2.: Additionally, $d (x, y) = 0$ if and only if $x = y$ for all $x, y \in X$ .

Then, d is called a metric.

Given a general similarity measure, one can easily build a corresponding distance measure as follows.

Proposition 1.

(Distance corresponding to a similarity) If

s : X \times X \to R

is a similarity then

d : X \times X \to R

defined as follows:

d (x, y) = \min {s (x, x), s (y, y)} - s (x, y),

(1)

where it is a distance.

Proof.

We show that

d (\cdot, \cdot)

defined by Equation (1) satisfies the distance axioms of Definition 2.

From item 3 of Definition 1 it follows that

s (x, x) \geq s (x, y)

and

s (y, y) \geq s (y, x)

. Using the symmetry of

s (\cdot, \cdot)

(item 2, Definition 1) we get the positiveness of

d (\cdot, \cdot)

.

Also, from the symmetry of

s (\cdot, \cdot)

, we obtain

d (x, y) = \min {s (x, x), s (y, y)} - s (x, y) = \min {s (y, y), s (x, x)} - s (y, x) = d (y, x)

, i.e., proving

d (\cdot, \cdot)

is symmetric.

Finally,

d (x, x) = \min {s (x, x), s (x, x)} - s (x, x) = s (x, x) - s (x, x) = 0

, concluding the proof. □

The similarity function s is very often normalized, i.e., it is bound to the interval

[0, 1]

, and the similarity of each element

x \in X

to itself is 1; that is,

s (x, x) = 1

for all

x \in X

. In this way, axioms 1 and 3 of Definition 1 are automatically satisfied.

Definition 4.

(Normalized similarity measure) Let X be a set of objects. Function

s : X \times X \to [0, 1]

is called a normalized similarity measure on X if it satisfies the following axioms:

1.: It is symmetric, i.e., $s (x, y) = s (y, x)$ holds for all $x, y \in X$ .
2.: $s (x, x) = 1$ for all $x \in X$ .

It is obvious that a normalized similarity measure is a similarity measure. Moreover, according to Proposition 1, if s is a normalized similarity, then

d (x, y) = 1 - s (x, y)

is a normalized distance, i.e.,

d (x, y) \in [0, 1]

for all

x, y \in X

.

In what follows, we focus on similarities and distances for objects described as sets of features, characteristics, or attributes. Moreover, we stick to the simplest Boolean model, i.e., each feature can be present or absent in a given object.

Let us consider the finite universal set

U

. This set can be interpreted as the universal set of characteristics or features that can be possessed by or observed at an object. According to the Boolean model, a feature

u \in U

can be present or not in a given object or unit of observation. So, we can represent an object by its set of features, denoted as

X \subseteq U

.

We carefully analyze some of the most important and popular normalized similarities proposed in the literature for objects described using the Boolean model. Adhering to the normalized model has the advantage that various similarities can be compared, allowing us to theoretically establish a domain-independent ordering of some of the most well-known normalized similarities.

It is not difficult to see that, according to the notation from Definition 4, the set of objects X in the Boolean model is actually the power set

2^{U}

, while the objects themselves correspond to subsets of

U

.

For each set

X \subseteq U

, its complement set is defined as

\bar{X} = U ∖ X

. Moreover, we denote

| X |

as the cardinal (number of elements) of

X

.

Similarity involves two objects; let us denote them by subsets

X

and

Y

of

U

.

First, observe the following:

i.: $X ∖ Y, X \cap Y$ is a partition of $X$ , and symmetrically, $Y ∖ X, X \cap Y$ is a partition of $Y$ .
ii.: $X ∖ Y, Y ∖ X, X \cap Y$ is a partition of $X \cup Y$ .
iii.: $X ∖ Y, Y ∖ X$ is a partition of the symmetric difference $X Δ Y$ .
iv.: $X ∖ Y, Y ∖ X, X \cap Y, \bar{X \cup Y}$ is a partition of $U$ .

Standard definitions of similarities are given in terms of the cardinal. However, in what follows, we generalize these definitions for arbitrary measures on finite sets.

Definition 5.

(Set measure) A measure on finite sets is a function

μ : 2^{U} \to R

that satisfies the following axioms:

i.: It is positive, i.e., $μ (X) \geq 0$ for each set $X \subseteq U$ .
ii.: It is additive for disjoint sets, i.e., $μ (X \cup Y) = μ (X) + μ (Y)$ for disjoint sets $X, Y \subseteq U$ .

A set measure

μ

has some properties that are very useful in the theoretical analysis of similarity measures based on

μ

.

Proposition 2.

Let μ be a measure on finite set

U

. Then, the following hold:

i.: The empty set has a null measure, i.e., $μ (\emptyset) = 0$ .
ii.: It is monotone, i.e., $μ (X) \leq μ (Y)$ for $X \subseteq Y$ .
iii.: μ is modular, i.e., $μ (X \cup Y) + μ (X \cap Y) = μ (X) + μ (Y)$ .
iv.: μ is non-negative and bounded, i.e., $0 \leq μ (X) \leq μ (U)$ for all $X \subseteq U$ .

Proof.

Observe that from additivity, it follows that

μ (\emptyset) + μ (\emptyset) = μ (\emptyset)

, so

μ (\emptyset) = 0

.

Considering

X \subseteq Y

, it follows that

(X, Y ∖ X)

is a partition of

Y

. Then, from additivity,

μ (Y) = μ (X) + μ (Y ∖ X)

. Using positivity

μ (Y ∖ X) \geq 0

, it follows that

μ (X) \leq μ (Y)

, resulting the monotony of

μ (\cdot)

.

Modularity follows from additivity and partitioning

X

,

Y

, and

X \cup Y

in terms of

X ∖ Y

,

Y ∖ X

, and

X \cap Y

.

From monotony, as

X \subseteq U

, it follows that

μ (X) \leq μ (U)

for all

X \subseteq U

. □

Moreover, it is not difficult to observe that measures of finite sets have a general form.

Proposition 3.

μ is a measure on finite set

U

if and only if there exist constants

m_{i} \geq 0

for each

i \in U

, such that:

\begin{matrix} μ (X) = \sum_{i \in X} m_{i} for all X \subseteq U . \end{matrix}

(2)

Proof.

It is very easy to observe that if

μ

satisfies Equation (2), then

μ

satisfies axioms of Definition 5. Conversely, if

μ

satisfies these axioms, then letting

m_{i} = μ ({i})

for all

i \in U

, we obtain from the positivity axiom that

m_{i} \geq 0

and from the additivity axiom that

μ

satisfies Equation (2). □

4. Similarity Measures

The definition of the most important and popular similarity measures is often given in terms of the following notations:

\begin{matrix} a = μ (X \cap Y) & attributes common to X and Y, \\ b = μ (X ∖ Y) & attributes common to X only (and absent from Y), \\ c = μ (Y ∖ X) & attributes common to Y only (and absent from X), \\ d = μ (\bar{X \cup Y}) & attributes absent in both X and Y . \end{matrix}

(3)

Note that, as

X \cap Y, X ∖ Y, Y ∖ X, \bar{X \cup Y}

is a partition of

U

and

μ (\cdot)

is a measure, the following hold:

\begin{matrix} μ (U) = a + b + c + d, \\ μ (X) = a + b, \\ μ (Y) = a + c, \\ μ (X Δ Y) = b + c, \\ μ (X \cup Y) = a + b + c . \end{matrix}

(4)

As already pointed out in Section 2, there are so many similarities proposed in the literature with applications in very diverse areas. For a fair discussion and comparison, we are going to restrict our attention to normalized similarities (i.e., those satisfying axioms of Definition 4).

According to [1], type 1 similarity measures are those that only depend on characteristics present either in

X

or

Y

(possibly in both, i.e., a, b, and c) but are independent of the attributes absent of both objects (i.e., they do not depend on d). Type 2 similarity measures are those that take into account all four quantities derived from the objects, i.e., their intersection, set differences, and the intersection of their complementary sets (i.e., a, b, c, and d). The major difference with type 1 similarity measures is that for type 2 measures, the size of the universe influences similarity. Consequently, depending on the measures, two objects can be more similar in a smaller universe than in a larger one, as “measured” by

μ

.

In what follows, we will focus on type 1 similarities, i.e., those that depend solely on

X

and

Y

, and not on the features found in other objects of the data set, i.e., not on features in

\bar{X \cup Y}

. In other words, we focus on those similarities depending only on

a, b, c

and not on d. Note that our assumption corresponds to the adoption of the “matching” axiom of similarity measures according to [16].

We are going to provide the defining equations of similarities in terms of

a, b, c

and also in terms of

X, Y

. Moreover, all the definitions from the literature consider similarity measures based on set cardinality,

μ (\cdot) \equiv | \cdot |

. In what follows, we provide more general definitions for arbitrary measures

μ (\cdot)

.

4.1. Jaccard Similarity

The Jaccard similarity index [2] is defined as follows:

\begin{matrix} J (X, Y) & = \frac{μ (X \cap Y)}{μ (X \cup Y)} = \frac{a}{a + b + c} . \end{matrix}

(5)

4.2. Sørensen–Dice Similarity

The Sørensen–Dice similarity index [17,18] is defined as follows:

\begin{matrix} S D (X, Y) & = \frac{2 μ (X \cap Y)}{μ (X) + μ (Y)} = \frac{2 a}{2 a + b + c} . \end{matrix}

(6)

4.3. Ochiai/Cos Similarity

The Driver and Kroeber [19] or Ochiai index [20] is defined as follows:

\begin{matrix} D K O (X, Y) & = \frac{μ (X \cap Y)}{\sqrt{μ (X) μ (Y)}} = \frac{a}{\sqrt{(a + b) (a + c)}} . \end{matrix}

(7)

Note that if

μ (\cdot) \equiv | \cdot |

, then the Ochiai index is the same as the cosinus of the Boolean vectors representing sets

X

and

Y

. That is why the Ochiai index is sometimes called cos similarity. Equation (7) can be also seen as an immediate generalization of the cos similarity to arbitrary measures.

4.4. Sorgenfrei Similarity

The Sorgenfrei similarity index [21] is defined as follows:

\begin{matrix} S O (X, Y) & = \frac{{(μ (X \cap Y))}^{2}}{μ (X) μ (Y)} = \frac{a^{2}}{(a + b) (a + c)} . \end{matrix}

(8)

Observe the following:

\begin{matrix} S O (X, Y) = {(D K O (X, Y))}^{2} . \end{matrix}

(9)

4.5. Sokal and Sneath Similarity

The Sokal and Sneath similarity index [22] is defined as follows:

\begin{matrix} S S (X, Y) & = \frac{μ (X \cap Y)}{2 μ (X \cup Y) - μ (X \cap Y)} = \frac{a}{a + 2 (b + c)} . \end{matrix}

(10)

4.6. Kulczynski Similarity

The Kulczynski similarity index [23] is defined as follows:

\begin{matrix} K U (X, Y) = \frac{μ (X \cap Y)}{2} (\frac{1}{μ (X)} + \frac{1}{μ (Y)}) = \frac{a}{2} (\frac{1}{a + b} + \frac{1}{a + c}) . \end{matrix}

(11)

4.7. Overlap or Szymkiewicz—Simpson Similarity

The overlap or Szymkiewicz—Simpson similarity index appears as formula #27 in Table 2 from [24], as well as in [25], and it is defined as follows:

\begin{matrix} O V (X, Y) = \frac{μ (X \cap Y)}{\min (μ (X), μ (Y))} = \frac{a}{\min (a + b, a + c)} . \end{matrix}

(12)

4.8. Braun–Blanquet Similarity

The Braun–Blanquet similarity index [26] is defined as follows:

\begin{matrix} B B (X, Y) = \frac{μ (X \cap Y)}{\max (μ (X), μ (Y))} = \frac{a}{\max (a + b, a + c)} . \end{matrix}

(13)

4.9. Consonni and Todeschini Similarity

The Consonni and Todeschini similarity index [27] is defined as follows:

\begin{matrix} C T (X, Y) = \frac{\log (1 + μ (X \cap Y))}{\log (1 + μ (X \cup Y))} = \frac{\log (1 + a)}{\log (1 + a + b + c)} . \end{matrix}

(14)

Example 1.

Let us consider

n \in N^{*}

and the following three sets:

\begin{matrix} X & = {3, 6, \dots, 3 n}, \\ Y & = {1, 2, 3, \dots, 3 n - 2, 3 n - 1, 3 n} = X \cup Z . \end{matrix}

(15)

Let us consider that all the elements

2, 3, \dots, 3 n

have measure 1, while element 1 has measure 2. We trivially obtain

a = n

,

b = 0

, and

c = 2 n + 1

using Equation (3) for sets

X

and

Y

. Computing similarity measures with Equations (5)–(8), (10)–(14) for sets

X

and

Y

, we obtain the results presented in Table 1.

5. Generalized Kulczynski Similarity

5.1. Definition and Basic Properties

Sørensen–Dice (6), Ochiai (7), Kulczynski (11), overlap (12), and Braun–Blanquet (13) similarities can be generalized to a unique similarity measure parameterized by a suitable mean function.

Definition 6.

(Mean function) Function

f : R^{+} \times R^{+} \to R^{+}

is called the mean if it satisfies the following conditions:

i.: Boundedness: $\min (x, y) \leq f (x, y) \leq \max (x, y)$ for all $x, y \in R^{+}$ .
ii.: Symmetry: $f (x, y) = f (y, x)$ for all $x, y \in R^{+}$ .
iii.: Homogeneity with multiplication: $f (α x, α y) = α f (x, y)$ for all $α, x, y \in R^{+}$ .

Classical examples of mean functions are as follows [28]:

\begin{matrix} Arithmetic mean : & A M (x, y) = \frac{x + y}{2}, \\ Geometric mean : & G M (a, y) = \sqrt{x y}, \\ Harmonic mean : & H M (x, y) = \frac{2 x y}{x + y}, \\ Power mean : & P M_{p} (x, y) = {(\frac{x^{p} + y^{p}}{2})}^{\frac{1}{p}} for p \in R^{*}, \\ Logarithmic mean : & L M (x, y) = \frac{y - x}{\log (y) - \log (x)} . \end{matrix}

(16)

Trivial examples of mean functions are

\max (x, y)

and

\min (x, y)

.

Note that from the first point of Definition 6, it follows that

x = \min (x, x) \leq f (x, x) \leq \max (x, x) = x

, so

f (x, x) = x

for all

x \in R^{+}

, i.e., f is idempotent.

We can now define the generalized Kulczynski similarity as follows:

Definition 7.

Let f be an arbitrary mean function. Generalized Kulczynski similarity is defined by the following:

\begin{matrix} G K U_{f} (X, Y) = \frac{μ (X \cap Y)}{f (μ (X), μ (Y))} = \frac{a}{f (a + b, a + c)} . \end{matrix}

(17)

Note that the computational overhead of

G K U_{f}

compared to the specific similarities (6), (7), (11), (12), and (13) is given solely by the computation of the mean function f.

Proposition 4.

i.: Generalized Kulczynski similarity (17) is a generalization of Sørensen–Dice (6), Ochiai (7), Kulczynski (11), overlap (12), and Braun–Blanquet (13) similarities as follows:

$\begin{matrix} S D (X, Y) & = G K U_{A M} (X, Y), \\ K U (X, Y) & = G K U_{H M} (X, Y), \\ D K O (X, Y) & = G K U_{G M} (X, Y), \\ O V (X, Y) & = G K U_{\min} (X, Y), \\ B B (X, Y) & = G K U_{\max} (X, Y) . \end{matrix}$

(18)
ii.: $\begin{matrix} S O (X, Y) \leq B B (X, Y) \leq S D (X, Y) \leq D K O (X, Y) \leq K U (X, Y) \leq O V (X, Y) . \end{matrix}$

(19)

Proof.

The first point follows by direct substitution from the defining equations of the similarities.

The second point follows almost entirely (except the first inequality) from Equation (18) and from the well-known ordering of the classical mean functions [28]:

\begin{matrix} \max (a, b) \geq A M (a, b) \geq G M (a, b) \geq H M (a, b) \geq \min (a, b) . \end{matrix}

(20)

We only have to prove the following:

\begin{matrix} S O (X, Y) \leq B B (X, Y), \\ i . e ., a \max (a + b, a + c) \leq (a + b) (a + c) . \end{matrix}

The last inequality follows by observing that

\max (a + b, a + c)

is

a + b

or

a + c

,

a \leq a + c

, and

a \leq a + b

. □

Example 2.

It is not difficult to observe that Inequality (19) is satisfied for sets

X

and

Y

from Example 1 by simple algebraic manipulation (first column of Table 1) and simple observation (second and third columns of Table 1).

5.2. Metric Properties

In almost all the cases, the distance measure derived from generalized Kulczynski similarity does not satisfy the triangle inequality with one notable exception: the Braun–Blanquet case. These results are established by the following proposition.

Proposition 5.

Let f be a mean function. The generalized Kulczynski similarity

G K U_{f} (\cdot, \cdot)

satisfies triangle inequality if and only if

G K U_{f} (\cdot, \cdot) \equiv B B (\cdot, \cdot)

.

Proof.

The triangle inequality for distance measure

d (X, Y) = 1 - s (X, Y)

is stated as follows:

\begin{matrix} 1 + s (X, Z) \geq s (X, Y) + s (Y, Z) for all X, Y, Z \subseteq U . \end{matrix}

(21)

To prove the direct implication, we substitute an appropriate triple

X, Y, Z

in (21) for

s (\cdot, \cdot) = G K U_{f} (\cdot, \cdot)

. Let us consider two arbitrary positive reals

a < b

, and let us denote

r = \frac{b}{a} - 1 > 0

. Let us consider a finite set

Y

that is partitioned into two disjoint sets,

X

and

Z

, such that

μ (Z) = r μ (X)

, and let us denote

x = μ (X)

. Substituting

X, Y, Z

in

s = G K U_{f}

, inequality (21) becomes the following:

\begin{matrix} 1 + 0 \geq \frac{x}{f (x, (1 + r) x)} + \frac{r x}{f (r x, (1 + r) x)} = \frac{1}{f (1, 1 + r)} + \frac{r}{f (r, 1 + r)} \geq \frac{1}{1 + r} + \frac{r}{1 + r} = 1 . \end{matrix}

From this, we conclude that

f (1, 1 + r) = 1 + r

.

It follows that

f (a, b) = a f (1, \frac{b}{a}) = a f (1, 1 + r) = a (1 + r) = b = \max (a, b)

.

To prove the reverse implication, we slightly adapt Gilbert’s proof of metricity of the Jaccard index from [4].

The distance is defined as follows:

\begin{matrix} d_{B B} (X, Y) = 1 - B B (X, Y) = 1 - \frac{μ (X \cap Y)}{\max (μ (X), μ (Y))} . \end{matrix}

(22)

The proof uses the notation of sets introduced in Figure 1. Let us denote

M = \max (μ (X), μ (Y), μ (Z))

. Then, it is not difficult to observe the following:

\begin{matrix} \frac{\max (μ (X^{'}) + μ (Y^{″}) + μ (Z^{″}), μ (X^{″}) + μ (Y^{'}) + μ (Z^{″}), μ (X^{″}) + μ (Y^{″}) + μ (Z^{'}))}{M} = \\ \frac{M - μ (V)}{M} = 1 - \frac{μ (V)}{M} \geq d_{B B} (X, Z) \\ d_{B B} (X, Y) = \frac{\max (μ (X ∖ Y), μ (Y ∖ X))}{\max (μ (X), μ (Y))} \geq \frac{\max (μ (X^{'}) + μ (Y^{″}), μ (X^{″}) + μ (Y^{'}))}{M} \\ By analogy d_{B B} (Y, Z) \geq \frac{\max (μ (Y^{'}) + μ (Z^{″}), μ (Y^{″}) + μ (Z^{'}))}{M} \end{matrix} .

(23)

Then, adding the last two inequalities of (23), applying

\max (a, b) + \max (c, d) = \max (a + c, a + d, b + c, b + d)

, and relating that to the first inequality of (23), we obtain the triangle inequality for

d_{B B}

. □

Example 3.

Let us consider the following set:

\begin{matrix} Z & = {1, 2, 4, 5, \dots, 3 n - 2, 3 n - 1}, \end{matrix}

(24)

together with sets

X

and

Y

from Example 1. Observe that

X

,

Z

is a partition of

Y

. It follows that

\begin{matrix} G K U_{f} (X, Z) & = 0, \\ G K U_{f} (X, Y) & = \frac{n}{f (n, 3 n + 1)}, \\ G K U_{f} (Y, Z) & = \frac{2 n + 1}{f (n, 3 n + 1)} . \end{matrix}

Now, for mean function

f \in {A M, H M, G M, \min}

, clearly

f (n, 3 n + 1) < 3 n + 1

. This shows the following:

\begin{matrix} G K U_{f} (X, Y) + G K U_{f} (Y, Z) = \frac{3 n + 1}{f (n, 3 n + 1)} > 1 = 1 + G K U_{f} (X, Z), \end{matrix}

thus contradicting (21).

6. Generalization of Jaccard Similarity

6.1. Definition and Basic Properties

Following the introduction and argumentation of a set of axioms, the ratio model of normalized similarity known as the Tversky index is proposed in [16] according to the following equation with parameters

α, β \geq 0

:

\begin{matrix} T V_{α, β} (X, Y) = \frac{μ (X \cap Y)}{μ (X \cap Y) + α μ (X ∖ Y) + β μ (Y ∖ X)} . \end{matrix}

(25)

It is not difficult to observe that

T V_{α, β} (X, Y)

is generally asymmetric if

α \neq β

. Moreover, the symmetric version for

α = β

is a generalization of Jaccard similarity with some computational complexity as the standard Jaccard index.

Proposition 6.

Let us define the following:

\begin{matrix} J_{α} (X, Y) = T V_{α, α} (X, Y) = \frac{μ (X \cap Y)}{μ (X \cap Y) + α (μ (X ∖ Y) + μ (Y ∖ X))} for α > 0 . \end{matrix}

(26)

Then,

i.: $J_{α} (X, Y)$ is a generalization of Jaccard (5), Sørensen–Dice (6), and Sokal and Sneath (10) similarities.
ii.: $J_{β} (X, Y) \geq J_{α} (X, Y)$ if $α > β > 0$ .

Proof.

Observe the following:

\begin{matrix} J_{α} (X, Y) = \frac{a}{a + α (b + c)}, \end{matrix}

(27)

which clearly shows that

\begin{matrix} J (X, Y) = J_{1} (X, Y), \\ S D (X, Y) = J_{\frac{1}{2}} (X, Y), \\ S S (X, Y) = J_{2} (X, Y), \end{matrix}

(28)

thus concluding the first point of the proposition.

If

α > β > 0

, then the denominator of

J_{α} (X, Y)

is greater than the denominator of

J_{β} (X, Y)

, thus concluding the second point of the proposition. □

A corollary of Proposition 6 is the relative ordering of Jaccard (5), Sørensen–Dice (6), and Sokal and Sneath (10) similarities:

\begin{matrix} S S (X, Y) \leq J (X, Y) \leq S D (X, Y) . \end{matrix}

(29)

Example 4.

It is not difficult to observe that Inequality (29) is satisfied for sets

X

and

Y

from Example 1 by simple algebraic manipulation (first column of Table 1) and simple observation (second and third columns of Table 1).

Proposition 7.

i.: For any mean function f and $α \geq 1$ ,

$\begin{matrix} J_{α} (X, Y) \leq G K U_{f} (X, Y) . \end{matrix}$

(30)
ii.: For any mean function f, such that $f (x, y) \leq A M (x, y)$ ,

$\begin{matrix} J_{α} (X, Y) \leq G K U_{f} (X, Y), \end{matrix}$

(31)

if and only if $α \geq \frac{1}{2}$ .

Proof.

i.: Inequality (30) is equivalent to the following:

$\begin{matrix} a + b + c + (α - 1) (b + c) \geq f (a + b, a + c) . \end{matrix}$

This follows on from the fact that $f (a + b, a + c)$ is smaller than both $a + b$ and $a + c$ , and $α - 1 \geq 0$ and $a + b + c$ is greater than both $a + b$ and $a + c$ .
ii.: Let us assume that Inequality (31) holds. It follows that

$\begin{matrix} a + α (b + c) \geq f (a + b, a + c) . \end{matrix}$

Substituting $b = c > 0$ and noticing that $f (a + b, a + b) = a + b$ , we obtain $α \geq \frac{1}{2}$ .

Conversely, Inequality (31) is equivalent to the following:

\begin{matrix} \frac{(a + b) + (a + c)}{2} + (α - \frac{1}{2}) (b + c) = A M (a + b, a + c) + (α - \frac{1}{2}) (b + c) \geq f (a + b, a + c), \end{matrix}

which follows on from

f (x, y) \leq A M (x, y)

and

α \geq \frac{1}{2}

. □

Example 5.

It is not difficult to observe that Inequality (30) is satisfied for sets

X

and

Y

from Example 1,

α \in {1, 2}

(producing similarity measures J and

S S

) and

f \in {A M, H M, G M, \min, \max}

(producing similarity measures

S D

,

K U

,

D K O

,

O V

, and

B B

) by simple algebraic manipulation (first column of Table 1) and simple observation (second and third columns of Table 1).

Moreover, it is not difficult to observe that Inequality (31) is satisfied for sets

X

and

Y

from Example 1,

α = \frac{1}{2}

(producing similarity measure

S D

) and

f \in {A M, H M, G M, \min}

(producing similarity measures

S D

,

K U

,

D K O

, and

O V

) by simple algebraic manipulation (first column of Table 1) and simple observation (second and third columns of Table 1).

6.2. Metric Properties

The last part of this section is dedicated to a result concerning the metricity of generalized Jaccard similarity.

Proposition 8.

Generalized Jaccard similarity (26) satisfies triangle inequality if and only if

α \geq 1

.

Proof.

For the direct implication, we consider a finite set

Y

partitioned into two disjointed subsets,

X

and

Z

, of equal measures denoted by n. Applying inequality (21) for

s (\cdot) = J_{α} (\cdot, \cdot)

, we obtain the following:

\begin{matrix} 1 + 0 \geq \frac{n}{n + α n} + \frac{n}{n + α n} = \frac{2}{1 + α} so α \geq 1 . \end{matrix}

For the reverse implication, we adapt Gilbert’s proof of metricity of Jaccard similarity from [4]. Referring to Figure 1, we define the following:

\begin{matrix} X_{1} = X^{'} \cup X^{″}, \\ Y_{1} = Y^{'} \cup Y^{″}, \\ Z_{1} = Z^{'} \cup Z^{″}, \\ T = X \cup Y \cup Z . \end{matrix}

(32)

Let us also introduce the following:

\begin{matrix} μ (X Δ Y) + μ (Y Δ Z) + μ (X Δ Z) = D, \\ μ (X_{1}) + μ (Y_{1}) + μ (Z_{1}) = \frac{D}{2} . \end{matrix}

(33)

Now, observe that

μ (X Δ Y)

satisfies triangle inequality and has an upper bound of

\frac{D}{2}

, while

μ (X Δ Y) + μ (Y Δ Z)

has a lower bound of

\frac{D}{2}

, as shown below.

\begin{matrix} μ (X Δ Y) = μ (X_{1}) + μ (Y_{1}), \\ μ (X Δ Y) + μ (Y Δ Z) = μ (X_{1}) + 2 μ (Y_{1}) + μ (Z_{1}) \geq μ (X_{1}) + μ (Z_{1}) = μ (X Δ Z), \\ 2 μ (X Δ Y) \leq μ (X Δ Y) + μ (X Δ Z) + μ (Z Δ Y) = D so μ (X Δ Y) \leq \frac{D}{2}, \\ 2 (μ (X Δ Y) + μ (Y Δ Z)) = (μ (X Δ Y) + μ (Y Δ Z)) + (μ (X Δ Y) + μ (Y Δ Z)) \geq, \\ μ (X Δ Y) + μ (Y Δ Z) + μ (X Δ Z) = D so : \\ μ (X Δ Y) + μ (Y Δ Z) \geq \frac{D}{2} . \end{matrix}

(34)

Also, observe that if

α \geq 1

, denoting

θ = α - 1 \geq 0

, we can rewrite generalized Jaccard similarity as follows:

\begin{matrix} J_{α} (X, Y) = \frac{μ (X \cap Y)}{μ (X \cup Y) + θ μ (X Δ Y)} . \end{matrix}

(35)

We obtain the following:

\begin{matrix} d_{J_{α}} (X, Y) = 1 - \frac{μ (X \cap Y)}{μ (X \cup Y) + θ μ (X Δ Y)} \leq 1 - \frac{μ (V)}{μ (T) + θ \frac{D}{2}} = \\ \frac{μ (X_{1}) + μ (Y_{1}) + μ (Z_{1}) + θ \frac{D}{2}}{μ (T) + θ \frac{D}{2}} = \frac{D (1 + θ)}{2 μ (T) + θ D} \\ d_{J_{α}} (X, Y) = \frac{μ (X \cup Y) - μ (X \cap Y) + θ μ (X Δ Y)}{μ (X \cup Y) + θ μ (X Δ Y)} \geq \frac{μ (X_{1}) + μ (Y_{1}) + θ μ (X Δ Y)}{μ (T) + θ \frac{D}{2}} . \\ By analogy : d_{J_{α}} (Y, Z) \geq \frac{μ (Y_{1}) + μ (Z_{1}) + θ μ (Y Δ Z)}{μ (T) + θ \frac{D}{2}} . \\ Adding last two inequalities : \\ d_{J_{α}} (X, Y) + d_{J_{α}} (Y, Z) \geq \frac{D + 2 θ (μ (X Δ Y) + μ (Y Δ Z))}{2 μ (T) + θ D} \geq \frac{D (1 + θ)}{2 μ (T) + θ D}, \\ thus concluding the proof . \end{matrix}

□

Example 6.

We check triangle inequality for

J_{α}

when

α = \frac{1}{2}

(i.e., for

S D

) and when

α = 2

(i.e., for

S S

). We consider sets

X

and

Y

from Example 1 and set

Z

from Example 3.

As

X \cap Z = \emptyset

, it follows that

S D (X, Z) = S S (X, Z) = 0

.

\begin{matrix} S D (X, Y) + S D (Y, Z) & = \frac{2 n}{4 n + 1} + \frac{4 n + 2}{5 n + 2} > \frac{2 n}{5 n + 2} + \frac{4 n + 2}{5 n + 2} = \frac{6 n + 2}{5 n + 2} > 1 \\ = 1 + S D (X, Z), \end{matrix}

thus contradicting (21), i.e.,

J_{\frac{1}{2}} = S D

does not satisfy triangle inequality.

\begin{matrix} S S (X, Y) + S S (Y, Z) & = \frac{n}{5 n + 2} + \frac{2 n + 1}{4 n + 1} < \frac{n}{4 n + 1} + \frac{2 n + 1}{4 n + 1} = \frac{3 n + 1}{4 n + 1} < 1 \\ = 1 + S S (X, Z), \end{matrix}

thus being consistent with (21), i.e., with

J_{2} = S S

, satisfying triangle inequality.

7. Generalization of Consonni and Todeschini Similarity

Consonni and Todeschini similarity [27] (see Equation (14)) can be generalized by replacing log with an appropriately chosen real f function as follows:

\begin{matrix} G C T_{f} (X, Y) = \frac{f (μ (X \cap Y))}{f (μ (X \cup Y))} = \frac{f (a)}{f (a + b + c)} . \end{matrix}

(36)

Note that the computational overhead of

G C T_{f}

compared to the specific similarity (14) is given solely by the computation of the function f.

For a proper similarity, we impose the condition

f (0) = 0

. Moreover, we shall assume in what follows that f is defined only for non-negative values, and it is non-decreasing, which are natural assumptions. It follows immediately that f is non-negative.

It is not difficult to observe that standard Consonni and Todeschini similarity is obtained for

f (x) = \log (1 + x)

, i.e.,

C T (\cdot, \cdot) = G C T_{\log (1 + \cdot)} (\cdot, \cdot)

.

In what follows, we analyze the metric properties of generalized Consonni and Todeschini similarity. We start by formulating a necessary condition on f to ensure that

G C T_{f} (\cdot, \cdot)

satisfies triangle inequality.

Proposition 9.

If

G C T_{f} (\cdot, \cdot)

satisfies triangle inequality, then f is supermodular, i.e.,

\begin{matrix} f (x + y + v) + f (v) \geq f (x + v) + f (y + v) for all x, y, v \in R^{+} . \end{matrix}

(37)

Proof.

Let us consider three sets,

X, Y

, and

Z = X \cap Y

, such that

μ (X \cap Y) = v

,

μ (X ∖ Y)

, and

μ (X ∖ Y) = y

. It follows that

X = x + v

and

Y = y + v

. Applying Inequality (21), we obtain the following:

\begin{matrix} 1 + \frac{f (v)}{f (x + y + v)} \geq \frac{f (x + v)}{f (x + y + v)} + \frac{f (y + v)}{f (x + y + v)}, \end{matrix}

from which Inequality (37) follows immediately. □

It is not difficult to verify that function

\log (1 + x)

does not satisfy the supermodularity condition (37), clearly showing that standard Consonni and Todeschini similarity is not a metric.

Example 7.

We check triangle inequality for

C T

. Let us consider sets

X

and

Y

from Example 1 and set

Z

from Example 3.

As

X \cap Z = \emptyset

, it follows that

C T (X, Z) = 0

.

\begin{matrix} C T (X, Y) + C T (Y, Z) & = \frac{\log (n + 1)}{\log (3 n + 2)} + \frac{\log (2 n + 2)}{\log (3 n + 2)} = \frac{\log (2 n^{2} + 4 n + 2)}{\log (3 n + 2)} \\ > \frac{\log (3 n + 2)}{\log (3 n + 2)} = 1 = 1 + C T (X, Z), \end{matrix}

thus contradicting (21), i.e.,

C T

does not satisfy triangle inequality.

Actually, it is not difficult to observe the following:

(i): If f is strictly concave, then it is strictly submodular, i.e., it satisfies Inequality (37) in the opposite direction with strict inequality. So, according to Proposition 9, the corresponding generalized Consonni and Todeschini similarity is not a metric.
(ii): If f is convex, then it is supermodular. In this case, the applicability of Proposition 10 can be investigated, as discussed in the following paragraphs.

Our next result is the formulation of a sufficient condition on f that guarantees the corresponding generalized Consonni and Todeschini similarity (36) is a metric, i.e., it satisfies triangle inequality.

Proposition 10.

If f is differentiable and supermodular and

\log \circ f

is concave, then generalized Consonni and Todeschini similarity, defined by Equation (36), satisfies triangle inequality (21).

Proof.

With reference to Figure 1 and Equations (32), let us denote with lowercase letters

x, y, z, x^{'}, y^{'}, z^{'}, x^{″}, y^{″}, z^{″}, x_{1}, y_{1}, z_{1}, t, v

, the measures of sets

X, Y, Z, X^{'}, Y^{'}, Z^{'}, X^{″}, Y^{″}, Z^{″}

,

X_{1}, Y_{1}, Z_{1}, T, V

.

Substituting (36) into Equation (21) we obtain the following:

\begin{matrix} 1 + \frac{f (v + y^{″})}{f (t - y^{'})} \geq \frac{f (v + z^{″})}{f (t - z^{'})} + \frac{f (v + x^{″})}{f (t - x^{'})} . \end{matrix}

(38)

In order to prove Inequality (38), we need the following.

Lemma 1.

If f is differentiable, then for each

α \geq 0

, function g defined as

g (x) = \frac{f (x)}{f (x + α)}

, i.e., it is non-decreasing.

Proof.

\begin{matrix} g^{'} (x) = & \frac{f^{'} (x) f (x + α) - f (x) f^{'} (x + α)}{f^{2} (x + α)} = \frac{f (x)}{f (x + α)} (\frac{f^{'} (x)}{f (x)} - \frac{f^{'} (x + α)}{f (x + α)}) = \\ \frac{f (x)}{f (x + α)} ({(\log \circ f)}^{'} (x) - {(\log \circ f)}^{'} (x + α)) \geq 0 \\ as \log \circ f is concave and therefore {(\log \circ f)}^{'} is decreasing . \end{matrix}

As

g^{'} (x) \geq 0

, we conclude that g is non-decreasing. □

Lemma 1 implies that if

α \geq 0

and

b \geq a \geq 0

, then the following inequality holds:

\begin{matrix} \frac{f (a)}{f (b)} \leq \frac{f (a + α)}{f (b + α)} . \end{matrix}

(39)

Applying (39), we obtain the following:

\begin{matrix} \frac{f (v + z^{″})}{f (t - z^{'})} \leq \frac{f (v + z^{″} + z^{'})}{f (t)} = \frac{f (v + z_{1})}{f (t)} \\ Similarly \frac{f (v + x^{″})}{f (t - x^{'})} \leq \frac{f (v + x_{1})}{f (t)} \\ Moreover 1 + \frac{f (v + y^{″})}{f (t - y^{'})} \geq 1 + \frac{f (v)}{f (t)} = 1 + \frac{f (v)}{f (v + x_{1} + y_{1} + z_{1})} \\ Adding the first two inequalities, it is enough to show that : \\ f (v + x_{1} + y_{1} + z_{1}) + f (v) \geq f (v + x_{1}) + f (v + z_{1}) . \end{matrix}

But, this follows on from

f (v + x_{1} + y_{1} + z_{1}) \geq f (v + x_{1} + z_{1})

and from the supermdularity condition (37), thus concluding the proof. □

Examples of functions satisfying the requirements of Proposition 10 are as follows:

i.: $x^{p}$ with $p \geq 1$ .
ii.: $e^{x} - 1$ .
iii.: $x \log (x)$ extended by continuity to 0 in $x = 0$ .

Examples of functions that are not supermodular, i.e., the corresponding generalized Consonni and Todeschini similarity, fail to satisfy triangle inequality by Proposition 9 and are as follows:

i.: $x^{p}$ with $p < 1$ .
ii.: $\log (x + 1)$ .

8. Conclusions

The results reported in this paper shed light on some theoretical properties of the most popular normalized similarity/distance measures for Boolean models that only depend on the characteristics present in compared objects. The new theoretical results covered the following three aspects: generalization of three classes of measures described as generalized Kulczynski, generalized Jaccard, and generalized Consonni and Todeschini measures; theoretical ordering of the similarity measures inside each class, as well as between classes; positive and negative results regarding the metric properties of the measures related to satisfying or not satisfying the triangle inequality axiom. In future work, we foresee at least two possibilities of continuing this theoretical research: by expanding the applicability of our methodology to other classes of similarity measures, possibly in the context of richer data models, like multisets, numerical, probabilistic, or fuzzy, as well as by investigating more abstract frameworks provided by lattice theory or other axiomatic approaches. On the experimental side, it would be interesting to investigate the experimental comparison of similarity measures and the visualization of results on sample datasets of various sizes.

Author Contributions

All authors contributed equally to each section of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lesot, M.-J.; Rifqi, M.; Benhadda, H. Similarity measures for binary and numerical data: A survey. Int. J. Knowl. Eng. Soft Data Paradig. 2009, 1, 63–84. [Google Scholar] [CrossRef]
Jaccard, P. Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bull. Soc. Vaudoise Sci. Nat. 1901, 37, 547–579. [Google Scholar]
Lewandowsky, M.; Winter, D. Distance between Sets. Nature 1971, 234, 34–35. [Google Scholar] [CrossRef]
Gilbert, G. Distance between Sets. Nature 1972, 239, 174. [Google Scholar] [CrossRef]
Kosub, S. A note on the triangle inequality for the Jaccard distance. Pattern Recognit. Lett. 2019, 120, 36–38. [Google Scholar] [CrossRef]
Gerasimou, G. Characterization of the Jaccard dissimilarity metric and a generalization. Discret. Appl. Math. 2024, 355, 57–61. [Google Scholar] [CrossRef]
Cheetham, A.H.; Hazel, J.E. Binary (presence-absence) similarity coefficients. J. Paleontol. 1969, 43, 1130–1136. [Google Scholar]
Hubálek, Z. Coefficients of Association and Similarity, Based on Binary (Presence-Absence) Data: An Evaluation. Biol. Rev. 1982, 57, 669–689. [Google Scholar] [CrossRef]
Brusco, M.; Cradit, J.D.; Steinley, D. A comparison of 71 binary similarity coefficients: The effect of base rates. PLoS ONE 2021, 16, e0247751. [Google Scholar] [CrossRef]
Rajski, C. A metric space of discrete probability distributions. Inf. Control 1961, 4, 371–377. [Google Scholar] [CrossRef]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Simovici, D.A. On Submodular and Supermodular Functions on Lattices and Related Structures. In Proceedings of the 2014 IEEE 44th International Symposium on Multiple-Valued Logic, Bremen, Germany, 19–21 May 2014; pp. 202–207. [Google Scholar]
Moulton, R.; Jiang, Y. Maximally Consistent Sampling and the Jaccard Index of Probability Distributions. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 347–356. [Google Scholar]
Berman, M.; Blaschko, M.B. Supermodular Locality Sensitive Hashes. arXiv 2018, arXiv:1807.06686v1. [Google Scholar]
Deza, M.M.; Deza, E. Encyclopedia of Distances, 4th ed.; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Tversky, A. Features of Similarity. Psychol. Rev. 1977, 84, 327–352. [Google Scholar] [CrossRef]
Dice, L.R. Measures of the Amount of Ecologic Association Between Species. Ecology 1945, 26, 297–302. [Google Scholar] [CrossRef]
Sørensen, T. A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and Its Application to Analyses of the Vegetation on Danish Commons. Biol. Skr. Dan. Vidensk. Selsk. 1948, 5, 1–34. [Google Scholar]
Driver, E.S.; Kroeber, A.L. Quantitative Expression of Cultural Relationships. Univ. Calif. Publ. Am. Archaeol. Ethnol. 1932, 31, 211–256. [Google Scholar]
Ochiai, A. Zoogeographical Studies on the Soleoid Fishes Found in Japan and its Neighbouring Regions-III. Nippon. Suisan Gakkaishi 1957, 22, 522–525. [Google Scholar] [CrossRef]
Sorgenfrei, T. Molluscan assemblages from the marine middle Miocene of South Jutland and their environments. Den. Geol. Undersoegelse Ser. II 1958, 79, 356–503. [Google Scholar]
Sokal, R.R.; Sneath, P.H.A. Principles of Numerical Taxonomy; W. H. Freeman and Co.: San Francisco, CA, USA; London, UK, 1963. [Google Scholar]
Kulczynski, S. Die Pflanzenassoziationen der Pieninen. Bull. Int. L’AcadéMie Pol. Sci. Lett. Classe Sci. Math. Nat. B (Sci. Nat.) 1927, 57–203. [Google Scholar]
McGill, M. An Evaluation of Factors Affecting Document Ranking by Information Retrieval Systems. ERIC Inst. Educ. Sci. 1979, ED188587. [Google Scholar]
Simpson, G.G. Notes on the Measurement of Faunal Resemblance. Am. J. Sci. 1960, 258-A, 300–311. [Google Scholar]
Braun-Blanquet, J. Zur Wertung der Gesellschaftstreue in der Pflanzensoziologie. Vierteljahrsschr. Naturf. Ges. Zürich 1925, 70, 12–149. [Google Scholar]
Todeschini, R.; Consonni, V.; Xiang, H.; Holliday, J.; Buscema, M.; Willett, P. Similarity Coefficients for Binary Chemoinformatics Data: Overview and Extended Comparison Using Simulated and Real Data Sets. J. Chem. Inf. Model. 2012, 52, 2884–2901. [Google Scholar] [CrossRef]
Beliakov, G.; Bustince Sola, H.; Calvo Sánchez, T. Classical Averaging Functions. In: A Practical Guide to Averaging Functions. Stud. Fuzziness Soft Comput. 2016, 329, 55–99. [Google Scholar]

Figure 1. Three sets and their intersections.

Table 1. Similarity values for sets

X

and

Y

from Example 1.

Table 1. Similarity values for sets

X

and

Y

from Example 1.

General Values for $n \in N^{*}$	Values for $n = 1$	Values for $n = 2$
$J = B B = S O = \frac{n}{3 n + 1}$	$J = B B = S O = 0.250$	$J = B B = S O = 0.285$
$S D = \frac{2 n}{4 n + 1}$	$S D = 0.400$	$S D = 0.444$
$D K O = \sqrt{\frac{n}{3 n + 1}}$	$D K O = 0.500$	$D K O = 0.534$
$S S = \frac{n}{5 n + 2}$	$S S = 0.142$	$S S = 0.166$
$K U = \frac{4 n + 1}{6 n + 2}$	$K U = 0.625$	$K U = 0.642$
$O V = 1.000$	$O V = 1.000$	$O V = 1.000$
$C T = \frac{\log (n + 1)}{\log (3 n + 2)}$	$C T = 0.439$	$C T = 0.528$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bădică, A.; Bădică, C.; Logofătu, D.; Neremzoiu, I.-D. Generalizations and Properties of Normalized Similarity Measures for Boolean Models. Mathematics 2025, 13, 384. https://doi.org/10.3390/math13030384

AMA Style

Bădică A, Bădică C, Logofătu D, Neremzoiu I-D. Generalizations and Properties of Normalized Similarity Measures for Boolean Models. Mathematics. 2025; 13(3):384. https://doi.org/10.3390/math13030384

Chicago/Turabian Style

Bădică, Amelia, Costin Bădică, Doina Logofătu, and Ionuţ-Dragoş Neremzoiu. 2025. "Generalizations and Properties of Normalized Similarity Measures for Boolean Models" Mathematics 13, no. 3: 384. https://doi.org/10.3390/math13030384

APA Style

Bădică, A., Bădică, C., Logofătu, D., & Neremzoiu, I.-D. (2025). Generalizations and Properties of Normalized Similarity Measures for Boolean Models. Mathematics, 13(3), 384. https://doi.org/10.3390/math13030384

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generalizations and Properties of Normalized Similarity Measures for Boolean Models

Abstract

1. Introduction

2. Related Works

3. Background

4. Similarity Measures

4.1. Jaccard Similarity

4.2. Sørensen–Dice Similarity

4.3. Ochiai/Cos Similarity

4.4. Sorgenfrei Similarity

4.5. Sokal and Sneath Similarity

4.6. Kulczynski Similarity

4.7. Overlap or Szymkiewicz—Simpson Similarity

4.8. Braun–Blanquet Similarity

4.9. Consonni and Todeschini Similarity

5. Generalized Kulczynski Similarity

5.1. Definition and Basic Properties

5.2. Metric Properties

6. Generalization of Jaccard Similarity

6.1. Definition and Basic Properties

6.2. Metric Properties

7. Generalization of Consonni and Todeschini Similarity

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI