Next Article in Journal
Variational Bayesian-Based Improved Maximum Mixture Correntropy Kalman Filter for Non-Gaussian Noise
Next Article in Special Issue
The Cuts Selection Method Based on Histogram Segmentation and Impact on Discretization Algorithms
Previous Article in Journal
Estimating Distributions of Parameters in Nonlinear State Space Models with Replica Exchange Particle Marginal Metropolis–Hastings Method
Previous Article in Special Issue
Decision Rules Derived from Optimal Decision Trees with Hypotheses
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On the Depth of Decision Trees with Hypotheses

Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
Entropy 2022, 24(1), 116; https://doi.org/10.3390/e24010116
Submission received: 20 October 2021 / Revised: 6 January 2022 / Accepted: 10 January 2022 / Published: 12 January 2022
(This article belongs to the Special Issue Rough Set Theory and Entropy in Information Science)

Abstract

:
In this paper, based on the results of rough set theory, test theory, and exact learning, we investigate decision trees over infinite sets of binary attributes represented as infinite binary information systems. We define the notion of a problem over an information system and study three functions of the Shannon type, which characterize the dependence in the worst case of the minimum depth of a decision tree solving a problem on the number of attributes in the problem description. The considered three functions correspond to (i) decision trees using attributes, (ii) decision trees using hypotheses (an analog of equivalence queries from exact learning), and (iii) decision trees using both attributes and hypotheses. The first function has two possible types of behavior: logarithmic and linear (this result follows from more general results published by the author earlier). The second and the third functions have three possible types of behavior: constant, logarithmic, and linear (these results were published by the author earlier without proofs that are given in the present paper). Based on the obtained results, we divided the set of all infinite binary information systems into four complexity classes. In each class, the type of behavior for each of the considered three functions does not change.

1. Introduction

Decision trees are studied in different areas of computer science, in particular in exact learning [1], rough set theory [2,3,4], and test theory [5]. In some sense, these theories deal with dual objects: for example, membership queries from exact learning correspond to attributes from test theory and rough set theory. In contrast to test theory and rough set theory, in exact learning, besides membership queries, equivalence queries are also considered.
We extend the model considered in test theory and rough set theory by adding the notion of a hypothesis that is an analog of equivalence query. Papers [6,7,8,9,10] are related mainly to the experimental study of decision trees with hypotheses. The present paper contains a theoretical study of the depth of decision trees with hypotheses.
An infinite binary information system is a pair U = ( A , F ) where A is an infinite set of elements and F is an infinite set of functions (attributes) from A to { 0 , 1 } . A problem over U is given by a finite number of attributes f 1 , , f n from F: for a A , we should find the tuple ( f 1 ( a ) , , f n ( a ) ) . To solve this problem, we can use decision trees with two types of queries. We can ask about the value of an attribute f i { f 1 , , f n } . As a result, we obtain an answer of the kind f i ( x ) = δ where δ { 0 , 1 } . We also can ask if a hypothesis f 1 ( x ) = δ 1 , , f n ( x ) = δ n is true where δ 1 , , δ n { 0 , 1 } . Either we obtain the confirmation or a counterexample in the form f i ( x ) = ¬ δ i .
The depth of decision trees with hypotheses can be essentially less than the depth of decision trees using only attributes. As an example, we consider the problem of the computation of the disjunction x 1 x n . The minimum depth of a decision tree solving this problem using only attributes x 1 , , x n is equal to n. However, the minimum depth of a decision tree with hypotheses solving this problem is equal to one: it is enough to ask only about the hypothesis x 1 = 0 , , x n = 0 . If it is true, then the considered disjunction is equal to zero. Otherwise, it is equal to one.
Based on the results of exact learning, rough set theory, and test theory [1,11,12,13,14,15,16], we study for an arbitrary infinite binary information system three functions of the Shannon type that characterize the growth in the worth case of the minimum depth of a decision tree solving a problem with the growth of the number of attributes in the problem description. The considered three functions correspond to the following three cases:
(i)
Only attributes are used in decision trees;
(ii)
Only hypotheses are used in decision trees;
(iii)
Both attributes and hypotheses are used in decision trees.
We show that the first function has two possible types of behavior: logarithmic and linear. The second and third functions have three possible types of behavior: constant, logarithmic, and linear. Bounds for the case (i) can be derived from more general results obtained in [15,16]. Results related to the cases (ii) and (iii) were presented in the conference paper [17] without proofs. In the present paper, we consider complete proofs for the cases (ii) and (iii). We also investigate the join behavior of these three functions and describe four complexity classes of infinite binary information systems; these results are completely new.
The obtained results allow us to understand the difference of time complexity for conventional decision trees that use only queries based on one attribute each and for decision trees with hypotheses. Moreover, we know now which combinations of types of behavior of the three Shannon-type functions we can take under consideration of an arbitrary infinite binary system, and we know the criteria for each combination.
This paper consists of six sections. In Section 2 and Section 3, we consider the basic notions and main results. Section 4 and Section 5 contain proofs of the main results, and Section 6 gives a short conclusion.

2. Basic Notions

Let A be a set of elements and F be a set of functions from A to { 0 , 1 } . Functions from F are called attributes, and the pair U = ( A , F ) is called a binary information system (this notion is close to the notion of information systems proposed by Pawlak [18]). If A and F are infinite sets, then the pair U = ( A , F ) is called an infinite binary information system.
A problem over U is an arbitrary n-tuple z = ( f 1 , , f n ) where n N , N is the set of natural numbers { 1 , 2 , } , and f 1 , , f n F . The problem z may be interpreted as a problem of searching for the tuple z ( a ) = ( f 1 ( a ) , , f n ( a ) ) for an arbitrary a A . The number dim z = n is called the dimension of the problem z. Denote F ( z ) = { f 1 , , f n } . We denote by P ( U ) the set of problems over U.
A system of equations over U is an arbitrary equation system of the kind:
{ g 1 ( x ) = δ 1 , , g m ( x ) = δ m }
where m N { 0 } , g 1 , , g m F , and δ 1 , , δ m { 0 , 1 } (if m = 0 , then the considered equation system is empty). This equation system is called a system of equations over z if g 1 , , g m F ( z ) . The considered equation system is called consistent (on A) if its set of solutions on A is nonempty. The set of solutions of the empty equation system coincides with A.
As algorithms for problem z solving, we consider decision trees with two types of queries. We can choose an attribute f i F ( z ) and ask about its value. This query has two possible answers: { f i ( x ) = 0 } and { f i ( x ) = 1 } . We can formulate a hypothesis over z in the form H = { f 1 ( x ) = δ 1 , , f n ( x ) = δ n } where δ 1 , , δ n { 0 , 1 } and ask about this hypothesis. This query has n + 1 possible answers: H , { f 1 ( x ) = ¬ δ 1 } , , { f n ( x ) = ¬ δ n } where ¬ 1 = 0 and ¬ 0 = 1 . The first answer means that the hypothesis is true. Other answers are counterexamples.
A decision tree over z is a marked finite directed tree with the root in which:
  • Each terminal node is labeled with an n-tuple from the set { 0 , 1 } n ;
  • Each node, which is not terminal (such nodes are called working), is labeled with an attribute from the set F ( z ) or with a hypothesis over z;
  • If a working node is labeled with an attribute f i from F ( z ) , then there are two edges, which leave this node and are labeled with the systems of equations { f i ( x ) = 0 } and { f i ( x ) = 1 } , respectively;
  • If a working node is labeled with a hypothesis:
    H = { f 1 ( x ) = δ 1 , , f n ( x ) = δ n }
    over z, then there are n + 1 edges, which leave this node and are labeled with the system of equations H , { f 1 ( x ) = ¬ δ 1 } , , { f n ( x ) = ¬ δ n } , respectively.
Let Γ be a decision tree over z. A complete path in Γ is an arbitrary directed path from the root to a terminal node in Γ . We now define an equation system S ( ξ ) over U associated with the complete path ξ . If there are no working nodes in ξ , then S ( ξ ) is the empty system. Otherwise, S ( ξ ) is the union of equation systems assigned to the edges of the path ξ . We denote by A ( ξ ) the set of solutions on A of the system of equations S ( ξ ) (if this system is empty, then its solution set is equal to A).
We say that a decision tree Γ over z solves the problem z relative to U if, for each element a A and for each complete path ξ in Γ such that a A ( ξ ) , the terminal node of the path ξ is labeled with the tuple z ( a ) .
We now consider an equivalent definition of a decision tree solving a problem. Denote by Δ U ( z ) the set of tuples ( δ 1 , , δ n ) { 0 , 1 } n such that the system of equations { f 1 ( x ) = δ 1 , , f n ( x ) = δ n } is consistent. The set Δ U ( z ) is the set of all possible solutions to the problem z. Let Δ Δ U ( z ) , f i 1 , , f i m { f 1 , , f n } , and σ 1 , , σ m { 0 , 1 } . Denote:
Δ ( f i 1 , σ 1 ) ( f i m , σ m )
the set of all n-tuples ( δ 1 , , δ n ) Δ for which δ i 1 = σ 1 , , δ i m = σ m .
Let Γ be a decision tree over the problem z. We correspond to each complete path ξ in the tree Γ a word π ( ξ ) in the alphabet { ( f i , δ ) : f i F ( z ) , δ { 0 , 1 } } . If the equation system S ( ξ ) is empty, then π ( ξ ) is the empty word. If S ( ξ ) = { f i 1 ( x ) = σ 1 , , f i m ( x ) = σ m } , then π ( ξ ) = ( f i 1 , σ 1 ) ( f i m , σ m ) . The decision tree Γ over z solves the problem z relative to U if, for each complete path ξ in Γ , the set Δ U ( z ) π ( ξ ) contains at most one tuple, and if this set contains exactly one tuple, then the considered tuple is assigned to the terminal node of the path ξ .
As the time complexity of a decision tree Γ , we consider its depth h ( Γ ) , that is the maximum number of working nodes in a complete path in the tree Γ .
Let z P ( U ) . We denote by h U ( 1 ) ( z ) the minimum depth of a decision tree over z, which solves z relative to U and uses only attributes from F ( z ) . We denote by h U ( 2 ) ( z ) the minimum depth of a decision tree over z, which solves z relative to U and uses only hypotheses over z. We denote by h U ( 3 ) ( z ) the minimum depth of a decision tree over z, which solves z relative to U and uses both attributes from F ( z ) and hypotheses over z.
For i = 1 , 2 , 3 , we define a function of the Shannon type h U ( i ) ( n ) that characterizes the dependence of h U ( i ) ( z ) on dim z in the worst case. Let i { 1 , 2 , 3 } and n N . Then:
h U ( i ) ( n ) = max { h U ( i ) ( z ) : z P ( U ) , dim z n } .

3. Main Results

Let U = ( A , F ) be an infinite binary information system and r N . The information system U is called r-reduced if, for each consistent on A system of equations over U, there exists a subsystem of this system that has the same set of solutions and contains at most r equations. We denote by R the set of infinite binary information systems each of which is r-reduced for some r N .
The next theorem follows from the results obtained in [15], where we considered closed classes of test tables (decision tables). It also follows from the results obtained in [16], where we considered the weighted depth of decision trees.
Theorem 1.
Let U be an infinite binary information system. Then, the following statements hold:
(a) 
If U R , then h U ( 1 ) ( n ) = Θ ( log n ) ;
(b) 
If U R , then h U ( 1 ) ( n ) = n for any n N .
A subset { f 1 , , f m } of F is called independent if, for any δ 1 , , δ m { 0 , 1 } , the system of equations { f 1 ( x ) = δ 1 , , f m ( x ) = δ m } is consistent on the set A. The empty set of attributes is independent by definition. We now define the independence dimension or I-dimension I ( U ) of the information system U (this notion is similar to the notion of the independence number of the family of sets considered by Naiman and Wynn in [19]). If, for each m N , the set F contains an independent subset of cardinality m, then I ( U ) = . Otherwise, I ( U ) is the maximum cardinality of an independent subset of the set F. We denote by D the set of infinite binary information systems with a finite independence dimension.
Let U = ( A , F ) be a binary information system, which is not necessarily infinite, f F , and δ { 0 , 1 } . Denote:
A ( f , δ ) = { a : a A , f ( a ) = δ } .
We now define inductively the notion of a k-information system, k N { 0 } . The binary information system U is called a 0-information system if all attributes from F are constant on the set A. Let, for some k N { 0 } , the notion of a m-information system be defined for m = 0 , , k . The binary information system U is called a ( k + 1 ) -information system if it is not a m-information system for m = 0 , , k and, for any f F , there exist numbers δ { 0 , 1 } and m { 0 , , k } such that the information system ( A ( f , δ ) , F ) is a m-information system. It is easy to show by induction on k that if U = ( A , F ) is a k-information system, then U = ( A , F ) , A A , is a l-information system for some l k . We denote by C the set of infinite binary information systems for each of which there exists k N such that the considered system is a k-information system. The following theorem was presented in [17] without proof.
Theorem 2.
Let U be an infinite binary information system. Then, the following statements hold:
(a) 
If U C , then h U ( 2 ) ( n ) = O ( 1 ) and h U ( 3 ) ( n ) = O ( 1 ) ;
(b) 
If U D C , then h U ( 2 ) ( n ) = Θ ( log n ) , h U ( 3 ) ( n ) = Ω ( log n log log n ) , and h U ( 3 ) ( n ) = O ( log n ) ;
(c) 
If U D , then h U ( 2 ) ( n ) = n and h U ( 3 ) ( n ) = n for any n N .
Let U be an infinite binary information system. We now consider the join behavior of the functions h U ( 1 ) ( n ) , h U ( 2 ) ( n ) , and h U ( 3 ) ( n ) . It depends on the belonging of the information system U to the sets R , D , and C . We correspond to the information system U its indicator vector i n d ( U ) = ( c 1 , c 2 , c 3 ) { 0 , 1 } 3 in which c 1 = 1 if and only if U R , c 2 = 1 if and only if U D , and c 3 = 1 if and only if U C .
Theorem 3.
For any infinite binary information system, its indicator vector coincides with one of the rows of Table 1. Each row of Table 1 is the indicator vector of some infinite binary information system.
For i = 1 , 2 , 3 , 4 , we denote by V i the class of all infinite binary information systems, for which the indicator vector coincides with the ith row of Table 1. Table 2 summarizes Theorems 1–3. The first column contains the name of complexity class  V i . The next three columns describe the indicator vector of information systems from this class. The last three columns h U ( 1 ) ( n ) , h U ( 2 ) ( n ) , and h U ( 3 ) ( n ) contain information about the behavior of the functions h U ( 1 ) ( n ) , h U ( 2 ) ( n ) , and h U ( 3 ) ( n ) for information systems from the class V i .

4. Proof of Theorem 2

We precede with the proof of Theorem 2 by two lemmas.
Let d N . A d-complete tree over the information system  U = ( A , F ) is a marked finite directed tree with the root in which:
  • Each terminal node is not labeled;
  • Each nonterminal node is labeled with an attribute f F . There are two edges leaving this node that are labeled with the systems of equations { f ( x ) = 0 } and { f ( x ) = 1 } , respectively;
  • The length of each complete path (the path from the root to a terminal node) is equal to d;
  • For each complete path ξ , the equation system S ( ξ ) , which is the union of equation systems assigned to the edges of the path ξ , is consistent.
Let G be a d-complete tree over U and F ( G ) be the set of all attributes attached to the nonterminal nodes of the tree G. The number of nonterminal nodes in G is equal to 2 0 + 2 1 + + 2 d 1 = 2 d 1 . Therefore, | F ( G ) | 2 d .
The results mentioned in the following lemma are obtained by methods similar to those used by Littlestone [12], Maass and Turán [13], and Angluin [11].
Lemma 1.
Let U = ( A , F ) be a binary information system, d N , G be a d-complete tree over U, and z be a problem over U such that F ( G ) F ( z ) . Then
(a) 
h U ( 2 ) ( z ) d ;
(b) 
h U ( 3 ) ( z ) d log 2 ( 2 d ) .
Proof. 
(a) We prove the inequality h U ( 2 ) ( z ) d by induction on d. Let d = 1 . Then, the tree G has the only one nonterminal node, which is labeled with an attribute f that is not constant on A. Therefore, | Δ U ( z ) | 2 and h U ( 2 ) ( z ) 1 . Let, for t N and for any natural d, 1 d t , the considered statement hold. Assume now that d = t + 1 , G is a d-complete tree over U, z is a problem over U such that F ( G ) F ( z ) , and Γ is a decision tree over z with the minimum depth, which solves the problem z and uses only hypotheses. Let f be the attribute attached to the root of the tree G and H be the hypothesis attached to the root of the decision tree Γ . Then, there is an edge that leaves the root of Γ and is labeled with the equation system { f ( x ) = δ } where the equation f ( x ) = ¬ δ belongs to the hypothesis H. This edge enters to the root of the subtree of Γ , which is denoted by Γ f . There is an edge that leaves the root of G and is labeled with the equation system { f ( x ) = δ } . This edge enters the root of the subtree of G, which is denoted by G δ . One can show that the decision tree Γ f solves the problem z relative to the information system U = ( A ( f , δ ) , F ) and G δ is a t-complete tree over U . It is clear that F ( G δ ) F ( z ) . Using the inductive hypothesis, we obtain h ( Γ f ) t . Therefore, h ( Γ ) t + 1 = d and h U ( 2 ) ( z ) d .
(b) We now prove the inequality h U ( 3 ) ( z ) d log 2 ( 2 d ) . Let z = ( f 1 , , f n ) and Γ be a decision tree over z with the minimum depth, which solves the problem z and uses both attributes and hypotheses. The d-complete tree G has 2 d complete paths ξ 1 , , ξ 2 d . For i = 1 , , 2 d , we denote by a i a solution of the equation system S ( ξ i ) . Denote B = { a 1 , , a 2 d } . We now show that the decision tree Γ contains a complete path, the length of which is at least d log 2 ( 2 d ) . We describe the process of this path construction beginning with the root of Γ .
Let the root of Γ be labeled with an attribute f i 0 . For δ { 0 , 1 } , we denote by B δ the set of solutions on B of the equation system { f i 0 ( x ) = δ } and choose σ { 0 , 1 } for which | B σ | = max { | B 0 | , | B 1 | } . It is clear that | B σ | | B | 2 | B | 2 d . In the considered case, the beginning of the constructed path in Γ is the root of Γ , the edge that leaves the root and is labeled with the equation system { f i 0 ( x ) = σ } , and the node to which this edge enters.
Let as assume now that the root of Γ is labeled with a hypothesis H = { f 1 ( x ) = δ 1 , , f n ( x ) = δ n } . We denote by ξ H the complete path in G for which the system of equations S ( ξ H ) is a subsystem of H. Let the nonterminal nodes of the complete path ξ H be labeled with the attributes f i 1 , , f i d . For j = 1 , , d , we denote by B j the set of solutions on B of the equation system { f i j ( x ) = ¬ δ i j } . It is clear that | B 1 | + + | B d | | B | 1 . Therefore, there exists l { 1 , , d } such that | B l | | B | 1 d | B | 2 d . In the considered case, the beginning of the constructed path in Γ is the root of Γ , the edge that leaves the root and is labeled with the equation system { f i l ( x ) = ¬ δ i l } , and the node to which this edge enters.
We continue the construction of the complete path in Γ in the same way such that after the tth query, we have at least | B | ( 2 d ) t elements from B. The process of path construction continues at least until | B | ( 2 d ) t 1 , i.e., at least until log 2 | B | t log 2 ( 2 d ) . Since | B | = 2 d , we have h ( Γ ) t d log 2 ( 2 d ) and h U ( 3 ) ( z ) d log 2 ( 2 d ) . □
Lemma 2.
Let U = ( A , F ) be a binary information system, k N { 0 } , and U not be an m-information system for m = 0 , , k . Then, there exists a ( k + 1 ) -complete tree over U.
Proof. 
We prove the considered statement by induction on k. Let k = 0 . In this case, U is not a 0-information system. Then, there exists an attribute f F , which is not constant on A. Using this attribute, it is easy to construct a 1-complete tree over U.
Let the considered statement hold for some k, k 0 . We now show that it also holds for k + 1 . Let U = ( A , F ) be a binary information system, which is not an m-information system for m = 1 , , k + 1 . Then, there exists an attribute f F such that, for any δ { 0 , 1 } , the information system U δ = ( A ( f , δ ) , F ) is not an m-information system for m = 1 , , k . Using the inductive hypothesis, we conclude that, for any δ { 0 , 1 } , there exists a ( k + 1 ) -complete tree G δ over U δ . Denote by G a directed tree with root in which the root is labeled with the attribute f, and for any δ { 0 , 1 } , there is an edge that leaves the root, is labeled with the equation system { f ( x ) = δ } , and enters the root of the tree G δ . One can show that the tree G is a ( k + 2 ) -complete tree over U. □
Proof of Theorem 2.
It is clear that h U ( 3 ) ( z ) h U ( 2 ) ( z ) for any problem z over U. Therefore, h U ( 3 ) ( n ) h U ( 2 ) ( n ) for any n N .
(a) Let k N { 0 } . We now show by induction on k that, for each binary k-information system U (not necessarily infinite) for each problem z over U, the inequality h U ( 2 ) ( z ) k holds. Let U = ( A , F ) be a binary 0-information system and z be a problem over U. Since all attributes from F ( z ) are constant on A, the set Δ U ( z ) contains only one tuple. Therefore, the decision tree containing only one node labeled with this tuple solves the problem z relative to U, and h U ( 2 ) ( z ) = 0 .
Let k N { 0 } and, for each m, 0 m k , the considered statement hold. Let us show that it holds for k + 1 . Let U = ( A , F ) be a binary ( k + 1 ) -information system and z = ( f 1 , , f n ) be a problem over U. For i = 1 , , n , choose a number δ i { 0 , 1 } such that the information system ( A ( f i , ¬ δ i ) , F ) is an m i -information system where 1 m i k . Using the inductive hypothesis, we conclude that, for i = 1 , , n , there is a decision tree Γ i over z, which uses only hypotheses, solves the problem z over ( A ( f i , ¬ δ i ) , F ) , and has depth at most m i . We denote by Γ a decision tree in which the root is labeled with the hypothesis H = { f 1 ( x ) = δ 1 , , f n ( x ) = δ n } , the edge leaving the root and labeled with H enters the terminal node labeled with the tuple ( δ 1 , , δ n ) , and for i = 1 , , n , the edge leaving the root and labeled with { f i ( x ) = ¬ δ i } enters the root of the tree Γ i . One can show that Γ solves the problem z relative to U and h ( Γ ) k + 1 . Therefore, h U ( 2 ) ( z ) k + 1 for any problem z over U.
Let U C . Then, U is a k-information system for some natural k, and for each problem z over U, we have h U ( 3 ) ( z ) h U ( 2 ) ( z ) k . Therefore, h U ( 2 ) ( n ) = O ( 1 ) and h U ( 3 ) ( n ) = O ( 1 ) .
(b) Let U = ( A , F ) D C . First, we show that h U ( 2 ) ( n ) = O ( log n ) . Let z = ( f 1 , , f n ) be an arbitrary problem over U. From Lemma 5.1 [16], it follows that | Δ U ( z ) | ( 4 n ) I ( U ) . The proof of this lemma is based on results similar to the ones obtained by Sauer [20] and Shelah [21]. We consider a decision tree Γ over z, which solves z relative to U and uses only hypotheses. This tree is constructed by the halving algorithm [1,12]. We describe the work of this tree for an arbitrary element a from A. Set Δ = Δ U ( z ) . If | Δ | = 1 , then the only n-tuple from Δ is the solution z ( a ) of the problem z for the element a. Let | Δ | 2 . For i = 1 , , m , we denote by δ i a number from { 0 , 1 } such that | Δ ( f i , δ i ) | | Δ ( f i , ¬ δ i ) | . The root of Γ is labeled with the hypothesis H = { f 1 ( x ) = δ 1 , , f n ( x ) = δ n } . After this query, either the problem z is solved (if the answer is H) or we halve the number of objects in the set Δ (if the answer is a counterexample { f i ( x ) = ¬ δ i } ). In the latter case, set Δ = Δ U ( z ) ( f i , ¬ δ i ) . The decision tree Γ continues to work with the element a and the set of n-tuples Δ in the same way. Let, during the work with the element a, the considered decision tree make q queries. After the ( q 1 ) th query, the number of remaining n-tuples in the set Δ is at least two and at most ( 4 n ) I ( U ) / 2 q 1 . Therefore, 2 q ( 4 n ) I ( U ) and q I ( U ) log 2 ( 4 n ) . Therefore, during the processing of the element a, the decision tree Γ makes at most I ( U ) log 2 ( 4 n ) queries. Since a is an arbitrary element from A, the depth of Γ is at most I ( U ) log 2 ( 4 n ) . Since z is an arbitrary problem over U, we obtain h U ( 2 ) ( n ) = O ( log n ) . Therefore, h U ( 3 ) ( n ) = O ( log n ) .
Using Lemma 2 and the relation U C , we obtain that, for any d N , there exists d-complete tree G d over U. Let F ( G d ) = { f 1 , , f n d } . We know that n d 2 d . Denote z d = ( f 1 , , f n d ) . From Lemma 1, it follows that h U ( 2 ) ( z d ) d and h U ( 3 ) ( z d ) d log 2 ( 2 d ) . As a result, we have h U ( 2 ) ( 2 d ) d and h U ( 3 ) ( 2 d ) d log 2 ( 2 d ) . Let n N and n 8 . Then, there exists d N such that 2 d n < 2 d + 1 . We have d > log 2 n 1 , h U ( 2 ) ( n ) log 2 n 1 , h U ( 2 ) ( n ) = Ω ( log n ) , and h U ( 2 ) ( n ) = Θ ( log n ) . It is easy to show that the function x log 2 ( 2 x ) is nondecreasing for x 2 . Therefore, h U ( 3 ) ( n ) log 2 n 1 log 2 ( 2 ( log 2 n 1 ) ) and h U ( 3 ) ( n ) = Ω ( log n log log n ) .
(c) Let U = ( A , F ) D . We now consider an arbitrary problem z = ( f 1 , , f n ) over U and a decision tree over z, which uses only hypotheses and solves the problem z over U in the following way. For a given element a A , the first query is about the hypothesis H 1 = { f 1 ( x ) = 1 , , f n ( x ) = 1 } . If the answer is H 1 , then the problem z is solved for the element a. If, for some i { 1 , , n } , the answer is { f i ( x ) = 0 } , then the second query is about the hypothesis H 2 obtained from H 1 by replacing the equality f i ( x ) = 1 with the equality f i ( x ) = 0 , etc. It is clear that after at most n queries, the problem z for the element a will be solved. Thus, h U ( 2 ) ( z ) n and h U ( 3 ) ( z ) n . Since z is an arbitrary problem over U, we have h U ( 2 ) ( n ) n and h U ( 3 ) ( n ) n for any n N .
Let n N . Since U D , there exist attributes f 1 , , f n F such that, for any ( δ 1 , , δ n ) { 0 , 1 } n , the equation system { f 1 ( x ) = δ 1 , , f n ( x ) = δ n } is consistent on A. We now consider the problem z = ( f 1 , , f n ) and an arbitrary decision tree Γ over z, which solves the problem z over U and uses both attributes and hypotheses. Let us show that h ( Γ ) n . If n = 1 , then the considered inequality holds since | Δ U ( z ) | 2 . Let n 2 . It is easy to show that an equation system over z is inconsistent if and only if it contains equations f i ( x ) = 0 and f i ( x ) = 1 for some i { 1 , , n } . For each node v of the decision tree Γ , we denote by S v the union of systems of equations attached to edges in the path from the root of Γ to v. A node v of Γ will be called consistent if the equation system S v is consistent.
We now construct a complete path ξ in the decision tree Γ , for which the nodes are consistent. We start from the root that is a consistent node. Let the path reach a consistent node v of Γ . If v is a terminal node, then the path ξ is constructed. Let v be a working node labeled with an attribute f i F ( z ) . Then, there exists δ { 0 , 1 } for which the system of equations S v { f i ( x ) = δ } is consistent. Then, the path ξ will pass through the edge leaving v and labeled with the system of equations { f i ( x ) = δ } . Let v be labeled with a hypothesis H = { f 1 ( x ) = δ 1 , , f n ( x ) = δ n } . If there exists i { 1 , , n } such that the system of equations S v { f i ( x ) = ¬ δ } is consistent, then the path ξ will pass through the edge leaving v and labeled with the system of equations { f i ( x ) = ¬ δ } . Otherwise, S v = H , and the path ξ will pass through the edge leaving v and labeled with the system of equations H.
Let all edges in the path ξ be labeled with systems of equations containing one equation each. Since all nodes of ξ are consistent, the equation system S ( ξ ) is consistent. We now show that S ( ξ ) contains at least n equations. Let us assume that this system contains less than n equations. Then, the set Δ U ( z ) π ( ξ ) contains more than one n-tuple, which is impossible. Therefore, the length of the path ξ is at least n. Let there be edges in ξ , which are labeled with hypotheses, and the first edge in ξ labeled with a hypothesis H leaves the node v. Then, S v = H , and the length of ξ is at least n. Therefore, h ( Γ ) n , h U ( 3 ) ( z ) n , and h U ( 2 ) ( z ) n . As a result, we obtain h U ( 3 ) ( n ) n and h U ( 2 ) ( n ) n . Thus, h U ( 2 ) ( n ) = n and h U ( 3 ) ( n ) = n for any n N . □

5. Proof of Theorem 3

First, we prove several auxiliary statements.
Proposition 1.
R D .
Proof. 
Let U R . By Theorem 1, h U ( 1 ) ( n ) = Θ ( log n ) . Let us assume that U D . Then, for any n N , there exists a problem z = ( f 1 , , f n ) over U such that | Δ U ( z ) | = 2 n . Let Γ be a decision tree over z, which solves the problem z relative to U and uses only attributes. Then, Γ should have at least 2 n terminal nodes. One can show that the number of terminal nodes in the tree Γ is at most 2 h ( Γ ) . Then, 2 n 2 h ( Γ ) , h ( Γ ) n , and h U ( z ) n . Therefore, h U ( 1 ) ( n ) n for any n N , which is impossible. Thus, R D . □
Proposition 2.
C D .
Proof. 
Let U C . By Theorem 2, h U ( 2 ) ( n ) = O ( 1 ) . Let us assume that U D . Then, by Theorem 2, h U ( 2 ) ( n ) = n for any n N , which is impossible. Therefore, C D . □
Proposition 3.
R C = .
Proof. 
Assume the contrary: R C and U = ( A , F ) R C . Let r , k N , U be an r-reduced information system and U be a k-information system. We now consider an arbitrary problem z = ( f 1 , , f n ) over U and describe a decision tree Γ over z, which uses only attributes, solves the problem z over U, and has depth at most k r .
For i = 1 , , n , let δ i be a number from { 0 , 1 } such that ( A ( f i , ¬ δ i ) , F ) is an m i -information system with 0 m i < k . Let t be the maximum number from the set { 1 , , n } such that the system of equations S = { f 1 ( x ) = δ 1 , , f t ( x ) = δ t } is consistent. Then, there exists a subsystem { f i 1 ( x ) = δ i 1 , , f i p ( x ) = δ i p } of the system S, which has the same set of solutions as S and for which p r . For a given a A , the decision tree Γ computes sequentially values f i 1 ( a ) , , f i p ( a ) .
If, for some q { 1 , , p } , f i 1 ( a ) = δ i 1 , , f i q 1 ( a ) = δ i q 1 , and f i q ( a ) = ¬ δ i q , then the decision tree Γ continues to work with the problem z and the information system U = ( A , F ) where A is the set of solutions on A of the equation system { f i 1 ( x ) = δ i 1 , , f i q 1 ( x ) = δ i q 1 , f i q ( x ) = ¬ δ i q } . We have that U is an l -information system for some l m i q < k .
Let f i 1 ( a ) = δ i 1 , , f i p ( a ) = δ i p . If t = n , then ( δ 1 , , δ n ) is the solution of the problem z for the considered element a. Let t < n . Then, the decision tree Γ continues to work with the problem z and the information system U = ( A , F ) where A is the set of solutions on A of the equation system { f i 1 ( x ) = δ i 1 , , f i p ( x ) = δ i p } . We know that the equation system { f 1 ( x ) = δ 1 , , f t ( x ) = δ t , f t + 1 ( x ) = δ t + 1 } is inconsistent. Therefore, the system { f i 1 ( x ) = δ i 1 , , f i p ( x ) = δ i p , f t + 1 ( x ) = δ t + 1 } is inconsistent. Hence, A A ( f t + 1 , ¬ δ t + 1 ) and U is an l -information system for some l m t + 1 < k .
As a result, after the computation of the values of at most r attributes, we either solve the problem z or reduce the consideration of the problem z over the k-information system U to the consideration of the problem z over some l-information system where l < k . After the computation of the values of at most r k attributes, we solve the problem z since each problem over the 0-information system has exactly one possible solution. Therefore, h U ( 1 ) ( z ) r k and h U ( 1 ) ( n ) = O ( 1 ) . By Theorem 1, h U ( 1 ) ( n ) = Θ ( log n ) . The obtained contradiction shows that R C = . □
Proposition 4.
For any infinite binary information system, its indicator vector coincides with one of the rows of Table 1.
Proof. 
Table 3 contains as rows all three-tuples from the set { 0 , 1 } 3 . We now show that the rows with the numbers 5–8 cannot be indicator vectors of infinite binary information systems. Assume the contrary: there is i { 5 , 6 , 7 , 8 } such that the row with the number i is the indicator vector of an infinite binary information system U. If i = 5 , then U R and U D , but this is impossible, since, by Proposition 1, R D . If i = 6 , then U C and U D , but this is impossible, since, by Proposition 2, C D . If i = 7 , then U R and U D , but this is impossible, since, by Proposition 1, R D . If i = 8 , then U R and U C , but this is impossible, since, by Proposition 3, R C = . Therefore, for any infinite binary information system, its indicator vector coincides with one of the rows of Table 3 with Numbers 1–4. Thus, it coincides with one of the rows of Table 1. □
Define an infinite binary information system U 1 = ( A 1 , F 1 ) as follows: A 1 = N and F 1 is the set of all functions from N to { 0 , 1 } .
Lemma 3.
The information system U 1 belongs to the class V 1 .
Proof. 
It is easy to show that the information system U 1 has an infinite I-dimension. Therefore, U 1 D . Using Proposition 4, we obtain i n d ( U ) = ( 0 , 0 , 0 ) , i.e., U 1 V 1 . □
For any i N , we define two functions p i : N { 0 , 1 } and l i : N { 0 , 1 } . Let j N . Then, p i ( j ) = 1 if and only if j = i and l i ( j ) = 1 if and only if j > i .
Define an infinite binary information system U 2 = ( A 2 , F 2 ) as follows: A 2 = N and F 2 = { p i : i N } { l i : i N } .
Lemma 4.
The information system U 2 belongs to the class V 2 .
Proof. 
For n N , denote S n = { p 1 ( x ) = 0 , , p n ( x ) = 0 } . One can show that the equation system S n is consistent and each proper subsystem of S n has a set of solutions different from the set of solutions of S n . Therefore, U 2 R . Using attributes from the set { l i : i N } , we can construct a d-complete tree over U 2 for each d N . By Lemma 1 and Theorem 2, U 2 C . One can show that I ( U 2 ) = 1 . Therefore, U 2 D . Thus, i n d ( U 2 ) = ( 0 , 1 , 0 ) , i.e., U 2 V 2 . □
Define an infinite binary information system U 3 = ( A 3 , F 3 ) as follows: A 3 = N and F 3 = { p i : i N } .
Lemma 5.
The information system U 3 belongs to the class V 3 .
Proof. 
It is easy to show that U 3 is a 1-information system. Therefore, U 3 C . Using Proposition 4, we obtain i n d ( U 3 ) = ( 0 , 1 , 1 ) , i.e., U 3 V 3 . □
Define an infinite binary information system U 4 = ( A 4 , F 4 ) as follows: A 4 = N and F 4 = { l i : i N } .
Lemma 6.
The information system U 4 belongs to the class V 4 .
Proof. 
Let us consider an arbitrary consistent system of equations S over U 4 . We now show that there is a subsystem of S, which has at most two equations and the same set of solutions as S. Let S contain both equations of the kind l i ( x ) = 1 and l j ( x ) = 0 . Denote i 0 = max { i : l i ( x ) = 1 S } and j 0 = min { j : l j ( x ) = 0 S } . One can show that the system of equations S = { l i 0 ( x ) = 1 , l j 0 ( x ) = 0 } has the same set of solutions as S. The case when S contains for some δ { 0 , 1 } only equations of the kind l p ( x ) = δ can be considered in a similar way. In this case, the equation system S contains only one equation. Therefore, the information system U 4 is 2-reduced and U 4 R . Using Proposition 4, we obtain i n d ( U 4 ) = ( 1 , 1 , 0 ) , i.e., U 4 V 4 . □
Proof of Theorem 3.
From Proposition 4, it follows that, for any infinite binary information system, its indicator vector coincides with one of the rows of Table 1. Using Lemmas 3–6, we conclude that each row of Table 1 is the indicator vector of some infinite binary information system. □

6. Conclusions

Based on the results of exact learning, test theory, and rough set theory, for an arbitrary infinite binary information system, we studied three functions of the Shannon type, which characterize the dependence in the worst case of the minimum depth of a decision tree solving a problem on the number of attributes in the problem description. These three functions correspond to (i) decision trees using attributes, (ii) decision trees using hypotheses, and (iii) decision trees using both attributes and hypotheses. We described possible types of behavior for each of these three functions. We also studied the join behavior of these functions and distinguished four corresponding complexity classes of infinite binary information systems. In the future, we plan to translate the obtained results into the language of exact learning.
The problems studied in this paper allow us to confine ourselves to considering only the crisp (conventional) sets that are completely defined by attributes. However, in the future, when we investigate approximately defined problems or approximate decision trees, it will be necessary to work with rough sets given by their lower and upper approximations. This will require a wider range of rough set theory techniques than those used in the present paper.

Funding

Research funded by King Abdullah University of Science and Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

Research reported in this publication was supported by King Abdullah University of Science and Technology (KAUST). The author is greatly indebted to the anonymous reviewers for their useful comments and suggestions.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Angluin, D. Queries and concept learning. Mach. Learn. 1988, 2, 319–342. [Google Scholar] [CrossRef] [Green Version]
  2. Pawlak, Z. Rough sets. Int. J. Parallel Program. 1982, 11, 341–356. [Google Scholar] [CrossRef]
  3. Pawlak, Z. Rough Sets—Theoretical Aspects of Reasoning about Data; Theory and Decision Library: Series D; Kluwer: Dordrecht, The Netherlands, 1991; Volume 9. [Google Scholar]
  4. Pawlak, Z.; Skowron, A. Rudiments of rough sets. Inf. Sci. 2007, 177, 3–27. [Google Scholar] [CrossRef]
  5. Chegis, I.A.; Yablonskii, S.V. Logical methods of control of work of electric schemes. Trudy Mat. Inst. Steklov 1958, 51, 270–360. (In Russian) [Google Scholar]
  6. Azad, M.; Chikalov, I.; Hussain, S.; Moshkov, M. Minimizing depth of decision trees with hypotheses. In Rough Sets–International Joint Conference, Proceedings of the IJCRS 2021, Bratislava, Slovakia, 19–24 September 2021; Ramanna, S., Cornelis, C., Ciucci, D., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; Volume 12872, pp. 123–133. [Google Scholar]
  7. Azad, M.; Chikalov, I.; Hussain, S.; Moshkov, M. Minimizing number of nodes in decision trees with hypotheses. In Proceedings of the 25th International Conference on Knowledge—Based and Intelligent Information & Engineering Systems (KES 2021), Szczecin, Poland, 8–10 September 2021; Watrobski, J., Salabun, W., Toro, C., Zanni-Merk, C., Howlett, R.J., Jain, L.C., Eds.; Elsevier: Amsterdam, The Netherlands, 2021; Volume 192, pp. 232–240. [Google Scholar]
  8. Azad, M.; Chikalov, I.; Hussain, S.; Moshkov, M. Sorting by decision trees with hypotheses (extended abstract). In Proceedings of the 29th International Workshop on Concurrency, Specification and Programming, CS&P 2021, Berlin, Germany, 27–28 September 2021; CEUR Workshop Proceedings. Schlingloff, H., Vogel, T., Eds.; CEUR-WS.org: Aachen, Germany, 2021; Volume 2951, pp. 126–130. [Google Scholar]
  9. Azad, M.; Chikalov, I.; Hussain, S.; Moshkov, M. Optimization of decision trees with hypotheses for knowledge representation. Electronics 2021, 10, 1580. [Google Scholar] [CrossRef]
  10. Azad, M.; Chikalov, I.; Hussain, S.; Moshkov, M. Entropy-based greedy algorithm for decision trees using hypotheses. Entropy 2021, 23, 808. [Google Scholar] [CrossRef] [PubMed]
  11. Angluin, D. Queries revisited. Theor. Comput. Sci. 2004, 313, 175–194. [Google Scholar] [CrossRef] [Green Version]
  12. Littlestone, N. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Mach. Learn. 1988, 2, 285–318. [Google Scholar] [CrossRef]
  13. Maass, W.; Turán, G. Lower bound methods and separation results for on-line learning models. Mach. Learn. 1992, 9, 107–145. [Google Scholar] [CrossRef] [Green Version]
  14. Moshkov, M. Conditional tests. In Problemy Kibernetiki; Yablonskii, S.V., Ed.; Nauka Publishers: Moscow, Russia, 1983; Volume 40, pp. 131–170. (In Russian) [Google Scholar]
  15. Moshkov, M. On depth of conditional tests for tables from closed classes. In Combinatorial-Algebraic and Probabilistic Methods of Discrete Analysis; Markov, A.A., Ed.; Gorky University Press: Gorky, Russia, 1989; pp. 78–86. (In Russian) [Google Scholar]
  16. Moshkov, M. Time complexity of decision trees. In Transactions on Rough Sets III; Peters, J.F., Skowron, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3400, pp. 244–459. [Google Scholar]
  17. Moshkov, M. Test theory and problems of machine learning. In Proceedings of the International School-Seminar on Discrete Mathematics and Mathematical Cybernetics, Ratmino, Russia, 31 May–3 June 2001; MAX Press: Moscow, Russia, 2001; pp. 6–10. [Google Scholar]
  18. Pawlak, Z. Information systems theoretical foundations. Inf. Syst. 1981, 6, 205–218. [Google Scholar] [CrossRef] [Green Version]
  19. Naiman, D.Q.; Wynn, H.P. Independence number and the complexity of families of sets. Discr. Math. 1996, 154, 203–216. [Google Scholar] [CrossRef] [Green Version]
  20. Sauer, N. On the density of families of sets. J. Comb. Theory A 1972, 13, 145–147. [Google Scholar] [CrossRef] [Green Version]
  21. Shelah, S. A combinatorial problem; stability and order for models and theories in infinitary languages. Pac. J. Math. 1972, 41, 241–261. [Google Scholar] [CrossRef]
Table 1. Possible indicator vectors of infinite binary information systems.
Table 1. Possible indicator vectors of infinite binary information systems.
R D C
1000
2010
3011
4110
Table 2. Summary of Theorems 1–3.
Table 2. Summary of Theorems 1–3.
R D C h U ( 1 ) ( n ) h U ( 2 ) ( n ) h U ( 3 ) ( n )
V 1 000nnn
V 2 010n Θ ( log n ) Ω ( log n log log n ) , O ( log n )
V 3 011n O ( 1 ) O ( 1 )
V 4 110 Θ ( log n ) Θ ( log n ) Ω ( log n log log n ) , O ( log n )
Table 3. All 3-tuples from the set { 0 , 1 } 3 .
Table 3. All 3-tuples from the set { 0 , 1 } 3 .
R D C
1000
2010
3011
4110
5100
6001
7101
8111
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Moshkov, M. On the Depth of Decision Trees with Hypotheses. Entropy 2022, 24, 116. https://doi.org/10.3390/e24010116

AMA Style

Moshkov M. On the Depth of Decision Trees with Hypotheses. Entropy. 2022; 24(1):116. https://doi.org/10.3390/e24010116

Chicago/Turabian Style

Moshkov, Mikhail. 2022. "On the Depth of Decision Trees with Hypotheses" Entropy 24, no. 1: 116. https://doi.org/10.3390/e24010116

APA Style

Moshkov, M. (2022). On the Depth of Decision Trees with Hypotheses. Entropy, 24(1), 116. https://doi.org/10.3390/e24010116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop