Next: Building Small Accurate Decision Up: Inducing Interpretable Voting Classifiers Previous: Our Contribution

Decision Committees

Let be the number of classes. Unless otherwise specified, an example is a couple where is an observation described over variables, and its corresponding class among $\{0, 1, ..., c-1\}$ ; to each example is associated a weight , representing its appearance probability with respect to a learning sample which we dispose of. is itself a subset of a whole domain which we denote ${\cal X}$ . Obviously, we do not have entire access to ${\cal X}$ ( $LS\subset {\cal X}$ ) : in general, we even have $\vert LS\vert \ll \vert{\cal X}\vert$ ( $\vert.\vert$ denotes the cardinality; we suppose in all that follows that ${\cal X}$ is discrete with finite cardinality). In the particular case where , the two classes are noted ``-'' () and ``+'' (), and called respectively the negative and positive class. The learning sample is the union of two samples, noted and , containing respectively the negative and positive examples. It is worthwhile to think the positive examples as belonging to a subset of ${\cal X}$ containing all possible positive examples, usually called the target concept.
As part of our goal in machine learning, is the need to build a reliable approximation to the true classification of the examples in ${\cal X}$ , that is, a good approximation of the target concept, by using only the examples in . Good approximations shall have a high accuracy over ${\cal X}$ , although we do not have access to this quantity, but rather to its estimator: a more or less reliable accuracy computable over . We refer the reader to standard machine learning books [Mitchell1997] for further considerations about this issue. A DC contains two parts:

A set of unordered pairs (or rules) $\{(t_i,\vec{v}_i)\}_{i=1, 2, ...}$ where each is a monomial (a conjunction of literals) over $\{x_1, \overline{x}_1, x_2, \overline{x}_2, ..., x_n, \overline{x}_n\}^n$ ( being the number of description variables, each is a positive literal and each $\overline{x}_j$ is a negative literal), and each $\vec{v}_i$ is a vector in $I\!\!R^c$ . For the sake of readability, this vectorial notation shall be kept throughout all the paper, even for problems with only two classes. One might choose to add a single real rather than a 2-component vector in that case.
A Default Vector $\vec{D}$ in . Again, in the two-class case, it is sufficient to replace $\vec{D}$ by a default class in $\{+,-\}$ .

For any observation

and any monomial

, the proposition ``

satisfies

'' is denoted by $o\Rightarrow t_i$ . The opposite proposition ``

does not satisfy

'' is denoted by `` $o \not\Rightarrow t_i$ ''. The classification of any observation

is made in the following way: define $\vec{V}_o$ as follows

$\begin{eqnarray*} \vec{V}_o & = & \sum_{ \begin{array}{c} (t_i,\vec{v}_i)\\ o\Rightarrow t_i \end{array}} {\vec{v}_i} \:\:. \end{eqnarray*}$

The class assigned to

is then:

$\arg \max_{j} \vec{V}_o$ if $\vert\arg \max_{j} \vec{V}_o\vert=1$ , and
$\arg \max_{j\in \arg \max_{j'} \vec{V}_o} \vec{D}$ otherwise.

In other words, if the maximal component of $\vec{V}_o$ is unique, then the index gives the class assigned to

. Otherwise, we take the index of the maximal component of $\vec{D}$ corresponding to the maximal component of $\vec{V}_o$ (ties are solved by a random choice among the maximal components).
DC contains a subclass which is among the largest classes of Boolean formulas to be PAC-learnable [Nock Gascuel1995], however this class is less interesting from a practical viewpoint since rules can be numerous and hard to interpret. Nevertheless, a subclass of DC [Nock Gascuel1995] presents an interesting compromise between representational power and interpretability power. In this class, which is used by WIDC, each of the vector components are restricted to $\{-1,0,+1\}$ and each monomial is present at most once. The values

allow natural interpretations of the rules, being either in favor of the corresponding class (

), neutral with respect to the class (

), or in disfavor of the corresponding class (

). This subclass, to which we relate as DC $_{\{-1,0,+1\}}$ , is, as we now prove, suffering the same algorithmic drawbacks as DT [Hyafil Rivest1976] and DL [Nock Jappy1998]: even without restricting the components of the vectors, or with any restriction to a set containing at least one real value, the construction of small formulas with sufficiently high accuracy is hard. This is a clear motivation for using heuristics in decision committee's induction.

Next: Building Small Accurate Decision Up: Inducing Interpretable Voting Classifiers Previous: Our Contribution