Journal of Artificial Intelligence Research 1 (1994) 209-229                     Submitted 11/93; published 2/94


              Learning the Past Tense of English Verbs:

The Symbolic Pattern Associator vs.  Connectionist Models


Charles X. Ling                                                          ling@csd.uwo.ca
Department of Computer Science
The University of Western Ontario
London, Ontario, Canada N6A 5B7
                                           Abstract

        Learning the past tense of English verbs _ a seemingly minor aspect of language ac-
     quisition _ has generated heated debates since 1986, and has become a landmark task
     for testing the adequacy of cognitive modeling. Several artificial neural networks (ANNs)
     have been implemented, and a challenge for better symbolic models has been posed.  In
     this paper, we present a general-purpose Symbolic Pattern Associator (SPA) based upon
     the decision-tree learning algorithm ID3. We conduct extensive head-to-head comparisons
     on the generalization ability between ANN models and the SPA under different represen-
     tations. We conclude that the SPA generalizes the past tense of unseen verbs better than
     ANN models by a wide margin, and we offer insights as to why this should be the case.
     We also discuss a new default strategy for decision-tree learning algorithms.


1.  Introduction


Learning the past tense of English verbs, a seemingly minor aspect of language acquisition,
has generated heated debates since the first connectionist implementation in 1986 (Rumel-
hart & McClelland, 1986). Based on their results, Rumelhart and McClelland claimed that
the use and acquisition of human knowledge of language can best be formulated by ANN
(Artificial Neural Network) models without symbol processing that postulates the existence
of explicit symbolic representation and rules.  Since then, learning the past tense has be-
come a landmark task for testing the adequacy of cognitive modeling.  Over the years a
number of criticisms of connectionist modeling appeared (Pinker & Prince, 1988; Lachter &
Bever, 1988; Prasada & Pinker, 1993; Ling, Cherwenka, & Marinov, 1993). These criticisms
centered mainly upon the issues of high error rates and low reliability of the experimental re-
sults, the inappropriateness of the training and testing procedures, "hidden" features of the
representation and the network architecture that facilitate learning, as well as the opaque
knowledge representation of the networks.  Several subsequent attempts at improving the
original results with new ANN models have been made (Plunkett & Marchman, 1991; Cot-
trell & Plunkett, 1991; MacWhinney & Leinbach, 1991; Daugherty & Seidenberg, 1993).
Most notably, MacWhinney and Leinbach (1991) constructed a multilayer neural network
with backpropagation (BP), and attempted to answer early criticisms. On the other hand,
supporters of the symbolic approach believe that symbol structures such as parse trees,
propositions, etc., and the rules for their manipulations, are critical at the cognitive level,
while the connectionist approach may only provide an account of the neural structures
in which the traditional symbol-processing cognitive architecture is implemented (Fodor
& Pylyshyn, 1988).  Pinker (1991) and Prasada and Pinker (1993) argue that a proper


Oc1994 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.
                                             Ling


accounting for regular verbs should be dependent upon production rules, while irregular
past-tense inflections may be generalized by ANN-like associative memory.
    The proper way of debating the adequacy of symbolic and connectionist modeling is by
contrasting competitive implementations. Thus, a symbolic implementation is needed that
can be compared with the ANN models. This is, in fact, a challenge posed by MacWhinney
and Leinbach (1991), who assert that no symbolic methods would work as well as their own
model. In a section titled "Is there a better symbolic model?" they claim:


          If there were some other approach that provided an even more accurate
      characterization of the learning process, we might still be forced to reject the
      connectionist approach, despite its successes. The proper way of debating con-
      ceptualizations is by contrasting competitive implementations. To do this in the
      present case, we would need a symbolic implementation that could be contrasted
      with the current implementation. (MacWhinney & Leinbach, 1991, page 153)


    In this paper, we present a general-purpose Symbolic Pattern Associator (SPA) based
upon the symbolic decision tree learning algorithm ID3 (Quinlan, 1986).  We have shown
(Ling & Marinov, 1993) that the SPA's results are much more psychologically realistic than
ANN models when compared with human subjects. On the issue of the predictive accuracy,
MacWhinney and Leinbach (1991) did not report important results of their model on unseen
regular verbs.  To reply to our criticism, MacWhinney (1993) re-implemented the ANN
model, and claimed that its raw generalization power is very close to that of our SPA. He
believed that this should be the case because both systems learn from the same data set:


          There is a very good reason for the equivalent performance of these two
      models.  [...]  When two computationally powerful systems are given the same
      set of input data, they both extract every bit of data regularity from that input.
      Without any further processing, there is only so much blood that can be squeezed
      out of a turnip, and each of our systems [SPA and ANN] extracted what they
      could. (MacWhinney, 1993, page 295)


    We will show that this is not the case; obviously there are reasons why one learning
algorithm outperforms another (otherwise why do we study different learning algorithms?).
The Occam's Razor Principle _ preferring the simplest hypothesis over more complex
ones _ creates preference biases for learning algorithms. A preference bias is a preference
order among competitive hypotheses in the hypothesis space. Different learning algorithms,
however, employ different ways of measuring simplicity, and thus concepts that they bias
to are different. How well a learning program generalizes depends upon the degree to which
the regularity of the data fits with its bias. We study and compare the raw generalization
ability of symbolic and ANN models on the task of learning the past tense of English
verbs. We perform extensive head-to-head comparisons between ANN and SPA, and show
the effects of different representations and encodings on their generalization abilities. Our
experimental results demonstrate clearly that


   1. the distributed representation, a feature that connectionists have been advocating,
      does not lead to better generalization when compared with the symbolic representa-
      tion, or with arbitrary error-correcting codes of a proper length;


                                               210
             Learning the Past Tense: Symbolic vs Connectionist Models


   2. ANNs cannot learn the identity mapping that preserves the verb stem in the past
      tense as well as the SPA can;


   3. a new representation suggested by MacWhinney (1993) improves the predictive accu-
      racy of both SPA and ANN, but SPA still outperforms ANN models;


   4. in sum, the SPA generalizes the past tense of unseen verbs better than ANN models
      by a wide margin.


In Section 5 we discuss reasons as to why the SPA is a better learning model for the
the task of English past-tense acquisition.  Our results support the view that many such
rule-governed cognitive processes should be better modeled by symbolic, rather than con-
nectionist, systems.


2.  Review of Previous Work


In this section, we review briefly the two main connectionist models of learning the past
tenses of English verbs, and the subsequent criticisms.


2.1  Rumelhart and McClelland's Model

Rumelhart and McClelland's model is based on a simple perceptron-based pattern associ-
ator interfaced with an input/output encoding/decoding network which allows the model
to associate verb stems with their past tenses using a special Wickelphone/Wickelfeature
phoneme-representation format. The learning algorithm is the classical perceptron conver-
gence procedure. The training and the testing sets are mutually disjoint in the experiments.
The errors made by the model during the training process broadly follow the U-shaped learn-
ing curve in the stages of acquisition of the English past tense exhibited by young children.
The testing sample consists of 86 "unseen" low frequency verbs (14 irregular and 72 regular)
that are not randomly chosen.  The testing sample results have a 93% error rate for the
irregulars. The regulars fare better with a 33.3% error rate. Thus, the overall error rate for
the whole testing sample is 43% _ 37 wrong or ambiguous past tense forms out of 86 tested.
Rumelhart and McClelland (1986) claim that the outcome of their experiment disconfirms
the view that there exist explicit (though inaccessible) rules that underlie human knowledge
of language.


2.2  MacWhinney and Leinbach's Model

MacWhinney and Leinbach (1991) report a new connectionist model on the learning of the
past tenses of English verbs.  They claim that the results from the new simulation are far
superior to Rumelhart and McClelland's results, and that they can answer most of the crit-
icisms aimed at the earlier model. The major departure from Rumelhart and McClelland's
model is that the Wickelphone/Wickelfeature representational format is replaced with the
UNIBET (MacWhinney, 1990) phoneme representational system which allows the assign-
ment of a single alphabetic/numerical letter to each of the total 36 phonemes. MacWhinney
and Leinbach use special templates with which to code each phoneme and its position in a
word. The actual input to the network is created by coding the individual phonemes as sets


                                               211
                                             Ling


of phonetic features in a way similar to the coding of Wickelphones as Wickelfeatures (cf
Section 4.3). The network has two layers of 200 "hidden" units fully connected to adjacent
layers. This number was arrived at through trial and error. In addition, the network has a
special-purpose set of connections that copy the input units directly onto the output units.

    Altogether, 2062 regular and irregular English verbs are selected for the experiment
_ 1650 of them are used for training (1532 regular and 118 irregular), but only 13 low
frequency irregular verbs are used for testing (MacWhinney & Leinbach, 1991, page 144).
Training the network takes 24,000 epochs.  At the end of training there still are 11 errors
on the irregular pasts.  MacWhinney and Leinbach believe that if they allow the network
to run for several additional days and give it additional hidden unit resources, it probably
can reach complete convergence (MacWhinney & Leinbach, 1991, page 151).  The only
testing error rate reported is based on a very small and biased test sample of 13 unseen
irregular verbs; 9 out of 13 are predicted incorrectly. They do not test their model on any
of the unseen regular verbs: "Unfortunately, we did not test a similar set of 13 regulars."
(MacWhinney & Leinbach, 1991, page 151).


2.3  Criticism of the Connectionist Models

Previous and current criticisms of the connectionist models of learning the past tenses of
English verbs center mainly on several issues.   Each of these issues is summarized in the
following subsections.


2.3.1  Error Rates

The error rate in producing the past tenses of the "unseen" test verbs is very high in
both ANN models, and important tests were not carried out in MacWhinney and Leinbach
(1991) model.  The experimental results indicate that neither model reaches the level of
adult competence. In addition, relatively large numbers of the errors are not psychologically
realistic since humans rarely make them.


2.3.2  Training and Testing Procedures

In both Rumelhart and McClelland's model, and MacWhinney and Leinbach's model, the
generalization ability is measured on only one training/testing sample. Further, the testing
sets are not randomly chosen, and they are very small.  The accuracy in testing irregular
verbs can vary greatly depending upon the particular set of testing verbs chosen, and thus
multiple runs with large testing samples are necessary to assess the true generalization
ability of a learning model. Therefore, the results of the previous connectionist models are
not reliable. In Section 4, we set up a reliable testing procedure to compare connectionist
models with our symbolic approach.  Previous connectionist simulations have also been
criticized for their crude training processes (for example, the sudden increase of regular
verbs in the training set), which create such behavior as the U-shaped learning curves.


2.3.3  Data Representation and Network Architecture

Most of the past criticisms of the connectionist models have been aimed at the data-
representation formats employed in the simulations.  Lachter and Bever (1988) pointed


                                               212
             Learning the Past Tense: Symbolic vs Connectionist Models


out that the results achieved by Rumelhart and McClelland's model would have been im-
possible without the use of several TRICS (The Representations It Crucially Supposes)
introduced with the adoption of the Wickelphone/Wickelfeature representational format.
MacWhinney and Leinbach claim that they have improved upon the earlier connectionist
model by getting rid of the Wickelphone/Wickelfeature representation format, and thus
to have responded to the many criticisms that this format entailed.  However, MacWhin-
ney and Leinbach also introduce several TRICS in their data-representation format.  For
example, instead of coding predecessor and successor phonemes as Wickelphones, they in-
troduce special templates with which to code positional information. This means that the
network will learn to associate patterns of phoneme/positions within a predetermined con-
sonant/vowel pattern.   Further, the use of restrictive templates gets rid of many English
verbs that  do not fit the chosen template.  This may bias the model in favour of shorter
verbs, predominantly of Anglo-Saxon origin, and against longer verbs, predominantly com-
posite or of Latin and French origin.  Another TRICS introduced is the phonetic feature
encoding (a distributed representation). It is not clear why phonetic features such as front,
centre, back, high, etc.  are chosen.  Do they represent finer grained "microfeatures" that
help to capture the regularities in English past tenses?  In Section 4.5, we will show that
the straightforward symbolic representation leads to better generalization than does the
carefully engineered distributed representation. This undermines the claimed advantages of
the distributed representation of connectionist models.
2.3.4  Knowledge Representation and Integration of Acquired Knowledge


Pinker and Prince (1988), and Lachter and Bever (1988) point out that Rumelhart and
McClelland try to model the acquisition of the production of the past tense in isolation
>from the rest of the English morphological system.  Rumelhart and McClelland, as well
as MacWhinney and Leinbach, assume that the acquisition process establishes a direct
mapping from the phonetic representation of the stem to the phonetic representation of
the past tense form. This direct mapping collapses some well-established distinctions such
as lexical item vs.  phoneme string, and morphological category vs.  morpheme.  Simply
remaining at the level of phonetic patterns,  it is impossible to express new categorical
information in first-order (predicate/function/variable) format. One of the inherent deficits
of the connectionist implementations is that there is no such thing as a variable for verb
stem, and hence there is no way for the model to attain the knowledge that one could
add suffix to a stem to get its past tense (Pinker & Prince, 1988, page 124).  Since the
acquired knowledge in such networks is a large weight matrix, which usually is opaque to the
human observer, it is unclear how the phonological levels processing that the connectionist
models carry out can be integrated with the morphological, lexical, and syntactical level
of processing. Neither Rumelhart and McClelland nor MacWhinney and Leinbach address
this issue.   In contrast to ANNs whose internal representations are entirely opaque, the
SPA can represent the acquired knowledge in the form of production rules, and allow for
further processing, resulting in higher-level categories such as the verb stem and the voiced
consonants, linguistically realistic production rules using these new categories for regular
verbs, and associative templates for irregular verbs (Ling & Marinov, 1993).


                                               213
                                             Ling


3.  The Symbolic Pattern Associator


We take up MacWhinney and Leinbach's challenge for a better symbolic model for learning
the past tense of English verbs, and present a general-purpose Symbolic Pattern Associator
(SPA)1 that can generalize the past tense of unseen verbs much more accurately than
connectionist models in this section.  Our model is symbolic for several reasons.  First,
the input/output representation of the learning program is a set of phoneme symbols,
which are the basic elements governing the past-tense inflection.   Second,  the learning
program operates on those phoneme symbols directly, and the acquired knowledge can be
represented in the form of production rules using those phoneme symbols as well.  Third,
those production rules at the phonological level can easily be further generalized into first-
order rules that use more abstract, high-level symbolic categories such as morphemes and
the verb stem (Ling & Marinov,  1993).  In contrast,  the connectionist models operate
on a distributed representation (phonetic feature vectors), and the acquired knowledge is
embedded in a large weight matrix; it is therefore hard to see how this knowledge can be
further generalized into more abstract representations and categories.


3.1  The Architecture of the Symbolic Pattern Associator

The SPA is based on C4.5 (Quinlan, 1993) which is an improved implementation of the ID3
learning algorithm (cf. (Quinlan, 1986)). ID3 is a program for inducing classification rules in
the form of decision trees from a set of classified examples. It uses information gain ratio as
a criterion for selecting attributes as roots of the subtrees. The divide-and-conquer strategy
is recursively applied in building subtrees until all remaining examples in the training set
belong to a single concept (class); then a leaf is labeled as that concept.  The information
gain guides a greedy heuristic search for the locally most relevant or discriminating attribute
that maximally reduces the entropy (randomness) in the divided set of the examples. The
use of this heuristic usually results in building small decision trees instead of larger ones
that also fit the training data.

    If the task is to learn to classify a set of different patterns into a single class of several
mutually exclusive categories, ID3 has been shown to be comparable with neural networks
(i.e., within about 5% range on the predictive accuracy) on many real-world learning tasks
(cf.  (Shavlik, Mooney, & Towell, 1991; Feng, King, Sutherland, & Henery, 1992; Ripley,
1992; Weiss & Kulikowski, 1991)). However, if the task is to classify a set of (input) patterns
into (output) patterns of many attributes, ID3 cannot be applied directly.  The reason is
that if ID3 treats the different output patterns as mutually exclusive classes, the number of
classes would be exponentially large and, more importantly, any generalization of individual
output attributes within the output patterns would be lost.

    To turn ID3 or any similar N-to-1 classification system into general purpose N-to-M
symbolic pattern associators, the SPA applies ID3 on all output attributes and combines
individual decision trees into a "forest", or set of trees. A similar approach was proposed for
dealing with the distributed (binary) encoding in multiclass learning tasks such as NETtalk
(English text-to-speech mapping) (Dietterich, Hild, & Bakiri, 1990).  Each tree takes as
input the set of all attributes in the input patterns, and is used to determine the value of
_______________________________________
 1. The SPA programs and relevant datasets can be obtained anonymously from ftp.csd.uwo.ca under
    pub/SPA/ .


                                               214
             Learning the Past Tense: Symbolic vs Connectionist Models


one attribute in its output pattern. More specifically, if a pair of input attributes ('1 to 'n )
and output attributes (!1 to !m ) is represented as:
        '1; :::; 'n  !  !1; :::; !m
then the SPA will build a total of m decision trees, one for each output attribute !i (1
i   m) taking all input attributes '1; :::; 'n per tree. Once all of m trees are built, the SPA
can use them jointly to determine the output pattern !1; :::; !m  from any input pattern
'1; :::; 'n .
    An important feature of the SPA is explicit knowledge representation. Decision trees for
output attributes can easily be transformed into propositional production rules (Quinlan,
1993).   Since entities of these rules are symbols with semantic meanings,  the acquired
knowledge often is comprehensible to the human observer. In addition, further processing
and integration of these rules can yield high-level knowledge (e.g., rules using verb stems)
(Ling & Marinov, 1993). Another feature of the SPA is that the trees for different output
attributes contain identical components (branches and subtrees) (Ling & Marinov, 1993).
These components have similar roles as hidden units in ANNs since they are shared in the
decision trees of more than one output attribute.  These identical components can also be
viewed as high-level concepts or feature combinations created by the learning program.


3.2  Default Strategies

An interesting research issue is how decision-tree learning algorithms handle the default
class.  A default class is the class to be assigned to leaves which no training examples are
classified into.  We call these leaves empty leaves.  This happens when the attributes have
many different values, or when the training set is relatively small. In these cases, during the
tree construction, only a few branches are explored for some attributes. When the testing
examples fall into the empty leaves, a default strategy is needed to assign classes to those
empty leaves.
    For easier understanding, we use the spelling form of verbs in this subsection to explain
how different default strategies work.  (In the actual learning experiment the verbs are
represented in phonetic form.) If we use consecutive left-to-right alphabetic representation,
the verb stems and their past tenses of a small training set can be represented as follows:


    a,f,f,o,r,d,`,`,`,`,`,`,`,`,`  =>  a,f,f,o,r,d,e,d,`,`,`,`,`,`,`
    e,a,t,`,`,`,`,`,`,`,`,`,`,`,`  =>  a,t,e,`,`,`,`,`,`,`,`,`,`,`,`
    l,a,u,n,c,h,`,`,`,`,`,`,`,`,`  =>  l,a,u,n,c,h,e,d,`,`,`,`,`,`,`
    l,e,a,v,e,`,`,`,`,`,`,`,`,`,`  =>  l,e,f,t,`,`,`,`,`,`,`,`,`,`,`


where _ is used as a filler for empty space. The left-hand 15 columns are the input patterns
for the stems of the verbs; the right-hand 15 columns are the output patterns for their
corresponding correct past tense forms.
    As we have discussed, 15 decision trees will be constructed, one for each output attribute.
The decision tree for the first output attribute can be constructed (see Figure 1 (a)) from
the following 4 examples:


    a,f,f,o,r,d,`,`,`,`,`,`,`,`,`  =>  a
    e,a,t,`,`,`,`,`,`,`,`,`,`,`,`  =>  a
    l,a,u,n,c,h,`,`,`,`,`,`,`,`,`  =>  l


                                               215
                                             Ling


    l,e,a,v,e,`,`,`,`,`,`,`,`,`,`  =>  l


where the last column is the classification of the first output attribute.  However, many
other branches (such as '1 = c in Figure 1 (a)) are not explored, since no training example
has that attribute value.  If a testing pattern has its first input attribute equal to c, what
class should it be assigned to?  ID3 uses the majority default.  That is, the most popular
class in the whole subtree under '1 is assigned to the empty leaves. In the example above,
either class a or l will be chosen since they each have 2 training examples. However, this is
clearly not the right strategy for this task since a verb such as create would be output as
l...... or a......, which is incorrect. Because it is unlikely for a small training set to have all
variations of attribute values, the majority default strategy of ID3 is not appropriate for
this task.
          Figure 1: (a) Passthrough default                     (b) Various default


    For applications such as verb past-tense learning, a new default heuristic _ passthrough
_ may be more suitable.  That is, the classification of an empty leaf should be the same
as the attribute value of that branch. For example, using the passthrough default strategy,
create will be output as c....... The passthrough strategy gives decision trees some first-order
flavor, since the production rules for empty leaves can be represented as If Attribute = X
then Class = X where X can be any unused attribute values.  Passthrough is a domain-
dependent heuristic strategy because the class labels may have nothing to do with the
attribute values in other applications.
    Applying the passthrough strategy alone, however, is not adequate for every output
attribute.  The endings of the regular past tenses are not identical to any of the input
patterns, and the irregular verbs may have vowel and consonant changes in the middle of
the verbs. In these cases, the majority default may be more suitable than the passthrough.
In order to choose the right default strategy _ majority or passthrough _ a decision is
made based upon the training data in the corresponding subtree. The SPA first determines
the majority class, and counts the number of examples from all subtrees that belong to
this class.  It then counts the number of examples in the subtrees that coincide with the


                                               216
             Learning the Past Tense: Symbolic vs Connectionist Models


passthrough strategy. These two numbers are compared, and the default strategy employed
by more examples is chosen.  For instance, in the example above (see Figure 1 (a)), the
majority class is l (or a) having 2 instances. However, there are 3 examples coinciding with
the passthrough default:  two l and one a.  Thus the passthrough strategy takes over, and
assigns all empty leaves at this level. The empty attribute branch c would then be assigned
the class c.  Note that the default strategy for empty leaves of attribute X depends upon
training examples falling into the subtree rooted at X. This localized method ensures that
only related objects have an influence on calculating default classes.  As a result, the SPA
can adapt the default strategy that is best suited at different levels of the decision trees. For
example, in Figure 1 (b), two different default strategies are used at different levels in the
same tree. We use the SPA with the adaptive default strategy throughout the remainder of
this paper. Note that the new default strategy is not a TRICS in the data representation;
rather, it represents a bias of the learning program.  Any learning algorithm has a default
strategy independent of the data representation. The effect of different data representations
on generalization is discussed in Sections 4.3, 4.5, and 4.6.  The passthrough strategy can
be imposed on ANNs as well by adding a set of copy connections between the input units
and the twin output units. See Section 4.4 for detail.


3.3  Comparisons of Default Strategies of ID3, SPA, and ANN


Which default strategy do neural networks tend to take in generalizing default classes
when compared with ID3 and SPA? We conducted several experiments to determine neural
networks' default strategy.  We assume that the domain has only one attribute X which
may take values a, b, c, and d. The class also can be one of the a, b, c, and d. The training
examples have attribute values a, b, and c but not d _ it is reserved for testing the default
class.  The training set contains multiple copies of the same example to form a certain
majority class.  Table 1 shows two sets of training/testing examples that we used to test
and compare default strategies of ID3, SPA and neural networks.


          _____________________________________________________________________________
          |______________Data_set_1______________|_|__________Data_set_2______________|_

          |__________Training_examples__________|_|_______Training_examples__________|_

          |__Values_of_X__|Class__|#_of_copies__|_|Values_of_X__|Class__|_#_of_copies__|

          |       a       |  a    |     10      | |     a       |   c    |    10      |
          |       b       |  b    |     2       | |     b       |   b    |     6       |
          |_______c_______|__c____|_____3_______|_|_____c_______|___c____|_____7_______|

          |___________Testing_example___________|_|________Testing_example___________|_

          |_______d_______|__?____|_____1_______|_|_____d_______|___?____|_____1_______|

          Table 1: Two data sets for testing default strategies of various methods.

    The classification of the testing examples by ID3 and SPA is quite easy to decide. Since
ID3 takes only the majority default, the output class is a (with 10 training examples) for
the first data set, and c (with 17 training examples) for the second data set. For SPA, the
number of examples using passthrough is 15 for the first data set, and 13 for the second


                                               217
                                             Ling


data set.  Therefore, the passthrough strategy wins in the first case with the output class
d, and the majority strategy wins in the second case with the output class c.
    For neural networks, various coding methods were used to represent values of the at-
tribute X.  In the dense coding, we used 00 to represent a, 01 for b, 10 for c and 11 for
d. We also tried the standard one-per-class encoding, and real number encoding (0.2 for a,
0.4 for b, 0.6 for c and 0.8 for d). The networks were trained using as few hidden units as
possible in each case.  We found that in most cases the classification of the testing exam-
ple is not stable; it varies with different random seeds that initialize the networks.  Table
2 summarises the experimental results.  For ANNs, various classifications obtained by 20
different random seeds are listed with the first ones occurring most frequently.  It seems
that not only do neural networks not have a consistent default strategy, but also that it
is neither the majority default as in ID3 nor the passthrough default as in SPA. This may
explain why connectionist models cannot generalize unseen regular verbs well even when
the training set contains only regular verbs (see Section 4.4). The networks have difficulty
(or are underconstrained) in generalizing the identity mapping that copies the attributes of
the verb stems into the past tenses.


                       ___________________________________________________
                       |____The_classification_for_the_testing_example____|

                       |_______________________|Data_set_1__|_Data_set_2__|

                       |__ID3__________________|_____a______|______c______ |

                       |__SPA__________________|_____d______|______c_______|

                       |__ANN,_dense_coding__|______b;_c_____|_____b______ |

                       |__ANN,_one-per-class__|____b;_c;_a____|__c;_b_____|

                       |__ANN,_real_numbers__|______c;_d_____|___d;_c_____|

            Table 2: Default strategies of ID3, SPA and ANN on two data sets.


4.  Head-to-head Comparisons between Symbolic and ANN Models


In this section, we perform a series of extensive head-to-head comparisons using several
different representations and encoding methods, and demonstrate that the SPA generalizes
the past tense of unseen verbs better than ANN models do by a wide margin.


4.1  Format of the data

Our verb set came from MacWhinney's original list of verbs.   The set contains about
1400 stem/past tense pairs.   Learning is based upon the phonological UNIBET repre-
sentation (MacWhinney, 1990), in which different phonemes are represented by different
alphabetic/numerical letters. There is a total of 36 phonemes. The source file is transferred
into the standard format of pairs of input and output patterns. For example, the verbs in
Table 3 are represented as pairs of input and output patterns (verb stem => past tense):


    6,b,&,n,d,6,n    =>    6,b,&,n,d,6,n,d
    I,k,s,E,l,6,r,e,t   =>   I,k,s,E,l,6,r,e,t,I,d


                                               218
             Learning the Past Tense: Symbolic vs Connectionist Models


    6,r,3,z   =>  6,r,o,z
    b,I,k,6,m   =>   b,I,k,e,m


See Table 3 (The original verb set is available in Online Appendix 1).  We keep only one
form of the past tense among multiple past tenses (such as hang-hanged and hang-hung)
in the data set.  In addition, no homophones exist in the original data set.  Consequently,
there is no noise (contradictory data which have the same input pattern but different output
patterns) in the training and testing examples.  Note also that information as to whether
the verb is regular or irregular is not provided in training/testing processes.


              ____________________________________________________________________
              |  base (stem)  |    UNIBET    |       b=base     |  1 = irregular  |
              |__spelling_form__|phonetic_form__|d=_past_tense__|___0_=_regular__|__

              |    abandon    |    6b&nd6n    |         b        |       0        |
              |   abandoned   |   6b&nd6nd   |          d        |       0        |
              |     benefit     |  bEn6fIt     |        b        |       0        |
              |    benefited    | bEn6fItId    |        d        |       0        |
              |      arise      |    6r3z       |       b        |       0        |
              |      arose      |    6roz       |       d        |       1        |
              |    become    |      bIk6m      |        b        |       0        |
              |    became    |      bIkem      |        d        |       1        |
              |______......______|______________|________________|________________|

                    Table 3: Source file from MacWhinney and Leinbach.


4.2  Experiment Setup

To guarantee unbiased and reliable comparison results, we use training and testing samples
randomly drawn in several independent runs.  Both SPA and ANN are provided with the
same sets of training/testing examples for each run.  This allows us to achieve a reliable
estimate of the inductive generalization capabilities of each model on this task.
    The neural network program we used is a package called Xerion, which was developed
at the University of Toronto.  It has several more sophisticated search mechanisms than
the standard steepest gradient descent method with momentum.  We found that training
with the conjugate-gradient method is much faster than with the standard backpropagation
algorithm. Using the conjugate-gradient method also avoids the need to search for proper
settings of parameters such as the learning rate.  However, we do need to determine the
proper number of hidden units.  In the experiments with ANNs,  we first tried various
numbers of hidden units and chose the one that produced the best predictive accuracy in
a trial run, and then use the network with that number of hidden units in the actual runs.
The SPA, on the other hand, has no parameters to adjust.
    One major difference in implementation between ANNs and SPA is that SPA can take
(symbolic) phoneme letters directly while ANNs normally encode each phoneme letter to
binary bits.  (Of course, SPA also can apply to the binary representation).  We studied
various binary encoding methods and compared results with SPA using symbolic letter


                                               219
                                             Ling


representation. Since outputs of neural networks are real numbers, we need to decode the
network outputs back to phoneme letters. We used the standard method of decoding: the
phoneme letter that has the minimal real-number Hamming distance (smallest angle) with
the network outputs was chosen.  To see how binary encoding affects the generalization,
the SPA was also trained with the binary representation.  Since the SPA's outputs are
binary, the decoding process may tie with several phoneme letters.  In this case, one of
them is chosen randomly. This reflects the probability of the correct decoding at the level
of phoneme letters. When all of the phoneme letters are decoded, if one or more letters are
incorrect, the whole pattern is counted as incorrect at the word level.


4.3  Templated, Distributed Representation

This set of experiments was conducted using the distributed representation suggested by
MacWhinney and Leinbach (1991). According to MacWhinney and Leinbach, the output is
a left-justified template in the format of CCCVVCCCVVCCCVVCCC, where C stands for
consonant and V for vowel space holders.  The input has two components:  a left-justified
template in the same format as the input, and a right-justified template in the format of
VVCCC. For example, the verb bet, represented in UNIBET coding as bEt, is shown in the
template format as follows (_ is the blank phoneme):

INPUT
bEt            b``E`t````````````        `E``t
template:    CCCVVCCCVVCCCVVCCC        VVCCC
                (left-justified)          (right-justified)
OUTPUT
bEt            b``E`t````````````
template:    CCCVVCCCVVCCCVVCCC
                (left-justified)

    A specific distributed representation _ a set of (binary) phonetic features _ is used
to encode all phoneme letters for the connectionist networks.  Each vowel (V in the above
templates) is encoded by 8 phonetic features (front, centre, back, high, low, middle, round,
and diphthong) and each consonant (C in the above templates) by 10 phonetic features
(voiced, labial, dental, palatal, velar, nasal, liquid, trill, fricative and interdental).  Note
that because the two feature sets of vowels and consonants are not identical, templates are
needed in order to decode the right type of the phoneme letters from the outputs of the
network.
    In our experimental comparison,  we decided not to use the right-justified template
(VVCCC) since this information is redundant.  Therefore, we used only the left-justified
template (CCCVVCCCVVCCCVVCCC) in both input and output.  (The whole verb set
in the templated phoneme representation is available in Online Appendix 1.  It contains
1320 pairs of verb stems and past tenses that fit the template).  To ease implementation,
we added two extra features that always were assigned to 0 in the vowel phonetic feature
set.  Therefore, both vowels and consonants were encoded by 10 binary bits.  The ANN
thus had 18 x 10 = 180 input bits and 180 output bits, and we found that one layer of 200
hidden units (same as MacWhinney (1993) model) reached the highest predictive accuracy
in a trial run. See Figure 2 for the network architecture used.


                                               220
             Learning the Past Tense: Symbolic vs Connectionist Models


             Figure 2: The architecture of the network used in the experiment.


    The SPA was trained and tested on the same data sets but with phoneme letters directly;
that is, 18 decision trees were built for each of the phoneme letters in the output templates.
To see how phonetic feature encoding affects the generalization, we also trained the SPA
with the the same distributed representation _ binary bit patterns of 180 input bits and
180 output bits _ exactly the same as those in the ANN simulation. In addition, to see how
the "symbolic" encoding works in ANN, we also train another neural network (with 120
hidden units) with the "one-per-class" encoding. That is, each phoneme letter (total of 37;
36 phoneme letters plus one for blank) is encoded by 37 bits, one for each phoneme letter.
We used 500 verb pairs (including both regular and irregular verbs) in the training and
testing sets.  Sampling was done randomly without replacement, and training and testing
sets were disjoint. Three runs of SPA and ANN were conducted, and both SPA and ANN
were trained and tested on the same data set in each run. Training reached 100% accuracy
for SPA and around 99% for ANN.

    Testing accuracy on novel verbs produced some interesting results.   The ANN model
and the SPA using the distributed representation have very similar accuracy, with ANN
slightly better.  This may well be caused by the binary outputs of SPA that suppress the
fine differences in prediction.  On the other hand, the SPA using phoneme letters directly
produces much higher accuracy on testing.  The SPA outperforms neural networks (with
either distributed or one-per-class representations) by 20 percentage points!  The testing
results of ANN and SPA can be found in Table 4.  Our findings clearly indicate that the
SPA using symbolic representation leads to much better generalization than ANN models.


4.4  Learning Regular Verbs

Predicting the past tense of an unseen verb, which can be either regular or irregular, is
not an easy task.  Irregular verbs are not learned by rote as traditionally thought since


                                               221
                                             Ling


  _____________________________________________________________________________________________
  |         Distributed representation         | |         Symbolic representation           |
  |___ANN:_%_Correct___|_|___SPA:_%_Correct___|_|__ANN:_%_Correct___|_|____SPA:_%_Correct___|__

  |__Reg__|Irrg__|Comb__|_|Reg__|_Irrg__|Comb__|_|Reg__|_Irrg__|Comb__|_|_Reg__|_Irrg__|Comb__|_

  |__65.3__|14.6__|60.4___|6|2.2__|18.8__|58.0___|6|3.3__|18.8__|59.2___|8|3.0__|29.2__|77.8___|

  |__59.7__|8.6__|53.8___|_|57.9__|8.2__|_52.2___|5|8.8__|10.3__|53.2___|8|3.3__|22.4__|76.2___|

  |__60.0__|16.0__|55.6___|5|8.0__|8.0__|_53.0___|5|8.7__|16.0__|54.4___|8|0.9__|20.0__|74.8___|_

  |__61.7__|13.1__|56.6___|5|9.4__|11.7__|54.4___|6|0.3__|15.0__|55.6___|8|2.4__|23.9__|76.3___|


Table 4:  Comparisons of testing accuracy of SPA and ANN with distributed and symbolic
          representations.

children and adults occasionally extend irregular inflection to irregular-sounding regular
verbs or pseudo verbs (such as cleef _ cleft) (Prasada & Pinker, 1993). The more similar
the novel verb is to the cluster of irregular verbs with similar phonological patterns, the
more likely the prediction of an irregular past-tense form. Pinker (1991) and Prasada and
Pinker (1993) argue that regular past tenses are governed by rules, while irregulars may
be generated by the associated memory which has this graded effect of irregular past-tense
generalization.  It is would be interesting, therefore, to compare SPA and ANN on the
past-tense generalization of regular verbs only. Because both SPA and ANN use the same,
position specific, representation, learning regular past tenses would require learning different
suffixes2 at different positions, and to learn the identity mapping that copies the verb stem
to the past tenses for verbs of different lengths.

    We used the same templated representation as in the previous section, but both training
and testing sets contained only regular verbs. Again samples were drawn randomly without
replacement.  To maximize the size of the testing sets, testing sets simply consisted of all
regular verbs that were not sampled in the training sets. The same training and testing sets
were used for each of the following methods compared.  To see the effect of the adaptive
default strategy (as discussed in Section 3.2) on generalization, the SPA with the majority
default only and with the adaptive default were both tested. The ANN models were similar
to those used in the previous section (except with 160 one-layer hidden units, which turned
out to have the best predictive accuracy in a test run). The passthrough default strategy can
be imposed on neural networks by adding a set of copy connections that connect directly
>from the input units to the twin output units.  MacWhinney and Leinbach (1991) used
such copy connections in their simulation. We therefore tested the networks with the copy
connection to see if generalization would be improved as well.

    The results on the predictive accuracy of the SPA and ANNs on one run with with
randomly sampled training and testing sets are summarized in Table 5.  As we can see,
the SPA with the adaptive default strategy, which combines the majority and passthrough
default, outperforms the SPA with only the majority default strategy used in ID3.  The
_______________________________________
 2. In phonological form there are three different suffixes for regular verbs. When the verb stem e*
 *nds with
    t or d (UNIBET phonetic representations), then the suffix is Id. For example, extend _ extended*
 * (in
    spelling form). When the verb stem ends with a unvoiced consonant, the suffix is t. For example*
 *, talk
    _ talked.  When the verb stem ends with a voiced consonant or vowel, the suffix is d.  For exam*
 *ple,
    solve _ solved.


                                               222
             Learning the Past Tense: Symbolic vs Connectionist Models


ANNs with copy connections do generalize better than the ones without.  However, even
ANN models with copy connections have a lower predictive accuracy than the SPA (ma-
jority).  In addition, the differences in the predictive accuracy are larger with smaller sets
of training examples.  Smaller training sets make the difference in testing accuracy more
evident.  When the training set contains 1000 patterns (out of 1184), the testing accuracy
becomes very similar, and would approach asymptotically to 100% with larger training sets.
Upon examination, most of the errors made in ANN models occur in the identity mapping
(i.e., strange phoneme change and drop); the verb stems cannot be preserved in the past
tense if the phonemes are not previously seen in the training examples. This contradicts the
findings of Prasada and Pinker (1993), which show that native English speakers generate
regular suffix-adding past tenses equally well with unfamiliar-sounding verb stems (as long
as these verb stems do not sound close to irregular verbs). This also indicates that the bias
of the ANN learning algorithms is not suitable to this type of task. See further discussion
in Section 5.


   __________________________________________________________________________________________
   |  Training _|_|_____________________Percent_correct_on_testing_________________________|_
   |____size____|S|PA_(adaptive)__|_SPA_(majority)__|_|ANN_(copy_con.)__|___ANN_(normal)__|__

   |_____50_____|_|____55.4________|______30.0________|_|_____14.6________|_______7.3________|

   |____100____|_|_____72.9________|______58.6________|_|_____34.6________|_______24.9_______|

   |____300____|_|_____87.0________|______83.7________|_|_____59.8________|_______58.2_______|

   |____500____|_|_____92.5________|______89.0________|_|_____82.6________|_______67.9_______|

   |____1000____|_|____93.5________|______92.4________|_|_____92.0________|_______87.3_______|

          Table 5: Predictive accuracy on learning the past tense of regular verbs


4.5  Error Correcting Codes


Dietterich and Bakiri (1991) reported an increase in the predictive accuracy when error-
correcting codes of large Hamming distances are used to encode values of the attributes.
This is because codes with larger Hamming distance (d) allow for correcting fewer than d=2
bits of errors. Thus, learning programs are allowed to make some mistakes at the bit level
without their outputs being misinterpreted at the word level.

    We wanted to find if performances of the SPA and ANNs are improved with the error-
correcting codes encoding all of the 36 phonemes. We chose error-correcting codes ranging
>from ones with small Hamming distance to ones with very large Hamming distance (using
the BHC codes, see Dietterich and Bakiri (1991)).  Because the number of attributes for
each phoneme is too large, the data representation was changed slightly for this experiment.
Instead of 18 phoneme holders with templates, 8 consecutive, left-to-right phoneme holders
were used. Verbs with stems or past tenses of more than 8 phonemes were removed from the
training/testing sets.  (The whole verb set in the this representation is available in Online
Appendix 1. It contains 1225 pairs of verb stems and past tenses whose lengths are shorter
than 8). Both SPA and ANN take exactly the same training/testing sets, each contains 500
pairs of verb stems and past tenses, with the error-correcting codes encoding each phoneme


                                               223
                                             Ling


letter. Still, training networks with 92-bit or longer error-correcting codes takes too long to
run (there are 8 x 92 = 736 input attributes and 736 output attributes).  Therefore, only
two runs with 23- and 46-bit codes were conducted. Consistent with Dietterich and Bakiri
(1991)'s findings, we found that the testing accuracy generally increases when the Hamming
distance increases.    However, we also observed that the testing accuracy decreases very
slightly when the codes become too long. The accuracy using 46-bit codes (with Hamming
distance of 20) reaches the maximum value (77.2%), which is quite close to the accuracy
(78.3%) of SPA using the direct phoneme letter representation. It seems there is a trade-off
between tolerance of errors with large Hamming distance and difficulty in learning with
longer codes. In addition, we found the testing accuracy of ANNs to be lower than the one
of SPA for both 23 bit- and 46-bit error-correcting codes.  The results are summarized in
Table 6.
     ________________________________________________________________________________________
     |_____ANN_____|___Hamming_Distance__|____Correct_at_bit_level__|Correct_at_word_level__|_

     |__23-bit_codes__|_________10__________|________93.5%________|__________65.6%__________|_

     |__46-bit_codes__|_________20__________|________94.1%________|__________67.4%__________|__

     |______SPA______|_Hamming_Distance__|____Correct_at_bit_level__|Correct_at_word_level__|_

     |__23-bit_codes__|_________10__________|________96.3%________|__________72.4%__________|_

     |__46-bit_codes__|_________20__________|________96.3%________|__________77.2%__________|_

     |__92-bit_codes__|_________40__________|________96.1%________|__________75.6%__________|_

     |__127-bit_codes__|________54__________|________96.1%________|__________75.4%__________|_


Table 6:  Comparisons of the testing accuracy of SPA and ANNs with error-correcting codes
    Our results in this and the previous two subsections undermine the advantages of the
distributed representation of ANNs, a unique feature advocated by connectionists. We have
demonstrated that, in this task, the distributed representation actually does not allow for
adequate generalization.  Both SPA using direct symbolic phoneme letters and SPA with
error-correcting codes outperform ANNs with distributed representation by a wide margin.
However, neither phoneme symbols nor bits in the error-correcting codes encode, implicitly
or explicitly, any micro-features as in the distributed representation.  It may be that the
distributed representation used was not optimally designed.  Nevertheless, straightforward
symbolic format requires little representation engineering compared with the distributed
representation in ANNs.


4.6  Right-justified, Isolated Suffix Representation

MacWhinney and Leinbach (1991) did not report important results of the predictive ac-
curacy of their model on unseen regular verbs.  In his reply (MacWhinney, 1993) to our
paper (Ling & Marinov, 1993), MacWhinney re-implemented the ANN model. In his new
implementation, 1,200 verb stem and past-tense pairs were in the training set, among which
1081 were regular and 119 were irregular.  Training took 4,200 epochs, and reached 100%
correct on regulars and 80% on irregulars. The testing set consisted of 87 regulars and 15
irregulars. The percent correct on testing at epoch 4,200 was 91% for regulars and 27% for
irregulars, with a combined 80.0% on the testing set.  MacWhinney claimed that the raw


                                               224
             Learning the Past Tense: Symbolic vs Connectionist Models


generalization power of ANN model is very close to that of our SPA. He believes that this
should be the case simply because both systems were trained on the same data set.

    We realize (via private communication) that a new representation used in MacWhinney's
recent implementation plays a critical role in the improved performance. In MacWhinney's
new representation,  the input (for verb stems) is coded by the right-justified template
CCCVVCCCVVCCCVVCCC. The output contains two parts:  a right-justified template
that is the same as the one in the input, and a coda in the form of VVCCC. The right-
justified template in the output is used to represent the past tense without including the
suffix for the regular verbs.  The suffix of the regular past tense always stays in the coda,
which is isolated from the main, right-justified templates. For the irregular past tense, the
coda is left empty. For example, the input and output templated patterns for the past tense
of verbs in Table 3 are represented as:


INPUT                       OUTPUT
(right-justified)        (right-justified)    (suffix  only)
CCCVVCCCVVCCCVVCCC       CCCVVCCCVVCCCVVCCC   VVCCC
```6`b``&`nd`6`n``       ```6`b``&`nd`6`n``   ``d``  (for  abandon-abandoned)
b``E`n``6`f``I`t``       b``E`n``6`f``I`t``   I`d``  (for  benefit-benefited)
````````6`r``3`z``       ````````6`r``o`z``   `````  (for  arise-arose)
`````b``I`k``6`m``       `````b``I`k``e`m``   `````  (for  become-became)


    Such data representation clearly facilitates learning.  For the regular verbs, the output
patterns are always identical to the input patterns. In addition, the verb-ending phoneme
letters always appear at a few fixed positions (i.e., the right most VVCCC section in the
input template) due to the right-justified, templated representation. Furthermore, the suffix
always occupies the coda, isolated from the right-justified templates.

    We performed a series of experiments to see how much improvement we could accom-
plish using the new representation over MacWhinney's recent ANN model and over the
left-justified representation discussed in Section 4.3. Our SPA (with an averaged predictive
accuracy of 89.0%) outperforms MacWhinney's recent ANN implementation (with the pre-
dictive accuracy of 80.0%) by a wide margin.  In addition, the predictive accuracy is also
improved from an average of 76.3% from the left-justified representation to 82.8% of the
right-justified, isolated suffix one. See results in Table 7.


5.  General Discussion and Conclusions


Two factors contribute to the generalization ability of a learning program.  The first is
the data representation, and the other is the bias of the learning program.  Arriving at
the right, optimal, representation is a difficult task.  As argued by Prasada and Pinker
(1993), regular verbs should be represented in a coarse grain in terms of the verb stem
and suffixes;  while irregular verbs in a finer grain in terms of phonological properties.
Admittedly, SPA works uniformly at the level of phoneme letters, as ANNs do.  However,
because SPA produces simple production rules that use these phoneme letters directly, those
rules can be further generalized to first-order rules with new representations such as stems
and the voiced consonants which can be used across the board in other such rule-learning
modules (Ling & Marinov, 1993).  This is one of the major advantages over ANN models.


                                               225
                                             Ling


     ________________________________________________________________________________________
     |          _|Predictive_accuracy_with_right-justified,_isolated_suffix_representation__|
     |          _|________________SPA_________________|____MacWhinney's_ANN_model____|______
     |           |training/testing  |training/testing  |         training/testing          |
     |___________|____500/500______|_____1200/102_____|______________1200/102______________|_

     |___Run_1___|______81.3________|______89.2________|                                    |
     |___Run_2___|______84.1________|______90.4________|                                    |
     |___Run_3___|______83.1________|______87.4________|____________________________________|_

     |__Average__|______82.8________|______89.0________|__________80.0_(one_run)___________|_


Table 7:  Comparisons of testing accuracy of SPA and ANN (with right-justified, isolated
          suffix representation)

It seems quite conceivable that children acquire these high-level concepts such as stems
and voiced consonants through learning noun plurals, verb past tense, verb third-person
singular, comparative adjectives, and so on.  With a large weight matrix as the result of
learning, it is hard to see how this knowledge can be further generalized in ANN models
and shared in other modules.

    Even with exactly the same data representation, there exist some learning tasks that
symbolic methods such as the SPA generalize categorically better than ANNs.  The con-
verse also is true.  This fact reflects the different inductive biases of the different learning
algorithms. The Occam's Razor Principle _ preferring the simplest hypothesis over more
complex ones _ creates a preference bias, a preference of choosing certain hypotheses over
others in the hypothesis space. However, different learning algorithms choose different hy-
potheses because they use different measurements for simplicity.  For example, among all
possible decision trees that fit the training examples, ID3 and SPA induce simple decision
trees instead of complicated ones.  Simple decision trees can be converted to small sets of
production rules.   How well a learning algorithm generalizes depends upon the degree to
which the underlying regularities of the target concept fit its bias.  In other words, if the
underlying regularities can be represented compactly in the format of hypotheses produced
by the learning algorithm, the data can be generalized well, even with a small set of training
examples.  Otherwise, if the underlying regularities only have a large hypothesis, but the
algorithm is looking for compact ones (as per the Occam's Razor Principle), the hypothe-
ses inferred will not be accurate. A learning algorithm that searches for hypotheses larger
than necessary (i.e., that does not use the Occam's Razor Principle) is normally "under-
constrained"; it does not know, based on the training examples only, which of the many
competitive hypotheses of the large size should be inferred.

    We also can describe the bias of a learning algorithm by looking at how training examples
of different classes are separated in the n-dimensional hyperspace where n is the number
of attributes.  A decision node in a decision tree forms a hyperplane as described by a
linear function such as X = a.  Not only are these hyperplanes perpendicular to the axis,
they are also partial-space hyperplanes that extend only within the subregion formed by
the hyperplanes of their parents' nodes. Likewise, hidden units with a threshold function in
ANNs can be viewed as forming hyperplanes in the hyperspace.  However, unlike the ones
in the decision trees, they need not be perpendicular to any axis, and they are full-space


                                               226
             Learning the Past Tense: Symbolic vs Connectionist Models


hyperplanes that extend through the whole space. If ID3 is applied to the concepts that fit
ANN's bias, especially if their hyperplanes are not perpendicular to any axis, then many
zigzag hyperplanes that are perpendicular to axes would be needed to separate different
classes of the examples. Hence, a large decision tree would be needed, but this does not fit
ID3's bias. Similarly, if ANN learning algorithms are applied to the concepts that fit ID3's
bias, especially if their hyperplanes form many separated, partial-space regions, then many
hidden units may be needed for these regions.

    Another major difference between ANNs and ID3 is that ANNs have a larger variation
and a weaker bias (cf.  (Geman, Bienenstock, & Doursat, 1992)) than ID3.  Many more
Boolean functions (e.g., linearly separable functions) can fit a small network (e.g., one with
no hidden units) than they can a small decision tree.  This is sometimes attributed to the
claimed versatility and flexibility of ANNs; they can learn (but not necessarily predict reli-
ably well) many functions, while symbolic methods are brittle. However, it is my belief that
we humans are versatile, not because we have a learning algorithm with a large variation,
but rather because we have a set of strong-biased learning algorithms, and we can somehow
search in the bias space and add new members into the set for the new learning tasks. Sym-
bolic learning algorithms have clear semantic components and explicit representation, and
thus we can more easily construct strong-based algorithms motivated from various specific
learning tasks. The adaptive default strategy in the SPA is such an example. On the other
hand, we still largely do not know how to effectively strengthen the bias of ANNs for many
specific tasks (such as the identity mapping, k-term DNF, etc.).  Some techniques (such
as adding copy connections and weight decaying) exist, but their exact effects on biasing
towards classes of functions are not clear.

    From our analyses (Ling & Marinov, 1993), the underlying regularities governing the
inflection of the past tense of English verbs do form a small set of production rules with
phoneme letters.  This is especially so for regular verbs; all the rules are either identity
rules or the suffix-adding rules.  For example, decision trees can be converted into a set of
precedence-ordered production rules with more complicated rules (rules with more condi-
tions) listed first. As an example, using consecutive, left-to-right phonetic representation, a
typical suffix-adding rule for verb stems with 4 phoneme letters (such as talk _ talked) is:
        If '4 = k and '5 = _, then !5 = t
That is, if the fourth input phoneme is k and the fifth is blank (i.e., if we are at a verb
ending) then the fifth output phoneme is t. On the other hand, the identity-mapping rules
have only one condition. A typical identity rule looks like:
        If '3 = l, then !3 = l
In fact, the passthrough default strategy allows all of the identity-mapping rules to be rep-
resented in a simple first-order format:
        If '3 = X, then !3 = X
where X can be any phoneme.  Clearly, the knowledge of forming regular past tenses can
thus be expressed in simple, conjunctive rules which fit the bias of the SPA (ID3), and
therefore, the SPA has a much better generalization ability than the ANN models.

    To conclude, we have demonstrated, via extensive head-to-head comparisons, that the
SPA has a more realistic and better generalization capacity than ANNs on learning the
past tense of English verbs.  We have argued that symbolic decision-tree/production-rule
learning algorithms should outperform ANNs. This is because, first, the domain seems to be


                                               227
                                             Ling


governed by a compact set of rules, and thus fits the bias of our symbolic learning algorithm;
second, the SPA directly manipulates on a representation better than ANNs do (i.e., the
symbolic phoneme letters vs.  the distributed representation); and third, the SPA is able
to derive high-level concepts used throughout English morphology. Our results support the
view that many such high-level, rule-governed cognitive tasks should be better modeled by
symbolic, rather than connectionist, systems.
Acknowledgements


I gratefully thank Steve Pinker for his constant encouragement, and Marin Marinov, Steve
Cherwenka and Huaqing Zeng for discussions and for help in implementing the SPA. I
thank Brian MacWhinney for providing the verb data used in his simulation.  Discussions
with Tom Dietterich, Dave Touretzky and Brian MacWhinney, as well as comments from
reviewers, have been very helpful. The research is conducted with support from the NSERC
Research Grant and computing facilities from our Department.


References


Cottrell, G., & Plunkett, K. (1991).  Using a recurrent net to learn the past tense.  In
      Proceedings of the Cognitive Science Society Conference.


Daugherty, K., & Seidenberg, M. (1993).  Beyond rules and exceptions:  A connectionist
      modeling approach to inflectional morphology.  In Lima,  S. (Ed.),  The Reality of
      Linguistic Rules. John Benjamins.


Dietterich, T., & Bakiri, G. (1991).  Error-correcting output codes:  A general method for
      improving multiclass inductive learning programs. In AAAI-91 (Proceedings of Ninth
      National Conference on Artificial Intelligence).


Dietterich, T., Hild, H., & Bakiri, G. (1990).  A comparative study of ID3 and backprop-
      agation for English text-to-speech mapping.  In Proceedings of the 7th International
      Conference on Machine Learning. Morgan Kaufmann.


Feng, C., King, R., Sutherland, A., & Henery, R. (1992).  Comparison of symbolic, statis-
      tical and neural network classifiers.  Manuscript. Department of Computer Science,
      University of Ottawa.


Fodor, J., & Pylyshyn, Z. (1988).  Connectionism and cognitive architecture:  A critical
      analysis.  In Pinker, S., & Mehler, J. (Eds.), Connections and Symbols, pp. 3 - 71.
      Cambridge, MA: MIT Press.


Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance
      dilemma. Neural Computation, 4, 1 - 58.


Lachter, J., & Bever, T. (1988).  The relation between linguistic structure and associative
      theories of language learning - a constructive critique of some connectionist learning
      models. In Pinker, S., & Mehler, J. (Eds.), Connections and Symbols, pp. 195 - 247.
      Cambridge, MA: MIT Press.


                                               228
             Learning the Past Tense: Symbolic vs Connectionist Models


Ling, X., Cherwenka, S., & Marinov, M. (1993).  A symbolic model for learning the past
      tenses of English verbs.  In Proceedings of IJCAI-93 (Thirteenth International Con-
      ference on Artificial Intelligence), pp. 1143-1149. Morgan Kaufmann Publishers.


Ling, X., & Marinov, M. (1993). Answering the connectionist challenge: a symbolic model
      of learning the past tense of English verbs. Cognition, 49 (3), 235-290.


MacWhinney, B. (1990).  The CHILDES Project: Tools for Analyzing Talk.  Hillsdale, NJ:
      Erlbaum.


MacWhinney, B. (1993).  Connections and symbols:  closing the gap.  Cognition, 49 (3),
      291-296.


MacWhinney, B., & Leinbach, J. (1991). Implementations are not conceptualizations: Re-
      vising the verb model. Cognition, 40, 121 - 157.


Pinker, S. (1991). Rules of language. Science, 253, 530 - 535.


Pinker, S., & Prince, A. (1988).  On language and connectionism:  Analysis of a parallel
      distributed processing model of language acquisition.  In Pinker, S., & Mehler, J.
      (Eds.), Connections and Symbols, pp. 73 - 193. Cambridge, MA: MIT Press.


Plunkett, K., & Marchman, V. (1991).  U-shaped learning and frequency effects in a mul-
      tilayered perceptron: Implications for child language acquisition. Cognition, 38, 43 -
      102.


Prasada, S., & Pinker, S. (1993).  Generalization of regular and irregular morphological
      patterns. Language and Cognitive Processes, 8 (1), 1 - 56.


Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1 (1), 81 - 106.


Quinlan, J. (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann: San Mateo,
      CA.


Ripley, B. (1992).  Statistical aspects of neural networks.  Invited lectures for SemStat
      (Seminaire Europeen de Statistique, Sandbjerg, Denmark, 25-30 April 1992).


Rumelhart, D., & McClelland, J. (1986).  On learning the past tenses of English verbs.
      In Rumelhart, D., McClelland, J., & the PDP Research Group (Eds.), Parallel Dis-
      tributed Processing Vol 2, pp. 216 - 271. Cambridge, MA: MIT Press.


Shavlik, J., Mooney, R., & Towell, G. (1991). Symbolic and neural learning algorithms: An
      experimental comparison. Machine Learning, 6 (2), 111 - 144.


Weiss, S., & Kulikowski, C. (1991). Computer Systems that Learn: classification and predic-
      tion methods from statistics, neural networks, machine learning, and expert systems.
      Morgan Kaufmann, San Mateo, CA.

                                               229