15-212 Section A Notes

January 28, 1998
Thomas Kang

1. Administrative Details

Just a brief note about class participation. According to the syllabus, it's worth 10% of your grade, and each TA has complete authority over the allocation of those points for all his students. I will be using the "warm fuzzy" method for determining those grades. This means that I won't keep track of who raises his or her hand how many times, or give pop quizzes, or anything else that involves keeping score in some way. I figure that after a full semester of seeing you in section, fielding your questions during office hours, checking over your problem sets and exams, and reading your postings on the newsgroups, I'll have gotten to know you all well enough that I can then make fair decisions on allocating the class participation grades without having to look for numbers to back me up.

2. The Pumping Theorem for Regular Languages

On pages 88 and 89 of L&P, you'll find one of the most useful theorems you'll see in this course. It is the pumping theorem for regular languages (there is one for context-free languages, too, but you won't be responsible for knowing that one), and its primary use is in proving that some language is not regular. Remember this well; you'll have problems galore if you try to prove that some language *is* regular using the pumping theorem, because it's not meant for problems like that. (Usually, you construct an FA to prove that some language is regular.)

Read theorem 2.4.1 carefully. Yes, it's pretty difficult to make sense of it, but one of the skills you should pick up from this course is learning how to make sense of jargon-laden technical writings. Here's what it says:

Theorem 2.4.1: Let L be a regular language. There is an integer n >= 1 such that any string w in L with |w| >= n can be rewritten as w = xyz such that y != e, |xy| <= n, and xyⁱz is in L for each i >= 0.

OK, this is definitely not light reading. But if we take some time to take it apart, we'll find that it's not so bad, really. Here's the list of ingredients:

L is a regular language.
n is an integer that is 1 or higher, which you choose arbitrarily.
w is any string in L of length n or longer.
x,y,z are three strings that do not have to be in L, but are subject to these restrictions:
- when they're concatenated to form xyz, they yield w.
- y cannot be the empty string, although x and z can.
- The sum of the lengths of x and y must be no longer than n.
i is a nonnegative integer (including 0), which you choose arbitrarily.

Still not clear where this is heading, but at least you now know what things are involved. Here's the general idea of what we do: basically, given some language L, we want to show that it's not regular. To do this, we will be using a property that all regular languages have and no non-regular languages have, which is that regular languages can be "pumped" (this term will be explained shortly). We begin by selecting some n >= 1. Then we have to show that for this particular value of n, there are strings in L that cannot be "pumped". Then we show that this is the case for any value of n that we choose. Finally, we conclude that because there are strings in L that cannot be "pumped" for all values of n >= 1, L is not a regular language.

What does it mean for a string to be "pumpable"? That's where that xyz stuff comes in. Take some string w that you want to test for pumpability. Then chop it up into three pieces, and label them x, y, and z, in order from left to right. Now, you must be careful to chop it up in such a way that y is at least one character long, and if you were to put x and y back together, it would be at most n characters long. Having chopped up w in a way that meets all these criteria, you can then "pump" w in this way: remove the piece y from w (so you're left with xz), make as many clones of y as you want (including zero!), concatenate the clones to form a single sequence (essentially y repeated any number of times, or y*), and put it back in between x and z. This is called "pumping," because you're pumping up (or down) the string w with an arbitrary number of y's.

A quick example: say that n=3 and w=abaab. Then we can choose to chop up w into, say, x=e, y=ab, and z=aab. Since y is not empty and the length of xy is 2, which is less than n, this meets all the criteria. Now we can pump w any number of times. If we pump zero times, we get xz, which is aab. If we pump once, we get xyz, which is abaab. Pump twice, and we get xyyz, which is ababaab. Pump five times, and we get xyyyyyz, which is abababababaab. And so on.

Now we're ready to do an example, which will show how this theorem works in action, by proving that L = { a^mb^m : m >= 0 } is not regular. Suppose that L is regular; then by the pumping theorem, there must be some number n such that all strings in L of length n or longer will be pumpable. Well, let's choose the string w = aⁿbⁿ. It has length 2n, which is certainly longer than n, so it is a valid choice. We now must chop it up into three pieces, but notice a funny thing: w starts with a string of n a's, and because the sum of the lengths of the first two pieces (x and y) must be no longer than n, this means that both x and y must contain only a's! Furthermore, y must contain at least one a, because it cannot be empty. So let's say that y contains j a's, where 1 <= j < n. Now we pump zero times to get xz, which is simply a^n-jbⁿ. But this string doesn't contain an equal number of a's and b's, because j is at least 1, so it's not in the language L. Since this is the case for all possible valid choices of n, we conclude that L is not regular.

One final note: the pumping theorem is a very powerful tool, but it's also a bit complicated to use. It's like a chainsaw -- it'll cut through most things, but it's probably not the most efficient choice for, say, slicing carrots. Often, we can use the closure properties of regular languages to simplify the task of proving that some language is not regular. An example is the language L = { a^mcⁿb^m : m,n >= 0 }. To prove that L is not regular, we intersect it with a language that is known to be regular, namely {a*b*}. But L&{a*b*} (note that HTML doesn't have the INTERSECT symbol, so I use "&" to denote it) is the language {a^mb^m : m >= 0 }, which we just proved above is not regular. But we know that regular languages are closed under intersection, which means that one of the two languages we intersected must not be regular. And we know that {a*b*} is regular, so we conclude that L must not be regular. Notice how much simpler this proof is, compared to applying the pumping theorem.

3. Context-Free Grammars

A grammar is basically a set of rules that tell you how to manipulate words. In our case, we're interested in a class of grammars that are "context-free." What does this mean? It means that the rules get applied without taking into account what's going on around them. Imagine you have two rules: "Go to bed at midnight every night," and, "If it's Friday, then stay up all night and par-tay!" Here, the first rule is context-free; it gets applied no matter what. But the second rule is context-sensitive; it gets applied only if it's Friday. Getting back to our grammar, we'll find that CFGs never "look" at the words that are already there to decide which rule to apply; it merely looks for any opportunity to apply some rule, then does so without giving any thought to anything else that's going on.

(In case you're curious: yes, there are grammars that take contexts into account. They are called unrestricted grammars. We won't be dealing with them in this course, however.)

CFGs consist of an alphabet and a bunch of rules. The alphabet contains two types of symbols: terminals and nonterminals. We usually notate terminals using lower-case letters and nonterminals using upper-case letters. Among the nonterminals, there is always one special nonterminal symbol called the start symbol. This is usually denoted as "S". Finally, the rules are of the form:

nonterminal -> string

The strings can be any mixture of terminals and nonterminals.

Here's an example of a CFG. Let SIGMA = {a,b,S} be the alphabet, and let there be two rules:

S -> aSb
S -> e

Now, to use this grammar, we simply start with the start symbol, then proceed to apply the rules for replacing the nonterminals. Whenever we apply a rule to get a new string, we notate that using a double-stem arrow "=>". For instance, we might try:

S => aSb => aaSbb => aaaSbbb => aaaebbb

which is aaabbb. So the string aaabbb is in the language defined by this grammar.

By now, it must be obvious that the language generated by the above grammar is { a^mb^m : m >= 0 }. So already, we have shown with this example that CFGs are capable of generating some languages that FAs cannot represent. But are there languages that FAs can represent that CFGs cannot generate? In other words, can CFGs generate all regular languages? The answer, of course, is yes; however, we won't prove that quite yet.

Note that what we just did in the above example is "generate" the string aaabbb using the CFG. It is also possible to reverse this process; i.e., take a string and figure out which rules were applied to generate it. We call this parsing. Look in the book for examples of parse trees, which are graphical representations of how parsing takes place.

Consider this grammar of three symbols and two rules:

S -> A | AA
A -> a | aA | Aa | e

(The vertical bars are "or" symbols; so the first rule is really a shorthand for two rules, "S -> A" and "S -> AA", and the second rule is a shorthand for four rules.) Now, if you try to parse the seemingly simple string "aa", you'll find that there are many different ways to parse it:

S => AA => aA => aa
S => A => aA => aAa => aa
S => A => aA => aa

and so on. When you have a grammar that defines a language in which some (or all) strings have multiple parse trees, we say that it is ambiguous.

CFGs define languages, which we call context-free languages. CFLs, like regular languages, are also closed under certain operations; union, concatenation, and Kleene star. Note that CFLs are not closed under intersection and complementation. This is an important difference between CFLs and RLs. However, there is one interesting twist: it turns out that CFLs *are* closed under intersection with RLs; i.e., CFL&RL -> CFL. As it turns out, this is a fairly handy property to use in proving that certain languages are not context-free.

15-212 Section A Notes

January 28, 1998Thomas Kang

1. Administrative Details

2. The Pumping Theorem for Regular Languages

3. Context-Free Grammars

January 28, 1998
Thomas Kang