A Brief Introduction to Regular Expressions
Introduction
Regular Expressions are ubiquitous in computer science.
We will study one example here: using regular expressions to specify a
search pattern in the Metrowerks Editor.
Such patterns help us locate appropriate information in our source code.
Each regular expression is like the right hand side of an EBNF rule,
although it uses a different notation and includes more options.
Below is a picture of The Find window in Metrowerks
(bring it up by pressing Ctrl f or selecting Search | Find).
It has a checkbox to enable/disable Regular expressions.
By enabling this mode, Metrowerks interprets the Find text box
according to special notation described below, in which certain
characters specify not themselves, but instructions.
If this mode is disabled, then every character just matches itself.
It is easy to test your knowledge of regular expressions by typing
information into a small text file and then using regular expression
patterns in the Find window to search it.
Experiment!
Notation
Just as in EBNF, certain characters in regular expressions have special
meanings while others just stand for themselves.
Here is a brief discussion, with examples, of the meanings of various
components in regular expressions.
This material assumes that you are already familiar with EBNF.
- Character Matching
- Except for special characters, a character matches itself; note that a
checkbox in the Find window controls case-sensitivity in
regular expresions; to match any of the special characters described
below, \.[]-^?*+()$|, preface it with a \; e.g.
\. matches a dot and \\ matches a backslash.
- . matches any character except a newline character; e.g.,
.art matches cart, dart, etc.
- [...] matches any character in ...; e.g., [aeiou]
matches any vowel; in EBNF we would write this as a|e|i|o|u.
- [...-...] matches any character in the specified range; e.g.,
[a-z] matches any lower-case letter (or any letter if the
case sensitive box is not checked); [0-9] matches any digit.
- [^...] matches any character NOT in ...; e.g.,
[^aeiou] matches anything that is not a vowel, including
digits, strange characters, etc.
- Operators
- ? is a postfix operator; it optionally matches the operand before
it (like [] in EBNF; another way to say this is that it matches
0 or 1 occurences of the pattern preceding it); e.g.,
foos? matches either foo or foos (the operand
before ? is only the character s).
- * is a postfix operator; it matches 0 or more occurrences
of the operand before it (like {} in EBNF); e.g., booo*m
matches boom, booom, boooom, etc. (it is like
boo{o}m in EBNF).
- + is a postfix operator; it matches 1 or more occurrences
of the operand before it; e.g., boo+m does not match
boom, but matches booom, boooom, etc. (it is like
booo{o}m in EBNF).
- Grouping
- (...) matches the sequence of characters ...; use
parentheses for groupings that are followed by operators; e.g.,
(ab)+ matches ab, abab, etc. (it is like
ab{ab} in EBNF).
- (...|...) matches either of the sequences of characters
... (like | in EBNF, but alternatives must appear
within parentheses; you can write as many as you want); e.g.,
(ab|xy) matches either ab or xy;
(ab|xy)* matches nothing, ab, xy,
abab, abxy, abxyab, etc. (it is like
{ab|xy} in EBNF).
- Special Matches
- \t matches a tab; similarly \n matches a newline,
etc. (for other Java escape characters); note that ( |\t)*
is a frequently occuring subpattern that matches any amount of
white-space.
- ^, when not used in [] matches the beginning of a line;
e.g.,^( |\t)*Hi matches Hi but only if they are the
first non-blank characters on the line (skipping any spaces and tabs).
- $ matches the end of a line; e.g.,
}( |\t)*$ matches } but only if it is the last
non-blank character on the line.
Regular expressions are used heavily in Unix tools (e.g., grep) and and
programmingn languages like Perl.
Java 1.5 has a regular expression class for doing these kinds of searches.
Finally, for book-length coverage of this topic, see
Friedl,
Mastering Regular Expressions, O'Reilly.
Solved Problems
Here is some information that you might want to search for, and the
patterns that perform the search.
Most of these patterns aren't perfect: they sometimes find a bit more
information than we want; but specifying a perfect pattern would take
much longer.
For each solution, can you show some information that matches but is not
really what we want.
- Find any calls to getters or setters using the variable x
Solution: x.(get|set).*\( Technically, for identifiers,
.* should be replaced by (a-z|A-Z|0-9|_)*
- Find any lines that contain only a closing }
Solution: ^( |\t)*\}( |\t)*$
- Find any calls to the m method, using two arguments.
Solution: m\(.*,.*\)
- Find any locations where the variable x is stored into by the
= operator.
Solution: ( |\t)*x( |\t)*=
- Find any locations where the variable x is printed.
Solution: System.out.printn(ln)?\(x|\.*\+( |\t)*x