PERL -- Search and Modification Operations

Search and Modification Operations

m/PATTERN/gio

Searches a string for a pattern match, and returns true (1) or false (''). If no string is specified via the =~ or !~ operator, the $_ string is searched. (The string specified with =~ need not be an lvalue--it may be the result of an expression evaluation, but remember the =~ binds rather tightly.) See also the section on regular expressions.

If / is the delimiter then the initial 'm' is optional. With the 'm' you can use any pair of non-alphanumeric characters as delimiters. This is particularly useful for matching Unix path names that contain '/'. If the final delimiter is followed by the optional letter 'i', the matching is done in a case-insensitive manner. PATTERN may contain references to scalar variables, which will be interpolated (and the pattern recompiled) every time the pattern search is evaluated. (Note that $) and $| may not be interpolated because they look like end-of-string tests.) If you want such a pattern to be compiled only once, add an "o" after the trailing delimiter. This avoids expensive run-time recompilations, and is useful when the value you are interpolating won't change over the life of the script. If the PATTERN evaluates to a null string, the most recent successful regular expression is used instead.

If used in a context that requires an array value, a pattern match returns an array consisting of the subexpressions matched by the parentheses in the pattern, i.e. ($1, $2, $3...). It does NOT actually set $1, $2, etc. in this case, nor does it set $+, $`, $& or $'. If the match fails, a null array is returned. If the match succeeds, but there were no parentheses, an array value of (1) is returned.

Examples:

    open(tty, '/dev/tty');
    <tty> =~ /^y/i && do foo();	# do foo if desired

    if (/Version: *([0-9.]*)/) { $version = $1; }

    next if m#^/usr/spool/uucp#;

    # poor man's grep
    $arg = shift;
    while (<>) {
	    print if /$arg/o;	# compile only once
    }

    if (($F1, $F2, $Etc) = ($foo =~ /^(\S+)\s+(\S+)\s*(.*)/))

This last example splits $foo into the first two words and the remainder of the line, and assigns those three fields to $F1, $F2 and $Etc. The conditional is true if any variables were assigned, i.e. if the pattern matched.

The "g" modifier specifies global pattern matching--that is, matching as many times as possible within the string. How it behaves depends on the context. In an array context, it returns a list of all the substrings matched by all the parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern. In a scalar context, it iterates through the string, returning TRUE each time it matches, and FALSE when it eventually runs out of matches. (In other words, it remembers where it left off last time and restarts the search at that point.) It presumes that you have not modified the string since the last match. Modifying the string between matches may result in undefined behavior. (You can actually get away with in-place modifications via substr() that do not change the length of the entire string. In general, however, you should be using s///g for such modifications.) Examples:

	# array context
	($one,$five,$fifteen) = (\`uptime\` =~ /(\d+\.\d+)/g);

	# scalar context
	$/ = ""; $* = 1;
	while ($paragraph = <>) {
	    while ($paragraph =~ /[a-z]['")]*[.!?]+['")]*\s/g) {
		$sentences++;
	    }
	}
	print "$sentences\n";

?PATTERN?

This is just like the /pattern/ search, except that it matches only once between calls to the reset operator. This is a useful optimization when you only want to see the first occurrence of something in each file of a set of files, for instance. Only ?? patterns local to the current package are reset.

s/PATTERN/REPLACEMENT/gieo

Searches a string for a pattern, and if found, replaces that pattern with the replacement text and returns the number of substitutions made. Otherwise it returns false (0). The "g" is optional, and if present, indicates that all occurrences of the pattern are to be replaced. The "i" is also optional, and if present, indicates that matching is to be done in a case-insensitive manner. The "e" is likewise optional, and if present, indicates that the replacement string is to be evaluated as an expression rather than just as a double-quoted string. Any non-alphanumeric delimiter may replace the slashes; if single quotes are used, no interpretation is done on the replacement string (the e modifier overrides this, however); if backquotes are used, the replacement string is a command to execute whose output will be used as the actual replacement text. If the PATTERN is delimited by bracketing quotes, the REPLACEMENT has its own pair of quotes, which may or may not be bracketing quotes, e.g. s(foo)(bar) or s<foo>/bar/. If no string is specified via the =~ or !~ operator, the $_ string is searched and modified. (The string specified with =~ must be a scalar variable, an array element, or an assignment to one of those, i.e. an lvalue.) If the pattern contains a $ that looks like a variable rather than an end-of-string test, the variable will be interpolated into the pattern at run-time. If you only want the pattern compiled once the first time the variable is interpolated, add an "o" at the end. If the PATTERN evaluates to a null string, the most recent successful regular expression is used instead. See also the section on regular expressions. Examples:

    s/\bgreen\b/mauve/g;		# don't change wintergreen

    $path =~ s|/usr/bin|/usr/local/bin|;

    s/Login: $foo/Login: $bar/; # run-time pattern

    ($foo = $bar) =~ s/bar/foo/;

    $_ = 'abc123xyz';
    s/\d+/$&*2/e;		# yields 'abc246xyz'
    s/\d+/sprintf("%5d",$&)/e;	# yields 'abc  246xyz'
    s/\w/$& x 2/eg;		# yields 'aabbcc  224466xxyyzz'

    s/([^ ]*) *([^ ]*)/$2 $1/;	# reverse 1st two fields

(Note the use of $ instead of \ in the last example. See section on regular expressions.)

study(SCALAR)

study SCALAR

study

Takes extra time to study SCALAR ($_ if unspecified) in anticipation of doing many pattern matches on the string before it is next modified. This may or may not save time, depending on the nature and number of patterns you are searching on, and on the distribution of character frequencies in the string to be searched--you probably want to compare runtimes with and without it to see which runs faster. Those loops which scan for many short constant strings (including the constant parts of more complex patterns) will benefit most. You may have only one study active at a time--if you study a different scalar the first is "unstudied". (The way study works is this: a linked list of every character in the string to be searched is made, so we know, for example, where all the 'k' characters are. From each search string, the rarest character is selected, based on some static frequency tables constructed from some C programs and English text. Only those places that contain this "rarest" character are examined.)

For example, here is a loop which inserts index producing entries before any line containing a certain pattern:

	while (<>) {
		study;
		print ".IX foo\n" if /\bfoo\b/;
		print ".IX bar\n" if /\bbar\b/;
		print ".IX blurfl\n" if /\bblurfl\b/;
		...
		print;
	}

In searching for /\bfoo\b/, only those locations in $_ that contain 'f' will be looked at, because 'f' is rarer than 'o'. In general, this is a big win except in pathological cases. The only question is whether it saves you more time than it took to build the linked list in the first place.

Note that if you have to look for strings that you don't know till runtime, you can build an entire loop as a string and eval that to avoid recompiling all your patterns all the time. Together with undefining $/ to input entire files as one record, this can be very fast, often faster than specialized programs like fgrep. The following scans a list of files (@files) for a list of words (@words), and prints out the names of those files that contain a match:

	$search = 'while (<>) { study;';
	foreach $word (@words) {
	    $search .= "++\$seen{\$ARGV} if /\\b$word\\b/;\n";
	}
	$search .= "}";
	@ARGV = @files;
	undef $/;
	eval $search;		# this screams
	$/ = "\n";		# put back to normal input delim
	foreach $file (sort keys(%seen)) {
	    print $file, "\n";
	}

tr/SEARCHLIST/REPLACEMENTLIST/cds

y/SEARCHLIST/REPLACEMENTLIST/cds

Translates all occurrences of the characters found in the search list with the corresponding character in the replacement list. It returns the number of characters replaced or deleted. If no string is specified via the =~ or !~ operator, the $_ string is translated. (The string specified with =~ must be a scalar variable, an array element, or an assignment to one of those, i.e. an lvalue.) For sed devotees, y is provided as a synonym for tr. If the SEARCHLIST is delimited by bracketing quotes, the REPLACEMENTLIST has its own pair of quotes, which may or may not be bracketing quotes, e.g. tr[A-Z][a-z] or tr(+-*/)/ABCD/.

If the c modifier is specified, the SEARCHLIST character set is complemented. If the d modifier is specified, any characters specified by SEARCHLIST that are not found in REPLACEMENTLIST are deleted. (Note that this is slightly more flexible than the behavior of some tr programs, which delete anything they find in the SEARCHLIST, period.) If the s modifier is specified, sequences of characters that were translated to the same character are squashed down to 1 instance of the character.

If the d modifier was used, the REPLACEMENTLIST is always interpreted exactly as specified. Otherwise, if the REPLACEMENTLIST is shorter than the SEARCHLIST, the final character is replicated till it is long enough. If the REPLACEMENTLIST is null, the SEARCHLIST is replicated. This latter is useful for counting characters in a class, or for squashing character sequences in a class.

Examples:

    $ARGV[1] =~ y/A-Z/a-z/;	\h'|3i'# canonicalize to lower case

    $cnt = tr/*/*/;		\h'|3i'# count the stars in $_

    $cnt = tr/0-9//;		\h'|3i'# count the digits in $_

    tr/a-zA-Z//s;	\h'|3i'# bookkeeper -> bokeper

    ($HOST = $host) =~ tr/a-z/A-Z/;

    y/a-zA-Z/ /cs;	\h'|3i'# change non-alphas to single space

    tr/\200-\377/\0-\177/;\h'|3i'# delete 8th bit