Using Regular Expressions
To use the PV‑WAVE string handling functions STRMATCH, STRSPLIT, and STRSUBST, you must understand how regular expressions work.
Regular expressions are used in UNIX-based utilities such as grep, egrep, awk, and ed. UNIX users are probably familiar with the powerful pattern matching capabilities of regular expressions.
This section provides an elementary introduction to regular expressions. Additional sources of information on regular expressions are listed at the end of this section.
Simple Regular Expressions: A Brief Introduction
This section introduces some simple regular expression examples. More complex examples are presented in
"Practical Regular Expression Examples".
Regular expressions can be very complex. Indeed, entire books have been written on the subject of regular expressions. Regular expressions normally consist of characters that you wish to match and special characters that perform specific pattern matching functions. For a list of commonly used special characters see
"Basic Special Characters Used In Regular Expressions".
In PV‑WAVE, the STRMATCH, STRSPLIT, and STRSUBST commands take regular expression arguments to perform pattern matching operations. The following examples demonstrate the use of regular expressions in the STRMATCH function.
Matching a Single Character
The regular expression special character '.' (dot) matches any single character except a newline.
For example, the regular expression used in the STRMATCH function:
result=STRMATCH(string, '.at')
matches any string containing the following sequence of characters:
bat
cat
mat
oat
Matching Zero or More Characters
The regular expression special character '*' (asterisk) matches zero or more of the preceding character.
For example, the regular expression used in the STRMATCH function:
result=STRMATCH(string, 'x*y')
matches the following strings (zero or more “x” characters, followed by a single “y”):
y
xy
xxy
xxxy
Matching One or More Characters
The regular expression special character '+' (plus) matches one or more of the preceding character.
For example, the regular expression used in the STRMATCH function:
result=STRMATCH(string, 'x+y')
matches the following strings:
xy
xxy
xxxy
Other Special Characters
Other characters—such as brackets, braces, parentheses, back-slashes and so on—also have meaning in a regular expression, depending on the regular expression syntax used.
See the table in the following section for a list of the most basic regular expression special characters.
Basic Special Characters Used In Regular Expressions
Table 8-4: Special Characters lists the most basic regular expression special characters and explains what they match.
Escaping Special Characters
To match a special character as you would a normal character, you must “escape it” by preceding it with a backslash (\). Note, however, that in PV‑WAVE strings, two backslashes translate to a single backslash. For example, to match a period (.) in a regular expression in a PV‑WAVE function, you must use ’\\.’
note | To match a single backslash in a PV‑WAVE string, you have to use two pairs of backslashes ’\\\\’. Each pair, in PV‑WAVE strings, makes a single backslash, thus you end up with a single escaped backslash. In other words, the first pair of backslashes is the “escape” character, and the second pair is the “escaped” backslash. If you get confused writing strings with multiple backslashes in PV‑WAVE, you can print the string to see what you get. For example: PRINT, '\\\\' \\ |
Practical Regular Expression Examples
Assume that string is a string array defined in PV‑WAVE. The following PV‑WAVE commands demonstrate the regular expression pattern matching used in the STRMATCH command.
; Matches any string containing the character 'a'.
result=STRMATCH(string, 'a')
; Matches any string beginning (^) with Cat, bat, and so on:
; 'Cat Woman', 'catatonic', 'Batman, the animated series'
; but does not match: ' cat' (begins with a space), 'cab', and
; so on.
result=STRMATCH(string, '^[CcBb]at')
; Matches any string containing 'L' followed by one or more occur
; rences of 'l': 'Get a Llama' matches; 'larry the llama' does
; not match (first l in llama is lower case).
result=STRMATCH(string, 'Ll+')
result=STRMATCH(string, '^[^C].*x$')
; Matches any string containing a period. '3.14159' matches;
; 'the quick brown fox' does not match. Remember that it takes
; two backslashes in a PV-WAVE string to produce the single back
; slash that “escapes” the dot (.), as explained previously.
result=STRMATCH(string, '\\.')
; Matches any string containing any character (that is, any
; non-null string).
result=STRMATCH(string, '.')
; Matches only empty strings (start and end with nothing in
; between).
result=STRMATCH(string, '^$')
; Matches either blank or null strings (Between the beginning (^)
; and the end ($) there are only zero or more spaces ( *)).
result=STRMATCH(string, '^ *$')
; Matches only three-character strings.
result=STRMATCH(string, '^...$')
; Matches strings three characters or longer.
result=STRMATCH(string, '^...+$')
; This interesting example matches any integer number, possibly
; surrounded by spaces and/or tabs. This expression means:
; From the beginning of the string (^), zero or more spaces or
; tabs (\011 is the octal ASCII number for a tab character), zero
; or one sign [-+], one or more digit [0-9], zero or more spaces/
; tabs, and finally match the end of string.
result=STRMATCH(string, '^[ \011]*[-+]?[0-9]+[ \011]*$')
Regular Expressions vs. Wildcard Characters
Many users understandably confuse wildcard characters and regular expressions, because both are used for pattern matching, and because some of the same characters, like asterisk (*), question mark (?), and square brackets ([ ]), are used in both, yet have different meanings.
note | Wildcard characters are commonly used in file matching contexts on Microsoft Windows systems. On UNIX systems, wildcards are used in the Bourne shell and C shell, as well as in the commands find and cpio. The most common wildcard is the asterisk (*), which matches any group of characters. |
A common misconception is that the asterisk (*) is a wildcard character in regular expressions. In regular expressions, asterisk (*) means “match zero or more of the preceding character.”
To make a “wildcard” (that is, an expression that matches anything) with regular expressions, you must use '.*' (dot asterisk). This expression means, “match zero or more of any character.”
Example of Wildcards vs. Regular Expressions
For example, most computer users have used the asterisk (*) as a wildcard character in system commands such as ls and dir. For example:
dir file.*
is a wildcard expression that matches anything that begins with “file.”, such as file.c, file.o, file.dat, file.pro, and so on.
However, the regular expression character * means something entirely different from the wildcard character *. In regular expressions, the asterisk means match zero or more of the preceding character.
Therefore, the regular expression, 'file.*', would match:
file.dat
myfile.c
myfile
myfiles
This result is quite different from the wildcard example shown previously.
Regular Expressions are Versatile
You can, of course, construct a regular expression that is equivalent to the wildcard expression shown previously. Here is a regular expression that performs the same pattern matching function as the wildcard expression file.*:
'^file\\..*'
Here, the caret (^) matches the beginning of the string. The “\\.” matches a single dot (.), and the “.*” matches zero or more of any characters.
For More Information
For an excellent explanation of regular expressions, see:
UNIX Power Tools, Jerry Peek, Tim O’Reilly, and Mike Loukides, O’Reilly & Associates/Bantam, 1993.
Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools, Jeffry Friedl, O’Reilly & Associates, 1997.
Many general books on UNIX programming contain information on regular expressions. In addition, books on the Perl programming language usually explain regular expressions in detail (Perl uses regular expressions extensively). For example, see:
Programming Perl, Larry Wall, Tom Christiansen, and Randal L. Schwartz, O’Reilly & Associates, Inc., Second Edition, 1996.
UNIX users can find regular expressions explained in the man page for the ed command.