Programmer Guide > Working with Text > Using Regular Expressions
  

Using Regular Expressions
To use the PV‑WAVE string handling functions STRMATCH, STRSPLIT, and STRSUBST, you must understand how regular expressions work.
Regular expressions are used in UNIX-based utilities such as grep, egrep, awk, and ed. UNIX users are probably familiar with the powerful pattern matching capabilities of regular expressions.
 
note
Regular expressions are not the same as wildcard characters. See the section "Regular Expressions vs. Wildcard Characters" for information on this common source of confusion.
This section provides an elementary introduction to regular expressions. Additional sources of information on regular expressions are listed at the end of this section.
Simple Regular Expressions: A Brief Introduction
This section introduces some simple regular expression examples. More complex examples are presented in "Practical Regular Expression Examples".
Regular expressions can be very complex. Indeed, entire books have been written on the subject of regular expressions. Regular expressions normally consist of characters that you wish to match and special characters that perform specific pattern matching functions. For a list of commonly used special characters see "Basic Special Characters Used In Regular Expressions".
In PV‑WAVE, the STRMATCH, STRSPLIT, and STRSUBST commands take regular expression arguments to perform pattern matching operations. The following examples demonstrate the use of regular expressions in the STRMATCH function.
Matching a Single Character
The regular expression special character '.' (dot) matches any single character except a newline.
For example, the regular expression used in the STRMATCH function:
result=STRMATCH(string, '.at')
matches any string containing the following sequence of characters:
bat
cat
mat
oat
Matching Zero or More Characters
The regular expression special character '*' (asterisk) matches zero or more of the preceding character.
For example, the regular expression used in the STRMATCH function:
result=STRMATCH(string, 'x*y')
matches the following strings (zero or more “x” characters, followed by a single “y”):
y 
xy 
xxy 
xxxy 
Matching One or More Characters
The regular expression special character '+' (plus) matches one or more of the preceding character.
For example, the regular expression used in the STRMATCH function:
result=STRMATCH(string, 'x+y')
matches the following strings:
xy 
xxy 
xxxy 
Other Special Characters
Other characters—such as brackets, braces, parentheses, back-slashes and so on—also have meaning in a regular expression, depending on the regular expression syntax used.
See the table in the following section for a list of the most basic regular expression special characters.
Basic Special Characters Used In Regular Expressions
Special Characters lists the most basic regular expression special characters and explains what they match.
 
Table 8-4: Special Characters
Special
Character
Matches
.
any single character except newline
^
the first character of the string (when used as the first character in the regular expression)
$
the last character of the string (when used as the last character in the regular expression)
*
zero or more of the preceding character. (This character is a modifier, which means that it specifies how many times you expect to see the preceding character. Therefore, this character is only significant if it is preceded by another character.)
+
one or more of the preceding character. (This character is also a modifier, because it must be preceded by another character.)
?
zero or one of the preceding character. (This character is also a modifier, because it must be preceded by another character.)
[ ... ]
a single character that is in the enclosed group of characters; either a list of characters, like [abc], or a range of characters, like [0-9], or both [0-9 ABC w-z]
[^ ... ]
any character except those enclosed in the square brackets, like [^0-9]
|
acts as an OR operator, separating two regular expressions
(  )
encloses sub-expressions (used for grouping and for the registers variable in the STRMATCH function)
Escaping Special Characters
To match a special character as you would a normal character, you must “escape it” by preceding it with a backslash (\). Note, however, that in PV‑WAVE strings, two backslashes translate to a single backslash. For example, to match a period (.) in a regular expression in a PV‑WAVE function, you must use ’\\.’
 
note
To match a single backslash in a PV‑WAVE string, you have to use two pairs of backslashes ’\\\\’. Each pair, in PV‑WAVE strings, makes a single backslash, thus you end up with a single escaped backslash. In other words, the first pair of backslashes is the “escape” character, and the second pair is the “escaped” backslash.
If you get confused writing strings with multiple backslashes in PV‑WAVE, you can print the string to see what you get. For example:
PRINT, '\\\\'
\\
Practical Regular Expression Examples
Assume that string is a string array defined in PV‑WAVE. The following PV‑WAVE commands demonstrate the regular expression pattern matching used in the STRMATCH command.
; Matches any string containing the character 'a'. 
result=STRMATCH(string, 'a') 
 
; Matches any string beginning (^) with Cat, bat, and so on:
; 'Cat Woman', 'catatonic', 'Batman, the animated series'
; but does not match: ' cat' (begins with a space), 'cab', and 
; so on. 
result=STRMATCH(string, '^[CcBb]at') 
 
; Matches any string containing 'L' followed by one or more occur
; rences of 'l': 'Get a Llama' matches; 'larry the llama' does 
; not match (first l in llama is lower case). 
result=STRMATCH(string, 'Ll+') 
result=STRMATCH(string, '^[^C].*x$') 
 
; Matches any string containing a period. '3.14159' matches; 
; 'the quick brown fox' does not match. Remember that it takes 
; two backslashes in a PV-WAVE string to produce the single back
; slash that “escapes” the dot (.), as explained previously. 
result=STRMATCH(string, '\\.') 
 
; Matches any string containing any character (that is, any 
; non-null string). 
result=STRMATCH(string, '.') 
 
; Matches only empty strings (start and end with nothing in 
; between). 
result=STRMATCH(string, '^$')
 
; Matches either blank or null strings (Between the beginning (^) 
; and the end ($) there are only zero or more spaces (  *)). 
result=STRMATCH(string, '^ *$')
 
; Matches only three-character strings. 
result=STRMATCH(string, '^...$')
 
; Matches strings three characters or longer. 
result=STRMATCH(string, '^...+$')
 
; This interesting example matches any integer number, possibly 
; surrounded by spaces and/or tabs. This expression means: 
; From the beginning of the string (^), zero or more spaces or 
; tabs (\011 is the octal ASCII number for a tab character), zero 
; or one sign [-+], one or more digit [0-9], zero or more spaces/
; tabs, and finally match the end of string. 
result=STRMATCH(string, '^[  \011]*[-+]?[0-9]+[  \011]*$')
Regular Expressions vs. Wildcard Characters
Many users understandably confuse wildcard characters and regular expressions, because both are used for pattern matching, and because some of the same characters, like asterisk (*), question mark (?), and square brackets ([ ]), are used in both, yet have different meanings.
 
note
Wildcard characters are commonly used in file matching contexts on Microsoft Windows systems. On UNIX systems, wildcards are used in the Bourne shell and C shell, as well as in the commands find and cpio. The most common wildcard is the asterisk (*), which matches any group of characters.
A common misconception is that the asterisk (*) is a wildcard character in regular expressions. In regular expressions, asterisk (*) means “match zero or more of the preceding character.”
To make a “wildcard” (that is, an expression that matches anything) with regular expressions, you must use  '.*'  (dot asterisk). This expression means, “match zero or more of any character.”
Example of Wildcards vs. Regular Expressions
For example, most computer users have used the asterisk (*) as a wildcard character in system commands such as ls and dir. For example:
dir file.*
is a wildcard expression that matches anything that begins with “file.”, such as file.c, file.o, file.dat, file.pro, and so on.
However, the regular expression character * means something entirely different from the wildcard character *. In regular expressions, the asterisk means match zero or more of the preceding character.
Therefore, the regular expression, 'file.*', would match:
file.dat
myfile.c
myfile
myfiles
This result is quite different from the wildcard example shown previously.
Regular Expressions are Versatile
You can, of course, construct a regular expression that is equivalent to the wildcard expression shown previously. Here is a regular expression that performs the same pattern matching function as the wildcard expression file.*:
'^file\\..*'
Here, the caret (^) matches the beginning of the string. The “\\.” matches a single dot (.), and the “.*” matches zero or more of any characters.
For More Information
For an excellent explanation of regular expressions, see:
*UNIX Power Tools, Jerry Peek, Tim O’Reilly, and Mike Loukides, O’Reilly & Associates/Bantam, 1993.
*Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools, Jeffry Friedl, O’Reilly & Associates, 1997.
Many general books on UNIX programming contain information on regular expressions. In addition, books on the Perl programming language usually explain regular expressions in detail (Perl uses regular expressions extensively). For example, see:
*Programming Perl, Larry Wall, Tom Christiansen, and Randal L. Schwartz, O’Reilly & Associates, Inc., Second Edition, 1996.
UNIX users can find regular expressions explained in the man page for the ed command.
 

Version 2017.0
Copyright © 2017, Rogue Wave Software, Inc. All Rights Reserved.