


regex(3)            UNIX Programmer's Manual             regex(3)



NAME
     re_comp, re_exec, re_subs, re_modw, re_fail  - regular
     expression handling

ORIGIN
     Dept. of Computer Science
     York University

SYNOPSIS
     char *re_comp(pat)
     char *pat;

     re_exec(str)
     char *str;

     re_subs(src, dst)
     char *src;
     char *dst;

     void re_fail(msg, op)
     char *msg;
     char op;

     void re_modw(str)
     char *str;


DESCRIPTION
     These functions implement ed(1)-style partial regular
     expressions and supporting facilities.

     Re_comp compiles a pattern string into an internal form (a
     deterministic finite-state automaton) to be executed by
     re_exec for pattern matching.  Re_comp returns 0 if the pat-
     tern is compiled successfully, otherwise it returns an error
     message string. If re_comp is called with a 0 or a null
     string, it returns without changing the currently compiled
     regular expression.

     Re_comp supports the same limited set of regular expressions
     found in ed and Berkeley regex(3) routines:

     [1]     char    Matches itself, unless it is a special char-
                     acter (meta-character): . \ [ ] * + ^ $

     [2]     .       Matches any character.

     [3]     \       Matches the character following it, except
                     when followed by a digit 1 to 9, (, fB), <
                     or >.  (see [7], [8] and [9]) It is used as
                     an escape character for all other meta-
                     characters, and itself. When used in a set



Printed 5/16/91               local                             1






regex(3)            UNIX Programmer's Manual             regex(3)



                     ([4]), it is treated as an ordinary charac-
                     ter.

     [4]     [set]   Matches one of the characters in the set.
                     If the first character in the set is ^, it
                     matches a character NOT in the set. A short-
                     hand S-E is used to specify a set of charac-
                     ters S up to E, inclusive. The special char-
                     acters ] and - have no special meaning if
                     they appear as the first chars in the set.
                             examples:       match:
                             [a-z]           any lowercase alpha
                             [^]-]           any char except ] and -
                             [^A-Z]          any char except
                                             uppercase alpha
                             [a-zA-Z0-9]     any alphanumeric

     [5]     *       Any regular expression form [1] to [4], fol-
                     lowed by closure char (*) matches zero or
                     more matches of that form.

     [6]     +       Same as [5], except it matches one or more.

     [7]             A regular expression in the form [1] to
                     [10], enclosed as \(form\) matches what form
                     matches. The enclosure creates a set of
                     tags, used for [8] and for pattern substitu-
                     tion in re_subs. The tagged forms are num-
                     bered starting from 1.

     [8]             A \ followed by a digit 1 to 9 matches what-
                     ever a previously tagged regular expression
                     ([7]) matched.

     [9]     \<      Matches the beginning of a word, that is, an
                     empty string followed by a letter, digit, or
                     _ and not preceded by a letter, digit, or _
                     .
             \>      Matches the end of a word, that is, an empty
                     string preceded by a letter, digit, or _ ,
                     and not followed by a letter, digit, or _ .

     [10]            A composite regular expression xy where x
                     and y are in the form of [1] to [10] matches
                     the longest match of x followed by a match
                     for y.

     [11]    ^ $     a regular expression starting with a ^ char-
                     acter and/or ending with a $ character, res-
                     tricts the pattern matching to the beginning
                     of the line, and/or the end of line
                     [anchors]. Elsewhere in the pattern, ^ and $



Printed 5/16/91               local                             2






regex(3)            UNIX Programmer's Manual             regex(3)



                     are treated as ordinary characters.


     Re_exec executes the internal form produced by re_comp and
     searches the argument string for the regular expression
     described by the internal form. Re_exec returns 1 if the
     last regular expression pattern is matched within the
     string, 0 if no match is found. In case of an internal error
     (corrupted internal form), re_exec calls the user-supplied
     re_fail and returns 0.

     The strings passed to both re_comp and re_exec may have
     trailing or embedded newline characters. The strings must be
     terminated by nulls.

     Re_subs does ed-style pattern substitution, after a success-
     ful match is found by re_exec. The source string parameter
     to re_subs is copied to the destination string with the fol-
     lowing interpretation;

     [1]     &       Substitute the entire matched string in the
                     destination.

     [2]     \n      Substitute the substring matched by a tagged
                     subpattern numbered n, where n is between 1
                     to 9, inclusive.

     [3]     \char   Treat the next character literally, unless
                     the character is a digit ([2]).


     If the copy operation with the substitutions is successful,
     re_subs returns 1.  If the source string is corrupted, or
     the last call to re_exec fails, it returns 0.

     Re_modw is used to add new characters into an internal table
     to change the re_exec's understanding of what a word should
     look like, when matching with \< and \> constructs. If the
     string parameter is 0 or null string, the table is reset
     back to the default, which contains A-Z a-z 0-9 _ .

     Re_fail is a user-supplied routine to handle internal
     errors.  re_exec calls re_fail with an error message string,
     and the opcode character that caused the error.  The default
     re_fail routine simply prints the message and the opcode
     character to stderr and invokes exit(2).

EXAMPLES
     In the examples below, the dfaform describes the internal
     form after the pattern is compiled. For additional details,
     refer to the sources.




Printed 5/16/91               local                             3






regex(3)            UNIX Programmer's Manual             regex(3)



     foo*.*
          dfaform:  CHR f CHR o CLO CHR o END CLO ANY END END
          matches:  fo foo fooo foobar fobar foxx ...

     fo[ob]a[rz]
          dfaform:  CHR f CHR o CCL 2 o b CHR a CCL 2 r z END
          matches:  fobar fooar fobaz fooaz

     foo\\+
          dfaform:  CHR f CHR o CHR o CHR \ CLO CHR \ END END
          matches:  foo\ foo\\ foo\\\  ...

     \(foo\)[1-3]\1 (same as foo[1-3]foo, but takes less internal space)
          dfaform:  BOT 1 CHR f CHR o CHR o EOT 1 CCL 3 1 2 3 REF 1 END
          matches:  foo1foo foo2foo foo3foo

     \(fo.*\)-\1
          dfaform:  BOT 1 CHR f CHR o CLO ANY END EOT 1 CHR - REF 1 END
          matches:  foo-foo fo-fo fob-fob foobar-foobar ...

DIAGNOSTICS
     Re_comp returns one of the following strings if an error
     occurs:

          No previous regular expression,
          Empty closure,
          Illegal closure,
          Cyclical reference,
          Undetermined reference,
          Unmatched \(,
          Missing ],
          Null pattern inside \(\),
          Null pattern inside \<\>,
          Too many \(\) pairs,
          Unmatched \).

REFERENCES
     Software tools                Kernighan & Plauger
     Software tools in Pascal      Kernighan & Plauger
     Grep sources [rsx-11 C dist]  David Conroy
     Ed - text editor              Unix Programmer's Manual
     Advanced editing on Unix      B. W. Kernighan
     RegExp sources                Henry Spencer

HISTORY AND NOTES
     These routines are derived from various implementations
     found in Software Tools books, and David Conroy's grep. They
     are NOT derived from licensed/restricted software.  For more
     interesting/academic/complicated implementations, see Henry
     Spencer's regexp routines (V8), or GNU Emacs pattern match-
     ing module.




Printed 5/16/91               local                             4






regex(3)            UNIX Programmer's Manual             regex(3)



     The re_comp and re_exec routines perform almost as well as
     their licensed counterparts, sometimes better. In very few
     instances, they are about 10% to 15% slower.

AUTHOR
     Ozan S. Yigit (oz)
     usenet: utzoo!yetti!oz
     bitnet: oz@yusol || oz@yuyetti

SEE ALSO
     ed(1), ex(1), egrep(1), fgrep(1), grep(1), regex(3)

BUGS
     These routines are Public Domain. You can get them in
     source.
     The internal storage for the dfa form is not checked for
     overflows. Currently, it is 1024 bytes.
     Others, no doubt.





































Printed 5/16/91               local                             5



