Regexp syntax is largely compatible with Perl, Java and extended POSIX regexps, but not completely. Below, the syntax is documented.
CharactersAt its core, regular expressions consist of character literals. For example,
R/ f/ is a regular expression matching just the string 'f'. In addition, the normal escape codes are provided, like
\t for the tab character and
\uxxxxxx for an arbitrary Unicode code point, by its hex value. In addition, any character can be preceded by a backslash to escape it, unless this has special meaning. For example, to match a literal opening parenthesis, use
\(.
Concatenation, alternation and groupingRegular expressions can be built out of multiple characters by concatenation. For example,
R/ ab/ matches a followed by b. The
| (alternation) operator can construct a regexp which matches one of two alternatives. Parentheses can be used for grouping. So
R/ f(oo|ar)/ would match either 'foo' or 'far'.
Character classesSquare brackets define a convenient way to refer to a set of characters. For example,
[ab] refers to either a or b. And
[a-z] refers to all of the characters between a and z, in code point order. You can use these together, as in
[ac-fz] which matches all of the characters between c and f, in addition to a and z. Character classes can be negated using a caret, as in
[^a] which matches all characters which are not a.
Predefined character classesSeveral character classes are predefined, both for convenience and because they are too large to represent directly. In Factor regular expressions, all character classes are Unicode-aware.
\d | Digits |
\D | Not digits |
\s | Whitespace |
\S | Not whitespace |
\w | Word character (alphanumeric or underscore) |
\W | Not word character |
\p{property} | Character which fulfils the property |
\P{property} | Character which does not fulfil the property |
Properties for
\p and
\P (case-insensitive):
\p{lower} | Lower case letters |
\p{upper} | Upper case letters |
\p{alpha} | Letters |
\p{ascii} | Characters in the ASCII range |
\p{alnum} | Letters or numbers |
\p{punct} | Punctuation |
\p{blank} | Non-newline whitespace |
\p{cntrl} | Control character |
\p{space} | Whitespace |
\p{xdigit} | Hexadecimal digit |
\p{Nd} | Character in Unicode category Nd |
\p{Z} | Character in Unicode category beginning with Z |
\p{script=Cham} | Character in the Cham writing system |
Character class operationsCharacter classes can be composed using four binary operations:
|| && ~~ --. These do the operations union, intersection, symmetric difference and difference, respectively. For example, characters which are lower-case but not Latin script could be matched as
[\p{lower}--\p{script=latin}]. These operations are right-associative, and
^ binds tighter than them. There is no syntax for grouping.
BoundariesSpecial operators exist to match certain points in the string. These are called 'zero-width' because they do not consume any characters.
^ | Beginning of a line |
$ | End of a line |
\A | Beginning of text |
\z | End of text |
\Z | Almost end of text: only thing after is newline |
\b | Word boundary (by Unicode word boundaries) |
\B | Not word boundary (by Unicode word boundaries) |
Greedy quantifiersIt is possible to have a regular expression which matches a variable number of occurrences of another regular expression.
a* | Zero or more occurrences of a |
a+ | One or more occurrences of a |
a? | Zero or one occurrences of a |
a{n} | n occurrences of a |
a{n,} | At least n occurrences of a |
a{,m} | At most m occurrences of a |
a{n,m} | Between n and m occurrences of a |
All of these quantifiers are
greedy, meaning that they take as many repetitions as possible within the larger regular expression. Reluctant and possessive quantifiers are not yet supported.
LookaroundOperators are provided to look ahead and behind the current point in the regular expression. These can be used in any context, but they're the most useful at the beginning or end of a regular expression.
(?=a) | Asserts that the current position is immediately followed by a |
(?!a) | Asserts that the current position is not immediately followed by a |
(?<=a) | Asserts that the current position is immediately preceded by a |
(?<!a) | Asserts that the current position is not immediately preceded by a |
QuotationTo make it convenient to have a long string which uses regexp operators, a special syntax is provided. If a substring begins with
\Q then everything until
\E is quoted (escaped). For example,
R/ \Qfoo\bar|baz()\E/ matches exactly the string
"foo\bar|baz()".
Unsupported featuresGroup captureReluctant and possessive quantifiersBackreferencesBackreferences were omitted because of a design decision to allow only regular expressions following the formal theory of regular languages. For more information, see
The theory of regular expressions.
To work around the lack of backreferences, consider using group capture and then creating a new regular expression to match the captured string using
regexp.
combinators.
Previous matchAnother feature that is not included is Perl's
\G syntax, which references the previous match. This is because that sequence is inherently stateful, and Factor regexps don't hold state.
Embedding codeOperations which embed code into a regexp are not supported. This would require the inclusion of the Factor parser and compiler in any deployed application which wants to expose regexps to the user, leading to an undesirable increase in the code size.
Casing operationsNo special casing operations are included, for example Perl's
\L.