RUBY REGULAR EXPRESSION SYNTAX
Syntax Elements
- \
- escape (enable or disable meta character meaning)
- |
- alternation
- (…)
- group
- […]
- character class
Characters
- \t
- horizontal tab (0x09)
- \v
- vertical tab (0x0B)
- \n
- newline (0x0A)
- \r
- return (0x0D)
- \b
- back space (0x08)
\b is effective in character class […] only
- \f
- form feed (0x0C)
- \a
- bell (0x07)
- \e
- escape (0x1B)
- \nnn
- octal char (encoded byte value)
- \xHH
- hexadecimal char (encoded byte value)
- \x{7HHHHHHH}
- wide hexadecimal char (character code point value)
- \cx
- control char (character code point value)
- \C-x
- control char (character code point value)
- \M-x
- meta (x|0x80) (character code point value)
- \M-\C-x
- meta control char (character code point value)
Character types
- .
- any character (except newline)
- \w
- word character
Not Unicode:
- alphanumeric, "_" and multibyte char.
Unicode:
- General_Category — (Letter|Mark|Number|Connector_Punctuation)
- \W
- non word char
- \s
- whitespace char
Not Unicode:
Unicode:
- 0009, 000A, 000B, 000C, 000D, 0085(NEL),
- General_Category:
- — Line_Separator
- — Paragraph_Separator
- — Space_Separator
- \S
- non whitespace char
- \d
- decimal digit char
Unicode: General_Category — Decimal_Number
- \D
- non decimal digit char
- \h
- hexadecimal digit char [0-9a-fA-F]
- \H
- non hexadecimal digit char
Character Properties
\p{property-name}
\p{^property-name} (negative)
\P{property-name} (negative)
property-name:
Works on all encodings:
- Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower, Print, Punct, Space,
Upper, XDigit, Word, ASCII,
Works on EUC_JP, Shift_JIS:
Works on UTF8, UTF16, UTF32:
- Any, Assigned, C, Cc, Cf, Cn, Co, Cs, L, Ll, Lm, Lo, Lt, Lu, M, Mc, Me, Mn,
N, Nd, Nl, No, P, Pc, Pd, Pe, Pf, Pi, Po, Ps, S, Sc, Sk, Sm, So, Z, Zl, Zp,
Zs, Arabic, Armenian, Bengali, Bopomofo, Braille, Buginese, Buhid,
Canadian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic, Deseret,
Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati,
Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada,
Katakana, Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam,
Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya,
Osmanya, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa,
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi
Quantifiers
Greedy
- ?
- 1 or 0 times
- *
- 0 or more times
- +
- 1 or more times
- {n,m}
- at least n but not more than m times
- {n,}
- at least n times
- {,n}
- at least 0 but not more than n times ({0,n})
- {n}
- n times
Reluctant
- ??
- 1 or 0 times
- *?
- 0 or more times
- +?
- 1 or more times
- {n,m}?
- at least n but not more than m times
- {n,}?
- at least n times
- {,n}?
- at least 0 but not more than n times (== {0,n}?)
Possessive (greedy and does not backtrack after repeated)
- ?+
- 1 or 0 times
- *+
- 0 or more times
- ++
- 1 or more times
({n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only)
Anchors
- ^
- beginning of the line
- $
- end of the line
- \b
- word boundary
- \B
- not word boundary
- \A
- beginning of string
- \Z
- end of string, or before newline at the end
- \z
- end of string
- \G
- matching start position
Character class
- ^…
- negative class (lowest precedence operator)
- x-y
- range from x to y
- […]
- set (character class in character class)
- ..&&..
- intersection (low precedence at the next of ^)
If you want to use ’[’, ’-’, ’]’ as a
normal character in a character class, you should escape these characters
by ’\’.
POSIX bracket ([:xxxxx:], negate [:^xxxxx:])
Not Unicode Case:
- alnum
- alphabet or digit char
- alpha
- alphabet
- ascii
- code value: [0 - 127]
- blank
- \t, \x20
- cntrl
- control
- digit
- 0-9
- graph
- include all of multibyte encoded characters
- lower
- lower case
- print
- include all of multibyte encoded characters
- punct
- punctuation
- space
- \t, \n, \v, \f, \r, \x20
- upper
- upper case
- xdigit
- 0-9, a-f, A-F
- word
- alphanumeric, "_" and multibyte characters
Unicode Case:
- alnum
- Letter | Mark | Decimal_Number
- alpha
- Letter | Mark
- ascii
- 0000 - 007F
- blank
- Space_Separator | 0009
- cntrl
- Control | Format | Unassigned | Private_Use | Surrogate
- digit
- Decimal_Number
- graph
- [[:^space:]] && ^Control && ^Unassigned &&
^Surrogate
- lower
- Lowercase_Letter
- print
- [[:graph:]] | [[:space:]]
- punct
- Connector_Punctuation | Dash_Punctuation | Close_Punctuation |
Final_Punctuation | Initial_Punctuation | Other_Punctuation |
Open_Punctuation
- space
- Space_Separator | Line_Separator | Paragraph_Separator | 0009 | 000A | 000B
| 000C | 000D | 0085
- upper
- Uppercase_Letter
- xdigit
- 0030 - 0039 | 0041 - 0046 | 0061 - 0066 (0-9, a-f, A-F)
- word
- Letter | Mark | Decimal_Number | Connector_Punctuation
Extended groups
- (?#…)
- comment
- (?imx-imx)
- option on/off:
- i: ignore case
- m: multi-line (dot(.) match newline)
- x: extended form
- (?imx-imx:subexp)
- option on/off for subexp
- (?:subexp)
- not captured group
- (subexp)
- captured group
- (?=subexp)
- look-ahead
- (?!subexp)
- negative look-ahead
- (?<=subexp)
- look-behind
- (?<!subexp)
- negative look-behind
Subexp of look-behind must be fixed character length. But different
character length is allowed in top level alternatives only. ex.
(?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
In negative-look-behind, captured group isn‘t allowed, but shy
group(?:) is allowed.
- (?>subexp)
- atomic group don‘t backtrack in subexp.
- (?<name>subexp)
- define named group (All characters of the name must be a word character.)
Not only a name but a number is assigned like a captured group.
Assigning the same name as two or more subexps is allowed. In this case, a
subexp call can not be performed although the back reference is possible.
Back reference
- \n
- back reference by group number (n >= 1)
- \k<name>
- back reference by group name
In the back reference by the multiplex definition name, a subexp with a
large number is referred to preferentially. (When not matched, a group of
the small number is referred to.)
- Back reference by group number is forbidden if named group is defined in
the pattern and ONIG_OPTION_CAPTURE_GROUP is not setted.
Back reference with nest level
- \k<name+n>
- n: 0, 1, 2, …
- \k<name-n>
- n: 0, 1, 2, …
Destinate relative nest level from back reference position.
Examples:
/\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")
r = ORegexp.compile(<<'__REGEXP__'.strip, :options => Oniguruma::EXTENDED)
(?<element> \g<stag> \g<content>* \g<etag> ){0}
(?<stag> < \g<name> \s* > ){0}
(?<name> [a-zA-Z_:]+ ){0}
(?<content> [^<&]+ (\g<element> | [^<&]+)* ){0}
(?<etag> </ \k<name+1> >){0}
\g<element>
__REGEXP__
p r.match('<foo>f<bar>bbb</bar>f</foo>').captures
Subexp call ("Tanaka Akira special")
- \g<name>
- call by group name
- \g<n>
- call by group number (n >= 1)
Captured group
Behavior of the no-named group (…) changes with the following
conditions. (But named group is not changed.)
- case 1
- ORegexp.new( ’…’ ) (named group is not used, no
option)
… is treated as a captured group.
- case 2
- ORegexp.new( ’…’, :options =>
OPTION_DONT_CAPTURE_GROUP ) (named group is not used, ‘g’
option)
… is treated as a no-captured group (?:…).
- case 3
- ORegexp.new( ’…(?<name>…)…’ )
(named group is used, no option)
(?<name>…) is treated as a no-captured group (?:…)
numbered-backref/call is not allowed.
- case 2
- ORegexp.new( ’…’, :options => OPTION_CAPTURE_GROUP
) (named group is used, ‘G’ option)
(?<name>…) is treated as a captured group (?:…)
numbered-backref/call is allowed.
where
- g: OPTION_DONT_CAPTURE_GROUP
- G: OPTION_CAPTURE_GROUP
(‘g’ and ‘G’ options are argued in ruby-dev ML)
Syntax dependent options
ONIG_SYNTAX_RUBY
- (?m)
- dot(.) match newline
ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA
- (?s)
- dot(.) match newline
- (?m)
- ^ match after newline, $ match before newline
Original extensions
- hexadecimal digit char type \h, \H
- named group (?<name>…)
- named backref \k<name>
- subexp call \g<name>, \g<group-num>
Lacking features compare with perl 5.8.0
Differences with Japanized GNU regex(version 0.12) of Ruby 1.8
- add character property (\p{property}, \P{property})
- add hexadecimal digit char type (\h, \H)
- add look-behind
(?<=fixed-char-length-pattern), (?<!fixed-char-length-pattern)
- add possessive quantifier. ?+, *+, ++
- add operations in character class. [], &&
(’[’ must be escaped as an usual char in character class.)
- add named group and subexp call.
- octal or hexadecimal number sequence can be treated as a multibyte code
char in character class if multibyte encoding is specified.
(ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1])
- allow the range of single byte char and multibyte char in character class.
ex. [a-<<any EUC-JP character>>] in EUC-JP encoding.
- effect range of isolated option is to next ’)’. ex. (?:(?i)a|b)
is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b).
- isolated option is not transparent to previous pattern. ex. a(?i)*
is a syntax error pattern.
- allowed incompleted left brace as an usual string. ex. /{/, /({)/, /a{2,3/
etc…
- negative POSIX bracket [:^xxxx:] is supported.
- POSIX bracket [:ascii:] is added.
- repeat of look-ahead is not allowed. ex. (?=a)*, (?!b){5}
- Ignore case option is effective to numbered character. ex.
<code>/\x61/i =~ "A"<code>
- In the range quantifier, the number of the minimum is omissible.
<code>/a{,n}/ == /a{0,n}/<code>
The simultanious abbreviation of the number of times of the minimum and the
maximum is not allowed. (/a{,}/)
- <code>a{n}?<code> is not a non-greedy operator.
<code>/a{n}?/ == /(?:a{n})?/<code>
- invalid back reference is checked and cause error. /\1/, /(a)\2/
- Zero-length match in infinite repeat stops the repeat, then changes of the
capture group status are checked as stop condition.
/(?:()|())*\1\2/ =~ ""
/(?:\1a|())*/ =~ "a"
Problems
- Invalid encoding byte sequence is not checked in UTF-8.
- Invalid first byte is treated as a character.
/./u =~ "\xa3"
- Incomplete byte sequence is not checked.
/\w+/ =~ "a\xf3\x8ec"