Regex Engine Internals
Pattern matching is the process of finding a secion of text that is described by a regular expression. The underlying system that trys to match the text is known as a regular expression engine.
All regular expression engines follow these two rules:
- leftmost matches are found first
Target strings are searched starting at the first character and continues until the last - once a match is found, the engine is done
- standard quantifiers are "greedy"
Regular expression quantifiers direct the engine as to how many times something can be repeated - the standard quantifiers try to match as many times as possible
When working with Regular Expressions, it is important to know the engine type. There are two classes of engines:
- Deterministic Finite Automaton (DFA)
- Nondeterministic Finite Automaton (NFA)
In general, DFA engines are faster, but lack capturing, lookaround, and nongreedy quantifiers.
DFA engines simply compares each character in the target to the regular expression. Without support for capture or lookaround, each character is only compared once making DFA engines the fastest.
Note: The alternation (|) metasequence always matches the longest option
There are two types of NFA engines: Traditional and POSIX:
Traditional NFA Engines
Traditional NFA engines compare each element of the regex to the target string, keeping track of positions where it between two options in the regex. If any option fails, the engine works backwards to the last saved position.
For standard quantifiers, the engine chooses the greedy option first. If the option fails, the engine returns to the saved position and tries any less greedy options.
For alternation, the traditional NFA engine tries each option sequentially.
POSIX NFA Engines
POSIX NFA engines are similar to Traditional NFA engines with one exception: a POSIX engine always chooses the longest of the lestmost matches.