Compile Time Regular Expressions
with
a Deterministic Finite Automaton

Hana Dusíková

Hana Dusíková

  • Researcher in Avast
    • Improving things
    • High-performance code
  • Avast Prague C++ Meetup
  • Czech National Body in WG21
  • Occasional hiker
  • Enthusiastic photographer
  • Fast book reader

@hankadusikova

#ctre #cppnow

"And if thou gaze long at a finite automaton,
a finite automaton also gazes into thee."
– Friedrich Nietzsche
(after taking a computer science class)

The
Compile Time Regular Expressions
Library

Example: Basic usage


							bool is_a_date(std::string_view input) noexcept {
								return ctre::match<"[0-9]{4}/[0-9]{2}/[0-9]{2}">(input);
							}
						

Example: Extracting a value


							std::optional<std::string_view> extract_sha256(std::string_view input) noexcept {
								if (auto r = ctre::match<"[a-fA-F0-9]{64}">(input)) {
									return r;
								} else {
									return std::nullopt;
								}
							}
						

Features of CTRE

  • PCRE2 compatible dialect
    • basic unicode support
    • optimizing greedy cycles into possessive if possible
  • constexpr
    • compile time constructing
    • constexpr & runtime matching
    • doesn't support runtime defined patterns
  • quick regex matching/searching
    • structured-bindings for extracting captures
    • generates compact assembly

Technical details of CTRE

  • small, header only C++17/20 library (8k LoC)
  • supports all major compilers
    • clang 6.0
    • gcc 7.4
    • msvc 15.8.8+
  • open-source

Basic API


							namespace ctre {
								
								template <fixed_string Pattern>
								constexpr auto match(const Range auto &) noexcept;
								
								template <fixed_string Pattern>
								constexpr auto match(ForwardIterable auto &&, ForwardIterable auto &&) noexcept;
								
								template <fixed_string Pattern>
								constexpr auto search(const Range auto &) noexcept;
								
								template <fixed_string Pattern>
								constexpr auto search(ForwardIterable auto &&, ForwardIterable auto &&) noexcept;
							
							}
						

DFA based API


							namespace ctre {
								
								template <fixed_string Pattern>
								constexpr bool fast_match(const Range auto &) noexcept;
								
								template <fixed_string Pattern>
								constexpr bool fast_match(ForwardIterable auto &&, ForwardIterable auto &&) noexcept;
								
								template <fixed_string Pattern>
								constexpr bool fast_search(const Range auto &) noexcept;
								
								template <fixed_string Pattern>
								constexpr bool fast_search(ForwardIterable auto &&, ForwardIterable auto &&) noexcept;
							
							}
						

Return type of the basic API


							namespace ctre {
								
								template <impl-spec> struct regex_results {
									constexpr operator bool() const noexcept;
									
									constexpr operator basic_string_view<char_type>() const noexcept;
									constexpr explicit operator basic_string<char_type>() const noexcept;
									
									constexpr auto begin() const noexcept;
									constexpr auto end() const noexcept;
									
									// number of captures
									constexpr static size_t size() const noexcept;
									
									// returns object similar to regex_results (without 'get' method)
									template <unsigned Id> constexpr auto get() const noexcept;
									template <fixed_string Name> constexpr auto get() const noexcept;
								};
								
							}
						

Example: Structured Binding


							// .get<0> is an implicit capture of whole pattern
							auto [r, year, month, day] = ctre::match<"([0-9]{4})/([0-9]{2})/([0-9]{2})">(input);
							
							if (r) {
								// do something with successful match
							}
						

Example: Iterate over
concatenated sequence of numbers


							int sum(std::string_view input) { 
								int output = 0;
								for (const auto & capt: ctre::range<"^,?([0-9]+)">(input)) 
									output += to_integer(capt.get<1>());
								return output;
							}
						

Iterable API


							namespace ctre {
								
								template <fixed_string Pattern>
								constexpr auto range(const Range auto &) noexcept;
								
								template <fixed_string Pattern>
								constexpr auto range(ForwardIterable auto &&, ForwardIterable auto &&) noexcept;
								
							}
						

Example: 🥰🦄


							constexpr bool is_emoji_only(std::u32string_view input) noexcept {
								return ctre::match<"\\p{emoji}+">(input);
							}
						

							  static_assert( is_emoji_only(U"🥰😱😆"));
							  static_assert(!is_emoji_only(U"no!😡"));
						
*unicode support is not yet fully merged
**thanks to Corentin Jabot for providing constexpr unicode tables ❤️

Caveats of a backtracking engine

pattern: aloha|[a-z]+
input: alohaha
aloha (fail, backtrack) alohaha (accepts)

Back-tracking

A common problem of many regular expression engines.

How can we avoid it?

We need to talk (a little bit) about theory.

Regular Expressions

hello|aloha|guten tag|dobrý den|bonjour

[a-z]+[0-9]+

A Regular Expression is (formally)

  • the empty set ( or {}),
  • an empty string (ε or {""}),
  • a literal character (e.g. a or {"a"}),
  • the concatenation of two REs (A⋅B),
  • the alternation of two REs (A|B),
  • or repetitions (Kleene star) (A*)

Non-formal syntax constructs

  • Optional A (A? is ε|A)
  • Plus repetition of A (A⋅A*)
  • Back-reference
not all features of a "regular expression"
implementation is technically a regular expression

Example: Non-formal syntax constructs

We are all used to a pattern like this:
(ct)?re|[a-f]+
The pattern is equivalent to:
(ε|ct)re|(a|b|c|d|e|f)(a|b|c|d|e|f)*
Every pattern can be converted into a formal form.

A Regular Expression
can be described as an abstract syntax tree

(ct)?re|[a-f]+

How to store the AST
in C++ compile-time evaluated code?

  • tree-like (allocated) data structure (we can't allocate in constexpr, yet)
  • type based expression
    • expression templates (like boost::xpressive)
    • tuple-like empty types

Types as the building blocks


							// building blocks
							struct epsilon { };
							template <Character auto C> struct ch { };
							
							// operations
							template <typename...> struct concat { };
							template <typename...> struct alt { };
							template <typename> struct star { };
							
							// for convenient usage
							template <typename E> using opt = alternation<E, epsilon>;
							template <typename E> using plus = concat<E, star<E>>;
						

Example: A Regex Type

Converting
A pattern into an AST: Parsing

How to convert the pattern?

  • Use a generic LL(1) parser for converting a pattern into a type.
  • The parser uses a provided PCRE compatible grammar.
  • Output type is the AST.

How does an LL(1) parser work?

  • Starts with a start symbol on the stack.
  • On every step it pops one symbol from the stack
    and checks the current character at the input.
  • Based on the pair of symbol and character it decides to:
    • push a string of symbols to the stack,
    • pop a character from the input,
    • or reject.
  • Repeat until the stack and input are empty then accept.

What does the grammar look like?

f(symbol,char) → (...) ε pop input reject accept
 ()*+?|otherε
→ S( alt0 ) mod seq alt      other char mod seq alt ε
alt0( alt0 ) mod seq alt      other char mod seq alt  
alt ε   | seq0 alt alt  ε
modεε* star + plus ? opt εεε
seq0( alt0 ) mod seq      other char mod seq  
seq( alt0 ) mod seq seq ε   εother char mod seq seq ε
( pop              
)   pop            
*     pop          
+       pop        
?         pop      
|           pop    
other             pop  
Z0               accept

How does an LL1 parser work?

input:
a*b*ε
step: 0
stack:
S
 ()*+?|otherε
→ S( alt0 ) mod seq alt      other char mod seq alt ε
alt0( alt0 ) mod seq alt      other char mod seq alt  
alt ε   | seq0 alt alt  ε
modεε* star + plus ? opt εεε
seq0( alt0 ) mod seq      other char mod seq  
seq( alt0 ) mod seq seq ε   εother char mod seq seq ε
terminal pop pop pop pop pop pop pop  
Z0               accept

How can we represent
the symbols in C++?


							struct S {};
							struct alt0 {};
							struct alt {};
							struct mod {};
							struct seq0 {};
							struct seq {};
					
							using start_symbol = S;
						

How can we represent
an LL(1) table in C++?


							// f (symbol, char) → (...)
							auto f(symbol, term<'c'>) -> list{...};
					
							// f (symbol, symbol) → pop input
							template <auto S> auto f(term<S>, term<S>) -> pop_input;
					
							// f (symbol, char) → reject
							auto f(...) -> reject;
					
							// f (Z0, ε) → accept
							auto f(empty_stack, epsilon) -> accept;
						

How can we pass
the grammar into the parser?


							struct pcre {
								struct S {};
								struct alt0 {};
								struct alt {};
								// ...
						
								using start_symbol = S;
						
								auto f(...) -> reject;
								// ...
							}
						

How is the parser used?


							constexpr bool ok = parser<pcre, "a+b+">::correct;
						

How is the parser implemented?


							template <typename Grammar, ...> struct parser {
								//...
								auto next_move = Grammar::f(top_of_stack, current_term);
								//...
							}
						

How do we implement the input string?


							template <typename CharT, size_t N> struct fixed_string {
								CharT data[N+1];
								// constexpr constructor from const char[N]
								constexpr auto operator[](size_t i) const noexcept { return data[i]; }
								constexpr size_t size() const noexcept { return N; }
								constexpr auto operator<=>(const fixed_string &) = default;
							};
					
							template <typename CharT, size_t N>
							 fixed_string(const CharT[N]) -> fixed_string<CharT, N>;
							// more info about class NTTP in p0732 by Jeff Snyder and Louis Dionne
						 

How do we implement the stack?


							template <typename... Ts> struct list { };
							
							template <typename... Ts, typename... As>
							 constexpr auto push(list<Ts...>, As...) -> list<As..., Ts...>;
							 
							template <typename T, typename... As>
							 constexpr auto pop(list<T, Ts...>) -> list<Ts...>;
							 
							template <typename T, typename... Ts>
							 constexpr auto top(list<T, Ts...>) -> T;
							 
							struct empty { };
							constexpr auto top(list<>) -> empty;
						

How does it fit together?


							constexpr bool ok = parser<pcre, "a*b*">::correct;
						

LL1 constexpr Parser


							template <typename Grammar, fixed_string Str> struct parser {
								
								static constexpr bool correct = parse(list<Grammar::start_symbol>{});
							
								// return current term
								template <size_t Pos> constexpr auto get_character() const {
									if constexpr (Pos < Str.size()) return term<Str[Pos]>{};
									else return epsilon{};
								}
							
								// prepare each step and move to next
								template <size_t Pos = 0, typename S> static constexpr bool parse(S stack) {
							
									auto symbol = top(stack);
									auto current_term = get_character<Pos>();
									
									auto next_move = decltype(Grammar::f(symbol, current_term));
							
									return move<Pos>(next_move, pop(stack));
								}
						
								// pushing something to stack (epsilon case included)
								template <size_t Pos, typename... Push, typename Stack>
								static constexpr bool move(list<Push...>, Stack stack) {
									
									return parse<Pos>(push(stack, Push{}...));
								}
				
								// move to next character
								template <size_t Pos, typename Stack>
								static constexpr bool move(pop_input, Stack stack) {
							
									return parse<Pos+1>(stack);
								}
				
								// reject
								template <size_t Pos, typename Stack>
								static constexpr bool move(reject, Stack) {	
									return false;
								}
				
								// accept
								template <size_t Pos, typename Stack>
								static constexpr bool move(accept, Stack) {
									return true;
								}
					
							};
						

We need more than
just returning a boolean.

A Pattern to the AST:
Let there be a type

How can we build a type from a string?

Where are the semantic actions placed?

 ()*+?|otherε
→ S( alt0 ) mod seq alt      other char mod seq alt ε
alt0( alt0 ) mod seq alt      other char mod seq alt  
alt ε   | seq0 alt alt  ε
modεε* star + plus ? opt εεε
seq0( alt0 ) mod seq      other char mod seq  
seq( alt0 ) mod concat seq ε   εother char mod concat seq ε
( pop              
)   pop            
*     pop          
+       pop        
?         pop      
|           pop    
other             pop  
Z0               accept

What do the Semantic Action
symbols look like?


							struct pcre {
								struct _char: action {};
								struct alpha: action {};
								struct digit: action {};
								struct seq: action {};
								struct star: action {};
								struct plus: action {};
								struct opt: action {};
								// ...
							}
						

What are the changes to the parser?


							template <typename Grammar, fixed_string Str> struct parser {
								
								template <typename Subject>
								static constexpr auto output = parse(list<Grammar::start_symbol>{}, Subject{});
							
								// return current term
								template <size_t Pos> constexpr auto get_character() const {
									if constexpr (Pos < Str.size()) return term<Str[Pos]>{};
									else return epsilon{};
								}
							
								// prepare each step and move to next
								template <size_t Pos = 0, typename S, typename T>
								static constexpr auto parse(S stack, T subject) {
							
									auto symbol = top(stack);
									
									if constexpr (is_semantic_action(symbol)) {
										auto previous_term = get_character<Pos-1>();
										
										auto next_subject = decltype( modify(symbol, previous_term, subject) ){};
										
										return parse<Pos>(pop(stack), next_subject);
									} else {
										auto current_term = get_character<Pos>();
										auto next_move = decltype(Grammar::f(symbol, current_term));
							
										return move<Pos>(next_move, pop(stack), subject);
									}
								}
						
								// pushing something to stack (epsilon case included)
								template <size_t Pos, typename... Push, typename Stack, typename Subject>
								static constexpr bool move(list<Push...>, Stack stack, Subject subject) {
									
									return parse<Pos>(push(stack, Push{}...), subject);
								}
				
								// move to next character
								template <size_t Pos, typename Stack, typename Subject>
								static constexpr bool move(pop_input, Stack stack, Subject subject) {
							
									return parse<Pos+1>(stack, subject);
								}
				
								// reject
								template <size_t Pos, typename Stack, typename Subject>
								static constexpr auto move(reject, Stack, Subject subject) {	
									return std::pair{false, subject};
								}
				
								// accept
								template <size_t Pos, typename Stack>
								static constexpr auto move(accept, Stack, Subject subject) {
									return std::pair{true, subject};
								}
								
								bool correct = output<list<>{}>.first;
								
								using type = decltype( output<list<>{}>.second );
							};
						

What does building from a string look like?

What about the modify function?

Building the AST on a stack


							// pushing a character
							template <Character auto C, typename... Ts>
							 auto modify(pcre::_char, term<C>, list<Ts...>) 
							  -> list<ch<C>, Ts...>;
							  
							// concatenating a sequence (notice the switched order of A & B on the stack)
							template <Character auto C, typename A, typename B, typename... Ts>
							 auto modify(pcre::seq, term<C>, list<B, A, Ts...>)
							  -> list<concat<A, B>, Ts...>;
							  
							// adding to a concatenated sequence
							template <Character auto C, typename... As, typename B, typename... Ts>
							 auto modify(pcre::seq, term<C>, list<B, concat<As...>, Ts...>)
							  -> list<concat<As..., B>, Ts...>;
							  
							// making an alternate (again, switched order of A & B)
							template <Character auto C, typename A, typename B, typename... Ts>
							 auto modify(pcre::alt, term<C>, list<B, A, Ts...>)
							  -> list<alt<A, B>, Ts...>;
							  
							// adding to an alternate
							template <auto C, typename... As, typename B, typename... Ts>
							 auto modify(pcre::alt, term<C>, list<B, alt<As...>, Ts...>)
							  -> list<alt<As..., B>, Ts...>;
							  
							// making something optional
							template <Character auto C, typename Opt, typename... Ts>
							 auto modify(pcre::opt, term<C>, list<Opt, Ts...>)
							  -> list<opt<Opt>, Ts...>;
							  
							// making a plus cycle
							template <Character auto C, typename Plus, typename... Ts>
							 auto modify(pcre::plus, term<C>, list<Plus, Ts...>)
							  -> list<plus<Plus>, Ts...>;
							
							// making a star cycle
							template <Character auto C, typename Star, typename... Ts>
							 auto modify(pcre::star, term<C>, list<Star, Ts...>)
							  -> list<star<Star>, Ts...>;
						

We have the AST!


						static_assert(std::is_same_v<
							,
							decltype( top(parser<pcre,"">::type) )
						>);
						

How do we match
a regular expression
in the form of an AST?

Finite Automaton

What's a finite automaton?

(Q, Σ, δ, q0, F)
  • a finite set of states Q
  • a finite set of input symbols (the alphabet) Σ
  • a transition function δ: Q × Σ →
    • P(Q) (nondeterministic FA)
    • Q (deterministic FA)
  • a start state q0 ∈ Q
  • a set of accept (final) states F ⊆ Q
A finite automaton accepts exactly the same class of languages as a regular expression.

Example: Integral Numbers
[0-9]+

How do you convert
a RegEx into an FA?

∅ (empty set)

There is a direct mapping between
the RE definition and operations over the FAs.

And we need to map it onto C++ code.

How to represent an FA in C++?

(Q, Σ, δ, q0, F)
  • a finite set of states Qset<int> int (implicit)
  • a finite set of input symbols (the alphabet) Σchar32_t
  • a transition function δ: Q × Σ →
    • P(Q) (nondeterministic FA) set<tuple<int, int, char32_t>>
    • Q (deterministic FA)
  • a start state q0 ∈ Q int(0)
  • a set of accept (final) states F ⊆ Q set<int>

constexpr implementation of a FA

"There is no such thing as a zero-cost abstraction."
– Chandler Carruth

constexpr problems

  • every operation counts
    • there are limits (and different in every compiler)
  • move semantics don't matter
  • problematic error handling & debugging
    • use a conditional UB or throw-statement as form of an assert
    • printf styled debugging
      template <typename T> struct identify_type;

The constexpr problem

sloooooooooow
2 minutes (gcc 9.1)
40 seconds (clang 8)
< 5 second (php 7.1)
(NFA determinization)

constexpr fixed_set


							template <size_t Sz, typename T> class fixed_set {
								T data[Sz];
							public:
								constexpr auto begin();
								constexpr auto end();
								constexpr auto begin() const;
								constexpr auto end() const;
			
								constexpr size_t size() const;
								constexpr auto & operator[](size_t);
								constexpr const auto & operator[](size_t) const;
								
								constexpr auto insert(T);
								
								template <typename K> constexpr auto erase(K);
								constexpr auto erase(T *);
								
								constexpr set();
								// ...
							};
						

constexpr finite_automaton


							struct transition {
								int source;
								int target;
								char32_t term;
								
								constexpr transition(int, int, char32_t);
								constexpr bool match(char32_t c) const;
								// comparable with int against source
							};
							
							template <size_t Tr, size_t FSz> struct finite_automaton {
								fixed_set<Tr, transition> transitions;
								fixed_set<FSz, int> final_states;
								
								// normal constructor / copy constructor ...
							};
						

Basic building blocks


								static constexpr auto empty = finite_automaton<0,0>{};
							
								static constexpr auto epsilon = finite_automaton<0,1>{{}, {0}};
							
								template <char32_t C> static constexpr auto one_char = finite_automaton<1,1>{
									{transition(0, 1, C)},
									{1}
								};
						

Basic manipulation


								auto operator>>(const FiniteAutomaton auto & lhs, const FiniteAutomaton auto & rhs);
								
								auto operator|(const FiniteAutomaton auto & lhs, const FiniteAutomaton auto & rhs);
								
								auto star(const FiniteAutomaton auto & lhs);
						
Result of operation is dependent
not just on a input type but also on a input value.

Concatenation algorithm

Concatenation algorithm


							template <finite_automaton Lhs, finite_automaton Rhs> struct concat_two {
								
								constexpr static auto calculate() {
									constexpr size_t tr_count = Lhs.transitions.size() + Rhs.transitions.size();
									constexpr size_t tr_start = Rhs.count_transitions(0);
								
									constexpr int prefix = Lhs.get_max_state() + 1;
								
									finite_automaton<(tr_count + tr_start), Rhs.final_states.count()> out;
								
									copy(Lhs.transitions.begin(), Lhs.transitions.end(), out.transitions.begin());
							
									for (int f: Lhs.final_states) {
										Rhs.foreach_transition_from(0, [&](transition t){
											t.source = f;
											t.target += prefix;
											out.add_transition(t);
										});
									}
								
									for (transition t: Rhs.transitions) {
										t.source += prefix;
										t.target += prefix;
										out.add_transition(t);
									}
								
									for (int f: Rhs.final_states) out.mark_final(f + prefix);
									
									return out;
								}
								
								static constexpr auto value = calculate();
							};
						

Other algorithms

  • alternation
  • repeating (Kleene star)

Both other algorithms work the same,
pre-calculate the size and then populate the content.

Template style manipulation


								template <finite_automaton... Fas>
								static constexpr auto concat = multi_helper<concat_two, Fas...>::value;
								
								template <finite_automaton... Fas>
								static constexpr auto alternation = multi_helper<alternation_two, Fas...>::value;
								
								template <finite_automaton... Fas>
								static constexpr auto star = star_one<concat<Fas...>>::value;
						

Recursive application of a binary function


								template <
									template <finite_automaton, finite_automaton> Op,
									finite_automaton... Arg
								>
								struct multi_helper;
								
								
								
								// specialization for 2+
								template <
									template <finite_automaton, finite_automaton> Op,
									finite_automaton A,
									finite_automaton B,
									finite_automaton... Tail
								>
								struct multi_helper<Op, A, B, Tail...> {
									static constexpr auto value = multi_helper<Op, Op<A, B>::value, Tail...>::value;
								};
								
								
								
								// specialization for 1
								template <
									template <finite_automaton, finite_automaton> Op,
									finite_automaton One
								>
								struct multi_helper<Op, One> {
									static constexpr auto value = One;
								};
							

We have a class-value based
template meta-programming!

And if you use const auto & it's C++17 compatible!

constexpr auto fa = concat<one_char<'a'>, one_char<'b'>>;

  • I called it a "compile time" function with compile-time arguments.
  • Unlike a constexpr function, it can take constexpr variables as arguments
    and maintain their "constexpr-bility".
  • It makes sure the value is cached.
  • It looks nice :)

Minimization

Remove unnecessary states and links by merging the same states.

Minimization algorithm

Determinization

Remove nondeterministic transitions, can generate 2n new states.

Determinization algorithm

Determinization

Remove nondeterministic transitions, can generate 2n new states.

Constexpr implementation must be iterative,
as the whole output can't be calculated in one step.

It's useful to minimize the FA after the determinization.

We have a deterministic finite automaton

Matching
a Regular Expression

"hello" =~ /aloha|[a-z]+/

Matching and searching


							ctre::match<"aloha|[a-z]+">("aloha"sv);
					
							// search(x) is actually match(.*x.*)
							ctre::search<"aloha|[a-z]+">("...aloha..."sv);
						

Deterministic match wrapper


							template <fixed_string re> bool fast_match(const Range auto & rng) {
								static_assert(parser<pcre, re>::correct);
								using RE = parser<pcre, re>::type;
								
								constexpr auto dfa = fa::minimize<fa::determinize<nfa_from<RE>>>;
							
								return match_state<dfa>(rng.begin(), rng.end());
							}
						

Converting a RE's AST into an FA


							namespace fa {
								static constexpr auto empty = finite_automaton<0, 0>{};
							}
							
							template <typename RE>
							static constexpr auto nfa_from = convert(fa::empty, list<RE>{});
						

RE is a type, finite_automaton is a value.

Converting the empty RE


							auto convert(const FinAutomaton auto seed, list<>) {
								// for an empty RE seed is empty FA
								
								return seed;
								
							}
						

Converting epsilon (empty string)


							namespace fa {
								// epsilon is FA with a start state final
								static constexpr auto epsilon = finite_automaton<0, 1>{{}, {0}};
							}
							
							template <typename... Ts>
							auto convert(const FinAutomaton auto lhs, list<epsilon, Ts...>) {
								
								return convert(fa::concat<lhs, fa::epsilon>, list<Ts...>{});
								
							}
						

Converting a character


							namespace fa {
								// epsilon is FA with a start state final
								template <char32_t Term> static constexpr auto 
								character = finite_automaton<0, 1>{{transition{0, 1, Term}}, {1}};
							}
							
							template <char32_t Term, typename... Ts>
							auto convert(const FinAutomaton auto lhs, list<character<Term>, Ts...>) {
								
								return convert(fa::concat<lhs, fa::character<Term>>, list<Ts...>{});
								
							}
						

Converting a concat sequence


							template <typename... Content, typename... Ts>
							auto convert(const FinAutomaton auto lhs, list<concat<Content...>, Ts...>) {
								
								return convert(lhs, list<Content..., Ts...>{});
								
							}
						

Converting an alternation


							template <typename... Options, typename... Ts>
							auto convert(const FinAutomaton auto lhs, list<alt<Options...>, Ts...>) {
								
								constexpr auto inner = fa::alternative<convert(fa::empty, Options)...>;
								
								return convert(fa::concat<lhs, inner>, list<Ts...>{});
								
							}
						

Converting a Kleene star (cycle)


							template <typename Content, typename Ts...>
							auto convert(const FinAutomaton auto seed, list<star<Content>, Ts...>) {
								
								return convert(fa::concat<lhs, fa::star<Content>>, list<Ts...>{});
								
							}
						

We have a finite automaton,
now we need to do the matching.

Deterministic match wrapper


							template <fixed_string re> bool fast_match(const Range auto & rng) {
								static_assert(parser<pcre, re>::correct);
								using RE = parser<pcre, re>::type;
								
								constexpr auto dfa = fa::minimize<fa::determinize<nfa_from<RE>>>;
							
								return match_state<dfa>(rng.begin(), rng.end());
							}
						

Matching a state


							template <finite_automaton DFA, int State = 0, typename Iterator>
							constexpr bool match_state(Iterator it, Iterator end) noexcept {
								
								// we can end correctly only if this state is final
								if constexpr (DFA.is_final(State)) {
									if (end == it) return true;
								} else {
									if (end == it) return false;
								}
	
								// transitions are sorted by source state
								constexpr auto index = DFA.first_transition_index_of(State);

								// tail-recursion
								return choose_transition<DFA, index, State>(it, end);
							}
						

Matching a transition


							template <finite_automaton DFA, int Index, int State, typename Iterator>
							constexpr bool choose_transition(Iterator it, const Iterator end) noexcept {
								// check if we are still in same state
								if constexpr (Index >= DFA.transitions.size()
								  || DFA.transitions[Index].source != State) {
									
									return false;
								}
								
								if (DFA.transitions[Index].match(*it)) {
									// this transition is the one, go to next state (tail recursion)
									return match_state<DFA, DFA.transitions[Index].target>(std::next(it), end);
								}
								
								// try next one (also tail recursion)
								return choose_transition<DFA, Index+1, State>(std::next(it), end);
							}
						

The deterministic matching

  • O(n * t) time complexity
  • O(1) dynamic memory complexity

n = length of the input, t = average number of transitions per state

  • doesn't support capture tracking
  • theoretically the quickest approach
  • allows streamed data

Backtracking match wrapper


							template <fixed_string re> bool match(const Range auto & rng) {
								static_assert(parser<pcre, re>::correct);
								using RE = parser<pcre, re>::type;
							
								auto out = evaluate(rng.begin(), rng.end(), list<RE>{});
								
								return out.success && out.it == rng.end();
							}
						

Return type of evaluate(…)


							template <ForwardIterator It> struct result {
								bool success;
								ForwardIterator it;
							};
						

Matching epsilon (empty string)


							template <ForwardIterator It, typename... Ts>
							result<It> evaluate(It it, It end, list<epsilon, Ts...>) {
								// tail-recursion => no backtracking
								return evaluate(it, end, list<Ts...>{});
							}
						

Matching a character


							template <ForwardIterator It, auto C, typename... Ts>
							result<It> evaluate(It it, It end, list<ch<C>, Ts...>) {
								
								if (it == end) return {it, false};
								if (*it != C) return {it, false};
								
								// tail-recursion => no backtracking
								return evaluate(std::next(it), end, list<Ts...>{});
							}
						

Matching a concatenated sequence


							template <ForwardIterator It, typename... Seq, typename... Ts>
							result<It> evaluate(It it, It end, list<concat<Seq...>, Ts...>) {
								
								// tail-recursion => no backtracking
								return evaluate(it, end, list<Seq..., Ts...>{});
							
							}
						

Matching an alternate


							template <ForwardIterator It, typename Head, typename... Tail, typename... Ts>
							result<It> evaluate(It it, It end, list<alt<Head, Tail...>, Ts...>) {
								
								// recursion => backtracking point
								if (auto out = evaluate(it, end, list<Head, Ts...>{})) {
									return out;
								} else {
									// tail-recursion => no backtracking
									return evaluate(it, end, list<alt<Tail...>, Ts...>{}));
								}
							}
					
							template <ForwardIterator It, typename... Ts>
							 result<It> evaluate(It it, It end, list<alt<>, Ts...>) {
								// no option was successful
								return {it, false};
							}
						

Matching a Kleene star (cycle)


							template <ForwardIterator It, typename Star, typename... Ts>
							result<It> match(It it, It end, list<star<Star>, Ts...>) {
								
								for (;;) {
									// recursion => backtracking point
									if (auto out = match(it, end, list<Ts...>{})) {
										return out;
									} 
									
									// recursion => backtracking point
									if (auto inner = match(it, end, list<Star..., end_of_cycle>{})) {
										it = inner.it;
									} else {
										// inner cycle failed and outer too => whole cycle fail
										return {it, false};
									}
								}
							}
						

The backtracking matching

  • O(n * 2b) time complexity
  • O(b) dynamic memory complexity

b = number of backtracking points, n = length of the input

  • easiest to implement
  • supports capture tracking
  • slow worst case scenario
  • doesn't allow streamed data

NFA without backtracking

  • supports capture tracking
  • slower to match, but much quicker to build (than DFA)
  • allows streamed data

But I haven't implemented it yet (some future talk maybe)

Comparison

Let's read some assembly :)

Benchmarks

Measured code (grep-like utility)


							int main(int argc, char ** argv) {
								auto re = PREPARE("PATTERN");
								auto lines = load_file_by_lines(argv[1]);
								
								size_t count = 0;
								
								// some warm-up
								
								auto start = steady_clock::now();
								for (string_view line: lines) {
									if (re.SEARCH(line)) count++;
								}
								auto end = steady_clock::now();
								
								cout << count << "\n";
								
								cerr << duration_cast<milliseconds>(end - start).count();
							}
						

Measurement methodology

  • CSV file (1.3GiB) with 6.5 MLoC
  • Median of 3 measurements
  • MacBook Pro 13" 2016 i7 3.3Ghz
  • GCC 9.1 & clang 8 (-std=2a -O3)

Runtime Searching (Clang 8): [a-z0-9]+abc[0-9]

(higher is better)

Runtime Searching (GCC 9): [a-z0-9]+abc[0-9]

(higher is better)

Compilation (Clang 8): [a-z0-9]+?abc[0-9]

(lower is better)

Compilation (GCC 9): [a-z0-9]+?abc[0-9]

(lower is better)

Runtime Searching (Clang 8): ABCDE-[0-9]+

(higher is better)

Runtime Searching (GCC 9): ABCDE-[0-9]+

(higher is better)

Compilation (Clang 8): ABCDE-[0-9]+

(lower is better)

Compilation (GCC 9): ABCDE-[0-9]+

(lower is better)

Six things I showed you…

  • How regular expression engines work.
  • Connection between theory and code.
  • Parsing of a pattern during compilation.
  • Building of a finite automaton.
  • Matching a regex with various algorithms.
  • Comparison between different regex implementations.

And four things you should remember…

  • Compilers can optimize your code, if you help them.
  • Combining TMP and constexpr code can
    help you express more things more precisely.
  • Stop pretending that constexpr C++ is a compiled language.
  • Try the CTRE library, not just for speed but for code clarity.

Thank you!

You can find slides & code at compile-time.re