An fnmatch implementation using finite state machines and LLVM
For my amusement (and I guess education) I decided to implement a regular expression language on top of LLVM using a Ken Thompson style finite state machine algorithm. Instead of implementing classic POSIX regular expressions I chose to implement something closer to POSIX fnmatch expressions for a couple of reasons. The fnmatch language is simpler to parse than regular expressions and regexes as they are commonly understood and used are not true regular expressions and can't be expressed as finite state machines.
My last experiment with LLVM and fnmatch was built in C++ but this time I chose Python. I'd been prototyping in Python and after I found the llvm-py module I couldn't bring myself to port it all to C++. I spent several days trying to work out the right incantations of STL to represent the Python structures I'd chosen correctly and efficiently in C++ before just embracing Python as the implementation language.
nfa.py
Following Thompson's technique (as explained by Russ Cox) I first converted the fnmatch rule (based on SUSv3 documentation) to non-deterministic finite automaton (NFA) form. Building the NFA from the fnmatch pattern string is straight-forward. Except for bracket expressions each character in the string becomes one node in the NFA. Each bracket expression becomes one node.
dfa.py
Transforming the NFA to deterministic finite automaton (DFA) form is a little trickier, but Russ Cox's explanation of the technique made it pretty straight-forward. Each DFA node maps to one or more NFA nodes so that for a given input string there is only one DFA node that would be reached.
Russ Cox's documentation doesn't cover character classes (ie: wildcards and bracket expressions) and I found that it was a bit tricky to represent and track these when converting from NFA to DFA form. Eventually I came up with a CharacterSet class that represents a set of matching characters, either by inclusion (tracking which characters are in the set) or exclusion (tracking which characters are not in the set). This was handy for representing bracket expressions but invaluable for storing the transitions between DFA states. The CharacterSet class stores a set of characters and a boolean to remember if it's tracking inclusion or exclusion. I built the set operations I needed on top of that including containment, equality, union, difference and intersection.
The part of NFA to DFA transformation that I found trickiest was determining the set of DFA nodes that would be reached from each of the NFA nodes associated with a DFA node. For each NFA node there are a set of descendants that a particular character set maps to. In the NFA the character sets can overlap, but in a DFA we must make sure all of the character sets are disjoint - so that the state machine is deterministic. Additionally, since each DFA node is associated with multiple NFA nodes we need to work out which set of NFA descendants can be reached by the same set of characters and should be treated as a single DFA node.
I wrote a function distinctCharacterSets that for a set of CharacterSets return a set of disjoint CharacterSets where each input CharacterSet can be expressed as the union of one or more output CharacterSets, the union of the input CharacterSets is equal to the union of the output CharacterSets and there are no empty CharacterSets. On top of that I built a function to turn the list of descendants from all of the NFA nodes associated with a DFA node into DFA descendants.
I didn't see this approach discussed in simple descriptions of the Thompson regular expression method and the implementations I tried reading were optimized beyond clarity but unless I'm missing an obvious alternative I'm sure similar structures and algorithms are used in finite automaton based regular expression engines.
compiler.py
While both the DFA and NFA can be interpreted the whole point of the exercise for me was to compile the expression to native code. I found the llvm-py module, a set of Python bindings for LLVM. It's fairly good but incomplete. I've made some patches and have them up on Launchpad.net.
I generate an LLVM function for each DFA with basic blocks for each state. At each state an LLVM switch operation jumps to the next state, to a block that returns true if the input string ends on a terminal state or to a block that returns false if there is no next state to match the input character. Instead of calling llvm-py's bindings for the LLVM optimizer I chose to call out to the command-line tool. It's easier and it seems fast enough. The generated code can be JITed or compiled statically to native code.
The generated code looks decent. There's definitely room for improvement, for example a * at the end of a pattern shouldn't require us to walk the whole string. The LLVM optimizer doesn't have much chance of catching code like that but it should be easy to catch things like that.
test.py
The implementation works, often. After building a tool to compare the results of Python's built-in fnmatch, the NFA, DFA and LLVM implementations I found that after several compiles there are often problems. The problems manifest as an incorrect result, a segmentation fault or a Python error. I'm not sure if these are manifestations of the same problem and I'm not sure if the problem is a bug in LLVM, llvm-py or a mistake in my use of llvm-py.
So, the theory works nicely, but the implementation leaves something to be desired. Hopefully the failures are easy to track down and easy to overcome. There is also very weird performance, but I'll discuss that in a later post. If you want to take a look at it the source is on github.


