Implementing text pattern matching languages in LLVM

Published:

We use pattern matching languages all day long. From shell filename matching rules (fnmatch) in our shells and shell utilities like find and locate to regular expression matching in programming languages, configuration files and shell utilities like grep and sed. These have typically been implemented by parsing the pattern into data structures and walking those data structures as input is processed. Obviously, these work fairly well - we couldn't live without them, but can we do better? I'm thinking particularly about patterns that are matched against many times like the input to GNU locate (matched against every path on your system) or the routing table of your web application (matched for every request)?

Pattern matching languages are programming languages. If we were looking to speed up long-running conventional programs an obvious approach would be to use a virtual machine with a JIT to end up with efficient native code at the cost of slower startup.

I've been experimenting with this and the LLVM compiler infrastructure project. LLVM is a set of libraries and UNIX utilities that provide an assembler, optimizer, interpreter, and native compiler for a high level SSA intermediate form. So far I've implemented a simplified subset of the POSIX fnmatch function's functionality as a proof of concept. It's pretty hacky, but it's up on github.

The performance isn't great, it's 20% slower than GLIBC in my trivial testcase, but so far all I've got is a really naiive implementation. I'm going to implement an NFA / DFA based algorithm which should be more efficient, and easier extend to full regular expressions. Hopefully it'll look more useful then.