Software and Opinions from Ian McKellar

15Jul/090

An fnmatch implementation using finite state machines and LLVM

For my amusement (and I guess education) I decided to implement a regular expression language on top of LLVM using a Ken Thompson style finite state machine algorithm. Instead of implementing classic POSIX regular expressions I chose to implement something closer to POSIX fnmatch expressions for a couple of reasons. The fnmatch language is simpler to parse than regular expressions and regexes as they are commonly understood and used are not true regular expressions and can't be expressed as finite state machines.

My last experiment with LLVM and fnmatch was built in C++ but this time I chose Python. I'd been prototyping in Python and after I found the llvm-py module I couldn't bring myself to port it all to C++. I spent several days trying to work out the right incantations of STL to represent the Python structures I'd chosen correctly and efficiently in C++ before just embracing Python as the implementation language.

nfa.py
Following Thompson's technique (as explained by Russ Cox) I first converted the fnmatch rule (based on SUSv3 documentation) to non-deterministic finite automaton (NFA) form. Building the NFA from the fnmatch pattern string is straight-forward. Except for bracket expressions each character in the string becomes one node in the NFA. Each bracket expression becomes one node.

nfa

NFA for *.txt

dfa.py
Transforming the NFA to deterministic finite automaton (DFA) form is a little trickier, but Russ Cox's explanation of the technique made it pretty straight-forward. Each DFA node maps to one or more NFA nodes so that for a given input string there is only one DFA node that would be reached.

Russ Cox's documentation doesn't cover character classes (ie: wildcards and bracket expressions) and I found that it was a bit tricky to represent and track these when converting from NFA to DFA form. Eventually I came up with a CharacterSet class that represents a set of matching characters, either by inclusion (tracking which characters are in the set) or exclusion (tracking which characters are not in the set). This was handy for representing bracket expressions but invaluable for storing the transitions between DFA states. The CharacterSet class stores a set of characters and a boolean to remember if it's tracking inclusion or exclusion. I built the set operations I needed on top of that including containment, equality, union, difference and intersection.

The part of NFA to DFA transformation that I found trickiest was determining the set of DFA nodes that would be reached from each of the NFA nodes associated with a DFA node. For each NFA node there are a set of descendants that a particular character set maps to. In the NFA the character sets can overlap, but in a DFA we must make sure all of the character sets are disjoint - so that the state machine is deterministic. Additionally, since each DFA node is associated with multiple NFA nodes we need to work out which set of NFA descendants can be reached by the same set of characters and should be treated as a single DFA node.

I wrote a function distinctCharacterSets that for a set of CharacterSets return a set of disjoint CharacterSets where each input CharacterSet can be expressed as the union of one or more output CharacterSets, the union of the input CharacterSets is equal to the union of the output CharacterSets and there are no empty CharacterSets. On top of that I built a function to turn the list of descendants from all of the NFA nodes associated with a DFA node into DFA descendants.

I didn't see this approach discussed in simple descriptions of the Thompson regular expression method and the implementations I tried reading were optimized beyond clarity but unless I'm missing an obvious alternative I'm sure similar structures and algorithms are used in finite automaton based regular expression engines.

DFA for *.txt

DFA for *.txt

compiler.py
While both the DFA and NFA can be interpreted the whole point of the exercise for me was to compile the expression to native code. I found the llvm-py module, a set of Python bindings for LLVM. It's fairly good but incomplete. I've made some patches and have them up on Launchpad.net.

I generate an LLVM function for each DFA with basic blocks for each state. At each state an LLVM switch operation jumps to the next state, to a block that returns true if the input string ends on a terminal state or to a block that returns false if there is no next state to match the input character. Instead of calling llvm-py's bindings for the LLVM optimizer I chose to call out to the command-line tool. It's easier and it seems fast enough. The generated code can be JITed or compiled statically to native code.

The generated code looks decent. There's definitely room for improvement, for example a * at the end of a pattern shouldn't require us to walk the whole string. The LLVM optimizer doesn't have much chance of catching code like that but it should be easy to catch things like that.

test.py
The implementation works, often. After building a tool to compare the results of Python's built-in fnmatch, the NFA, DFA and LLVM implementations I found that after several compiles there are often problems. The problems manifest as an incorrect result, a segmentation fault or a Python error. I'm not sure if these are manifestations of the same problem and I'm not sure if the problem is a bug in LLVM, llvm-py or a mistake in my use of llvm-py.

So, the theory works nicely, but the implementation leaves something to be desired. Hopefully the failures are easy to track down and easy to overcome. There is also very weird performance, but I'll discuss that in a later post. If you want to take a look at it the source is on github.

24Jun/091

Python generator fun

I know Python's iterators and generators aren't that new anymore, but at heart I'm still a Python 1.5 programmer. I've come to iterators and generators in Python from my experience in JavaScript.

This morning I wanted to generate a list of names (for nodes in my NFA/DFA pattern matching code) that looked like: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, AA, AB, AC, AD, AE, AF, AG, AH, AI, AJ, AK, AL, AM, AN, AO, AP, AQ, AR, AS, AT, AU, AV, AW, AX, AY, AZ, BA, BB, BC, BD, etc... A couple of minutes of fiddling I came up with:

def name_generator():
  from string import uppercase
  from itertools import chain
  for n in chain([''], name_generator()):
    for c in uppercase:
      yield n+c

It think the code is pretty simple, pretty readable, and pretty efficient. It's not rocket science, but I feel that iterators and generators have paid off well for Python.

Tagged as: 1 Comment
5Jun/091

Implementing text pattern matching languages in LLVM

We use pattern matching languages all day long. From shell filename matching rules (fnmatch) in our shells and shell utilities like find and locate to regular expression matching in programming languages, configuration files and shell utilities like grep and sed. These have typically been implemented by parsing the pattern into data structures and walking those data structures as input is processed. Obviously, these work fairly well - we couldn't live without them, but can we do better? I'm thinking particularly about patterns that are matched against many times like the input to GNU locate (matched against every path on your system) or the routing table of your web application (matched for every request)?

Pattern matching languages are programming languages. If we were looking to speed up long-running conventional programs an obvious approach would be to use a virtual machine with a JIT to end up with efficient native code at the cost of slower startup.

I've been experimenting with this and the LLVM compiler infrastructure project. LLVM is a set of libraries and UNIX utilities that provide an assembler, optimizer, interpreter, and native compiler for a high level SSA intermediate form. So far I've implemented a simplified subset of the POSIX fnmatch function's functionality as a proof of concept. It's pretty hacky, but it's up on github.

The performance isn't great, it's 20% slower than GLIBC in my trivial testcase, but so far all I've got is a really naiive implementation. I'm going to implement an NFA / DFA based algorithm which should be more efficient, and easier extend to full regular expressions. Hopefully it'll look more useful then.

Tagged as: , , 1 Comment
23Apr/096

Understanding the OAuth vulnerability

Last night's OAuth Security Advisory 2009.1 was a little light on the details. The blog post wasn't much better. I was peripherally involved in the OAuth spec development and I couldn't work out what the advisory meant without a bunch of thinking and spec reading so I thought I'd try to explain it in simpler terms here.

For my example I'll use the real service Twitter and a theoretical service Twitten that lets users post to to Twitter in LOL-speak and authenticates via OAuth. Alice and Bob will be my attacker and victim.

Alice's normal authentication process goes like this:

  1. Alice loads twitten.com/login
  2. Twitten creates a regular HTTP session for Alice
  3. Twitten asks Twitter for an unauthorized token
  4. Twitten redirects Alice to an URL on the Twitter servers that will allow her to authorize the token
  5. Alice clicks OK to authorize the token
  6. Twitter redirects Alice back to Twitten
  7. Twitten exchanges its unauthorized token for an access token (associated with Alice's account) with Twitter and stores it in Alice's session
  8. Alice makes inane posts on Twitter via Twitten

The vulnerability here is in step 4. If instead of going to the authorization URL Alice convinces Bob to go there and authorize Twitten she can gain access to his account. Like this:

  1. Alice loads twitten.com/login
  2. Twitten creates a regular HTTP session for Alice
  3. Twitten asks Twitter for an unauthorized token
  4. Twitten tells Alice what URL to go to to authorize the token, but she doesn't go there
  5. Alice tells Bob, "if you love Twitter and kittens try out Twitten - go to http://twitter.com/oauth/authorize/...." (the authorization URL from Twitten)
  6. Bob loads the authorization URL with his Twitter credentials and authorizes the token
  7. Twitten requests Twitter to exchange the unauthorized token for an access token (associated with Bob’s Twitter account) and stores it in Alice’s session
  8. Alice goes to twitten.com and posts "OMG PWNd" to Bob's twitter account

I'm not really sure how to address this issue. It's fundamentally hard to establish trust between three parties over insecure communications. Hopefully more experienced people than me will come up with clever answers.

Update: changed wording to match Eran's suggestion, his blog post on the subject is excellent reading.

Tagged as: , 6 Comments
11Mar/092

Restarting an AIR application

For reasons too complicated and secret to go into here I'm writing an Adobe AIR application that needs to restart itself occasionally. I didn't find any clear documents describing how to do this but after some reverse engineering and experimentation, here's what I came up with.

The air.swf movie that's used by web pages to install and launch AIR applications calls some internal, undocumented APIs to do this, and so can you!

namespace {
  import adobe.utils.ProductManager;
  import mx.core.Application;
  public class Restart {
    public static function restart() : void {
      // request that a new instance of the application be launched
      new ProductManager('airappinstaller').launch('-launch ' +
        Application.application.nativeApplication.applicationID + ' ' +
        Application.application.nativeApplication.publisherID);
      // exit the current instance
      Application.application.nativeApplication.exit(0);
    }
  }
}

I don't know if or when this will break but it's pretty straight-forward, once you work it out. It feels to me like there might be some kind of race condition, but it seems to work alright. Oh also, I've only tried this on Mac.

4Mar/098

Mozilla and WebKit, browser platform wars.

This post began as a comment on Matthew Gertner's blog post The Browser Platform Wars. It's a rant not an article, don't take it personally.

In my experience (8 years building Mozilla based products and playing with WebKit since it was first released as WebCore in 2003) there are a few clear technical and social differences that can make WebKit a more attractive platform for developers than Mozilla. There are plenty of reasons that Firefox is a better product than Safari (I definitely prefer Firefox over Safari on my Mac), but that's a different story.

The scale and complexity of the Mozilla codebase is daunting. Mozilla advocates will say that that's because Mozilla provides more functionality, but the reality is that even if you don't want all that functionality you still have to dig through and around it to get your work done. Much of the Mozilla platform is poorly documented, poorly understood and incomplete (the C++/JS binding security stuff was the most recent example I've looked at) while WebKit is smaller, simpler and newer. They use common c++ idioms instead of proprietary systems like XPCOM.

The scale of the Mozilla organization is also daunting. Mozilla's web presence is vast and is filled with inaccurate, outdated content. Their goals are vague and mostly irrelevant to developers. By contrast WebKit's web site is simple and straight-forward. Its audience is developers, it sets out goals that matter and make sense to developers, it explains clearly the process for participating and contributing in the project.

WebKit is designed for embedding. Within Apple there are several customers for the WebKit library - Desktop Safari, iPhone Safari, Dashboard, AppKit and more. Since WebKit already serves a variety of purposes it's likely to work for other applications which third party developers will want to build. By comparison the Mozilla platform really only has one first-class customer - Firefox.

The WebKit community has welcomed non-employee contributors. They've even welcomed contributors who work for Apple's competitors. There are WebKit reviewers from Google, Nokia and the open source community. By comparison, Songbird and Flock don't have any Mozilla committers or reviewers who weren't previously Mozilla Corporation employees even though they are two of the largest non-MoCo platform customers.

Perhaps I'm short-sighted, but I don't see a clear path forward for Mozilla in competing with WebKit as a platform for web content display. The long history of Mozilla have left them with a large, complicated codebase that's not getting smaller. The rapid growth and defensive attitude of the organization (probably brought on by the Netscape / IE wars) has left it without a culture that welcomes friendly competition. I think that Mozilla's focus on the product above the platform is the right decision for them. I'm just glad we have an alternative web content platform.

27Feb/090

Three Months of ActionScript

I've been working largely in ActionScript 3 for the past three months. After spending four years working primarily in JavaScript I didn't expect to encounter too many problems with her cousin. That expectation was pretty well borne out.

I was particularly excited at the chance to play with ActionScript's optional strict typing since that has been under consideration for future versions of JavaScript / ECMAScript. AS3's strict typing hasn't felt optional to me because I've been largely using the Flex Builder IDE and it excitedly warns me every time I fail to specify a type on a variable or function signature, so I put them everywhere. Instead of implementing encapsulation using conventions like I do in JS I use AS3's Interfaces. This felt good at first but I quickly realized that I couldn't do the kinds of tricks that I'd become accustomed to in JS. I can't create a typed object in a function and return it without defining a class for it in its own file. This has made the higher level structure of my software far closer to Java than to JavaScript.

On the other hand, inside my functions I'm pretty much writing JavaScript. All of the standard JS built-in objects (Object, Array, Math, etc) behave like modern JS implementations. I use anonymous functions to manipulate and filter arrays. I use Objects and Arrays as my primary data structures.

At this point I'm pretty used to AS3. I know how to design my code and I can be pretty productive, not needing to rewrite everything too many times. I'd like to see Adobe mix the Java-style classes and interfaces up with the JavaScript habit of declaring anonymous functions and objects as you go. I'd be thrilled if I could write something like:

var x : IMyInterface = new IMyInterface {
method: function() : void { ... },
field: 42
};

I see that they're extending the language to support limited generics support so hopefully that won't be the end of it. I'm looking forward to experimenting more on the platform.

5Jan/096

A Different Model For Web Services Authorization

In my last post I set out to describe how easy it is to extract private keys from desktop software. As I was concluding I stumbled on an alternative approach that might be more secure in some circumstances. I didn't really go into details, so here's an expansion of the idea.

Current API authentication mechanisms including Flickr Auth, OAuth, Yahoo BBAuth and Google AuthSub work by allowing users to grant an application the right to act for the user on the site. Some implementations allow the user to grant access to only a subset of the functionality — Flickr lets users grant just read access, Google lets users grant access to just one class of data (for example Google Docs but not Google Calendar). But still, the user doesn't know which specific operations they're allowing to be carried out. This wouldn't matter so much if the user can trust the application that they're granting access to, but there's no way to provide for that completely.

Another approach might be to give third-party applications the ability to submit sets of changes to be applied to a user's account, but require the user to approve them through the service provider's web site before they're applied to the the user's real data. For example, a Flickr upload might look like this:

  • Client makes API request flickr.transaction.create(total_files=20, total_data=20M)
  • Flickr responds with transaction id X and a maximum lifetime for the transaction
  • Flickr responds with an URL to direct the user's browser to to:
    • associate the transaction with an account, and
    • preview, permit and commit the changes

To a user this isn't going to look or feel to different from the current Flickr upload scenario where on completion they're presented with a set of uploaded photos and offered the opportunity to make metadata changes.

The key advantage here from a security standpoint is that users approve actions not applications. The challenge of expressing a set of changes to a user clearly is not small but unlike the challenge of identifying and trusting applications it's solvable.

This model probably only makes sense for write-oriented sessions, and for infrequent, large write sessions (such as Flickr uploads) rather than frequent, small writes (like Twitter posting).

These ideas aren't well thought out in my head, they're fresh from me trying to come up with a workable model for API authentication and authorization that doesn't depend on trusting the identity of clients. I'd really welcome feedback and ideas on how to take this forward and feedback implementers on the feasibility of adopting a model like this.

5Jan/096

No More Secrets

Using secret keys to identify applications communicating across the internet has become popular as people have copied the very successful Flickr authentication API. Unfortunately people trust that they can keep these keys secret from attackers, even as they distribute applications that contain the secret keys to other people. I decided to see how hard it would be to extract the secret key and API key from Flickr's new XULRunner based Uploadr.

While the Uploadr is nominally open source, they don't include the key in the source code. In the README they instruct anyone building it to acquire their own key and modify the source code to include that key:

You'll need your own API key and secret from Flickr to build Uploadr.
These can be obtained at http://flickr.com/services/api/.  The key
and secret must be placed in flKey.cpp as indicated.  This file is at:
  UPLOADR/MacUploadr.app/Contents/Resources/components/flKey.cpp

The API key is stored as a string.  The secret is stored as individual
characters so it is not easily readable from the binary.

Such noble goals.

Getting the API key was as simple as running tcpdump:

sudo tcpdump -i en1 -vvv -n -s 0 -w /tmp/DumpFile.dmp

running the Uploadr, and then looking through the dump file to find where the API key is passed. Look:

api_key=a5c4c07991a15c6b56b88005a76e7774

In theory getting the secret key is harder, but it turned out to be even easier. It's stored in the source code for the flKey::Sign method:

	// The shared secret
#ifdef XP_MACOSX
	bytes[0] = '-';
	bytes[1] = '-';
	bytes[2] = '-';
	bytes[3] = '-';
	bytes[4] = '-';
	bytes[5] = '-';
	bytes[6] = '-';
	bytes[7] = '-';
	bytes[8] = '-';
	bytes[9] = '-';
	bytes[10] = '-';
	bytes[11] = '-';
	bytes[12] = '-';
	bytes[13] = '-';
	bytes[14] = '-';
	bytes[15] = '-';
#endif

All I need to do is find that code, disassemble it and reassemble the string. I ran the Uploadr under gdb:

gdb '/Applications/Flickr Uploadr.app/Contents/MacOS/xulrunner'

and once it was up and running tried to find something to break on that would expose the secret key. There weren't symbols for the flKey::Sign method that's defined in the source file but I found I could break on MD5Init that's called from that function. So:

break MD5Init
kill
run

And now I stop in the middle of the signature function - and from the stack trace I can see where flKey::Sign begins:

Breakpoint 1, 0x149a086b in MD5Init ()
(gdb) where
#0  0x149a086b in MD5Init ()
#1  0x149a1588 in flKey::Sign ()
#2  0x017bb45d in NS_InvokeByIndex_P ()
...

All I have to do is disassemble flKey::Sign:

(gdb) disassemble 0x149a1588
Dump of assembler code for function _ZN5flKey4SignERK9nsAStringRS0_:
0x149a1498 <_ZN5flKey4SignERK9nsAStringRS0_+0>:	push   %ebp
0x149a1499 <_ZN5flKey4SignERK9nsAStringRS0_+1>:	mov    %esp,%ebp
0x149a149b <_ZN5flKey4SignERK9nsAStringRS0_+3>:	push   %edi
0x149a149c <_ZN5flKey4SignERK9nsAStringRS0_+4>:	push   %esi
...
0x149a153e <_ZN5flKey4SignERK9nsAStringRS0_+166>:	movb   $0xXX,(%eax)
0x149a1541 <_ZN5flKey4SignERK9nsAStringRS0_+169>:	movb   $0xXX,0x1(%eax)
0x149a1545 <_ZN5flKey4SignERK9nsAStringRS0_+173>:	movb   $0xXX,0x2(%eax)
0x149a1549 <_ZN5flKey4SignERK9nsAStringRS0_+177>:	movb   $0xXX,0x3(%eax)
0x149a154d <_ZN5flKey4SignERK9nsAStringRS0_+181>:	movb   $0xXX,0x4(%eax)
0x149a1551 <_ZN5flKey4SignERK9nsAStringRS0_+185>:	movb   $0xXX,0x5(%eax)
0x149a1555 <_ZN5flKey4SignERK9nsAStringRS0_+189>:	movb   $0xXX,0x6(%eax)
0x149a1559 <_ZN5flKey4SignERK9nsAStringRS0_+193>:	movb   $0xXX,0x7(%eax)
0x149a155d <_ZN5flKey4SignERK9nsAStringRS0_+197>:	movb   $0xXX,0x8(%eax)
0x149a1561 <_ZN5flKey4SignERK9nsAStringRS0_+201>:	movb   $0xXX,0x9(%eax)
0x149a1565 <_ZN5flKey4SignERK9nsAStringRS0_+205>:	movb   $0xXX,0xa(%eax)
0x149a1569 <_ZN5flKey4SignERK9nsAStringRS0_+209>:	movb   $0xXX,0xb(%eax)
0x149a156d <_ZN5flKey4SignERK9nsAStringRS0_+213>:	movb   $0xXX,0xc(%eax)
0x149a1571 <_ZN5flKey4SignERK9nsAStringRS0_+217>:	movb   $0xXX,0xd(%eax)
0x149a1575 <_ZN5flKey4SignERK9nsAStringRS0_+221>:	movb   $0xXX,0xe(%eax)
0x149a1579 <_ZN5flKey4SignERK9nsAStringRS0_+225>:	movb   $0xXX,0xf(%eax)
...
0x149a168e <_ZN5flKey4SignERK9nsAStringRS0_+502>:	pop    %esi
0x149a168f <_ZN5flKey4SignERK9nsAStringRS0_+503>:	pop    %edi
0x149a1690 <_ZN5flKey4SignERK9nsAStringRS0_+504>:	pop    %ebp
0x149a1691 <_ZN5flKey4SignERK9nsAStringRS0_+505>:	ret
End of assembler dump.

And sure enough if I reconstruct those bytes, the secret key is: REDACTEDREDACTED

That wasn't too hard was it? The fact that the code around the key was open source made it easier to find but even if it was closed source if I knew what kind of algorithm I was looking for it wouldn't be hard to find and examine. It's pretty easy to trace forward from user input or backward from network IO to find the signature algorithm, and the secret's always the input to that.

Unfortunately OAuth follows the same model as Flickr, placing some value in a consumer's "private" key. They do mention in Appendix B.7 that it's often possible for attackers to access a secret it's still a vital part of the protocol.

Since it's not possible to securely identify clients we should move to a model where users approve actions against their accounts rather than granting carte-blanche actions to clients. For example Flickr could put photos that are uploaded into a staging area and require approval on the web site before including them in a user's photo stream. This would allow users to make a informed decisions about third-party access to their accounts.

Updated: at Flickr's request I've removed the "secret" key.

Updated again: I made another post about this topic, expanding on the final paragraph of this post.

24Dec/083

Closed Source

So now that I'm working on something that's proprietary, closed source and in stealth mode, I'm finally doing stuff and learning how to do things that are really cool! Typical. Perhaps I should just start queuing up blog posts about the stuff I've discovered to push out live once we launch something other people could look at.

Tagged as: 3 Comments