Breaking into a debugger in Python

This is probably old news to most folks, but I only found out about this recently, more than twelve years into being a Python programmer.

The pdb (Python Debugger) module has a few useful functions. For years I’ve been using pdb.pm(), the postmortem debugger. It can be run from the Python shell to invoke a debugger after an exception. Traditionally I’ve done:

% python -i myscript.py
(some exception occurs)
>>> import pdb
>>> pdb.pm()
(Pdb)

But for this to work you need to be able to pass arguments to the interpreter (-i) and the code needs to throw an exception that isn’t caught.

Enter pdb.set_trace(). It begins the same interactive debugger that pdb.pm() provides upon request, from code. So in the middle of a function that’s confusing me I add:

from pdb import set_trace;set_trace()

and my code will stop, ready for me to tear apart its runtime state. I just wish it had a more descriptive name so that I’d have found it earlier.

Bonus: printing stack traces
Python’s stack traces in exceptions are pretty useful. It used to be difficult to get the current stack in Python, involving raising and catching an exception. Now the traceback module has traceback.print_stack() and traceback.extract_stack() methods.

Scripting Google Voice in Python

In the interest of getting some of the fragments of code I’ve written off my hard disk and out where someone might find them useful I’ve decided to start dumping them into git repos with some very minimal documentation. Minimal enough that I actually kick them out there.

The first is a simple Python library to access Google Voice’s calling and texting functionality:

% python
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from googlevoice import *
>>> gv = GoogleVoice('googleaccount', 'googlepassword')
>>> gv.sms('+1415MYNUMBER', 'this is a test')
>>> gv.call('+1415MYNUMBER', '18003569377')

The library depends on BeautifulSoup. It’s based on a bunch of work that I found on the internet. I don’t remember exactly whose techniques I ended up using, but there’s a bunch of example code out there, just not much that was simple to use and in Python.

Implementing Information Retrieval Systems

I believe that it’s valuable to implement software components that you use from scratch merely to gain a better understanding of the systems you use. I’ve never implemented an operating system from scratch or designed and implemented a programming language like many of my friends but I’d really like to one day. Part of the problem is that looking at a large, complicated, system from the outside is intimidating.

A couple of years ago I spent many months working with CLucene, the C++ port of the Lucene IR (full text search). Both CLucene and Lucene are huge, complicated and often appear to be buggy. While learning about the Lucenes I did a bunch of reading about full text search theory. It never seemed that complicated, but all of the implementations I came across clearly were.

A basic full text search engine works like this:

  • collect all of the words in the input documents
  • simplify the words so that similar ones are the same, so I can search for “look” and get “looked”, “looking” and “looker”
  • organize the words in a special tree data structure called a trie that lets you look up the words in linear time, the leaves of the trie for each word indicate which documents it’s present in
  • to search for a word you walk the trie and get the list of documents

Yesterday I spent a couple of hours walking (Tel Aviv public transport was failing me) and I worked through most of how to implement this in my head. I also thought about how to implement phrase matching. I’m not really sure how phrase matching works in other systems (like I said, they’re too complicated for me to understand easily) but I represent each instance of a word in a document and then link from one instance to the instance of the next word in the document. That way after walking down the trie to find the first word in the phrase I can walk along each word in the phrase, comparing it to the next word in the document.

Last night I implemented it in Python. I called it Tripe and it’s less than 250 lines of Python. It doesn’t have a stemmer worth talking about and it’s pretty space inefficient but it seems to generally work. There’s a library file called tripe.py and then some test utilities: tripe_add.py (add documents), tripe_search.py (search the index), tripe_dot.py (visualize the index with Graphviz). Take a look:

% echo "The cat sat on the mat." | ./tripe_add.py test.tripe 1
% ./tripe_dot.py test.tripe|dot -Tpng -o test1.png

% echo "The quick brown fox..." | ./tripe_add.py test.tripe 2
% ./tripe_dot.py test.tripe|dot -Tpng -o test2.png

% echo "There is a light that never goes out." | ./tripe_add.py test.tripe 3
% ./tripe_dot.py test.tripe|dot -Tpng -o test3.png

% ./tripe_search.py test.tripe the
matched in document 1 at 15
matched in document 1 at 0
matched in document 2 at 0
% ./tripe_search.py test.tripe cat
matched in document 1 at 4

I first implemented this as a tree in memory, but I found that implementing persistent storage was really easy too. I’ve always been hesitant to implement structured binary file formats, and I know that efficient implementation can be really complicated, but a naive implementation like I did was relatively straight-forward.

So now what? There are already a bunch of full-text search engines out there, but they all seem relatively complicated. I wonder if it’s worth tidying up what I have, extending it enough to be useful for site search applications and publishing it. In the meantime it’s on github.

Flipy, a new Python library for Flickr

In the past day or so I’ve written a new Python library for Flickr. It came from some frustration using other Python libraries. They’re all great, but none of them work quite how I want.

My goal was to have a library that feels like Python and the Flickr API at the same time. I think it’s worked out pretty well so far. You can make calls using the standard Flickr API calls as documented on the Flickr site, but the response objects feel like normal Python objects. For example you can do something like this:

from flipy import Flipy
flickr = Flipy(MY_API_KEY)
me = flickr.people.findByUsername(username='ianloic')
me_info = flickr.people.getInfo(user_id=me.nsid)
print 'My name is %s. I have %s photos at %s.' % (me_info.realname, me_info.photos.count, me_info.photosurl)

I’ve put more details about the mapping in the README.

Beyond simple mapping of methods to responses I’m working on decorating certain important response objects such as users and photos with more object oriented methods. For example right now if you have a user object you can call user.photos() and get a iterator for of a user’s photos. My code takes care of all of the paging behind the scenes.

Since I haven’t implemented authentication or uploading yet so right now it’s mostly useful for simple mashup-style applications, but I’ll get uploads and authentication complete when I get back from Jordan next week.

Check out the code on github and let me know what you think.

Python generator fun

I know Python’s iterators and generators aren’t that new anymore, but at heart I’m still a Python 1.5 programmer. I’ve come to iterators and generators in Python from my experience in JavaScript.

This morning I wanted to generate a list of names (for nodes in my NFA/DFA pattern matching code) that looked like: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, AA, AB, AC, AD, AE, AF, AG, AH, AI, AJ, AK, AL, AM, AN, AO, AP, AQ, AR, AS, AT, AU, AV, AW, AX, AY, AZ, BA, BB, BC, BD, etc… A couple of minutes of fiddling I came up with:

def name_generator():
  from string import uppercase
  from itertools import chain
  for n in chain([''], name_generator()):
    for c in uppercase:
      yield n+c

It think the code is pretty simple, pretty readable, and pretty efficient. It’s not rocket science, but I feel that iterators and generators have paid off well for Python.