Who is the most connected musician? Mathematics had Paul Erdős, film has Kevin Bacon, but who is the center of the musical world. At some point I flippantly suggested Brian Eno because of this long career as a producer. Rich Trott brought this up again recently when linking to his awesome Music Routes site.
Rich has now published the data that backs Music Routes. This let me write some code to calculate who the most connected musician is. The criteria I chose was to find the artist with the lowest average distance from all other musicians in the largest network of connected musicians. I started researching shortest path algorithms and quickly discovered that the Floyd-Warshall algorithm will give me the shortest path between all nodes in a graph in only O(n³). And scipy has an implementation.
I ended up with an IPython notebook to calculate the most connected artist. Rich's data includes 10479 people on 4835 tracks. He admits that it ultimately reflects his tastes because he has entered most (perhaps all) of the data himself, but that's still a serious number of data points. The Floyd-Warshall was the slowest part of the calculation. I left it running over night and once it was complete saved off the results in case I ever wanted to run this again.
The result? The artist in the largest group of connected artists (10065) with the shortest average distance from each other artist in the network was Jim Keltner, a session drummer with an average distance of 3.226. He's followed closely by Paul McCartney (3.239), Bob Dylan' (3.322) and Elvis Costello (3.333).
This is an alright result, but it really feels kind of like "small data". I downloaded the discogs.com database dump with 3.6M artists on 5.3M releases. I need to work out how to run an all-pairs shortest-path algorithm for that without needing terabytes of RAM.