The initial spider I wrote wasn’t bad; it tended to average about 500 sites/hour at first, and after some tweaks I managed to get it up to 1,000/hour. Indexing 24,000 sites each day seemed pretty good at first, but then I realized that in order to keep results accurate, I’ll need to be re-indexing all of those pages on a fairly regular basis, say once ever week or two. Since I’m shooting for 1,000,000 galleries, indexing 24,000/day just isn’t going to cut it!
My first thought was the Perl threading module–since the spider spending most of its time blocking on network requests, it seemed an ideal fit for multi-threading. While I was looking into this, I realized I was horribly over-thinking the entire problem. Threading isn’t much more than parallel processes with shared memory, but since this problem domain holds an infinite amount of galleries to index, sharing memory isn’t really an issue. Plus, the spider is in constant contact with a database, so each process knows when a URL was recently indexed. Thus, with only a few minor tweaks, I was able to get several spiders running simultaneously. They’ve hit 30,000 galleries over the past nine hours; not a bad improvement for about an hour’s worth of work :)