Caroline/0.4.1

March 14th, 2007 by aidan Leave a reply »

Caroline is the name of the spider used to build the EveKnows.com database. Today I hacked together version 0.4, which moved to the threading model I mentioned earlier. Running multiple processes was a pain and relied on having a few different starting-points; now I can run the spider on a single URL and still experience fast results. Shortly after it began to run, I realized that my host was going to get banned all over the place because requests were being sent too quickly. Caroline/0.4.1 took a step towards fixing this by shuffling up the queue of sites a bit.

Now I’m researching how to respect robots.txt and implement longer pauses when repeatedly hitting the same server. Caroline use Perl’s WWW::Mechanize module to automate HTTP requests, which means I can’t use the LWP::RobotUA module without some serious hacking. This might be the way to go in the long run, but I want to explore other options first.

Advertisement

Leave a Reply