I’ve been hacking away on Caroline this past week and finally have the latest version ready. After watching memory usage skyrocket over a typical 12-hour run from 100MB to 1GB, I started searching for memory leaks. The first thing I did was to nix the old system of storing all URLs in memory until they were spidered; Caroline essentially performs a depth-first search through TGPs, going deeper and deeper into the free porn sites until it hits a specified depth (usually 3 or 4 links), then starts working its way back up. This resulted in 100,000+ URLs being in memory at any one time, plus the URL’s incoming link text and current depth. I changed the code so that 1,000 links are kept in memory, and the rest are dumped to disk. More links are fetched if the memory cache dips below 100, which made a decent improvement.
There was still a small leak, though–somehow memory usage kept creeping up by about 1MB ever five minutes. After much trouble-shooting, I tracked the problem down to the WWW::Mechanize module. Replacing this with LWP and HTML::TokeParser seems to have resolved the problem, plus setup a clean framework for indexing movie galleries. Mechanize had a really easy interface for finding image links, but no easy way to extract movies. By writing the extraction code myself using HTML::TokeParser, I can finally get both. Expect to see porn movie indexing coming in the following weeks! ;)