Archive for April, 2007

Porn Video Search

April 29th, 2007

This weekend saw a major update to EveKnows.com. The biggest new feature is the ability to search for porn video galleries in addition to photo galleries. Search results note whether the gallery includes pictures or movies, along with the number of either. A future release will likely include the ability to only search photo or video galleries.

On the back-end, some improvements have been made to the search algorithm, allowing faster results with a larger database. Caroline, the porn spider, has been running non-stop for a week and managed to accumulate over 250,000 new, searchable galleries. We’ll be monitoring performance as the database continues to grow–the goal remains to have 1 million current galleries in the engine at any given time.

Another new feature is an integrated dictionary for suggesting popular spellings of search terms which did not yield any results. For example, searching for hot amatuere now gives the message No galleries matched your search terms. Perhaps you meant ‘hot amateur‘?. Let us know how accurate the suggestions are!

We’ve also jumped on the trendy ‘tag cloud’ bandwagon, using a cloud to replace the Popular Searches and Recent Searches lists. Forty or so recent search terms appear in the cloud, with size indicating the popularity of the search. Not sure if this counts as Web 2.0 goodness or not, but it certainly looks cool.

Caroline 0.90

April 15th, 2007

Today I finished work on the first beta release for the porn spider Caroline 1.0. This release adds support for video galleries, improves detections of shifty HTML or JavaScript in galleries (you know, the stuff that causes pop-ups, auto-bookmarkers, and the like), and a much better filing system for image thumbnails. The addition of video galleries meant that I must once again scrap the EveKnows.com database and begin from scratch, but I think the results will be worth the wait! I’m really excited about the new JavaScript detection as well; Caroline will be able to index pages with harmless JavaScript, like image-rollovers, but will continue to block pages with pop-up ads or other intrusive script behaviors. I’m going to wait until we have around 250,000 galleries, then update EveKnows.com with the new database.

Database Updates

April 14th, 2007

Well, Caroline is off gathering as much free porn as she can find. I made the database adjustments I mentioned in my previous post, so the next EveKnows.com update won’t be for a few days. Once it’s ready, however, search speeds should be drastically faster. Right now, the first query for a particular search term takes between 3 and 10 seconds, though once it’s cached, the results are generated in about 0.01 seconds. This update should get everything down to the 1-second-or-less range.

Caroline 0.8

April 12th, 2007

The latest Caroline Spider update fixed a number of memory problems, some more thumbnailing issues, and, most importantly, include significant speed improvements. The spider is now indexing porn galleries two or three times faster than before, and using much less RAM to boot! She’s out scuttling around the Internets at the this moment, cataloging all the best boobies and rolling them into the EveKnows.com database.

After the spider was well under control, I turned back to looking at optimization methods for EveKnows itself. First, I noticed that the server’s harddrive was performing abysmally. A quick ‘hdparm -m16 /dev/hda’ had things back to where they were before–apparently Debian Etch’s config files for hdparm don’t work as expected, so the settings are lost after a reboot. Still need to look into why this is happening, but I can’t take the server down for a reboot very often so it’s going to have to wait. I also found a way to significantly speed up my queries, but it would require a database schema update, which in turn would necessitate scrapping the current porn gallery database and starting from scratch. I think I’m going to do this (the changes result in an order-of-magnitude speed increase), but it may take some time. Don’t be worried if you don’t see new galleries appearing in the search results over the next few days, it just means that we’re gearing up for a faster, more thorough search engine.

Update: For what it’s worth, I figured out that the hdparm problem was due to a missing device block in /etc/hdparm.conf. In my case, all I needed was:

/dev/hda {
        mult_sect_io = 16
        write_cache = on
        keep_settings_over_reset = on
}

Caroline 0.7

April 1st, 2007

I’ve been hacking away on Caroline this past week and finally have the latest version ready. After watching memory usage skyrocket over a typical 12-hour run from 100MB to 1GB, I started searching for memory leaks. The first thing I did was to nix the old system of storing all URLs in memory until they were spidered; Caroline essentially performs a depth-first search through TGPs, going deeper and deeper into the free porn sites until it hits a specified depth (usually 3 or 4 links), then starts working its way back up. This resulted in 100,000+ URLs being in memory at any one time, plus the URL’s incoming link text and current depth. I changed the code so that 1,000 links are kept in memory, and the rest are dumped to disk. More links are fetched if the memory cache dips below 100, which made a decent improvement.

There was still a small leak, though–somehow memory usage kept creeping up by about 1MB ever five minutes. After much trouble-shooting, I tracked the problem down to the WWW::Mechanize module. Replacing this with LWP and HTML::TokeParser seems to have resolved the problem, plus setup a clean framework for indexing movie galleries. Mechanize had a really easy interface for finding image links, but no easy way to extract movies. By writing the extraction code myself using HTML::TokeParser, I can finally get both. Expect to see porn movie indexing coming in the following weeks! ;)