Caroline 0.8

April 12th, 2007 by Aidan No comments »

The latest Caroline Spider update fixed a number of memory problems, some more thumbnailing issues, and, most importantly, include significant speed improvements. The spider is now indexing porn galleries two or three times faster than before, and using much less RAM to boot! She’s out scuttling around the Internets at the this moment, cataloging all the best boobies and rolling them into the EveKnows.com database.

After the spider was well under control, I turned back to looking at optimization methods for EveKnows itself. First, I noticed that the server’s harddrive was performing abysmally. A quick ‘hdparm -m16 /dev/hda’ had things back to where they were before–apparently Debian Etch’s config files for hdparm don’t work as expected, so the settings are lost after a reboot. Still need to look into why this is happening, but I can’t take the server down for a reboot very often so it’s going to have to wait. I also found a way to significantly speed up my queries, but it would require a database schema update, which in turn would necessitate scrapping the current porn gallery database and starting from scratch. I think I’m going to do this (the changes result in an order-of-magnitude speed increase), but it may take some time. Don’t be worried if you don’t see new galleries appearing in the search results over the next few days, it just means that we’re gearing up for a faster, more thorough search engine.

Update: For what it’s worth, I figured out that the hdparm problem was due to a missing device block in /etc/hdparm.conf. In my case, all I needed was:

/dev/hda {
        mult_sect_io = 16
        write_cache = on
        keep_settings_over_reset = on
}

Caroline 0.7

April 1st, 2007 by Aidan No comments »

I’ve been hacking away on Caroline this past week and finally have the latest version ready. After watching memory usage skyrocket over a typical 12-hour run from 100MB to 1GB, I started searching for memory leaks. The first thing I did was to nix the old system of storing all URLs in memory until they were spidered; Caroline essentially performs a depth-first search through TGPs, going deeper and deeper into the free porn sites until it hits a specified depth (usually 3 or 4 links), then starts working its way back up. This resulted in 100,000+ URLs being in memory at any one time, plus the URL’s incoming link text and current depth. I changed the code so that 1,000 links are kept in memory, and the rest are dumped to disk. More links are fetched if the memory cache dips below 100, which made a decent improvement.

There was still a small leak, though–somehow memory usage kept creeping up by about 1MB ever five minutes. After much trouble-shooting, I tracked the problem down to the WWW::Mechanize module. Replacing this with LWP and HTML::TokeParser seems to have resolved the problem, plus setup a clean framework for indexing movie galleries. Mechanize had a really easy interface for finding image links, but no easy way to extract movies. By writing the extraction code myself using HTML::TokeParser, I can finally get both. Expect to see porn movie indexing coming in the following weeks! ;)

Major Search Query Changes

March 24th, 2007 by Aidan No comments »

W00t! Finally finished the SQL optimizations I started working on a couple of days ago. Searching single terms is now twice as fast as it was; coupled with the previous change, this equates to a four-fold speed increase over the past week! EveKnows.com is now serving up the hot, naked chicks faster than ever.

While I was tweaking the code, I also improved the spider to be more stringent regarding “shifty” galleries. You know the type… auto-bookmarking scripts, pop-up ads, and other nastiness. If Caroline detects anything potentially malicious, the gallery won’t be indexed. I like clean porn galleries, and I’m sure you’ll all appreciate the fact that every gallery on EveKnows is clean and safe to visit, even if you’re not using a secure web browser

Oh yeah, almost forgot! I noticed a number of galleries creeping into the database which where missing thumbnails. After some debugging, I found the problem in Caroline’s thumbnailing code and fixed it up. Freshly-indexed galleries should all have a thumbnail from now on.

Free Porn Search Modifiers

March 24th, 2007 by Aidan 1 comment »

I added a new modifier to the searching algorithm for EveKnows.comsite:
The site: modifier allows you to restrict your porn search to a particular website. For instance, searching Ariel Rebel site:myarielgalleries.com will only return Ariel Rebel galleries hosted on http://myarielgalleries.com. It’s a nifty little tool for finding similar porn once you’ve found a gallery you like.

I also fixed a bug that was preventing the src: modifier from functioning properly. The TGP search box should now be working again, so feel free to add it to your TGPs, blogs, linked lists, whatever!

The Curse of Disk Access

March 22nd, 2007 by Aidan No comments »

As the EveKnows.com database grows ever larger, search times have been going up. Yesterday I tried some more SQL optimizations to alleviate this, and cut most search times in half. The problem seems to be disk access with large result sets, especially broad, single-word queries such as teen, babe, sex, etc. Finding the initial set of matching galleries isn’t much of a problem since the query uses fast indexing, but we then need to pull all of the details from, say 30,000 galleries, which is a problem. Some testing of the UNIX utilities sar and iostat reveal an incredible number of disk reads for these broad queries, and it’s only going to get worse as the database grows larger.

Until a few moments ago I wasn’t quite sure how to handle this, but I think I’ve just hit on a solution. Currently EveKnows makes two queries, one without a LIMIT command to figure out the total number of results, and then a second one with the LIMIT keyword to pull, say, the first 30 results (or 31-60 if the user is on the second page, etc.). I’m wondering if I adjust the first query to not pull the extra gallery data, if it might be able to work using only indexes. If that turns out to be true, our second, LIMITed query would only need to fetch full results for 30 rows at a time, which could be a substantial speed-up for popular search terms. Hrmm… I’ll try to implement this tonight and let you know the results! ;)

Let Surfers Search Your Porn Galleries!

March 19th, 2007 by Aidan No comments »

I’ve added a new site search feature to EveKnows.com. Now TGP owners can include a search box on their site and have the search results only include galleries their TGP links to. It’s a simple way to let your surfers search through your gallery database!

The service is completely free, just go to http://eveknows.com/about.html to get the search box HTML, copy it to your site, and confirm with me that your site is being indexed. If it isn’t already in the EveKnows database, I’ll add it to the queue. This is a great way to get repeat traffic for your free porn galleries!

Search Result Weighting

March 18th, 2007 by Aidan No comments »

I just made an update to the search result weighting algorithm. To keep the newest galleries on top, sites which were first indexed within the past three days are now given some extra weight. I’m a little worried that the extra processing will slow down the searching algorithm too much, but we’ll try this out for a while and see how it works. If nothing else, it’ll keep new galleries at the top of the search results.

I also made some adjustments to penalize keyword spammers. After a word shows up in a gallery too many times, a negative value is applied to the gallery which will lower it in the search rankings. The more the same word appears, the lower the gallery will sink.

As always, check out the advances at the little porn search engine that could :)

Disk Optimizations

March 18th, 2007 by Aidan No comments »

So with Caroline running full-bore, I noticed a frightening thing–the EveKnows.com server was bottle-necked by disk IO. Between the spider downloading thumbnails and inserting galleries and SQL server fetching them and the Apache server handing out web pages, my machine was slowing to a crawl. Since the search database is far too large to fit into memory, each search ends up hitting the disk several times. Couple that with a constant stream of SQL INSERT statements and you get a MySQL daemon that’s constantly waiting for the hard disk to give it what it needs, and an idle CPU waiting for something to do.

After some researching, I discovered a few tricks to relieve the congestion. First, I used hdparm to enable multiple sector counts (16) and write-back caching on my disk. This alone made a huge improvement. Then, I tuned Caroline to make far fewer disk hits; it had been logging data, but I don’t really need the log so I turned that off, and tweaked the way that it opens the stop_list file to do it once per thread rather than once per gallery (which was something I should have done from the beginning, but hey, this is the first search engine I’ve ever written!) Finally, I setup two databases on the SQL server, one to use for the search engine and one for the spider to insert new galleries into. They are identical, save that the spider’s database doesn’t have any indexes, which creates far fewer disk writes when adding galleries. Then I setup a cron job to, once per day, replace the search database with the recently updated version from Caroline, add in the indexes needed to keep the searches effecient, and then go back to spidering.

Thus far, the plan seems to be working well. ‘top’ no longer shows a WA% of 90, and searches are completing in less than one second again while Caroline is running. Woohoo!

WWW::Mechanize Memory Management

March 16th, 2007 by Aidan No comments »

With the new threaded model for Caroline, I started to notice memory usage getting out of control. Under the previous forking system, I could expect a typical run to go through 200 MB, but the new model was easily topping 1GB and then dying when the system ran out of physical memory. After some searching, I learned that Perl’s WWW::Mechanize module caches each page in memory for the lifetime of the object (this allows for the ‘back’ feature of Mechanize); cool if you’re trying to mimic a web browser, but totally unusable for a web robot like Caroline. Thankfully there is a stack_depth() method for WWW::Mechanize which controls the number of pages cached. By setting this to 2 I managed to get Caroline’s memory usage back under control. She’s off happily indexing more galleries as I type :)

Site Redesign

March 15th, 2007 by Aidan No comments »

Today I took a break from tweaking the engine database and gave the interface an overhaul. I know search engines these days are mostly simple affairs, but I figured a decent design couldn’t hurt :p

It’s all valid XHTML + CSS, and should degrade cleanly in older browsers. Let me know if anyone has trouble viewing the site in their browser of choice.