Anatomy of a Search Engine | Development of the EveKnows.com adult search engine

CAT | Development

May/07

7

EveKnows 0.7

Development of the EveKnows.com porn search engine is progressing swiftly! Today sees the addition of sorting search results based on either date or relevancy. Like most search engines, EveKnows assigns a score for each word on a webpage based on the number of times it occurs, the word’s location on a page, and other factors. During a search, these scores are compared and the galleries with the highest score for the queries words get displayed at the top of the search results. This is knows as sorting by relevancy; theoretically, the most relevant results will be displayed first.

Sometimes, however, people are interested in recent galleries. To that end, I’ve added the ability to sort the search results with the newest galleries first. To make use of this, click the ‘Show newest galleries first’ link above the search results. If you want to get back to the previous view, click the ‘Show most relevant galleries first’ link. When sorted by date, the quality of the search results may be significantly lower (especially with broad search terms, like ‘naked teens’, which return thousands of results). The default sorting will still be based on relevancy, but the recent sort option is now available for those who are interested.

No tags Hide

Today I updated EveKnows.com to support searching for video galleries, photo galleries, or both. This feature exists in most other porn search engines and helps bring EveKnows up to feature-parity with them. To make use of this new ability, check the desired options (either ‘Photo galleries’ or ‘Video galleries’) beneath the search box, then click ‘Ask Eve’ as usual. By default, EveKnows will search for both photo and video porn galleries.

No tags Hide

Today I hacked a new feature into EveKnows.com–the ability to see where our galleries were found. Search results now include a line of the form “X Photos from thehun.net” or “Y videos from cutegirlsdaily.com”. This let’s everyone know the site which we used to find the gallery, and provides searchers with an easy link to the source for more porn they may enjoy. The links will open in a new window, so don’t worry about following one and losing your search results.

I’ve also been tweaking the Suggestion Dictionary. Suggestions are more accurate than ever and are now offered for small search results. For example, searching for ‘lezbian’ currently returns 8 results, plus a link saying, “Perhaps you meant ‘lesbian‘?”. The suggestion dictionary is built from our own database of search results, so it will always suggest the words that appear most-frequently in porn galleries.

No tags Hide

Apr/07

15

Caroline 0.90

Today I finished work on the first beta release for the porn spider Caroline 1.0. This release adds support for video galleries, improves detections of shifty HTML or JavaScript in galleries (you know, the stuff that causes pop-ups, auto-bookmarkers, and the like), and a much better filing system for image thumbnails. The addition of video galleries meant that I must once again scrap the EveKnows.com database and begin from scratch, but I think the results will be worth the wait! I’m really excited about the new JavaScript detection as well; Caroline will be able to index pages with harmless JavaScript, like image-rollovers, but will continue to block pages with pop-up ads or other intrusive script behaviors. I’m going to wait until we have around 250,000 galleries, then update EveKnows.com with the new database.

No tags Hide

Apr/07

14

Database Updates

Well, Caroline is off gathering as much free porn as she can find. I made the database adjustments I mentioned in my previous post, so the next EveKnows.com update won’t be for a few days. Once it’s ready, however, search speeds should be drastically faster. Right now, the first query for a particular search term takes between 3 and 10 seconds, though once it’s cached, the results are generated in about 0.01 seconds. This update should get everything down to the 1-second-or-less range.

No tags Hide

Apr/07

12

Caroline 0.8

The latest Caroline Spider update fixed a number of memory problems, some more thumbnailing issues, and, most importantly, include significant speed improvements. The spider is now indexing porn galleries two or three times faster than before, and using much less RAM to boot! She’s out scuttling around the Internets at the this moment, cataloging all the best boobies and rolling them into the EveKnows.com database.

After the spider was well under control, I turned back to looking at optimization methods for EveKnows itself. First, I noticed that the server’s harddrive was performing abysmally. A quick ‘hdparm -m16 /dev/hda’ had things back to where they were before–apparently Debian Etch’s config files for hdparm don’t work as expected, so the settings are lost after a reboot. Still need to look into why this is happening, but I can’t take the server down for a reboot very often so it’s going to have to wait. I also found a way to significantly speed up my queries, but it would require a database schema update, which in turn would necessitate scrapping the current porn gallery database and starting from scratch. I think I’m going to do this (the changes result in an order-of-magnitude speed increase), but it may take some time. Don’t be worried if you don’t see new galleries appearing in the search results over the next few days, it just means that we’re gearing up for a faster, more thorough search engine.

Update: For what it’s worth, I figured out that the hdparm problem was due to a missing device block in /etc/hdparm.conf. In my case, all I needed was:

/dev/hda {
        mult_sect_io = 16
        write_cache = on
        keep_settings_over_reset = on
}

No tags Hide

Apr/07

1

Caroline 0.7

I’ve been hacking away on Caroline this past week and finally have the latest version ready. After watching memory usage skyrocket over a typical 12-hour run from 100MB to 1GB, I started searching for memory leaks. The first thing I did was to nix the old system of storing all URLs in memory until they were spidered; Caroline essentially performs a depth-first search through TGPs, going deeper and deeper into the free porn sites until it hits a specified depth (usually 3 or 4 links), then starts working its way back up. This resulted in 100,000+ URLs being in memory at any one time, plus the URL’s incoming link text and current depth. I changed the code so that 1,000 links are kept in memory, and the rest are dumped to disk. More links are fetched if the memory cache dips below 100, which made a decent improvement.

There was still a small leak, though–somehow memory usage kept creeping up by about 1MB ever five minutes. After much trouble-shooting, I tracked the problem down to the WWW::Mechanize module. Replacing this with LWP and HTML::TokeParser seems to have resolved the problem, plus setup a clean framework for indexing movie galleries. Mechanize had a really easy interface for finding image links, but no easy way to extract movies. By writing the extraction code myself using HTML::TokeParser, I can finally get both. Expect to see porn movie indexing coming in the following weeks! ;)

No tags Hide

Mar/07

24

Major Search Query Changes

W00t! Finally finished the SQL optimizations I started working on a couple of days ago. Searching single terms is now twice as fast as it was; coupled with the previous change, this equates to a four-fold speed increase over the past week! EveKnows.com is now serving up the hot, naked chicks faster than ever.

While I was tweaking the code, I also improved the spider to be more stringent regarding “shifty” galleries. You know the type… auto-bookmarking scripts, pop-up ads, and other nastiness. If Caroline detects anything potentially malicious, the gallery won’t be indexed. I like clean porn galleries, and I’m sure you’ll all appreciate the fact that every gallery on EveKnows is clean and safe to visit, even if you’re not using a secure web browser

Oh yeah, almost forgot! I noticed a number of galleries creeping into the database which where missing thumbnails. After some debugging, I found the problem in Caroline’s thumbnailing code and fixed it up. Freshly-indexed galleries should all have a thumbnail from now on.

No tags Hide

I added a new modifier to the searching algorithm for EveKnows.comsite:
The site: modifier allows you to restrict your porn search to a particular website. For instance, searching Ariel Rebel site:myarielgalleries.com will only return Ariel Rebel galleries hosted on http://myarielgalleries.com. It’s a nifty little tool for finding similar porn once you’ve found a gallery you like.

I also fixed a bug that was preventing the src: modifier from functioning properly. The TGP search box should now be working again, so feel free to add it to your TGPs, blogs, linked lists, whatever!

No tags Hide

Mar/07

22

The Curse of Disk Access

As the EveKnows.com database grows ever larger, search times have been going up. Yesterday I tried some more SQL optimizations to alleviate this, and cut most search times in half. The problem seems to be disk access with large result sets, especially broad, single-word queries such as teen, babe, sex, etc. Finding the initial set of matching galleries isn’t much of a problem since the query uses fast indexing, but we then need to pull all of the details from, say 30,000 galleries, which is a problem. Some testing of the UNIX utilities sar and iostat reveal an incredible number of disk reads for these broad queries, and it’s only going to get worse as the database grows larger.

Until a few moments ago I wasn’t quite sure how to handle this, but I think I’ve just hit on a solution. Currently EveKnows makes two queries, one without a LIMIT command to figure out the total number of results, and then a second one with the LIMIT keyword to pull, say, the first 30 results (or 31-60 if the user is on the second page, etc.). I’m wondering if I adjust the first query to not pull the extra gallery data, if it might be able to work using only indexes. If that turns out to be true, our second, LIMITed query would only need to fetch full results for 30 rows at a time, which could be a substantial speed-up for popular search terms. Hrmm… I’ll try to implement this tonight and let you know the results! ;)

No tags Hide

« Previous Page« Previous Entries

Next Entries »Next Page »

Theme Design by devolux.org