It’s been a while since my last post. I wanted to let everyone know that development of EveKnows.com is progressing at a break-neck pace–this past month has seen some tremendous improvements behind the scenes. Searches are running faster and with more accurate results than ever before, and I believe the site is finally ready to handle an increased load. The new engine will remain on the staging server for another week or two while I rebuild the search database. Stay tuned for some major changes coming in June…
After yesterday’s post about slow negative search terms and MySQL’s disregard for the EXCEPT operator, I came upon a decent solution for EveKnows.com’s problem. With some (slightly) clever use of LEFT JOINs, I was able to cut the running time of queries with a single negated term in half, and that run time drops by an order of magnitude for queries involving multiple negated terms. W00t! The trick was to build a temporary table of gallery IDs which contain the negated terms, then take the LEFT JOIN of galleries matching the desired terms with the temporary table. This gives us a resulting table with two columns, matches.gallery_id and neg_matches.gallery_id; any rows with a non-NULL value for neg_matches.gallery_id are then dropped, resulting in the proper set of matches. A fairly simple solution; I feel pretty dumb for not seeing it earlier.
While I was working on this, I noticed that the existing src: and site: query modifiers were not functiong properly. Due to the new SQL database schema, a quick fix isn’t possible. I’ve dropped these modifiers for the time being, but intend to support them both at some point in the future.
Today I discovered a wonderful new SQL operator: EXCEPT. This neat operator allows one to join two tables, with the result being all of the rows in table1 which are not in table2. One of the slowest operations for EveKnows is handling queries with negated terms (such as ‘teens -blonde’ to search for non-blonde teen porn); this is because the SQL code includes a NOT EXISTS SQL subquery which gets run for each and every result, verifying that the galleries containing the word ‘teen’ do not also contain the word ‘blonde’. The exact code looks something like this, assuming ri1 is a table full of galleries matching the word ‘teen’:
NOT EXISTS (SELECT * FROM ReverseIndex AS ri2 WHERE ri1.ri_location=ri2.ri_location AND ri2.ri_word IN ('blonde')
That subquery is fast, but if the search matches lots of rows (say, for example, 120,000), then its execution time begins to climb upward. Right now, popular searches with negated words take 3-5 seconds to process, compared to the 0.5-second average of other searches. Obviously, something needs to be done.
I thought I found the answer in the EXCEPT operator. It seems like it would be perfect; we take the ID of rows matching our search terms in temp tabel1 and the ID of rows to be negated in temp tabel2, then take table1 EXCEPT table2 and use the resulting list of IDs as the galleries to fetch. It turns out, however, that MySQL doesn’t support EXCEPT. The oft-suggested method for getting around this is doing an exclusion self-join, but I challenge anyone to self-join a 10-million row table–it’s simply not practical.
So, it looks like I’m back to square one: still searching for a fast way to handle negated searches without migrating to a different RDBMS.
Development of the EveKnows.com porn search engine is progressing swiftly! Today sees the addition of sorting search results based on either date or relevancy. Like most search engines, EveKnows assigns a score for each word on a webpage based on the number of times it occurs, the word’s location on a page, and other factors. During a search, these scores are compared and the galleries with the highest score for the queries words get displayed at the top of the search results. This is knows as sorting by relevancy; theoretically, the most relevant results will be displayed first.
Sometimes, however, people are interested in recent galleries. To that end, I’ve added the ability to sort the search results with the newest galleries first. To make use of this, click the ‘Show newest galleries first’ link above the search results. If you want to get back to the previous view, click the ‘Show most relevant galleries first’ link. When sorted by date, the quality of the search results may be significantly lower (especially with broad search terms, like ‘naked teens’, which return thousands of results). The default sorting will still be based on relevancy, but the recent sort option is now available for those who are interested.
Today I updated EveKnows.com to support searching for video galleries, photo galleries, or both. This feature exists in most other porn search engines and helps bring EveKnows up to feature-parity with them. To make use of this new ability, check the desired options (either ‘Photo galleries’ or ‘Video galleries’) beneath the search box, then click ‘Ask Eve’ as usual. By default, EveKnows will search for both photo and video porn galleries.
Today I hacked a new feature into EveKnows.com–the ability to see where our galleries were found. Search results now include a line of the form “X Photos from thehun.net” or “Y videos from cutegirlsdaily.com”. This let’s everyone know the site which we used to find the gallery, and provides searchers with an easy link to the source for more porn they may enjoy. The links will open in a new window, so don’t worry about following one and losing your search results.
I’ve also been tweaking the Suggestion Dictionary. Suggestions are more accurate than ever and are now offered for small search results. For example, searching for ‘lezbian’ currently returns 8 results, plus a link saying, “Perhaps you meant ‘lesbian‘?”. The suggestion dictionary is built from our own database of search results, so it will always suggest the words that appear most-frequently in porn galleries.
This weekend saw a major update to EveKnows.com. The biggest new feature is the ability to search for porn video galleries in addition to photo galleries. Search results note whether the gallery includes pictures or movies, along with the number of either. A future release will likely include the ability to only search photo or video galleries.
On the back-end, some improvements have been made to the search algorithm, allowing faster results with a larger database. Caroline, the porn spider, has been running non-stop for a week and managed to accumulate over 250,000 new, searchable galleries. We’ll be monitoring performance as the database continues to grow–the goal remains to have 1 million current galleries in the engine at any given time.
Another new feature is an integrated dictionary for suggesting popular spellings of search terms which did not yield any results. For example, searching for hot amatuere now gives the message No galleries matched your search terms. Perhaps you meant ‘hot amateur‘?. Let us know how accurate the suggestions are!
We’ve also jumped on the trendy ‘tag cloud’ bandwagon, using a cloud to replace the Popular Searches and Recent Searches lists. Forty or so recent search terms appear in the cloud, with size indicating the popularity of the search. Not sure if this counts as Web 2.0 goodness or not, but it certainly looks cool.
Today I finished work on the first beta release for the porn spider Caroline 1.0. This release adds support for video galleries, improves detections of shifty HTML or JavaScript in galleries (you know, the stuff that causes pop-ups, auto-bookmarkers, and the like), and a much better filing system for image thumbnails. The addition of video galleries meant that I must once again scrap the EveKnows.com database and begin from scratch, but I think the results will be worth the wait! I’m really excited about the new JavaScript detection as well; Caroline will be able to index pages with harmless JavaScript, like image-rollovers, but will continue to block pages with pop-up ads or other intrusive script behaviors. I’m going to wait until we have around 250,000 galleries, then update EveKnows.com with the new database.
Well, Caroline is off gathering as much free porn as she can find. I made the database adjustments I mentioned in my previous post, so the next EveKnows.com update won’t be for a few days. Once it’s ready, however, search speeds should be drastically faster. Right now, the first query for a particular search term takes between 3 and 10 seconds, though once it’s cached, the results are generated in about 0.01 seconds. This update should get everything down to the 1-second-or-less range.
The latest Caroline Spider update fixed a number of memory problems, some more thumbnailing issues, and, most importantly, include significant speed improvements. The spider is now indexing porn galleries two or three times faster than before, and using much less RAM to boot! She’s out scuttling around the Internets at the this moment, cataloging all the best boobies and rolling them into the EveKnows.com database.
After the spider was well under control, I turned back to looking at optimization methods for EveKnows itself. First, I noticed that the server’s harddrive was performing abysmally. A quick ‘hdparm -m16 /dev/hda’ had things back to where they were before–apparently Debian Etch’s config files for hdparm don’t work as expected, so the settings are lost after a reboot. Still need to look into why this is happening, but I can’t take the server down for a reboot very often so it’s going to have to wait. I also found a way to significantly speed up my queries, but it would require a database schema update, which in turn would necessitate scrapping the current porn gallery database and starting from scratch. I think I’m going to do this (the changes result in an order-of-magnitude speed increase), but it may take some time. Don’t be worried if you don’t see new galleries appearing in the search results over the next few days, it just means that we’re gearing up for a faster, more thorough search engine.
Update: For what it’s worth, I figured out that the hdparm problem was due to a missing device block in /etc/hdparm.conf. In my case, all I needed was:
/dev/hda {
mult_sect_io = 16
write_cache = on
keep_settings_over_reset = on
}
