Anatomy of a Search Engine | Development of the EveKnows.com adult search engine

Archive for March 2007

Mar/07

11

Spidering Speed

The initial spider I wrote wasn’t bad; it tended to average about 500 sites/hour at first, and after some tweaks I managed to get it up to 1,000/hour. Indexing 24,000 sites each day seemed pretty good at first, but then I realized that in order to keep results accurate, I’ll need to be re-indexing all of those pages on a fairly regular basis, say once ever week or two. Since I’m shooting for 1,000,000 galleries, indexing 24,000/day just isn’t going to cut it!

My first thought was the Perl threading module–since the spider spending most of its time blocking on network requests, it seemed an ideal fit for multi-threading. While I was looking into this, I realized I was horribly over-thinking the entire problem. Threading isn’t much more than parallel processes with shared memory, but since this problem domain holds an infinite amount of galleries to index, sharing memory isn’t really an issue. Plus, the spider is in constant contact with a database, so each process knows when a URL was recently indexed. Thus, with only a few minor tweaks, I was able to get several spiders running simultaneously. They’ve hit 30,000 galleries over the past nine hours; not a bad improvement for about an hour’s worth of work :)

No tags Hide

Gah, spent the better part of this beautiful afternoon getting Wordpress’ permalinking feature to place nice with Apache 2.2 on Debian Etch. Every time I turned on Permalinking, I’d get a 403 Forbidden error when trying to view the blog, and going back to the Permalinking options page resulted in a /wp-admin/options-permalink.php 403: Forbidden error as well. After some Googling and messing around with my Apache config files, I realized that Wordpress was creating a .htaccess file with permissions of 600, when they need to be 644 in order for the web server to see it (I run Apache with suphp, so PHP code is executed as my own user, not www-data). A quick chmod 644 .htaccess later, and the EveKnows Blog is back up and running!

No tags Hide

Mar/07

11

Boolean Searches

Yesterday I implemented boolean searches for our little adult search engine. You may now prefix a word with ‘minus’, such as -hardcore, and the results will be stripped of all galleries containing that word. There are no limits on you many negated words a query may contain. Hopefully this will help people attempting to narrow down search results.

In the same vein, I’ve also added quoted strings to the search functionality. So, searching for “Alison Angel” will now only return galleries matching the exact string Alison Angel. Of course, you are still free to leave off the quotation marks, which will result in a query fetching all galleries containing both the words Alison and Angel, though not necessarily in that order.

No tags Hide

Mar/07

10

Indexing Improvements

Today I fixed up the indexer in a few ways:

1) The thumbnail code now crops the image into a 100px square.  This helps align the search results and really improves the look of the site.  The cropper centers around the upper-central-third of the photo, so it should include the face and tits of most models.

2) The summary-extracting code has been vastly improved and tested on a slew of galleries; it now does a much better job of skipping ads and recip links and pulling the first sentence or two of the gallery description (assuming one exists).

3) I caught a bug in the indexer which was stripping numbers from the reverse index, so searching for “18 year old teens” would only find results for “year old teens”.  Oops!

4) The list of stop words has been extended to include the most common words which doesn’t really apply to gallery descriptions, stuff like the 2257 links at the bottom of galleries, HTML link codes, etc.  This keeps the reverse index leaner, which leads to much faster search times.

With all of these changes, I’m going to clear out the existing database and start spidering from scratch.  Annoying, I know, but EveKnows is still in beta ;)

Also, for any gallery submitters out there, I’ve added a Submissions page so that you can queue up your own galleries to be indexed.  I’d appreciate a recip link on any galleries submitted, but it’s not required and it won’t affect your search ranking in any way.

No tags Hide

Mar/07

7

Hardware Upgrade

Those of you testing out the initial release of EveKnows likely noticed the slow response times; the problem was not the web server but the SQL server. To keep costs low I was using a shared server but five to ten seconds to generate results for a query on only a few thousand galleries was unacceptable. This afternoon I purchased a dedicated machine and found a few ways to optimize the queries, resulting in searching in the half-second range. I imagine there is still room for improvement, but this should do for the testing period.

Last night I integrated records of incoming links into EveKnows. The spider now tracks the text of links to galleries (or alt tags in the case of images), but with reducing weighting and a character limit. This makes keyword spamming far less productive, while giving galleries without much embedded text more accurate rankings. Sadly, spamming seems rampant in the porn industry, perhaps more so than in other Internet subcultures. I’m attempting to program the spider so as to detect natural English phrases and rate them higher than lists of keywords. Feedback on the quality of search results is most welcome, as well as ideas on how to combat the constant problem of gallery spamming.

No tags Hide

Mar/07

5

Searching is Live!

Alright! The first beta database has been moved to the production server, so searching is officially live at http://eveknows.com! The initial dataset only includes 5,000 galleries, certainly nothing terribly impressive but it gives a decent indication of the search engine’s technology. Try it out and let me know what you like and what needs improvement.

No tags Hide

Mar/07

4

Search Terms

After a day of messing around with SQL queries, I’ve finally got a handle on doing a logical AND search against the reverse index. Previously the search terms were OR’ed together, so searching for Liz Vicious would return any gallery that matched the word Liz and then every gallery containing the word Vicious. Turns out that there’s a lot of stuff with the word Vicious in it that has absolutely nothing to do with the sexy goth redhead I was looking for, so I knew the searching algorithm needed work. The trick is in SQL’s HAVING clause; the Eve engine does a COUNT(*) on the returned results, which are grouped by URL. The result of the COUNT(*) function is the number of matching terms; a quick HAVING COUNT(*)=$n_terms line in the SQL SELECT statement cleaned up the mess.

I also made some changes to the spider so that it pulls search terms from incoming links; this should help improve search quality, but the change means the existing gallery database is worthless. I’ve scrapped it and started spidering from scratch. In a day or two I’ll take whatever’s been spidered and move it to the production server at EveKnows.com. Stay tuned for some quality porn searches!

No tags Hide

Mar/07

4

Indexing Progress

The spider component of EveKnows.com has been progressing swiftly. It’s become very reliable at examining a webpage and determining whether the page is or is not a photo gallery. In the past 48 hours is indexed over 15,000 galleries, which is a decent amount to begin testing the searching and sorting algorithms on.

The user interface has been uploaded to http://eveknows.com, but the database will not be put online till its performance has been tested. You’ll notice that, unlike just about every other search engine on the Web, there is no Advertise link. There will be no paid-placement galleries on EveKnows.com. Rest assured, every search result will be a genuine, organic gallery linked to by popular TGP sites. Feel free to comment on the look of the interface and include any features you’d like to see added to the search engine.

No tags Hide

« Previous Page

Next Entries »

Theme Design by devolux.org