Archive for the ‘Development’ category

Search Result Weighting

March 18th, 2007

I just made an update to the search result weighting algorithm. To keep the newest galleries on top, sites which were first indexed within the past three days are now given some extra weight. I’m a little worried that the extra processing will slow down the searching algorithm too much, but we’ll try this out for a while and see how it works. If nothing else, it’ll keep new galleries at the top of the search results.

I also made some adjustments to penalize keyword spammers. After a word shows up in a gallery too many times, a negative value is applied to the gallery which will lower it in the search rankings. The more the same word appears, the lower the gallery will sink.

As always, check out the advances at the little porn search engine that could :)

Disk Optimizations

March 18th, 2007

So with Caroline running full-bore, I noticed a frightening thing–the EveKnows.com server was bottle-necked by disk IO. Between the spider downloading thumbnails and inserting galleries and SQL server fetching them and the Apache server handing out web pages, my machine was slowing to a crawl. Since the search database is far too large to fit into memory, each search ends up hitting the disk several times. Couple that with a constant stream of SQL INSERT statements and you get a MySQL daemon that’s constantly waiting for the hard disk to give it what it needs, and an idle CPU waiting for something to do.

After some researching, I discovered a few tricks to relieve the congestion. First, I used hdparm to enable multiple sector counts (16) and write-back caching on my disk. This alone made a huge improvement. Then, I tuned Caroline to make far fewer disk hits; it had been logging data, but I don’t really need the log so I turned that off, and tweaked the way that it opens the stop_list file to do it once per thread rather than once per gallery (which was something I should have done from the beginning, but hey, this is the first search engine I’ve ever written!) Finally, I setup two databases on the SQL server, one to use for the search engine and one for the spider to insert new galleries into. They are identical, save that the spider’s database doesn’t have any indexes, which creates far fewer disk writes when adding galleries. Then I setup a cron job to, once per day, replace the search database with the recently updated version from Caroline, add in the indexes needed to keep the searches effecient, and then go back to spidering.

Thus far, the plan seems to be working well. ‘top’ no longer shows a WA% of 90, and searches are completing in less than one second again while Caroline is running. Woohoo!

WWW::Mechanize Memory Management

March 16th, 2007

With the new threaded model for Caroline, I started to notice memory usage getting out of control. Under the previous forking system, I could expect a typical run to go through 200 MB, but the new model was easily topping 1GB and then dying when the system ran out of physical memory. After some searching, I learned that Perl’s WWW::Mechanize module caches each page in memory for the lifetime of the object (this allows for the ‘back’ feature of Mechanize); cool if you’re trying to mimic a web browser, but totally unusable for a web robot like Caroline. Thankfully there is a stack_depth() method for WWW::Mechanize which controls the number of pages cached. By setting this to 2 I managed to get Caroline’s memory usage back under control. She’s off happily indexing more galleries as I type :)

Site Redesign

March 15th, 2007

Today I took a break from tweaking the engine database and gave the interface an overhaul. I know search engines these days are mostly simple affairs, but I figured a decent design couldn’t hurt :p

It’s all valid XHTML + CSS, and should degrade cleanly in older browsers. Let me know if anyone has trouble viewing the site in their browser of choice.

Caroline/0.4.1

March 14th, 2007

Caroline is the name of the spider used to build the EveKnows.com database. Today I hacked together version 0.4, which moved to the threading model I mentioned earlier. Running multiple processes was a pain and relied on having a few different starting-points; now I can run the spider on a single URL and still experience fast results. Shortly after it began to run, I realized that my host was going to get banned all over the place because requests were being sent too quickly. Caroline/0.4.1 took a step towards fixing this by shuffling up the queue of sites a bit.

Now I’m researching how to respect robots.txt and implement longer pauses when repeatedly hitting the same server. Caroline use Perl’s WWW::Mechanize module to automate HTTP requests, which means I can’t use the LWP::RobotUA module without some serious hacking. This might be the way to go in the long run, but I want to explore other options first.

Subqueries and Left Joins

March 12th, 2007

Wow. Never underestimate the power of a left join. After watching the gallery database grow to 60,000 galleries with 2.5 million indexed words, SQL query times were getting pretty bad. A search for ‘Sexy Teens’ took around 10 seconds; if it’s that bad at 60,000 galleries, imagine how bad it would be at 1,000,000! So I did some research on query optimization today and found a way to rework the main search algorithm. But first, some background…

The EveKnows database consists of a few tables, but the two that matter right now are a table of galleries and a table of keywords. The tables are linked by the gallery_id key; each gallery has an entry in the keyword table for every word on the page, plus an associated score. For example, a gallery with the word ‘goth’ in the title, incoming link, and page text might have a word score of 4 while a gallery with ‘goth’ only appearing once in the page text might have a score of 0.25. When someone queries the search engine, we look at each word in their query, add up the scores for each matching gallery, and return the results sorted so that the highest-scoring galleries are at the top. Sounds simple enough, right?

Well, it is pretty straight-forward, but my god it doesn’t scale well! Once the keyword table topped 1,000,000 entries, it began to get bad. After some research, I found that the slowdown was coming when I joined the keyword and gallery tables together–the resulting table was huge! So, I needed to reduce the size of the final table, but how? Thankfully, MySQL 5 supports subqueries, a method of executing one query inside of another and using the results in the parent query. Thus I was able to eliminate the natural join and use a subquery on the keyword table to pull each matching gallery ID and return them pre-sorted, since the scoring data was included in the subquery. This gives us a list of gallery IDs which we then left join to the full gallery table. Left joins are useful because they don’t take the product of two tables, they simply add the matching rows of the second table to the first. Thus, our result list stays sorted and has all of the gallery details, such as title, URL, summary, etc., added to it in a highly efficient manner.

Final result? That ‘Sexy teens’ search dropped from 10 seconds to less than half a second. W00t indeed!

Spidering Speed

March 11th, 2007

The initial spider I wrote wasn’t bad; it tended to average about 500 sites/hour at first, and after some tweaks I managed to get it up to 1,000/hour. Indexing 24,000 sites each day seemed pretty good at first, but then I realized that in order to keep results accurate, I’ll need to be re-indexing all of those pages on a fairly regular basis, say once ever week or two. Since I’m shooting for 1,000,000 galleries, indexing 24,000/day just isn’t going to cut it!

My first thought was the Perl threading module–since the spider spending most of its time blocking on network requests, it seemed an ideal fit for multi-threading. While I was looking into this, I realized I was horribly over-thinking the entire problem. Threading isn’t much more than parallel processes with shared memory, but since this problem domain holds an infinite amount of galleries to index, sharing memory isn’t really an issue. Plus, the spider is in constant contact with a database, so each process knows when a URL was recently indexed. Thus, with only a few minor tweaks, I was able to get several spiders running simultaneously. They’ve hit 30,000 galleries over the past nine hours; not a bad improvement for about an hour’s worth of work :)

WordPress and 403 Forbidden Errors when Permalinking

March 11th, 2007

Gah, spent the better part of this beautiful afternoon getting WordPress’ permalinking feature to place nice with Apache 2.2 on Debian Etch. Every time I turned on Permalinking, I’d get a 403 Forbidden error when trying to view the blog, and going back to the Permalinking options page resulted in a /wp-admin/options-permalink.php 403: Forbidden error as well. After some Googling and messing around with my Apache config files, I realized that WordPress was creating a .htaccess file with permissions of 600, when they need to be 644 in order for the web server to see it (I run Apache with suphp, so PHP code is executed as my own user, not www-data). A quick chmod 644 .htaccess later, and the EveKnows Blog is back up and running!

Boolean Searches

March 11th, 2007

Yesterday I implemented boolean searches for our little adult search engine. You may now prefix a word with ‘minus’, such as -hardcore, and the results will be stripped of all galleries containing that word. There are no limits on you many negated words a query may contain. Hopefully this will help people attempting to narrow down search results.

In the same vein, I’ve also added quoted strings to the search functionality. So, searching for “Alison Angel” will now only return galleries matching the exact string Alison Angel. Of course, you are still free to leave off the quotation marks, which will result in a query fetching all galleries containing both the words Alison and Angel, though not necessarily in that order.

Indexing Improvements

March 10th, 2007

Today I fixed up the indexer in a few ways:

1) The thumbnail code now crops the image into a 100px square.  This helps align the search results and really improves the look of the site.  The cropper centers around the upper-central-third of the photo, so it should include the face and tits of most models.

2) The summary-extracting code has been vastly improved and tested on a slew of galleries; it now does a much better job of skipping ads and recip links and pulling the first sentence or two of the gallery description (assuming one exists).

3) I caught a bug in the indexer which was stripping numbers from the reverse index, so searching for “18 year old teens” would only find results for “year old teens”.  Oops!

4) The list of stop words has been extended to include the most common words which doesn’t really apply to gallery descriptions, stuff like the 2257 links at the bottom of galleries, HTML link codes, etc.  This keeps the reverse index leaner, which leads to much faster search times.

With all of these changes, I’m going to clear out the existing database and start spidering from scratch.  Annoying, I know, but EveKnows is still in beta ;)

Also, for any gallery submitters out there, I’ve added a Submissions page so that you can queue up your own galleries to be indexed.  I’d appreciate a recip link on any galleries submitted, but it’s not required and it won’t affect your search ranking in any way.