Wow. Never underestimate the power of a left join. After watching the gallery database grow to 60,000 galleries with 2.5 million indexed words, SQL query times were getting pretty bad. A search for ‘Sexy Teens’ took around 10 seconds; if it’s that bad at 60,000 galleries, imagine how bad it would be at 1,000,000! So I did some research on query optimization today and found a way to rework the main search algorithm. But first, some background…
The EveKnows database consists of a few tables, but the two that matter right now are a table of galleries and a table of keywords. The tables are linked by the gallery_id key; each gallery has an entry in the keyword table for every word on the page, plus an associated score. For example, a gallery with the word ‘goth’ in the title, incoming link, and page text might have a word score of 4 while a gallery with ‘goth’ only appearing once in the page text might have a score of 0.25. When someone queries the search engine, we look at each word in their query, add up the scores for each matching gallery, and return the results sorted so that the highest-scoring galleries are at the top. Sounds simple enough, right?
Well, it is pretty straight-forward, but my god it doesn’t scale well! Once the keyword table topped 1,000,000 entries, it began to get bad. After some research, I found that the slowdown was coming when I joined the keyword and gallery tables together–the resulting table was huge! So, I needed to reduce the size of the final table, but how? Thankfully, MySQL 5 supports subqueries, a method of executing one query inside of another and using the results in the parent query. Thus I was able to eliminate the natural join and use a subquery on the keyword table to pull each matching gallery ID and return them pre-sorted, since the scoring data was included in the subquery. This gives us a list of gallery IDs which we then left join to the full gallery table. Left joins are useful because they don’t take the product of two tables, they simply add the matching rows of the second table to the first. Thus, our result list stays sorted and has all of the gallery details, such as title, URL, summary, etc., added to it in a highly efficient manner.
Final result? That ‘Sexy teens’ search dropped from 10 seconds to less than half a second. W00t indeed!