<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Anatomy of a Search Engine &#187; Linux</title>
	<atom:link href="http://blog.eveknows.com/category/linux/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.eveknows.com</link>
	<description>Development of the EveKnows.com adult search engine</description>
	<lastBuildDate>Wed, 16 Nov 2011 00:58:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Profiling and Debugging Linux Disk Access</title>
		<link>http://blog.eveknows.com/2007/09/27/profiling-and-debugging-linux-disk-access/</link>
		<comments>http://blog.eveknows.com/2007/09/27/profiling-and-debugging-linux-disk-access/#comments</comments>
		<pubDate>Thu, 27 Sep 2007 06:12:58 +0000</pubDate>
		<dc:creator>Aidan</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Hardware]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://blog.eveknows.com/2007/09/27/profiling-and-debugging-linux-disk-access/</guid>
		<description><![CDATA[EveKnows.com is 100% Linux powered. The free (as in speech!) system has proved to be absolutely perfect for our needs. It&#8217;s fast, stable, and customizable&#8211;exactly what you look for in a platform for running fresh, cutting-edge applications such as EveKnows. One of the harder tasks I&#8217;ve had is tuning disk access. The search engine is [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://eveknows.com">EveKnows.com</a> is 100% Linux powered.  The free (as in speech!) system has proved to be absolutely perfect for our needs.  It&#8217;s fast, stable, and customizable&#8211;exactly what you look for in a platform for running fresh, cutting-edge applications such as EveKnows.</p>
<p>One of the harder tasks I&#8217;ve had is tuning disk access.  The search engine is currently running on a Debian 4.0 system with SATA hard drives.  The UNIX utility <em>top</em> reports 10-20% IO usage (which is a good indicator of disk access) almost all the time.  When I turn on the Caroline search spider, that usage spikes to 50%.  At the moment this isn&#8217;t really a big deal, but as the site&#8217;s popularity continues to grow, it will eventually become a bottleneck and severely limit performance.</p>
<p>Thus, I&#8217;ve been trying to learn about profiling disk access on Linux systems.  Maybe I&#8217;ve just been looking in the wrong places, but I haven&#8217;t been able to find any tools which can show me which applications are causing the heavy IO load.  Some digging revealed that <em>dmesg</em> can report individual IO calls when <em>/proc/sys/vm/block_dump</em> is set to <em>1</em>, but that raw information is essentially useless.  To that end, I wrote a small Perl script which totals all of the IO statistics and displays a pretty table of results.  If anyone is interested in using it themselves, the code is below.</p>
<p><strong>Update:</strong> HTML tends to screw up Perl code, so copying/pasting the below code probably won&#8217;t work; if you just want to download the script for your own use, you can <a href="/files/io_stats.pl.gz">find it here</a>.</p>
<pre>
#!/usr/bin/perl
#
# Copyright 2007 Aidan Trent <aidan@eveknows.com>
# Released under the terms of the GNU GPL                                                                                                                                                                             

# Usage: SCRIPT_NAME <time>
# The optional <time> parameter tells the script how many
# minutes it should spend gathering IO statistics. The
# default is 5.                                                                                                                                                                                                       

use strict;
use warnings;

my $sleep_time = 60 * 5; # 5 minutes
if ($ARGV[0]) {
    $sleep_time = 60 * int ($ARGV[0]);
}
`echo 1 > /proc/sys/vm/block_dump`;
sleep ($sleep_time); # 5 minutes
`echo 0 > /proc/sys/vm/block_dump`;

`dmesg > /tmp/io_stats.temp`;
open (FD, '/tmp/io_stats.temp') or die;
my (%total, %read, %write, %dirtied);
while (<FD>) {
    if (/(.*)\(\d+\):\s+(dirtied|READ|WRITE)/i) {
        my $name = $1;
        my $type = $2;
        print "$name - $2\n";
        if (!$total{$name}) {
            $total{$name} = 0;
        }
        $total{$name}++;
        if (!$read{$name}) {
            $read{$name} = 0;
        }
        if ($type =~ /read/i) {
            $read{$name}++;
        }
        if (!$write{$name}) {
            $write{$name} = 0;
        }
        if ($type =~ /write/i) {
            $write{$name}++;
        }
        if (!$dirtied{$name}) {
            $dirtied{$name} = 0;
        }
        if ($type =~ /dirtied/i) {
            $dirtied{$name}++;
        }
    }
}
close (FD);

print "Name\t\tTotal\tRead\tWrite\tDirtied\n";
foreach my $key (sort {$total{$b} <=> $total{$a}} keys %total) {
    my $tab = '';
    if (length ($key) < 7) {
        $tab = "\t";
    }
    print "$key:$tab\t$total{$key}\t$read{$key}\t$write{$key}\t$dirtied{$key}\n";
}

unlink ('/tmp/io_stats.temp');
</pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.eveknows.com/2007/09/27/profiling-and-debugging-linux-disk-access/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

