Preston Hunt

FEB 24
2005

My server had its first denial of service "attack" today, and it came from the most unlikely of places.... Google! It's not really their fault, as you can read about below in my short story entitled, "How I went head to head with the GoogleBot... and lost":

At around 4:00am, the GoogleBot latched on to my server and started pounding a couple of PHP scripts that are very CPU intensive (one of them resizes pictures, another is very database intensive). For some reason that I'm still investigating, my web server (apache) started gobbling up memory to try and keep up with the requests from the GoogleBot, eventually causing the machine to become completely unresponsive. See below for a graph of free system memory -- see if you can find the point where the GoogleBot found the server :-)

Now, Google follows proper web crawler etiquette and was only requesting a page every few seconds, but due to the resource intensive nature of the pages being requested, that persistent yet relentless load was enough to bring my server to its knees. I'm to blame for not having a properly coded robots.txt file, and also for not employing sufficient caching of the CPU-heavy scripts.

But I think Google has some culpability too, as they should be able to detect when they are crushing a server (based on a sudden rash of "404 not found" errors or other metrics) and they should back off for a while. They should also periodically check robots.txt for changes once they've started crawling a site (I quickly added a "Disallow:" line to protect my scripts, but the GoogleBot didn't seem to check it and kept on going.) The total GoogleBot access count according to my weblogs is currently 22,207 and climbing (they are still hitting my server every few seconds).

At first I suspected a hardware error since my server would crash at random intervals after being rebooted. It turns out that this random time was just the amount of time it took GoogleBot to re-latch on to my server after knocking it out of commission.

In debugging this, I swapped out every piece of hardware possible on the system before looking at the software installation as the possible culprit. Wondering if somebody had found a security exploit, I made sure that all of my Gentoo packages were up to date, and also recompiled the Linux kernel to the latest version. Recompiling the kernel turned out to be the pivotal action, as one of the new features was some sort of fail safe that starts killing off processes if the system runs out of memory. This enabled my system to stay alive long enough for me to see the server logs, which revealed a gazillion requests coming in from 66.249.65.239 (which, a reverse DNS lookup revealed as a googlebot server).

For now, I have simply blocked that IP address from accessing any pages on my server. Once things have settled down, I will re-allow access (don't want to be left off of the Google search index!). By then, my new robots.txt file should be in effect, and hopefully I won't have any more problems.

No hard feelings toward Google, of course. It's really my fault that this happened... amazing that it took this long for this sort of thing to occur!

tags: ph.com
permalink | comments | technorati

NOV 27
2004

I just realized that my RSS feed has been non-compliant up until now (just fixed it). I was advertising myself as an RSS 2.0 feed, but publishing an RSS 1.0 feed. RSS is very confusing because there are at least three different incompatible versions: One from Dave Winer (RSS 2.0), one from Netscape (RSS 0.9), and one that has been hacked up with all kinds of extensions to make it more usable (RSS 1.0).

Atom, by comparison, is much simpler, cleaner, and easier to code. Moving forward, I'm putting all my efforts into Atom, but I'll keep the RSS 2.0 feed running as well (bugfixes only, no new features over there).

tags: ph.com
permalink | comments | technorati

SEP 15
2004

I've revamped my picture engine a bit: First, I fixed a problem that was causing the pictures and thumbnails to load quite slowly (thanks James for the bug report!). Second, I added a cool new random photos page (there's also a link under the random photo on the left side of the main page). Check it out... and if you don't see anything you like, hit reload for more random goodness :)

tags: ph.com
permalink | comments | technorati

JAN 29
2004

I just finished adding RSS support to this site (check the "Miscellany" menu on the right for the RSS Feed link). If you don't know what RSS is, it means that you can get automatic updates whenever I add something new to my site. You can either add a module onto your My Yahoo! page or download a stand-alone RSS aggregator (my favorite is SharpReader but there are plenty of others).

tags: ph.com
permalink | comments | technorati

JUN 13
2003

The Wayback Machine has snapshots of what prestonhunt.com used to look like (all the way back to May 2000).

tags: ph.com
permalink | comments | technorati

APR 4
2003

I am in the process of adding licensing terms to all of my pages. For example, this page now has a license (see bottom right hand column).

tags: ph.com
permalink | comments | technorati

JUL 10
2002

If you search on Google for my name, the first 2 results are links to this site. Exxxxcellent.

tags: ph.com
permalink | comments | technorati

Blogroll

Other Tags