msnbot's Fatal Flaw!

I've had a very disproportionate number of visits from "msnbot" compared to any other User Agent. After a bit of procrastinating, I've foundout why - Microsoft's new search facility has an obvious and problematic flaw!

The flaw means that until Microsoft fix the "bot" that trawls the web indexing the pages it finds, everyone can cause MSN a few headaches. The flaw doesn't directly affect the actual search service, but it does open the door for incredibly obvious hacks that exploit the database used for search results. You don't even need to do anything "dodgy" to completely spam the indexing 'bot.

<--break-->
To exploit the MSN indexer problem (at least on msnbot/0.3), construct a non-cgi page that references itself, with the addition of a random number in the query string. This random number needs to be created on the fly, so that each time the page is loaded, the number changes. Here's an example using Server Side Includes (SSI):

    <html>
    <head>
    <title>msnbot catcher</title>
    </head>
    <body>
    <!--#config timefmt="%d%M%S" -->
    <h1>The msnbot Catcher</h1>
    <a href="<!--#echo var="DOCUMENT_URI" -->?cdc=<!--#echo var="DATE_LOCAL" -->">[Reload]</a>
    </body>
    </html>

This code uses the SSI variable DATE_LOCAL (which has been configured so that it's just numeric) and constructs URLs like these:

    http://yourserver/path/to/file?cdc=221105
    http://yourserver/path/to/file?cdc=221106

Each time the page is reloaded the number used will be different (provided the requests come at least one second apart). Note that you must get the spider to check this page, so don't cover it with an entry in your robots.txt file.

This is an incredbly obvious flaw - just about any website that uses any form of dynamic content will use things in the query string. Often such applications have the scope to show the same page in a number of forms (e.g. normal, printer friendly, mobile, low graphics etc). In such cases the spider will index all forms of the page, but is unlikely to "get stuck" because there is a finite number of possible pages.

This behaviour is highly risky for a web spider. As soon as a URL has a question mark (?) in it, that page is (probably) dynamically generated, and so might easily include infinite references, and in many cases will be duplicate content. Most web spiders ignore all URLs with question marks in them because they are risky. Some spiders may fetch pages with question marks in the URL provided the URL was found on a page without a question marked URL.

So for the time being, anyone can construct an "msnbot catcher". Less scrupulous site owners may make each page contain a number of random words or phrases that subvert the index in some way.

Submitted by coofercat on Mon, 2004-11-22 19:51