locked
MSNBOT appears to be case sensitive when reading robots.txt RRS feed

  • Question

  • Whether a DISALLOW directive is respected appears to be dependent upon the case of referral URL matching the case in robots.txt.

     

    This seems very strange but our web logging seems to indicate this is the case.

     

    Has anyone else seen this?

     

     

     

     

    Thursday, July 3, 2008 12:38 PM

Answers

All replies

  •  

    Hi,

     

    The robots.txt is case sensitive if that is what you are asking. That is based on actual page names, not links to the pages. It shouldn't matter if the link is: http://www.msn.com/?wl=true or http://www.Msn.com/?wl=true as long as you have the correct case for the page listed in your robots.txt, it should obey your directives.

    Thursday, July 3, 2008 6:46 PM
  • Hi,

    I did not fully appreciate that robots.txt was case sensitive (oops), so there is some tidying up I need to do.

     

    The problem I seem to be having is that IIS is not case sensitive and msnbot is.

    The full case in IIS of one of our directories is /TransportDirect/..... but obviously IIS will serve any permutation of case.

     

    Therefore to stop MSNBOT from crawling this I must put every possible combination of case that someone may use in their referral to me. ie /TRANSPORTDIRECT/ ...to.... /TrAnSpOrTdIrEcT/ ...to... /transportdirect/

     

    Our current robots.txt:

    User-agent: msnbot
    Disallow: /Web2/JourneyPlanning/JourneyPlannerInput.aspx

     

    User-agent: *
    Disallow: /Web2/JourneyPlanning/JPLanding.aspx
    Disallow: /Web2/JourneyPlanning/jplandingpage.aspx
    Disallow: /Web2/JourneyPlanning/JPLandingPage.aspx
    Disallow: /NaptanViewer/
    Disallow: /TDPWebservices/
    Disallow: /Transportdirect/
    Disallow: /Web/

    From our IIS logs the following can be seen:

    2008-07-02 11:01:03 65.55.105.10 - 10.93.108.200 80 GET /transportdirect/en/journeyplanning/jplandingpage.aspx id=traintaxi.co.uk&do=n&oo=n&o=&on=&d=9100LMTNTWN&dn=Lymington%20Town%20Station 302 0 411 219 www.transportdirect.info msnbot/1.1+(+http://search.msn.com/msnbot.htm) - -

     

    I have checked the referring site and they have the URL as shown in the log above.

     

    Regards,

     

    Friday, July 4, 2008 4:42 PM
  • ok,

     

    use the actual directory or url name  i.e. TransportDirect not transportdirect. It matters what the actual url is named and if it is capped, not what someone may put in their browser/url to get to your site.

     

    Does this help?

    Wednesday, July 9, 2008 3:14 AM
  • I am sorry but I cannot see how this can be correct.

     

    Please bear in mind my original problem is with MSN following referral urls from external websites.

     

    How does the MSNBOT bot know the actual case of my URL (as served by iis) until it has surfed it?

    (If it can do this, it is pretty clever, can it pick next weeks lottery numbers as well?)

     

    MSNBOT obviuosly cannot know this information beforehand....... therefore you MUST be comparing the referral URL to robots.txt how else can MSNBOT decide whether to surf or not?

     

     

     

    Wednesday, July 9, 2008 1:52 PM
  •  

    Got it.

     

    Sorry, I think I was focusing on something else. You can use page-level Meta tags such as <noindex,nofollow>. We have posted an article on the different Meta directives available.Robots Exclusion Protocol: Joining Together to Provide Better Documentation. You can also use HTTP header X-Robots tags for dynamically served content which for asp, looks something like:

     

    Response.AddHeader "X-Robots-Tag", "insert Meta tag directive here"

     

     

    Hope this helps.

    Wednesday, July 9, 2008 4:01 PM
  • Hi Brett,

     

    Thank you for your previously reply.

     

    I am obviously not communicating my issue clearly.... or possibly i don't understand how the answers being supplied

    are applicable to my situation.

     

    The high level problem I have is I don't want any bots to crawl a specific URL on my site. This particular URL has active content and bot visits are playing havoc with my MI.

     

    The specific problem I am having is that MSNBOT keeps coming back to this specific URL, by following back links (referral URLS) from other websites even when I keep trying to tell it not to in my robots.txt.

     

    The fault path as I see it is:

     

    1) this specific URL(s) is being added to an external website by their webmaster (this is OK for them to do, and I

    want them to do this)

    AND

    2) MSNBOT is finding this referral URL when crawling the external website and adding it to the MSNBOT crawl list.

    AND

    3) MSNBOT is then using this referral URL in a crawl of MY SITE

     

    This can be seen in webserver log

    2008-07-02 11:01:03 65.55.105.10 - 10.93.108.200 80 GET /transportdirect/en/journeyplanning/jplandingpage.aspx

    id=traintaxi.co.uk&do=n&oo=n&o=&on=&d=9100LMTNTWN&dn=Lymington%20Town%20Station 302 0 411 219

    www.transportdirect.info msnbot/1.1+(+http://search.msn.com/msnbot.htm) - -

     

    AND

    4) MSNBOT is clearly ignoring my robots.txt entries because they are in the wrong case.! when compared the referral

    URL.

     

    BUT

     

    5) MSNBOT cannot know the actual true case of my URL without first crawling it, CATCH22 . you have just crawled

    the URL to find the case.......

     

     

    The only two things MSNBOT does know, or can find out, before actually crawling the referral URL it got from the external site is

    a) the referral URL itself    AND
    b) the entries in the robots.txt file

     

    As you are obviously crawling my site the original ascertion I made must be correct, namely that MSNBOT is comparing

    the CASE of the referral URL obtained from the EXTERNAL website against my robots.txt.

    If they don't match the URL is ok to crawl


    The problem I have is that I don't control how people create their referral URL's (back links) to me.........


    QED

    Whether a DISALLOW directive is respected appears to be dependent upon the case of referral URL matching the case in

    robots.txt.


    this is BAD.

     

    Please can you frame your answer in the context of the problem I am seeing. If any of my fault path is wrong please point it out as I will be able to tie it into the data I am seeing

     

    Many thanks

     

     

     

     

     

    Wednesday, July 9, 2008 5:02 PM
  • Throwing this out here now...

     

    Hmm, how about using a canonicalization technique like Apache's redirectmatch so that any miss-cased URLs get redirected to the prefered path which then has an http header x-robots tag specifying noindex, nofollow?

    Wednesday, July 9, 2008 9:08 PM
  • My website cannot do anything until MSNBOT decides to crawl me at which point it is too late - msnot has just hit me.

    <THIS IS THE PROBLEM>

     

    MSNBOT should be carrying out the canonicalization !  of at least the folders, and ideally the files as well that are mentioned in my robots.txt !!

     

    The intention indicated in my robots.txt is clear - why can't msnbot understand this? No one elses bots cause this problem !!!

     

     

    Friday, July 11, 2008 1:54 PM
  • Does anyone have any suggestions about how to resolve the issue I am having, where MSNBOT appears to ignore my robots.txt?

     

    To give a scale of this problem MSNBOT usually hits our site approx 15,000 times a night. This figure jumped to 30,000 from MSNBOT for a night with the extra traffic on a URL that is actually 'disallow-ed' in the robots.txt.

     

    Regards

    Steve

     

     

    Friday, August 1, 2008 4:02 PM
  • Answered on another thread.

     

    Monday, August 25, 2008 8:21 PM