Solution for "most" MSNBot crawl/indexing issues RRS feed

  • General discussion


    Hello Fellow Webmasters:


    In the past few months, i have done some tinkering around and testing so to speak, I i do belive I have found the solution to "most" of MSNBot's indexing woes many have been complaining about. The good news is the fix is a very "simple one" and it will have no bearing on any other search engine. with that being said, to the matter at hand.


    Many web page wysiwyg editors are actually encoding html/xhtml documents "Wrong". Well not wrong, but in a format that is not yet widely supported enough. this encoding is utf-8.


    I have found that msnbot has problems it seems with utf-8 encoded pages. also, if you write valid html/xhtml and validate your sites with the w3c page validator, you will find if your page is encoded in utf-8 it will pass validation, but it will give you a "warning" stating that the validator had to do "guesswork" and their may still be problems.


    This is important thing I at first overlooked. 3 months ago, I decided to test this.


    I changed 5 pages on my sites encoding from utf-8 to iso-8859-1, I changed nothing else on the pages.


    These pages that went 2 months "without" being indexed were indexed by msnbot within 4 days.


    After changing all 24 pages of my website to the iso-8859-1 encoding, msnbot indexed my entire site within a week where previously it went unindexed for 2 months.


    I don't know why, but for somereason it seems in my case i have observed that msnbot does not like utf-8 encoded pages.




    The fix:


    Look in the Head of your html document, and look for this tag:


    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />


    And change it to:


    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />


    After making this one simple change to all my pages, msnbot picked them up immedialty, i made no other changes. msnbot now indexes all my new pages within 3 days with no problems.


    After making this change to your webpages, upload them and submit your new sitemap


    Be sure to update your "date modified" in your sitemap before submitting to live.


    Your site should now get indexed properly


    hope this tips helps some people out, took me awhile to figure this out. Just because google says uft-8 is ok doesn't mean it is. like i said the w3c validator warns against using utf-8 for this very reason that many apps don't support it. It is not widely supported enough yet, many web bots don't support its use.


    also, it seems Yahoo bot and Googlebot love my pages even more since I changed this encoding. give it a shot, I bet msnbot will be more than happy to index ya once you make this change.

    Thursday, September 4, 2008 4:36 PM

All replies


    This is certainly an interesting experiment; but, it is dangerous advice.


    If your Web page uses any of the roughly 1 million UTF-8 codepoints that are not in the (tiny) iso-8859-1 character set (contains only 191 "characters"), switching your charset attribute's value as suggested above will cause such characters to not render properly!

    Thursday, September 4, 2008 10:35 PM
  • True cichanlx, but "most" folks are not using those characters mentioned. I alos have the "content = "Us-Eng" declared in my pages as well as the ISO-8859-1 encoding.


    that being said, when making your webpage you have to ask yourself, "Do i really need unicode" ?


    I mean I could see if your using certain special characters or using different languages and what not...but a english only normal website i see no need for unicode...thats just my opinion.


    As i said and you made a good point though, make susre you not using chracters that are not incudled in the iso format obviously.


    The parts of ISO 8859
    standard name of alphabet characterization
    ISO 8859-1 Latin alphabet No. 1 "Western", "West European"
    ISO 8859-2 Latin alphabet No. 2 "Central European", "East European"
    ISO 8859-3 Latin alphabet No. 3 "South European"; "Maltese & Esperanto"
    ISO 8859-4 Latin alphabet No. 4 "North European"
    ISO 8859-5 Latin/Cyrillic alphabet (for Slavic languages)
    ISO 8859-6 Latin/Arabic alphabet (for the Arabic language)
    ISO 8859-7 Latin/Greek alphabet (for modern Greek)
    ISO 8859-8 Latin/Hebrew alphabet (for Hebrew and Yiddish)
    ISO 8859-9 Latin alphabet No. 5 "Turkish"
    ISO 8859-10 Latin alphabet No. 6 "Nordic" (Sámi, Inuit, Icelandic)
    ISO 8859-11 Latin/Thai alphabet (for the Thai language)
    (Part 12 has not been defined.)
    ISO 8859-13 Latin alphabet No. 7 Baltic Rim
    ISO 8859-14 Latin alphabet No. 8 Celtic
    ISO 8859-15 Latin alphabet No. 9 "euro"
    ISO 8859-16 Latin alphabet No. 10 for South-Eastern Europe


    For most websites out there, atleast 95% of them really shouldn't be using utf-8 or unicode simply because they are not going to make full use of the charactersets. Most webpages I see encoded in utf-8 are using "zero" characters that are "not" available in iso8859-1


    It is a fact utf-8 is not "widely" support yet on the web..not to the extent of iso8859-1


    I mean it does seem msnbot and other web crawlers such as yahoo are more "friendly" to iso encoded pages. They are probably easier to parse and read by the bots I would assume.


    Im not saying a utf-8 encoded page won't get indexed, It just seems in my "observations" that iso 8859-1 seems to be a much better option. for exmaple, Google webmaster tools says if you "don't" specify an encoding for your page it "defualts" to iso....


    Im just saying ISO is better supported "right now" a few years down the road though as the web continues to change, utf-8 and other  unicode should receive much better support.


    Friday, September 5, 2008 2:19 AM
  • Your original advice said, "Leap".  I am saying, "Look before you leap."  :]  In response, you claim that for 95% of anyone who might find your advice, it's okay to just leap.  Your general point is a good one: Some web authors are using character sets that include many characters they are not using.  That does not make it any less dangerous to tell everyone to leap. 


    By the way, 99.9999997899% of statements like your "95% of them really shouldn't be using utf-8" are made up.  95%, huh? 


    Should not be using UTF-8, you say?  There is nothing inherently bad about using UTF-8--even if you are not using any of its many character points that are not in some other character set.  The nice thing about using it, even if you don't need it, is that you won't have to change the charset in all your pages when you need to use one of the million characters from UTF-8 that are not in ISO-8859-1.  My site--and many others--use ™, which is not in ISO-8859-1.


    Your original point about the Live bot not seeming to like UTF-8 is still interesting; though, my site uses UTF-8 and I have had no apparent problems from it.  My original reply, that your advice was dangerous is still valid.




    Friday, September 5, 2008 4:38 PM