Microsoft > Forums Home > Microsoft Research Forums > Microsoft Academic Search > FAQ: How to make your site indexed by Microsoft Academic Search

Sticky FAQ: How to make your site indexed by Microsoft Academic Search

  • Monday, September 27, 2010 9:35 AM
    Owner
     
     

    Microsoft Academic Search uses a focus crawler to fetch data from the Internet. The following are some tips for website administrators, which can help websites be indexed by Academic Search easily and quickly.

    1.       About crawler name and IP:

    The Microsoft Academic Search crawler is called “librabot”. The user agent string in the http request is “librabot/2.0 (+http://academic.research.microsoft.com/)”. Our crawler’s IP ranges are 219.142.53.0/25 ,202.96.51.128/25 and 131.107.65.248 . The http request from our crawler looks like this:

    GET http://www.microsoft.com/ HTTP/1.0

    Host: www.microsoft.com

    User-Agent: librabot/2.0 (+http://academic.research.microsoft.com/)

    Accept: text/html, text/plain, text/xml, application/*

    Accept-Encoding: identity;q=1.0

    From: librabot@microsoft.com

     

    Please make sure that your website can be accessed by the IP and crawler name.

    Academic search follows the robots.txt protocol. If your website has robots.txt, please make sure the crawler is not blocked by the file; set the crawldelay to a small value if you want to control the crawler access frequency.

    2.       Sitemap protocol

    Our crawler supports sitemap protocol. If you want your important content to be indexed in quantity and quickly, you can write the URL of the important content into the sitemap; please be sure to update it when the content in your website changes.

    3.       Make the parsing and exploration of your website easier.

    Don’t use a complex dynamic web technology such as Flash or Ajax for your important content. We suggest that you use a simplified form of your important content for the crawler only (keep the complex one for real users).

    Don’t make your important content hard to discover. If some content can only be found by issuing a query, then it can’t be indexed by our crawler. We suggest you make a list page including all URLs of important content.

    We also suggest you put papers into PDF or Word format, rather than HTML.

     

    4.       Please contact us directly if your website is not indexed. We will be happy to analyze the problem.


All Replies