FAQ: How to make your site indexed by Microsoft Academic Search

Sticky FAQ: How to make your site indexed by Microsoft Academic Search

  • Monday, September 27, 2010 9:35 AM
    Owner
     
     

    Microsoft Academic Search uses a focus crawler to fetch data from the Internet. The following are some tips for website administrators that can help websites be indexed by Academic Search easily and quickly.

    Crawler name and IP

    The Microsoft Academic Search crawler is called “librabot”. The user agent string in the http request is “librabot/2.0 (+http://academic.research.microsoft.com/)”. Our crawler’s IP ranges are 219.142.53.0/25 ,202.96.51.128/25 and 131.107.65.248 . The http request from our crawler looks like this:

    GET http://www.microsoft.com/ HTTP/1.0

    Host: www.microsoft.com

    User-Agent: librabot/2.0 (+http://academic.research.microsoft.com/)

    Accept: text/html, text/plain, text/xml, application/*

    Accept-Encoding: identity;q=1.0

    From: librabot@microsoft.com

    Please make sure that your website can be accessed by the IP and crawler name.

    Academic search follows the robots.txt protocol. If your website has robots.txt, please make sure the crawler is not blocked by the file; set the crawldelay to a small value if you want to control the crawler access frequency.

    Sitemap protocol

    Our crawler supports sitemap protocol. If you want your important content to be indexed in quantity and quickly, you can write the URL of the important content into the sitemap; please be sure to update it when the content in your website changes.

    Make the parsing and exploration of your website easier

    Don’t use a complex dynamic web technology such as Flash or Ajax for your important content. We suggest that you use a simplified form of your important content for the crawler only (keep the complex one for real users).

    Don’t make your important content hard to discover. If some content can only be found by issuing a query, then it can’t be indexed by our crawler. We suggest you make a list page that includes all URLs of important content.

    We also suggest you put papers into PDF or Word format, rather than HTML.

    Contact us

    Please contact us directly if your website is not indexed. We will be happy to analyze the problem.




All Replies

  • Saturday, October 30, 2010 10:42 PM
     
     

    Hi,

    Does the Microsoft Academic Search crawler use any bibliographic metadata (that is generally embedded in HTML META tags)?

    Best regards,

    --Martin

  • Saturday, November 06, 2010 10:14 AM
    Moderator
     
     

    Our crawler doesn't read the special HTML META tags in current version.

    However, we support OAI interface (Dublin Core format). If you have lots of papers' meta data and provide them through OAI interface, please tell us the OAI interface address.

  • Wednesday, December 01, 2010 2:54 AM
    Moderator
     
     

    The following table shows the sites which we crawled paper from till now. We only list the top 100 sites.

     

    Site PaperNumber
    redalyc.uaemex.mx 35546
    hal.archives-ouvertes.fr 31267
    cancerres.aacrjournals.org 28334
    bloodjournal.hematologylibrary.org 27094
    www.aaai.org 19482
    jvi.asm.org 19130
    www.aclweb.org 17260
    nar.oxfordjournals.org 17096
    www.math.ethz.ch 16577
    www.emis.de 16417
    jb.asm.org 16132
    www.cs.cmu.edu 15722
    aem.asm.org 15703
    emis.maths.adelaide.edu.au 15653
    research.microsoft.com 15438
    www.maths.soton.ac.uk 15042
    emis.luc.ac.be 14839
    www.scielo.br 14780
    reference.kfupm.edu.sa 14631
    jas.fass.org 14572
    www.plantphysiol.org 14080
    jds.fass.org 13557
    www.maths.tcd.ie 13515
    iai.asm.org 13393
    jcm.asm.org 13378
    ageconsearch.umn.edu 12829
    www.jneurosci.org 12482
    www.jimmunol.org 12387
    mcb.asm.org 12347
    www.genetics.org 12220
    www.wseas.us 11588
    jcb.rupress.org 11456
    www.math.helsinki.fi 11446
    jn.nutrition.org 11401
    epaper.kek.jp 11302
    admin.xosn.com 11173
    www.ams.org 10626
    jem.rupress.org 10624
    circ.ahajournals.org 10612
    aac.asm.org 10171
    vir.sgmjournals.org 10089
    www.biomedcentral.com 9886
    www.clinchem.org 9885
    www.fs.fed.us 9701
    mic.sgmjournals.org 9592
    www.ajronline.org 9434
    accelconf.web.cern.ch 9277
    www.mat.ub.es 9236
    emis.maths.tcd.ie 8983
    jcem.endojournals.org 8763
    www.stanford.edu 8762
    emis.library.cornell.edu 8666
    www.iovs.org 8034
    acl.ldc.upenn.edu 8007
    emis.math.ca 7958
    intl.plantphysiol.org 7857
    www.emis.math.ca 7811
    www.ias.ac.in 7786
    www.nber.org 7640
    www.anesthesia-analgesia.org 7622
    www.univie.ac.at 7447
    web.mit.edu 7390
    www.clevelandfed.org 7308
    www.wjgnet.com 7190
    mathnet.preprints.org 7094
    endo.endojournals.org 7036
    www.akademik.unsri.ac.id 6910
    ams.confex.com 6805
    content.onlinejacc.org 6799
    humrep.oxfordjournals.org 6759
    ats.ctsnetjournals.org 6757
    stroke.ahajournals.org 6746
    www.emis.ams.org 6608
    www.jlr.org 6563
    content.nejm.org 6534
    www.molbiolcell.org 6515
    www.ajcn.org 6449
    ndt.oxfordjournals.org 6435
    emis.math.tifr.res.in 6349
    wing.comp.nus.edu.sg 6259
    www.academicjournals.org 6146
    reports-archive.adm.cs.cmu.edu 6146
    www.jleukbio.org 6048
    jeb.biologists.org 6001
    iahs.info 5979
    www.usenix.org 5961
    www.ars.usda.gov 5856
    bioinformatics.oxfordjournals.org 5827
    www.cc.gatech.edu 5724
    www.princeton.edu 5587
    infoscience.epfl.ch 5565
    circres.ahajournals.org 5556
    hyper.ahajournals.org 5541
    jnm.snmjournals.org 5470
    pediatrics.aappublications.org 5441
    halshs.archives-ouvertes.fr 5359
    jac.oxfordjournals.org 5203
    www.ajnr.org 5180
    ajrccm.atsjournals.org 5048
    www.slac.stanford.edu 5048

  • Sunday, December 19, 2010 10:01 AM
     
     
    Surprisingly, arxiv.org is not crawled (it hosts about 650k papers).
  • Monday, December 20, 2010 3:17 AM
     
     

    Hi Vincent,

    Thanks for your feedback! Here we just listed top 100 sites where we crawled paper from. Actually arxiv.org is also crawled.

    For example, you can view this page: http://academic.research.microsoft.com/Paper/120150.aspx 

    "arxiv.org" is among the view and download links.

    Best wishes


    Microsoft Academic Search Team
  • Sunday, December 26, 2010 11:06 PM
     
     

    How can I give the url? Can you add this url bellow in your crawler list?

    http://virtualbib.fgv.br/oai/request

    This is the url of oai-pmh interface of Fundação Getulio Vargas' digital library:

    http://academic.research.microsoft.com/Organization/7003

    Cheers,

    Alexandre Rademaker

     

  • Tuesday, December 28, 2010 3:15 AM
     
     

    Hi arademaker,

    Thanks for your information! We've got your requirement and noticed our team member to make certain changes. Please check back later, due to our process, it may not be seen online very soon. We appreciate your patience and continuous support!

    Best wishes


    Microsoft Academic Search Team
  • Monday, January 03, 2011 12:15 AM
     
     

    Many thanks Caroline! I am a research at Getulio Vargas Foundation and also the project manager of our Digital Library. Let me know if we can make anything to help your crawler... 

    Happy new year!

    Cheers,

    Alexandre

  • Monday, January 03, 2011 6:11 AM
     
     

    Happy New Year to you and your team, too.


    Microsoft Academic Search Team
  • Monday, January 03, 2011 10:36 AM
    Moderator
     
     

    Hi arademaker,

          Could you provide detail information of your OAI service? Such format, user account? You can directly send mail to qingyu@microsoft.com

  • Wednesday, April 03, 2013 11:17 AM
     
     
    Respected Sir/Madam,
                          Please help us in including our journal in Microsoft Academic Database. Some of our manuscripts are already added by the authors, but the journal name (International Journal of Research in Computer Science) is not available in the index. Please help us in this regard.
    Following is the link to our journal's OAI data.
    http://ijorcs.org/oai/oai2.php?verb=ListRecords&metadataPrefix=oai_dc
    • Edited by White Globe Wednesday, April 03, 2013 11:18 AM
    •