Monday, September 27, 2010 9:35 AMOwner
Microsoft Academic Search uses a focus crawler to fetch data from the Internet. The following are some tips for website administrators that can help websites be indexed by Academic Search easily and quickly.
Crawler name and IP
The Microsoft Academic Search crawler is called “librabot”. The user agent string in the http request is “librabot/2.0 (+http://academic.research.microsoft.com/)”. Our crawler’s IP ranges are 18.104.22.168/25 ,22.214.171.124/25 and 126.96.36.199 . The http request from our crawler looks like this:
GET http://www.microsoft.com/ HTTP/1.0
User-Agent: librabot/2.0 (+http://academic.research.microsoft.com/)
Accept: text/html, text/plain, text/xml, application/*
Please make sure that your website can be accessed by the IP and crawler name.
Academic search follows the robots.txt protocol. If your website has robots.txt, please make sure the crawler is not blocked by the file; set the crawldelay to a small value if you want to control the crawler access frequency.
Our crawler supports sitemap protocol. If you want your important content to be indexed in quantity and quickly, you can write the URL of the important content into the sitemap; please be sure to update it when the content in your website changes.
Make the parsing and exploration of your website easier
Don’t use a complex dynamic web technology such as Flash or Ajax for your important content. We suggest that you use a simplified form of your important content for the crawler only (keep the complex one for real users).
Don’t make your important content hard to discover. If some content can only be found by issuing a query, then it can’t be indexed by our crawler. We suggest you make a list page that includes all URLs of important content.
We also suggest you put papers into PDF or Word format, rather than HTML.
Please contact us directly if your website is not indexed. We will be happy to analyze the problem.
- Edited by Qing YuMicrosoft Employee, Moderator Wednesday, November 17, 2010 10:12 AM change crawler ip
- Changed Type Cherry CHEMicrosoft Employee, Owner Tuesday, February 22, 2011 6:00 AM
- Edited by Cherry CHEMicrosoft Employee, Owner Monday, December 12, 2011 6:07 AM
- Edited by Thomas, Academic Search EditorModerator Monday, February 11, 2013 11:55 PM Made minor edits.
- Edited by Thomas, Academic Search EditorModerator Monday, February 11, 2013 11:56 PM Made minor edits.
Saturday, October 30, 2010 10:42 PM
Saturday, November 06, 2010 10:14 AMModerator
Our crawler doesn't read the special HTML META tags in current version.
However, we support OAI interface (Dublin Core format). If you have lots of papers' meta data and provide them through OAI interface, please tell us the OAI interface address.
Wednesday, December 01, 2010 2:54 AMModerator
The following table shows the sites which we crawled paper from till now. We only list the top 100 sites.
Site PaperNumber redalyc.uaemex.mx 35546 hal.archives-ouvertes.fr 31267 cancerres.aacrjournals.org 28334 bloodjournal.hematologylibrary.org 27094 www.aaai.org 19482 jvi.asm.org 19130 www.aclweb.org 17260 nar.oxfordjournals.org 17096 www.math.ethz.ch 16577 www.emis.de 16417 jb.asm.org 16132 www.cs.cmu.edu 15722 aem.asm.org 15703 emis.maths.adelaide.edu.au 15653 research.microsoft.com 15438 www.maths.soton.ac.uk 15042 emis.luc.ac.be 14839 www.scielo.br 14780 reference.kfupm.edu.sa 14631 jas.fass.org 14572 www.plantphysiol.org 14080 jds.fass.org 13557 www.maths.tcd.ie 13515 iai.asm.org 13393 jcm.asm.org 13378 ageconsearch.umn.edu 12829 www.jneurosci.org 12482 www.jimmunol.org 12387 mcb.asm.org 12347 www.genetics.org 12220 www.wseas.us 11588 jcb.rupress.org 11456 www.math.helsinki.fi 11446 jn.nutrition.org 11401 epaper.kek.jp 11302 admin.xosn.com 11173 www.ams.org 10626 jem.rupress.org 10624 circ.ahajournals.org 10612 aac.asm.org 10171 vir.sgmjournals.org 10089 www.biomedcentral.com 9886 www.clinchem.org 9885 www.fs.fed.us 9701 mic.sgmjournals.org 9592 www.ajronline.org 9434 accelconf.web.cern.ch 9277 www.mat.ub.es 9236 emis.maths.tcd.ie 8983 jcem.endojournals.org 8763 www.stanford.edu 8762 emis.library.cornell.edu 8666 www.iovs.org 8034 acl.ldc.upenn.edu 8007 emis.math.ca 7958 intl.plantphysiol.org 7857 www.emis.math.ca 7811 www.ias.ac.in 7786 www.nber.org 7640 www.anesthesia-analgesia.org 7622 www.univie.ac.at 7447 web.mit.edu 7390 www.clevelandfed.org 7308 www.wjgnet.com 7190 mathnet.preprints.org 7094 endo.endojournals.org 7036 www.akademik.unsri.ac.id 6910 ams.confex.com 6805 content.onlinejacc.org 6799 humrep.oxfordjournals.org 6759 ats.ctsnetjournals.org 6757 stroke.ahajournals.org 6746 www.emis.ams.org 6608 www.jlr.org 6563 content.nejm.org 6534 www.molbiolcell.org 6515 www.ajcn.org 6449 ndt.oxfordjournals.org 6435 emis.math.tifr.res.in 6349 wing.comp.nus.edu.sg 6259 www.academicjournals.org 6146 reports-archive.adm.cs.cmu.edu 6146 www.jleukbio.org 6048 jeb.biologists.org 6001 iahs.info 5979 www.usenix.org 5961 www.ars.usda.gov 5856 bioinformatics.oxfordjournals.org 5827 www.cc.gatech.edu 5724 www.princeton.edu 5587 infoscience.epfl.ch 5565 circres.ahajournals.org 5556 hyper.ahajournals.org 5541 jnm.snmjournals.org 5470 pediatrics.aappublications.org 5441 halshs.archives-ouvertes.fr 5359 jac.oxfordjournals.org 5203 www.ajnr.org 5180 ajrccm.atsjournals.org 5048 www.slac.stanford.edu 5048
Sunday, December 19, 2010 10:01 AMSurprisingly, arxiv.org is not crawled (it hosts about 650k papers).
Monday, December 20, 2010 3:17 AM
Thanks for your feedback! Here we just listed top 100 sites where we crawled paper from. Actually arxiv.org is also crawled.
For example, you can view this page: http://academic.research.microsoft.com/Paper/120150.aspx
"arxiv.org" is among the view and download links.
Microsoft Academic Search Team
Sunday, December 26, 2010 11:06 PM
How can I give the url? Can you add this url bellow in your crawler list?
This is the url of oai-pmh interface of Fundação Getulio Vargas' digital library:
Tuesday, December 28, 2010 3:15 AM
Thanks for your information! We've got your requirement and noticed our team member to make certain changes. Please check back later, due to our process, it may not be seen online very soon. We appreciate your patience and continuous support!
Microsoft Academic Search Team
Monday, January 03, 2011 12:15 AM
Many thanks Caroline! I am a research at Getulio Vargas Foundation and also the project manager of our Digital Library. Let me know if we can make anything to help your crawler...
Happy new year!
Monday, January 03, 2011 6:11 AM
Happy New Year to you and your team, too.
Microsoft Academic Search Team
Monday, January 03, 2011 10:36 AMModerator
Could you provide detail information of your OAI service? Such format, user account? You can directly send mail to email@example.com
Wednesday, April 03, 2013 11:17 AMRespected Sir/Madam,
Please help us in including our journal in Microsoft Academic Database. Some of our manuscripts are already added by the authors, but the journal name (International Journal of Research in Computer Science) is not available in the index. Please help us in this regard.
Following is the link to our journal's OAI data.
- Edited by White Globe Wednesday, April 03, 2013 11:18 AM