locked
Web Site Robot Text File-Robots, Crawlers and Search Engines Oh my! RRS feed

  • General discussion

  • Now that you have created new Publicly accessible web sites that can be seen and scanned without a password anytime your WHS is turned on...It might be a good idea to create a Robots.txt file to deny access to search bots on all your web pages and photo album web pages on WHS! Not that your WHS pages are sending a beacon to the WWW or anything, but should be considered if you have private info on your public WHS pages.

    This robots file will prevent Google, MSN, Yahoo, Archiver and many other robots from finding, crawling, searching, indexing and archiving your web sites hosted on WHS. Which would be a good idea to prevent in my opinion. (WHS already has these robot.txt files in the Interpub folders for the remote access web pages.)

    1. Create a new .txt file and name it robots.txt. In that file copy and paste exactly:
    User-agent: *
    Disallow: /


    2. Click File>Save. Now paste this new robots.txt file in every new web site folder
    you create. This includes HTML or any Photo album web sites you make. Not
    every sub-folder needs this file, just the main folder that each web site lives in (it won't hurt to put
    a copy in every folder if you are extra concerned). This will keep your web pages, images and
    comments from being indexed by search engines and ensure that your WHS pages stay private....unless you don't care.

    I have some elaborate robot.txt files on my non-WHS hosted sites. But that is too much to go into on this thread...

    For more info on Robot.txt files, see the following links:

    Main site: http://www.robotstxt.org/wc/robots.html

    Deny Info: http://www.robotstxt.org/wc/faq.html#robotstxt
    Sunday, June 10, 2007 12:42 PM

All replies

  •  

    Please note that web crawlers will not be able to access content that requires authentication. Your files/photos hosted on WHS are safe from crawlers.

     

    We do have a bug tracking this. robots.txt should be placed under the root folder of the website. The purpose of robots.txt is primarily to protect your public website that does not require a log-in, for example, if you set up your own blogs on WHS.

     

     

    Monday, June 11, 2007 4:42 PM
  • Thanks Fan Z.  I should have been more specific....The robot.txt file would not be needed for any of the remote web pages you log onto with the user name and password.... Any of the User account pages would not need this either. Actually, I noticed that the log on pages already have robot.txt files in those pages. Ther is no concern here....

     

    This really only applies to New Public web pages created on WHS either by manually creating them with say Frontpage or when using Andrew Grant's Whiist--Windows Home Server Add-In. These new pages do not require a user name and password to view and are accessable just like any web page on the net as long as your WHS is turned on. I would like to keep Google, MSN, Yahoo and many other bots from finding and scanning my new Public WHS hosted pages and then having them show up in search engine results. That's all. No biggie, but some of us may want to prevent this and a robot.txt file in the root folder of these newly created pages will prevent your weekend pictures with your family showing up in Google searches...  It's not that these public pages are out there screaming "scan me...scan me," but over time they could be.   

     

    Now, Andrew is planning to give his Whiist--Windows Home Server Add-In the ability to create Private web pages that do require a viewer to log in. A robot.txt file would not be necessary on these pages......

     

    Hope this clears things up...

     

     

     

     

    Wednesday, June 13, 2007 2:17 AM