locked
Houston, you have problem! Crawling, indexing hijacking problems RRS feed

  • Question

  • Instead of web pages on websites being indexed, you're crawling and indexing affiliate links (Shareasale and Commission Junction) and also Google Adwords with www.google.com as the page title. And it's involving indexing URLs that are 302 redirects.

     

    In essence, you're hijacking the web pages like the spammers and proxies used to do to hijack listings at Google until they fixed the problem..

     

    http://search.msn.com/results.aspx?q=holiday+sleepwear&FORM=MSNH

     

    If youj want more, it'll take me just a few minutes to find them.

     

    What happens then is that the sites the hijacked pages are on get dropped from your index entirely in he process of MSNBot fetching pages anew. This has already happened with a few sites that I know of, now gone, and I'm seeing a third start to disappear as you're crawling it and getting it wrong.

    Tuesday, November 27, 2007 9:07 PM

Answers

  • Totally Agree.   Thanks for bringing this to our attention. I have notified the crawl team and I assure you they will address it.   These redirects can be tricky, but we will get it right. 

     

    Thanks

     

    Jeremiah

     

    Tuesday, November 27, 2007 10:17 PM

All replies

  • Totally Agree.   Thanks for bringing this to our attention. I have notified the crawl team and I assure you they will address it.   These redirects can be tricky, but we will get it right. 

     

    Thanks

     

    Jeremiah

     

    Tuesday, November 27, 2007 10:17 PM
  • Jeremiah, it also looks like the URLs are being clustered (indented), with "more from this site" being the affiliate network's site, which is just the intermediary tracking/redirect URL. Here, I found this today:

    http://search.live.com/results.aspx?mkt=en-in&FORM=TOOLBR&q=bargain+children+clothing&FORM=TOOLBR

    Those are two completely different affiliate sites the links are on, and "more from this site" gives this:

    http://search.live.com/results.aspx?q=site:www.shareasale.com+bargain+children+clothing&lf=0&rf=0&FORM=MSRE2

    Those are ALL different affiliate sites the links are on. The u= tells what the affiliate ID is. And that's just for that particular search term/query: bargain childrens clothing.

    You can easily replicate that phenomenon by replacing with different query strings:

    http://search.live.com/results.aspx?q=site%3Awww.shareasale.com+baby+gifts&go=Search&form=QBRE

    http://search.live.com/results.aspx?q=site%3Awww.shareasale.com+home+decor&go=Search&form=QBRE

    http://search.live.com/results.aspx?q=site%3Awww.shareasale.com+mens+apparel&go=Search&form=QBRE

    When it's taking sites out of the index, it's a serious issue on several counts. Not only that, but it's exposing a significant factor in some portion of the algo, so it's layer of transparency that isn't necessarily the best thing.

    Thanks so much for looking into this!



    Friday, November 30, 2007 12:05 AM
  • Jeremiah, at first it looked like a Live Search issue, but that's not really altogether so. It's happening at all the search engines. It's just that the sites that are having their links hijacked are disappearing from the Live Search index, is what it's looking like at this point in time.

    Google is actually showing around 1,600 pages returned. They only ever show a small sub-set, especially recently, and it fluctuates drastically; but here's a small sample search showing that they are indexing those. It could change from minute to minute.

    Google
    (only a small sampling with a specific inurl: search to show the URLs)

    Yahoo, 2,669,807
    2,669,807 results returned

    Live Search
    3,170,000 results returned

    http://www.shareasale.com/robots.txt
    404 Page not returned.

    My apologies! It isn't Live Search that's doing any hijacking. ;-)
    Wednesday, December 5, 2007 10:43 AM