Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Why Google Might "Ignore" a robots.txt Disallow Rule

Thousands of pages show up in the Google cache! aka "here we go again"

         

g1smd

4:35 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Here's the skinny on this one:

This concerns a 750 000 URL forum that has had a large number of URLs disallowed in the robots.txt file for about 18 months. The forum has about 50 000 valid threads.

The disallowed URLs are those that for a guest or a bot only show an "Error. You are not logged in" message - URLs that would otherwise be used to reply to a thread, start a new thread, send a PM, show a "print-friendly" screen, edit the user profile, and so on. There is never a need for search engines to try to index these. Search engines only need to see the thread indexes, and the message threads themselves.

Google has not been indexing the content at the disallowed URLs, but has shown a large number of them as URL-only entries for a long time. They are most easily seen in a site:domain.com listing. This is pretty-much the normal operation. That part has been working OK. The disallowed URLs are listed in the User-agent: * part of the robots.txt file.

.

As you may have seen before, I have written several times about how a forum can have duplicate content for each thread, because each thread in a vBulletin or PHPbb (and most other packages too) forum has multiple URLs that can reach the same content.

For a vBulletin forum each thread could show up as each of these URLs:

/forum/showthread.php?t=54321
/forum/showthread.php?t=54321&p=22446688
/forum/showthread.php?t=54321&page=2
/forum/showthread.php?mode=hybrid&t=54321
/forum/showthread.php?p=22446688&mode=linear#post22446688
/forum/showthread.php?p=22446688&mode=threaded#post224466 88
/forum/showthread.php?t=34567&goto=nextnewest
/forum/showthread.php?t=87654&goto=nextoldest
/forum/showthread.php?goto=lastpost&t=54321
/forum/showpost.php?p=22446688
/forum/showpost.php?p=22446688&postcount=45
/forum/printthread.php?t=54321

and that is without introducing URLs that include the page parameter, for threads that are more than one page long, and the pp parameter for changing the default number of posts per page; either or both of which can be added to most of the URLs above too.

The robots.txt file had been set up long ago to exclude several of the URL patterns for thread duplicate content - but critically, not all combinations - and for the excluded URLs, Google had only shown URL-only entires if anything at all.

.

In a Vbulletin forum, the "next" and "previous" links cause massive duplicate content issues because they allow a thread like
/forum/showthread.php?t=54321 to be indexed as
/forum/showthread.php?t=34567&goto=nextnewest and as
/forum/showthread.php?t=87654&goto=nextoldest too.

Additionally if any of the three threads is bumped, the "next" and "previous" links that are indexed no longer point to the same thread, because they contain the thread number of the thread that they were ON (along with the goto parameter), not the real thread number of the thread that they actually pointed to.

This is a major programming error by the people that designed the forum software. The link should either contain the true thread number of the thread that it points to, or else clicking the "next" and "previous" links should go via a 301 redirect to a URL that includes the real true canonical thread number of the target thread.

Those duplicate content URLs have all been indexed before, but now the robots.txt file has been amended to disallow those. This is what was added to the robots.txt file just a few days ago:

User-Agent: Googlebot
Disallow: /*nextnewest
Disallow: /*nextoldest
Disallow: /*mode
Disallow: /*highlight

.

Here's the punchline:

The disallowed URLs in the User-Agent: * section of the robots.txt file are now being indexed and cached by Google. The cache time-stamps start showing up for dates and times that are just hours after the date and time that the robots.txt file was amended by adding the additional Googlebot-specific information.

I would have assumed that Google would not index the URLs that are in the User-agent: Googlebot section, and not index the URLs that are in the User-agent: * section.

What appears to happen is that as soon as you add a User-agent: Googlebot section, that Google starts indexing all URLs that are not mentioned in that section, even if they are mentioned in the User-agent: * section, and supposedly disallowed for all user agents.

That is, if you have a User-agent: Googlebot section, then you also need to repeat all URLs found in the User-agent: * section in the Googlebot-specific section.

That, to me, is not how it should work.

.

Can someone from Google clarify whether Google is supposed to follow both User-agent: * and User-agent: Googlebot if both are present; or whether it ignores User-agent: * if User-agent: Googlebot is present.

The latter is what appears to happen right now.

.

Side note: Looks like the other stuff at [webmasterworld.com...] is fixed, by the way.

[edited by: g1smd at 4:52 pm (utc) on Aug. 13, 2006]

tedster

4:48 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would say that googlebot definitely "should" follow both kind of exclusion if both are present. However, as you've discovered there can be a difference between what googlebot "should" do and what the little devil actually "does".

I have long been concerned about this and several times I thought I saw the kind of spidering you mention. So I currently make it a standard practice to give a full set of Disallow instructions to googlebot in robots.txt, even at the expense of duplicating what was already in the widcard instructions.

Sorry to hear about the indexing mess, g1smd. I wonder if this paragraph is playing in here -- it seems pretty cryptic to me:

Also, there's a small difference between the way Googlebot handles the robots.txt file and the way the robots.txt standard says we should (keeping in mind the distinction between "should" and "must"). The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do. For example, consider the following robots.txt file:
User-Agent: *
Allow: /
Disallow: /cgi-bin

It's obvious that the webmaster's intent here is to allow robots to crawl everything except the /cgi-bin directory. Consequently, that's what we do.

For more information, please see the Robots FAQ [robotstxt.org]. If there still seems to be a problem, please let us know [books.google.com].

[books.google.com...]

bird

4:57 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What appears to happen is that as soon as you add a User-agent: Googlebot section, that Google starts indexing URLs that are not mentioned in that section, even if they are mentioned in the User-agent: * section.

In other words, Googlebot is working exactly as it should.

The robots.txt entries for the different UAs are not cumulative, and neither is the * entry. This is necessary, because you may want to give one spider more priviledges than the unnamed ones, and another one less than that.

So yes, if you have a User-agent: Googlebot section, AND you're using that only to impose additional restrictions on Googlebot, then you will have to repeat all of your User-agent: * stuff in there as well.

The reference is here [kollar.com], btw.

tedster

5:01 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for that bird. I never payed close enough attention before. I think the relevant line is this distinction between "weakly specific" and "strongly specific":

For example, in the following /robots.txt file:

User-Agent: *
Disallow: /

User-Agent: Lycos
Disallow: /cgi-bin/ /tmp/

It can be observed that there are two logical records. The first record is weakly specific to all robots, and effectively disallows access to all files on the server. The second record is specific to a robot that would have the string "Lycos" (without regard to case) in the User-Agent field of the Hypertext Transfer Protocol Request that it would make of a server. Any robot parsing this file would choose the record that is most specific to it. The "Lycos robot" would choose the second record, all other robots would choose the first.

GoogleGuy

5:54 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Good example, tedster. In that example, if Lycos followed the less specific directive, it couldn't fetch any pages at all--which appears to be against the behavior that someone would reasonably expect in that robots.txt file.

If there's a weak specification and a specific specification for Googlebot, we'll go with the one for Googlebot. g1smd, if you include specific directions for Googlebot and also want Googlebot to obey the "generic" directives, you'd need to include allows/disallows from the generic section in the Googlebot section.

I believe most/all search engines interpret robots.txt this way--a more specific directive takes precedence over a weaker one.

mcavic

7:07 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What appears to happen is that as soon as you add a User-agent: Googlebot section, that Google starts indexing URLs that are not mentioned in that section, even if they are mentioned in the User-agent: * section

Yes, that's how it should be, because you might want to allow Google more access than all other bots.

walkman

7:36 pm on Aug 13, 2006 (gmt 0)



>> User-Agent: *
Disallow: /

If I did this way, I would do it with the intention of NOT allowing any bot access to my site.
In my opinion,
User-Agent: *
Disallow: /
should stop all of them dead on the tracks; there's no need to go further down to read bot specific info, as Lycos, MSNBOT, Googlebot and all, by definition are 'User-Agents.'

You might want to use apache rewrite, if aplicable. This is what I have, courtesy of the Apache forum:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Googlebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Msnbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Slurp [NC]
RewriteCond %{THE_REQUEST} \.php
RewriteRule \.php - [F]

I have the urls re-written and do not want the bots to index .php files. For msn and google that support *extentsion, I also have that on the robots.txt

[edited by: walkman at 7:39 pm (utc) on Aug. 13, 2006]

tedster

7:37 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I hadn't given Kollar's page about robots.txt [kollar.com] very detailed attention before now. Thanks again bird, it's worth a close study. Just be careful to note that some sections are proposals that have not yet been adapted. But it sure does a better job explaining things than the relatively sparse information at robotstxt.org -- even though the language is a bit technically dense at times.

In fact, through reading I've discovered that I have a few robots files to update immediately.

tedster

7:40 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If I did this way, I would do it with the intention of NOT allowing any bot access to my site.

But if you did it this way, why would you be including that second rule at all -- the rule for Lycos? If the intention was to block all bots, surely you would stop with the first rule, right?

walkman

7:44 pm on Aug 13, 2006 (gmt 0)



>> But if you did it this way, why would you be including that second rule at all -- the rule for Lycos? If the intention was to block all bots, surely you would stop with the first rule, right?

or, you just changed your mind, or you are working on something and for the time being you don't want them to index it. Instead of deleting it all, you leave the robots.txt as it is, just add a few lines on top (that's how I did it twice; not sure if it made an impact or what.)

I think a case can be made for both, first you tell SEs that you want no robots, but then you tell Google to index this. We all just need to adopt a standard and stick with it.

[edited by: walkman at 7:56 pm (utc) on Aug. 13, 2006]

SEOcritique

7:47 pm on Aug 13, 2006 (gmt 0)

10+ Year Member



GoogleGuy�s response is quite interesting. Many web site administrators expect robots.txt to follow a top down hierarchy or an inherited hierarchy like CSS. While understandable, this is obviously not the case. It looks like the robots.txt standard could use some added specification.

Perhaps something on the order of:

    User-Agent: *
    Disallow: / !important

[edited by: SEOcritique at 7:50 pm (utc) on Aug. 13, 2006]

graywolf

7:55 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What I find particularly "funny" about Googlebot is the way it shows up places it shouldn't.

For example let's say you create a new adwords landing page. This page has no external links to the page at all, the only way someone is going to get to the page is by clicking on an adwords ads. Shortly after you create the ad you will see the standard googlebot come fetch the page with the user agent shown below.

(compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Now if this was the Adwords bot I wouldn't be so "concerned", but seeing how google is now cloaking their bots ... I'll let you draw your own conclusions and pick out your own reflective silver headgear.

mcavic

7:57 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In my opinion,
User-Agent: *
Disallow: /
should stop all of them dead on the tracks; there's no need to go further down to read bot specific info, as Lycos, MSNBOT, Googlebot and all, by definition are 'User-Agents.'

No, because if that were the case, there would be no way for you to specify "all bots except Googlebot"

g1smd

8:04 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



OK. I follow the logic, but I've read robotstxt.org many times, and never got the hint that it worked that way before.

I can't believe I have missed this important fact for 8 years, but this is the first time I have added a specific section for Googlebot to a robots.txt file. I much prefer using the robots meta tag on each applicable page.

We need to make sure the "new" information is widely disseminated, as I expect that many others have missed it too.

jdMorgan

8:46 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The number one mistake in interpreting robots.txt is to think that the robot examining it is programmed to be "smart" about discerning what you want. While it's certainly possible for today's robots to be smarter than they were back when the Standard was proposed, the robots.txt Standard was invented in much simpler times and for much simpler web sites.

Instead of thinking, "I'm sure they can handle this" and interpreting the Standard liberally, it's best to think in terms of "the robots will do (or recognize) this and no more." Keep it dirt-simple, in other words.

It is best to assume that a robot will accept the first record that it finds that matches its user-agent token, and to interpret the definition of "matches" as meaning that the robot will accept either a "*" or its specific user-agent token whichever it finds first. I know of several robots that will read past a "*" record to see if they can find a more-specific record, but there are many, many that won't.

In other words, the "User-agent: *" record should always be the catch-all record at the end of your robots.txt file, and all robot-specific records should precede it.

This because a given robot will accept directives from one robots.txt record and no more. And while a few robots support the (required) feature of specifying multiple user-agents in a single record, this causes many others to blow up completely and either go away or crawl the whole site. So even some of what is clearly defined in he standard is poorly supported. If that's not clear, I'm referring to this construct:

User-agent: googlebot
User-agent: Slurp
Disallow: /cgi-bin
Disallow: /admin


That construct, while clearly required* by the Standard, is not supported by many second-tier robots.
* From A Standard for Robot Exclusion: "The record starts with one or more User-agent lines" (emphasis added)

My conclusion is that the best approach is to serve a different robots.txt files to each robot -- test the user-agent in the request and serve an appropriate and separate file to each one that matters to your site, then serve a generic one to the 'bots that send you little or no traffic. Only in this way can you be sure that a proprietary directive intended for one robot won't cause a less-sophisticated robot to reject the file as invalid and either spider the whole site or just go away without crawling at all.

When doing so, it's best to code the logic so that new variants of recognized robots won't be turned away. The big search companies are developing a penchant for releasing additional 'specialty' robots at an ever-increasing pace, and for changing their user-agent string arrangement for no good reason. Nevertheless, it wouldn't do to send an otherwise-welcome robot packing because you don't recognize the new version.

vanessafox

9:41 pm on Aug 13, 2006 (gmt 0)

10+ Year Member



Hi Brett/moderators,

I hope it's OK to post these links as they describe how Googlebot handles robots.txt files.

This page provides links to information on robots.txt files as they pertain to Googlebot:
[google.com...]

And this page provides information on how Googlebot interprets the situation being discussed here:
[google.com...]

If you want to block access to all bots other than the Googlebot, you can use the following syntax:

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

Googlebot follows the line directed at it, rather than the line directed at everyone.

You can always use the robots.txt analysis tool in our webmaster tools to see how Googlebot will interpret a robots.txt file.

g1smd

9:46 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Cool. Now we're talking.

Many thanks for such a rapid resolution to such an easy misunderstanding...

jdMorgan

10:03 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Googlebot is "smart" in that is uses record-specificity priority, rather than record-order priority.

For most other robots, the safest way to exclude all but the specified robots would be:

User-agent: The_Allowed_Bot
Disallow:

User-agent: Another_Allowed_Bot
Disallow:

User-agent: *
Disallow: /


Here the allowed 'bots find their specific record first, and accept it, while all other robots continue to the catch-all record at the end.

This should work for Googlebot as well, and is a much safer bet for 'dumber' robots.

Jim

[edited by: jdMorgan at 10:05 pm (utc) on Aug. 13, 2006]

GoogleGuy

12:47 am on Aug 14, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for hopping in, Vanessa. Personally, I always take a new robots.txt file out for a test drive before I put it live. There's one in the Google Webmaster Tools, and Brett offered one of the earliest ones I saw on the web.

The rule of thumb I always use is "the most specific directive applies." So if you say "Everyone in the room, leave. g1smd, please stay; we need to chat" then everyone but g1msd would mosey.

Although this is how we've done things for a long time (and I think every other major engine works this way), I agree it's good to get the word out, g1smd. It's on the front page of WebmasterWorld, so I think the word is out. :) My takeaway would be to find a good robots.txt checker and test out a new file before making it live.

[edited by: GoogleGuy at 12:49 am (utc) on Aug. 14, 2006]

bufferzone

12:11 pm on Aug 14, 2006 (gmt 0)

10+ Year Member



I would definitely recommend that you take a close look at Google Webmasters tool. The Robots.txt tool will do validation as well as test for new entries. You also have the ability to make detailed tests for the many flavours of googlbot so that you can specify exactly what bot can do what.

Brett_Tabke

2:55 pm on Aug 14, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



also be aware that when you are using nonstandard robots.txt syntax - bad things can happen. Never use nonstandard syntax with wildcard agent names.


User-Agent: *
Allow: /
Disallow: /cgi-bin

That will disallow the entire site on some bots.

There is no "allow" syntax and we have documented Slurp misinterpreting it in the past as a disallow line which ultimately removed the entire site from Yahoo. That behavior has been changed, but it is a clear cut case of what can go wrong when using nonstandard (improper) syntax in robots.txt.

SEOEgghead

7:36 pm on Aug 15, 2006 (gmt 0)

10+ Year Member



I think Google is only following the specification here.

This debate arises from a somewhat ambiguous statement in the robots.txt standard.

"the record describes the default access policy for any robot that has not matched any of the other records."

In other words, if a specific rule is present the "*" rule is ignored. Order shouldn't matter.

So ... according to a close reading of the specification, the rules for a specific user agent _entirely override_ the "User-agent: *" rules. Therefore, any rule under "User-agent: *" that should also be applied to googlebot must be repeated under "User-agent: googlebot."

Am I wrong?

[edited by: SEOEgghead at 7:37 pm (utc) on Aug. 15, 2006]

mcavic

10:20 pm on Aug 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So ... according to a close reading of the specification, the rules for a specific user agent _entirely override_ the "User-agent: *" rules. Therefore, any rule under "User-agent: *" that should also be applied to googlebot must be repeated under "User-agent: googlebot."

Correct.

SEOEgghead

10:48 pm on Aug 15, 2006 (gmt 0)

10+ Year Member



Out of curiousity, does anyone know if Yahoo and MSN also follow this very literal interpretation of the specification? I would hope so.

I agree that it's wise to put "*" last regardless. I think that cures the ambiguity regardless, right?

I think many programmers would tend to read it the _wrong_ way. I have in the past. It kind of reads like a "switch" statement in C -- where order counts. Arguably, C requires that "*," the default, be last. It's a so-so analogy, but you see my point. If they had used the token "default" instead of "*" nobody would mistake it for a glob/regex like many have ... </rant>

tedster

12:05 am on Aug 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Up above, GoogleGuy said "I believe most/all search engines interpret robots.txt this way--a more specific directive takes precedence over a weaker one."

I tend to trust his experience on this one.

jdMorgan

12:18 am on Aug 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would recommend going with the Standard, and coding to the lowest common denominator to avoid problems. While our esteemed Googlers will undoubtedly provide authoritative answers for how Google does things, and Google's implementations are usually correct, compliant, and robust, extending those attributes to 'most robots' is rather a big stretch.

Jim

mcavic

1:22 pm on Aug 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



does anyone know if Yahoo and MSN also follow this very literal interpretation of the specification

They would have to. It's the only interpretation that works logically, isn't it? If a bot were to follow its own section AND the * section, then there would be no way allow that bot more access than all the other bots (since Allow: is non-standard).

Brett_Tabke

2:10 pm on Aug 17, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



> It's the only interpretation that works logically, isn't it?

no. Some bots have worked in the past in increasing order of importance. What comes last in the robots.txt trumps what comes first. A sequential reading of robots.txt is how Infoseek and Scooter used to work.

[edited by: Brett_Tabke at 4:23 pm (utc) on Aug. 19, 2006]

g1smd

2:19 pm on Aug 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



New search results published to [gfe-eh.google.com ] today show that the modified robots.txt file is now being followed, and all the problems with indexed URLs are being cleared up. I don't yet see any change on most other DCs.

SEOEgghead

9:26 pm on Aug 17, 2006 (gmt 0)

10+ Year Member



Brett,

Who says those bots were logical, though? :) Anyway, my assessment is that you should place the "*" (default) rule last and repeat all rules in "*" that you want to be applied for specific bots in those repsective listings. This covers most of the bases.

Make sense?

This 31 message thread spans 2 pages: 31
 


 


 


 

Status: 403 Forbidden