Forum Moderators: Robert Charlton & goodroi
This concerns a 750 000 URL forum that has had a large number of URLs disallowed in the robots.txt file for about 18 months. The forum has about 50 000 valid threads.
The disallowed URLs are those that for a guest or a bot only show an "Error. You are not logged in" message - URLs that would otherwise be used to reply to a thread, start a new thread, send a PM, show a "print-friendly" screen, edit the user profile, and so on. There is never a need for search engines to try to index these. Search engines only need to see the thread indexes, and the message threads themselves.
Google has not been indexing the content at the disallowed URLs, but has shown a large number of them as URL-only entries for a long time. They are most easily seen in a site:domain.com listing. This is pretty-much the normal operation. That part has been working OK. The disallowed URLs are listed in the User-agent: * part of the robots.txt file.
.
As you may have seen before, I have written several times about how a forum can have duplicate content for each thread, because each thread in a vBulletin or PHPbb (and most other packages too) forum has multiple URLs that can reach the same content.
For a vBulletin forum each thread could show up as each of these URLs:
/forum/showthread.php?t=54321
/forum/showthread.php?t=54321&p=22446688
/forum/showthread.php?t=54321&page=2
/forum/showthread.php?mode=hybrid&t=54321
/forum/showthread.php?p=22446688&mode=linear#post22446688
/forum/showthread.php?p=22446688&mode=threaded#post224466 88
/forum/showthread.php?t=34567&goto=nextnewest
/forum/showthread.php?t=87654&goto=nextoldest
/forum/showthread.php?goto=lastpost&t=54321
/forum/showpost.php?p=22446688
/forum/showpost.php?p=22446688&postcount=45
/forum/printthread.php?t=54321
and that is without introducing URLs that include the page parameter, for threads that are more than one page long, and the pp parameter for changing the default number of posts per page; either or both of which can be added to most of the URLs above too.
The robots.txt file had been set up long ago to exclude several of the URL patterns for thread duplicate content - but critically, not all combinations - and for the excluded URLs, Google had only shown URL-only entires if anything at all.
.
In a Vbulletin forum, the "next" and "previous" links cause massive duplicate content issues because they allow a thread like
/forum/showthread.php?t=54321 to be indexed as
/forum/showthread.php?t=34567&goto=nextnewest and as
/forum/showthread.php?t=87654&goto=nextoldest too.
Additionally if any of the three threads is bumped, the "next" and "previous" links that are indexed no longer point to the same thread, because they contain the thread number of the thread that they were ON (along with the goto parameter), not the real thread number of the thread that they actually pointed to.
This is a major programming error by the people that designed the forum software. The link should either contain the true thread number of the thread that it points to, or else clicking the "next" and "previous" links should go via a 301 redirect to a URL that includes the real true canonical thread number of the target thread.
Those duplicate content URLs have all been indexed before, but now the robots.txt file has been amended to disallow those. This is what was added to the robots.txt file just a few days ago:
User-Agent: Googlebot
Disallow: /*nextnewest
Disallow: /*nextoldest
Disallow: /*mode
Disallow: /*highlight
.
Here's the punchline:
The disallowed URLs in the User-Agent: * section of the robots.txt file are now being indexed and cached by Google. The cache time-stamps start showing up for dates and times that are just hours after the date and time that the robots.txt file was amended by adding the additional Googlebot-specific information.
I would have assumed that Google would not index the URLs that are in the User-agent: Googlebot section, and not index the URLs that are in the User-agent: * section.
What appears to happen is that as soon as you add a User-agent: Googlebot section, that Google starts indexing all URLs that are not mentioned in that section, even if they are mentioned in the User-agent: * section, and supposedly disallowed for all user agents.
That is, if you have a User-agent: Googlebot section, then you also need to repeat all URLs found in the User-agent: * section in the Googlebot-specific section.
That, to me, is not how it should work.
.
Can someone from Google clarify whether Google is supposed to follow both User-agent: * and User-agent: Googlebot if both are present; or whether it ignores User-agent: * if User-agent: Googlebot is present.
The latter is what appears to happen right now.
.
Side note: Looks like the other stuff at [webmasterworld.com...] is fixed, by the way.
[edited by: g1smd at 4:52 pm (utc) on Aug. 13, 2006]
I have long been concerned about this and several times I thought I saw the kind of spidering you mention. So I currently make it a standard practice to give a full set of Disallow instructions to googlebot in robots.txt, even at the expense of duplicating what was already in the widcard instructions.
Sorry to hear about the indexing mess, g1smd. I wonder if this paragraph is playing in here -- it seems pretty cryptic to me:
Also, there's a small difference between the way Googlebot handles the robots.txt file and the way the robots.txt standard says we should (keeping in mind the distinction between "should" and "must"). The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do. For example, consider the following robots.txt file:
User-Agent: *
Allow: /
Disallow: /cgi-binIt's obvious that the webmaster's intent here is to allow robots to crawl everything except the /cgi-bin directory. Consequently, that's what we do.
For more information, please see the Robots FAQ [robotstxt.org]. If there still seems to be a problem, please let us know [books.google.com].
[books.google.com...]
In other words, Googlebot is working exactly as it should.
The robots.txt entries for the different UAs are not cumulative, and neither is the * entry. This is necessary, because you may want to give one spider more priviledges than the unnamed ones, and another one less than that.
So yes, if you have a User-agent: Googlebot section, AND you're using that only to impose additional restrictions on Googlebot, then you will have to repeat all of your User-agent: * stuff in there as well.
The reference is here [kollar.com], btw.
For example, in the following /robots.txt file:User-Agent: *
Disallow: /User-Agent: Lycos
Disallow: /cgi-bin/ /tmp/It can be observed that there are two logical records. The first record is weakly specific to all robots, and effectively disallows access to all files on the server. The second record is specific to a robot that would have the string "Lycos" (without regard to case) in the User-Agent field of the Hypertext Transfer Protocol Request that it would make of a server. Any robot parsing this file would choose the record that is most specific to it. The "Lycos robot" would choose the second record, all other robots would choose the first.
If there's a weak specification and a specific specification for Googlebot, we'll go with the one for Googlebot. g1smd, if you include specific directions for Googlebot and also want Googlebot to obey the "generic" directives, you'd need to include allows/disallows from the generic section in the Googlebot section.
I believe most/all search engines interpret robots.txt this way--a more specific directive takes precedence over a weaker one.
What appears to happen is that as soon as you add a User-agent: Googlebot section, that Google starts indexing URLs that are not mentioned in that section, even if they are mentioned in the User-agent: * section
If I did this way, I would do it with the intention of NOT allowing any bot access to my site.
In my opinion,
User-Agent: *
Disallow: /
should stop all of them dead on the tracks; there's no need to go further down to read bot specific info, as Lycos, MSNBOT, Googlebot and all, by definition are 'User-Agents.'
You might want to use apache rewrite, if aplicable. This is what I have, courtesy of the Apache forum:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Googlebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Msnbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Slurp [NC]
RewriteCond %{THE_REQUEST} \.php
RewriteRule \.php - [F]
I have the urls re-written and do not want the bots to index .php files. For msn and google that support *extentsion, I also have that on the robots.txt
[edited by: walkman at 7:39 pm (utc) on Aug. 13, 2006]
In fact, through reading I've discovered that I have a few robots files to update immediately.
or, you just changed your mind, or you are working on something and for the time being you don't want them to index it. Instead of deleting it all, you leave the robots.txt as it is, just add a few lines on top (that's how I did it twice; not sure if it made an impact or what.)
I think a case can be made for both, first you tell SEs that you want no robots, but then you tell Google to index this. We all just need to adopt a standard and stick with it.
[edited by: walkman at 7:56 pm (utc) on Aug. 13, 2006]
Perhaps something on the order of:
[edited by: SEOcritique at 7:50 pm (utc) on Aug. 13, 2006]
For example let's say you create a new adwords landing page. This page has no external links to the page at all, the only way someone is going to get to the page is by clicking on an adwords ads. Shortly after you create the ad you will see the standard googlebot come fetch the page with the user agent shown below.
(compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Now if this was the Adwords bot I wouldn't be so "concerned", but seeing how google is now cloaking their bots ... I'll let you draw your own conclusions and pick out your own reflective silver headgear.
In my opinion,
User-Agent: *
Disallow: /
should stop all of them dead on the tracks; there's no need to go further down to read bot specific info, as Lycos, MSNBOT, Googlebot and all, by definition are 'User-Agents.'
I can't believe I have missed this important fact for 8 years, but this is the first time I have added a specific section for Googlebot to a robots.txt file. I much prefer using the robots meta tag on each applicable page.
We need to make sure the "new" information is widely disseminated, as I expect that many others have missed it too.
Instead of thinking, "I'm sure they can handle this" and interpreting the Standard liberally, it's best to think in terms of "the robots will do (or recognize) this and no more." Keep it dirt-simple, in other words.
It is best to assume that a robot will accept the first record that it finds that matches its user-agent token, and to interpret the definition of "matches" as meaning that the robot will accept either a "*" or its specific user-agent token whichever it finds first. I know of several robots that will read past a "*" record to see if they can find a more-specific record, but there are many, many that won't.
In other words, the "User-agent: *" record should always be the catch-all record at the end of your robots.txt file, and all robot-specific records should precede it.
This because a given robot will accept directives from one robots.txt record and no more. And while a few robots support the (required) feature of specifying multiple user-agents in a single record, this causes many others to blow up completely and either go away or crawl the whole site. So even some of what is clearly defined in he standard is poorly supported. If that's not clear, I'm referring to this construct:
User-agent: googlebot
User-agent: Slurp
Disallow: /cgi-bin
Disallow: /admin
My conclusion is that the best approach is to serve a different robots.txt files to each robot -- test the user-agent in the request and serve an appropriate and separate file to each one that matters to your site, then serve a generic one to the 'bots that send you little or no traffic. Only in this way can you be sure that a proprietary directive intended for one robot won't cause a less-sophisticated robot to reject the file as invalid and either spider the whole site or just go away without crawling at all.
When doing so, it's best to code the logic so that new variants of recognized robots won't be turned away. The big search companies are developing a penchant for releasing additional 'specialty' robots at an ever-increasing pace, and for changing their user-agent string arrangement for no good reason. Nevertheless, it wouldn't do to send an otherwise-welcome robot packing because you don't recognize the new version.
I hope it's OK to post these links as they describe how Googlebot handles robots.txt files.
This page provides links to information on robots.txt files as they pertain to Googlebot:
[google.com...]
And this page provides information on how Googlebot interprets the situation being discussed here:
[google.com...]
If you want to block access to all bots other than the Googlebot, you can use the following syntax:
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
Googlebot follows the line directed at it, rather than the line directed at everyone.
You can always use the robots.txt analysis tool in our webmaster tools to see how Googlebot will interpret a robots.txt file.
For most other robots, the safest way to exclude all but the specified robots would be:
User-agent: The_Allowed_Bot
Disallow:
User-agent: Another_Allowed_Bot
Disallow:
User-agent: *
Disallow: /
This should work for Googlebot as well, and is a much safer bet for 'dumber' robots.
Jim
[edited by: jdMorgan at 10:05 pm (utc) on Aug. 13, 2006]
The rule of thumb I always use is "the most specific directive applies." So if you say "Everyone in the room, leave. g1smd, please stay; we need to chat" then everyone but g1msd would mosey.
Although this is how we've done things for a long time (and I think every other major engine works this way), I agree it's good to get the word out, g1smd. It's on the front page of WebmasterWorld, so I think the word is out. :) My takeaway would be to find a good robots.txt checker and test out a new file before making it live.
[edited by: GoogleGuy at 12:49 am (utc) on Aug. 14, 2006]
User-Agent: *
Allow: /
Disallow: /cgi-bin
That will disallow the entire site on some bots.
There is no "allow" syntax and we have documented Slurp misinterpreting it in the past as a disallow line which ultimately removed the entire site from Yahoo. That behavior has been changed, but it is a clear cut case of what can go wrong when using nonstandard (improper) syntax in robots.txt.
This debate arises from a somewhat ambiguous statement in the robots.txt standard.
"the record describes the default access policy for any robot that has not matched any of the other records."
In other words, if a specific rule is present the "*" rule is ignored. Order shouldn't matter.
So ... according to a close reading of the specification, the rules for a specific user agent _entirely override_ the "User-agent: *" rules. Therefore, any rule under "User-agent: *" that should also be applied to googlebot must be repeated under "User-agent: googlebot."
Am I wrong?
[edited by: SEOEgghead at 7:37 pm (utc) on Aug. 15, 2006]
I agree that it's wise to put "*" last regardless. I think that cures the ambiguity regardless, right?
I think many programmers would tend to read it the _wrong_ way. I have in the past. It kind of reads like a "switch" statement in C -- where order counts. Arguably, C requires that "*," the default, be last. It's a so-so analogy, but you see my point. If they had used the token "default" instead of "*" nobody would mistake it for a glob/regex like many have ... </rant>
Jim
does anyone know if Yahoo and MSN also follow this very literal interpretation of the specification
no. Some bots have worked in the past in increasing order of importance. What comes last in the robots.txt trumps what comes first. A sequential reading of robots.txt is how Infoseek and Scooter used to work.
[edited by: Brett_Tabke at 4:23 pm (utc) on Aug. 19, 2006]
Status: 403 Forbidden