Why Google Might

Here's the skinny on this one:

This concerns a 750 000 URL forum that has had a large number of URLs disallowed in the robots.txt file for about 18 months. The forum has about 50 000 valid threads.

The disallowed URLs are those that for a guest or a bot only show an "Error. You are not logged in" message - URLs that would otherwise be used to reply to a thread, start a new thread, send a PM, show a "print-friendly" screen, edit the user profile, and so on. There is never a need for search engines to try to index these. Search engines only need to see the thread indexes, and the message threads themselves.

Google has not been indexing the content at the disallowed URLs, but has shown a large number of them as URL-only entries for a long time. They are most easily seen in a site:domain.com listing. This is pretty-much the normal operation. That part has been working OK. The disallowed URLs are listed in the User-agent: * part of the robots.txt file.

As you may have seen before, I have written several times about how a forum can have duplicate content for each thread, because each thread in a vBulletin or PHPbb (and most other packages too) forum has multiple URLs that can reach the same content.

For a vBulletin forum each thread could show up as each of these URLs:

/forum/showthread.php?t=54321
/forum/showthread.php?t=54321&p=22446688
/forum/showthread.php?t=54321&page=2
/forum/showthread.php?mode=hybrid&t=54321
/forum/showthread.php?p=22446688&mode=linear#post22446688
/forum/showthread.php?p=22446688&mode=threaded#post224466 88
/forum/showthread.php?t=34567&goto=nextnewest
/forum/showthread.php?t=87654&goto=nextoldest
/forum/showthread.php?goto=lastpost&t=54321
/forum/showpost.php?p=22446688
/forum/showpost.php?p=22446688&postcount=45
/forum/printthread.php?t=54321

and that is without introducing URLs that include the page parameter, for threads that are more than one page long, and the pp parameter for changing the default number of posts per page; either or both of which can be added to most of the URLs above too.

The robots.txt file had been set up long ago to exclude several of the URL patterns for thread duplicate content - but critically, not all combinations - and for the excluded URLs, Google had only shown URL-only entires if anything at all.

In a Vbulletin forum, the "next" and "previous" links cause massive duplicate content issues because they allow a thread like
/forum/showthread.php?t=54321 to be indexed as
/forum/showthread.php?t=34567&goto=nextnewest and as
/forum/showthread.php?t=87654&goto=nextoldest too.

Additionally if any of the three threads is bumped, the "next" and "previous" links that are indexed no longer point to the same thread, because they contain the thread number of the thread that they were ON (along with the goto parameter), not the real thread number of the thread that they actually pointed to.

This is a major programming error by the people that designed the forum software. The link should either contain the true thread number of the thread that it points to, or else clicking the "next" and "previous" links should go via a 301 redirect to a URL that includes the real true canonical thread number of the target thread.

Those duplicate content URLs have all been indexed before, but now the robots.txt file has been amended to disallow those. This is what was added to the robots.txt file just a few days ago:

User-Agent: Googlebot
Disallow: /*nextnewest
Disallow: /*nextoldest
Disallow: /*mode
Disallow: /*highlight

Here's the punchline:

The disallowed URLs in the User-Agent: * section of the robots.txt file are now being indexed and cached by Google. The cache time-stamps start showing up for dates and times that are just hours after the date and time that the robots.txt file was amended by adding the additional Googlebot-specific information.

I would have assumed that Google would not index the URLs that are in the User-agent: Googlebot section, and not index the URLs that are in the User-agent: * section.

What appears to happen is that as soon as you add a User-agent: Googlebot section, that Google starts indexing all URLs that are not mentioned in that section, even if they are mentioned in the User-agent: * section, and supposedly disallowed for all user agents.

That is, if you have a User-agent: Googlebot section, then you also need to repeat all URLs found in the User-agent: * section in the Googlebot-specific section.

That, to me, is not how it should work.

Can someone from Google clarify whether Google is supposed to follow both User-agent: * and User-agent: Googlebot if both are present; or whether it ignores User-agent: * if User-agent: Googlebot is present.

The latter is what appears to happen right now.

Side note: Looks like the other stuff at [webmasterworld.com...] is fixed, by the way.

[edited by: g1smd at 4:52 pm (utc) on Aug. 13, 2006]

Why Google Might "Ignore" a robots.txt Disallow Rule

Thousands of pages show up in the Google cache! aka "here we go again"

g1smd

tedster

bird

tedster

GoogleGuy

mcavic

walkman

tedster

tedster

walkman

SEOcritique

graywolf

mcavic

g1smd

jdMorgan

vanessafox

g1smd

jdMorgan

GoogleGuy

bufferzone

Brett_Tabke

SEOEgghead

mcavic

SEOEgghead

tedster

jdMorgan

mcavic

Brett_Tabke

g1smd

SEOEgghead

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week