You probably heard before that your robots.txt file MUST be at example․com/robots.txt. The Robots Exclusion Protocol is 30-years-old this year and I'm here to tell you that what you heard on the internet is not entirely true (shocker). You have a CDN and you have your main site. You have two robots.txt files, one at https∶//cdn․example․com/robots.txt and one at https∶//www․example․com/robots.txt. You could have just one central robots.txt with all the rules, say on your CDN, which might help you with keeping track of all rules you need to manage. All you have to do is to redirect https∶//www․example․com/robots.txt to https∶//cdn․example․com/robots.txt and crawlers that comply to RFC9309 will just use the redirect target as the robotstxt file of https∶//www․example․com/. Weird or what. Now I wonder if the parsed robots.txt file actually needs to be called robots.txt
Gary Illyes Up to 5 redirects, I believe? 🤔
hI Gary, is the same exact concept also applied on sitemap xml as well? Need your word on this case, coz im having trouble with my developer now haha Maybe John Mueller can enlighten me too, thanks in advance
The timing of this post is mysteriously on point. Cheers and Thank You for this simple PoA.
Robots handles relative pathes and I haven't found any source, robots could work cross-domain. Im afraid, with redirection robots rules can become unvalid for location from where it is redirected. So I understand Google documentation on robots https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
Interesting. I also encountered sites that had redirected their robots.txt file while analyzing the robots.txt file of over 80K websites as part of my "Universal Web Crawler Blocking Report" research (https://saeedkhosravi.ca/universal-web-crawler-blocking-report/). At the time, I figured that the redirections were set as part of a complete 301 redirection of the site to a new domain, as I didn't know they might have redirected just the robots.txt file for the purpose of managing them from a central place. This is interesting as in my future updates, I can see how many of the sites in my pool actually redirect only their robots.txt files.
The current handling of special characters in robots.txt files, as outlined in RFC 9309, presents some ambiguities and potential inefficiencies. The specification requires crawlers to support special characters such as #, $, and *, and to use percent-encoding for these characters in URIs. This approach can lead to inconsistencies and confusion, especially when interpreting patterns in robots.txt files.
Both subdomains have the robots file in the same place (root) in your example: www.example.com/robots.txt & cdn.example.com/robots.txt. If you'd want to place the file in a folder like /robots/robots.txt, do you have to implement a redirect .com/robots.txt to /robots/robots.txt or not? I got confused after your first paragraph. Thanks!
kinda interesting. I thought cdn and main site should have their own robots.txt due they are diferent domains
I wonder if there is a way of avoiding keyword stuffing when explaining technical stuff, probably not
Creator of The Content Technologist, web evangelist, and results-focused digital content strategy consultant
3wIf "what you heard on the internet is not entirely true," isn't it the responsibility of the company that creates the gateway to the web's traffic to create the source of truth or correct the record? It would be awesome to have a version of this for webmasters clearly available on Google's documentation for Robots dot text, so we don't have to hang on every word of Google employee social posts. Also, please rebrand "robots dot text" while you are at it. 😉