Twitter Robots.TXT
-
Hello Moz World,
So, I trying to wrap my head around all of the different robots.txt. I decided to dive into a site like Twitter, and look at their robot text. And now, I'm super confused. What are they telling the search engines with /hasttag/*src=. Why don't they just use:
Useragent: *
Disallow:
But, they address each search engine. Is there any benefit to this?
Thanks for all of the awesome responses!!!
B/R
Will H.
-
Thanks Martijn. That makes a lot of sense. I'm working with small websites, but hopefully I will be moving on to bigger fish
-
Thank you for the awesome response and taking the time to write this all out. It was very helpful!
-
To answer your question around why they would set-up different statements for different search engines. When huge sites become more complicated in their structure you also want to have a chance to see how different engines deal with pages and crawling some of them. By setting up the statements differently it creates a better overview in what is being crawled for a specific one and what isn't.
-
At a glance, I couldn't tell you what their motivation is to do so but it seems they're addressing individual search engines to show/block various things on a per-engine basis.
Being Twitter I'm sure they have their reasons for doing this but from the outside, it's beyond me what that motivation is!
What are they telling the search engines with /hasttag/*src=
The full line _Allow: /hashtag/*?src= _says to allow the respective engine to crawl the hashtag pages.
To better explain exactly what's going on here, let's take a look at a working example. If you click on a #SEO hashtag on Twitter (note, you have to click on one, not just search for one, that's a different string) you'll arrive at this URL:
https://twitter.com/hashtag/SEO?src=hash
A * is known as a wildcard and is essentially a variable so anything can go in that place and the statement still applies. In this particular example, it's /hashtag/SEO?src=hash. The bolded "SEO" could be replaced by any other hashtag name like the other examples below and the Allow statement would still apply.
/hashtag/Marketing?src=hash
/hashtag/SEM?src=hash
/hashtag/WebDesign?src=hash
/hashtag/Digital?src=hashAs a general rule, I'd suggest looking at more basic websites for a better example to follow - these big guys have to handle some issues that the rest of us don't so a normal Robots.txt is rarely more than 10 lines if the site is built correctly.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Large robots.txt file
We're looking at potentially creating a robots.txt with 1450 lines in it. This will remove 100k+ pages from the crawl that are all old pages (I know, the ideal would be to delete/noindex but not viable unfortunately) Now the issue i'm thinking is that a large robots.txt will either stop the robots.txt from being followed or will slow our crawl rate down. Does anybody have any experience with a robots.txt of that size?
Intermediate & Advanced SEO | | ThomasHarvey0 -
Robots txt is case senstive? Pls suggest
Hi i have seen few urls in the html improvements duplicate titles Can i disable one of the below url in the robots.txt? /store/Solar-Home-UPS-1KV-System/75652
Intermediate & Advanced SEO | | Rahim119
/store/solar-home-ups-1kv-system/75652 if i disable this Disallow: /store/Solar-Home-UPS-1KV-System/75652 will the Search engines scan this /store/solar-home-ups-1kv-system/75652 im little confused with case senstive.. Pls suggest go ahead or not in the robots.txt0 -
Meta robots or robot.txt file?
Hi Mozzers! For parametric URL's would you recommend meta robot or robot.txt file?
Intermediate & Advanced SEO | | eLab_London
For example: http://www.exmaple.com//category/product/cat no./quickView I want to stop indexing /quickView URLs. And what's the real difference between the two? Thanks again! Kay0 -
meta robots no follow on page for paid links
Hi I have a page containing paid links. i would like to add no follow attribute to these links
Intermediate & Advanced SEO | | Kung_fu_Panda
but from technical reasons, i can only place meta robots no follow on page level (
is that enough for telling Google that the links in this page are paid and and to prevent Google penlizling the sites that the page link to? Thanks!0 -
Robot.txt error
I currently have this under my robot txt file: User-agent: *
Intermediate & Advanced SEO | | Rubix
Disallow: /authenticated/
Disallow: /css/
Disallow: /images/
Disallow: /js/
Disallow: /PayPal/
Disallow: /Reporting/
Disallow: /RegistrationComplete.aspx WebMatrix 2.0 On webmaster > Health Check > Blocked URL I copy and paste above code then click on Test, everything looks ok but then logout and log back in then I see below code under Blocked URL: User-agent: * Disallow: / WebMatrix 2.0 Currently, Google doesn't index my domain and i don't understand why this happening. Any ideas? Thanks Seda0 -
MOZ crawl report says category pages blocked by meta robots but theyr'e not?
I've just run a SEOMOZ crawl report and it tells me that the category pages on my site such as http://www.top-10-dating-reviews.com/category/online-dating/ are blocked by meta robots and have the meta robots tag noindex,follow. This was the case a couple of days ago as I run wordpress and am using the SEO Category updater plugin. By default it appears it makes categories noindex, follow. Therefore I edited the plugin so that the default was index, follow as I want google to index the category pages so that I can build links to them. When I open the page in a browser and view source the tags show as index, follow which adds up. Why then is the SEOMOZ report telling me they are still noindex,follow? Presumably the crawl is in real time and should pick up the new follow tag or is it perhaps because its using data from an old crawl? As yet these pages aren't indexed by google. Any help is much appreciated! Thanks Sam.
Intermediate & Advanced SEO | | SamCUK0 -
Panda Updates - robots.txt or noindex?
Hi, I have a site that I believe has been impacted by the recent Panda updates. Assuming that Google has crawled and indexed several thousand pages that are essentially the same and the site has now passed the threshold to be picked out by the Panda update, what is the best way to proceed? Is it enough to block the pages from being crawled in the future using robots.txt, or would I need to remove the pages from the index using the meta noindex tag? Of course if I block the URLs with robots.txt then Googlebot won't be able to access the page in order to see the noindex tag. Anyone have and previous experiences of doing something similar? Thanks very much.
Intermediate & Advanced SEO | | ianmcintosh0 -
Googlebot Can't Access My Sites After I Repair My Robots File
Hello Mozzers, A colleague and I have been collectively managing about 12 brands for the past several months and we have recently received a number of messages in the sites' webmaster tools instructing us that 'Googlebot was not able to access our site due to some errors with our robots.txt file' My colleague and I, in turn, created new robots.txt files with the intention of preventing the spider from crawling our 'cgi-bin' directory as follows: User-agent: * Disallow: /cgi-bin/ After creating the robots and manually re-submitting it in Webmaster Tools (and receiving the green checkbox), I received the same message about Googlebot not being able to access the site, only difference being that this time it was for a different site that I manage. I repeated the process and everything, aesthetically looked correct, however, I continued receiving these messages for each of the other sites I manage on a daily-basis for roughly a 10-day period. Do any of you know why I may be receiving this error? is it not possible for me to block the Googlebot from crawling the 'cgi-bin'? Any and all advice/insight is very much welcome, I hope I'm being descriptive enough!
Intermediate & Advanced SEO | | NiallSmith1