Twitter Robots.TXT
-
Hello Moz World,
So, I trying to wrap my head around all of the different robots.txt. I decided to dive into a site like Twitter, and look at their robot text. And now, I'm super confused. What are they telling the search engines with /hasttag/*src=. Why don't they just use:
Useragent: *
Disallow:
But, they address each search engine. Is there any benefit to this?
Thanks for all of the awesome responses!!!
B/R
Will H.
-
Thanks Martijn. That makes a lot of sense. I'm working with small websites, but hopefully I will be moving on to bigger fish
-
Thank you for the awesome response and taking the time to write this all out. It was very helpful!
-
To answer your question around why they would set-up different statements for different search engines. When huge sites become more complicated in their structure you also want to have a chance to see how different engines deal with pages and crawling some of them. By setting up the statements differently it creates a better overview in what is being crawled for a specific one and what isn't.
-
At a glance, I couldn't tell you what their motivation is to do so but it seems they're addressing individual search engines to show/block various things on a per-engine basis.
Being Twitter I'm sure they have their reasons for doing this but from the outside, it's beyond me what that motivation is!
What are they telling the search engines with /hasttag/*src=
The full line _Allow: /hashtag/*?src= _says to allow the respective engine to crawl the hashtag pages.
To better explain exactly what's going on here, let's take a look at a working example. If you click on a #SEO hashtag on Twitter (note, you have to click on one, not just search for one, that's a different string) you'll arrive at this URL:
https://twitter.com/hashtag/SEO?src=hash
A * is known as a wildcard and is essentially a variable so anything can go in that place and the statement still applies. In this particular example, it's /hashtag/SEO?src=hash. The bolded "SEO" could be replaced by any other hashtag name like the other examples below and the Allow statement would still apply.
/hashtag/Marketing?src=hash
/hashtag/SEM?src=hash
/hashtag/WebDesign?src=hash
/hashtag/Digital?src=hashAs a general rule, I'd suggest looking at more basic websites for a better example to follow - these big guys have to handle some issues that the rest of us don't so a normal Robots.txt is rarely more than 10 lines if the site is built correctly.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What happens to crawled URLs subsequently blocked by robots.txt?
We have a very large store with 278,146 individual product pages. Since these are all various sizes and packaging quantities of less than 200 product categories my feeling is that Google would be better off making sure our category pages are indexed. I would like to block all product pages via robots.txt until we are sure all category pages are indexed, then unblock them. Our product pages rarely change, no ratings or product reviews so there is little reason for a search engine to revisit a product page. The sales team is afraid blocking a previously indexed product page will result in in it being removed from the Google index and would prefer to submit the categories by hand, 10 per day via requested crawling. Which is the better practice?
Intermediate & Advanced SEO | | AspenFasteners1 -
Robots.txt wildcards - the devs had a disagreement - which is correct?
Hi – the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?” The second developer suggested that this wildcard would only block URLs featuring a ? that come immediately after /shirts/ - for example: /shirts?minprice=10&maxprice=20 BUT argued that this robots.txt directive would not block URLS featuring a ? in sub directories - e.g. /shirts/blue?mprice=100&maxp=20 So which of the developers is correct? Beyond that, I assumed that the ? should feature a * on each side of it – for example - /? - to work as intended above? Am I correct in assuming that?
Intermediate & Advanced SEO | | McTaggart0 -
Robots.txt - Do I block Bots from crawling the non-www version if I use www.site.com ?
my site uses is set up at http://www.site.com I have my site redirected from non- www to the www in htacess file. My question is... what should my robots.txt file look like for the non-www site? Do you block robots from crawling the site like this? Or do you leave it blank? User-agent: * Disallow: / Sitemap: http://www.morganlindsayphotography.com/sitemap.xml Sitemap: http://www.morganlindsayphotography.com/video-sitemap.xml
Intermediate & Advanced SEO | | morg454540 -
Now that Google will be indexing Twitter, are Twitter backlinks likely to effect website rank in the SERPs?
About a year (or 2) ago, Matt Cutts said that Twitter and FB have no effect on website rank, in part because Google can't get to the content. Now that Google will be indexing Twitter (again), do we expect that links in twitter posts will be useful backlinks for improving SERP rank?
Intermediate & Advanced SEO | | Thriveworks-Counseling1 -
How to handle a blog subdomain on the main sitemap and robots file?
Hi, I have some confusion about how our blog subdomain is handled in our sitemap. We have our main website, example.com, and our blog, blog.example.com. Should we list the blog subdomain URL in our main sitemap? In other words, is listing a subdomain allowed in the root sitemap? What does the final structure look like in terms of the sitemap and robots file? Specifically: **example.com/sitemap.xml ** would I include a link to our blog subdomain (blog.example.com)? example.com/robots.xml would I include a link to BOTH our main sitemap and blog sitemap? blog.example.com/sitemap.xml would I include a link to our main website URL (even though it's not a subdomain)? blog.example.com/robots.xml does a subdomain need its own robots file? I'm a technical SEO and understand the mechanics of much of on-page SEO.... but for some reason I never found an answer to this specific question and I am wondering how the pros do it. I appreciate your help with this.
Intermediate & Advanced SEO | | seo.owl0 -
Should all pages on a site be included in either your sitemap or robots.txt?
I don't have any specific scenario here but just curious as I come across sites fairly often that have, for example, 20,000 pages but only 1,000 in their sitemap. If they only think 1,000 of their URL's are ones that they want included in their sitemap and indexed, should the others be excluded using robots.txt or a page level exclusion? Is there a point to having pages that are included in neither and leaving it up to Google to decide?
Intermediate & Advanced SEO | | RossFruin1 -
Robots.txt error message in Google Webmaster from a later date than the page was cached, how is that?
I have error messages in Google Webmaster that state that Googlebot encountered errors while attempting to access the robots.txt. The last date that this was reported was on December 25, 2012 (Merry Christmas), but the last cache date was November 16, 2012 (http://webcache.googleusercontent.com/search?q=cache%3Awww.etundra.com/robots.txt&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a). How could I get this error if the page hasn't been cached since November 16, 2012?
Intermediate & Advanced SEO | | eTundra0 -
Robots.txt: Can you put a /* wildcard in the middle of a URL?
We have noticed that Google is indexing the language/country directory versions of directories we have disallowed in our robots.txt. For example: Disallow: /images/ is blocked just fine However, once you add our /en/uk/ directory in front of it, there are dozens of pages indexed. The question is: Can I put a wildcard in the middle of the string, ex. /en/*/images/, or do I need to list out every single country for every language in the robots file. Anyone know of any workarounds?
Intermediate & Advanced SEO | | IHSwebsite0