Twitter Robots.TXT
-
Hello Moz World,
So, I trying to wrap my head around all of the different robots.txt. I decided to dive into a site like Twitter, and look at their robot text. And now, I'm super confused. What are they telling the search engines with /hasttag/*src=. Why don't they just use:
Useragent: *
Disallow:
But, they address each search engine. Is there any benefit to this?
Thanks for all of the awesome responses!!!
B/R
Will H.
-
Thanks Martijn. That makes a lot of sense. I'm working with small websites, but hopefully I will be moving on to bigger fish
-
Thank you for the awesome response and taking the time to write this all out. It was very helpful!
-
To answer your question around why they would set-up different statements for different search engines. When huge sites become more complicated in their structure you also want to have a chance to see how different engines deal with pages and crawling some of them. By setting up the statements differently it creates a better overview in what is being crawled for a specific one and what isn't.
-
At a glance, I couldn't tell you what their motivation is to do so but it seems they're addressing individual search engines to show/block various things on a per-engine basis.
Being Twitter I'm sure they have their reasons for doing this but from the outside, it's beyond me what that motivation is!
What are they telling the search engines with /hasttag/*src=
The full line _Allow: /hashtag/*?src= _says to allow the respective engine to crawl the hashtag pages.
To better explain exactly what's going on here, let's take a look at a working example. If you click on a #SEO hashtag on Twitter (note, you have to click on one, not just search for one, that's a different string) you'll arrive at this URL:
https://twitter.com/hashtag/SEO?src=hash
A * is known as a wildcard and is essentially a variable so anything can go in that place and the statement still applies. In this particular example, it's /hashtag/SEO?src=hash. The bolded "SEO" could be replaced by any other hashtag name like the other examples below and the Allow statement would still apply.
/hashtag/Marketing?src=hash
/hashtag/SEM?src=hash
/hashtag/WebDesign?src=hash
/hashtag/Digital?src=hashAs a general rule, I'd suggest looking at more basic websites for a better example to follow - these big guys have to handle some issues that the rest of us don't so a normal Robots.txt is rarely more than 10 lines if the site is built correctly.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
If I block a URL via the robots.txt - how long will it take for Google to stop indexing that URL?
If I block a URL via the robots.txt - how long will it take for Google to stop indexing that URL?
Intermediate & Advanced SEO | | Gabriele_Layoutweb0 -
Have a Robots.txt Issue
I have a robots.txt file error that is causing me loads of headaches and is making my website fall off the SE grid. on MOZ and other sites its saying that I blocked all websites from finding it. Could it be as simple as I created a new website and forgot to re-create a robots.txt file for the new site or it was trying to find the old one? I just created a new one. Google's website still shows in the search console that there are severe health issues found in the property and that it is the robots.txt is blocking important pages. Does this take time to refresh? Is there something I'm missing that someone here in the MOZ community could help me with?
Intermediate & Advanced SEO | | primemediaconsultants0 -
Robots.txt Blocking - Best Practices
Hi All, We have a web provider who's not willing to remove the wildcard line of code blocking all agents from crawling our client's site (user-agent: *, Disallow: /). They have other lines allowing certain bots to crawl the site but we're wondering if they're missing out on organic traffic by having this main blocking line. It's also a pain because we're unable to set up Moz Pro, potentially because of this first line. We've researched and haven't found a ton of best practices regarding blocking all bots, then allowing certain ones. What do you think is a best practice for these files? Thanks! User-agent: * Disallow: / User-agent: Googlebot Disallow: Crawl-delay: 5 User-agent: Yahoo-slurp Disallow: User-agent: bingbot Disallow: User-agent: rogerbot Disallow: User-agent: * Crawl-delay: 5 Disallow: /new_vehicle_detail.asp Disallow: /new_vehicle_compare.asp Disallow: /news_article.asp Disallow: /new_model_detail_print.asp Disallow: /used_bikes/ Disallow: /default.asp?page=xCompareModels Disallow: /fiche_section_detail.asp
Intermediate & Advanced SEO | | ReunionMarketing0 -
How to handle a blog subdomain on the main sitemap and robots file?
Hi, I have some confusion about how our blog subdomain is handled in our sitemap. We have our main website, example.com, and our blog, blog.example.com. Should we list the blog subdomain URL in our main sitemap? In other words, is listing a subdomain allowed in the root sitemap? What does the final structure look like in terms of the sitemap and robots file? Specifically: **example.com/sitemap.xml ** would I include a link to our blog subdomain (blog.example.com)? example.com/robots.xml would I include a link to BOTH our main sitemap and blog sitemap? blog.example.com/sitemap.xml would I include a link to our main website URL (even though it's not a subdomain)? blog.example.com/robots.xml does a subdomain need its own robots file? I'm a technical SEO and understand the mechanics of much of on-page SEO.... but for some reason I never found an answer to this specific question and I am wondering how the pros do it. I appreciate your help with this.
Intermediate & Advanced SEO | | seo.owl0 -
Huge increase in server errors and robots.txt
Hi Moz community! Wondering if someone can help? One of my clients (online fashion retailer) has been receiving huge increase in server errors (500's and 503's) over the last 6 weeks and it has got to the point where people cannot access the site because of server errors. The client has recently changed hosting companies to deal with this, and they have just told us they removed the DNS records once the name servers were changed, and they have now fixed this and are waiting for the name servers to propagate again. These errors also correlate with a huge decrease in pages blocked by robots.txt file, which makes me think someone has perhaps changed this and not told anyone... Anyone have any ideas here? It would be greatly appreciated! 🙂 I've been chasing this up with the dev agency and the hosting company for weeks, to no avail. Massive thanks in advance 🙂
Intermediate & Advanced SEO | | labelPR0 -
Using Meta Header vs Robots.txt
Hey Mozzers, I am working on a site that has search-friendly parameters for their faceted navigation, however this makes it difficult to identify the parameters in a robots.txt file. I know that using the robots.txt file is highly recommended and powerful, but I am not sure how to do this when facets are using common words such as sizes. For example, a filtered url may look like www.website.com/category/brand/small.html Brand and size are both facets. Brand is a great filter, and size is very relevant for shoppers, but many products include "small" in the url, so it is tough to isolate that filter in the robots.txt. (I hope that makes sense). I am able to identify problematic pages and edit the Meta Head so I can add on any page that is causing these duplicate issues. My question is, is this a good idea? I want bots to crawl the facets, but indexing all of the facets causes duplicate issues. Thoughts?
Intermediate & Advanced SEO | | evan890 -
Can't find X-Robots tag!
Hi all. I've been checking out http://www.unthankbooks.com/ as it seems to have some indexing problems. I ran a server header check, and got a 200 response. However, it also shows the following: X-Robots-Tag:
Intermediate & Advanced SEO | | Blink-SEO
noindex, nofollow It's not in the page HTML though. Could it be being picked up from somewhere else?0 -
Does Twitter Feed create backlinks?
Hi all, This morning I have seen a new backlink coming to one of our client's website. I have checked the link and it is a page full of social network feeds. I guess when i tweet something related to their subject it came up on their page and that has became a back link for me . Here is the link for the page that i get back link: http://www.brandigg.de/nachname/Burns I don't think this page has a pure link coming to my site but My questions is that does twitter or facebook feeds give back links? If they are backlinks then are they considered as spam links? I am also trying to recover from penguin update, I was wondering if this could be the reason for my Penguin hit or not. Thanks for any input in advance, Seda
Intermediate & Advanced SEO | | Rubix0