Robots.txt on refinements
-
In dealing with Panda do you think it is a good idea to put all refinements for category pages in the robots.txt file? We already have a lot as noindex, follow but I am wondering if it would be better to address from a crawl perspective as the pages are probably thin duplicate content to Google.
-
Hi There
In general you probably don't need to do that. Here's how I would normally deal with indexation in WordPress (assuming you're using WordPress);
- Categories - index
- Tags - noindex
- Date archives - noindex
- Author (single author blogs) - noindex
- Author (multi-author) - index
- Subpages - noindex
Basically all these settings are shown in my post here on setting up WordPress: http://moz.com/blog/setup-wordpress-for-seo-success
Yoast is the best plugin to do all this with!
-
One of the most common mistakes I see in SEO... There's nothing about the robots.txt that keeps pages from being indexed. In fact, just the opposite. If you have existing pages to which you've added no-index, but you also block them with robots.txt, then the search crawler will never see them to pick up the no-index and therefore won't know it's supposed to remove them. So they would still count against you as thin content even though they're not being crawled. NOT the result you're looking for.
If you can no-index them, great. If not, at least use canonical tags to point them to the primary version of the category page. (Remember no-index is a FAR stronger command to the search engines then canonical tags, which they take as "suggestions".)
The only time it's appropriate to block no-indexed pages with robots is if you're absolutely certain the pages have never made it into the index in the first place. If they've never been indexed, you can no-index them for security, and then drop them behind the robots.txt to save crawl budget.
Hope that makes sense?
Paul
-
I don't know if you have those taco commercials where you live that have the little girl in them that says "Why not both!", but you might do that, it would not hurt and it would make you sleep better at night.
Oh, here is a link to the commercial, https://www.youtube.com/watch?v=vqgSO8_cRio
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt on http vs. https
We recently changed our domain from http to https. When a user enters any URL on http, there is an global 301 redirect to the same page on https. I cannot find instructions about what to do with robots.txt. Now that https is the canonical version, should I block the http-Version with robots.txt? Strangely, I cannot find a single ressource about this...
Technical SEO | | zeepartner0 -
Question about construction of our sitemap URL in robots.txt file
Hi all, This is a Webmaster/SEO question. This is the sitemap URL currently in our robots.txt file: http://www.ccisolutions.com/sitemap.xml As you can see it leads to a page with two URLs on it. Is this a problem? Wouldn't it be better to list both of those XML files as separate line items in the robots.txt file? Thanks! Dana
Technical SEO | | danatanseo0 -
OK to block /js/ folder using robots.txt?
I know Matt Cutts suggestions we allow bots to crawl css and javascript folders (http://www.youtube.com/watch?v=PNEipHjsEPU) But what if you have lots and lots of JS and you dont want to waste precious crawl resources? Also, as we update and improve the javascript on our site, we iterate the version number ?v=1.1... 1.2... 1.3... etc. And the legacy versions show up in Google Webmaster Tools as 404s. For example: http://www.discoverafrica.com/js/global_functions.js?v=1.1
Technical SEO | | AndreVanKets
http://www.discoverafrica.com/js/jquery.cookie.js?v=1.1
http://www.discoverafrica.com/js/global.js?v=1.2
http://www.discoverafrica.com/js/jquery.validate.min.js?v=1.1
http://www.discoverafrica.com/js/json2.js?v=1.1 Wouldn't it just be easier to prevent Googlebot from crawling the js folder altogether? Isn't that what robots.txt was made for? Just to be clear - we are NOT doing any sneaky redirects or other dodgy javascript hacks. We're just trying to power our content and UX elegantly with javascript. What do you guys say: Obey Matt? Or run the javascript gauntlet?0 -
How long does it take for traffic to bounce back from and accidental robots.txt disallow of root?
We accidentally uploaded a robots.txt disallow root for all agents last Tuesday and did not catch the error until yesterday.. so 6 days total of exposure. Organic traffic is down 20%. Google has since indexed the correct version of the robots.txt file. However, we're still seeing awful titles/descriptions in the SERPs and traffic is not coming back. GWT shows that not many pages were actually removed from the index but we're still seeing drastic rankings decreases. Anyone been through this? Any sort of timeline for a recovery? Much appreciated!
Technical SEO | | bheard0 -
Mobile site: robots.txt best practices
If there are canonical tags pointing to the web version of each mobile page, what should a robots.txt file for a mobile site have?
Technical SEO | | bonnierSEO0 -
Robots.txt
Hi everyone, I just want to check something. If you have this entered into your robots.txt file: User-agent: *
Technical SEO | | PeterM22
Disallow: /fred/ This wouldn't block /fred-review/ from being crawled would it? Thanks0