Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Blocking Dynamic URLs with Robots.txt
-
Background:
My e-commerce site uses a lot of layered navigation and sorting links. While this is great for users, it ends up in a lot of URL variations of the same page being crawled by Google. For example, a standard category page:
...which uses a "Price" layered navigation sidebar to filter products based on price also produces the following URLs which link to the same page:
http://www.mysite.com/widgets.html?price=1%2C250
http://www.mysite.com/widgets.html?price=2%2C250
http://www.mysite.com/widgets.html?price=3%2C250
As there are literally thousands of these URL variations being indexed, so I'd like to use Robots.txt to disallow these variations.
Question:
-
Is this a wise thing to do? Or does Google take into account layered navigation links by default, and I don't need to worry.
-
To implement, I was going to do the following in Robots.txt:
User-agent: *
Disallow: /*?
Disallow: /*=
....which would prevent any dynamic URL with a '?" or '=' from being indexed. Is there a better way to do this, or is this a good solution?
Thank you!
-
-
If you are happy with any URLs with query strings not being indexed your robots.txt will work fine.
Do any or your URLs with question marks in them have links to them? If so you might want to be careful blocking google from indexing them. I would think you'd lose the benefits those links would pass to your site.
-
Tait,
Thanks for the answer. I think the canonical tag would be ideal, but in terms of implementation, it would require some substantial code modification to the site / PHP code as I have a lot of categories, and adding this manually to each one would be very time consuming.
Would preventing the spiders from indexing any URLs with a "?" or "&" (which would only be dynamic URLs variations) cause any problems? Or is this just not an ideal best practice?
Thanks!
-
I don't know if there's a good solution with robots.txt given your URL structure. However, you could use the rel=canonical link tag in the header to force google to treat many of your URLs the same way. This would help you avoid duplicate content penalties.
More on rel=canonical:
http://www.google.com/support/webmasters/bin/answer.py?answer=139394
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt blocked internal resources Wordpress
Hi all, We've recently migrated a Wordpress website from staging to live, but the robots.txt was deleted. I've created the following new one: User-agent: *
Intermediate & Advanced SEO | | Mat_C
Allow: /
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Allow: /wp-admin/admin-ajax.php However, in the site audit on SemRush, I now get the mention that a lot of pages have issues with blocked internal resources in robots.txt file. These blocked internal resources are all cached and minified css elements: links, images and scripts. Does this mean that Google won't crawl some parts of these pages with blocked resources correctly and thus won't be able to follow these links and index the images? In other words, is this any cause for concern regarding SEO? Of course I can change the robots.txt again, but will urls like https://example.com/wp-content/cache/minify/df983.js end up in the index? Thanks for your thoughts!2 -
If I block a URL via the robots.txt - how long will it take for Google to stop indexing that URL?
If I block a URL via the robots.txt - how long will it take for Google to stop indexing that URL?
Intermediate & Advanced SEO | | Gabriele_Layoutweb0 -
Large robots.txt file
We're looking at potentially creating a robots.txt with 1450 lines in it. This will remove 100k+ pages from the crawl that are all old pages (I know, the ideal would be to delete/noindex but not viable unfortunately) Now the issue i'm thinking is that a large robots.txt will either stop the robots.txt from being followed or will slow our crawl rate down. Does anybody have any experience with a robots.txt of that size?
Intermediate & Advanced SEO | | ThomasHarvey0 -
Dilemma about "images" folder in robots.txt
Hi, Hope you're doing well. I am sure, you guys must be aware that Google has updated their webmaster technical guidelines saying that users should allow access to their css files and java-scripts file if it's possible. Used to be that Google would render the web pages only text based. Now it claims that it can read the css and java-scripts. According to their own terms, not allowing access to the css files can result in sub-optimal rankings. "Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings."http://googlewebmastercentral.blogspot.com/2014/10/updating-our-technical-webmaster.htmlWe have allowed access to our CSS files. and Google bot, is seeing our webapges more like a normal user would do. (tested it in GWT)Anyhow, this is my dilemma. I am sure lot of other users might be facing the same situation. Like any other e commerce companies/websites.. we have lot of images. Used to be that our css files were inside our images folder, so I have allowed access to that. Here's the robots.txt --> http://www.modbargains.com/robots.txtRight now we are blocking images folder, as it is very huge, very heavy, and some of the images are very high res. The reason we are blocking that is because we feel that Google bot might spend almost all of its time trying to crawl that "images" folder only, that it might not have enough time to crawl other important pages. Not to mention, a very heavy server load on Google's and ours. we do have good high quality original pictures. We feel that we are losing potential rankings since we are blocking images. I was thinking to allow ONLY google-image bot, access to it. But I still feel that google might spend lot of time doing that. **I was wondering if Google makes a decision saying, hey let me spend 10 minutes for google image bot, and let me spend 20 minutes for google-mobile bot etc.. or something like that.. , or does it have separate "time spending" allocations for all of it's bot types. I want to unblock the images folder, for now only the google image bot, but at the same time, I fear that it might drastically hamper indexing of our important pages, as I mentioned before, because of having tons & tons of images, and Google spending enough time already just to crawl that folder.**Any advice? recommendations? suggestions? technical guidance? Plan of action? Pretty sure I answered my own question, but I need a confirmation from an Expert, if I am right, saying that allow only Google image access to my images folder. Sincerely,Shaleen Shah
Intermediate & Advanced SEO | | Modbargains1 -
Block in robots.txt instead of using canonical?
When I use a canonical tag for pages that are variations of the same page, it basically means that I don't want Google to index this page. But at the same time, spiders will go ahead and crawl the page. Isn't this a waste of my crawl budget? Wouldn't it be better to just disallow the page in robots.txt and let Google focus on crawling the pages that I do want indexed? In other words, why should I ever use rel=canonical as opposed to simply disallowing in robots.txt?
Intermediate & Advanced SEO | | YairSpolter0 -
Removing Dynamic "noindex" URL's from Index
6 months ago my clients site was overhauled and the user generated searches had an index tag on them. I switched that to noindex but didn't get it fast enough to avoid being 100's of pages indexed in Google. It's been months since switching to the noindex tag and the pages are still indexed. What would you recommend? Google crawls my site daily - but never the pages that I want removed from the index. I am trying to avoid submitting hundreds of these dynamic URL's to the removal tool in webmaster tools. Suggestions?
Intermediate & Advanced SEO | | BeTheBoss0 -
Exact keyword URL or not?
Hi all, I have a quick question about the proper use of permalinks. Let's say that I have a website about sports and I want to create an internal page dedicated to shoes. I know that the keyword "shoe" has 15.000 monthly visits, while the keyword "shoes" has 1.000 monthly visits. How do I have to name the internal page? http://www.example.com/shoe or http://www.example.com/shoes (with a final 's')? I would think that by naming the URL http://www.example.com/shoes, the search engine would consider that page for the keywords "shoe" and "shoes", but I am not sure about it. Should I create a URL that only focuses on one specific keyword ("shoe", in this example) or a URL that may encompass more than one keyword ("shoe" and "shoes")? I hope this is clear. Thank you for your time and help. All best, Sal
Intermediate & Advanced SEO | | salvyy0 -
Removing dashes in our URLs?
Hi Forum, Our site has an errant product review module that is resulting in about 9-10 404 errors per day on Google Webmaster Tools. We've found that by changing our product page URLs to only include 2 dashes, the module stops causing 404 errors for that page. Does changing our URL from "oursite.com/girls-pink-yoga-capri.html" to "oursite.com/girlspink-yoga-capri.html" hurt our SEO for a search for "girls pink yoga capri"? If so, by how much (assuming everthing else on the page is optimized properly) Thanks for your input.
Intermediate & Advanced SEO | | pano0