Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Blocking Dynamic URLs with Robots.txt
-
Background:
My e-commerce site uses a lot of layered navigation and sorting links. While this is great for users, it ends up in a lot of URL variations of the same page being crawled by Google. For example, a standard category page:
...which uses a "Price" layered navigation sidebar to filter products based on price also produces the following URLs which link to the same page:
http://www.mysite.com/widgets.html?price=1%2C250
http://www.mysite.com/widgets.html?price=2%2C250
http://www.mysite.com/widgets.html?price=3%2C250
As there are literally thousands of these URL variations being indexed, so I'd like to use Robots.txt to disallow these variations.
Question:
-
Is this a wise thing to do? Or does Google take into account layered navigation links by default, and I don't need to worry.
-
To implement, I was going to do the following in Robots.txt:
User-agent: *
Disallow: /*?
Disallow: /*=
....which would prevent any dynamic URL with a '?" or '=' from being indexed. Is there a better way to do this, or is this a good solution?
Thank you!
-
-
If you are happy with any URLs with query strings not being indexed your robots.txt will work fine.
Do any or your URLs with question marks in them have links to them? If so you might want to be careful blocking google from indexing them. I would think you'd lose the benefits those links would pass to your site.
-
Tait,
Thanks for the answer. I think the canonical tag would be ideal, but in terms of implementation, it would require some substantial code modification to the site / PHP code as I have a lot of categories, and adding this manually to each one would be very time consuming.
Would preventing the spiders from indexing any URLs with a "?" or "&" (which would only be dynamic URLs variations) cause any problems? Or is this just not an ideal best practice?
Thanks!
-
I don't know if there's a good solution with robots.txt given your URL structure. However, you could use the rel=canonical link tag in the header to force google to treat many of your URLs the same way. This would help you avoid duplicate content penalties.
More on rel=canonical:
http://www.google.com/support/webmasters/bin/answer.py?answer=139394
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Does google ignore ? in url?
Hi Guys, Have a site which ends ?v=6cc98ba2045f for all its URLs. Example: https://domain.com/products/cashmere/robes/?v=6cc98ba2045f Just wondering does Google ignore what is after the ?. Also any ideas what that is? Cheers.
Intermediate & Advanced SEO | | CarolynSC0 -
What does Disallow: /french-wines/?* actually do - robots.txt
Hello Mozzers - Just wondering what this robots.txt instruction means: Disallow: /french-wines/?* Does it stop Googlebot crawling and indexing URLs in that "French Wines" folder - specifically the URLs that include a question mark? Would it stop the crawling of deeper folders - e.g. /french-wines/rhone-region/ that include a question mark in their URL? I think this has been done to block URLs containing query strings. Thanks, Luke
Intermediate & Advanced SEO | | McTaggart0 -
Help with facet URLs in Magento
Hi Guys, Wondering if I can get some technical help here... We have our site britishbraces.co.uk , built in Magento. As per eCommerce sites, we have paginated pages throughout. These have rel=next/prev implemented but not correctly ( as it is not in is it in ) - this fix is in process. Our canonicals are currently incorrect as far as I believe, as even when content is filtered, the canonical takes you back to the first page URL. For example, http://www.britishbraces.co.uk/braces/x-style.html?ajaxcatalog=true&brand=380&max=51.19&min=31.19 Canonical to... http://www.britishbraces.co.uk/braces/x-style.html Which I understand to be incorrect. As I want the coloured filtered pages to be indexed ( due to search volume for colour related queries ), but I don't want the price filtered pages to be indexed - I am unsure how to implement the solution? As I understand, because rel=next/prev implemented ( with no View All page ), the rel=canonical is not necessary as Google understands page 1 is the first page in the series. Therefore, once a user has filtered by colour, there should then be a canonical pointing to the coloured filter URL? ( e.g. /product/black ) But when a user filters by price, there should be noindex on those URLs ? Or can this be blocked in robots.txt prior? My head is a little confused here and I know we have an issue because our amount of indexed pages is increasing day by day but to no solution of the facet urls. Can anybody help - apologies in advance if I have confused the matter. Thanks
Intermediate & Advanced SEO | | HappyJackJr0 -
Large robots.txt file
We're looking at potentially creating a robots.txt with 1450 lines in it. This will remove 100k+ pages from the crawl that are all old pages (I know, the ideal would be to delete/noindex but not viable unfortunately) Now the issue i'm thinking is that a large robots.txt will either stop the robots.txt from being followed or will slow our crawl rate down. Does anybody have any experience with a robots.txt of that size?
Intermediate & Advanced SEO | | ThomasHarvey0 -
Should I include URLs that are 301'd or only include 200 status URLs in my sitemap.xml?
I'm not sure if I should be including old URLs (content) that are being redirected (301) to new URLs (content) in my sitemap.xml. Does anyone know if it is best to include or leave out 301ed URLs in a xml sitemap?
Intermediate & Advanced SEO | | Jonathan.Smith0 -
Dilemma about "images" folder in robots.txt
Hi, Hope you're doing well. I am sure, you guys must be aware that Google has updated their webmaster technical guidelines saying that users should allow access to their css files and java-scripts file if it's possible. Used to be that Google would render the web pages only text based. Now it claims that it can read the css and java-scripts. According to their own terms, not allowing access to the css files can result in sub-optimal rankings. "Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings."http://googlewebmastercentral.blogspot.com/2014/10/updating-our-technical-webmaster.htmlWe have allowed access to our CSS files. and Google bot, is seeing our webapges more like a normal user would do. (tested it in GWT)Anyhow, this is my dilemma. I am sure lot of other users might be facing the same situation. Like any other e commerce companies/websites.. we have lot of images. Used to be that our css files were inside our images folder, so I have allowed access to that. Here's the robots.txt --> http://www.modbargains.com/robots.txtRight now we are blocking images folder, as it is very huge, very heavy, and some of the images are very high res. The reason we are blocking that is because we feel that Google bot might spend almost all of its time trying to crawl that "images" folder only, that it might not have enough time to crawl other important pages. Not to mention, a very heavy server load on Google's and ours. we do have good high quality original pictures. We feel that we are losing potential rankings since we are blocking images. I was thinking to allow ONLY google-image bot, access to it. But I still feel that google might spend lot of time doing that. **I was wondering if Google makes a decision saying, hey let me spend 10 minutes for google image bot, and let me spend 20 minutes for google-mobile bot etc.. or something like that.. , or does it have separate "time spending" allocations for all of it's bot types. I want to unblock the images folder, for now only the google image bot, but at the same time, I fear that it might drastically hamper indexing of our important pages, as I mentioned before, because of having tons & tons of images, and Google spending enough time already just to crawl that folder.**Any advice? recommendations? suggestions? technical guidance? Plan of action? Pretty sure I answered my own question, but I need a confirmation from an Expert, if I am right, saying that allow only Google image access to my images folder. Sincerely,Shaleen Shah
Intermediate & Advanced SEO | | Modbargains1 -
Product or Shop in URL
What do you think is better for seo and for sale, I am using woo-ecommerce for health products website. websitename.com/product/keyword OR websitename.com/shop/keyword
Intermediate & Advanced SEO | | MasonBaker0 -
Robots.txt is blocking Wordpress Pages from Googlebot?
I have a robots.txt file on my server, which I did not develop, it was done by the web designer at the company before me. Then there is a word press plugin that generates a robots.txt file. How Do I unblock all the wordpress pages from googlebot?
Intermediate & Advanced SEO | | ENSO0