Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
How to stop URLs that include query strings from being indexed by Google
-
Hello Mozzers
Would you use rel=canonical, robots.txt, or Google Webmaster Tools to stop the search engines indexing URLs that include query strings/parameters. Or perhaps a combination?
I guess it would be a good idea to stop the search engines crawling these URLs because the content they display will tend to be duplicate content and of low value to users.
I would be tempted to use a combination of canonicalization and robots.txt for every page I do not want crawled or indexed, yet perhaps Google Webmaster Tools is the best way to go / just as effective??? And I suppose some use meta robots tags too.
Does Google take a position on being blocked from web pages.
Thanks in advance, Luke
-
WIthout a specific example, there are a couple of options here. I am going to assume that you have an ecommerce site where parameters are being used for sort functions on search results or different options on a given product.
I know you may not be able to do this, but using parameters in this case is just a bad idea to start with. If you can (and I know this can be difficult) find a way to rework this so that your site functions without the use of parameters.
You could use canonicals, but then Google would still be crawling all those pages and then go through the process of using the canonical link to find out what page is canonical. That is a big waste of Google's time. Why waste Googlebots time on crawling a bunch of pages that you do not want to have crawled anyway? I would rather Googlebot focus on crawling your most important pages.
You can use the robots.txt file to stop Google from crawling sections of your site. The only issue with this is that if some of your pages with a bunch of parameters in them are ranking, once you tell Google to stop crawling it, you would then lose traffic.
It is not that Google does not "like" robot.txt to block them, or that they do not "like" the use of the canonical tag, it is just that there are directives that Google will follow in a certain way and so if not implemented correctly or in the wrong sequence can cause negative results because you have basically told Google to do something without fully understanding what will happen.
Here is what I would do. Long version for long term success
-
Look at Google Analytics (or other Analytics) and Moz tools and see what pages are ranking and sending you traffic. Make note of your results.
-
Think of the most simple way that you could organize your site that would be logical to your users and would allow Google to crawl every page you deem important. Creating a hierarchical sitemap is a good way to do this. How does this relate to what you found in #1.
-
Rework your URL structure to reflect what you found in #2 without using parameters. If you have to use parameters, then make sure Google can crawl your basic sitemap without using any of the parameters. Use robots.txt to then block the crawling of any parameters on your site. You have now ensured that Google can crawl and will rank pages without parameters and you are not hiding any important pages or page information on a page that uses parameters.
There are other reasons not to use parameters (e.g. easier for users remember, tend to be shorter, etc), so think about if you want to get rid of them.
- 301 redirect all your main traffic pages from the old URL structure to the new URL structure. Show 404s for all the old pages including the ones with parameters. That way all the good pages will move to the new URL structure and the bad ones will go away.
Now, if you are stuck using parameters. I would do a variant of the above. Still see if there are any important or well ranked pages that use parameters. Consider if there is a way to use the canonical on those pages to get Google to the right page to know what should rank. All the other pages I would use the noindex directive to get them out of the Google index, then later use robots to block Google crawling them. You want to do this in sequence as if you block Google first, it will never see the noindex directive.
Now, everything I said above is generally "correct" but depending on your situation, things may need to be tweaked. I hope the information I gave might help with you being able to work out the best options for what works for your site and your customers.
Good luck!
-
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Staging website got indexed by google
Our staging website got indexed by google and now MOZ is showing all inbound links from staging site, how should i remove those links and make it no index. Note- we already added Meta NOINDEX in head tag
Intermediate & Advanced SEO | | Asmi-Ta0 -
After hack and remediation, thousands of URL's still appearing as 'Valid' in google search console. How to remedy?
I'm working on a site that was hacked in March 2019 and in the process, nearly 900,000 spam links were generated and indexed. After remediation of the hack in April 2019, the spammy URLs began dropping out of the index until last week, when Search Console showed around 8,000 as "Indexed, not submitted in sitemap" but listed as "Valid" in the coverage report and many of them are still hack-related URLs that are listed as being indexed in March 2019, despite the fact that clicking on them leads to a 404. As of this Saturday, the number jumped up to 18,000, but I have no way of finding out using the search console reports why the jump happened or what are the new URLs that were added, the only sort mechanism is last crawled and they don't show up there. How long can I expect it to take for these remaining urls to also be removed from the index? Is there any way to expedite the process? I've submitted a 'new' sitemap several times, which (so far) has not helped. Is there any way to see inside the new GSC view why/how the number of valid URLs in the indexed doubled over one weekend?
Intermediate & Advanced SEO | | rickyporco0 -
How long after https migration that google shows in search console new sitemap being indexed?
We migrated 4 days ago to https and followed best practices..
Intermediate & Advanced SEO | | lcourse
In search console now still 80% of our sitemaps appear as "pending" and among those sitemaps that were processed only less than 1% of submitted pages appear as indexed? Is this normal ?
How long does it take for google to index pages from sitemap?
Before https migration nearly all our pages were indexed and I see in the crawler stats that google has crawled a number of pages each day after migration that corresponds to number of submitted pages in sitemap. Sitemap and crawler stats show no errors.0 -
My site shows 503 error to Google bot, but can see the site fine. Not indexing in Google. Help
Hi, This site is not indexed on Google at all. http://www.thethreehorseshoespub.co.uk Looking into it, it seems to be giving a 503 error to the google bot. I can see the site I have checked source code Checked robots Did have a sitemap param. but removed it for testing GWMT is showing 'unreachable' if I submit a site map or fetch Any ideas on how to remove this error? Many thanks in advance
Intermediate & Advanced SEO | | SolveWebMedia0 -
Why are bit.ly links being indexed and ranked by Google?
I did a quick search for "site:bit.ly" and it returns more than 10 million results. Given that bit.ly links are 301 redirects, why are they being indexed in Google and ranked according to their destination? I'm working on a similar project to bit.ly and I want to make sure I don't run into the same problem.
Intermediate & Advanced SEO | | JDatSB1 -
Is Google indexing Mp3 audio and MIDI music files? Can that cause any duplicate problems?
Hello, I own virtualsheetmusic.com website and we have several thousands of media files (Mp3 and MIDI files) that potentially Google can index. If that's the case, I am wondering if that could cause any "duplicate" issues of some sort since many of such media files have exact file names or same meta information inside. Any thoughts about this issue are very welcome! Thank you in advance to anyone.
Intermediate & Advanced SEO | | fablau0 -
How to get content to index faster in Google.....pubsubhubbub?
I'm curious to know what tools others are using to get their content to index faster (other than html sitmap and pingomatic, twitter, etc) Would installing the wordpress pubsubhubbub plugin help even though it uses pingomatic? http://wordpress.org/extend/plugins/pubsubhubbub/
Intermediate & Advanced SEO | | webestate0 -
Best practice for removing indexed internal search pages from Google?
Hi Mozzers I know that it’s best practice to block Google from indexing internal search pages, but what’s best practice when “the damage is done”? I have a project where a substantial part of our visitors and income lands on an internal search page, because Google has indexed them (about 3 %). I would like to block Google from indexing the search pages via the meta noindex,follow tag because: Google Guidelines: “Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines.” http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769 Bad user experience The search pages are (probably) stealing rankings from our real landing pages Webmaster Notification: “Googlebot found an extremely high number of URLs on your site” with links to our internal search results I want to use the meta tag to keep the link juice flowing. Do you recommend using the robots.txt instead? If yes, why? Should we just go dark on the internal search pages, or how shall we proceed with blocking them? I’m looking forward to your answer! Edit: Google have currently indexed several million of our internal search pages.
Intermediate & Advanced SEO | | HrThomsen0