Is there a limit to how many URLs you can put in a robots.txt file?
-
We have a site that has way too many urls caused by our crawlable faceted navigation. We are trying to purge 90% of our urls from the indexes. We put no index tags on the url combinations that we do no want indexed anymore, but it is taking google way too long to find the no index tags. Meanwhile we are getting hit with excessive url warnings and have been it by Panda.
Would it help speed the process of purging urls if we added the urls to the robots.txt file? Could this cause any issues for us? Could it have the opposite effect and block the crawler from finding the urls, but not purge them from the index? The list could be in excess of 100MM urls.
-
Hi Kristen,
I did this recently and it worked. The important part is that you need to block the pages in robots.txt or add a noindex tag to the pages to stop them from being indexed again.
I hope this helps.
-
Hi all, Google Webmaster Tools has a great tool for this. If you go into WMT and select "Google index", then "remove URLs". You can use regex to remove a large batch of URLs then block them in robots.txt to make sure they stay out of the index.
I hope this helps.
-
Great thanks for the input. Per Kristen's post I am worried that it could just block the URLs altogether and they will never get purged from the index.
-
Yes, we have done that and are seeing traction on those urls, but we can't get rid of these old urls as fast as we would like.
Thanks for your input
-
Thanks Kristen, thats what I was afraid I would do. Other than Fetch is there a way to send Google these URLs in mass? There are over 100 million URLs so Fetch is not scalable. They are picking them up slowly, but at current pace it will take a few months and I would like to find a way to make it purge faster.
-
You could add them to the robots.txt but it you have to remember that Google will only read the first 500kb (source) - as far as I understand with the number of url's you want to block you'll pass this limit.
As Google bot is able to understand basic regex expressions it's probably better to use regex (you will probably be able to block all these url's with a few lines of code.
More info here & on Moz: https://moz.com/blog/interactive-guide-to-robots-txtDirk
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Submitted URL has crawl issue - Submitted URL seems to be a Soft 404 - but all looks fine
Google Search Console is showing some pages up as "Submitted URL has crawl issue" but they look fine to me. I have set them as fixed but after a month they were finally re-crawled and google states the issue persists. Examples are: https://www.rscpp.co.uk/counselling/175809/psychology-alcester-lanes-end.html
Technical SEO | | TommyNewmanCEO
https://www.rscpp.co.uk/browse/location-index/889/index-of-therapy-in-hanger-lane.html
https://www.rscpp.co.uk/counselling/274646/psychology-waltham-forest-sexual-problems.html There's also some "Submitted URL seems to be a Soft 404": https://www.rscpp.co.uk/counselling/112585/counselling-moseley-depression.html I also have more which are "pending", but again I couldn't see a problem with them in the first place. I'm at a bit of a loss as to what to do next. Any advice? Thanks in advance.0 -
Robots.txt on subdomains
Hi guys! I keep reading conflicting information on this and it's left me a little unsure. Am I right in thinking that a website with a subdomain of shop.sitetitle.com will share the same robots.txt file as the root domain?
Technical SEO | | Whittie0 -
Which URL would you choose?
1 – www.company.com/subfolder/subfolder/keyword-keyword-product (I’m able to keyword match with this url) or 2. www.company.com/subfolder/subfolder/product (no url keyword match) What would you choose? A url which is "short" but still relevant, or, a url which is more descriptive allowing “keyword” match? Be great to get your feedback guys. Many thanks Gary
Technical SEO | | GaryVictory0 -
Meta-robots Nofollow
I don't understand Meta-robots Nofollow. Wordpress has my homepage set to this according to SEOMoz tool. Is this really bad?
Technical SEO | | hopkinspat1 -
Overly Dynamic URLs
I have a site that I use to time fitness events and I like to post the results using query strings. I create a link to each event's results/gallery/etc. I don't need these pages crawled and I don't want them to hurt my seo. Can I put a "do not crawl" meta on them or will that hurt my overall positioning? What are my other options?
Technical SEO | | bobbabuoy0 -
Penalty for many domains pointing to the same URL?
I've searched around on the Google forums, and other sources (including the Q&A!), but haven't seen a solid answer on this one. I've recently discovered that throughout the years we've had several hundred domains pointed to our homepage. These are our domains and are related to our niche. I believe they were pointed for the purposes of attracting type-in traffic. Before last month I knew at least some existed, but I didn't realize the extent until last week. I know there isn't any positive SEO effect to doing this (except perhaps if any of the domains have links to them, and a few do), but is there any negative SEO effect? I realize that there are legitimate redirects for type-in traffic, like misspellings and such, but most of these are just exact-match-domains. It just screams unnatural to me, but perhaps I'm just a little paranoid. 🙂
Technical SEO | | tncomseo0 -
Duplicate pages, overly dynamic URL’s and long URL’s in Magento
Hi there, I’ve just completed the first crawl of my Magento site and SEOMOZ has picked up 1,000’s of duplicate pages, overly dynamic URL’s and long URL’s due to the sort function which appends URL’s with variables when sorting products (e.g. www.example.com?dir=asc&order=duration). I’m not particularly concerned that this will affect our rankings as Google has stated that they are familiar with the structure of popular CMS’s and Magento is pretty popular. However it completely dominates my crawl diagnostics so I can’t see if there are any real underlying issues. Does anyone know a way of preventing this? Cheers,
Technical SEO | | WendyWuTours
Al.1 -
Client accidently blocked entire site with robots.txt for a week
Our client was having a design firm do some website development work for them. The work was done on a staging server that was blocked with a robots.txt to prevent duplicate content issues. Unfortunately, when the design firm made the changes live, they also moved over the robots.txt file, which blocked the good, live site from search for a full week. We saw the error (!) as soon as the latest crawl report came in. The error has been corrected, but... Does anyone have any experience with a snafu like this? Any idea how long it will take for the damage to be reversed and the site to get back in the good graces of the search engines? Are there any steps we should take in the meantime that would help to rectify the situation more quickly? Thanks for all of your help.
Technical SEO | | pixelpointpress0