Is there a limit to how many URLs you can put in a robots.txt file?

kcb8178

We have a site that has way too many urls caused by our crawlable faceted navigation. We are trying to purge 90% of our urls from the indexes. We put no index tags on the url combinations that we do no want indexed anymore, but it is taking google way too long to find the no index tags. Meanwhile we are getting hit with excessive url warnings and have been it by Panda.

Would it help speed the process of purging urls if we added the urls to the robots.txt file? Could this cause any issues for us? Could it have the opposite effect and block the crawler from finding the urls, but not purge them from the index? The list could be in excess of 100MM urls.

CraigBradford

Hi Kristen,

I did this recently and it worked. The important part is that you need to block the pages in robots.txt or add a noindex tag to the pages to stop them from being indexed again.

I hope this helps.

CraigBradford

Hi all, Google Webmaster Tools has a great tool for this. If you go into WMT and select "Google index", then "remove URLs". You can use regex to remove a large batch of URLs then block them in robots.txt to make sure they stay out of the index.

I hope this helps.

kcb8178

Great thanks for the input. Per Kristen's post I am worried that it could just block the URLs altogether and they will never get purged from the index.

kcb8178

Yes, we have done that and are seeing traction on those urls, but we can't get rid of these old urls as fast as we would like.

Thanks for your input

kcb8178

Thanks Kristen, thats what I was afraid I would do. Other than Fetch is there a way to send Google these URLs in mass? There are over 100 million URLs so Fetch is not scalable. They are picking them up slowly, but at current pace it will take a few months and I would like to find a way to make it purge faster.

DirkC

You could add them to the robots.txt but it you have to remember that Google will only read the first 500kb (source) - as far as I understand with the number of url's you want to block you'll pass this limit.

As Google bot is able to understand basic regex expressions it's probably better to use regex (you will probably be able to block all these url's with a few lines of code.
More info here & on Moz: https://moz.com/blog/interactive-guide-to-robots-txt

Dirk

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Moz Q&A is closed.

Is there a limit to how many URLs you can put in a robots.txt file?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Robots.txt & meta noindex--site still shows up on Google Search

Tools/Software that can crawl all image URLs in a site

Blocked jquery in Robots.txt, Any SEO impact?

Blocked URL parameters can still be crawled and indexed by google?

Are robots.txt wildcards still valid? If so, what is the proper syntax for setting this up?

Can too many pages hurt crawling and ranking?

Robots.txt Sitemap with Relative Path

Robots.txt and canonical tag

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved