Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Is there a limit to how many URLs you can put in a robots.txt file?
-
We have a site that has way too many urls caused by our crawlable faceted navigation. We are trying to purge 90% of our urls from the indexes. We put no index tags on the url combinations that we do no want indexed anymore, but it is taking google way too long to find the no index tags. Meanwhile we are getting hit with excessive url warnings and have been it by Panda.
Would it help speed the process of purging urls if we added the urls to the robots.txt file? Could this cause any issues for us? Could it have the opposite effect and block the crawler from finding the urls, but not purge them from the index? The list could be in excess of 100MM urls.
-
Hi Kristen,
I did this recently and it worked. The important part is that you need to block the pages in robots.txt or add a noindex tag to the pages to stop them from being indexed again.
I hope this helps.
-
Hi all, Google Webmaster Tools has a great tool for this. If you go into WMT and select "Google index", then "remove URLs". You can use regex to remove a large batch of URLs then block them in robots.txt to make sure they stay out of the index.
I hope this helps.
-
Great thanks for the input. Per Kristen's post I am worried that it could just block the URLs altogether and they will never get purged from the index.
-
Yes, we have done that and are seeing traction on those urls, but we can't get rid of these old urls as fast as we would like.
Thanks for your input
-
Thanks Kristen, thats what I was afraid I would do. Other than Fetch is there a way to send Google these URLs in mass? There are over 100 million URLs so Fetch is not scalable. They are picking them up slowly, but at current pace it will take a few months and I would like to find a way to make it purge faster.
-
You could add them to the robots.txt but it you have to remember that Google will only read the first 500kb (source) - as far as I understand with the number of url's you want to block you'll pass this limit.
As Google bot is able to understand basic regex expressions it's probably better to use regex (you will probably be able to block all these url's with a few lines of code.
More info here & on Moz: https://moz.com/blog/interactive-guide-to-robots-txtDirk
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Folders in url structure?
Hello, Revamping an out-of-date website and am wondering if I need to include the folders (categories) in the url structure? The proposed structure has 8 main folders. I've been reading that Google is ok if the folder is not included in the url, but is it really? The hesitation I have is that the urls are getting long and the main folder only has only a sub folder beneath it. So, /folder-name/facility-name/treatment-overview. This looks too long, doesn't it? Thanks!
Technical SEO | | lfrazer1230 -
2 sitemaps on my robots.txt?
Hi, I thought that I just could link one sitemap from my site's robots.txt but... I may be wrong. So, I need to confirm if this kind of implementation is right or wrong: robots.txt for Magento Community and Enterprise ...
Technical SEO | | Webicultors
Sitemap: http://www.mysite.es/media/sitemap/es.xml
Sitemap: http://www.mysite.pt/media/sitemap/pt.xml Thanks in advance,0 -
Is sitemap required on my robots.txt?
Hi, I know that linking your sitemap from your robots.txt file is a good practice. Ok, but... may I just send my sitemap to search console and forget about adding ti to my robots.txt? That's my situation: 1 multilang platform which means... ... 2 set of pages. One for each lang, of course But my CMS (magento) only allows me to have 1 robots.txt file So, again: may I have a robots.txt file woth no sitemap AND not suffering any potential SEO loss? Thanks in advance, Juan Vicente Mañanas Abad
Technical SEO | | Webicultors0 -
Robots.txt on http vs. https
We recently changed our domain from http to https. When a user enters any URL on http, there is an global 301 redirect to the same page on https. I cannot find instructions about what to do with robots.txt. Now that https is the canonical version, should I block the http-Version with robots.txt? Strangely, I cannot find a single ressource about this...
Technical SEO | | zeepartner0 -
Blocked jquery in Robots.txt, Any SEO impact?
I've heard that Google is now indexing links and stuff available in javascript and jquery. My webmastertools is showing that some links are blocked in robots.txt of jquery. Sorry I'm not a developer or designer. I want to know is there any impact of this on my SEO? and also how can I unblock it for the robots? Check this screenshot: http://i.imgur.com/3VDWikC.png
Technical SEO | | hammadrafique0 -
Adding multi-language sitemaps to robots.txt
I am working on a revamped multi-language site that has moved to Magento. Each language runs off the core coding so there are no sub-directories per language. The developer has created sitemaps which have been uploaded to their respective GWT accounts. They have placed the sitemaps in new directories such as: /sitemap/uk/sitemap.xml /sitemap/de/sitemap.xml I want to add the sitemaps to the robots.txt but can't figure out how to do it. Also should they have placed the sitemaps in a single location with the file identifying each language: /sitemap/uk-sitemap.xml /sitemap/de-sitemap.xml What is the cleanest way of handling these sitemaps and can/should I get them on robots.txt?
Technical SEO | | MickEdwards0 -
Google insists robots.txt is blocking... but it isn't.
I recently launched a new website. During development, I'd enabled the option in WordPress to prevent search engines from indexing the site. When the site went public (over 24 hours ago), I cleared that option. At that point, I added a specific robots.txt file that only disallowed a couple directories of files. You can view the robots.txt at http://photogeardeals.com/robots.txt Google (via Webmaster tools) is insisting that my robots.txt file contains a "Disallow: /" on line 2 and that it's preventing Google from indexing the site and preventing me from submitting a sitemap. These errors are showing both in the sitemap section of Webmaster tools as well as the Blocked URLs section. Bing's webmaster tools are able to read the site and sitemap just fine. Any idea why Google insists I'm disallowing everything even after telling it to re-fetch?
Technical SEO | | ahockley0 -
How many strong tags is too many
Hi everyone, just a quick question, what are your views on the use of strong tags in content? how many is too many? What is you have strong tags around every keywords for a sentance etc?
Technical SEO | | pauledwards1