Is there a limit to how many URLs you can put in a robots.txt file?
-
We have a site that has way too many urls caused by our crawlable faceted navigation. We are trying to purge 90% of our urls from the indexes. We put no index tags on the url combinations that we do no want indexed anymore, but it is taking google way too long to find the no index tags. Meanwhile we are getting hit with excessive url warnings and have been it by Panda.
Would it help speed the process of purging urls if we added the urls to the robots.txt file? Could this cause any issues for us? Could it have the opposite effect and block the crawler from finding the urls, but not purge them from the index? The list could be in excess of 100MM urls.
-
Hi Kristen,
I did this recently and it worked. The important part is that you need to block the pages in robots.txt or add a noindex tag to the pages to stop them from being indexed again.
I hope this helps.
-
Hi all, Google Webmaster Tools has a great tool for this. If you go into WMT and select "Google index", then "remove URLs". You can use regex to remove a large batch of URLs then block them in robots.txt to make sure they stay out of the index.
I hope this helps.
-
Great thanks for the input. Per Kristen's post I am worried that it could just block the URLs altogether and they will never get purged from the index.
-
Yes, we have done that and are seeing traction on those urls, but we can't get rid of these old urls as fast as we would like.
Thanks for your input
-
Thanks Kristen, thats what I was afraid I would do. Other than Fetch is there a way to send Google these URLs in mass? There are over 100 million URLs so Fetch is not scalable. They are picking them up slowly, but at current pace it will take a few months and I would like to find a way to make it purge faster.
-
You could add them to the robots.txt but it you have to remember that Google will only read the first 500kb (source) - as far as I understand with the number of url's you want to block you'll pass this limit.
As Google bot is able to understand basic regex expressions it's probably better to use regex (you will probably be able to block all these url's with a few lines of code.
More info here & on Moz: https://moz.com/blog/interactive-guide-to-robots-txtDirk
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt in subfolders and hreflang issues
A client recently rolled out their UK business to the US. They decided to deploy with 2 WordPress installations: UK site - https://www.clientname.com/uk/ - robots.txt location: UK site - https://www.clientname.com/uk/robots.txt
Technical SEO | | lauralou82
US site - https://www.clientname.com/us/ - robots.txt location: UK site - https://www.clientname.com/us/robots.txt We've had various issues with /us/ pages being indexed in Google UK, and /uk/ pages being indexed in Google US. They have the following hreflang tags across all pages: We changed the x-default page to .com 2 weeks ago (we've tried both /uk/ and /us/ previously). Search Console says there are no hreflang tags at all. Additionally, we have a robots.txt file on each site which has a link to the corresponding sitemap files, but when viewing the robots.txt tester on Search Console, each property shows the robots.txt file for https://www.clientname.com only, even though when you actually navigate to this URL (https://www.clientname.com/robots.txt) you’ll get redirected to either https://www.clientname.com/uk/robots.txt or https://www.clientname.com/us/robots.txt depending on your location. Any suggestions how we can remove UK listings from Google US and vice versa?0 -
Why put rel=canonical to the same url ?
Hi all. I've heard that it's good to put the link rel canonical in your header even when there is no other important or prefered version of that url. If you take a look at moz.com and see the code, you'll see that they put the <link rel="<a class="attribute-value">canonical</a>" href="http://moz.com" /> ... pointing at the same url ! But if you go to http://moz.com/products/pricing for example, they have no canonical there ! WHY ? Thanks in advance !
Technical SEO | | Tintanus0 -
Googlebot size limit
Hi there, There is about 2.8 KB of java script above the content of our homepage. I know it isn't desirable, but is this something I need to be concerned about? Thanks,
Technical SEO | | SSFCU
Sarah Update: It's fine. Ran a Fetch as Google and it's rendering as it should be. I would delete my question if I could figure out how!0 -
When to file a Reconsideration Request
Hi all, I don't have any manual penalties from Google but do have a unnatural links message from them back in 2012. We have removed some of the spammy links over the last 2 years but we're now making a further effort and will use the disavow tool once we've done this. Will this be enough once I submit the file or should I / can I submit a Reconsideration Request as well? Do I have to have a manual penalty item in my webmaster account to be able to submit a request? Thanks everyone!
Technical SEO | | KerryK0 -
BEST Wordpress Robots.txt Sitemap Practice??
Alright, my question comes directly from this article by SEOmoz http://www.seomoz.org/learn-seo/robotstxt Yes, I have submitted the sitemap to google, bing's webmaster tools and and I want to add the location of our site's sitemaps and does it mean that I erase everything in the robots.txt right now and replace it with? <code>User-agent: * Disallow: Sitemap: http://www.example.com/none-standard-location/sitemap.xml</code> <code>???</code> because Wordpress comes with some default disallows like wp-admin, trackback, plugins. I have also read other questions. but was wondering if this is the correct way to add sitemap on Wordpress Robots.txt http://www.seomoz.org/q/robots-txt-question-2 http://www.seomoz.org/q/quick-robots-txt-check. http://www.seomoz.org/q/xml-sitemap-instruction-in-robots-txt-worth-doing I am using Multisite with Yoast plugin so I have more than one sitemap.xml to submit Do I erase everything in Robots.txt and replace it with how SEOmoz recommended? hmm that sounds not right. User-agent: *
Technical SEO | | joony2008
Disallow:
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-login.php
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /comments **ERASE EVERYTHING??? and changed it to** <code> <code>
<code>User-agent: *
Disallow: </code> Sitemap: http://www.example.com/sitemap_index.xml</code> <code>``` Sitemap: http://www.example.com/sub/sitemap_index.xml ```</code> <code>?????????</code> ```</code>0 -
URL paths and keywords
I'm recommending some on-page optimization for a home builder building in several new home communities. The site has been through some changes in the past few months and we're almost starting over. The current URL structure is http://homebuilder.com/oakwood/features where homebuilder = builder name Oakwood Estates= name of community features = one of several sub-paths including site plan, elevations, floor plans, etc. The most attainable keyword phrases include the word 'home' and 'townname' I want to change the URL path to: http://homebuilder.com/oakwood-estates-townname-homes/features Is there any problem with doing this? It just seems to make a lot of sense. Any input would be appreciated.
Technical SEO | | mikescotty0 -
File from godaddy.com
Hi, One of our client has received a file from godaddy.com where his site is hosted. Here is the message from the client- "i submitted my site for Search Engine Visibility,but they got some issue on the site need to be fixed. i tried myself could not fix it" The site in question is - http://allkindofessays.com/ Is there any problem with the site ? Contents of the file - bplist00Ó k 0_ WebSubframeArchives_ WebSubresources_ WebMainResource L x Ï Ö Ý ] ¨ ¯ ¼ Û 6 SÓ @ F¡ Ó / :¡ Ó )¡ Ò ¡ Ô _ WebResourceResponse_ WebResourceData_ WebResourceMIMEType^WebResourceURLO cbplist00Ô Z[X$versionX$objectsY$archiverT$top † ¯ "()0 12DEFGHIJKLMNOPTUU$nullÝ !R$6S$10R$2R$7R$3S$11R$8V$classR$4R$9R$0R$5R$1€ € € € € € € € Ó #$%& [NS.relativeWNS.base€ € € _ ¢http://tags.bluekai.com/site/2748?redir=http%3A%2F%2Fsegment-pixel.invitemedia.com%2Fset_partner_uid%3FpartnerID%3D84%26partnerUID%3D%24_BK_UUID%26sscs_active%3D1Ò*+,-Z$classnameX$classesUNSURL¢./UNSURLXNSObject#A´ þ¹ –5 ÈÓ 3456=WNS.keysZNS.objects€ ¦789:;<€ €€ € €€ ¦>?@ABC€ € € € € € \Content-TypeSP3PVServerTDate^Content-LengthYBK-ServerYimage/gif_ nCP="NOI DSP COR CUR ADMo DEVo PSAo PSDo OUR SAMo BUS UNI NAV", policyref="http://tags.bluekai.com/w3c/p3p.xml"_ Apache/2.2.3 (CentOS)_ Sat, 10 Sep 2011 20:23:21 GMTR62T87dfÒ*+QR_ NSMutableDictionary£QS/\NSDictionary >Ò*+VW_ NSHTTPURLResponse£XY/_ NSHTTPURLResponse]NSURLResponse_ NSKeyedArchiverÑ]_ WebResourceResponse€ # - 2 7 R X s v z } € ƒ ‡ Š ‘ ” — š ¢ ¤ ¦ ¨ ª ¬ ® ° ² ´ ¶ ¸ ¿ Ë Ó Õ × Ù ~ ƒ Ž — ¦ ¯ ¸ º Á É Ô Ö Ý ß á ã å ç é ð ò ô ö ø ú ü ( 2 < Å å è í ò 4 8 L Z l o … ^ ‡O >GIF89a ÿÿÿ!ÿ NETSCAPE2.0 !ù , L ;Yimage/gif_ ¢http://tags.bluekai.com/site/2748?redir=http%3A%2F%2Fsegment-pixel.invitemedia.com%2Fset_partner_uid%3FpartnerID%3D84%26partnerUID%3D%24_BK_UUID%26sscs_active%3D1Õ _ WebResourceTextEncodingName_ WebResourceFrameNameO 6
Technical SEO | | seoug_20050 -
Robots.txt
Hi everyone, I just want to check something. If you have this entered into your robots.txt file: User-agent: *
Technical SEO | | PeterM22
Disallow: /fred/ This wouldn't block /fred-review/ from being crawled would it? Thanks0