New CMS system - 100,000 old urls - use robots.txt to block?
-
Hello.
My website has recently switched to a new CMS system.
Over the last 10 years or so, we've used 3 different CMS systems on our current domain. As expected, this has resulted in lots of urls.
Up until this most recent iteration, we were unable to 301 redirect or use any page-level indexation techniques like rel 'canonical'
Using SEOmoz's tools and GWMT, I've been able to locate and redirect all pertinent, page-rank bearing, "older" urls to their new counterparts..however, according to Google Webmaster tools 'Not Found' report, there are literally over 100,000 additional urls out there it's trying to find.
My question is, is there an advantage to using robots.txt to stop search engines from looking for some of these older directories? Currently, we allow everything - only using page level robots tags to disallow where necessary.
Thanks!
-
Great stuff..thanks again for your advice..much appreciated!
-
It can be really tough to gauge the impact - it depends on how suddenly the 404s popped up, how many you're seeing (webmaster tools, for Google and Bing, is probably the best place to check) and how that number compares to your overall index. In most cases, it's a temporary problem and the engines will sort it out and de-index the 404'ed pages.
I'd just make sure that all of these 404s are intentional and none are valuable pages or occurring because of issues with the new CMS itself. It's easy to overlook something when you're talking about 100K pages, and it could be more than just a big chunk of 404s.
-
Thanks for the advice! The previous website did have a robots.txt file with a few wild cards declared. A lot of the urls I'm seeing are NOT indexed anymore and haven't been for many years.
So, I think the 'stop the bleeding' method will work, and I'll just have to proceed with investigating and applying 301s as necessary.
Any idea what kind of an impact this is having on our rankings? I submitted a valid sitemap, crawl paths are good, and major 301s are in place. We've been hit particularly hard in Bing.
Thanks!
-
I've honestly had mixed luck with using Robots.txt to block pages that have already been indexed. It tends to be unreliable at a large scale (good for prevention, poor for cures). I endorsed @Optimize, though, because if Robots.txt is your only option, it can help "stop the bleeding". Sometimes, you use the best you have.
It's a bit trickier with 404s ("Not Found"). Technically, there's nothing wrong with having 404s (and it's a very valid signal for SEO), but if you create 100,000 all at once, that can sometimes give raise red flags with Google. Some kind of mass-removal may prevent problems from Google crawling thousands of not founds all at once.
If these pages are isolated in a folder, then you can use Google Webmaster Tools to remove the entire folder (after you block it). This is MUCH faster than Robots.txt alone, but you need to make sure everything in the folder can be dumped out of the index.
-
Absolutely. Not founds and no content are a concern. This will help your ranking....
-
Thanks a lot! I should have been a little more specific..but, my exact question would be, if I move the crawlers' attention away from these 'Not Found' pages, will that benefit the indexation of the now valid pages? Are the 'Not Found's' really a concern? Will this help my indexation and/or ranking?
Thanks!
-
Loaded question without knowing exactly what you are doing.....but let me offer this advice. Stop the bleeding with robots.txt. This is the easiest way to quickly resolve that many "not found".
Then you can slowly pick away at the issue and figure out if some of the "not founds" really have content and it is sending them to the wrong area....
On a recent project we had over 200,000 additional url's "not found". We stopped the bleeding and then slowly over the course of a month, spending a couple hours a week, found another 5,000 pages of content that we redirected correctly and removed the robots....
Good luck.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
To update or not to update news URLs ?
We manage a huge daily news website in my small country - keeping this a bit mysterious in case competitors are reading 🙂 Our URL structure is www.companyname.com/news/categoryofnews/title-of-article?id=articleid In this hyperreactive news world, title of articles change frequently (may be ten times a day for the main stories). The question we debate is : should we reflect the modification of the title in the URL or not ? Example : "Trump says he wants to ban search engines" would have URL http://www.companyname.com/news/entertainment/Trump-says-he-wants-to-ban-search-engines?id=12345678 Later in the day the title becomes "Trump denies he suggested banning search engines". Should the URL be modified to http://www.companyname.com/news/entertainment/Trump-denies-he-suggested-banning-search-engines?id=12345678 (option A) or not (option B) ? In Google News it makes no difference because of the sitemap, but in Google organic things are different. At present (option B in place), Google apparently doesn't see that the article has been updated, and shows the initial timestamp which is visually (and presumably SEOwise) not good : our new news looks like old news. Modifiying the URL would solve that issue, but could, may be, create another one : the new URL, being considered a new article, would lose, the acquired weight of the previous one in terms of referrals, social trafic and so on. Or not ? What do you think is the best option ? Thanks for your expertise, Yves
On-Page Optimization | | yves678901 -
Keyword density or No. of Time keyword used
Now, I know that there is no set figure to be used here, whichever metric you are using and it will depend on the article and what is natural. However, lets suppose for a minute that we are taking a keyword in isolation, and I have a 2000 word article using the keyword 17 times and rank no. 3 in Google SERPS. The no. 1 slot uses the keyword 8 times but only has a 800 word article and only a B grade on the onpage ranker. Of course, there are off page factors as well, but just wondering what your thoughts are on whether you look at density or total keyword usage. It is easy to just write without think about keyword density or usage, but occasionally you end up using the keyword about 50 times, and it is then I have to actually think about it. Other articles I barely use the keyword because the article just writes itself and it works out fine, but these are generally shorter. With longer articles on my best converting pages, I can't help but think about it more and it ends up a little hit and miss.
On-Page Optimization | | TheWebMastercom1 -
Magento Canonical & Default Robots Settings
Hello! I'm working with Magento 1.9 for an eCommerce site with several hundred products. Currently I understand it is best practices to use the Canonical tag, however I also have my default robots set to "Index, Follow". Will this cause an issue having product pages set to index, follow but also having a canonical tag included? What are some best practices regarding Magento default robots & canonical tags? Any help is appreciated.
On-Page Optimization | | BretDarby0 -
Use External Links
Hey 🙂 I noticed when analysing my pages that Moz gives the following advice about adding external links to my articles; "On any page specifically targeting a keyword, link externally to at least one (and possibly more than one) relevant, trusted resources as a best practice." As a small business I work pretty damn hard to get visitors to my website, so why on earth would I want to go to all that trouble just to send them away again to a trusted resouce? Secondly, what exactly is a "trusted resource"? Can I simply use search and use the top competitor, for example Moz or Wikipedia and does the anchor need to be an exact match or will a partial suffice. I say this because I already have the top spot for my longtail, so an exact match would be pointless. And lastly, I notice that pretty much all quality sites use external links to open in the same window i.e. not target=_blank, I never thought of it before today, but now that I'm considering using external linking in my articles I guess it's important to know the answer - i.e. Is this a best practice and does this give any seo benefit? Cheers, Lee :)
On-Page Optimization | | LeeC0 -
Should I delete my old blog now that it has been transfered?
I just transfered my free wordpress.com blog to my main business site and I did not know if I should delete my old blog or is it a big issue to have the same content on both sites. I will be adding to my biz site from now on.
On-Page Optimization | | greenjoe0 -
Temp redirects of homepage URLs
Working on a site for a client and the CMS provider has all variants of their domain (.co.uk & .com and / and http:// etc) temp redirecting to the /default.aspx homepage. I know in theory that this means no Google Juice is being passed to the final location page, does this mean that all backlinks that point to their domain are not actually passing Juice onto the homepage? Thanks
On-Page Optimization | | Switch_Digital0 -
Implementing rel=canonical in a CMS
Hi Guys, We have an issue with duplicate content caused by dynamic URLs, so want to implement rel=canonical. However this isn't easy due to the way out CMS works. These were pulled from SEOMoz scan: http://www.transair.co.uk/pp+Aerobatics-Training+463
On-Page Optimization | | brightonseorob
http://www.transair.co.uk/pp+Aerobatics-Training+463?page=1&perpage=10&sales_group=NULL&filter_colour=&filter_size=&sortby=RELEV&inStock=NO&resfilter=
and are obviously the same page. As far as I can see I have two options. 1. To implement the canonical meta tag only on page 1. 2. To implement the canonical tag so that I add ?page=X so
http://www.transair.co.uk/pp+Aerobatics-Training+463
would be
http://www.transair.co.uk/pp+Aerobatics-Training+463?page=1 Will this work? Thanks Rob0 -
Importance of URL Structure
We are trying to restructure our onpage SEO and want to make sure we have our URLs correct. The problem is we did the URLs incorrectly in the first place and the ones we currently have are several years olds. We have some URLs such as: http://www.firebrandtraining.co.uk/courses/management/prince2.asp and
On-Page Optimization | | RobertChapman
http://www.firebrandtraining.co.uk/courses/cisco/ccna_2007.asp which are not ideal but user experience aside does it make sense for us to change the URLs and use 301 redirects to the new ones or is the damage done to our natural rankings simply not worth making the change? I have read different articles saying different things, some say that URL structure has little weight (if any weight at all) on rankings while other people seem to say it is quite important. In addition we have heard that changing the URLs with a 301 redirect will cause a large drop in ranking which will take months to recover from and contrarily that 301s are now considered "ok" by Google and we shouldn't see too much change at all in our rankings. Any advice would be much appreciated.0