New CMS system - 100,000 old urls - use robots.txt to block?
-
Hello.
My website has recently switched to a new CMS system.
Over the last 10 years or so, we've used 3 different CMS systems on our current domain. As expected, this has resulted in lots of urls.
Up until this most recent iteration, we were unable to 301 redirect or use any page-level indexation techniques like rel 'canonical'
Using SEOmoz's tools and GWMT, I've been able to locate and redirect all pertinent, page-rank bearing, "older" urls to their new counterparts..however, according to Google Webmaster tools 'Not Found' report, there are literally over 100,000 additional urls out there it's trying to find.
My question is, is there an advantage to using robots.txt to stop search engines from looking for some of these older directories? Currently, we allow everything - only using page level robots tags to disallow where necessary.
Thanks!
-
Great stuff..thanks again for your advice..much appreciated!
-
It can be really tough to gauge the impact - it depends on how suddenly the 404s popped up, how many you're seeing (webmaster tools, for Google and Bing, is probably the best place to check) and how that number compares to your overall index. In most cases, it's a temporary problem and the engines will sort it out and de-index the 404'ed pages.
I'd just make sure that all of these 404s are intentional and none are valuable pages or occurring because of issues with the new CMS itself. It's easy to overlook something when you're talking about 100K pages, and it could be more than just a big chunk of 404s.
-
Thanks for the advice! The previous website did have a robots.txt file with a few wild cards declared. A lot of the urls I'm seeing are NOT indexed anymore and haven't been for many years.
So, I think the 'stop the bleeding' method will work, and I'll just have to proceed with investigating and applying 301s as necessary.
Any idea what kind of an impact this is having on our rankings? I submitted a valid sitemap, crawl paths are good, and major 301s are in place. We've been hit particularly hard in Bing.
Thanks!
-
I've honestly had mixed luck with using Robots.txt to block pages that have already been indexed. It tends to be unreliable at a large scale (good for prevention, poor for cures). I endorsed @Optimize, though, because if Robots.txt is your only option, it can help "stop the bleeding". Sometimes, you use the best you have.
It's a bit trickier with 404s ("Not Found"). Technically, there's nothing wrong with having 404s (and it's a very valid signal for SEO), but if you create 100,000 all at once, that can sometimes give raise red flags with Google. Some kind of mass-removal may prevent problems from Google crawling thousands of not founds all at once.
If these pages are isolated in a folder, then you can use Google Webmaster Tools to remove the entire folder (after you block it). This is MUCH faster than Robots.txt alone, but you need to make sure everything in the folder can be dumped out of the index.
-
Absolutely. Not founds and no content are a concern. This will help your ranking....
-
Thanks a lot! I should have been a little more specific..but, my exact question would be, if I move the crawlers' attention away from these 'Not Found' pages, will that benefit the indexation of the now valid pages? Are the 'Not Found's' really a concern? Will this help my indexation and/or ranking?
Thanks!
-
Loaded question without knowing exactly what you are doing.....but let me offer this advice. Stop the bleeding with robots.txt. This is the easiest way to quickly resolve that many "not found".
Then you can slowly pick away at the issue and figure out if some of the "not founds" really have content and it is sending them to the wrong area....
On a recent project we had over 200,000 additional url's "not found". We stopped the bleeding and then slowly over the course of a month, spending a couple hours a week, found another 5,000 pages of content that we redirected correctly and removed the robots....
Good luck.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can I use £ character in title tag?
Hi - Can you use something like "save £££s" in the title tag. Have looked but can't see an answer. Thanks.
On-Page Optimization | | StevieD0 -
Canonical URL Tag Usage
I have a large website, almost 1500 pages that each market different keywords for the trucking logistics industry. I don't really understand the new Canonical URL Tag USAGE. They say to use it so the page is not a duplicate but the page that MOZ is call for to have the tag isn't a duplicate. It promotes 1 keyword that no other page directly promotes. Here is the page address, now what tag would I put up in the HEAD so google don't treat it as a duplicate page. http://www.freightetc.com/c/heavyhaul/heavyhaul.php 1. Number 1 the actual page address because I want it treated like its own page or do I have to use #2 below? 2. I don't know why I would use #2 as I want it to be its own page, and get credit and listed and ranked as its own page. Can anyone clarify this stuff to me as I guess i am just new to this whole tag usage.
On-Page Optimization | | dwebb0070 -
Use cookie-free domains
Is there anything simple i can install to reduce the Use of cookie-free domains, i have tried to used fooman extension but had major conflicts with other extensions? Kind regards
On-Page Optimization | | Mikaai0 -
I have a client where every page has over 100 links
Some links are in the main navigation (it has a secondary and tertiary level) and some links are repeated in the left navigation. Every page has over 100 links if crawled. From a practical standpoint, would you (a) delete the 3rd-level links (or at least argue for that) or (b) rel='nofollow' them? From a usability standpoint, this setup works as they are almost one click from everything. From a crawl standpoint, I see some pages missed in google (the sitemap has over 200 links). Looking for the best on-page current SEO advice to set these guys on the road to success.
On-Page Optimization | | digimech0 -
Which is Best Practice for creating URLs for subdomain?
My website is related to education. We have created sub domains for all major colleges, universities & Entrance exams like Gre, Toefl ETC. for eg: amityuniversity.abc.com (Amity is Name of University ) Now if have to mention city name in URL as well (college is located in multiple locations) amityuniversity-delhi.abc.com
On-Page Optimization | | rohanarora536
amityuniversitydelhi.abc.com Now my Q is can we use hyphens in sub domains if we have to add city name or shall we create without using any hyphens. In Directory structure we can always separate words with hyphens, can we follow same practice in subdomain as well Which is a best URL for subdomain amity-university-delhi.abc.com
amityuniversity-delhi.abc.com
or amityuniversitydelhi.abc.com0 -
Is rel=canonical used only for duplicate content
Can the rel-canonical be used to tell the search engines which page is "preferred" when there are similar pages? For instance, I have an internal page that Google is showing on the first page of the SERPs that I would prefer the home page be ranked for. Both the home and internal page have been optimized for the same keyword. What is interesting is that the internal page has very few backlinks compared to the home page but Google seems to favor it since the keyword is in the URL. I am afraid a 301 will drop us from the first page of the SERPs.
On-Page Optimization | | surveygizmo0 -
100 links on one page
we're recommended 100 links or less on one page. is the 100 links including header and footer links?
On-Page Optimization | | jallenyang0 -
Best URL Structure For Products That Are The Same
I know that the url structure is very important for seo preferably using the keyword. But is it okay to have the same url with the product number at the end ? Each of our products have a name with a product number. Or will this cause to many similar urls? or if the folder is the name of the product that needs to be optimized, can the page just be called the product number? Example: Say you have a 20 different product lines and they are all catagorized in the appropriate folders, and need to be optimized for the actual product name. XXX (folder name ) WWW-PR-123 WWW-PR-1234 WWW-PR-12345 WWW-PR-123456 what would be the best url structure? Can they have the same begining? The product name? something like: www.example.com/xxx/www-pr-123.php www.example.com/xxx/www-pr-1234.php or www.example.com/xxx/pr-123.php www.example.com/xxx/pr-1234.php
On-Page Optimization | | hfranz0