New CMS system - 100,000 old urls - use robots.txt to block?
-
Hello.
My website has recently switched to a new CMS system.
Over the last 10 years or so, we've used 3 different CMS systems on our current domain. As expected, this has resulted in lots of urls.
Up until this most recent iteration, we were unable to 301 redirect or use any page-level indexation techniques like rel 'canonical'
Using SEOmoz's tools and GWMT, I've been able to locate and redirect all pertinent, page-rank bearing, "older" urls to their new counterparts..however, according to Google Webmaster tools 'Not Found' report, there are literally over 100,000 additional urls out there it's trying to find.
My question is, is there an advantage to using robots.txt to stop search engines from looking for some of these older directories? Currently, we allow everything - only using page level robots tags to disallow where necessary.
Thanks!
-
Great stuff..thanks again for your advice..much appreciated!
-
It can be really tough to gauge the impact - it depends on how suddenly the 404s popped up, how many you're seeing (webmaster tools, for Google and Bing, is probably the best place to check) and how that number compares to your overall index. In most cases, it's a temporary problem and the engines will sort it out and de-index the 404'ed pages.
I'd just make sure that all of these 404s are intentional and none are valuable pages or occurring because of issues with the new CMS itself. It's easy to overlook something when you're talking about 100K pages, and it could be more than just a big chunk of 404s.
-
Thanks for the advice! The previous website did have a robots.txt file with a few wild cards declared. A lot of the urls I'm seeing are NOT indexed anymore and haven't been for many years.
So, I think the 'stop the bleeding' method will work, and I'll just have to proceed with investigating and applying 301s as necessary.
Any idea what kind of an impact this is having on our rankings? I submitted a valid sitemap, crawl paths are good, and major 301s are in place. We've been hit particularly hard in Bing.
Thanks!
-
I've honestly had mixed luck with using Robots.txt to block pages that have already been indexed. It tends to be unreliable at a large scale (good for prevention, poor for cures). I endorsed @Optimize, though, because if Robots.txt is your only option, it can help "stop the bleeding". Sometimes, you use the best you have.
It's a bit trickier with 404s ("Not Found"). Technically, there's nothing wrong with having 404s (and it's a very valid signal for SEO), but if you create 100,000 all at once, that can sometimes give raise red flags with Google. Some kind of mass-removal may prevent problems from Google crawling thousands of not founds all at once.
If these pages are isolated in a folder, then you can use Google Webmaster Tools to remove the entire folder (after you block it). This is MUCH faster than Robots.txt alone, but you need to make sure everything in the folder can be dumped out of the index.
-
Absolutely. Not founds and no content are a concern. This will help your ranking....
-
Thanks a lot! I should have been a little more specific..but, my exact question would be, if I move the crawlers' attention away from these 'Not Found' pages, will that benefit the indexation of the now valid pages? Are the 'Not Found's' really a concern? Will this help my indexation and/or ranking?
Thanks!
-
Loaded question without knowing exactly what you are doing.....but let me offer this advice. Stop the bleeding with robots.txt. This is the easiest way to quickly resolve that many "not found".
Then you can slowly pick away at the issue and figure out if some of the "not founds" really have content and it is sending them to the wrong area....
On a recent project we had over 200,000 additional url's "not found". We stopped the bleeding and then slowly over the course of a month, spending a couple hours a week, found another 5,000 pages of content that we redirected correctly and removed the robots....
Good luck.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Proper Use and Interpretation of new Query/Page report
When I'm in WMT/Search Console - I start a process of looking at all of the data initially unfiltered Then I select a query. Let's say its a top query for starters and I filter my results by that top query (exactly) With the filter on, I flip over to Pages and I get about a dozen results. When I look at this list, I get the normal variety of output: impressions, clicks, CTR, avg. position One thing that seems a bit odd to me is that most of the average positions for each of the URLs displayed is about the same. Say they range from 1.0 to 1.3. Does this mean that Google is displaying the dozen or so URLs to different people and generally in the 1st or 2nd position. Does this mean that my dozen or so pages are all competing with each other for the same query? On one hand, if all of my dozen pages displayed most of the time in the SERP all at the same time, I would see this as a good thing in that I would be 'owning' the SERP for my particular query. On the other hand, I'm concerned that the keyword I'm trying to optimize a particular page for is being partially distributed to less optimized pages. The main target page is shown the most (good) and it has about a 15x better CTR (also good). But all together, the other 11 pages are taking in around 40% of impressions and get a far lower CTR (bad). Am I interpreting this data correctly? Is WMT showing me what pages a particular query sends traffic to? Is there any way to extract the keywords that a particular page receives? When I reset my query and then start by selecting a specific page (exact match) and then select queries - is this showing my the search queries that drove traffic to that page? Is there a 'best practices' process to try to target a keyword to a specific page so that it gets more than the 60% of impressions I'm seeing now? Obviously I don't want to do a canonical because each keyword goes to many different pages and each page receives a different mix of keywords. I would think there would be a different technique when your page has an average position off of page 1.
On-Page Optimization | | ExploreConsulting0 -
Is using hyphens in a URL to separate words good practice?
Hi guys, I have a client who wants to use a hyphen to separate two words in the URL to make each work stand out. Is is good or bad practice to use a hyphen in a URL and will it affect rankings? Thanks!
On-Page Optimization | | StoryScout0 -
Appropriate Use of Rel Canonical
I have encountered problems regarding rel canonical. When I ran On-Page Report Card it says **Error: ** Appropriate Use of Rel Canonical Canonical URL: "http://www.sourcedental.createmyid.net/teeth-whitening/" Explanation: If the canonical tag is pointing to a different URL, engines will not count this page as the reference resource and thus, it won't have an opportunity to rank. Make sure you're targeting the right page (if this isn't it, you can reset the target above) and then change the canonical tag to reference that URL. Recommendation We check to make sure that IF you use canonical URL tags, it points to the right page. If the canonical tag points to a different URL, engines will not count this page as the reference resource and thus, it won't have an opportunity to rank. If you've not made this page the rel=canonical target, change the reference to this URL. NOTE: For pages not employing canonical URL tags, this factor does not apply." I just don't know how to fix this. I am using Wordpress SEO by Yoast but I haven't change any settings regarding rel canonical. Can anyone help me with this? Thanks
On-Page Optimization | | projectassistant0 -
How do I avoid duplicate content and page title errors when using a single CMS for a website
I am currently hosting a client site on a CMS with both a Canadian and USA version of the website. We have the .com as the primary domain and the .ca is re-directed from the registrar to the Canadian home page. The problem I am having is that my campaign produces errors for duplicate page content and duplicate page titles. Is there a way to setup the two versions on the CMS so that these errors do not get produced? My concern is getting penalized from search engines. Appreciate any help. Mark Palmer
On-Page Optimization | | kpreneur0 -
New Url Structure
Hey Guys We are working on a new site and in order to implement some of the new functions we need to restructure our url's , Will redirect everything correctly but I was looking for advice on the structure we need the word product / category subfolder for speed but would there be any benefit making them shorter ? what would you guys advise ? Category Current http://www.freestylextreme.com/uk/Home/Brands/DC-Shoe-Co-/default.aspx **New ** freestylextreme.com/uk/category/dc-shoe-co Product Current http://www.freestylextreme.com/uk/Home/Brands/DC-Shoe-Co-/Mens-DC-Shoe-Co-T-shirts/DC-Black-Unwind-T-Shirt---.aspx New freestylextreme.com/uk/product/dc-black-unwind-t-shirt
On-Page Optimization | | elbeno0 -
Blocked By Meta Robots
Hi I logged in the other day to find that over night I received 8347 notices saying certain pages are being kept out of the search engine indexes by meta-robots. I have not changed my robots.txt in years and I certainly didn't block Google from visiting those pages. Is this a fault on Roger Mozbot behalf? Or is there a bot preventing 8000+ pages being indexed? Is there a way to find out what meta-robot is doing this and where? And how I can get rid of it? I usually rank between #3 and #5 for the term 'sex toys' on google.com.au, but I now rank #7 to #9 so it would seem some of my pages/content is being blocked. My website is www.theloveshop.com THIS IS AN ADULT TOYS SITE. There is no porn videos or anything like that on it, but just in case you don't wish to look at sex toys or are around kids I thought I would mention it. Blake
On-Page Optimization | | wayne10 -
Right way to block google robots from ppc landing pages
What is the right way to completely block seo robots from my adword landing pages? Robots.txt does not work really good for that, as far I know. Adding metatags noindex nofollow on the other side will block adwords robot as well. right? Thank you very much, Serge
On-Page Optimization | | Kotkov0 -
What do you do with old web pages?
I have a wordpress blog that I've made a ton of changes to. I have about a dozen pages on it that have old information or I no longer have a use for. In retrospect, I probably should have used the old pages for new content instead of building new pages. Should I just leave them on the site with no links to them, should I mark them as noindex, or should I delete them? I think some of the old pages could be completing with my new pages for ranking due to similar content. Website is: http://continuumweddings.com if you'd like to check out my site. Thanks, Melissa
On-Page Optimization | | mrsmelmitch0