Old pages STILL indexed...
-
Our new website has been live for around 3 months and the URL structure has completely changed. We weren't able to dynamically create 301 redirects for over 5,000 of our products because of how different the URL's were so we've been redirecting them as and when.
3 months on and we're still getting hundreds of 404 errors daily in our Webmaster Tools account. I've checked the server logs and it looks like Bing Bot still seems to want to crawl our old /product/ URL's. Also, if I perform a "site:example.co.uk/product" on Google or Bing - lots of results are still returned, indicating the both still haven't dropped them from their index.
Should I ignore the 404 errors and continue to wait for them to drop off or should I just block /product/ in my robots.txt? After 3 months I'd have thought they'd have naturally dropped off by now!
I'm half-debating this:
User-agent: *
Disallow: /some-directory-for-all/*User-agent: Bingbot
User-agent: MSNBot
Disallow: /product/Sitemap: http://www.example.co.uk/sitemap.xml
-
Yea. If you cannot do it dynamically, it gets to be a real PIA, and also, depending on how you setup the 301s, you may get an overstuffed .htaccess file that could cause problems.
If these pages were so young and did not have any link equity or rank to start with, they are probably not worth 301ing.
One tool you may want to consider is URLprofiler http://urlprofiler.com/ You could take all the old URLs and have URL profiler pull in GA data (from when they were live on your site) and then also pull in OSE data from Moz. You can then filter them and see what pages got traffic and links. Take those select "top pages" and make sure they 301 to the correct page on the new URL structure and then go from there. URL profiler has a free 15 day trial that you could use for this project and get done at no charge. But after using the product, you will see it is pretty handy and may buy anyway.
Ideally, if you could have dynamically 301ed the old pages to the new, that would have been the simplest method, but with your situation, I think you are ok. Google is just trying to help to make sure you did not "mess up" and 404 those old pages on accident. It wants to give you the benefit of the doubt. It is crazy sometimes how they keep things in the index.
I am monitoring a site that scraped one of my sites. They shut the entire site down after we threatened legal action. The site has been down for weeks and showing 404s, but I can still do a site: search and see them in the index. Meh.
-
Forgot to add this - just some free advice. You have your CSS inlined in your HTML. Ideally, you want to have that in an external CSS file. That way, once the user loads that external file, they do not have to download it multiple times so the experience is faster on subsequent pages.
If you were testing your page with Google site speed and they mentioned render blocking CSS issues and that is why you inlined your CSS, the solution is not to inline all your CSS, but to just inline what is above the fold and put the rest in an external file.
Hope that makes sense.
-
I suppose that's the problem. We've spent hours redirecting hundreds of 404 pages to new/relevant locations - but these pages don't receive organic traffic. It's mostly just BingBot, MSNBot and GoogleBot crawling them because they're still indexed.
I think I'm going to leave them as 404 rather than trying to keep on top of 301 redirecting them and I'll leave it in Google's hands to eventually drop them off!
Thanks!
Liam
-
General rule of thumb, if a page 404s and it is supposed to 404 dont worry about it. The Search Console 404 report does not mean that you are being penalized although it can be diagnostic. If you block the 404 pages in robots.txt yea, it will take the 404 errors out of the Search Console report, but then Google never "deals" with those 404s. It can take 3 months (maybe longer) to get things out of Search Console, I have noticed it taking longer here lately, but what you need to do first is ask the following questions
-
Do I still link internally to any of these /product/ URLs? If you do, Google may assume that you are 404ing those pages by mistake and leave them in the report longer as if you are still linking internally to them they must be a viable page.
-
Do any of these old URLs have value? Do they have links to them from external sites? Did they used to rank for a KW? You should probably 301 them to a semantically relevant page then vs 404ing and getting some use out of them.
If you have either of the above, Google may continue to remind you of the 404 as it thinks the page might be valuable and want to "help" you out.
You mention 5,000 URLs that were indexed and then you 404 them. You cannot assume that Search Console works in real time or that Google checks all 5,000 of these URLs at the same time. Google has a given crawl budget for your site on how often it will crawl a given page. Some pages they crawl more often (home page) some pages they crawl less often. They then have to process those crawls once they get the data back. What you will see in a situation like this is that if you 404 several thousand pages, you will first see several hundred show up in your Search Console report, then the next day some more, then some more, etc. Over time, the total will build and then may peak and then gradually start to fall off. Google has to find the 404s, process them and then show them in the report. You may see 500 of your 404 pages today, but then 3 months later, there may be 500 other 404 pages that show up in the report and those original 500 are now gone. This is why you might be seeing 404 errors after 3 months in addition to the examples I gave above.
It would be great if the process were faster and the data was cleaner. The report has a checkbox for "this is fixed" and that is great if you fixed something, but they need a checkbox for "this is supposed to 404" to help clear things out. If I have learned anything about Search Console, it is helpful, but the data in many cases is not real time.
Good luck!
-
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
If Robots.txt have blocked an Image (Image URL) but the other page which can be indexed has this image, how is the image treated?
Hi MOZers, This probably is a dumb question but I have a case where the robots.tags has an image url blocked but this image is used on a page (lets call it Page A) which can be indexed. If the image on Page A has an Alt tags, then how is this information digested by crawlers? A) would Google totally ignore the image and the ALT tags information? OR B) Google would consider the ALT tags information? I am asking this because all the images on the website are blocked by robots.txt at the moment but I would really like website crawlers to crawl the alt tags information. Chances are that I will ask the webmaster to allow indexing of images too but I would like to understand what's happening currently. Looking forward to all your responses 🙂 Malika
Intermediate & Advanced SEO | | Malika11 -
Keep Pages with Old Dates?
We have a tourism related site. We list annual events. Right now the URL extension includes the year. I assume it is better to keep the same page and update the dates, thereby keeping any links, ranking trust and authority we built. Is that the best strategy by updating the event info with the new dates? I would assume with a new page for the new year we would be starting over again and would have too much similar content and link diffusion. And in the future are we better off not including the year in the URL extension?
Intermediate & Advanced SEO | | Ebtec0 -
Wordpress - Dynamic pages vs static pages
Hi, Our site has over 48,000 indexed links, with a good mix of pages, posts and dynamic pages. For the purposes of SEO and the recent talk of "fresh content" - would it be better to keep dynamic pages as they are or manually create static pages/ subpages. The one noticable downside with dynamic pages is that they arent picked up by any sitemap plugins, you need to manually create a separate sitemap just for these dynamic links. Any thoughts??
Intermediate & Advanced SEO | | danialniazi1 -
Index or not index Categories
We are using Yoast Seo plugin. On the main menu we have only categories which has consist of posts and one page. We have category with villas, category with villa hotels etc. Initially we set to index and include in the sitemap posts and excluded categories, but I guess it was not correct. Would be a better way to index and include categories in the sitemap and exclude the posts in order to avoid the duplicate? It somehow does not make sense for me, If the posts are excluded and the categories included, will not then be the categories empty for google? I guess I will get crazy of this. Somebody has perhaps more experiences with this?
Intermediate & Advanced SEO | | Rebeca10 -
Yoast SEO Plugin: To Index or Not to index Categories?
Taking a poll out there......In most cases would you want to index or NOT index your category pages using the Yoast SEO plugin?
Intermediate & Advanced SEO | | webestate0 -
Number of Indexed Pages are Continuously Going Down
I am working on online retail stores. Initially, Google have indexed 10K+ pages of my website. I have checked number of indexed page before one week and pages were 8K+. Today, number of indexed pages are 7680. I can't understand why should it happen and How can fix it? I want to index maximum pages of my website.
Intermediate & Advanced SEO | | CommercePundit0 -
Which page to target? Home or /landing-page
I have optimized my home page for the keyword "computer repairs" would I be better of targeting my links at this page or an additional page (which already exists) called /repairs it's possible to rename & 301 this page to /computer-repairs The only advantage I can see from targeting /computer-repairs is that the keywords are in the target URL.
Intermediate & Advanced SEO | | SEOKeith0 -
Why do old URL format are still being crawled by Rogerbot?
Hi, In the early days of my blog, I used permalinks with the following format: http://www.mysitesamp.com/2009/02/04/heidi-cortez-photo-shoot/ I then decided to change this format using .htaccess to this format: http://www.mysitesamp.com//heidi-cortez-photo-shoot/ My question is, why do rogerbot still crawls my old URL format since these urls' no longer exists in my website or blog.
Intermediate & Advanced SEO | | Trigun0