2.3 million 404s in GWT - learn to live with 'em?
-
So I’m working on optimizing a directory site. Total size: 12.5 million pages in the XML sitemap. This is orders of magnitude larger than any site I’ve ever worked on – heck, every other site I’ve ever worked on combined would be a rounding error compared to this.
Before I was hired, the company brought in an outside consultant to iron out some of the technical issues on the site. To his credit, he was worth the money: indexation and organic Google traffic have steadily increased over the last six months. However, some issues remain. The company has access to a quality (i.e. paid) source of data for directory listing pages, but the last time the data was refreshed some months back, it threw 1.8 million 404s in GWT. That has since started to grow progressively higher; now we have 2.3 million 404s in GWT.
Based on what I’ve been able to determine, links on this particular site relative to the data feed are broken generally due to one of two reasons: the page just doesn’t exist anymore (i.e. wasn’t found in the data refresh, so the page was simply deleted), or the URL had to change due to some technical issue (page still exists, just now under a different link). With other sites I’ve worked on, 404s aren’t that big a deal: set up a 301 redirect in htaccess and problem solved. In this instance, setting up that many 301 redirects, even if it could somehow be automated, just isn’t an option due to the potential bloat in the htaccess file.
Based on what I’ve read here and here, 404s in and of themselves don’t really hurt the site indexation or ranking. And the more I consider it, the really big sites – the Amazons and eBays of the world – have to contend with broken links all the time due to product pages coming and going. Bottom line, it looks like if we really want to refresh the data on the site on a regular basis – and I believe that is priority one if we want the bot to come back more frequently – we’ll just have to put up with broken links on the site on a more regular basis.
So here’s where my thought process is leading:
- Go ahead and refresh the data. Make sure the XML sitemaps are refreshed as well – hopefully this will help the site stay current in the index.
- Keep an eye on broken links in GWT. Implement 301s for really important pages (i.e. content-rich stuff that is really mission-critical). Otherwise, just learn to live with a certain number of 404s being reported in GWT on more or less an ongoing basis.
- Watch the overall trend of 404s in GWT. At least make sure they don’t increase. Hopefully, if we can make sure that the sitemap is updated when we refresh the data, the 404s reported will decrease over time.
We do have an issue with the site creating some weird pages with content that lives within tabs on specific pages. Once we can clamp down on those and a few other technical issues, I think keeping the data refreshed should help with our indexation and crawl rates.
Thoughts? If you think I’m off base, please set me straight.
-
I was actually thinking about some type of wildcard rule in htaccess. This might actually do the trick! Thanks for the response!
-
Hi,
Sounds like you’ve taken on a massive job with 12.5 million pages, but I think you can implement a simple fix to get things started.
You’re right to think about that sitemap, make sure it’s being dynamically updated as the data refreshes, otherwise that will be responsible for a lot of your 404s.
I understand you don’t want to add 2.3 million separate redirects to your htaccess, so what about a simple rule - if the request starts with ^/listing/ (one of your directory pages), is not a file and is not a dir, then redirect back to the homepage. Something like this:
does the request start with /listing/ or whatever structure you are using
RewriteCond %{REQUEST_URI} ^/listing/ [nc]
is it NOT a file and NOT a dir
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
#all true? Redirect
RewriteRule .* / [L,R=301]This way you can specify a certain URL structure for the pages which tend to turn to 404s, any 404s outside of your first rule will still serve a 404 code and show your 404 page and you can manually fix these problems, but the pages which tend to disappear can all be redirected back to the homepage if they’re not found.
You could still implement your 301s for important pages or simply recreate the page if it’s worth doing so, but you will have dealt with a large chunk or your non-existing pages.
I think it’s a big job and those missing pages are only part of it, but it should help you to sift through all of the data to get to the important bits – you can mark a lot of URLs as fixed and start giving your attention to the important pages which need some works.
Hope that helps,
Tom
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Syntax: 'canonical' vs "canonical" (Apostrophes or Quotes) does it matter?
I have been working on a site and through all the tools (Screaming Frog & Moz Bar) I've used it recognizes the canonical, but does Google? This is the only site I've worked on that has apostrophes. rel='canonical' href='https://www.example.com'/> It's apostrophes vs quotes. Could this error in syntax be causing the canonical not to be recognized? rel="canonical"href="https://www.example.com"/>
Intermediate & Advanced SEO | | ccox10 -
GWT Keywords not showing my Keywords Focus, What to do?
Hello Community I will like ot know if im doing something wrong here..... I have setup keywords for my google ranking using yoasr SEO http://imgur.com/BCWTifV but in my Google Webmaster http://imgur.com/V1texto What im doing wrong?
Intermediate & Advanced SEO | | dawgroup0 -
Can't get page moving!
Hi all. I've been working on a page for months now and can't seem to make any progress. I'm trying to get http://www.alwayshobbies.com/dolls-houses on the first page for term 'dolls houses'. I've done the following: Cleaned up the site's overall backlink profile Built some new links to the page Added 800 words of new copy Reduced the number of keyword instances on the page below 15 Any advice would be much appreciated. I don't think it's down to links as the DA/PA isn't wildly different from its competitors. Thanks!
Intermediate & Advanced SEO | | Blink-SEO0 -
Are clean mobile URL's necessary?
Adding code to redirect/clean up ugly URL's slows down mobile site performance, so it is necessary if we are already using rel=alternate tags on our desktop/www pages?
Intermediate & Advanced SEO | | recbrands0 -
Is 301 redirecting your index page to the root '/' safe to do or do you end up in an endless loop?
Hi I need to tidy up my home page a little, I have some links to our index.html page but I just want them to go to the root '/' so I thought I could 301 redirect it. However is this safe to do? I'm getting duplicate page notifications in my analytic reportings tools about the home page and need a quick way to fix this issue. Many thanks in advance David
Intermediate & Advanced SEO | | David-E-Carey0 -
2 Ecommerce sites & SEO
Hi, i am managing 2 ecommerce sites that sell a lot of identical products. snowsupermarket.co.uk - public webshop shop.snowbusiness.com - trade webshop Should i optimise the 2 sites to target different keywords for all products or, should i keep the keywords the same but, vary the meta data/ description etc. to avoid duplication. Is there a clear argument to have to ecommerce websites ranking high for our products & dominating page 1, even though they will be technically competing against each other? Thanks, Ben
Intermediate & Advanced SEO | | SnowFX0 -
Huge spike in 404s and 500 erros
I'm curious what might cause an inordinate amount of 404s in the reporting from SEOMoz's dashboard. I'm exploring links that are marked as 404s and they are (for the most part) working. I talked with the sysadmin and there were no outages this weekend. We also had a number of 500 errors reported in Webmaster Tools but everything seems to be up. Any ideas?
Intermediate & Advanced SEO | | SystemIDBarcodes0 -
How to deal with 1 product in 1 country and 3 languages?
After reading multiple posts on dealing with multilanguage sites (also checked http://www.google.com/support/forum/p/Webmasters/thread?tid=12a5507889c20461&hl=en), I still haven't got an answer to a very specific question I have. Please allow me to give some background:
Intermediate & Advanced SEO | | TruvoDirectories
I'm working for the official Belgian Yellow Pages (part of Truvo), and as you might know in Belgium, we have to deal with 3 official languages (BE-nl, BE-fr, BE-de | the latter is out of scope for this question) and on top of that we also have a large international audience (BE-en). Furthermore, Belgium is very small, meaning that someone living in the French part of Belgium (ex. Liège) easily might look for information in the Dutch part of Belgium (ex. Antwerpen) without having to switch websites/language. Since 1968 (http://info.truvo.be/en/our-company/) we have established 3 different brands, each brand is adapted to a language, each has a clear language specific connotation:
for the BE-nl market: we have the brand "gouden gids"
for the BE-fr market: we have the brand "pages dor"
for the BE-en market we have the brand "golden pages" Logically, this results in 3 websites: www.goudengids.be, www.pagesdor.be, www.goldenpages.be each serving a specific language and containing specific language messages and functionalities, but, off course, serving a part of the content that is similar for all websites regardless of the language.
So we do have following links ex.
http://www.goudengids.be/united-consultants-nv-antwerpen-2000/
http://www.pagesdor.be/united-consultants-nv-antwerpen-2000/
http://www.goldenpages.be/united-consultants-nv-antwerpen-2000/ When I want to stick with the separate brands for the same content, how do I make sure that Google shows the desired url when searching in resp. google.be (dutch), google.be (french) google.be (english)? Kind Regards0