2.3 million 404s in GWT - learn to live with 'em?
-
So I’m working on optimizing a directory site. Total size: 12.5 million pages in the XML sitemap. This is orders of magnitude larger than any site I’ve ever worked on – heck, every other site I’ve ever worked on combined would be a rounding error compared to this.
Before I was hired, the company brought in an outside consultant to iron out some of the technical issues on the site. To his credit, he was worth the money: indexation and organic Google traffic have steadily increased over the last six months. However, some issues remain. The company has access to a quality (i.e. paid) source of data for directory listing pages, but the last time the data was refreshed some months back, it threw 1.8 million 404s in GWT. That has since started to grow progressively higher; now we have 2.3 million 404s in GWT.
Based on what I’ve been able to determine, links on this particular site relative to the data feed are broken generally due to one of two reasons: the page just doesn’t exist anymore (i.e. wasn’t found in the data refresh, so the page was simply deleted), or the URL had to change due to some technical issue (page still exists, just now under a different link). With other sites I’ve worked on, 404s aren’t that big a deal: set up a 301 redirect in htaccess and problem solved. In this instance, setting up that many 301 redirects, even if it could somehow be automated, just isn’t an option due to the potential bloat in the htaccess file.
Based on what I’ve read here and here, 404s in and of themselves don’t really hurt the site indexation or ranking. And the more I consider it, the really big sites – the Amazons and eBays of the world – have to contend with broken links all the time due to product pages coming and going. Bottom line, it looks like if we really want to refresh the data on the site on a regular basis – and I believe that is priority one if we want the bot to come back more frequently – we’ll just have to put up with broken links on the site on a more regular basis.
So here’s where my thought process is leading:
- Go ahead and refresh the data. Make sure the XML sitemaps are refreshed as well – hopefully this will help the site stay current in the index.
- Keep an eye on broken links in GWT. Implement 301s for really important pages (i.e. content-rich stuff that is really mission-critical). Otherwise, just learn to live with a certain number of 404s being reported in GWT on more or less an ongoing basis.
- Watch the overall trend of 404s in GWT. At least make sure they don’t increase. Hopefully, if we can make sure that the sitemap is updated when we refresh the data, the 404s reported will decrease over time.
We do have an issue with the site creating some weird pages with content that lives within tabs on specific pages. Once we can clamp down on those and a few other technical issues, I think keeping the data refreshed should help with our indexation and crawl rates.
Thoughts? If you think I’m off base, please set me straight.
-
I was actually thinking about some type of wildcard rule in htaccess. This might actually do the trick! Thanks for the response!
-
Hi,
Sounds like you’ve taken on a massive job with 12.5 million pages, but I think you can implement a simple fix to get things started.
You’re right to think about that sitemap, make sure it’s being dynamically updated as the data refreshes, otherwise that will be responsible for a lot of your 404s.
I understand you don’t want to add 2.3 million separate redirects to your htaccess, so what about a simple rule - if the request starts with ^/listing/ (one of your directory pages), is not a file and is not a dir, then redirect back to the homepage. Something like this:
does the request start with /listing/ or whatever structure you are using
RewriteCond %{REQUEST_URI} ^/listing/ [nc]
is it NOT a file and NOT a dir
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
#all true? Redirect
RewriteRule .* / [L,R=301]This way you can specify a certain URL structure for the pages which tend to turn to 404s, any 404s outside of your first rule will still serve a 404 code and show your 404 page and you can manually fix these problems, but the pages which tend to disappear can all be redirected back to the homepage if they’re not found.
You could still implement your 301s for important pages or simply recreate the page if it’s worth doing so, but you will have dealt with a large chunk or your non-existing pages.
I think it’s a big job and those missing pages are only part of it, but it should help you to sift through all of the data to get to the important bits – you can mark a lot of URLs as fixed and start giving your attention to the important pages which need some works.
Hope that helps,
Tom
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Why rankings dropped from 2 page to 8th page and no penalization?
Dear Sirs, a client of mine for more than 7 years used to have his home page (www.egrecia.es) between 1st and 2nd page in the Google Serps and suddenly went down to 8 page. The keyword in question is "Viajes a Grecia". It has a good link profile as we have built links in good newspapers from Spain, and according to Moz it has a 99% on-page optimization for that keyword, why why why ??? What could I do to solve this? PD: It has more than 20 other keywords in 1st position, so why this one went so far down? Thank you in advance !
Intermediate & Advanced SEO | | Tintanus0 -
Is it OK that the root didn't have any internal links?
Hi guys; In a website with more than 20,000 indexed pages, Is it normally that homepage (root) didn't have any internal links, while other important pages have enough internal links? Consider that in a top menu in header of all pages, I added homepage link, so the home page link repeated on all indexed pages, but google didn't count it and the website technology is angular js thank you for helping me
Intermediate & Advanced SEO | | cafegardesh0 -
Hey there, i'm working on search results in dutch.
My biggest competitor who's number 1 in main keywords in google has almost only links from 'linkfarms' and blog comments. How is he ranked that high? Would it be a good idea to add a bit the best of these in my mix, while i work on the real good quality content?
Intermediate & Advanced SEO | | ValleyofTea0 -
We haven been hit by penguin 2.0 what to do?
Hi, Last week we got hit bij penguin 2.0. Our sites dropped on most keywords on average 10 places. We had a steady place for 2 to 3 years. We have site-wide links in the top of our websites to the other websites ( about 9 e-commerce sites). Today i have put rel= "nofollow " tags in all these links (accept on the hompages). To prevent spammy links. Is there anything else we can do ? Url ww.klokkenpaleis.nl most important keyword = klokken ( previous position, 2nd place) search engine = google netherlands Thanks a lot for your help.
Intermediate & Advanced SEO | | GTGshops0 -
Reality of Panda 3.9 Refresh
I have had a 10 page website(registered in 1999) rank for my top keywords(top 5) for over 4 years. No changes have been made to the website. (Static website). July 11, 2012, most of the keywords, and all the major keywords were dropped from Google. They remain steady in Bing and Yahoo. I saw that some people referred to a Panda 3.9 refresh on that day, but also saw that Google(Matt Cutts) denied the refresh. Given the simplicity of the website and the strong backlinks, which remain, what are other reasons I could see a drastic drop in 1 day. Any ideas on where to target my search for solving this very serious issue? Any thoughts would be appreciated.
Intermediate & Advanced SEO | | FidelityOne0 -
Traffic drop off and page isn't indexed
In the last couple weeks my impressiona and clicks have dropped off to about half what it used to be. I am wondering if Google is punishing me for something... I also added two new pages to my site in the first week of June and they still aren't indexed. In the past it seemed like new pages would be indexed in a couple days. Is there any way to tell if Google is unhappy with my site? WMT shows 3 server errors, 3 Access denied, and 122 not found errors. Could those not found pages be killing me? Thanks for any advise, Greg www.AntiqueBanknotes.com
Intermediate & Advanced SEO | | Banknotes0 -
Is 404'ing a page enough to remove it from Google's index?
We set some pages to 404 status about 7 months ago, but they are still showing in Google's index (as 404's). Is there anything else I need to do to remove these?
Intermediate & Advanced SEO | | nicole.healthline0 -
2 Language Versions on Same URL
A site we are working on is a large gift retailer in Canada. They have a language option for French, but the page URLs are the same. If you click 'French' in the header, a cookie is set and then all pages are dynamically served the French content (and all nav/site elements of course change to French). The URLs then are exactly the same as it's the cookie that determines the language option to serve. e.g. www.site.ca/index.php?category=7&product=99.... would be the same regardless of if I'm set for English or French. Question: Does this setup have a negative impact on any SEO factors? The site has several thousand pages.
Intermediate & Advanced SEO | | BMGSEO0