2.3 million 404s in GWT - learn to live with 'em?
-
So I’m working on optimizing a directory site. Total size: 12.5 million pages in the XML sitemap. This is orders of magnitude larger than any site I’ve ever worked on – heck, every other site I’ve ever worked on combined would be a rounding error compared to this.
Before I was hired, the company brought in an outside consultant to iron out some of the technical issues on the site. To his credit, he was worth the money: indexation and organic Google traffic have steadily increased over the last six months. However, some issues remain. The company has access to a quality (i.e. paid) source of data for directory listing pages, but the last time the data was refreshed some months back, it threw 1.8 million 404s in GWT. That has since started to grow progressively higher; now we have 2.3 million 404s in GWT.
Based on what I’ve been able to determine, links on this particular site relative to the data feed are broken generally due to one of two reasons: the page just doesn’t exist anymore (i.e. wasn’t found in the data refresh, so the page was simply deleted), or the URL had to change due to some technical issue (page still exists, just now under a different link). With other sites I’ve worked on, 404s aren’t that big a deal: set up a 301 redirect in htaccess and problem solved. In this instance, setting up that many 301 redirects, even if it could somehow be automated, just isn’t an option due to the potential bloat in the htaccess file.
Based on what I’ve read here and here, 404s in and of themselves don’t really hurt the site indexation or ranking. And the more I consider it, the really big sites – the Amazons and eBays of the world – have to contend with broken links all the time due to product pages coming and going. Bottom line, it looks like if we really want to refresh the data on the site on a regular basis – and I believe that is priority one if we want the bot to come back more frequently – we’ll just have to put up with broken links on the site on a more regular basis.
So here’s where my thought process is leading:
- Go ahead and refresh the data. Make sure the XML sitemaps are refreshed as well – hopefully this will help the site stay current in the index.
- Keep an eye on broken links in GWT. Implement 301s for really important pages (i.e. content-rich stuff that is really mission-critical). Otherwise, just learn to live with a certain number of 404s being reported in GWT on more or less an ongoing basis.
- Watch the overall trend of 404s in GWT. At least make sure they don’t increase. Hopefully, if we can make sure that the sitemap is updated when we refresh the data, the 404s reported will decrease over time.
We do have an issue with the site creating some weird pages with content that lives within tabs on specific pages. Once we can clamp down on those and a few other technical issues, I think keeping the data refreshed should help with our indexation and crawl rates.
Thoughts? If you think I’m off base, please set me straight.
-
I was actually thinking about some type of wildcard rule in htaccess. This might actually do the trick! Thanks for the response!
-
Hi,
Sounds like you’ve taken on a massive job with 12.5 million pages, but I think you can implement a simple fix to get things started.
You’re right to think about that sitemap, make sure it’s being dynamically updated as the data refreshes, otherwise that will be responsible for a lot of your 404s.
I understand you don’t want to add 2.3 million separate redirects to your htaccess, so what about a simple rule - if the request starts with ^/listing/ (one of your directory pages), is not a file and is not a dir, then redirect back to the homepage. Something like this:
does the request start with /listing/ or whatever structure you are using
RewriteCond %{REQUEST_URI} ^/listing/ [nc]
is it NOT a file and NOT a dir
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
#all true? Redirect
RewriteRule .* / [L,R=301]This way you can specify a certain URL structure for the pages which tend to turn to 404s, any 404s outside of your first rule will still serve a 404 code and show your 404 page and you can manually fix these problems, but the pages which tend to disappear can all be redirected back to the homepage if they’re not found.
You could still implement your 301s for important pages or simply recreate the page if it’s worth doing so, but you will have dealt with a large chunk or your non-existing pages.
I think it’s a big job and those missing pages are only part of it, but it should help you to sift through all of the data to get to the important bits – you can mark a lot of URLs as fixed and start giving your attention to the important pages which need some works.
Hope that helps,
Tom
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
My "search visibility" went from 3% to 0% and I don't know why.
My search visibility on here went from 3.5% to 3.7% to 0% to 0.03% and now 0.05% in a matter of 1 month and I do not know why. I make changes every week to see if I can get higher on google results. I do well with one website which is for a medical office that has been open for years. This new one where the office has only been open a few months I am having trouble. We aren't getting calls like I am hoping we would. In fact the only one we did receive I believe is because we were closest to him in proximity on google maps. I am also having some trouble with the "Links" aspect of SEO. Everywhere I see to get linked it seems you have to pay. We are a medical office we aren't selling products so not many Blogs would want to talk about us. Any help that could assist me with getting a higher rank on google would be greatly appreciated. Also any help with getting the search visibility up would be great as well.
Intermediate & Advanced SEO | | benjaminleemd1 -
No international targeting option showing in GWT?
Hi for some strange reason for one of our sites the international targeting option to select what country to target in GWT is not showing. Usually it shows up like this: http://s9.postimg.org/bkxkkrafi/screenshot_1672.jpg (this is a different site we have in GWT). But it shows up like this: http://s16.postimg.org/im1ysd5z8/screenshot_1673.jpg With no way to change country targeting. My permission level is set as: Owner and are verified. Any ideas on why its not showing up? Cheers, Chris
Intermediate & Advanced SEO | | jayoliverwright0 -
Should I change client's keyword stuffed URLs?
Hi Guys, We currently have a client that offers reviews and preparation classes for their industry (online and offline). One of the main things that I have noticed is how all of their product landing page urls are stuffed with keywords. I have read changing url's will impact up to 25% traffic and to not mess with url's unless it is completely needed. My question is, when url's are stuffed with keywords and make the url length over 200 characters, should I be focusing on a more structured url system?
Intermediate & Advanced SEO | | EricLee1230 -
Website with only a portion being 'mobile friendly' -- what to tell Google?
I have a website for desktop that does a lot of things, and have converted part of it do show pages in a mobile friendly format based on the users device. Not responsive design, but actual diff code with different formatting by mobile vs desktop--but each still share the same page url name. Google allows this approach. The mobile-friendly part of the site is not as extensive as desktop, so there are pages that apply to the desktop but not for mobile. So the functionality is limited some for mobile devices, and therefore some pages should only be indexed for desktop users. How should that page be handled for Google crawlers? If it is given a 404 not found for their mobile bot will Google properly still crawl it for the desktop, or will Google see that the url was flagged as 'not found' and not crawl it for the desktop? I asked a similar question yest, but it was not stated clearly. Thanks,Ted
Intermediate & Advanced SEO | | friendoffood0 -
What do you think about this links? Toxic or don't? disavow?
Hi, we are now involved in a google penalty issue (artificial links – global – all links). We were very surprised, cause we only have 300 links more less, and most of those links are from stats sites, some are malware (we are trying to fight against that), and other ones are article portals. We have created a spreadsheet with the links and we have analyzed them using Link Detox. Now we are sending emails, so that they can be removed, or disavow the links what happen is that we have very few links, and in 99% of then we have done nothing to create that link. We have doubts about what to do with some kind of links. We are not sure them to be bad. We would appreciate your opinion. We should talk about two types: Domain stats links Article portals Automatically generated content site I would like to know if we should remove those links or disavow them These are examples Anygator.com. We have 57 links coming from this portal. Linkdetox says this portal is not dangerous http://es.anygator.com/articulo/arranca-la-migracion-de-hotmail-a-outlook__343483 more examples (stats or similar) www.mxwebsite.com/worth/crearcorreoelectronico.es/ and from that website we have 10 links in wmt, but only one works. What do you do on those cases? Do you mark that link as a removed one? And these other examples… what do you think about them? More stats sites: http://alestat.com/www,crearcorreoelectronico.es.html http://www.statscrop.com/www/crearcorreoelectronico.es Automated generated content examples http://mrwhatis.net/como-checo-mi-correo-electronico-yaho.html http://www.askives.com/abrir-correo-electronico-gmail.html At first, we began trying to delete all links, but… those links are not artificial, we have not created them, google should know those sites. What would you do with those sites? Your advices would be very appreciated. Thanks 😄
Intermediate & Advanced SEO | | teconsite0 -
What's better ...more or less linking C-blocks?
I'm a little confused about c-blocks, I've been reading about them but I still don't get it. Are these similar to sitewide links? do they have to come from websites that I own and hosted in the same ip? and finally, what's better ...more or less linking c-blocks? Cheers 🙂
Intermediate & Advanced SEO | | mbulox0 -
I'm Looking SEO Person for my project.
Hi Guys. I'm looking seo person who can help me in my project. I'm losing ranking day by day from last 2 months. For more detail and price please PM me. Thanks
Intermediate & Advanced SEO | | KLLC0 -
Pinging SE's - is this spam?
Hi, Just read on ViperChill that Matt Cutts told Glen (owner of ViperChill) that ping services can help your blog posts. Now lets say you have a list of 10 that you ping and you put an article up everyday, thats 300 pings a month, is that not spammy? Here is the link to the post: http://www.viperchill.com/future-of-blogging/ If you scroll down your see a screen print of Google search box, its the para above and below this screen print.
Intermediate & Advanced SEO | | activitysuper1