2.3 million 404s in GWT - learn to live with 'em?
-
So I’m working on optimizing a directory site. Total size: 12.5 million pages in the XML sitemap. This is orders of magnitude larger than any site I’ve ever worked on – heck, every other site I’ve ever worked on combined would be a rounding error compared to this.
Before I was hired, the company brought in an outside consultant to iron out some of the technical issues on the site. To his credit, he was worth the money: indexation and organic Google traffic have steadily increased over the last six months. However, some issues remain. The company has access to a quality (i.e. paid) source of data for directory listing pages, but the last time the data was refreshed some months back, it threw 1.8 million 404s in GWT. That has since started to grow progressively higher; now we have 2.3 million 404s in GWT.
Based on what I’ve been able to determine, links on this particular site relative to the data feed are broken generally due to one of two reasons: the page just doesn’t exist anymore (i.e. wasn’t found in the data refresh, so the page was simply deleted), or the URL had to change due to some technical issue (page still exists, just now under a different link). With other sites I’ve worked on, 404s aren’t that big a deal: set up a 301 redirect in htaccess and problem solved. In this instance, setting up that many 301 redirects, even if it could somehow be automated, just isn’t an option due to the potential bloat in the htaccess file.
Based on what I’ve read here and here, 404s in and of themselves don’t really hurt the site indexation or ranking. And the more I consider it, the really big sites – the Amazons and eBays of the world – have to contend with broken links all the time due to product pages coming and going. Bottom line, it looks like if we really want to refresh the data on the site on a regular basis – and I believe that is priority one if we want the bot to come back more frequently – we’ll just have to put up with broken links on the site on a more regular basis.
So here’s where my thought process is leading:
- Go ahead and refresh the data. Make sure the XML sitemaps are refreshed as well – hopefully this will help the site stay current in the index.
- Keep an eye on broken links in GWT. Implement 301s for really important pages (i.e. content-rich stuff that is really mission-critical). Otherwise, just learn to live with a certain number of 404s being reported in GWT on more or less an ongoing basis.
- Watch the overall trend of 404s in GWT. At least make sure they don’t increase. Hopefully, if we can make sure that the sitemap is updated when we refresh the data, the 404s reported will decrease over time.
We do have an issue with the site creating some weird pages with content that lives within tabs on specific pages. Once we can clamp down on those and a few other technical issues, I think keeping the data refreshed should help with our indexation and crawl rates.
Thoughts? If you think I’m off base, please set me straight.
-
I was actually thinking about some type of wildcard rule in htaccess. This might actually do the trick! Thanks for the response!
-
Hi,
Sounds like you’ve taken on a massive job with 12.5 million pages, but I think you can implement a simple fix to get things started.
You’re right to think about that sitemap, make sure it’s being dynamically updated as the data refreshes, otherwise that will be responsible for a lot of your 404s.
I understand you don’t want to add 2.3 million separate redirects to your htaccess, so what about a simple rule - if the request starts with ^/listing/ (one of your directory pages), is not a file and is not a dir, then redirect back to the homepage. Something like this:
does the request start with /listing/ or whatever structure you are using
RewriteCond %{REQUEST_URI} ^/listing/ [nc]
is it NOT a file and NOT a dir
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
#all true? Redirect
RewriteRule .* / [L,R=301]This way you can specify a certain URL structure for the pages which tend to turn to 404s, any 404s outside of your first rule will still serve a 404 code and show your 404 page and you can manually fix these problems, but the pages which tend to disappear can all be redirected back to the homepage if they’re not found.
You could still implement your 301s for important pages or simply recreate the page if it’s worth doing so, but you will have dealt with a large chunk or your non-existing pages.
I think it’s a big job and those missing pages are only part of it, but it should help you to sift through all of the data to get to the important bits – you can mark a lot of URLs as fixed and start giving your attention to the important pages which need some works.
Hope that helps,
Tom
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Will disallowing URL's in the robots.txt file stop those URL's being indexed by Google
I found a lot of duplicate title tags showing in Google Webmaster Tools. When I visited the URL's that these duplicates belonged to, I found that they were just images from a gallery that we didn't particularly want Google to index. There is no benefit to the end user in these image pages being indexed in Google. Our developer has told us that these urls are created by a module and are not "real" pages in the CMS. They would like to add the following to our robots.txt file Disallow: /catalog/product/gallery/ QUESTION: If the these pages are already indexed by Google, will this adjustment to the robots.txt file help to remove the pages from the index? We don't want these pages to be found.
Intermediate & Advanced SEO | | andyheath0 -
How to check competitors' strengths/weaknesses?
How to and what to check in competitors' strengths/weaknesses?
Intermediate & Advanced SEO | | Green.landon1 -
Need to change 1 million page URLs
Hey all, I have a community site where users are uploading photos and videos. Launched in 2003, back then it wasn't such a bad idea to use keywords/tags in the URLs, so I did that. All my content pages (individual photo/video) are looking like this: www.domain.com/12345-kw1-kw2-kw3-k4-k5 and so on. Where the 12345 is the unique content ID and the rest are keywords/tags added by the uploader. I would like to get rid of of the keywords after the ID in the URL. My site is well coded, so this can be easily done by changing a simple function, so my content page URLs become this: www.domain.com/ID What is the best course of action? 301 the KW URLs to non-KW version? Canonical? I really want to do this the proper way. Any advice is highly appreciated. Thanks in advance.
Intermediate & Advanced SEO | | mlqsko0 -
Joomla to Wordpress site migration - thousands of 404s
I recently migrated a site from Joomla to Wordpress. In advance I exported the HTML pages from Joomla using Screaming Frog and did 301 redirects on all those pages. However Webmaster Tools is now telling me (a week after putting the redirects in place) that there are >7k 404s. Many of them aren't HTML pages, just index.php files but I didn't think I would have to export these in my Screaming Frog crawl. We have since done a blanket 301 redirect for anything with index.php in it but Webmaster Tools is still picking them up as 404s. So my question is, what should I have done with Screaming Frog re exporting to ensure I captured all pages to redirect and what should I now do to fix the 404s that Webmaster Tools is picking up?
Intermediate & Advanced SEO | | Bua0 -
Is ok to add 'no follow' to every outbound link?
How do you handle outbound links from your site?.. do you no follow them all to be on the safe side?
Intermediate & Advanced SEO | | nick-name1230 -
Is that Penguin 2.0 or just temporary situation?
Hey, i changed my hosting provider because better server hardware on 22 May. And many results dropped on google! My website opened only just 17 Jan 2013. Maybe you want to look my anchors. you can find attachment image. Total backlinks: 4,9K Is that temporary situation coz changing ip address (hosting provider) or penguin 2.0? nw68v.jpg
Intermediate & Advanced SEO | | umutege0 -
Press Release in 3 PR6 Sites against yahoo submission
Hi - i have shortlisted 3 pr news sites - prlog.org, i-newswire.com, 24-7pressrelease.com . All 3 have pr6 and have high scores. Have 2 queries - a) Is it worth to put backlinks - by taking a paid press release option and spend around 200$ - sole intent is to generate backlinks b) is the combination effect of these 3 pr sites better than stand alone yahoo direct submission - as have limited budget can do either of them thnx
Intermediate & Advanced SEO | | Modi0 -
2-stage 301 redirects
Dear colleagues, I have quite an unusual situation with one of my client's websites, and I could use an advise from someone who experienced the same circumstances: They are currently planning on launching a new site under the same domain (by September), when several key current pages are intended to be replaced with new equivalent pages under new URLs. So far it's pretty simple, BUT - due to a merger with another company they will be migrating their entire website to a different domain within a year. My question is - what would be the optimal solution for redirects? We are considering a 301 from the current pages to the new pages under the same domain, and once the new domain is activated - aside from defining 301 redirects from the new pages under the same domain to the new domain, we will cancel the original 301 from the old pages to the new pages on the same domain, and instead define new 301 for those pages to the new domain. What do you think? Is there a better solution - like using 302 redirects for the first stage? Has anyone tried such a procedure? Your input will be highly appreciated! Thanks in advance, Omer
Intermediate & Advanced SEO | | Usearch0