2.3 million 404s in GWT - learn to live with 'em?
-
So I’m working on optimizing a directory site. Total size: 12.5 million pages in the XML sitemap. This is orders of magnitude larger than any site I’ve ever worked on – heck, every other site I’ve ever worked on combined would be a rounding error compared to this.
Before I was hired, the company brought in an outside consultant to iron out some of the technical issues on the site. To his credit, he was worth the money: indexation and organic Google traffic have steadily increased over the last six months. However, some issues remain. The company has access to a quality (i.e. paid) source of data for directory listing pages, but the last time the data was refreshed some months back, it threw 1.8 million 404s in GWT. That has since started to grow progressively higher; now we have 2.3 million 404s in GWT.
Based on what I’ve been able to determine, links on this particular site relative to the data feed are broken generally due to one of two reasons: the page just doesn’t exist anymore (i.e. wasn’t found in the data refresh, so the page was simply deleted), or the URL had to change due to some technical issue (page still exists, just now under a different link). With other sites I’ve worked on, 404s aren’t that big a deal: set up a 301 redirect in htaccess and problem solved. In this instance, setting up that many 301 redirects, even if it could somehow be automated, just isn’t an option due to the potential bloat in the htaccess file.
Based on what I’ve read here and here, 404s in and of themselves don’t really hurt the site indexation or ranking. And the more I consider it, the really big sites – the Amazons and eBays of the world – have to contend with broken links all the time due to product pages coming and going. Bottom line, it looks like if we really want to refresh the data on the site on a regular basis – and I believe that is priority one if we want the bot to come back more frequently – we’ll just have to put up with broken links on the site on a more regular basis.
So here’s where my thought process is leading:
- Go ahead and refresh the data. Make sure the XML sitemaps are refreshed as well – hopefully this will help the site stay current in the index.
- Keep an eye on broken links in GWT. Implement 301s for really important pages (i.e. content-rich stuff that is really mission-critical). Otherwise, just learn to live with a certain number of 404s being reported in GWT on more or less an ongoing basis.
- Watch the overall trend of 404s in GWT. At least make sure they don’t increase. Hopefully, if we can make sure that the sitemap is updated when we refresh the data, the 404s reported will decrease over time.
We do have an issue with the site creating some weird pages with content that lives within tabs on specific pages. Once we can clamp down on those and a few other technical issues, I think keeping the data refreshed should help with our indexation and crawl rates.
Thoughts? If you think I’m off base, please set me straight.
-
I was actually thinking about some type of wildcard rule in htaccess. This might actually do the trick! Thanks for the response!
-
Hi,
Sounds like you’ve taken on a massive job with 12.5 million pages, but I think you can implement a simple fix to get things started.
You’re right to think about that sitemap, make sure it’s being dynamically updated as the data refreshes, otherwise that will be responsible for a lot of your 404s.
I understand you don’t want to add 2.3 million separate redirects to your htaccess, so what about a simple rule - if the request starts with ^/listing/ (one of your directory pages), is not a file and is not a dir, then redirect back to the homepage. Something like this:
does the request start with /listing/ or whatever structure you are using
RewriteCond %{REQUEST_URI} ^/listing/ [nc]
is it NOT a file and NOT a dir
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
#all true? Redirect
RewriteRule .* / [L,R=301]This way you can specify a certain URL structure for the pages which tend to turn to 404s, any 404s outside of your first rule will still serve a 404 code and show your 404 page and you can manually fix these problems, but the pages which tend to disappear can all be redirected back to the homepage if they’re not found.
You could still implement your 301s for important pages or simply recreate the page if it’s worth doing so, but you will have dealt with a large chunk or your non-existing pages.
I think it’s a big job and those missing pages are only part of it, but it should help you to sift through all of the data to get to the important bits – you can mark a lot of URLs as fixed and start giving your attention to the important pages which need some works.
Hope that helps,
Tom
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Mobile Site Panda 4.2 Penalty
We are an ecommerce company, and we outsource our mobile site to a service, and our mobile site is m.ourdomain.com. We pass the Google mobile ready test. Our product page content on the mobile site is woefully thin (typically less than 100 words), and it appears that we got hit with Panda 4.2 on the mobile site. Starting at the end of July, our mobile rankings have dropped, and our mobile traffic is now about half of what it was in July. We are working to correct the content issue but it obviously takes time. So here's my question - if our mobile site got hit with Panda 4.2, could that have a negative effect on our desktop site?
Intermediate & Advanced SEO | | AMHC0 -
2 menus Responsive website Seo question
Hi there Thanks for reading my post. İ am fairly new to SEO and dont know much coding. İ purchased an opencart theme and am working with a developer to modify it to make it more user friendly. The website is responsive and İ modified the menu for the desktop but now it doesnt have categories but just products. So it doesnt have URL for categories but just filter. So the developer recommended to add the mobile menu which has categories, and subcategories back to the desktop menu. İm not sure if this is a kosher approach to seo. Here is the link: As you can see there is a menu in the top menu and menu on the Main Menu. THoughts? will this be problem for duplicate content? The Main menu keywords are crucial and is what the website is revolving around. New website
Intermediate & Advanced SEO | | socratic-goat7770 -
How to resubmit a Web 2.0 site to Google?
I have 3 web 2.0 sites that look like theyve been hit by a penalty. I have checked their backlinks and there are a lot of backlinks from sites that have been deindexed. I have requested the removal of lots of the links, but now I need to resubmit the site to Google. Is this even possible with them being a web 2.0 site? I don't have webmaster tools for the site so how would I do this?
Intermediate & Advanced SEO | | JohnPeters0 -
Social Buttons Help SEO, 2 Questions...
Howdy Guys, I noticed a weird thing over the weekend - our main keyword has been hit pretty hard by penguin and we had dropped down to #79. On Friday I decided to change some on-page optimisation and changed the title tag and some tags. When I've ran my rank tracker this morning we have jumped up to #62... Has anyone else noticed just a simple change boosts rankings? Second Questions We took all our social buttons off the website back in January as no-body was using them but from a few recent reports I've seen having the buttons on the site help organic rankings... Is this true? Scott
Intermediate & Advanced SEO | | ScottBaxterWW0 -
How to check a website's architecture?
Hello everyone, I am an SEO analyst - a good one - but I am weak in technical aspects. I do not know any programming and only a little HTML. I know this is a major weakness for an SEO so my first request to you all is to guide me how to learn HTML and some basic PHP programming. Secondly... about the topic of this particular question - I know that a website should have a flat architecture... but I do not know how to find out if a website's architecture is flat or not, good or bad. Please help me out on this... I would be obliged. Eagerly awaiting your responses, BEst Regards, Talha
Intermediate & Advanced SEO | | MTalhaImtiaz0 -
Understanding Google's keyword tool...
When I type in Google a keyword like : boot camp I get results that show Bootcamp (one word) traffic in the tens of thousands. I see many words combined. Does this mean that tens of thousands of people every month are misspelling that keyword? How should I interpret this in terms of anchor texting? I would hate to deliberately misspell it on my website just to get traffic. For those interested, my website is: http://ultimatebasictraining.com/admin/ (currently revaming my http://ultimatebasictraining.com website)
Intermediate & Advanced SEO | | StreetwiseReports0 -
How to handle "2" homepages?
Came across an interesting problem. A site has the traditional homepage of site.com and ranks okay. Later I found that another "homepage", site.com/home.html that ranks well for several terms but actually has old branding and semi-up-to-date content. Site.com/home.html has a solid linking profile but not as strong as the current homepage (site.com). The question I have is should I try to salvage the page or 301 redirect to site.com? Thank for the help!
Intermediate & Advanced SEO | | 2comarketing0 -
Does Google penalize for having a bunch of Error 404s?
If a site removes thousands of pages in one day, without any redirects, is there reason to think Google will penalize the site for this? I have thousands of subcategory index pages. I've figured out a way to reduce the number, but it won't be easy to put in redirects for the ones I'm deleting. They will just disappear. There's no link juice issue. These pages are only linked internally, and indexed in Google. Nobody else links to them. Does anyone think it would be better to remove the pages gradually over time instead of all at once? Thanks!
Intermediate & Advanced SEO | | Interesting.com0