Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
How can I prevent duplicate pages being indexed because of load balancer (hosting)?
-
The site that I am optimising has a problem with duplicate pages being indexed as a result of the load balancer (which is required and set up by the hosting company).
The load balancer passes the site through to 2 different URLs:
Some how, Google have indexed 2 of the same URLs (which I was obviously hoping they wouldn't) - the first on www and the second on www2.
The hosting is a mirror image of each other (www and www2), meaning I can't upload a robots.txt to the root of www2.domain.com disallowing all. Also, I can't add a canonical script into the website header of www2.domain.com pointing the individual URLs through to www.domain.com etc.
Any suggestions as to how I can resolve this issue would be greatly appreciated!
-
There are two ways to handle load balancing, and it appears that your hosting company / server company chose to use the DNS round-robin routing option.
According to the Wikipedia page on load balancing:
http://en.wikipedia.org/wiki/Load_balancing_(computing)"Load balancing usually involves dedicated software or hardware, such as a multilayer switch or a Domain Name System server process."
Round Robin DNS Load Balancing: Basically you use the DNS routing system to handle requests. When someone visits your site, 50% of the people are routed to www.domain.com, and 50% are routed to ww1.domain.com. Both sites contain the same identical content; it's the URLs that are slightly different. Sometimes the domains are the same; but you have different IP addresses for www.domain.com.
Advantages: you don't need a dedicated load balancing piece of software or hardware, so it's less expensive.
Disadvantages: this technique exposes the individual web servers to the end user seeing the site. You can also suffer from duplicate content penalties, too. Finally, if you are relying on the round robin DNS system for load balancing, and a DNS server or one of the Web servers goes down, there's not an easy fail-over (as many DNS records are cached).More about Round Robin DNS: http://en.wikipedia.org/wiki/Round-robin_DNS
Hardware / Software Load Balancer:
In this case, your DNS zone file tells the end user to go to one IP address when they type in www.domain.com. The hardware or software load balancer then sees the request, and then hands off the content to one of the web servers in a cluster.Advantages: No duplicate content penalty; to the end user, they just see one web server and not individual sub-domains (www.domain.com and ww1.domain.com). A load balancer can also cache specific items like a CSS page, so the load on the Web server is even more minimal.
Disadvantages: You're introducing another piece of hardware or software (i.e. more cost); this piece could also be a single point of failure into the mix. You need someone to figure out how to set this up and make sure it all works.
More on this type of Load Balancing: http://en.wikipedia.org/wiki/Load_balancing_(computing)#Internet-based_services
Load balancing can get complicated as soon as you have databases involved, but with a good design, multiple front end Web servers can talk to one single backend database server. The goal would be to cache as much content as possible as "static" elements, using caching systems like Varnish, that essentially turn database-driven pages into static, old-school HTML pages. And then only when someone needs to save something from the database (i.e. making a purchase on an eCommerce site), the system then interacts with it.
My recommendation:
(1) Move from the Round Robin Robin DNS to a hardware or software load balancer.(2) If that isn't an easy solution, implement the Round Robin DNS solution to use identical A records for each server.
For example, you might have identical entries in your DNS zone files for both DNS servers:
www.domain.com A 69.94.15.10
NS2.domain.com:
www.domain.com A 75.64.18.12This should at least eliminate your duplicate content issue, but you still do have a few disadvantages (described above). This also could lead to server issues, as the servers might be confused if they are the authoritative ones.
And if both servers are sending email, pay special attention to your SPF record, to make sure that you are allowing both IP addresses to be able to send email. (This is often overlooked.)
Hope this is helpful!
-- Jeff
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How can I avoid duplicate content for a new landing page which is the same as an old one?
Hello mozers! I have a question about duplicate content for you... One on my clients pages have been dropping in search volume for a while now, and I've discovered it's because the search term isn't as popular as it used to be. So... we need to create a new landing page using a more popular search term. The page which is losing traffic is based on the search query "Can I put a solid roof on my conservatory" this only gets 0-10 searches per month according to the keyword explorer tool. However, if we changed this to "replacing conservatory roof with solid roof" this gets up to 500 searches per month. Muuuuch better! The issue is, I don't want to close down and re-direct the old page because it's got a featured snippet and sits in position 1. So I'd like to create another page instead... however, as the two are effectively the same content, I would then land myself in a duplicate content issue. If I were to put a rel="canonical" tag in the original "can I put a solid roof...." page but say the master page is now the new one, would that get around the issue?
Intermediate & Advanced SEO | | Virginia-Girtz0 -
Can't generate a sitemap with all my pages
I am trying to generate a site map for my site nationalcurrencyvalues.com but all the tools I have tried don't get all my 70000 html pages... I have found that the one at check-domains.com crawls all my pages but when it writes the xml file most of them are gone... seemingly randomly. I have used this same site before and it worked without a problem. Can anyone help me understand why this is or point me to a utility that will map all of the pages? Kindly, Greg
Intermediate & Advanced SEO | | Banknotes0 -
Google indexing pages from chrome history ?
We have pages that are not linked from site yet they are indexed in Google. It could be possible if Google got these pages from browser. Does Google takes data from chrome?
Intermediate & Advanced SEO | | vivekrathore0 -
Better to 301 or de-index 403 pages
Google WMT recently found and called out a large number of old unpublished pages as access denied errors. The pages are tagged "noindex, follow." These old pages are in Google's index. At this point, would it better to 301 all these pages or submit an index removal request or what? Thanks... Darcy
Intermediate & Advanced SEO | | 945010 -
Date of page first indexed or age of a page?
Hi does anyone know any ways, tools to find when a page was first indexed/cached by Google? I remember a while back, around 2009 i had a firefox plugin which could check this, and gave you a exact date. Maybe this has changed since. I don't remember the plugin. Or any recommendations on finding the age of a page (not domain) for a website? This is for competitor research not my own website. Cheers, Paul
Intermediate & Advanced SEO | | MBASydney0 -
Should pages of old news articles be indexed?
My website published about 3 news articles a day and is set up so that old news articles can be accessed through a "back" button with articles going to page 2 then page 3 then page 4, etc... as new articles push them down. The pages include a link to the article and a short snippet. I was thinking I would want Google to index the first 3 pages of articles, but after that the pages are not worthwhile. Could these pages harm me and should they be noindexed and/or added as a canonical URL to the main news page - or is leaving them as is fine because they are so deep into the site that Google won't see them, but I also won't be penalized for having week content? Thanks for the help!
Intermediate & Advanced SEO | | theLotter0 -
Blocking Pages Via Robots, Can Images On Those Pages Be Included In Image Search
Hi! I have pages within my forum where visitors can upload photos. When they upload photos they provide a simple statement about the photo but no real information about the image,definitely not enough for the page to be deemed worthy of being indexed. The industry however is one that really leans on images and having the images in Google Image search is important to us. The url structure is like such: domain.com/community/photos/~username~/picture111111.aspx I wish to block the whole folder from Googlebot to prevent these low quality pages from being added to Google's main SERP results. This would be something like this: User-agent: googlebot Disallow: /community/photos/ Can I disallow Googlebot specifically rather than just using User-agent: * which would then allow googlebot-image to pick up the photos? I plan on configuring a way to add meaningful alt attributes and image names to assist in visibility, but the actual act of blocking the pages and getting the images picked up... Is this possible? Thanks! Leona
Intermediate & Advanced SEO | | HD_Leona0 -
Are duplicate links on same page alright?
If I have a homepage with category links, is it alright for those category links to appear in the footer as well, or should you never have duplicate links on one page? Can you please give a reason why as well? Thanks!
Intermediate & Advanced SEO | | dkamen0