Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
How can I prevent duplicate pages being indexed because of load balancer (hosting)?
-
The site that I am optimising has a problem with duplicate pages being indexed as a result of the load balancer (which is required and set up by the hosting company).
The load balancer passes the site through to 2 different URLs:
Some how, Google have indexed 2 of the same URLs (which I was obviously hoping they wouldn't) - the first on www and the second on www2.
The hosting is a mirror image of each other (www and www2), meaning I can't upload a robots.txt to the root of www2.domain.com disallowing all. Also, I can't add a canonical script into the website header of www2.domain.com pointing the individual URLs through to www.domain.com etc.
Any suggestions as to how I can resolve this issue would be greatly appreciated!
-
There are two ways to handle load balancing, and it appears that your hosting company / server company chose to use the DNS round-robin routing option.
According to the Wikipedia page on load balancing:
http://en.wikipedia.org/wiki/Load_balancing_(computing)"Load balancing usually involves dedicated software or hardware, such as a multilayer switch or a Domain Name System server process."
Round Robin DNS Load Balancing: Basically you use the DNS routing system to handle requests. When someone visits your site, 50% of the people are routed to www.domain.com, and 50% are routed to ww1.domain.com. Both sites contain the same identical content; it's the URLs that are slightly different. Sometimes the domains are the same; but you have different IP addresses for www.domain.com.
Advantages: you don't need a dedicated load balancing piece of software or hardware, so it's less expensive.
Disadvantages: this technique exposes the individual web servers to the end user seeing the site. You can also suffer from duplicate content penalties, too. Finally, if you are relying on the round robin DNS system for load balancing, and a DNS server or one of the Web servers goes down, there's not an easy fail-over (as many DNS records are cached).More about Round Robin DNS: http://en.wikipedia.org/wiki/Round-robin_DNS
Hardware / Software Load Balancer:
In this case, your DNS zone file tells the end user to go to one IP address when they type in www.domain.com. The hardware or software load balancer then sees the request, and then hands off the content to one of the web servers in a cluster.Advantages: No duplicate content penalty; to the end user, they just see one web server and not individual sub-domains (www.domain.com and ww1.domain.com). A load balancer can also cache specific items like a CSS page, so the load on the Web server is even more minimal.
Disadvantages: You're introducing another piece of hardware or software (i.e. more cost); this piece could also be a single point of failure into the mix. You need someone to figure out how to set this up and make sure it all works.
More on this type of Load Balancing: http://en.wikipedia.org/wiki/Load_balancing_(computing)#Internet-based_services
Load balancing can get complicated as soon as you have databases involved, but with a good design, multiple front end Web servers can talk to one single backend database server. The goal would be to cache as much content as possible as "static" elements, using caching systems like Varnish, that essentially turn database-driven pages into static, old-school HTML pages. And then only when someone needs to save something from the database (i.e. making a purchase on an eCommerce site), the system then interacts with it.
My recommendation:
(1) Move from the Round Robin Robin DNS to a hardware or software load balancer.(2) If that isn't an easy solution, implement the Round Robin DNS solution to use identical A records for each server.
For example, you might have identical entries in your DNS zone files for both DNS servers:
www.domain.com A 69.94.15.10
NS2.domain.com:
www.domain.com A 75.64.18.12This should at least eliminate your duplicate content issue, but you still do have a few disadvantages (described above). This also could lead to server issues, as the servers might be confused if they are the authoritative ones.
And if both servers are sending email, pay special attention to your SPF record, to make sure that you are allowing both IP addresses to be able to send email. (This is often overlooked.)
Hope this is helpful!
-- Jeff
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Should I index resource submission forms, thank you pages, etc.?
Should I index resource submission forms, thank you, event pages, etc.? Doesn't Google consider this content too thin?
Intermediate & Advanced SEO | | amarieyoussef0 -
Page with metatag noindex is STILL being indexed?!
Hi Mozers, There are over 200 pages from our site that have a meta tag "noindex" but are STILL being indexed. What else can I do to remove them from the Index?
Intermediate & Advanced SEO | | yaelslater0 -
Can noindexed pages accrue page authority?
My company's site has a large set of pages (tens of thousands) that have very thin or no content. They typically target a single low-competition keyword (and typically rank very well), but the pages have a very high bounce rate and are definitely hurting our domain's overall rankings via Panda (quality ranking). I'm planning on recommending we noindexed these pages temporarily, and reindex each page as resources are able to fill in content. My question is whether an individual page will be able to accrue any page authority for that target term while noindexed. We DO want to rank for all those terms, just not until we have the content to back it up. However, we're in a pretty competitive space up against domains that have been around a lot longer and have higher domain authorities. Like I said, these pages rank well right now, even with thin content. The worry is if we noindex them while we slowly build out content, will our competitors get the edge on those terms (with their subpar but continually available content)? Do you think Google will give us any credit for having had the page all along, just not always indexed?
Intermediate & Advanced SEO | | THandorf0 -
Redirected Old Pages Still Indexed
Hello, we migrated a domain onto a new Wordpress site over a year ago. We redirected (with plugin: simple 301 redirects) all the old urls (.asp) to the corresponding new wordpress urls (non-.asp). The old pages are still indexed by Google, even though when you click on them you are redirected to the new page. Can someone tell me reasons they would still be indexed? Do you think it is hurting my rankings?
Intermediate & Advanced SEO | | phogan0 -
Contextual FAQ and FAQ Page, is this duplicate content?
Hi Mozzers, On my website, I have a FAQ Page (with the questions-responses of all the themes (prices, products,...)of my website) and I would like to add some thematical faq on the pages of my website. For example : adding the faq about pricing on my pricing page,... Is this duplicate content? Thank you for your help, regards. Jonathan
Intermediate & Advanced SEO | | JonathanLeplang0 -
Links from non-indexed pages
Whilst looking for link opportunities, I have noticed that the website has a few profiles from suppliers or accredited organisations. However, a search form is required to access these pages and when I type cache:"webpage.com" the page is showing up as non-indexed. These are good websites, not spammy directory sites, but is it worth trying to get Google to index the pages? If so, what is the best method to use?
Intermediate & Advanced SEO | | maxweb0 -
Our login pages are being indexed by Google - How do you remove them?
Each of our login pages show up under different subdomains of our website. Currently these are accessible by Google which is a huge competitive advantage for our competitors looking for our client list. We've done a few things to try to rectify the problem: - No index/archive to each login page Robot.txt to all subdomains to block search engines gone into webmaster tools and added the subdomain of one of our bigger clients then requested to remove it from Google (This would be great to do for every subdomain but we have a LOT of clients and it would require tons of backend work to make this happen.) Other than the last option, is there something we can do that will remove subdomains from being viewed from search engines? We know the robots.txt are working since the message on search results say: "A description for this result is not available because of this site's robots.txt – learn more." But we'd like the whole link to disappear.. Any suggestions?
Intermediate & Advanced SEO | | desmond.liang1 -
Blocking Pages Via Robots, Can Images On Those Pages Be Included In Image Search
Hi! I have pages within my forum where visitors can upload photos. When they upload photos they provide a simple statement about the photo but no real information about the image,definitely not enough for the page to be deemed worthy of being indexed. The industry however is one that really leans on images and having the images in Google Image search is important to us. The url structure is like such: domain.com/community/photos/~username~/picture111111.aspx I wish to block the whole folder from Googlebot to prevent these low quality pages from being added to Google's main SERP results. This would be something like this: User-agent: googlebot Disallow: /community/photos/ Can I disallow Googlebot specifically rather than just using User-agent: * which would then allow googlebot-image to pick up the photos? I plan on configuring a way to add meaningful alt attributes and image names to assist in visibility, but the actual act of blocking the pages and getting the images picked up... Is this possible? Thanks! Leona
Intermediate & Advanced SEO | | HD_Leona0