Best way to handle indexed pages you don't want indexed
-
We've had a lot of pages indexed by google which we didn't want indexed. They relate to a ajax category filter module that works ok for front end customers but under the bonnet google has been following all of the links.
I've put a rule in the robots.txt file to stop google from following any dynamic pages (with a ?) and also any ajax pages but the pages are still indexed on google.
At the moment there is over 5000 pages which have been indexed which I don't want on there and I'm worried is causing issues with my rankings.
Would a redirect rule work or could someone offer any advice?
-
Gavin Since you have added the noindex in the pages, the best way is to let Google crawl those pages, see the noindex and remove them. The other option is to keep everything as is and request these parameter pages via your Google Webmaster Console. Option 1: You never know how long it takes Option 2: This should happen relatively fast I would therefore suggest keeping everything as is and doing a removal request.
-
Right... We think we've been able to get the code noindex code into the dodgy pages. The only way we could think of doing it without breaking the user interface was to put this rule into the PHP.
if(!empty($_SERVER['HTTP_X_REQUESTED_WITH']) && strtolower($_SERVER['HTTP_X_REQUESTED_WITH']) == 'xmlhttprequest')
{normal code
}
else
{echo '';
echo '';
echo '';
echo '';
echo '';
echo '404';
echo '';
echo '';
}Its rendering ok for us front end, if anyone would like to test... I'm just hopeful it would work for google?
http://www.outdoormegastore.co.uk/cycling/cycling-clothing/protective-clothing.html?ajax=1
One thing I am not sure about is how google is going to revisit the said pages. I have put in various rules to the robots.txt files as well as the url parameter handling in webmaster tools to prevent any future pages from being followed... Would these rules need to be removed?
-
The AJAX URLs are used by the site, though, right (for visitors)? If you 404 them, you may be breaking the functionality and not just impacting Google.
Another problem is that, if these pages are no longer crawlable, and you add a page-level directive (whether it's a 404, 301, canonical, NOINDEX, etc.), Google won't process those new instructions. So, they could get stuck in the index. If that's the case, ti may actually be more effective to block the "ajax=" parameter with parameter handling in Google Webmaster Tools (there's a similar option in Bing).
If you know the path is cut and this isn't a recurrent problem, that could be the fastest short-term solution. You do need to monitor, though, as they can re-enter the index later.
-
Gavin, that's a more generic response. In this scenario, unless you can make a 404 happen, it won't work and therefore is not applicable. Noindex and / or the canonical tag are the choices and I would try and get those going if possible.
-
Thanks for all of the replies... My best option seems to be the meta noindex rule but the nature of the pages that are getting indexed are just one long ajax string with no access to the header are. I hope I have already 'prevented' google from following the links in the future by adding the rules to robots.txt but I'm now desperate to clean up (cure) the existing ones.
My next thought would be to put a rule in htaccess and redirect anything with ajax in the url to a 404 page?
I'm worried that this may have even worse side effects with rankings but its based on this article that google publish: https://support.google.com/webmasters/bin/answer.py?hl=en&answer=59819
"To remove a page or image, you must do one of the following:
- Make sure the content is no longer live on the web. Requests for the page must return an HTTP 404 (not found) or 410 status code
What would your thoughts be on this?
-
Definitely review George's comment as you need to figure out why they're being crawled. As Andrea said, any solution takes time, I'm sorry to say. Robots.txt is not a good solution for getting pages removed that are already indexed, especially in bulk. It's better at prevention than cure.
META NOINDEX can be effective, or you could rel=canonical these pages to the appropriate non-AJAX URL - not sure exactly how the structure is set up. Those are probably the two fastest and more powerful approaches. Google parameter handling (in Webmaster Tools) is another option, but it's a bit unpredictable whether they honor it and how quickly.
You can only do mass removal if everything is in a folder, if I recall. There's no way to bulk remove unless all of the pages are structurally under one root URL.
-
I'm not sure if you're aware or not, but I think I know why Google is indexing these pages.
Right now, you are outputting URLs into your source code of your page in the form of a JavaScript function call similar to the following:
I believe this is because your page (and this function call) is programmatically created. Instead of outputting the whole URL to the page, you could output only what needs to be there.
For example:
Then change the signature of the JavaScript function so that it accepts this new input and builds the URL from your inputs:
function initSlider(price, low, high, category, subcategory, product, store, ajax, ?) {
// build URL
var URL = 'http://www.outdoormegastore.co.uk/' + category + '/' + subcategory + '/' + product + '.html?_' + store + '&' + ajax;
// continue...
}
Right now, because that URL is being outputted to the page, I think Google sees it as a URL it should follow and index. If you build this URL with the function in an external JavaScript file, I don't think it will be indexed.
Your developer(s) should know what I'm talking about.
Hope this helps!
-
If they are already indexed, it's going to take time for Google to recrawl, read the tag and get them to fall out, so patience will be key. It's not a quick thing to undo.
If the pages are all in one location, you can add a disallow robots/text to Webmaster Tools command to prevent that entire folder from being indexed, but again, it's already done so you are going to have to wait for all those pages to fall out.
-
Thanks for the quick reply! I'm desperate to get these removed as soon as possible now. I've got webmaster tools access but requesting over 5,000 pages to be removed one by one will take too long. You can't do page removal in bulk can you?
I'm going to work on the noindex option
-
OMG, that does not look good. I completely understand. The best way in my opinion would be to add a noindex meta tag on these pages and let Google crawl them. Once they re-index them with the noindex, that should take care of the problem. However, be careful since you want to make sure that noindex tag does not appear on your real pages, just the AJAX ones.
Another option might be to consider the canonical tag, but then technically these pages are not duplicate pages, they just should not exist. Are you verified and using the Google Webmaster Console ? If yes, see if you can get some of these pages excluded via the URL removal tool. The best way is to add the noindex tag in my opinion.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google has deindexed a page it thinks is set to 'noindex', but is in fact still set to 'index'
A page on our WordPress powered website has had an error message thrown up in GSC to say it is included in the sitemap but set to 'noindex'. The page has also been removed from Google's search results. Page is https://www.onlinemortgageadvisor.co.uk/bad-credit-mortgages/how-to-get-a-mortgage-with-bad-credit/ Looking at the page code, plus using Screaming Frog and Ahrefs crawlers, the page is very clearly still set to 'index'. The SEO plugin we use has not been changed to 'noindex' the page. I have asked for it to be reindexed via GSC but I'm concerned why Google thinks this page was asked to be noindexed. Can anyone help with this one? Has anyone seen this before, been hit with this recently, got any advice...?
Technical SEO | | d.bird0 -
New pages need to be crawled & indexed
Hi there, When you add pages to a site, do you need to re-generate an XML site map and re-submit to Google/Bing? I see the option in Google Webmaster Tools under the "fetch as Google tool" to submit individual pages for indexing, which I am doing right now. Thanks,
Technical SEO | | SSFCU
Sarah0 -
How should I close my forum in a way that's best for SEO?
Hi Guys, I have a forum on a subdomain and it is no longer used. (like forum.mywebsite.com) It kind of feels like a dead limb and I don't know what's best to do for SEO. Should I just leave it as it is and let it stagnate? There is a link in the nav menu to the main domain so users have a chance to find the main domain. Or should I remove it and just redirect the whole subdomain to the main domain? I don't know if redirects would work as I doubt most of the threads would match our articles, plus there are 700 of them. The main domain is PR3 and so is the forum subdomain. Please help!
Technical SEO | | HCHQ0 -
What's the best canonicalization method?
Hi there - is there a canonicalization method that is better than others? Our developers have used the
Technical SEO | | GBC0 -
Non-Canonical Pages still Indexed. Is this normal?
I have a website that contains some products and the old structure of the URL's was definitely not optimal for SEO purposes. So I created new SEO friendly URL's on my site and decided that I would use the canonical tags to transfer all the weight of the old URL's to the New URL's and ensure that the old ones would not show up in the SERP's. Problem is this has not quite worked. I implemented the canonical tags about a month ago but I am still seeing the old URL's indexed in Google and I am noticing that the cache date of these pages was only about a week ago. This leads me to believe that the spiders have been to the pages and seen the new canonical tags but are not following them. Is this normal behavior and if so, can somebody explain to me why? I know I could have just 301 redirected these old URL's to the new ones but the process I would need to go through to have that done is much more of a battle than to just add the canonical tags and I felt that the canonical tags would have done the job. Needless to say the client is not too happy right now and insists that I should have just used the 301's. In this case the client appears to be correct but I do not quite understand why my canonical tags did not work. Examples Below- Old Pages: www.awebsite.com/something/something/productid.3254235 New Pages: www.awebsite.com/something/something/keyword-rich-product-name Canonical tag on both pages: rel="canonical" href="http://www.awebsite.com/something/something/keyword-rich-product-name"/> Thanks guys for the help on this.
Technical SEO | | DRSearchEngOpt0 -
I'm getting a Duplicate Content error in my Pro Dashboard for 2 versions of my Homepage. What is the best way to handle this issue?
Hi SEOMoz,I am trying to fix the final issues in my site crawl. One that confuses me is this canonical homepage URL fix. It says I have duplicate content on the following pages:http://www.accupos.com/http://www.accupos.com/index.phpWhat would be the best way to fix this problem? (...the first URL has a higher page authority by 10 points and 100+ more inbound links).Respectfully Yours,Derek M.
Technical SEO | | DerekM880 -
Page rank 2 for home page, 3 for service pages
Hey guys, I have noticed with one of our new sites, the home page is showing page rank two, whereas 2 of the internal service pages are showing as 3. I have checked with both open site explorer and yahoo back links and there are by far more links to the home page. All quality and relevant directory submissions and blog comments. The site is only 4 months old, I wonder if anyone can shed any light on the fact 2 of the lesser linked pages are showing higher PR? Thanks 🙂
Technical SEO | | Nextman0 -
Google indexing directory folder listing page
Google somehow managed to find several of our images index folders and decided to include them into their index. Example: websitesite.com/category/images/ is what you'll see when doing a site:website.com search. So, I have two-part question: 1) Does this hurt our site's ability to rank in any way?
Technical SEO | | invision
Because all Google sees is just a directory listing page with a bunch of links to images in the folder. 2) If there could be any negative effect, what is the best way to get these folders out of Google's index?
I could block via robots.txt, but I'm afraid it will also block all the images in that folder from being indexed in Google image search. I could also turn off directory listing in cpanel / htaccess, but then that gives is a 403 forbidden. Will this hurt the site in anyway and would it prevent Google from indexing the images in the directory? Thanks,
Tony0