How to fix google index filled with redundant parameters
-
Hi All
This follows on from a previous question (http://moz.com/community/q/how-to-fix-google-index-after-fixing-site-infected-with-malware) that on further investigation has become a much broader problem. I think this is an issue that may plague many sites following upgrades from CMS systems.
First a little history. A new customer wanted to improve their site ranking and SEO. We discovered the site was running an old version of Joomla and had been hacked. URL's such as http://domain.com/index.php?vc=427&Buy_Pinnacle_Studio_14_Ultimate redirected users to other sites and the site was ranking for buy adobe or buy microsoft. There was no notification in webmaster tools that the site had been hacked. So an upgrade to a later version of Joomla was required and we implemented SEF URLs at the same time. This fixed the hacking problem, we now had SEF url's, fixed a lot of duplicate content and added new titles and descriptions. Problem is that after a couple of months things aren't really improving. The site is still ranking for adobe and microsoft and a lot of other rubbish and the urls like http://domain.com/index.php?vc=427&Buy_Pinnacle_Studio_14_Ultimate are still sending visitors but to the home page as are a lot of the old redundant urls with parameters in them. I think it is default behavior for a lot of CMS systems to ignore parameters it doesn't recognise so http://domain.com/index.php?vc=427&Buy_Pinnacle_Studio_14_Ultimate displays the home page and gives a 200 response code.
My theory is that Google isn't removing these pages from the index because it's getting a 200 response code from old url's and possibly penalizing the site for duplicate content (which don't showing up in moz because there aren't any links on the site to these url's) The index in webmaster tools is showing over 1000 url's indexed when there are only around 300 actual url's. It also shows thousands of url's for each parameter type most of which aren't used.
So my question is how to fix this, I don't think 404's or similar are the answer because there are so many and trying to find each combination of parameter would be impossible. Webmaster tools advises not to make changes to parameters but even so I don't think resetting or editing them individually is going to remove them and only change how google indexes them (if anyone knows different please let me know)
Appreciate any assistance and also any comments or discussion on this matter.
Regards, Ian
-
Thanks again Alan.
I've checked the site with screaming frog and it doesn't return any url's with parameters so at this stage I might be ok. I am getting a message in webmaster tools saying "severe health issues" but it doesn't appear to be affecting the urls I want to keep. I'll likely remove the entry once things have cleared up some more.
Thanks Jeff
At the moment I'm stuck with Zeus web server (insert expletives here) so no htaccess file or I'd be in a better position. After messing around with it and very limited documentation I can only get the site operating with index.php in the url but with SEF url's for the remainder of it. I'm investigating migration to an apache server so that might make it easier.
Regards
Ian
-
the ability to remove the index.php is built into the stock joomla .htaccess file.
In the joomla backend, global config / site tab/ seo settings > enable "Use URL rewriting".
-
I can see it fixed your problem, but its a ugly fix, you mean need to use parameters in the future, you may already be using them but unaware.
-
OK Might have a solution that would at least work for my situation.
Since implementing SEF URL's on the site I have no real need for any URL's with parameters. By adding the following to robots.txt it should prevent any indexing of old pages or pages with parameters.
Disallow: /index.php?*
Tested it in webmaster tools with some of the offending URL's and it seems to work. I'll wait until the next indexing and post back or mark it as answered.
-
Thanks for your input Alan
There lies my problem. The URL's don't exist but give a 200 response.
http://domain.com/index.php?vc=427&Buy_Pinnacle_Studio_14_Ultimate is the same as
http://domain.com/index.php which is the same as
http://domain.com/?type_anything_here_and it still gives a 200 response. Joomla seems to just ignore parameters from non existing pages after the ?. I found a lot of people are having similar problems here http://forum.joomla.org/viewtopic.php?f=618&t=699954.
Once in googles index I can't see a way of getting rid of thousands or redundant entries. I have the added problem of the site being hosted on a Zeus Web Server which isn't as well documented as apache.
I'm currently looking into wild cards in robots.txt. It will be a slow process to get rid of them all but might finally help me clean up the index.
Ian
-
If the site is returning 200's then that is where the problem lies, you need to find out why.
I can see any other fix, removing the urls is only a temp fix, you must make them return 404's
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What should i do to index images in google webmaster?
My website onlineplants.com.au. It's a shopping cart website. I do have nearly 1200 images but none of the images are indexed in google webmaster? what should i do. Thanks
Technical SEO | | Verve-Innovation1 -
Why are my images not being indexed?
I have submitted an image sitemap with over 2,000 images yet only about 35 have been indexed. Could you please help me understand why Google is not indexing my images? www.creative-calendars.com
Technical SEO | | nicole20140 -
How to Remove /feed URLs from Google's Index
Hey everyone, I have an issue with RSS /feed URLs being indexed by Google for some of our Wordpress sites. Have a look at this Google query, and click to show omitted search results. You'll see we have 500+ /feed URLs indexed by Google, for our many category pages/etc. Here is one of the example URLs: http://www.howdesign.com/design-creativity/fonts-typography/letterforms/attachment/gilhelveticatrade/feed/. Based on this content/code of the XML page, it looks like Wordpress is generating these: <generator>http://wordpress.org/?v=3.5.2</generator> Any idea how to get them out of Google's index without 301 redirecting them? We need the Wordpress-generated RSS feeds to work for various uses. My first two thoughts are trying to work with our Development team to see if we can get a "noindex" meta robots tag on the pages, by they are dynamically-generated pages...so I'm not sure if that will be possible. Or, perhaps we can add a "feed" paramater to GWT "URL Parameters" section...but I don't want to limit Google from crawling these again...I figure I need Google to crawl them and see some code that says to get the pages out of their index...and THEN not crawl the pages anymore. I don't think the "Remove URL" feature in GWT will work, since that tool only removes URLs from the search results, not the actual Google index. FWIW, this site is using the Yoast plugin. We set every page type to "noindex" except for the homepage, Posts, Pages and Categories. We have other sites on Yoast that do not have any /feed URLs indexed by Google at all. Side note, the /robots.txt file was previously blocking crawling of the /feed URLs on this site, which is why you'll see that note in the Google SERPs when you click on the query link given in the first paragraph.
Technical SEO | | M_D_Golden_Peak0 -
Google Dancing?
Hello, I was wondering why my website for some keywords goes from 2nd 3rd page in Google to 7th or even more sometimes? This happens since a while. Any suggestion? Thanks. Eugenio
Technical SEO | | socialengaged0 -
How to Stop Google from Indexing Old Pages
We moved from a .php site to a java site on April 10th. It's almost 2 months later and Google continues to crawl old pages that no longer exist (225,430 Not Found Errors to be exact). These pages no longer exist on the site and there are no internal or external links pointing to these pages. Google has crawled the site since the go live, but continues to try and crawl these pages. What are my next steps?
Technical SEO | | rhoadesjohn0 -
Lots of Pages Dropped Out of Google's Index?
Until yesterday, my website had about 1200 pages indexed in Google. I did lots of changes: removed low quality content, rewrote passable content to make it better, wrote high quality content, got lots of likes and shares on social networks, etc. Now this morning I see that out of 1252 pages submitted, only 691 are indexed. Is that a temporary situation related to the recent updates? Anyone seeing this? What should I interpret about this?
Technical SEO | | sbrault740 -
Google Sitelinks
Hello, Good afternoon. I am having a site issue with Sitelinks. For some reason when I search Google for the brand I represent "California Olive Ranch" Sitelinks are not being generated. When I search for "Cal Olive Ranch" our site links are being generated. Our domain is Californiaoliveranch.com. Is there a way to tell Google to to change the site links to match our domain and brand name? Is this something that can be done in Google Webmasters? Thank you very much for your help. Adam P
Technical SEO | | apost40 -
No index directory pages?
All, I have a site built on WordPress with directory software (edirectory) on the backend that houses a directory of members. The Wordpress portion of the site is full of content and drives traffic through to the directory. Like most directories, the results pages are thin on content and mainly contain links to member profiles. Is it best to simply no index the search results for the directory portion of the site?
Technical SEO | | JSOC0