Best use of robots.txt for "garbage" links from Joomla!
-
I recently started out on Seomoz and is trying to make some cleanup according to the campaign report i received.
One of my biggest gripes is the point of "Dublicate Page Content".
Right now im having over 200 pages with dublicate page content.
Now.. This is triggerede because Seomoz have snagged up auto generated links from my site.
My site has a "send to freind" feature, and every time someone wants to send a article or a product to a friend via email a pop-up appears.
Now it seems like the pop-up pages has been snagged by the seomoz spider,however these pages is something i would never want to index in Google.
So i just want to get rid of them.
Now to my question
I guess the best solution is to make a general rule via robots.txt, so that these pages is not indexed and considered by google at all.
But, how do i do this? what should my syntax be?
A lof of the links looks like this, but has different id numbers according to the product that is being send:
http://mywebshop.dk/index.php?option=com_redshop&view=send_friend&pid=39&tmpl=component&Itemid=167
I guess i need a rule that grabs the following and makes google ignore links that contains this:
view=send_friend
-
Hi Henrik,
It can take up to a week for SEOmoz crawlers to process your site, which may be an issue if you recently added the tag. Did you remember to include all user agents in your first line?
User-agent: *
Be sure to test your robots.txt file in Google Webmaster Tools to ensure everything is correct.
Couple of other things you can do:
1. Add a rel="nofollow" on your send to friend links.
2. Add a meta robots "noindex" to the head of the popup html.
3. And/or add a canonical tag to the popup. Since I don't have a working example, I don't know what to canonical it too (whatever content it is duplicating) but this is also an option.
-
I just tried to add
Disallow: /view=send_friend
I removed the last /
however a crawl gave me the dublicate content problem again.
Is my syntax wrong?
-
The second one "Disallow: /*view=send_friend" will prevent googlebot from crawling any url with that string in it. So that should take care of your problem.
-
So my link example would look like this in robots.txt?
Disallow: /index.php?option=com_redshop&view=send_friend&pid=&tmpl=component&Itemid=
Or
Disallow: /view=send_friend/
-
Your right I would disallow via robots.txt & a wildcard (*) wherever a unique item id # could be generated.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Using http: shorthand inside canonical tag ("//" instead of "http:") can cause harm?
HI, I am planning to launch a new site, and shortly after to move to HTTPS. to save the need to change over 5,000 canonical tags in pages the webmaster suggested we implement inside the rel canonical "//" instead of the absolute path, would that do any damage or be a problem? oranges-south-dakota" />
Technical SEO | | Kung_fu_Panda0 -
Google is indexing blocked content in robots.txt
Hi,Google is indexing some URLs that i don't want to be indexed and also is indexing the same URLs with https. This URLs are blocked in the file robots.txt.I've tried to block this URLs through Google WebmasterTools but Google doesn't let me do it because this URL are httpsThe file robots.txt is correct so, what can i do to avoid this content to be indexed?
Technical SEO | | elisainteractive0 -
Block Domain in robots.txt
Hi. We had some URLs that were indexed in Google from a www1-subdomain. We have now disabled the URLs (returning a 404 - for other reasons we cannot do a redirect from www1 to www) and blocked via robots.txt. But the amount of indexed pages keeps increasing (for 2 weeks now). Unfortunately, I cannot install Webmaster Tools for this subdomain to tell Google to back off... Any ideas why this could be and whether it's normal? I can send you more domain infos by personal message if you want to have a look at it.
Technical SEO | | zeepartner0 -
Implementation of rel="next" & rel="prev"
Hi All, I'm looking to implement rel="next" & rel="prev", so I've been looking for examples. I looked at the source code for the MOZ.com forum, if anyone one is going to do it properly MOZ are. I noticed that the rel="next" & rel="prev" tags have been implemented in the a href tags that link to the previous and next pages rather than in the head. I'm assuming this is fine with Google but in their documentation they state to put the tags in the . Does it matter? Neil.
Technical SEO | | NDAY0 -
Using the word "FREE" in domain name
Hi, This may seem like a simple question but a new client of mine wishes to use a domain name with the word "free" in it. The website will offer free activity vouchers. I couldn't see this being a problem as there a lot of websites that do this although he was told it may present a problem with the search engines thinking the site was spammy. It won't be and will be offering information and vouchers on local sporting activities. I was wondering if anybody could clarify this please so I can give him a more definitive answer to his question. Thanks for your help.
Technical SEO | | malinkymedia0 -
Backlinks go to "example.com" our homepage is "example.com/default.html" am I losing internal link power?
Hey everyone! Thanks again for everybodies contributions to my questions over the last few months. As the title states, our homepage is at "example.com/default.html" but everybody that backlinks to us (as expected) to "example.com" does that mean that I am probably losing a lot of the power of my links??
Technical SEO | | TylerAbernethy0 -
Blocked by meta-robots but there is no robots file
OK, I'm a little frustred here. I've waited a week for the next weekly index to take place after changing the privacy setting in a wordpress website so Google can index, but I still got the same problem. Blocked by meta-robots, no index, no follow. But I do not see a robot file anywhere and the privacy setting in this Wordpress site is set to allow search engines to index this site. Website is www.marketalert.ca What am I missing here? Why can't I index the rest of the website and is there a faster way to test this rather than wait another week just to find out it didn't work again?
Technical SEO | | Twinbytes0 -
Robots.txt
Hello Everyone, The problem I'm having is not knowing where to have the robots.txt file on our server. We have our main domain (company.com) with a robots.txt file in the root of the site, but we also have our blog (company.com/blog) where were trying to disallow certain directories from being crawled for SEO purposes... Would having the blog in the sub-directory still need its own robots.txt? or can I reference the directories i don't want crawled within the blog using the root robots.txt file? Thanks for your insight on this matter.
Technical SEO | | BailHotline0