Wildcarding Robots.txt for Particular Word in URL
-
Hey All,
So I know that this isn't a standard robots.txt, I'm aware of how to block or wildcard certain folders but I'm wondering whether it's possible to block all URL's with a certain word in it?
We have a client that was hacked a year ago and now they want us to help remove some of the pages that were being autogenerated with the word "viagra" in it. I saw this article and tried implementing it https://builtvisible.com/wildcards-in-robots-txt/ and it seems that I've been able to remove some of the URL's (although I can't confirm yet until I do a full pull of the SERPs on the domain). However, when I test certain URL's inside of WMT it still says that they are allowed which makes me think that it's not working fully or working at all.
In this case these are the lines I've added to the robots.txt
Disallow: /*&viagra
Disallow: /*&Viagra
I know I have the solution of individually requesting URL's to be removed from the index but I want to see if anybody has every had success with wildcarding URL's with a certain word in their robots.txt? The individual URL route could be very tedious.
Thanks!
Jon
-
Hey Paul,
Great answer, for some reason it totally slipped my mind that robots.txt is a crawling directive and not an index one. Yes the pages return a 404 on the headers. I've grabbed a copy of the complete SERPS and will now manually disallow them.
Thanks!
Jon
-
Thank for the endorsement, Christy! Funny, I only just now saw Rand's recent WBF related to this topic, but pleased to see my answer lines up exactly with his info.
P.
-
You need to be aware, Jonathan, that there is absolutely nothing about a robots.txt disallow that will help remove a URL from the search engine indexes. Robots is a crawling directive, NOT an indexing directive. In fact, in most cases, blocking URLs in robots.txt will actually cause them to remain in the index even longer.
I'm assuming you have cleaned up the site so the actual spam URLs no longer resolve. Those URLs should now result in a 404 error page. You must confirm they are actually returning the correct 404 code in the headers. As long as this is the case, it is a matter of waiting while the search engines crawl the spam URLs often enough to recognise they are really gone and remove them from the index. The problem with adding them to the robots.txt is that is actually telling the search engines NOT to crawl them, so they are unlikely to discover that they lead to 404s, hence they may remain in the index even longer.
Unfortunately you can't use a no-index tag on the offending pages, because the pages should no longer exist on the site. I don't think even a careful implementation of a X-Robots noindex directive in htaccess would work, because the URLs should be resulting in a 404.
Make certain the problem URLs return a clean 404, use the Google Search Console Remove URLs tool for as many of them as you can (for example you can request removal for entire directories, if the spam happened to be built that way), and then be patient for the rest. But do NOT block them in robots.txt - you'll just prolong the agony and waste your time.
Hope that all makes sense?
Paul
-
Hi Jon,
Why not just: Disallow: /viagra
-
Jon,
I have never done it with a robots.txt, one easy why that I think you could do it would be on the page level. You could add a noindex nofollow to the page itself.
You can generate it automatically too and have it fired depending on the url by using a substring search on the url as well. That will get them all for sure.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Changing Url Removes Backlink
Hello MOZ Community, I have question regarding Bad Backlink Removal. My Site's Post's Image got 4 to 5k backlinks from unknown sites and also their is no contact details on their site so that i can contact them to remove. So, I have an idea for which i want suggestion " If I change the url that receieves backlinks" does this will remove backlinks? For Example: https://example.com/test/ got 5k backlinks if I change this url to https://examplee.com/test-failed/ does this will remove those 5k backlinks? If not then How Can I remove those Backlinks? I Know about disavow but this takes time.
Intermediate & Advanced SEO | | Jackson210 -
Robots txt is case senstive? Pls suggest
Hi i have seen few urls in the html improvements duplicate titles Can i disable one of the below url in the robots.txt? /store/Solar-Home-UPS-1KV-System/75652
Intermediate & Advanced SEO | | Rahim119
/store/solar-home-ups-1kv-system/75652 if i disable this Disallow: /store/Solar-Home-UPS-1KV-System/75652 will the Search engines scan this /store/solar-home-ups-1kv-system/75652 im little confused with case senstive.. Pls suggest go ahead or not in the robots.txt0 -
Block in robots.txt instead of using canonical?
When I use a canonical tag for pages that are variations of the same page, it basically means that I don't want Google to index this page. But at the same time, spiders will go ahead and crawl the page. Isn't this a waste of my crawl budget? Wouldn't it be better to just disallow the page in robots.txt and let Google focus on crawling the pages that I do want indexed? In other words, why should I ever use rel=canonical as opposed to simply disallowing in robots.txt?
Intermediate & Advanced SEO | | YairSpolter0 -
One word Keywords
Hey as you know that as a seo we are, we always optimize keywords which are at least 2 words, and lets say I'm trying to optimize a page for terms like "man clothing, man london clothing, man great collection, man stylus collection" and as you can guess I optimize this pages for this keywords by inputting them into title heading tags and body.
Intermediate & Advanced SEO | | atakala
So my question is , what if google takes "man" phrase from my 2 words keywords, and pretend as a my keyword. (I mean what if google thinks my keywords is man because as you can see in all of the keywords "man" is in all of them.)
And what if Google thinks the density of "man" probably would be %20 which is astronomic number.? Sorry for my bad english.0 -
Should we use URL parameters or plain URL's=
Hi, Me and the development team are having a heated discussion about one of the more important thing in life, i.e. URL structures on our site. Let's say we are creating a AirBNB clone, and we want to be found when people search for apartments new york. As we have both have houses and apartments in all cities in the U.S it would make sense for our url to at least include these, so clone.com/Appartments/New-York but the user are also able to filter on price and size. This isn't really relevant for google, and we all agree on clone.com/Apartments/New-York should be canonical for all apartment/New York searches. But how should the url look like for people having a price for max 300$ and 100 sqft? clone.com/Apartments/New-York?price=30&size=100 or (We are using Node.js so no problem) clone.com/Apartments/New-York/Price/30/Size/100 The developers hate url parameters with a vengeance, and think the last version is the preferable one and most user readable, and says that as long we use canonical on everything to clone.com/Apartments/New-York it won't matter for god old google. I think the url parameters are the way to go for two reasons. One is that google might by themselves figure out that the price parameter doesn't matter (https://support.google.com/webmasters/answer/1235687?hl=en) and also it is possible in webmaster tools to actually tell google that you shouldn't worry about a parameter. We have agreed to disagree on this point, and let the wisdom of Moz decide what we ought to do. What do you all think?
Intermediate & Advanced SEO | | Peekabo0 -
Robots
I have just noticed this in my code name="robots" content="noindex"> And have noticed some of my keywords have dropped, could this be the reason?
Intermediate & Advanced SEO | | Paul780 -
How to fix duplicated urls
I have an issue with duplicated pages. Should I use cannonical tag and if so, how? Or should change the page titles? This is causing my pages to compete with each other in the SERPs. 'Paradisus All Inclusive Luxury Resorts - Book your stay at Paradisus Resorts' is also used on http://www.paradisus.com/booking-template.php | http://www.paradisus.com/booking-template.php?codigoHotel=5889 line 9 | | http://www.paradisus.com/booking-template.php?codigoHotel=5891 line 9 | | http://www.paradisus.com/booking-template.php?codigoHotel=5910 line 9 | | http://www.paradisus.com/booking-template.php?codigoHotel=5911 line 9 |
Intermediate & Advanced SEO | | Melia0 -
Expiring URL seo
a buddy of mine is running a niche job board and is having issues with expiring URLs. we ruled it out cuz a 301 is meant to be used when the content has moved to another page, or the page was replaced. We were thinking that we'd be just stacking duplicate content on old urls that would never be 'replaced'. Rather they have been removed and will never come back. So 410 is appropriate but maybe we overlooked something. any ideas?
Intermediate & Advanced SEO | | malachiii0