Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt & Disallow: /*? Question!
-
Hi,
I have a site where they have:
Disallow: /*?
Problem is we need the following indexed:
?utm_source=google_shopping
What would the best solution be? I have read:
User-agent: *
Allow: ?utm_source=google_shopping
Disallow: /*?Any ideas?
-
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /archives/ Disallow: /? Allow: /comments/feed/ Disallow: /refer/ Disallow: /index.php Disallow: /wp-content/plugins/ Allow: /wp-admin/admin-ajax.php User-agent: Mediapartners-Google* Allow: / User-agent: Googlebot-Image Allow: /wp-content/uploads/ User-agent: Adsbot-Google Allow: / User-agent: Googlebot-Mobile Allow: / Sitemap: https://site.com/sitemap_index.xml
use this it will help you and your problem will solve
Regards
-
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /archives/ Disallow: /? Allow: /comments/feed/ Disallow: /refer/ Disallow: /index.php Disallow: /wp-content/plugins/ Allow: /wp-admin/admin-ajax.php User-agent: Mediapartners-Google* Allow: / User-agent: Googlebot-Image Allow: /wp-content/uploads/ User-agent: Adsbot-Google Allow: / User-agent: Googlebot-Mobile Allow: / Sitemap: https://site.com/sitemap_index.xml
this will work ??
Regards
Sajad -
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /archives/ Disallow: /*?* Allow: /comments/feed/ Disallow: /refer/ Disallow: /index.php Disallow: /wp-content/plugins/ Allow: /wp-admin/admin-ajax.php User-agent: Mediapartners-Google* Allow: / User-agent: Googlebot-Image Allow: /wp-content/uploads/ User-agent: Adsbot-Google Allow: / User-agent: Googlebot-Mobile Allow: / Sitemap: https://site.com/sitemap_index.xml use this it will help you Regards [Saad](https://clicktestworld.com/)
-
Hi Jeff,
Robots.txt tester as per the above link is definitely worth playing with and is the easiest route to achieving what you want.
Another reactive way of managing this is in some cases is to simply see the range of parameters Google has naturally crawled within Search Console.
You can see this in the old search console for now. So login and go to Crawl --> URL Parameters.
If Googlebot has encountered any ?=params it will list them. You'll then have an option how to manage them or exclude them from the index.
It can be a decent way of cleaning up a site with lot's of indexed pages (1,000+), although please be sure to read this documentation before using it: https://support.google.com/webmasters/answer/6080548?hl=en
-
With this kind of thing, it's really better to pick the specific parameters (or parameter combinations) which you'd like to exclude, e.g:
User-agent: *
Disallow: /shop/product/&size=*
Disallow: */shop/product/*?size=*
Disallow: /stockists?product=*
^ I just took the above from a robots.txt file which I have been working on, as these particular pages don't have 'pretty' URLs with unique content on. Very soon now that will change and the blocks will be lifted
If you are really 100% sure that there's only one param which you want to let through, then you'd go with:
User-agent: *
Disallow: /?
Allow: /?utm_source=google_shopping
Allow: /*&utm_source=google_shopping*
(or something pretty similar to that!)
Before you set anything live, get down a list of URLs which represent the blocks (and allows) which you want to achieve. Test it all with the Robots.txt tester (in Search Console) before you set anything live!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt blocked internal resources Wordpress
Hi all, We've recently migrated a Wordpress website from staging to live, but the robots.txt was deleted. I've created the following new one: User-agent: *
Intermediate & Advanced SEO | | Mat_C
Allow: /
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Allow: /wp-admin/admin-ajax.php However, in the site audit on SemRush, I now get the mention that a lot of pages have issues with blocked internal resources in robots.txt file. These blocked internal resources are all cached and minified css elements: links, images and scripts. Does this mean that Google won't crawl some parts of these pages with blocked resources correctly and thus won't be able to follow these links and index the images? In other words, is this any cause for concern regarding SEO? Of course I can change the robots.txt again, but will urls like https://example.com/wp-content/cache/minify/df983.js end up in the index? Thanks for your thoughts!2 -
Medical / Health Content Authority - Content Mix Question
Greetings, I have an interesting challenge for you. Well, I suppose "interesting" is an understatement, but here goes. Our company is a women's health site. However, over the years our content mix has grown to nearly 50/50 between unique health / medical content and general lifestyle/DIY/well being content (non-health). Basically, there is a "great divide" between health and non-health content. As you can imagine, this has put a serious damper on gaining ground with our medical / health organic traffic. It's my understanding that Google does not see us as an authority site with regard to medical / health content since we "have two faces" in the eyes of Google. My recommendation is to create a new domain and separate the content entirely so that one domain is focused exclusively on health / medical while the other focuses on general lifestyle/DIY/well being. Because health / medical pages undergo an additional level of scrutiny per Google - YMYL pages - it seems to me the only way to make serious ground in this hyper-competitive vertical is to be laser targeted with our health/medical content. I see no other way. Am I thinking clearly here, or have I totally gone insane? Thanks in advance for any reply. Kind regards, Eric
Intermediate & Advanced SEO | | Eric_Lifescript0 -
Should I disallow all URL query strings/parameters in Robots.txt?
Webmaster Tools correctly identifies the query strings/parameters used in my URLs, but still reports duplicate title tags and meta descriptions for the original URL and the versions with parameters. For example, Webmaster Tools would report duplicates for the following URLs, despite it correctly identifying the "cat_id" and "kw" parameters: /Mulligan-Practitioner-CD-ROM
Intermediate & Advanced SEO | | jmorehouse
/Mulligan-Practitioner-CD-ROM?cat_id=87
/Mulligan-Practitioner-CD-ROM?kw=CROM Additionally, theses pages have self-referential canonical tags, so I would think I'd be covered, but I recently read that another Mozzer saw a great improvement after disallowing all query/parameter URLs, despite Webmaster Tools not reporting any errors. As I see it, I have two options: Manually tell Google that these parameters have no effect on page content via the URL Parameters section in Webmaster Tools (in case Google is unable to automatically detect this, and I am being penalized as a result). Add "Disallow: *?" to hide all query/parameter URLs from Google. My concern here is that most backlinks include the parameters, and in some cases these parameter URLs outrank the original. Any thoughts?0 -
Should I use meta noindex and robots.txt disallow?
Hi, we have an alternate "list view" version of every one of our search results pages The list view has its own URL, indicated by a URL parameter I'm concerned about wasting our crawl budget on all these list view pages, which effectively doubles the amount of pages that need crawling When they were first launched, I had the noindex meta tag be placed on all list view pages, but I'm concerned that they are still being crawled Should I therefore go ahead and also apply a robots.txt disallow on that parameter to ensure that no crawling occurs? Or, will Googlebot/Bingbot also stop crawling that page over time? I assume that noindex still means "crawl"... Thanks 🙂
Intermediate & Advanced SEO | | ntcma0 -
Block in robots.txt instead of using canonical?
When I use a canonical tag for pages that are variations of the same page, it basically means that I don't want Google to index this page. But at the same time, spiders will go ahead and crawl the page. Isn't this a waste of my crawl budget? Wouldn't it be better to just disallow the page in robots.txt and let Google focus on crawling the pages that I do want indexed? In other words, why should I ever use rel=canonical as opposed to simply disallowing in robots.txt?
Intermediate & Advanced SEO | | YairSpolter0 -
Citation/Business Directory Question...
A company I work for has two numbers... one for the std call centre and one for tracking SEO. Now, if local citation/business directory listings have the same address but different numbers, will this affect local/other SEO results? Any help is greatly appreciated! 🙂
Intermediate & Advanced SEO | | geniusenergyltd0 -
Robots.txt: Can you put a /* wildcard in the middle of a URL?
We have noticed that Google is indexing the language/country directory versions of directories we have disallowed in our robots.txt. For example: Disallow: /images/ is blocked just fine However, once you add our /en/uk/ directory in front of it, there are dozens of pages indexed. The question is: Can I put a wildcard in the middle of the string, ex. /en/*/images/, or do I need to list out every single country for every language in the robots file. Anyone know of any workarounds?
Intermediate & Advanced SEO | | IHSwebsite0 -
Using 2 wildcards in the robots.txt file
I have a URL string which I don't want to be indexed. it includes the characters _Q1 ni the middle of the string. So in the robots.txt can I use 2 wildcards in the string to take out all of the URLs with that in it? So something like /_Q1. Will that pickup and block every URL with those characters in the string? Also, this is not directly of the root, but in a secondary directory, so .com/.../_Q1. So do I have to format the robots.txt as //_Q1* as it will be in the second folder or just using /_Q1 will pickup everything no matter what folder it is on? Thanks.
Intermediate & Advanced SEO | | seo1234560