Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt & Disallow: /*? Question!
-
Hi,
I have a site where they have:
Disallow: /*?
Problem is we need the following indexed:
?utm_source=google_shopping
What would the best solution be? I have read:
User-agent: *
Allow: ?utm_source=google_shopping
Disallow: /*?Any ideas?
-
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /archives/ Disallow: /? Allow: /comments/feed/ Disallow: /refer/ Disallow: /index.php Disallow: /wp-content/plugins/ Allow: /wp-admin/admin-ajax.php User-agent: Mediapartners-Google* Allow: / User-agent: Googlebot-Image Allow: /wp-content/uploads/ User-agent: Adsbot-Google Allow: / User-agent: Googlebot-Mobile Allow: / Sitemap: https://site.com/sitemap_index.xml
use this it will help you and your problem will solve
Regards
-
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /archives/ Disallow: /? Allow: /comments/feed/ Disallow: /refer/ Disallow: /index.php Disallow: /wp-content/plugins/ Allow: /wp-admin/admin-ajax.php User-agent: Mediapartners-Google* Allow: / User-agent: Googlebot-Image Allow: /wp-content/uploads/ User-agent: Adsbot-Google Allow: / User-agent: Googlebot-Mobile Allow: / Sitemap: https://site.com/sitemap_index.xml
this will work ??
Regards
Sajad -
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /archives/ Disallow: /*?* Allow: /comments/feed/ Disallow: /refer/ Disallow: /index.php Disallow: /wp-content/plugins/ Allow: /wp-admin/admin-ajax.php User-agent: Mediapartners-Google* Allow: / User-agent: Googlebot-Image Allow: /wp-content/uploads/ User-agent: Adsbot-Google Allow: / User-agent: Googlebot-Mobile Allow: / Sitemap: https://site.com/sitemap_index.xml use this it will help you Regards [Saad](https://clicktestworld.com/)
-
Hi Jeff,
Robots.txt tester as per the above link is definitely worth playing with and is the easiest route to achieving what you want.
Another reactive way of managing this is in some cases is to simply see the range of parameters Google has naturally crawled within Search Console.
You can see this in the old search console for now. So login and go to Crawl --> URL Parameters.
If Googlebot has encountered any ?=params it will list them. You'll then have an option how to manage them or exclude them from the index.
It can be a decent way of cleaning up a site with lot's of indexed pages (1,000+), although please be sure to read this documentation before using it: https://support.google.com/webmasters/answer/6080548?hl=en
-
With this kind of thing, it's really better to pick the specific parameters (or parameter combinations) which you'd like to exclude, e.g:
User-agent: *
Disallow: /shop/product/&size=*
Disallow: */shop/product/*?size=*
Disallow: /stockists?product=*
^ I just took the above from a robots.txt file which I have been working on, as these particular pages don't have 'pretty' URLs with unique content on. Very soon now that will change and the blocks will be lifted
If you are really 100% sure that there's only one param which you want to let through, then you'd go with:
User-agent: *
Disallow: /?
Allow: /?utm_source=google_shopping
Allow: /*&utm_source=google_shopping*
(or something pretty similar to that!)
Before you set anything live, get down a list of URLs which represent the blocks (and allows) which you want to achieve. Test it all with the Robots.txt tester (in Search Console) before you set anything live!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Block session id URLs with robots.txt
Hi, I would like to block all URLs with the parameter '?filter=' from being crawled by including them in the robots.txt. Which directive should I use: User-agent: *
Intermediate & Advanced SEO | | Mat_C
Disallow: ?filter= or User-agent: *
Disallow: /?filter= In other words, is the forward slash in the beginning of the disallow directive necessary? Thanks!1 -
Category Pages & Content
Hi Does anyone have any great examples of an ecommerce site which has great content on category pages or product listing pages? Thanks!
Intermediate & Advanced SEO | | BeckyKey1 -
Avoiding Duplicate Content with Used Car Listings Database: Robots.txt vs Noindex vs Hash URLs (Help!)
Hi Guys, We have developed a plugin that allows us to display used vehicle listings from a centralized, third-party database. The functionality works similar to autotrader.com or cargurus.com, and there are two primary components: 1. Vehicle Listings Pages: this is the page where the user can use various filters to narrow the vehicle listings to find the vehicle they want.
Intermediate & Advanced SEO | | browndoginteractive
2. Vehicle Details Pages: this is the page where the user actually views the details about said vehicle. It is served up via Ajax, in a dialog box on the Vehicle Listings Pages. Example functionality: http://screencast.com/t/kArKm4tBo The Vehicle Listings pages (#1), we do want indexed and to rank. These pages have additional content besides the vehicle listings themselves, and those results are randomized or sliced/diced in different and unique ways. They're also updated twice per day. We do not want to index #2, the Vehicle Details pages, as these pages appear and disappear all of the time, based on dealer inventory, and don't have much value in the SERPs. Additionally, other sites such as autotrader.com, Yahoo Autos, and others draw from this same database, so we're worried about duplicate content. For instance, entering a snippet of dealer-provided content for one specific listing that Google indexed yielded 8,200+ results: Example Google query. We did not originally think that Google would even be able to index these pages, as they are served up via Ajax. However, it seems we were wrong, as Google has already begun indexing them. Not only is duplicate content an issue, but these pages are not meant for visitors to navigate to directly! If a user were to navigate to the url directly, from the SERPs, they would see a page that isn't styled right. Now we have to determine the right solution to keep these pages out of the index: robots.txt, noindex meta tags, or hash (#) internal links. Robots.txt Advantages: Super easy to implement Conserves crawl budget for large sites Ensures crawler doesn't get stuck. After all, if our website only has 500 pages that we really want indexed and ranked, and vehicle details pages constitute another 1,000,000,000 pages, it doesn't seem to make sense to make Googlebot crawl all of those pages. Robots.txt Disadvantages: Doesn't prevent pages from being indexed, as we've seen, probably because there are internal links to these pages. We could nofollow these internal links, thereby minimizing indexation, but this would lead to each 10-25 noindex internal links on each Vehicle Listings page (will Google think we're pagerank sculpting?) Noindex Advantages: Does prevent vehicle details pages from being indexed Allows ALL pages to be crawled (advantage?) Noindex Disadvantages: Difficult to implement (vehicle details pages are served using ajax, so they have no tag. Solution would have to involve X-Robots-Tag HTTP header and Apache, sending a noindex tag based on querystring variables, similar to this stackoverflow solution. This means the plugin functionality is no longer self-contained, and some hosts may not allow these types of Apache rewrites (as I understand it) Forces (or rather allows) Googlebot to crawl hundreds of thousands of noindex pages. I say "force" because of the crawl budget required. Crawler could get stuck/lost in so many pages, and my not like crawling a site with 1,000,000,000 pages, 99.9% of which are noindexed. Cannot be used in conjunction with robots.txt. After all, crawler never reads noindex meta tag if blocked by robots.txt Hash (#) URL Advantages: By using for links on Vehicle Listing pages to Vehicle Details pages (such as "Contact Seller" buttons), coupled with Javascript, crawler won't be able to follow/crawl these links. Best of both worlds: crawl budget isn't overtaxed by thousands of noindex pages, and internal links used to index robots.txt-disallowed pages are gone. Accomplishes same thing as "nofollowing" these links, but without looking like pagerank sculpting (?) Does not require complex Apache stuff Hash (#) URL Disdvantages: Is Google suspicious of sites with (some) internal links structured like this, since they can't crawl/follow them? Initially, we implemented robots.txt--the "sledgehammer solution." We figured that we'd have a happier crawler this way, as it wouldn't have to crawl zillions of partially duplicate vehicle details pages, and we wanted it to be like these pages didn't even exist. However, Google seems to be indexing many of these pages anyway, probably based on internal links pointing to them. We could nofollow the links pointing to these pages, but we don't want it to look like we're pagerank sculpting or something like that. If we implement noindex on these pages (and doing so is a difficult task itself), then we will be certain these pages aren't indexed. However, to do so we will have to remove the robots.txt disallowal, in order to let the crawler read the noindex tag on these pages. Intuitively, it doesn't make sense to me to make googlebot crawl zillions of vehicle details pages, all of which are noindexed, and it could easily get stuck/lost/etc. It seems like a waste of resources, and in some shadowy way bad for SEO. My developers are pushing for the third solution: using the hash URLs. This works on all hosts and keeps all functionality in the plugin self-contained (unlike noindex), and conserves crawl budget while keeping vehicle details page out of the index (unlike robots.txt). But I don't want Google to slap us 6-12 months from now because it doesn't like links like these (). Any thoughts or advice you guys have would be hugely appreciated, as I've been going in circles, circles, circles on this for a couple of days now. Also, I can provide a test site URL if you'd like to see the functionality in action.0 -
Meta NoIndex tag and Robots Disallow
Hi all, I hope you can spend some time to answer my first of a few questions 🙂 We are running a Magento site - layered/faceted navigation nightmare has created thousands of duplicate URLS! Anyway, during my process to tackle the issue, I disallowed in Robots.txt anything in the querystring that was not a p (allowed this for pagination). After checking some pages in Google, I did a site:www.mydomain.com/specificpage.html and a few duplicates came up along with the original with
Intermediate & Advanced SEO | | bjs2010
"There is no information about this page because it is blocked by robots.txt" So I had added in Meta Noindex, follow on all these duplicates also but I guess it wasnt being read because of Robots.txt. So coming to my question. Did robots.txt block access to these pages? If so, were these already in the index and after disallowing it with robots, Googlebot could not read Meta No index? Does Meta Noindex Follow on pages actually help Googlebot decide to remove these pages from index? I thought Robots would stop and prevent indexation? But I've read this:
"Noindex is a funny thing, it actually doesn’t mean “You can’t index this”, it means “You can’t show this in search results”. Robots.txt disallow means “You can’t index this” but it doesn’t mean “You can’t show it in the search results”. I'm a bit confused about how to use these in both preventing duplicate content in the first place and then helping to address dupe content once it's already in the index. Thanks! B0 -
301 redirect with /? in URL
For a Wordpress site that has the ending / in the URL with a ? after it... how can you do a 301 redirect to strip off anything after the / For example how to take this URL domain.com/article-name/?utm_source=feedburner and 301 to this URL domain.com/article-name/ Thank you for the help
Intermediate & Advanced SEO | | COEDMediaGroup0 -
Robots.txt: Can you put a /* wildcard in the middle of a URL?
We have noticed that Google is indexing the language/country directory versions of directories we have disallowed in our robots.txt. For example: Disallow: /images/ is blocked just fine However, once you add our /en/uk/ directory in front of it, there are dozens of pages indexed. The question is: Can I put a wildcard in the middle of the string, ex. /en/*/images/, or do I need to list out every single country for every language in the robots file. Anyone know of any workarounds?
Intermediate & Advanced SEO | | IHSwebsite0 -
Multiple stores & domains vs. One unified store (SEO pros / cons for E-Commerce)
Our company runs a number of individual online shops, specialised in particular products but all in the same genre of goods overall, with a specific and relevant domain name for each shop. At the moment the sites are separate, and not interlinked, i.e. Completely separate brands. An analogy could be something like clothing accessories (we are not in the clothing business): scarves.com, and silkties.com (our field is more niche than this) We are about to launch a related site, (e.g. handbags.com), in the same field again but without precisely overlapping products. We will produce this site on a newer, more flexible e-commerce platform, so now is a good time to consider whether we want to place all our sites together with one e-commerce system on the backend. Essentially, we need to know what the pros and cons would be of the various options facing us and how the SEO ranking is affected by the three possibilities. Option 1: continue with separate sites each with its own domains. Option 2: have multiple sites, each on their own domain, but on the same ecommerce system and visible linked together for the customer (with unified checkout) – on the top of each site could be a menu bar linking to each site: [Scarves.com] – [SilkTies.com] – [Handbags.com] The main question here is whether the multiple domains are mutually beneficial, particularly considerding how close to target keywords the individual domains are. If mutually benefitial, how does it compare to option 3: Option 3: Having recently acquired a domain name (e.g. accessories.com) which would cover the whole category together, we are presented with a third option: making one site selling all of these products in different categories. Our main concern here would be losing the ability to specifically target marketing, and losing the benefit of the domains with the key words in for what people are more likely to be searching for (e.g. 'silk tie') rather than 'accessories.' Is it worth taking the hit on losing these specific targeted domain names for the advantage of increased combined inbound links?
Intermediate & Advanced SEO | | Colage0 -
Blocking Dynamic URLs with Robots.txt
Background: My e-commerce site uses a lot of layered navigation and sorting links. While this is great for users, it ends up in a lot of URL variations of the same page being crawled by Google. For example, a standard category page: www.mysite.com/widgets.html ...which uses a "Price" layered navigation sidebar to filter products based on price also produces the following URLs which link to the same page: http://www.mysite.com/widgets.html?price=1%2C250 http://www.mysite.com/widgets.html?price=2%2C250 http://www.mysite.com/widgets.html?price=3%2C250 As there are literally thousands of these URL variations being indexed, so I'd like to use Robots.txt to disallow these variations. Question: Is this a wise thing to do? Or does Google take into account layered navigation links by default, and I don't need to worry. To implement, I was going to do the following in Robots.txt: User-agent: * Disallow: /*? Disallow: /*= ....which would prevent any dynamic URL with a '?" or '=' from being indexed. Is there a better way to do this, or is this a good solution? Thank you!
Intermediate & Advanced SEO | | AndrewY1