Using Robots.txt
-
I want to Block or prevent pages being accessed or indexed by googlebot. Please tell me if googlebot will NOT Access any URL that begins with my domain name, followed by a question mark,followed by any string by using Robots.txt below. Sample URL http://mydomain.com/?example User-agent: Googlebot Disallow: /?
-
Not sure if that would work, but you can test by changing your robots.txt and running a test in GWT > Health > Blocked URLs
You might also be interested in specifying specific URL paraments (e.g. /?sort=name&order=asc > can block sort and order parameters) from within GWT (Configuration > URL Parameters)
Learn more about parameters - https://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Rank regional homepages using canonicals and hreflangs
Here’s a situation I’ve been puzzling with for some time: The situation
Technical SEO | | dmduco
Please consider an international website targeting 3 regions. The real site has more regions, but I simplified the case for this question. screenshot1.png There is no default language. The content for each regional version is meant for that region only. The website.eu page is dynamic. When there is no region cookie, the page is identical to website.eu/nl/ (because Netherlands is the most important region) When there is a region cookie (set by a modal), there is a 302 redirect to the corresponding regional homepage What we want
We want regional Google to index the correct regional homepages (eg. website.eu/nl/ on google.nl), instead of website.eu.
Why? Because visitors surfing to website.eu sometimes tend to ignore the region modal and therefor browse the wrong version.
For this, I set up canonicals and hreflangs as described below: screenshot2.png The problem
It’s 40 days now since the above hreflangs and canonicals have been setup, but Google is still ranking website.eu instead of the regional homepages.
Search console’s report for website.eu: screenshot3.png Any ideas why Google doesn’t respect our canonical? Maybe I’m overlooking something in this setup (combination of hreflangs and canonicals might be confusing)? Should I remove the hreflangs on the dynamic page, because there is no self-referencing hreflang? Or maybe it’s because website.eu has gathered a lot of backlinks over the years, whereas the regional homepages have much less, which might be why Google chooses to ig nore the canonical signals? Or maybe it’s a matter of time and I just need to wait longer? Note: I’m aware the language subfolders (eg. /be_nl) are not according to Google’s recommendations. But I’ve seen similar setups (like adobe.com and apple.com) where the regional homepage is showing ok. Any help appreciated!0 -
Is it good practice to use hreflang on pages that have canonicals?
I have a page in English that has both English & Spanish translations on it. It is pulled in from a page generated on another site and I am not able to adjust the CSS to display only one language. Until I can fix this, I have made the English page the canonical for both. Do I still want to use hreflang for English & Spanish pages? What if I do not have a Spanish page at all. I assume (from what I've read) I should not have an hreflang on the English page. Is this correct? Thank you in advance.
Technical SEO | | RoxBrock0 -
Robots file set up
The robots file looks like it has been set up in a very messy way.
Technical SEO | | mcwork
I understand the # will comment out a line, does this mean the sitemap would
not be picked up?
Disallow: /js/ should this be allowed like /*.js$
Disallow: /media/wysiwyg/ - this seems to be causing alerts in webmaster tools as it can not access
the images within.
Can anyone help me clean this up please #Sitemap: https://examplesite.com/sitemap.xml Crawlers Setup User-agent: *
Crawl-delay: 10 Allowable Index Mind that Allow is not an official standard Allow: /index.php/blog/
Allow: /catalog/seo_sitemap/category/ Allow: /catalogsearch/result/ Allow: /media/catalog/ Directories Disallow: /404/
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /downloader/
Disallow: /errors/
Disallow: /includes/
Disallow: /js/
Disallow: /lib/
Disallow: /magento/ Disallow: /media/ Disallow: /media/captcha/ Disallow: /media/catalog/ #Disallow: /media/css/
#Disallow: /media/css_secure/
Disallow: /media/customer/
Disallow: /media/dhl/
Disallow: /media/downloadable/
Disallow: /media/import/
#Disallow: /media/js/
Disallow: /media/pdf/
Disallow: /media/sales/
Disallow: /media/tmp/
Disallow: /media/wysiwyg/
Disallow: /media/xmlconnect/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /scripts/
Disallow: /shell/
#Disallow: /skin/
Disallow: /stats/
Disallow: /var/ Paths (clean URLs) Disallow: /index.php/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalog/product/gallery/
Disallow: */catalog/product/upload/
Disallow: /catalogsearch/
Disallow: /checkout/
Disallow: /control/
Disallow: /contacts/
Disallow: /customer/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/ Files Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /STATUS.txt
Disallow: /get.php # Magento 1.5+ Paths (no clean URLs) #Disallow: /.js$
#Disallow: /.css$
Disallow: /.php$
Disallow: /?SID=
Disallow: /rss*
Disallow: /*PHPSESSID Disallow: /:
Disallow: /😘 User-agent: Fatbot
Disallow: / User-agent: TwengaBot-2.0
Disallow: /0 -
Should I use canonicals? Best practice?
Hi there, I've been working on a pretty dated site. The product pages have tabs that separate the product information, e.g., a tab for specifications, a tab for system essentials, an overview tab that is actually just a copy of the product page. Each tab is actually a link to a completely separate page, so product/main-page is split into product/main-page/specs, product/main-page/resources, etc. Wondering if canonicals would be appropriate in this situation? The information isn't necessarily duplicate (except for the overview tabs) but with each tab as a separate page, I would imagine that's diluting the value of the main page? The information all belongs to the main page, shouldn't it be saying "I'm a version of the main page"?
Technical SEO | | anneoaks0 -
Block Domain in robots.txt
Hi. We had some URLs that were indexed in Google from a www1-subdomain. We have now disabled the URLs (returning a 404 - for other reasons we cannot do a redirect from www1 to www) and blocked via robots.txt. But the amount of indexed pages keeps increasing (for 2 weeks now). Unfortunately, I cannot install Webmaster Tools for this subdomain to tell Google to back off... Any ideas why this could be and whether it's normal? I can send you more domain infos by personal message if you want to have a look at it.
Technical SEO | | zeepartner0 -
Robots.txt
www.mywebsite.com**/details/**home-to-mome-4596 www.mywebsite.com**/details/**home-moving-4599 www.mywebsite.com**/details/**1-bedroom-apartment-4601 www.mywebsite.com**/details/**4-bedroom-apartment-4612 We have so many pages like this, we do not want to Google crawl this pages So we added the following code to Robots.txt User-agent: Googlebot Disallow: /details/ This code is correct?
Technical SEO | | iskq0 -
Is robots.txt a must-have for 150 page well-structured site?
By looking in my logs I see dozens of 404 errors each day from different bots trying to load robots.txt. I have a small site (150 pages) with clean navigation that allows the bots to index the whole site (which they are doing). There are no secret areas I don't want the bots to find (the secret areas are behind a Login so the bots won't see them). I have used rel=nofollow for internal links that point to my Login page. Is there any reason to include a generic robots.txt file that contains "user-agent: *"? I have a minor reason: to stop getting 404 errors and clean up my error logs so I can find other issues that may exist. But I'm wondering if not having a robots.txt file is the same as some default blank file (or 1-line file giving all bots all access)?
Technical SEO | | scanlin0