Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Block Moz (or any other robot) from crawling pages with specific URLs
-
Hello!
Moz reports that my site has around 380 duplicate page content. Most of them come from dynamic generated URLs that have some specific parameters. I have sorted this out for Google in webmaster tools (the new Google Search Console) by blocking the pages with these parameters. However, Moz is still reporting the same amount of duplicate content pages and, to stop it, I know I must use robots.txt. The trick is that, I don't want to block every page, but just the pages with specific parameters. I want to do this because among these 380 pages there are some other pages with no parameters (or different parameters) that I need to take care of. Basically, I need to clean this list to be able to use the feature properly in the future.
I have read through Moz forums and found a few topics related to this, but there is no clear answer on how to block only pages with specific URLs. Therefore, I have done my research and come up with these lines for robots.txt:
User-agent: dotbot
Disallow: /*numberOfStars=0User-agent: rogerbot
Disallow: /*numberOfStars=0My questions:
1. Are the above lines correct and would block Moz (dotbot and rogerbot) from crawling only pages that have numberOfStars=0 parameter in their URLs, leaving other pages intact?
2. Do I need to have an empty line between the two groups? (I mean between "Disallow: /*numberOfStars=0" and "User-agent: rogerbot")? (or does it even matter?)
I think this would help many people as there is no clear answer on how to block crawling only pages with specific URLs. Moreover, this should be valid for any robot out there.
Thank you for your help!
-
Hello!
Thanks a lot for your feedback and clearing this out! It worked well.
The robots.txt tester is a good tip!
Thanks!
-
Hi,
What you have there will work absolutely fine with a little tweak. And no need to leave spaces between lines.
Disallow: /numberOfStars=0
However, no need to add the wildcard at the end if there is nothing more after that.
The best way to test what works, is before you go and add it to live, use the Robots.txt test tool in Search Console (Webmaster Tools), add in the lines above and then check to make sure none of your other pages are blocked. They won't be, but it's a great way to test before going live.
I hope this helps
-Andy
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
URL Length Issue
MOZ is telling me the URLs are too long. I did a little research and I found out that the length of the URLs is not really a serious problem. In fact, others recommend ignoring the situation. Even on their blog I found this explanation: "Shorter URLs are generally preferable. You do not need to take this to the extreme, and if your URL is already less than 50-60 characters, do not worry about it at all. But if you have URLs pushing 100+ characters, there's probably an opportunity to rewrite them and gain value. This is not a direct problem with Google or Bing - the search engines can process long URLs without much trouble. The issue, instead, lies with usability and user experience. Shorter URLs are easier to parse, copy and paste, share on social media, and embed, and while these may all add up to a fractional improvement in sharing or amplification, every tweet, like, share, pin, email, and link matters (either directly or, often, indirectly)." And yet, I have these questions: In this case, why do I get this error telling me that the urls are too long, and what are the best practices to get this out? Thank You
Moz Pro | | Cart_generation1 -
Is one page with long content better than multiple pages with shorter content?
(Note, the site links are from a sandbox site and has very low DA or PA) If you look at this page, you will see at the bottom a lengthy article detailing all of the properties of the product categories in the links above. http://www.aspensecurityfasteners.com/Screws-s/432.htm My question is, is there more SEO value in having the one long article in the general product category page, or in breaking up the content and moving the sub-topics as content to the more specific sub-category pages? e.g. http://www.aspensecurityfasteners.com/Screws-Button-Head-Socket-s/1579.htm
Moz Pro | | AspenFasteners
http://www.aspensecurityfasteners.com/Screws-Cap-Screws-s/331.htm
http://www.aspensecurityfasteners.com/Screws-Captive-Panel-Scre-s/1559.htm0 -
What to do with a site of >50,000 pages vs. crawl limit?
What happens if you have a site in your Moz Pro campaign that has more than 50,000 pages? Would it be better to choose a sub-folder of the site to get a thorough look at that sub-folder? I have a few different large government websites that I'm tracking to see how they are fairing in rankings and SEO. They are not my own websites. I want to see how these agencies are doing compared to what the public searches for on technical topics and social issues that the agencies manage. I'm an academic looking at science communication. I am in the process of re-setting up my campaigns to get better data than I have been getting -- I am a newbie to SEO and the campaigns I slapped together a few months ago need to be set up better, such as all on the same day, making sure I've set it to include www or not for what ranks, refining my keywords, etc. I am stumped on what to do about the agency websites being really huge, and what all the options are to get good data in light of the 50,000 page crawl limit. Here is an example of what I mean: To see how EPA is doing in searches related to air quality, ideally I'd track all of EPA's web presence. www.epa.gov has 560,000 pages -- if I put in www.epa.gov for a campaign, what happens with the site having so many more pages than the 50,000 crawl limit? What do I miss out on? Can I "trust" what I get? www.epa.gov/air has only 1450 pages, so if I choose this for what I track in a campaign, the crawl will cover that subfolder completely, and I am getting a complete picture of this air-focused sub-folder ... but (1) I'll miss out on air-related pages in other sub-folders of www.epa.gov, and (2) it seems like I have so much of the 50,000-page crawl limit that I'm not using and could be using. (However, maybe that's not quite true - I'd also be tracking other sites as competitors - e.g. non-profits that advocate in air quality, industry air quality sites - and maybe those competitors count towards the 50,000-page crawl limit and would get me up to the limit? How do the competitors you choose figure into the crawl limit?) Any opinions on which I should do in general on this kind of situation? The small sub-folder vs. the full humongous site vs. is there some other way to go here that I'm not thinking of?
Moz Pro | | scienceisrad0 -
Special Characters in URL & Google Search Engine (Index & Crawl)
G'd everyone, I need help with understanding how special characters impact SEO. Eg. é , ë ô in words Does anyone have good insights or reference material regarding the treatment of Special Characters by Google Search Engine? how Page Title / Meta Desc with Special Chars are being index & Crawl Best Practices when it comes to URLs - uses of Unicode, HTML entity references - when are where? any disadvantage using special characters Does special characters in URL have any impact on SEO performance & User search, experience. Thanks heaps, Amy
Moz Pro | | LabeliumUSA0 -
Moz & Xenu Link Sleuth unable to crawl a website (403 error)
It could be that I am missing something really obvious however we are getting the following error when we try to use the Moz tool on a client website. (I have read through a few posts on 403 errors but none that appear to be the same problem as this) Moz Result Title 403 : Error Meta Description 403 Forbidden Meta Robots_Not present/empty_ Meta Refresh_Not present/empty_ Xenu Link Sleuth Result Broken links, ordered by link: error code: 403 (forbidden request), linked from page(s): Thanks in advance!
Moz Pro | | ZaddleMarketing0 -
Moz WordPress Plugin?
WordPress is currently 18% of the Internet. Given its huge footprint, wouldn't it make sense for Moz to develop a WP plugin that can not only report site metrics, but help fix and optimize site structure directly from within the site? Just curious - I can't be the only one who wonders if I'm implementing Moz findings/recommendations correctly given the myriad of WP SEO plugins, authors, implementations.
Moz Pro | | twelvetwo.net5 -
How to increase page authority
I wonder how to increase the page authority or the domain authority to begin with. It seems you are putting a lot of weight on this in your analysis.
Moz Pro | | wcsinc0 -
How to resolve Duplicate Content crawl errors for Magento Login Page
I am using the Magento shopping cart, and 99% of my duplicate content errors come from the login page. The URL looks like: http://www.site.com/customer/account/login/referer/aHR0cDovL3d3dy5tbW1zcGVjaW9zYS5jb20vcmV2aWV3L3Byb2R1Y3QvbGlzdC9pZC8xOTYvY2F0ZWdvcnkvNC8jcmV2aWV3LWZvcm0%2C/ Or, the same url but with the long string different from the one above. This link is available at the top of every page in my site, but I have made sure to add "rel=nofollow" as an attribute to the link in every case (it is done easily by modifying the header links template). Is there something else I should be doing? Do I need to try to add canonical to the login page? If so, does anyone know how to do it using XML?
Moz Pro | | kdl01