Moz Crawler URL paramaters & duplicate content
-
Hi all, this is my first post on Moz Q&A
Questions:
- Does the Moz Crawler take into account rel="canonical" for search results pages with sorting / filtering URL parameters?
- How much time does it take for an issue to disappear from the issues list after it's been corrected? Does it come op in the next weekly report?
I'm asking because the crawler is reporting 50k+ pages crawled, when in reality, this number should be closer to 1000. All pages with query parameters have the correct canonical tag pointing to the root URL, so I'm wondering whether I need to noindex the other pages for the crawler to report correct data?:
Original (canonical URL): DOMAIN.COM/charters/search/mx/BS?search_location=cabo-san-lucas
Filter active URL: DOMAIN.COM/charters/search/mx/BS?search_location=cabo-san-lucas&booking_date=&booking_days=1&booking_persons=1&priceFilter%5B%5D=0%2C500&includedPriceFilter%5B%5D=drinks-soft
Also, if noindex is the only solution, will it impact the ranking of the pages involved?
Note: Google and Bing are semi-successful in reporting index page count, each reporting around 2.5k result pages when using the site:DOMAIN.com query. The rel canonical tag was missing for a short period of time about 4 weeks ago, but since fixing the issue these pages still haven't been deindexed.
Appreciate any suggestions regarding Moz Crawler & Google / Bing index count!
-
Happy to help!
We crawled roughly 49k pages because there were that many links on the site that we could find. 50k is also the new standard crawl limit for campaigns in Standard and Medium subscriptions. Adding a rel=canonical to a page doesn't mean it won't get crawled by our campaign crawler, only that the crawler is to refer to the canonicalized link for reporting purposes.
Without going into too specific of URL details, these pages are considered duplicates because their canonical tags point to different URLs. For example,
is considered a duplicate of
DOMAIN.COM/charters/search/mx/QR?booking_date=&booking_days=&booking_persons=limit%252525253D20
because the canonical tag for the first page is
DOMAIN.COM/charters/search/mx/QR?offset=20
while the canonical for the second URL is
DOMAIN.COM/charters/search/mx/QR
Since the canonical tags point to different pages it is assumed that DOMAIN.COM/charters/search/mx/QR?offset=20 and DOMAIN.COM/charters/search/mx/QR are likely to be duplicates themselves.
Here is how our system interprets duplicate content vs. rel=canonical:
Assuming A, B, C, and D are all duplicates,
If A references B as the canonical, then they are not considered duplicates
If A and B both reference C as canonical, A and B are not considered duplicates of each other
If A references C as a canonical, A and B are considered duplicated
If A references C as canonical, B references D, then A and B are considered duplicatesThe above example from your campaign actually falls into the fourth example I've listed above. Hope this helps clear things up
-
Thanks Sam!
I've read the post and checked my canonical tags but still can't seem to find what's causing the canonicalized pages to be indexed by RogerBot. The same page shows up in Moz's crawl test 100 times with slightly different parameters.
I'll keep investigating but some specific feedback from Moz staff would be appreciated
-
Hi!
I'm going to leave the strategy discussion open to the community but from a technical standpoint, we will count rel=canonical on dynamic urls as long as they are implemented correctly. Dr. Pete has a great post where he talks about canonicals that might be helpful as well. Updates to campaigns happen on a weekly basis depending on when the campaign was created. So if it was created on a Tuesday, you'll see updated campaign data every Tuesday after. You can run a crawl test (accessible from Research Tools) to get 3k page crawls in between your updates though. Hope this helps!
-
Thanks for the info searchbuzz. So if I understand correctly, new pages are crawled and kept in the index (up to the campaign limit), but issues on indexed pages are reported separately.
My issue is that due to the dynamic URLs used in search filters on my site I actually have 49k issues detected (over 95% are duplicate content and long URL issues because the crawler is indexing the same page many times for each URL parameter combination). The crawl test can't index the entire site because it generates a huge amount of pages.
It's a travel-related website with listings in 233 cities and multiple filter functionality, so each unique 'page' of results is indexed more than 100 times, even though there's a rel="canonical" tag pointing to the non-parametrized URL of that page.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How To Stop Moz Crawl From Prepending /blog/ on all our site urls that it crawls
Hello, At some time in the past our WP site had urls like this: www.oursite.com/blog/post-title-pretty-link The site has not used that url structure for quite some time, but Moz crawl is still hitting every post with /blog/prepended and as a result is generating thousands of 404s. When the /blog/ is removed from the url, then the urls work fine. Where are those old urls being stored and how can we update them? How do we address this issue? Any assistance will be appreciated. Thanks!
Moz Bar | | dbcooper1 -
How do can the crawler not access my robots.txt file but have 0 crawler issues?
So I'm getting this errorOur crawler was not able to access the robots.txt file on your site. This often occurs because of a server error from the robots.txt. Although this may have been caused by a temporary outage, we recommend making sure your robots.txt file is accessible and that your network and server are working correctly. Typically errors like this should be investigated and fixed by the site webmaster.https://www.evernote.com/l/ADOmJ5AG3A1OPZZ2wr_ETiU2dDrejywnZ8kHowever, Moz is saying I have 0 Crawler issues. Have I hit an edge case? What can I do to rectify this situation? I'm looking at my robots.txt file here: http://www.dateideas.net/robots.txt however, I don't see anything that woudl specifically get in the way.I'm trying to build a helpful resource from this domain, and getting zero organic traffic, and I have a sinking suspicion this might be the main culprit.I appreciate your help!Thanks! 🙂
Moz Bar | | will_l0 -
Are we actually getting accurate data on keyword volumes from Moz (or other sources)?
I have a client who does patio furniture repair and restoration. When performing keyword research in Moz for terms like "patio furniture repair" I see that only 11-50 people in the entire US are searching for this term according to the Moz data. However, running an Adwords campaign currently and our top keyword is the phrase match for "patio furniture repair" which has generated over 100 clicks in just a couple of months in ONE county. Is there a better way to research more accurate results on search volume estimates? This makes organic SEO and keyword targeting hard! Thanks, Ricky
Moz Bar | | RickyShockley1 -
Duplicate Page Titles detected, no relevant links shown
Moz is reporting duplicate page titles, but the relevant pages aren't being shown. The downloaded report doesn't show them either. Screencast: http://screencast.com/t/gkyRds6u
Moz Bar | | ElykInnovation0 -
Idea for new feature on Moz Q&A
Hi Moz Can we please have a new feature to moz when answering questions please, the ability to save as a draft. I am currently on the train from London back home and I was trying to answer a few questions and give my opinion and after writing a really long and detailed response to an area I actually have quite a lot to say on, my internet froze and the page refreshed losing all the content. It doesn’t help that my Mac I am working on isn’t the fastest, but my new one isn’t quite ready yet. But if there was a simple save draft feature, it would be really great as I wasn’t ready to post my answer as it wasn’t finished and without checking wouldn’t have made sense. Yes I know I could of done it in word and pasted it across, but usually when I am travelling and answering questions I do it on my Ipad which isn’t great for multitasking. While I understand you don’t want lots of ‘draft answers’ all over the place as this would take storage space up on your servers and doesn’t actually help the person answering the question, your system could purge all draft answers which have not been edited in the last 24 hours. Just a little suggestion to help make this great forum even better. Thanks Andy
Moz Bar | | Andy-Halliday4 -
I am not able to perform crawl test in moz tools
it is throwing there is some problem in domain when i try testing the crawl test for my domains
Moz Bar | | IBEE-Hosting0 -
Pro.moz.com referrals and transactions
Our Google Analytics account reports a significant number of visits and transactions as referrals from pro.moz.com. That is nice and we appreciate this but where is the traffic generated from ? (my first MOZ question)
Moz Bar | | silverpewter0 -
Path of Moz Crawler Report?
Hi, I'm finding path of moz crawler report. Please tell me where I can find this tool?. Thanks
Moz Bar | | KLLC0