Captcha wall to access content and cloaking sanction
-
Hello, to protect our website against scrapping, visitor are redirect to a recaptcha page after 2 pages visited.
But for a SEO purpose Google bot is not included in that restriction so it could be seen as cloaking.
What is the best practice in SEO to avoid a penalty for cloaking in that case ?
I think about adding a paywall Json shema NewsArticle but the content is acceccible for free so it's not a paywall but more a captcha protection wall.What do you recommend ?
Thanks,Describe your question in detail. The more information you give, the better! It helps give context for a great answer.
-
In general, Google cares only about cloaking in the sense of treating their crawler differently to human visitors - it's not a problem to treat them differently to other crawlers.
So: if you are tracking the "2 pages visited" using cookies (which I assume you must be? there is no other reliable way to know the 2nd request is from the same user without cookies?) then you can treat googlebot exactly the same as human users - every request is stateless (without cookies) and so googlebot will be able to crawl. You can then treat non-googlebot scrapers more strictly, and rate limit / throttle / deny them as you wish.
I think that if real human users get at least one "free" visit, then you are probably OK - but you may want to consider not showing the recaptcha to real human users coming from google (but you could find yourself in an arms race with the scrapers pretending to be human visitors from google).
In general, I would expect that if it's a recaptcha ("prove you are human") step rather than a paywall / registration wall, you will likely be OK in the situation where:
- Googlebot is never shown the recaptcha
- Other scrapers are aggressively blocked
- Human visitors get at least one page without a recaptcha wall
- Human visitors can visit more pages after completing a recaptcha (but without paying / registering)
Hope that all helps. Good luck!
-
Well I'm not saying that there's no risk in what you are doing, just that I perceive the risk to be less risky than the alternatives. I think such a fundamental change like pay-walling could be moderately to highly likely to have a high impact on results (maybe a 65% likelihood of a 50% impact). Being incorrectly accused of cloaking would be a much lower chance (IMO) but with potentially higher impact (maybe a 5% or less chance of an 85% impact). When weighing these two things up, I subjectively conclude that I'd rather make the cloaking less 'cloaky' in and way I could, and leave everything outside of a paywall. That's how I'd personally weigh it up
Personally I'd treat Google as a paid user. If you DID have a 'full' paywall, this would be really sketchy but since it's only partial and indeed data can continue to be accessed for FREE via recaptcha entry, that's the one I'd go for
Again I'm not saying there is no risk, just that each set of dice you have at your disposal are ... not great? And this is the set of dice I'd personally choose to roll with
The only thing to keep in mind is that, the algorithms which Googlebot return data to are pretty smart. But they're not human smart, a quirk in an algo could cause a big problem. Really though, the chances of that IMO (if all you have said is accurate) are minimal. It's the lesser of two evils from my current perspective
-
Yes our DA is good and we got lot of gouv, edu and medias backlinks.
Paid user did not go through recaptcha, indeed treat Google as a paid user could be a good solution.
So you did not recommend using a paywall ?
Today recaptcha is only used for decision pages
But we need thoses pages to be indexed for our business because all or our paid user find us while searching a justice decision on Google.So we have 2 solutions :
- Change nothing and treat Google as a paid user
- Use hard paywall and inform Google that we use json shema markup but we risk to seen lot of page deindexed
In addition we could go from 2 pages visited then captcha to something less intrusive like 6 pages then captcha
Also in the captcha page there is also a form to start a free trial, so visitor can check captcha and keep navigate or create a free account and get an unlimited access for 7 days.To conclude, if I well understand your opinion, we don't have to stress about being penalized for cloaking because Gbot is smart and understand why we use captcha and our DA help us being trustable by gbot. So I think the best solution is the 1, Change nothing and treat Google as a paid user.
Thank a lot for your time and your help !
It's a complicated subject and it's hard to find people able to answer my question, but you did it -
Well if you have a partnership with the Court of Justice I'd assume your trust and authority metrics would be pretty high with them linking to you on occasion. If that is true then I think in this instance Google would give you the benefit of the doubt, as you're not just some random tech start-up (maybe a start-up, but one which matters and is trusted)
It makes sense that in your scenario your data protection has to be iron-clad. Do paid users have to go through the recaptcha? If they don't, would there be a way to treat Google as a paid user rather than a free user?
Yeah putting down a hard paywall could have significant consequences for you. Some huge publishers manage to still get indexed (pay-walled news sites), but not many and their performance deteriorates over time IMO
Here's a question for you. So you have some pages you really want indexed, and you have a load of data you don't want scraped or taken / stolen - right? Is it possible to ONLY apply the recaptcha for the pages which contain the data that you don't want stolen, and never trigger the recaptcha (at all) in other areas? Just trying to think if there is a wiggle way in the middle, to make it obvious to Google you are doing all you possibly can to do keep Google's view and the user view the same
-
Hi effectdigital, thanks a lot for that answer. I agreed with you captcha is not the best UX idea but our content is sensitive, we are a legal tech indexing french justice decision. We get unique partnership with Court of Justice because we got a unique technology to anonymize data in justice decision so we don't want our competitor to scrap our date (and trust me they try, every day..). This is why we use recaptcha protection. For Gbot we use Google reverse DNS and user agent so even a great scrapper can't bypass our security.
Then we have a paid option, people can create an account and paid a monthly subscription to access content in unlimited. This is why I think about paywall. We could replace captcha page by a paywall page (with a freetrial of course) but I'm not sur Google will index millions of page hiding behing a metered paywall
As you said, I think there is no good answer..
And again, thank a lot to having take time to answer my question -
Unless you have previously experienced heavy scraping which you cannot solve any other way, this seems a little excessive. Most websites don't have such strong anti-spam measures and they cope just fine without them
I would say that it would be better to embed the recaptcha on the page and just block users from proceeding further (or accessing the content), until the recaptcha were filled. Unfortunately this would be a bad solution as scrapers would still be able to scrape the page, so I guess redirecting to the captcha is your only option. Remember that if you are letting Googlebot through (probably with a user agent toggle) then as long as scrape-builders program their scripts to serve the Googlebot UA, they can penetrate your recaptcha redirects and just refuse to do them. Even users can alter their browser's UA to avoid the redirects
There are a number of situations where Google don't consider redirect penetration to be cloaking. One big one is regional redirects, as Google needs to crawl a whole multilingual site instead of being redirected. I would think that in this situation Google wouldn't take too much of an issue with what you are doing, but you can never be certain (algorithms work in weird and wonderful ways)
I don't think any schema can really help you. Google will want to know that you are using technology that could annoy users so they can lower your UX score(s) accordingly, but unfortunately letting them see this will stop your site being properly crawled so I don't know what the right answer is. Surely there must be some less nuclear, obstructive technology you could integrate instead? Or just keep on top of your block lists (IP ranges, user agents) and monitor your site (don't make users suffer)
If you are already letting Googlebot through your redirects, why not just have a user-agent based allow list instead of a black list which is harder to manage? Find the UAs of most common mobile / desktop browsers (Chrome, Safari, Firefox, Edge, Opera, whatever) and allow those UAs plus Googlebot. Anyone who does penetrate for scraping, deal with them on a case-by-case basis
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What to do with dynamically translated content sharing same urls?
We've just added to an originally English website, Italian and German translations. User can switch between them with right hand drop down language selection menu; then the entire page will be translated (including menu, body, footer) but the urls remain the same. The Italian page have some meta data (titles and descriptions) translated as well. Is it going to be a significantly negative effect on SEO to have the translated pages sharing the same urls?
Intermediate & Advanced SEO | | D2i0 -
Duplicate content due to parked domains
I have a main ecommerce website with unique content and decent back links. I had few domains parked on the main website as well specific product pages. These domains had some type in traffic. Some where exact product names. So main main website www.maindomain.com had domain1.com , domain2.com parked on it. Also had domian3.com parked on www.maindomain.com/product1. This caused lot of duplicate content issues. 12 months back, all the parked domains were changed to 301 redirects. I also added all the domains to google webmaster tools. Then removed main directory from google index. Now realize few of the additional domains are indexed and causing duplicate content. My question is what other steps can I take to avoid the duplicate content for my my website 1. Provide change of address in Google search console. Is there any downside in providing change of address pointing to a website? Also domains pointing to a specific url , cannot provide change of address 2. Provide a remove page from google index request in Google search console. It is temporary and last 6 months. Even if the pages are removed from Google index, would google still see them duplicates? 3. Ask google to fetch each url under other domains and submit to google index. This would hopefully remove the urls under domain1.com and doamin2.com eventually due to 301 redirects. 4. Add canonical urls for all pages in the main site. so google will eventually remove content from doman1 and domain2.com due to canonical links. This wil take time for google to update their index 5. Point these domains elsewhere to remove duplicate contents eventually. But it will take time for google to update their index with new non duplicate content. Which of these options are best best to my issue and which ones are potentially dangerous? I would rather not to point these domains elsewhere. Any feedback would be greatly appreciated.
Intermediate & Advanced SEO | | ajiabs0 -
Blog content and panda?
If we want to create a blog to keep in front of our customers (via email and posting on social) and the posts will be around 300 - 1000 words like this site http://www.solopress.com/blog/ are we going to be asking for a panda slap as the issue would be the very little shares and traction after the first day or two. Also would panda only affect the blogs that are crap if we mix in a couple of really good posts or would it affect theses as well and possibly even the site? Any help would be appreciated.
Intermediate & Advanced SEO | | BobAnderson0 -
Page structure and how to optimize old content
SITE STRUCTURE I am trying to optimize the structure of our site Dreamestatehuahin.com. Getting a visible sitemap of my page make me realized it was not a pyramid as I expected it to be but instead very flat. I Would be happy for some advise on how to structure my site in future aswell how to optimize certain place on the page that i think need a change. 1: structure on posts. Maybe I misunderstand how post works in wordpress or something happen with my theme. When I look at my page sitemap my page is VERY flat because permalinks setting I chose the setting as post name (recommended in most articles). http://www.dreamestatehuahin.com/sample-post What I actually believed was that post name was place after /blog/ like: http://www.dreamestatehuahin.com/blog/sample-post I would be a good idea to do like this right? Should I add some SEO text on the top of my blog page before the actually posts. Or would this be a bad idea due to pagination causing double content? Could one do 4 blogs in one site and replace the name “blog” in the url with a keywords http://www.dreamestatehuahin.com/real-estate-announcement/sample-post http://www.dreamestatehuahin.com/hua-hin-attractions/sample-post 2) Pages Based on property type From our top menu, i have made links under for sael using wordpress property types http://www.dreamestatehuahin.com/property-type/villa/ http://www.dreamestatehuahin.com/property-type/hot-deals/ http://www.dreamestatehuahin.com/property-type/condominium/ Earlier I found that these pages created duplictaon of titles due to pagenation so I deleted the h1 What would you do with these pages. Should I optimize them with a text and h1. maybe it is possible to add some title and text content for the top of the first page only (the one page that are linked to our top menu) http://www.dreamestatehuahin.com/property-type/villa and not to page 2, 3, 4….. http://www.dreamestatehuahin.com/property-type/villa/page/2/ b) Also maybe I should rename the property types WOuld it make sence to change name of the property types from etc villa to villas for sale or even better villas for sale hua hin Then the above urls will look like this instead: http://www.dreamestatehuahin.com/property-type/villas-for-sale/ Or Maybe renaming a property type would result in many 404 errors and not be worth the effort? 3) LINKING + REPOSTING OUR “PROPERTY” PAGES AND DO A 301 REDIRECT? a) Would It be good idea to link back from all properties description to one of our 5 optimized landingpages (for the keyword home/house/condo/villa) for sale in Hua Hin? http://www.dreamestatehuahin.com/property-hua-hin/ http://www.dreamestatehuahin.com/house-for-sale-hua-hin/ b) Also so far we haven’t been really good about optimizing each property (no keywords, optimized titles or descriptions) etc. http://www.dreamestatehuahin.com/property/baan-suksamran/ I wonder if it would be worth the effort to optimize content of each of the old properties )photos-text) on our page? Or maybe post the old properties again in a new optimized version and do a 301 redirect from the old post?
Intermediate & Advanced SEO | | nm19770 -
Http and https duplicate content?
Hello, This is a quick one or two. 🙂 If I have a page accessible on http and https count as duplicate content? What about external links pointing to my website to the http or https page. Regards, Cornel
Intermediate & Advanced SEO | | Cornel_Ilea0 -
Access Denied
Our website which was ranking at number 1 in Google.co.uk for our 2 main search terms for over three years was hacked into last November. We rebuilt the site but had slipped down to number 4. We were hacked again 2 weeks ago and are now at number 7. I realise that this drop may not be just a result of the hacking but it cant' have helped. I've just access our Google Webmaster Tools accounts and these are the current results: 940 Access Denied Errors 197 Not Found The 940 Access Denied Errors apply to all of our main pages plus.... Is it likely that the hacking caused the Access Denied errors and is there a clear way to repair these errors? Any advice would be very welcome. Thanks, Colin
Intermediate & Advanced SEO | | NileCruises0 -
Penalised for duplicate content, time to fix?
Ok, I accept this one is my fault but wondering on time scales to fix... I have a website and I put an affiliate store on it, using merchant datafeeds in a bid to get revenue from the site. This was all good, however, I forgot to put noindex on the datafeed/duplicate content pages and over a period of a couple of weeks the traffic to the site died. I have since nofollowed or removed the products but some 3 months later my site still will not rank for the keywords it was ranking for previously. It will not even rank if I type in the sites' name (bright tights). I have searched for the name using bright tights, "bright tights" and brighttights but none of them return the site anywhere. I am guessing that I have been hit with a drop x place penalty by Google for the duplicate content. What is the easiest way around this? I have no warning about bad links or the such. Is it worth battling on trying to get the domain back or should I write off the domain, buy a new one and start again but minus the duplicate content? The goal of having the duplicate content store on the site was to be able to rank the category pages in the store which had unique content on so there were no problems with that which I could foresee. Like Amazon et al, the categories would have lists of products (amongst other content) and you would click through to the individual product description - the duplicate page. Thanks for reading
Intermediate & Advanced SEO | | Grumpy_Carl0 -
HT.Access Redirect Question
Quick question on the HT.Access / Redirects... II have a site http://www.securitysystemsfortlauderdale.org/ADT-Home-Security-Alarm-Systems/ and I am running througth SEO moz for backlinks and noticed a large descrepancy on the links on the root vs the redirect. There are more links on the root and less on the redirect. Does this affect SEO for Google or does Google follow the redirects and give credit accordingly. Thanks for your help!!! Matt
Intermediate & Advanced SEO | | joeups0