Captcha wall to access content and cloaking sanction
-
Hello, to protect our website against scrapping, visitor are redirect to a recaptcha page after 2 pages visited.
But for a SEO purpose Google bot is not included in that restriction so it could be seen as cloaking.
What is the best practice in SEO to avoid a penalty for cloaking in that case ?
I think about adding a paywall Json shema NewsArticle but the content is acceccible for free so it's not a paywall but more a captcha protection wall.What do you recommend ?
Thanks,Describe your question in detail. The more information you give, the better! It helps give context for a great answer.
-
In general, Google cares only about cloaking in the sense of treating their crawler differently to human visitors - it's not a problem to treat them differently to other crawlers.
So: if you are tracking the "2 pages visited" using cookies (which I assume you must be? there is no other reliable way to know the 2nd request is from the same user without cookies?) then you can treat googlebot exactly the same as human users - every request is stateless (without cookies) and so googlebot will be able to crawl. You can then treat non-googlebot scrapers more strictly, and rate limit / throttle / deny them as you wish.
I think that if real human users get at least one "free" visit, then you are probably OK - but you may want to consider not showing the recaptcha to real human users coming from google (but you could find yourself in an arms race with the scrapers pretending to be human visitors from google).
In general, I would expect that if it's a recaptcha ("prove you are human") step rather than a paywall / registration wall, you will likely be OK in the situation where:
- Googlebot is never shown the recaptcha
- Other scrapers are aggressively blocked
- Human visitors get at least one page without a recaptcha wall
- Human visitors can visit more pages after completing a recaptcha (but without paying / registering)
Hope that all helps. Good luck!
-
Well I'm not saying that there's no risk in what you are doing, just that I perceive the risk to be less risky than the alternatives. I think such a fundamental change like pay-walling could be moderately to highly likely to have a high impact on results (maybe a 65% likelihood of a 50% impact). Being incorrectly accused of cloaking would be a much lower chance (IMO) but with potentially higher impact (maybe a 5% or less chance of an 85% impact). When weighing these two things up, I subjectively conclude that I'd rather make the cloaking less 'cloaky' in and way I could, and leave everything outside of a paywall. That's how I'd personally weigh it up
Personally I'd treat Google as a paid user. If you DID have a 'full' paywall, this would be really sketchy but since it's only partial and indeed data can continue to be accessed for FREE via recaptcha entry, that's the one I'd go for
Again I'm not saying there is no risk, just that each set of dice you have at your disposal are ... not great? And this is the set of dice I'd personally choose to roll with
The only thing to keep in mind is that, the algorithms which Googlebot return data to are pretty smart. But they're not human smart, a quirk in an algo could cause a big problem. Really though, the chances of that IMO (if all you have said is accurate) are minimal. It's the lesser of two evils from my current perspective
-
Yes our DA is good and we got lot of gouv, edu and medias backlinks.
Paid user did not go through recaptcha, indeed treat Google as a paid user could be a good solution.
So you did not recommend using a paywall ?
Today recaptcha is only used for decision pages
But we need thoses pages to be indexed for our business because all or our paid user find us while searching a justice decision on Google.So we have 2 solutions :
- Change nothing and treat Google as a paid user
- Use hard paywall and inform Google that we use json shema markup but we risk to seen lot of page deindexed
In addition we could go from 2 pages visited then captcha to something less intrusive like 6 pages then captcha
Also in the captcha page there is also a form to start a free trial, so visitor can check captcha and keep navigate or create a free account and get an unlimited access for 7 days.To conclude, if I well understand your opinion, we don't have to stress about being penalized for cloaking because Gbot is smart and understand why we use captcha and our DA help us being trustable by gbot. So I think the best solution is the 1, Change nothing and treat Google as a paid user.
Thank a lot for your time and your help !
It's a complicated subject and it's hard to find people able to answer my question, but you did it -
Well if you have a partnership with the Court of Justice I'd assume your trust and authority metrics would be pretty high with them linking to you on occasion. If that is true then I think in this instance Google would give you the benefit of the doubt, as you're not just some random tech start-up (maybe a start-up, but one which matters and is trusted)
It makes sense that in your scenario your data protection has to be iron-clad. Do paid users have to go through the recaptcha? If they don't, would there be a way to treat Google as a paid user rather than a free user?
Yeah putting down a hard paywall could have significant consequences for you. Some huge publishers manage to still get indexed (pay-walled news sites), but not many and their performance deteriorates over time IMO
Here's a question for you. So you have some pages you really want indexed, and you have a load of data you don't want scraped or taken / stolen - right? Is it possible to ONLY apply the recaptcha for the pages which contain the data that you don't want stolen, and never trigger the recaptcha (at all) in other areas? Just trying to think if there is a wiggle way in the middle, to make it obvious to Google you are doing all you possibly can to do keep Google's view and the user view the same
-
Hi effectdigital, thanks a lot for that answer. I agreed with you captcha is not the best UX idea but our content is sensitive, we are a legal tech indexing french justice decision. We get unique partnership with Court of Justice because we got a unique technology to anonymize data in justice decision so we don't want our competitor to scrap our date (and trust me they try, every day..). This is why we use recaptcha protection. For Gbot we use Google reverse DNS and user agent so even a great scrapper can't bypass our security.
Then we have a paid option, people can create an account and paid a monthly subscription to access content in unlimited. This is why I think about paywall. We could replace captcha page by a paywall page (with a freetrial of course) but I'm not sur Google will index millions of page hiding behing a metered paywall
As you said, I think there is no good answer..
And again, thank a lot to having take time to answer my question -
Unless you have previously experienced heavy scraping which you cannot solve any other way, this seems a little excessive. Most websites don't have such strong anti-spam measures and they cope just fine without them
I would say that it would be better to embed the recaptcha on the page and just block users from proceeding further (or accessing the content), until the recaptcha were filled. Unfortunately this would be a bad solution as scrapers would still be able to scrape the page, so I guess redirecting to the captcha is your only option. Remember that if you are letting Googlebot through (probably with a user agent toggle) then as long as scrape-builders program their scripts to serve the Googlebot UA, they can penetrate your recaptcha redirects and just refuse to do them. Even users can alter their browser's UA to avoid the redirects
There are a number of situations where Google don't consider redirect penetration to be cloaking. One big one is regional redirects, as Google needs to crawl a whole multilingual site instead of being redirected. I would think that in this situation Google wouldn't take too much of an issue with what you are doing, but you can never be certain (algorithms work in weird and wonderful ways)
I don't think any schema can really help you. Google will want to know that you are using technology that could annoy users so they can lower your UX score(s) accordingly, but unfortunately letting them see this will stop your site being properly crawled so I don't know what the right answer is. Surely there must be some less nuclear, obstructive technology you could integrate instead? Or just keep on top of your block lists (IP ranges, user agents) and monitor your site (don't make users suffer)
If you are already letting Googlebot through your redirects, why not just have a user-agent based allow list instead of a black list which is harder to manage? Find the UAs of most common mobile / desktop browsers (Chrome, Safari, Firefox, Edge, Opera, whatever) and allow those UAs plus Googlebot. Anyone who does penetrate for scraping, deal with them on a case-by-case basis
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Big retailers and duplicate content
Hello there! I was wondering if you guys have experience with big retailers sites fetching data via API (PDP content etc.) from another domain which is also sharing the same data with other multiple sites. If each retailer has thousands on products, optimizing PDP content (even in batches) is quite of a cumbersome task and rel="canonical" pointing to original domain will dilute the value. How would you approach this type of scenario? Looking forward to read your suggestions/experiences Thanks a lot! Best Sara
Intermediate & Advanced SEO | | SaraCoppola1 -
Content Strategy/Duplicate Content Issue, rel=canonical question
Hi Mozzers: We have a client who regularly pays to have high-quality content produced for their company blog. When I say 'high quality' I mean 1000 - 2000 word posts written to a technical audience by a lawyer. We recently found out that, prior to the content going on their blog, they're shipping it off to two syndication sites, both of which slap rel=canonical on them. By the time the content makes it to the blog, it has probably appeared in two other places. What are some thoughts about how 'awful' a practice this is? Of course, I'm arguing to them that the ranking of the content on their blog is bound to be suffering and that, at least, they should post to their own site first and, if at all, only post to other sites several weeks out. Does anyone have deeper thinking about this?
Intermediate & Advanced SEO | | Daaveey0 -
Javascript content not being indexed by Google
I thought Google has gotten better at picking up unique content from javascript. I'm not seeing it with our site. We rate beauty and skincare products using our algorithms. Here is an example of a product -- https://www.skinsafeproducts.com/tide-free-gentle-he-liquid-laundry-detergent-100-fl-oz When you look at the cache page (text) from google none of the core ratings (badges like fragrance free, top free and so forth) are being picked up for ranking. Any idea what we could do to have the rating incorporated in the indexation.
Intermediate & Advanced SEO | | akih0 -
Possible duplicate content issue
Hi, Here is a rather detailed overview of our problem, any feedback / suggestions is most welcome. We currently have 6 sites targeting the various markets (countries) we operate in all websites are on one wordpress install but are separate sites in a multisite network, content and structure is pretty much the same barring a few regional differences. The UK site has held a pretty strong position in search engines the past few years. Here is where we have the problem. Our strongest page (from an organic point of view) has dropped off the search results completely for Google.co.uk, we've picked this up through a drop in search visibility in SEMRush, and confirmed this by looking at our organic landing page traffic in Google Analytics and Search Analytics in Search Console. Here are a few of the assumptions we've made and things we've checked: Checked for any Crawl or technical issues, nothing serious found Bad backlinks, no new spammy backlinks Geotarggetting, this was fine for the UK site, however the US site a .com (not a cctld) was not set to the US (we suspect this to be the issue, but more below) On-site issues, nothing wrong here - the page was edited recently which coincided with the drop in traffic (more below), but these changes did not impact things such as title, h1, url or body content - we replaced some call to action blocks from a custom one to one that was built into the framework (Div) Manual or algorithmic penalties: Nothing reported by search console HTTPs change: We did transition over to http at the start of june. The sites are not too big (around 6K pages) and all redirects were put in place. Here is what we suspect has happened, the https change triggered google to re-crawl and reindex the whole site (we anticipated this), during this process, an edit was made to the key page, and through some technical fault the page title was changed to match the US version of the page, and because geotargetting was not turned on for the US site, Google filtered out the duplicate content page on the UK site, there by dropping it off the index. What further contributes to this theory is that a search of Google.co.uk returns the US version of the page. With country targeting on (ie only return pages from the UK) that UK version of the page is not returned. Also a site: query from google.co.uk DOES return the Uk version of that page, but with the old US title. All these factors leads me to believe that its a duplicate content filter issue due to incorrect geo-targetting - what does surprise me is that the co.uk site has much more search equity than the US site, so it was odd that it choose to filter out the UK version of the page. What we have done to counter this is as follows: Turned on Geo targeting for US site Ensured that the title of the UK page says UK and not US Edited both pages to trigger a last modified date and so the 2 pages share less similarities Recreated a site map and resubmitted to Google Re-crawled and requested a re-index of the whole site Fixed a few of the smaller issues If our theory is right and our actions do help, I believe its now a waiting game for Google to re-crawl and reindex. Unfortunately, Search Console is still only showing data from a few days ago, so its hard to tell if there has been any changes in the index. I am happy to wait it out, but you can appreciate that some of snr management are very nervous given the impact of loosing this page and are keen to get a second opinion on the matter. Does the Moz Community have any further ideas or insights on how we can speed up the indexing of the site? Kind regards, Jason
Intermediate & Advanced SEO | | Clickmetrics0 -
Http vs. https - duplicate content
Hi I have recently come across a new issue on our site, where https & http titles are showing as duplicate. I read https://moz.com/community/q/duplicate-content-and-http-and-https however, am wondering as https is now a ranking factor, blocked this can't be a good thing? We aren't in a position to roll out https everywhere, so what would be the best thing to do next? I thought about implementing canonicals? Thank you
Intermediate & Advanced SEO | | BeckyKey0 -
Duplicate content on product pages
Hi, We are considering the impact when you want to deliver content directly on the product pages. If the products were manufactured in a specific way and its the same process across 100 other products you might want to tell your readers about it. If you were to believe the product page was the best place to deliver this information for your readers then you could potentially be creating mass content duplication. Especially as the storytelling of the product could equate to 60% of the page content this could really flag as duplication. Our options would appear to be:1. Instead add the content as a link on each product page to one centralised URL and risk taking users away from the product page (not going to help with conversion rate or designers plans)2. Put the content behind some javascript which requires interaction hopefully deterring the search engine from crawling the content (doesn't fit the designers plans & users have to interact which is a big ask)3. Assign one product as a canonical and risk the other products not appearing in search for relevant searches4. Leave the copy as crawlable and risk being marked down or de-indexed for duplicated contentIts seems the search engines do not offer a way for us to serve this great content to our readers with out being at risk of going against guidelines or the search engines not being able to crawl it.How would you suggest a site should go about this for optimal results?
Intermediate & Advanced SEO | | FashionLux2 -
Any Ecommerce Content Marketing Training and Resources
Hi guys! Was wondering if you can help me out finding training/courses/resources that you could recommend for content marketing for large retail ecommerce sites. Particularly interested in editing product and category pages though open minded for any of your support. Any suggestion is welcomed. In advance I appreciate your time and help! Best, Finn
Intermediate & Advanced SEO | | insite3600 -
Content Focus
I have a particular Page which shows primary contact details as well as "additional" contact details for the client. GIven I do not believe I want Google to misinterpret the focus of the page from the primary contact details which of the following three options would be best? Place the "additional" contact details (w/maps) in Javascript, Ajax or similar to suppress them from being crawled. Leave "additional" contact details alone but emphasize the Primary contact details by placing the Primary contact details in Rich Snippets/Microformats. Do nothing and allow Google to Crawl the pages with all contact details Thanks, Phil
Intermediate & Advanced SEO | | AU-SEO0