Hi Greg,
Awesome information there from Ryan!
Implementing the authorship markup is important in that it basically "outs" anyone who has already stolen your content by telling Google that they are not the original author. With authorship markup properly implemented, it really doesn't matter how many duplicates there are out there, Google will always see those sites as imposters, since no-one else has the ability to verify their authorship with a link back from your Google profile
It is possible to block scrapers from your server (blacklist) using IP address or User Agent if you are able to identify them. Identification is not very difficult if you have access to server logs as there will be a number of clues in the log data. These include excessive hits, bandwidth used, requests for java and css files and high numbers of 401(unauthorized) and 403 (forbidden) HTTP error codes.
Some scrapers are also easily identifiable by User Agent (name). Once the IP address or user agent is known, instructions can be given to the server to block it and if you wish, to serve content which will identify the site as having been scraped.
If you are not able to specifically identify the bot(s) responsible, it is also possible to use alternatives like whitelisting bots that you know are OK. This needs to be handled carefully as ommissions from the whitelist could mean that you have actually banned bots that you want to crawl the site.
If using a LAMP setup (Apache server), then instructions are added to the .htaccess file using PHP. For a Windows server, you use a database or text file with filesystemobject to redirect them to a dead end page. Ours is a LAMP Shop, so I am much more familiar with the .htaccess method.
If using .htaccess you have the choice of returning a 403 FORBIDDEN HTTP error, or using the bot-response.php script to serve an image which identifies the site as scraped
If using bot-response.php, the gif image should be made large enough to break the layout in the scraped site if they serve the content somewhere else. Usually a very large gif that reads something like: "Content on this page has been scraped from yoursite.com. If you are the webmaster please stop trying to steal our content".will do the job.
There is one VERY BIG note of caution if you are thinking of blocking bots from your server. You really need to be an experienced tech to do this. It is NOT something that should be attempted if you don't understand exactly what you are doing and what precautions need to be taken beforehand. There are two major things to consider:
- You can accidentally block the bots that you want to crawl your site. (Major search engines use many different crawlers to do different jobs. They do not always appear as googlebot, slurp etc)
- It is possible for people to create fake bots that appear to be legitimate. If you don't identify these you will not solve the scraping problem.
The authenticity of bots can be verified using Roundtrip DNS Lookups and WhoIs Lookups to check the originating domain and IP address range.
It is possible to add a disallow statement for "bad bots" to your robots.txt file, but scrapers will generally ignore robots.txt by default, so this method is not recommended.
Phew! Think that's everything covered.
Hope it helps,
Sha