Hi Ryan,
The major problem is that any experienced programmer can easily write their own script to scrape a site. So there could be thousands of "bad bots" out there that have not been seen before.
There are a few recurring themes that appear amongst suspicious User Agents that are easy to spot - generally anything that has a name including words like grabber, siphon, leach, downloader, extractor, stripper, sucker or any name with a bad connotation like reaper, vampire, widow etc. Some of these guys just can't help themselves!
The most important thing though is to properly identify the ones that are giving you a problem by checking server logs and tracing where they originate from using Roundtrip DNS and WhoIs Lookups.
Matt Cutts wrote a post a long time ago on how to verify googlebot and of course the method applies to other search engines as well. The doublecheck is to then use WhoIs to verify that the IP address you have falls within the IP range assigned to Google (or whichever search engine you are checking).
If you are experienced at reading server logs it becomes fairly easy to spot spikes in hits, bandwidth etc which will alert you to bots. Depending which server stats package you are using, some or all of the bots may already be highlighted for you. Some packages do a much better job than others. Some provide only a limited list.
If you have access to a programmer who is easy to get along with, the best way to get your head around this is to sit down with them for an hour and walk through the process.
Hope that helps,
Sha
PS - I'm starting to think you sleep less than I do!