Welcome to the Q&A Forum

Travis_Bailey

There's a possibility someone in your company saw suspicious traffic from an actor spoofing the Baidu user agent. It can get so aggressive that it will eventually bog down your response time through sheer number of requests. But the problem is that same actor, or someone else with malicious intent can simply spoof another user agent or IP.

But the main problem is, the site is straight e-commerce. It could get international business, so why take such a ham fist approach? Even if blocking Baidu gave the desired result, the dev/admin would still have to block individual IP blocks as they come in. It would make more sense to invest in server resources so it can handle the load, or look into DDos Mitigation.

So yeah, it's strange. Though it's more likely a lack of understanding than anything malicious.

Travis_Bailey

Rest assured, that I don't scrape/hammer so hard that it would knock your site down for a period. I often throttle it back to 1 thread and two URI per second. If I forget to configure it, the default is 5 threads at two URI per second. So yeah, maybe a bit of the moz effect.

Chrome Incognito Settings:

Just the typical/vanilla/default incognito settings. It should accept cookies, but they generally wouldn't persist after the session ends.

I didn't receive a message regarding cookies prior to the block notification.

On a side note, I don't allow plugins/extensions while using incognito.

Fun w/ Screaming Frog:

It's hard to say if the 8.5 hour later instance was my instance of Screaming Frog. The IP address would probably tell you the traffic came out of San Antonio, if it was mine. I didn't record the IP at the time, but I remember that much about it. Otherwise it's back in the pool.

Normally Screaming Frog would display notifications, but in this instance the connection just timed out for requested URLs. It didn't appear to be a connectivity issue on my end, so... yeah...

Fun w/ Scraping and/or Spoofing:

Screaming Frog will crawl CSS and JS links in source code. I found it a little odd that it didn't.

I also ran the domain through the Google Page Speed tool for giggles, since it would be traffic from Googlebot. It failed to fetch the resources necessary to run the test. Though cached versions of pages seemed to render fine, with the exception of broken images in some cases. Though I think that may have something to do with the lazy load script in indexinit.js, but I didn't do much more than read the code comments there.

In regard to the settings for the crawler, I had it set to allow cookies. The user agent was googlebot, but it wouldn't have came from the typical IPs. Basically just trying to get around the user agent and cookie problem with an IP that hadn't been blocked. You know, quick - dirty - and likely stupid.

Fun w/ Meta Robots Directives:

A few of the pages that had noindex directives appeared to lack genuine content, in line with the purpose of the site. So I left that avenue alone and figured it was intentional. The noarchive directive should prevent a cache link. I was just wondering if one or more somehow made into the mix, for added zest. Apparently not.

While I'm running off in an almost totally unrelated direction, I thought this was interesting. Apparently Bingbot can be cheeky at times.

Fun w/ The OP:

It looks like Ryan had your answer, and now you have an entirely new potential problem which is interesting. I think I'm just going to take up masonry and carpentry. Feel free to come along if you're interested.

Travis_Bailey

I'll PM my public IP through Moz. I don't really have any issue with that. Oddly enough, I'm still blocked though.

I thought an okay, though slightly annoying, middle ground would be to give me a chance to prove that I'm not a bot. It seems cases like mine may be few and far between, but it happened.

It turns out that our lovely friends at The Googles just released a new version of reCAPTCHA. It's a one-click-prove-you're-not-a-bot-buddy-okay-i-will-friend-who-you-calling-friend-buddy bot check. (One click - and a user can prove they aren't a bot - without super annoying squiggle interpretation and entry.)

I don't speak fluent developer, but there are PHP code snippets hosted on this GitHub repo. From the the documentation, it looks like you can fire the widget when you need to. So if it works like I think it could work, you can have a little breathing room to figure out the possible session problem.

I've also rethought the whole carpenter/mason career path. After much searches on the Yahoos, I think they may require me to go outside. That just isn't going to work.

Travis_Bailey

Good to hear you may be getting closer to the root of the problem. Apologies that it took so long to get back to you here. I had 'things'.

I followed the steps and you should be able to determine the outcome. Spoiler Alert: No block, this time.

It's a whole other can of worms, but should you need more human testing on the cheap; you may find Mechanical Turk attractive. One could probably get a couple hundred participants for under a couple hundred dollars, with a task comparable to the one above.

Just a thought...

Travis_Bailey

I'm going to guess that you have something that looks like this in your .htaccess file:
RewriteRule ^blog/$ http://blog.website.com [L,NC,R=301]

WARNING

You can knock your site down with the slightest syntax error when you mess with the htaccess file. Proceed with caution.

Read this first.

Let us know what you find.

Travis_Bailey

If the content isn't there, you can setup a 410. That will tell the search engines and users that the pages are gone for good. GWT will also show broken links to your site as well. So you will want to distinguish between the inbound 404s and the pages that are no longer there.

There's a possibility that the pages have been gone for so long, they're no longer indexed. So I'm not really sure how much good a 301 will do from a link perspective. However, if you have access to referral data - you may find some of those inbound 404s are worth redirecting to a relevant page.

Travis_Bailey

I'm just going to leave this here. ; ) It would seem that all of the typical means of citation can be recognized as such. Perhaps too readily?

Travis_Bailey

I would generally prefer CSS over JS for navigational elements, but that probably isn't the problem here. Google can crawl JavaScript and attribute links fine. And per SEM Rush, it looks like the site is enjoying a pretty sharp uptick in organic traffic recently. That would seem to be at odds with big indexation problems.

I'm not so sure if it's my network, I'm on a sub par connection now, but I noticed that some CSS and JS files were timing out when I crawled the site. That could lead to a big problem. I would advise that someone check the server log files and see if those files are regularly timing out. Ideally one would want their CSS and JS files combined/concatenated where possible, to reduce the possibility of any such rendering issues.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Travis_Bailey

@Travis_Bailey

Best posts made by Travis_Bailey

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved