Baidu Spider appearing on robots.txt
-
Hi, I'm not too sure what to do about this or what to think of it.
This magically appeared in my companies robots.txt file (literally magically appeared/text is below)
User-agent: Baiduspider
User-agent: Baiduspider-video
User-agent: Baiduspider-image
Disallow: /I know that Baidu is the Google of China, but I'm not sure why this would appear in our robots.txt all of a sudden. Should I be worried about a hack? Also, would I want to disallow Baidu from crawling my companies website?
Thanks for your help,
-Reed -
Thanks for your help Travis, that was a really solid answer.
-
There's a possibility someone in your company saw suspicious traffic from an actor spoofing the Baidu user agent. It can get so aggressive that it will eventually bog down your response time through sheer number of requests. But the problem is that same actor, or someone else with malicious intent can simply spoof another user agent or IP.
But the main problem is, the site is straight e-commerce. It could get international business, so why take such a ham fist approach? Even if blocking Baidu gave the desired result, the dev/admin would still have to block individual IP blocks as they come in. It would make more sense to invest in server resources so it can handle the load, or look into DDos Mitigation.
So yeah, it's strange. Though it's more likely a lack of understanding than anything malicious.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
I have a same paragraph appearing on two webpages of my site!
is it gonna affect rankings if so what should be done thanks
Intermediate & Advanced SEO | | Sam09schulz0 -
Block subdomain directory in robots.txt
Instead of block an entire sub-domain (fr.sitegeek.com) with robots.txt, we like to block one directory (fr.sitegeek.com/blog).
Intermediate & Advanced SEO | | gamesecure
'fr.sitegeek.com/blog' and 'wwww.sitegeek.com/blog' contain the same articles in one language only labels are changed for 'fr' version and we suppose that duplicate content cause problem for SEO. We would like to crawl and index 'www.sitegee.com/blog' articles not 'fr.sitegeek.com/blog'. so, suggest us how to block single sub-domain directory (fr.sitegeek.com/blog) with robot.txt? This is only for blog directory of 'fr' version even all other directories or pages would be crawled and indexed for 'fr' version. Thanks,
Rajiv0 -
Robots.txt Syntax
I have been having a hard time finding any decent information regarding the robots.txt syntax that has been written in the last few years and I just want to verify some things as a review for myself. I have many occasions where I need to block particular directories in the URL, parameters and parameter values. I just wanted to make sure that I am doing this in the most efficient ways possible and thought you guys could help. So let's say I want to block a particular directory called "this" and this would be an example URL: www.domain.com/folder1/folder2/this/file.html
Intermediate & Advanced SEO | | DRSearchEngOpt
or
www.domain.com/folder1/this/folder2/file.html In order for me to block any URL that contains this folder anywhere in the URL I would use: User-agent: *
Disallow: /this/ Now lets say I have a parameter "that" I want to block and sometimes it is the first parameter and sometimes it isn't when it shows up in the URL. Would it look like this? User-agent: *
Disallow: ?that=
Disallow: &that= What about if there is only one value I want to block for "that" and the value is "NotThisGuy": User-agent: *
Disallow: ?that=NotThisGuy
Disallow: &that=NotThisGuy My big questions here are what are the most efficient ways to block a particular parameter and block a particular parameter value. Is there a more efficient way to deal with ? and & for when the parameter and value are either first or later? Secondly is there a list somewhere that will tell me all of the syntax and meaning that can be used for a robots.txt file? Thanks!0 -
Our Robots.txt and Reconsideration Request Journey and Success
We have asked a few questions related to this process on Moz and wanted to give a breakdown of our journey as it will likely be helpful to others! A couple of months ago, we updated our robots.txt file with several pages that we did not want to be indexed. At the time, we weren't checking WMT as regularly as we should have been and in a few weeks, we found that apparently one of the robots.txt files we were blocking was a dynamic file that led to the blocking of over 950,000 of our pages according to webmaster tools. Which page was causing this is still a mystery, but we quickly removed all of the entries. From research, most people say that things normalize in a few weeks, so we waited. A few weeks passed and things did not normalize. We searched, we asked and the number of "blocked" pages in WMT which had increased at a rate of a few hundred thousand a week were decreasing at a rate of a thousand a week. At this rate it would be a year or more before the pages were unblocked. This did not change. Two months later and we were still at 840,000 pages blocked. We posted on the Google Webmaster Forum and one of the mods there said that it would just take a long time to normalize. Very frustrating indeed considering how quickly the pages had been blocked. We found a few places on the interwebs that suggested that if you have an issue/mistake with robots.txt that you can submit a reconsideration request. This seemed to be our only hope. So, we put together a detailed reconsideration request asking for help with our blocked pages issue. A few days later, to our horror, we did not get a message offering help with our robots.txt problem. Instead, we received a message saying that we had received a penalty for inbound links that violate Google's terms of use. Major backfire. We used an SEO company years ago that posted a hundred or so blog posts for us. To our knowledge, the links didn't even exist anymore. They did.... So, we signed up for an account with removeem.com. We quickly found many of the links posted by the SEO firm as they were easily recognizable via the anchor text. We began the process of using removem to contact the owners of the blogs. To our surprise, we got a number of removals right away! Others we had to contact another time and many did not respond at all. Those we could not find an email for, we tried posting comments on the blog. Once we felt we had removed as many as possible, we added the rest to a disavow list and uploaded it using the disavow tool in WMT. Then we waited... A few days later, we already had a response. DENIED. In our request, we specifically asked that if the request were to be denied that Google provide some example links. When they denied our request, they sent us an email and including a sample link. It was an interesting example. We actually already had this blog in removem. The issue in this case was, our version was a domain name, i.e. www.domainname.com and the version google had was a wordpress sub domain, i.e. www.subdomain.wordpress.com. So, we went back to the drawing board. This time we signed up for majestic SEO and tied it in with removem. That added a few more links. We also had records from the old SEO company we were able to go through and locate a number of new links. We repeated the previous process, contacting site owners and keeping track of our progress. We also went through the "sample links" in WMT as best as we could (we have a lot of them) to try to pinpoint any other potentials. We removed what we could and again, disavowed the rest. A few days later, we had a message in WMT. DENIED AGAIN! This time it was very discouraging as it just didn't seem there were any more links to remove. The difference this time, was that there was NOT an email from Google. Only a message in WMT. So, while we didn't know if we would receive a response, we responded to the original email asking for more example links, so we could better understand what the issue was. Several days passed we received an email back saying that THE PENALTY HAD BEEN LIFTED! This was of course very good news and it appeared that our email to Google was reviewed and received well. So, the final hurdle was the reason that we originally contacted Google. Our robots.txt issue. We did not receive any information from Google related to the robots.txt issue we originally filed the reconsideration request for. We didn't know if it had just been ignored, or if there was something that might be done about it. So, as a last ditch final effort, we responded to the email once again and requested help as we did the other times with the robots.txt issue. The weekend passed and on Monday we checked WMT again. The number of blocked pages had dropped over the weekend from 840,000 to 440,000! Success! We are still waiting and hoping that number will continue downward back to zero. So, some thoughts: 1. Was our site manually penalized from the beginning, yet without a message in WMT? Or, when we filed the reconsideration request, did the reviewer take a closer look at our site, see the old paid links and add the penalty at that time? If the latter is the case then... 2. Did our reconsideration request backfire? Or, was it ultimately for the best? 3. When asking for reconsideration, make your requests known? If you want example links, ask for them. It never hurts to ask! If you want to be connected with Google via email, ask to be! 4. If you receive an email from Google, don't be afraid to respond to it. I wouldn't over do this or spam them. Keep it to the bare minimum and don't pester them, but if you have something pertinent to say that you have not already said, then don't be afraid to ask. Hopefully our journey might help others who have similar issues and feel free to ask any further questions. Thanks for reading! TheCraig
Intermediate & Advanced SEO | | TheCraig5 -
Panda Updates - robots.txt or noindex?
Hi, I have a site that I believe has been impacted by the recent Panda updates. Assuming that Google has crawled and indexed several thousand pages that are essentially the same and the site has now passed the threshold to be picked out by the Panda update, what is the best way to proceed? Is it enough to block the pages from being crawled in the future using robots.txt, or would I need to remove the pages from the index using the meta noindex tag? Of course if I block the URLs with robots.txt then Googlebot won't be able to access the page in order to see the noindex tag. Anyone have and previous experiences of doing something similar? Thanks very much.
Intermediate & Advanced SEO | | ianmcintosh0 -
Why should I add URL parameters where Meta Robots NOINDEX available?
Today, I have checked Bing webmaster tools and come to know about Ignore URL parameters. Bing webmaster tools shows me certain parameters for URLs where I have added META Robots with NOINDEX FOLLOW syntax. I can see canopy_search_fabric parameter in suggested section. It's due to following kind or URLs. http://www.vistastores.com/patio-umbrellas?canopy_fabric_search=1728 http://www.vistastores.com/patio-umbrellas?canopy_fabric_search=1729 http://www.vistastores.com/patio-umbrellas?canopy_fabric_search=1730 http://www.vistastores.com/patio-umbrellas?canopy_fabric_search=2239 But, I have added META Robots NOINDEX Follow to disallow crawling. So, why should it happen?
Intermediate & Advanced SEO | | CommercePundit0 -
Subdomains - duplicate content - robots.txt
Our corporate site provides MLS data to users, with the end goal of generating leads. Each registered lead is assigned to an agent, essentially in a round robin fashion. However we also give each agent a domain of their choosing that points to our corporate website. The domain can be whatever they want, but upon loading it is immediately directed to a subdomain. For example, www.agentsmith.com would be redirected to agentsmith.corporatedomain.com. Finally, any leads generated from agentsmith.easystreetrealty-indy.com are always assigned to Agent Smith instead of the agent pool (by parsing the current host name). In order to avoid being penalized for duplicate content, any page that is viewed on one of the agent subdomains always has a canonical link pointing to the corporate host name (www.corporatedomain.com). The only content difference between our corporate site and an agent subdomain is the phone number and contact email address where applicable. Two questions: Can/should we use robots.txt or robot meta tags to tell crawlers to ignore these subdomains, but obviously not the corporate domain? If question 1 is yes, would it be better for SEO to do that, or leave it how it is?
Intermediate & Advanced SEO | | EasyStreet0 -
Blocking Dynamic URLs with Robots.txt
Background: My e-commerce site uses a lot of layered navigation and sorting links. While this is great for users, it ends up in a lot of URL variations of the same page being crawled by Google. For example, a standard category page: www.mysite.com/widgets.html ...which uses a "Price" layered navigation sidebar to filter products based on price also produces the following URLs which link to the same page: http://www.mysite.com/widgets.html?price=1%2C250 http://www.mysite.com/widgets.html?price=2%2C250 http://www.mysite.com/widgets.html?price=3%2C250 As there are literally thousands of these URL variations being indexed, so I'd like to use Robots.txt to disallow these variations. Question: Is this a wise thing to do? Or does Google take into account layered navigation links by default, and I don't need to worry. To implement, I was going to do the following in Robots.txt: User-agent: * Disallow: /*? Disallow: /*= ....which would prevent any dynamic URL with a '?" or '=' from being indexed. Is there a better way to do this, or is this a good solution? Thank you!
Intermediate & Advanced SEO | | AndrewY1