Manipulate Googlebot
-
**Problem: I have found something wierd on the server log as below. the googlebot visit the folders and files which do not exist at all. there is no photo folder on the server, but googlebot visit the files inside the photo folder and return 404 error. **
I wonder if it is SEO hacking attempts, and how can someone manage to Manipulate Googlebot.
==================================================
**66.249.71.200 - - [22/Aug/2012:02:31:53 -0400] "GET /robots.txt HTTP/1.0" 200 2255 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" **
**66.249.71.25 - - [22/Aug/2012:02:36:55 -0400] "GET /photo/pic24.html HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.71.26 - - [22/Aug/2012:02:37:03 -0400] "GET /photo/pic20.html HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.71.200 - - [22/Aug/2012:02:37:11 -0400] "GET /photo/pic22.html HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.71.200 - - [22/Aug/2012:02:37:28 -0400] "GET /photo/pic19.html HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.71.26 - - [22/Aug/2012:02:37:36 -0400] "GET /photo/pic17.html HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.71.200 - - [22/Aug/2012:02:37:44 -0400] "GET /photo/pic21.html HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" **
-
Hi
This is a valid concert.
As Mat correctly stated, Googlebot is not easily manipulated.
Having said that, Googlebot impersonation is a sad fact.Recently we released a Fake Googlebot study in which we've found out that 21% of all Googlebot visits are made by different impersonators - fairly "innocent" SEO tools used for competition check-ups, various spammer and even malicious scanner that will use Googlebot user-agent to try and slip in between the cracks and lay a path for a more serious attack to come (DDoS, IRA and etc).
To identify your visitor can use Botopedia's "IP check tool" - it will cross-verify the IP and help reveal most fake bots.
(I`ve already searched for 66.249.71.25 and it's legit - see attached image)Still, IPs can be spoofed.
So, if in doubt, I would promote a "better safe than sorry" approach and advise you to look into free bad bot protection services (there are several good ones).GL
-
If anyone did manage to get control of googlebot they could find better uses to put it to than that.
Much more likely is that there are links somewhere to those URLs - they may well be on someone else's site. Google is following the link to see what it there, then finding nothing. However it works on a file by file basis rather than by directory so it could happen quite a bit.
If you want to stop it clogging up your error logs (and ensure that googlebot cycles are spent indexing better stuff) just block that directory in your robots.txt file.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Crawl and Indexation Error - Googlebot can't/doesn't access specific folders on microsites
Hi, My first time posting here, I am just looking for some feedback on a indexation issue we have with a client and any feedback on possible next steps or items I may have overlooked. To give some background, our client operates a website for the core band and a also a number of microsites based on specific business units, so you have corewebsite.com along with bu1.corewebsite.com, bu2.corewebsite.com. The content structure isn't ideal, as each microsite follows a structure of bu1.corewebsite.com/bu1/home.aspx, bu2.corewebsite.com/bu2/home.aspx and so on. In addition to this each microsite has duplicate folders from the other microsites so bu1.corewebsite.com has indexable folders bu1.corewebsite.com/bu1/home.aspx but also bu1.corewebsite.com/bu2/home.aspx the same with bu2.corewebsite.com has bu2.corewebsite.com/bu2/home.aspx but also bu2.corewebsite.com/bu1/home.aspx. Therre are 5 different business units so you have this duplicate content scenario for all microsites. This situation is being addressed in the medium term development roadmap and will be rectified in the next iteration of the site but that is still a ways out. The issue
Intermediate & Advanced SEO | | ImpericMedia
About 6 weeks ago we noticed a drop off in search rankings for two of our microsites (bu1.corewebsite.com and bu2.corewebsite.com) over a period of 2-3 weeks pretty much all our terms dropped out of the rankings and search visibility dropped to essentially 0. I can see that pages from the websites are still indexed but oddly it is the duplicate content pages so (bu1.corewebsite.com/bu3/home.aspx or (bu1.corewebsite.com/bu4/home.aspx is still indexed, similiarly on the bu2.corewebsite microsite bu2.corewebsite.com/bu3/home.aspx and bu4.corewebsite.com/bu3/home.aspx are indexed but no pages from the BU1 or BU2 content directories seem to be indexed under their own microsites. Logging into webmaster tools I can see there is a "Google couldn't crawl your site because we were unable to access your site's robots.txt file." This was a bit odd as there was no robots.txt in the root directory but I got some weird results when I checked the BU1/BU2 microsites in technicalseo.com robots text tool. Also due to the fact that there is a redirect from bu1.corewebsite.com/ to bu1.corewebsite.com/bu4.aspx I thought maybe there could be something there so consequently we removed the redirect and added a basic robots to the root directory for both microsites. After this we saw a small pickup in site visibility, a few terms pop into our Moz campaign rankings but drop out again pretty quickly. Also the error message in GSC persisted. Steps taken so far after that In Google Search Console, I confirmed there are no manual actions against the microsites. Confirmed there is no instances of noindex on any of the pages for BU1/BU2 A number of the main links from the root domain to microsite BU1/BU2 have a rel="noopener noreferrer" attribute but we looked into this and found it has no impact on indexation Looking into this issue we saw some people had similar issues when using Cloudflare but our client doesn't use this service Using a response redirect header tool checker, we noticed a timeout when trying to mimic googlebot accessing the site Following on from point 5 we got a hold of a week of server logs from the client and I can see Googlebot successfully pinging the site and not getting 500 response codes from the server...but couldn't see any instance of it trying to index microsite BU1/BU2 content So it seems to me that the issue could be something server side but I'm at a bit of a loss of next steps to take. Any advice at all is much appreciated!0 -
How do I know if I am correctly solving an uppercase url issue that may be affecting Googlebot?
We have a large e-commerce site (10k+ SKUs). https://www.flagandbanner.com. As I have begun analyzing how to improve it I have discovered that we have thousands of urls that have uppercase characters. For instance: https://www.flagandbanner.com/Products/patriotic-paper-lanterns-string-lights.asp. This is inconsistently applied throughout the site. I directed our website vendor to fix the issue and they placed 301 redirects via a rule to the web.config file. Any url that contains an uppercase character now displays as a lowercase. However, as I use screaming frog to monitor our site, I see all these 301 redirects--thousands of them. The XML sitemap still shows the the uppercase versions. We have had indexing issues as well. So I'm wondering what is the most effective way to make sure that I'm not placing an extra burden on Googlebot when they index our site? Should I have just not cared about the uppercase issue and let it alone?
Intermediate & Advanced SEO | | webrocket0 -
Https Homepage Redirect & Issue with Googlebot Access
Hi All, I have a question about Google correctly accessing a site that has a 301 redirect to https on the homepage. Here’s an overview of the situation and I’d really appreciate any insight from the community on what the issue might be: Background Info:
Intermediate & Advanced SEO | | G.Anderson
My homepage is set up as a 301 redirect to a https version of the homepage (some users log in so we need the SSL). Only 2 pages on the site are under SSL and the rest of the site is http. We switched to the SSL in July but have not seen any change in our rankings despite efforts increasing backlinks and out put of content. Even though Google has indexed the SSL page of the site, it appears that it is not linking up the SSL page with the rest of the site in its search and tracking. Why do we think this is the case? The Diagnosis: 1) When we do a Google Fetch on our http homepage, it appears that Google is only reading the 301 redirect instructions (as shown below) and is not finding its way over to the SSL page which has all the correct Page Title and meta information. <code>HTTP/1.1 301 Moved Permanently Date: Fri, 08 Nov 2013 17:26:24 GMT Server: Apache/2.2.16 (Debian) Location: https://mysite.com/ Vary: Accept-Encoding Content-Encoding: gzip Content-Length: 242 Keep-Alive: timeout=15, max=100 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 <title>301 Moved Permanently</title> # Moved Permanently The document has moved [here](https://mysite.com/). * * * <address>Apache/2.2.16 (Debian) Server at mysite.com</address></code> 2) When we view a list of external backlinks to our homepage, it appears that the backlinks that have been built after we switched to the SSL homepage have been separated from the backlinks built before the SSL. Even on Open Site, we are only seeing the backlinks that were achieved before we switched to the SSL and not getting to track any backlinks that have been added after the SSL switch. This leads up to believe that the new links are not adding any value to our search rankings. 3) When viewing Google Webmaster, we are receiving no information about our homepage, only all the non-https pages. I added a https account to Google Webmaster and in that version we ONLY receive the information about our homepage (and the other ssl page on the site) What Is The Problem? My concern is that we need to do something specific with our sitemap or with the 301 redirect itself in order for Google to read the whole site as one entity and receive the reporting/backlinks as one site. Again, google is indexing all of our pages but it seems to be doing so in a disjointed way that is breaking down link juice and value being built up by our SSL homepage. Can anybody help? Thank you for any advice input you might be able to offer. -Greg0 -
Received "Googlebot found an extremely high number of URLs on your site:" but most of the example URLs are noindexed.
An example URL can be found here: http://symptom.healthline.com/symptomsearch?addterm=Neck%20pain&addterm=Face&addterm=Fatigue&addterm=Shortness%20Of%20Breath A couple of questions: Why is Google reporting an issue with these URLs if they are marked as noindex? What is the best way to fix the issue? Thanks in advance.
Intermediate & Advanced SEO | | nicole.healthline0 -
Best way to view Global Navigation bar from GoogleBot's perspective
Hi, Links in the global navigation bar of our website do not show up when we look at Google cache --> text only version of the page. These links use "style="<a class="attribute-value">display:none;</a>" when we looked at HTML source. But if I use "user agent switcher" add-on in Firefox and set it to Googlebot, the links in global nav are displayed. I am wondering what is the best way to find out if Google can/can not see the links. Thanks for the help! Supriya.
Intermediate & Advanced SEO | | SShiyekar0 -
Googlebot on paywall made with cookies and local storage
My question is about paywalls made with cookies and local storage. We are changing a website with free content to a open paywall with a 5 article view weekly limit. The paywall is made to work with cookies and local storage. The article views are stored to local storage but you have to have your cookies enabled so that you can read the free articles. If you don't have cookies enable we would pass an error page (otherwise the paywall would be easy to bypass). Can you say how this affects SEO? We would still like that Google would index all article pages that it does now. Would it be cloaking if we treated Googlebot differently so that when it does not have cookies enabled, it would still be able to index the page?
Intermediate & Advanced SEO | | OPU1 -
Googlebot found an extremely high number of URLs on your site
I keep getting the "Googlebot found an extremely high number of URLs on your site" message in the GWMT for one of the sites that I manage. The error is as below- Googlebot encountered problems while crawling your site. Googlebot encountered extremely large numbers of links on your site. This may indicate a problem with your site's URL structure. Googlebot may unnecessarily be crawling a large number of distinct URLs that point to identical or similar content, or crawling parts of your site that are not intended to be crawled by Googlebot. As a result Googlebot may consume much more bandwidth than necessary, or may be unable to completely index all of the content on your site. I understand the nature of the message - the site uses a faceted navigation and is genuinely generating a lot of duplicate pages. However in order to stop this from becoming an issue we do the following; No-index a large number of pages using the on page meta tag. Use a canonical tag where it is appropriate But we still get the error and a lot of the example pages that Google suggests are affected by the issue are actually pages with the no-index tag. So my question is how do I address this problem? I'm thinking that as it's a crawling issue the solution might involve the no-follow meta tag. any suggestions appreciated.
Intermediate & Advanced SEO | | BenFox0 -
How to find what Googlebot actually sees on a page?
1. When I disable java-script in Firefox and load our home page, it is missing entire middle section. 2. Also, the global nav dropdown menu does not display at all. (with java-script disabled) I believe this is not good. 3. But when type in <website name="">in Google search and click on the cached version of home page > and then click on text only version, It displays the Global nav links fine.</website> 4. When I switch the user agent to Googlebot(using Firefox plugin "User Agent Swticher)), the home page and global nav displays fine. Should I be worried about#1 and #2 then? How to find what Googlebot actually sees on a page? (I have tried "Fetch as Googlebot" from GWT. It displays source code.) Thanks for the help! Supriya.
Intermediate & Advanced SEO | | Amjath0