GWT False Reporting or GoogleBot has weird crawling ability?
-
Hi I hope someone can help me.
I have launched a new website and trying hard to make everything perfect. I have been using Google Webmaster Tools (GWT) to ensure everything is as it should be but the crawl errors being reported do not match my site. I mark them as fixed and then check again the next day and it reports the same or similar errors again the next day.
Example:
http://www.mydomain.com/category/article/ (this would be a correct structure for the site).
GWT reports:
http://www.mydomain.com/category/article/category/article/ 404 (It does not exist, never has and never will) I have been to the pages listed to be linking to this page and it does not have the links in this manner. I have checked the page source code and all links from the given pages are correct structure and it is impossible to replicate this type of crawl.
This happens accross most of the site, I have a few hundred pages all ending in a trailing slash and most pages of the site are reported in this manner making it look like I have close to 1000, 404 errors when I am not able to replicate this crawl using many different methods.
The site is using a htacess file with redirects and a rewrite condition.
Rewrite Condition:
Need to redirect when no trailing slash
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !.(html|shtml)$
RewriteCond %{REQUEST_URI} !(.)/$
RewriteRule ^(.)$ /$1/ [L,R=301]The above condition forces the trailing slash on folders.
Then we are using redirects in this manner:
Redirect 301 /article.html http://www.domain.com/article/
In addition to the above we had a development site whilst I was building the new site which was http://dev.slimandsave.co.uk now this had been spidered without my knowledge until it was too late. So when I put the site live I left the development domain in place (http://dev.domain.com) and redirected it like so:
<ifmodule mod_rewrite.c="">RewriteEngine on
RewriteRule ^ - [E=protossl]
RewriteCond %{HTTPS} on
RewriteRule ^ - [E=protossl:s]RewriteRule ^ http%{ENV:protossl}://www.domain.com%{REQUEST_URI} [L,R=301]</ifmodule>
Is there anything that I have done that would cause this type of redirect 'loop' ?
Any help greatly appreciated.\
-
Yeah - do this!
-
Anyone any thoughts on this?
-
Sorry I also should add that the url structure that google generates is like this:
http://www.domain.com/category/article/
http://www.domain.com/category/article/same-category/differentarticle/
http://www.domain.com/category/article/same-category/another-different-article/
http://www.domain.com/category/article/another-different-category/differentarticle/
etc, it is like it gets to a category article and then moves sideways and somehow adds the move onto the current url without keeping hold of the suffix of the URL
-
Doesn't sound like GWT is false reporting. May want to check your trailing slash URL rewrite. It seems like there is an issue there as what you are describing sounds like the URLs are being written incorrectly and causing the incorrect URLs to be generated and show up in GWT.
Your 301 looks ok and if the dev site was spidered and indexed, you should just add the site to GWT and then use the URL removal tool to remove the site from the index, then remove the site and redirect.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Googlebot crawl error Javascript method is not defined
Hi All, I have this problem, that has been a pain in the ****. I get tons of crawl errors from "Googlebot" saying a specific Javascript method does not exist in my logs. I then go to the affected page and test in a web browser and the page works without any Javascript errors. Can some help with resolving this issue? Thanks in advance.
Technical SEO | | FreddyKgapza0 -
WEBMASTER console: increase in the number of URLs we were blocked from crawling due to authorization permission errors.
Hi guys,I received this warning in my webmaster console: "Google detected a significant increase in the number of URLs we were blocked from crawling due to authorization permission errors." So i went to "Crawl Errors" section and i found such errors under "Access denied" status: ?page_name=Cheap+Viagra+Gold+Online&id=471 ?page_name=Cheapest+Viagra+Us+Licensed+Pharmacies&id=1603 and many happy URLs like these. Does anybody know what this is and where it comes from? Thanks in advance!
Technical SEO | | odmsoft0 -
Google stopped crawling my site. Everybody is stumped.
This has stumped the Wordpress staff and people in the Google Webmasters forum. We are in Google News (have been for years), and so new posts are crawled immediately. On Feb 17-18 Crawl Stats dropped 85%, and new posts were no longer indexed (not appearing on News or search). Data highlighter attempts return "This URL could not be found in Google's index." No manual actions by Google. No changes to the website; no custom CSS. No Site Errors or new URL errors. No sitemap problems (resubmitting didn't help). We're on wordpress.com, so no odd code. We can see the robot.txt file. Other search engines can see us, as can social media websites. Older posts still index, but loss of News is a big hit. Also, I think overall Google referrals are dropping. We can Fetch the URL for a new post, and many hours later it appears on Google and News, and we can then use Data Highlighter. It's now 6 days and no recovery. Everybody is stumped. Any ideas? I just joined, so this might be the wrong venue. If so, apologies.
Technical SEO | | Editor-FabiusMaximus_Website0 -
20 000 duplicates in Moz crawl due to Joomla URL parameters. How to fix?
We have a problem of massive duplicate content in Joomla. Here is an example of the "base" URL: http://www.binary-options.biz/index.php/Web-Pages/binary-options-platforms.html For some reason Joomla creates many versions of this URL, for example: http://www.binary-options.biz/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html or http://www.binary-options.biz/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html So it lists the URL parameter ?q= and then repeats part of the beforegoing URL. This leads to tens of thousands duplicate pages in our content heavy site. Any ideas how to fix this? Thanks so much!
Technical SEO | | Xmanic0 -
First Crawl Report
Just joined SEOMoz today and am slightly overwhelmed, but excited about learning loads from it. I've just received my Crawl Report and there is a
Technical SEO | | iainmoran
404 : UserPreemptionError:
http://www.iainmoran.com/comments/feed/ This is a WordPress site and I've no idea what the best course of action to take. I've done some searching on Google and a couple of sites suggest removing that url from within the robots.txt file. I'm using the Yoast Plugin which apparently creates a robots.txt file, but I can't see any way to edit it. Is there another solution for resolving the 404 error? Many thanks, Iain.0 -
How do I keep Google from crawling my PPC landing page?
Question, I don't want Google to catch my PPC landing pages, but I'm wondering if I make the page no-follow, no-index will it still crawl my landing page for quality score? Just to clarify I do want it to crawl the landing page to boost my quality score on PPC, but I do not want it to index it for SEO. Thanks 🙂
Technical SEO | | jhinchcliffe0 -
Crawling and indexing content
If a page element (div, e.g.) is initially hidden and shown only by a hover descriptor or Javascript call, will Google crawl and index it’s content?
Technical SEO | | Mont0 -
Geotargeting a folder in GWT & IP targeting
I am curently managing a .com that targets Canada and we will soon be launching a .com/us/ that will target the US. Once we launch the /us/ folder, we want to display the /us/ content to any US IP. My concern is that Google will then only index the /us/ content, as their IP is in the US. So, if I set up .com and .com/us/ as two different sites in GWT, and geotarget each to the Country it is targeting, will this take care of the problem and ensure that Google indexes the .com for Canada, and the /us/ for the US? Is there any alternative method (that does not include using the .ca domain)? I am concerned that Google would not be able to see the .com content if we are redirecting all US traffic to .com/us/. Any examples of this online anywhere?
Technical SEO | | bheard0