Google Crawler Error / restricting crawling
-
Hi
On a Magento Instance we manage there is an advanced search. As part of the ongoing enhancement of the instance we altered the advance search options so there are less and more relevant.
The issue is Google has crawled and catalogued the advanced search with the now removed options in the query string. Google keeps crawling these out of date advanced searches. These stale searches now create a 500 error.
Currently Google is attempting to crawl these pages twice a day.
I have implemented the following to stop this:-
1. Submitted requested the url be removed via Webmaster tools, selecting the directory option using uri:
http://www.domian.com/catalogsearch/advanced/result/
2. Added Disallow to robots.txt
Disallow: /catalogsearch/advanced/result/* Disallow: /catalogsearch/advanced/result/
3. Add rel="nofollow" to the links in the site linking to the advanced search.
Below is a list of the links it is crawling or attempting to crawl, 12 links crawled twice a day each resulting in a 500 status.
Can anything else be done?
-
Seems like you've done everything right. You could also add a Meta robots "NOINDEX, FOLLOW" to those pages.
I'd also double check the referring "linked from" referrer in Webmasters tools just to make sure you haven't missed any live followed links pointing to those pages.
When did you submit the removal request, and what is the status? (approved, denied, pending?) Another question, are those pages in Google's index?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Home Page Being Indexed / Referral URLs /
I have a few questions related to home page URLs being indexed, canonicalization, and GA reporting... 1. I can view the home page by typing in domain.com , domain.com/ and domain.com/index.htm There are no redirects and it's canonicalized to point to domain.com/index.htm -- how important is it to have redirects? I don't want unnecessary redirects or canonical tags, but I noticed the trailing slash can sometimes be typed in manually on other pages, sometimes not. 2. When I do a site search (site:domain.com), sometimes the HP shows up as "domain.com/", never "domain.com/index.htm" or "domain.com", and sometimes the HP doesn't show up period. This seems to change several times a day, sometimes within 15 minutes. I have no idea what is causing it and I don't know if it has anything to do with #1. In a perfect world, I would ask for the /index.htm to be dropped and redirected to .com/, and the canonical to point to .com/ 3. I've noticed in GA I see / , /index.htm, and a weird Google referral URL (/index.htm?referrer=https://www.google.com/) all showing up as top pages. I think the / and /index.htm is because I haven't setup a default URL in GA, but I'm not sure what would cause the referrer. I tracked back when the referrer URL started to show up in the top pages, and it was right around the time they moved over to https://, so I'm not sure what the best option is to remove that. I know this is a lot - I appreciate any insight anyone can provide.
Technical SEO | | DigMS0 -
Website Redesign / Switching CMS / .aspx and .html extensions question
Hello everyone, We're currently preparing a website redesign for one of our important websites. It is our most important website, having good rankings and a lot of visitors from Search Engines, so we want to be really careful with the redesign. Our strategy is to keep as much in place as possible. At first, we are only changing the styling of the website, we will keep the content, the structure, and as much as URLs the same as possible. However, we are switching from a custom build CMS system which created URLs like www.homepage.com/default-en.aspx
Technical SEO | | NielsB
No we would like to keep this URL the same , but our new CMS system does not support this kind of URLs. The same with for instance the URL: www.homepage.com/products.html
We're not able to recreate this URL in our new CMS. What would be the best strategy for SEO? Keep the URLs like this:
www.homepage.com/default-en
www.homepage.com/products Or doesn't it really matter, since Google we view these as completely different URLs? And, what would the impact of this changes in URLs be? Thanks a lot in advance! Best Regards, Jorg1 -
Google not pulling my favicon
Several sites use Google favicon to load favicons instead of loading it from the Website itself. Our favicon is not being pulled from our site correctly, instead it shows the default "world" image. https://plus.google.com/_/favicon?domain=www.example.com Is the address to pull a favicon. When I post on G+ or see other sites that use that service to pull favicons ours isn't displaying, despite it shows up in Chrome, Firefox, IE, etc and we have the correct meta in all pages of our site. Any idea why is this happening? Or how to "ping" Google to update that?
Technical SEO | | FedeEinhorn0 -
Google local listings
im working with gutter installation company, and we're ranking for all the top keywords in google. the only thing that we're not ranking for is for the map results, for the keyword "gutter ma" since we're located in Springfield ma, i thing Google considers certain areas from Boston, because its more center of Massachusetts, What can i do to improve my rankings in maps for this keyword, because i know it wont work with PO box since i need to confirm an address? Thanks
Technical SEO | | vladraush990 -
Why is an error page showing when searching our website using Google "site:" search function?
When I search our company website using the Google site search function "site:jwsuretybonds.com", a 400 Bad Request page is at the top of the listed pages. I had someone else at our company do the same site search and the 400 Bad Request did not appear. Is there a reason this is happening, and are there any ramifications to it?
Technical SEO | | TheDude0 -
Having a massive amount of duplicate crawl errors
Im having over 400 crawl errors over duplicate content looking like this: http://www.mydomain.com/index.php?task=login&prevpage=http%3A%2F%2Fwww.mydomain.com%2Ftag%2Fmahjon http://www.mydomain.com/index.php?task=login&prevpage=http%3A%2F%2Fwww.mydomain.com%2Findex.php%3F etc.. etc... So there seems to be something with my login script that is not working, Anyone knows how to fix this? Thanks
Technical SEO | | stanken0 -
Google Webmaster tools error?
So I am trying to set the URL preference in google webmaster tools for my site. However when I try to save it it tells me to verify that I own the site. I have already done this so where can I go to verify I own the site exactly? Maybe I am wrong and I have not done this already but even on the homepage of webmaster tools I don't see an option to "verify".
Technical SEO | | ENSO0 -
Google causing Magento Errors
I have an online shop - run using Magento. I have recently upgraded to version 1.4, and I installed a extension called Lightspeed, a caching module which makes tremendous improvements to Magento's performance. Unfortunately, a confoguration problem, meant that I had to disable the module, because it was generating errors relating to the session, if you entered the site from any page other than the home page. The site is now working as expected. I have Magento's error notification set to email - I've not received emails for errors generated by visitors. However over a 72 hour period, I received a deluge of error emails, which where being caused by Googlebot. It was generating an erro in a file called lightspeed.php Here is an example: URL: http://www.jacksgardenstore.com/tahiti-vulcano-hammock IP Address: 66.249.66.186 Time: 2011-06-11 17:02:26 GMT Error: Cannot send headers; headers already sent in /home/jack/jacksgardenstore.com/user/jack_1.4/htdocs/lightspeed.php, line 444 So several things of note: I deleted lightspeed.php from the server, before any of these error messages began to arrive. lightspeed.php was never exposed in the URL, at anytime. It was referred to in a mod_rewrite rule in .htaccess, which I also commented out. If you clicked on the URL in the error message, it loaded in the browser as expected, with no error messages. It appears that Google has cached a version of the page which briefly existed whilst Lightspeed was enabled. But I though that Google cached generated HTML. Since when does cache a server-side PHP file ???? I've just used the Fetch as Googlebot facility on Webmaster Tools for the URL in the above error message, and it returns the page as expected. No errors. I've had to errors at all in the last 48 hours, so I'm hoping it's just sorted itself out. However I'm concerned about any Google related implications. Any insights would be greatly appreciated. Thanks Ben
Technical SEO | | atticus70