Spider Indexed Disallowed URLs
-
Hi there,
In order to reduce the huge amount of duplicate content and titles for a cliënt, we have disallowed all spiders for some areas of the site in August via the robots.txt-file. This was followed by a huge decrease in errors in our SEOmoz crawl report, which, of course, made us satisfied.
In the meanwhile, we haven't changed anything in the back-end, robots.txt-file, FTP, website or anything. But our crawl report came in this November and all of a sudden all the errors where back. We've checked the errors and noticed URLs that are definitly disallowed. The disallowment of these URLs is also verified by our Google Webmaster Tools, other robots.txt-checkers and when we search for a disallowed URL in Google, it says that it's blocked for spiders. Where did these errors came from? Was it the SEOmoz spider that broke our disallowment or something? You can see the drop and the increase in errors in the attached image.
Thanks in advance.
[](<a href=)" target="_blank">a> [](<a href=)" target="_blank">a> LAAFj.jpg
-
This was what I was looking for! The pages are indexed by Google, yes, but they aren't being crawled by the Googlebot (as my Webmaster Tool and the Matt Cutts Video is telling me), but they are occasionally being crawled by the Rogerbot probably (not monthly). Thank you very much!
-
Yes yes, canonicalization or meta noindex-tag would be better of course to pass the possible link juice, but we aren't worried about that. I was worried Google would still see the pages as duplicates. (couldn't really distile that out of the article, although it was useful!) Barry Smith answered that last issue in the answer below, but i do want to thank you for your insight.
-
The directives issued in a robots.txt file are just a suggestion to bots. One that Google does follow though.
Malicious bots will ignore them and occasionally even bots that follow the directives may mess up (probably what's happened here).
Google may also index pages that you've blocked as they've found them via a link as explained here - http://www.youtube.com/watch?v=KBdEwpRQRD0 - or for an overview of what Google does with robots.txt files you can read here - http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449
I'd suggest you look at other ways of fixing the problem than just blocking 1500 pages but I see you've considered what would be required to fix the issues without removing the pages from a crawl and decided the value isn't there.
If WMT is telling you the pages are blocked from being crawled I'd believe that.
Try searching for a url that should be blocked in Google and see if it's indexed or do site:http://yoursitehere.com and see if blocked pages come up.
-
The assumptions of what to expect from using robots.txt may not be in line with the realities. Crawling a page isn't the same thing as indexing the content to appear in SERPs and even with robots, your pages can be crawled.
http://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutions
-
Thanks mister Goyal. Of course we have been thinking about ways and figured out some options in doing so, but implementing these solutions would be disastreous from a time/financial perspective. The pages that we have blocked from the spiders aren't needed for visibility in the search engines and don't carry much link juice, they are only there for the visitors, so we decided we don't really need them for our SEO-efforts in a positive way. But when these pages do get crawled and the engines notice the huge amount of duplicates, i recogn this would have a negative influence on our site as a whole.
So, the problem we have is focused on the doubts we have on the legitimacy of the report. If SEOMoz can crawl it, the Googlebot could probably too, right, since we've used: User-agent: *
-
Mark
Are you blocking all your bots to spider these erroneous URLs ? Is there a way for you to fix these such that either they don't exist or they are not duplicate anymore.
I'd just recommend looking from that perspective as well. Not just the intent of making those errors disappear from the SEOMoz report.
I hope this helps.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What to do with existing URL when replatforming and new URL is the same?
We are changing CMS from WordPress to Uberflip. If there is a URL that remains the same I believe we should not create a redirect. However, what happens to the old page? Should it be deleted?
Technical SEO | | maland0 -
Can adding thousands of new indexable URLs to my site at once be a problem?
Hi everyone, I am currently working on a project that will quickly add thousands of new indexable URLs to my site. For context, the site currently has over a million indexable pages. Is there any danger of adding a few thousand URLs at once to the site? Could it potentially affect crawlability/SEO/other pages? Thank you!
Technical SEO | | StevenLevine0 -
Migration to new URL structure
Hi guys, Just wondering what your processes are when moving a large site to a completely new URL structure on the same domain. Do you 301 everything from old page to new page, or are your more selective - i.e. only 301 pages that have a certain page authority, for example. Thanks!
Technical SEO | | A_Q0 -
Why my website does not index?
I made some changes in my website after that I try webmaster tool FETCH AS GOOGLE but this is 2nd day and my new pages does not index www. astrologersktantrik .com
Technical SEO | | ramansaab0 -
Content and url duplication?
One of the campaign tools flags one of my clients sites as having lots of duplicates. This is true in the sense the content is sort of boiler plate but with the different countries wording changed. The is same with the urls but they are different in the sense a couple of words have changed in the url`s. So its not the case of a cms or server issue as this seomoz advises. It doesnt need 301`s! Thing is in the niche, freight, transport operators, shipping, I can see many other sites doing the same thing and those sites have lots of similar pages ranking very well. In fact one site has over 300 keywords ranked on page 1-2, but it is a large site with an 12yo domain, which clearly helps. Of course having every page content unique is important, however, i suppose it is better than copy n paste from other sites. So its unique in that sense. Im hoping to convince the site owner to change the content over time for every country. A long process. My biggest problem for understanding duplication issues is that every tabloid or broadsheet media website would be canned from google as quite often they scrape Reuters or re-publish standard press releases on their sites as newsworthy content. So i have great doubt that there is a penalty for it. You only have to look and you can see media sites duplication everywhere, everyday, but they get ranked. I just think that google dont rank the worst cases of spammy duplication. They still index though I notice. So considering the business niche has very much the same content layout replicated content, which rank well, is this duplicate flag such a great worry? Many businesses sell the same service to many locations and its virtually impossible to re write the services in a dozen or so different ways.
Technical SEO | | xtopher660 -
Is there actual risk to having multiple URLs that frame in main url? Or is it just bad form and waste of money?
Client has many urls that just frame in the main site. It seems like a total waste of money, but if they are frames, is there an actual risk?
Technical SEO | | gravityseo0 -
Trailing Slashes In Url use Canonical Url or 301 Redirect?
I was thinking of using 301 redirects for trailing slahes to no trailing slashes for my urls. EG: www.url.com/page1/ 301 redirect to www.url.com/page1 Already got a redirect for non-www to www already. Just wondering in my case would it be best to continue using htacces for the trailing slash redirect or just go with Canonical URLs?
Technical SEO | | upick-1623910 -
Indexing of flash files
When Google indexes a flash file, do they use a library for such a purpose ? What set me thinking was this blog post ( although old ) which states - "we expanded our SWF indexing capabilities thanks to our continued collaboration with Adobe and a new library that is more robust and compatible with features supported by Flash Player 10.1."
Technical SEO | | seoug_20050