How is Google crawling and indexing this directory listing?
-
We have three Directory Listing pages that are being indexed by Google:
http://www.ccisolutions.com/StoreFront/jsp/
http://www.ccisolutions.com/StoreFront/jsp/html/
http://www.ccisolutions.com/StoreFront/jsp/pdf/
How and why is Googlebot crawling and indexing these pages? Nothing else links to them (although the /jsp.html/ and /jsp/pdf/ both link back to /jsp/). They aren't disallowed in our robots.txt file and I understand that this could be why.
If we add them to our robots.txt file and disallow, will this prevent Googlebot from crawling and indexing those Directory Listing pages without prohibiting them from crawling and indexing the content that resides there which is used to populate pages on our site?
Having these pages indexed in Google is causing a myriad of issues, not the least of which is duplicate content.
For example, this file <tt>CCI-SALES-STAFF.HTML</tt> (which appears on this Directory Listing referenced above - http://www.ccisolutions.com/StoreFront/jsp/html/) clicks through to this Web page:
http://www.ccisolutions.com/StoreFront/jsp/html/CCI-SALES-STAFF.HTML
This page is indexed in Google and we don't want it to be. But so is the actual page where we intended the content contained in that file to display: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff
As you can see, this results in duplicate content problems.
Is there a way to disallow Googlebot from crawling that Directory Listing page, and, provided that we have this URL in our sitemap: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff, solve the duplicate content issue as a result?
For example:
Disallow: /StoreFront/jsp/
Disallow: /StoreFront/jsp/html/
Disallow: /StoreFront/jsp/pdf/
Can we do this without risking blocking Googlebot from content we do want crawled and indexed?
Many thanks in advance for any and all help on this one!
-
Thanks so much to you all. This has gotten us closer to an answer. We are consulting with the folks who developed the Web store to make sure that these solutions won't break other things if implemented, particularly something mentioned to me by our IT Director called "Sim links" - I'll keep you posted!
-
I am referring to Web users. If a user or search engine tried to view those directory listing pages, they will get a Forbidden message, which is what you want to happen. The content in those directories will still be accessible by the pages on the site since the files still exist in those directories, but the pages listing the files in those directories won't be accessible in the browser to users/search engines. In other words, turning off the Directory indexes will not affect any of the content on the site.
-
He's got the right idea, you shouldn't be serving these pages (unless you have a specific reason to). The problem is these index pages are returning with a status code of 200 OK, so Google assumes it's fine to index them. These pages should either come back with a 404 or a 403 (forbidden), and users then wouldn't be able to browse your site with these directory pages.
Disallowing in robots.txt may not immediately remove these from search results, you may get that lovely description underneath the results that says, "A description for this result is not available because of this site's robots.txt".
-
Thanks much to you both for jumping in. (thumbs up!)
Streamline, I understand your suggestion regarding .htaccess, however, as I mentioned, the content in these directories is being used to populate content on our pages. In your response you mentioned that users/search engines wouldn't be able to access them. When you say "users," are you referring to Web visitors, and not site admins?
-
There's numerous ways Google could have found those pages and added them to the index, but there's really no way to determine exactly what caused it in the first place. All it takes is for one visit by Google for a page to be crawled and indexed.
If you don't want these pages indexed, then blocking those directories/pages in robots.txt would not be the solution because you would prevent Google from accessing those pages at all going forward. But the problem is that these pages are already in Google's index and by simply using the robots.txt file, you are just telling Google not to visit those pages from now on and thus your pages will remain in the index. A better solution would be to add the no-index, no-cache tags to those pages so the next time Google accesses those pages, they will know to remove those pages from the index.
And now that I've read through your post again, I am now realizing you are talking about file directories rather than normal webpages. What I've wrote above mainly still applies, but I think the quick and easy fix would be to turn off Directory Indexes all together (unless you need them for some reason?). All you have to do is add the following code to your .htaccess file -
Options -Indexes
This will turn off these directory listings so users/search engines can't access them and they should eventually fall out of the Google index.
-
You can use robots to disallow google from even crawling those pages, while the meta noindex still allows the crawling but prevents the indexing of those pages.
If you have any sensitive data that you don't want Google to read, then go ahead and use the robots directives you wrote above. However, if you just want them deindexed I'll suggest to go with the meta noindex, as it will allow other pages (linked) to be indexed but leave that particular page out.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google indexed "Lorem Ipsum" content on an unfinished website
Hi guys. So I recently created a new WordPress site and started developing the homepage. I completely forgot to disallow robots to prevent Google from indexing it and the homepage of my site got quickly indexed with all the Lorem ipsum and some plagiarized content from sites of my competitors. What do I do now? I’m afraid that this might spoil my SEO strategy and devalue my site in the eyes of Google from the very beginning. Should I ask Google to remove the homepage using the removal tool in Google Webmaster Tools and ask it to recrawl the page after adding the unique content? Thank you so much for your replies.
Intermediate & Advanced SEO | | Ibis150 -
Dropped from Google?
My website www.weddingphotojournalist.co.uk appears to have been penalised by Google. I ranked fairly well for a number of venue related searches from my blog posts. Generally I'd find myself somewhere on page one or towards the top of page two. However recently I found I am nowhere to be seen for these venue searches. I still appear if I search for my name, business name and keywords in my domain name. A quick check of Yahoo and I found I am ranking very well, it is only Google who seem to have dropped me. I looked at Google webmaster tools and there are no messages or clues as to what has happened. However it does show my traffic dropping off a cliff edge on the 19th July from 850 impressions to around 60 to 70 per day. I haven't made any changes to my website recently and hadn't added any new content in July. I haven't added any new inbound links either, a search for inbound links does not show anything suspicious. Can anyone shed any light on why this might happen?
Intermediate & Advanced SEO | | weddingphotojournalist0 -
HTTPS pages - To meta no-index or not to meta no-index?
I am working on a client's site at the moment and I noticed that both HTTP and HTTPS versions of certain pages are indexed by Google and both show in the SERPS when you search for the content of these pages. I just wanted to get various opinions on whether HTTPS pages should have a meta no-index tag through an htaccess rule or whether they should be left as is.
Intermediate & Advanced SEO | | Jamie.Stevens0 -
Why the archive sub pages are still indexed by Google?
Why the archive sub pages are still indexed by Google? I am using the WordPress SEO by Yoast, and selected the needed option to get these pages no-index in order to avoid the duplicate content.
Intermediate & Advanced SEO | | MichaelNewman1 -
Google Places
If you rank on google places, I have noticed that you do not rank on the front page as well. I have a site that ranks on front page for it's keywords; however, because they are (1) on google places, they don't show up when someone is localized to that area. They show up on google places but not on front page. If you turn of localization, they are first in serps. How can I get around this? Two separate sites? One for Google+ (Places) and one for SERPS?
Intermediate & Advanced SEO | | JML11790 -
How can we get a site reconsidered for Google indexing?
We recently completed a re-design for a site and are having trouble getting it indexed. This site may have been penalized previously. They were having issues getting it ranked and the design was horrible. Any advise on how to get the new site reconsidered to get the rank where it should be? (Yes, Webmaster Tools is all set up with the sitemap linked) Many thanks for any help with this one!
Intermediate & Advanced SEO | | d25kart0 -
Google bot vs google mobile bot
Hi everyone 🙂 I seriously hope you can come up with an idea to a solution for the problem below, cause I am kinda stuck 😕 Situation: A client of mine has a webshop located on a hosted server. The shop is made in a closed CMS, meaning that I have very limited options for changing the code. Limited access to pagehead and can within the CMS only use JavaScript and HTML. The only place I have access to a server-side language is in the root where a Defualt.asp file redirects the visitor to a specific folder where the webshop is located. The webshop have 2 "languages"/store views. One for normal browsers and google-bot and one for mobile browsers and google-mobile-bot.In the default.asp (asp classic). I do a test for user agent and redirect the user to one domain or the mobile, sub-domain. All good right? unfortunately not. Now we arrive at the core of the problem. Since the mobile shop was added on a later date, Google already had most of the pages from the shop in it's index. and apparently uses them as entrance pages to crawl the site with the mobile bot. Hence it never sees the default.asp (or outright ignores it).. and this causes as you might have guessed a huge pile of "Dub-content" Normally you would just place some user-agent detection in the page head and either throw Google a 301 or a rel-canon. But since I only have access to JavaScript and html in the page head, this cannot be done. I'm kinda running out of options quickly, so if anyone has an idea as to how the BEEP! I get Google to index the right domains for the right devices, please feel free to comment. 🙂 Any and all ideas are more then welcome.
Intermediate & Advanced SEO | | ReneReinholdt0 -
Best way to de-index content from Google and not Bing?
We have a large quantity of URLs that we would like to de-index from Google (we are affected b Panda), but not Bing. What is the best way to go about doing this?
Intermediate & Advanced SEO | | nicole.healthline0