Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
How is Google crawling and indexing this directory listing?
-
We have three Directory Listing pages that are being indexed by Google:
http://www.ccisolutions.com/StoreFront/jsp/
http://www.ccisolutions.com/StoreFront/jsp/html/
http://www.ccisolutions.com/StoreFront/jsp/pdf/
How and why is Googlebot crawling and indexing these pages? Nothing else links to them (although the /jsp.html/ and /jsp/pdf/ both link back to /jsp/). They aren't disallowed in our robots.txt file and I understand that this could be why.
If we add them to our robots.txt file and disallow, will this prevent Googlebot from crawling and indexing those Directory Listing pages without prohibiting them from crawling and indexing the content that resides there which is used to populate pages on our site?
Having these pages indexed in Google is causing a myriad of issues, not the least of which is duplicate content.
For example, this file <tt>CCI-SALES-STAFF.HTML</tt> (which appears on this Directory Listing referenced above - http://www.ccisolutions.com/StoreFront/jsp/html/) clicks through to this Web page:
http://www.ccisolutions.com/StoreFront/jsp/html/CCI-SALES-STAFF.HTML
This page is indexed in Google and we don't want it to be. But so is the actual page where we intended the content contained in that file to display: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff
As you can see, this results in duplicate content problems.
Is there a way to disallow Googlebot from crawling that Directory Listing page, and, provided that we have this URL in our sitemap: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff, solve the duplicate content issue as a result?
For example:
Disallow: /StoreFront/jsp/
Disallow: /StoreFront/jsp/html/
Disallow: /StoreFront/jsp/pdf/
Can we do this without risking blocking Googlebot from content we do want crawled and indexed?
Many thanks in advance for any and all help on this one!
-
Thanks so much to you all. This has gotten us closer to an answer. We are consulting with the folks who developed the Web store to make sure that these solutions won't break other things if implemented, particularly something mentioned to me by our IT Director called "Sim links" - I'll keep you posted!
-
I am referring to Web users. If a user or search engine tried to view those directory listing pages, they will get a Forbidden message, which is what you want to happen. The content in those directories will still be accessible by the pages on the site since the files still exist in those directories, but the pages listing the files in those directories won't be accessible in the browser to users/search engines. In other words, turning off the Directory indexes will not affect any of the content on the site.
-
He's got the right idea, you shouldn't be serving these pages (unless you have a specific reason to). The problem is these index pages are returning with a status code of 200 OK, so Google assumes it's fine to index them. These pages should either come back with a 404 or a 403 (forbidden), and users then wouldn't be able to browse your site with these directory pages.
Disallowing in robots.txt may not immediately remove these from search results, you may get that lovely description underneath the results that says, "A description for this result is not available because of this site's robots.txt".
-
Thanks much to you both for jumping in. (thumbs up!)
Streamline, I understand your suggestion regarding .htaccess, however, as I mentioned, the content in these directories is being used to populate content on our pages. In your response you mentioned that users/search engines wouldn't be able to access them. When you say "users," are you referring to Web visitors, and not site admins?
-
There's numerous ways Google could have found those pages and added them to the index, but there's really no way to determine exactly what caused it in the first place. All it takes is for one visit by Google for a page to be crawled and indexed.
If you don't want these pages indexed, then blocking those directories/pages in robots.txt would not be the solution because you would prevent Google from accessing those pages at all going forward. But the problem is that these pages are already in Google's index and by simply using the robots.txt file, you are just telling Google not to visit those pages from now on and thus your pages will remain in the index. A better solution would be to add the no-index, no-cache tags to those pages so the next time Google accesses those pages, they will know to remove those pages from the index.
And now that I've read through your post again, I am now realizing you are talking about file directories rather than normal webpages. What I've wrote above mainly still applies, but I think the quick and easy fix would be to turn off Directory Indexes all together (unless you need them for some reason?). All you have to do is add the following code to your .htaccess file -
Options -Indexes
This will turn off these directory listings so users/search engines can't access them and they should eventually fall out of the Google index.
-
You can use robots to disallow google from even crawling those pages, while the meta noindex still allows the crawling but prevents the indexing of those pages.
If you have any sensitive data that you don't want Google to read, then go ahead and use the robots directives you wrote above. However, if you just want them deindexed I'll suggest to go with the meta noindex, as it will allow other pages (linked) to be indexed but leave that particular page out.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Staging website got indexed by google
Our staging website got indexed by google and now MOZ is showing all inbound links from staging site, how should i remove those links and make it no index. Note- we already added Meta NOINDEX in head tag
Intermediate & Advanced SEO | | Asmi-Ta0 -
My product category pages are not being indexed on google can someone help?
My website has been indexed on google and all of its pages can be found on google except for the product category pages - which are where we want our traffic heading to, so this is a big problem for us. Our website is www.skirtinguk.com And an example of a page that isn't being indexed is https://www.skirtinguk.com/product-category/mdf-skirting-board/
Intermediate & Advanced SEO | | chelseaskirtinguk0 -
Prevent Google from crawling Ajax
With Google figuring out how to make Ajax and JS more searchable/indexable, I am curious on thoughts or techniques to prevent this. Here's my Situation, we have a page that we do not ever want to be indexed/crawled or other. Currently we have the nofollow/noindex command, but due to technical changes for our site the method in which this information is being implemented if it is ever displayed it will not have the ability to block the content from search. It is also the decision of the business to not list the file in robots.txt due to the sensitivity of the content. Basically, this content doesn't exist unless something super important happens, and even if something super important happens, we do not want Google to know of its existence. Since the Dev team is planning on using Ajax/JS to pull in this content if the business turns it on, the concern is that it will be on the homepage and Google could index it. So the questions that I was asked; if Google can/does index, how long would that piece of content potentially appear in the SERPs? Can we block Google from caring about and indexing this section of content on the homepage? Sorry for the vagueness of this question, it's very sensitive in nature and I am trying to avoid too many specifics. I am able to discuss this in a more private way if necessary. Thanks!
Intermediate & Advanced SEO | | Shawn_Huber0 -
Why are bit.ly links being indexed and ranked by Google?
I did a quick search for "site:bit.ly" and it returns more than 10 million results. Given that bit.ly links are 301 redirects, why are they being indexed in Google and ranked according to their destination? I'm working on a similar project to bit.ly and I want to make sure I don't run into the same problem.
Intermediate & Advanced SEO | | JDatSB1 -
Wordpress blog in a subdirectory not being indexed by Google
HI MozzersIn my websites sitemap.xml, pages are listed, such as /blog/ and /blog/textile-fact-or-fiction-egyptian-cotton-explained/These pages are visible when you visit them in a browser and when you use the Google Webmaster tool - Fetch as Google to view them (see attachment), however they aren't being indexed in Google, not even the root directory for the blog (/blog/) is being indexed, and when we query:site: www.hilden.co.uk/blog/ It returns 0 results in Google.Also note that:The Wordpress installation is located at /blog/ which is a subdirectory of the main root directory which is managed by Magento. I'm wondering if this causing the problem.Any help on this would be greatly appreciated!AnthonyToTOHuj.png?1
Intermediate & Advanced SEO | | Tone_Agency0 -
Best practice for removing indexed internal search pages from Google?
Hi Mozzers I know that it’s best practice to block Google from indexing internal search pages, but what’s best practice when “the damage is done”? I have a project where a substantial part of our visitors and income lands on an internal search page, because Google has indexed them (about 3 %). I would like to block Google from indexing the search pages via the meta noindex,follow tag because: Google Guidelines: “Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines.” http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769 Bad user experience The search pages are (probably) stealing rankings from our real landing pages Webmaster Notification: “Googlebot found an extremely high number of URLs on your site” with links to our internal search results I want to use the meta tag to keep the link juice flowing. Do you recommend using the robots.txt instead? If yes, why? Should we just go dark on the internal search pages, or how shall we proceed with blocking them? I’m looking forward to your answer! Edit: Google have currently indexed several million of our internal search pages.
Intermediate & Advanced SEO | | HrThomsen0 -
How to prevent Google from crawling our product filter?
Hi All, We have a crawler problem on one of our sites www.sneakerskoopjeonline.nl. On this site, visitors can specify criteria to filter available products. These filters are passed as http/get arguments. The number of possible filter urls is virtually limitless. In order to prevent duplicate content, or an insane amount of pages in the search indices, our software automatically adds noindex, nofollow and noarchive directives to these filter result pages. However, we’re unable to explain to crawlers (Google in particular) to ignore these urls. We’ve already changed the on page filter html to javascript, hoping this would cause the crawler to ignore it. However, it seems that Googlebot executes the javascript and crawls the generated urls anyway. What can we do to prevent Google from crawling all the filter options? Thanks in advance for the help. Kind regards, Gerwin
Intermediate & Advanced SEO | | footsteps0