How is Google crawling and indexing this directory listing?
-
We have three Directory Listing pages that are being indexed by Google:
http://www.ccisolutions.com/StoreFront/jsp/
http://www.ccisolutions.com/StoreFront/jsp/html/
http://www.ccisolutions.com/StoreFront/jsp/pdf/
How and why is Googlebot crawling and indexing these pages? Nothing else links to them (although the /jsp.html/ and /jsp/pdf/ both link back to /jsp/). They aren't disallowed in our robots.txt file and I understand that this could be why.
If we add them to our robots.txt file and disallow, will this prevent Googlebot from crawling and indexing those Directory Listing pages without prohibiting them from crawling and indexing the content that resides there which is used to populate pages on our site?
Having these pages indexed in Google is causing a myriad of issues, not the least of which is duplicate content.
For example, this file <tt>CCI-SALES-STAFF.HTML</tt> (which appears on this Directory Listing referenced above - http://www.ccisolutions.com/StoreFront/jsp/html/) clicks through to this Web page:
http://www.ccisolutions.com/StoreFront/jsp/html/CCI-SALES-STAFF.HTML
This page is indexed in Google and we don't want it to be. But so is the actual page where we intended the content contained in that file to display: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff
As you can see, this results in duplicate content problems.
Is there a way to disallow Googlebot from crawling that Directory Listing page, and, provided that we have this URL in our sitemap: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff, solve the duplicate content issue as a result?
For example:
Disallow: /StoreFront/jsp/
Disallow: /StoreFront/jsp/html/
Disallow: /StoreFront/jsp/pdf/
Can we do this without risking blocking Googlebot from content we do want crawled and indexed?
Many thanks in advance for any and all help on this one!
-
Thanks so much to you all. This has gotten us closer to an answer. We are consulting with the folks who developed the Web store to make sure that these solutions won't break other things if implemented, particularly something mentioned to me by our IT Director called "Sim links" - I'll keep you posted!
-
I am referring to Web users. If a user or search engine tried to view those directory listing pages, they will get a Forbidden message, which is what you want to happen. The content in those directories will still be accessible by the pages on the site since the files still exist in those directories, but the pages listing the files in those directories won't be accessible in the browser to users/search engines. In other words, turning off the Directory indexes will not affect any of the content on the site.
-
He's got the right idea, you shouldn't be serving these pages (unless you have a specific reason to). The problem is these index pages are returning with a status code of 200 OK, so Google assumes it's fine to index them. These pages should either come back with a 404 or a 403 (forbidden), and users then wouldn't be able to browse your site with these directory pages.
Disallowing in robots.txt may not immediately remove these from search results, you may get that lovely description underneath the results that says, "A description for this result is not available because of this site's robots.txt".
-
Thanks much to you both for jumping in. (thumbs up!)
Streamline, I understand your suggestion regarding .htaccess, however, as I mentioned, the content in these directories is being used to populate content on our pages. In your response you mentioned that users/search engines wouldn't be able to access them. When you say "users," are you referring to Web visitors, and not site admins?
-
There's numerous ways Google could have found those pages and added them to the index, but there's really no way to determine exactly what caused it in the first place. All it takes is for one visit by Google for a page to be crawled and indexed.
If you don't want these pages indexed, then blocking those directories/pages in robots.txt would not be the solution because you would prevent Google from accessing those pages at all going forward. But the problem is that these pages are already in Google's index and by simply using the robots.txt file, you are just telling Google not to visit those pages from now on and thus your pages will remain in the index. A better solution would be to add the no-index, no-cache tags to those pages so the next time Google accesses those pages, they will know to remove those pages from the index.
And now that I've read through your post again, I am now realizing you are talking about file directories rather than normal webpages. What I've wrote above mainly still applies, but I think the quick and easy fix would be to turn off Directory Indexes all together (unless you need them for some reason?). All you have to do is add the following code to your .htaccess file -
Options -Indexes
This will turn off these directory listings so users/search engines can't access them and they should eventually fall out of the Google index.
-
You can use robots to disallow google from even crawling those pages, while the meta noindex still allows the crawling but prevents the indexing of those pages.
If you have any sensitive data that you don't want Google to read, then go ahead and use the robots directives you wrote above. However, if you just want them deindexed I'll suggest to go with the meta noindex, as it will allow other pages (linked) to be indexed but leave that particular page out.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Any way to force a URL out of Google index?
As far as I know, there is no way to truly FORCE a URL to be removed from Google's index. We have a page that is being stubborn. Even after it was 301 redirected to an internal secure page months ago and a noindex tag was placed on it in the backend, it still remains in the Google index. I also submitted a request through the remove outdated content tool https://www.google.com/webmasters/tools/removals and it said the content has been removed. My understanding though is that this only updates the cache to be consistent with the current index. So if it's still in the index, this will not remove it. Just asking for confirmation - is there truly any way to force a URL out of the index? Or to even suggest more strongly that it be removed? It's the first listing in this search https://www.google.com/search?q=hcahranswers&rlz=1C1GGRV_enUS753US755&oq=hcahr&aqs=chrome.0.69i59j69i57j69i60j0l3.1700j0j8&sourceid=chrome&ie=UTF-8
Intermediate & Advanced SEO | | MJTrevens0 -
Removing massive number of no index follow page that are not crawled
Hi, We have stackable filters on some of our pages (ie: ?filter1=a&filter2=b&etc.). Those stacked filters pages are "noindex, follow". They were created in order to facilitate the indexation of the item listed in them. After analysing the logs we know that the search engines do not crawl those stacked filter pages. Does blocking those pages (by loading their link in AJAX for example) would help our crawl rate or not? In order words does removing links that are already not crawled help the crawl rate of the rest of our pages? My assumption here is that SE see those links but discard them because those pages are too deep in our architecture and by removing them we would help SE focus on the rest of our page. We don't want to waste our efforts removing those links if there will be no impact. Thanks
Intermediate & Advanced SEO | | Digitics0 -
Does anyone know how to appear with snippet that says something like: Jobs 1-10 of 80 in the beginning of the description on Google? e.g. like on: https://www.google.co.za/#q=pickers+and+packers
Does anyone know how to appear with snippet that says something like: Jobs 1-10 of 80 in the beginning of the description on Google? e.g. like on: https://www.google.co.za/#q=pickers+and+packers Any markup that could be used to be listed like this. Why is some sites listed like this and some not. Why is the adzuna.co.za page listed with Results 1-10 while some other with Jobs 1-10 ?
Intermediate & Advanced SEO | | classifiedtech0 -
Google isn't seeing the content but it is still indexing the webpage
When I fetch my website page using GWT this is what I receive. HTTP/1.1 301 Moved Permanently
Intermediate & Advanced SEO | | jacobfy
X-Pantheon-Styx-Hostname: styx1560bba9.chios.panth.io
server: nginx
content-type: text/html
location: https://www.inscopix.com/
x-pantheon-endpoint: 4ac0249e-9a7a-4fd6-81fc-a7170812c4d6
Cache-Control: public, max-age=86400
Content-Length: 0
Accept-Ranges: bytes
Date: Fri, 14 Mar 2014 16:29:38 GMT
X-Varnish: 2640682369 2640432361
Age: 326
Via: 1.1 varnish
Connection: keep-alive What I used to get is this: HTTP/1.1 200 OK
Date: Thu, 11 Apr 2013 16:00:24 GMT
Server: Apache/2.2.23 (Amazon)
X-Powered-By: PHP/5.3.18
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Last-Modified: Thu, 11 Apr 2013 16:00:24 +0000
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
ETag: "1365696024"
Content-Language: en
Link: ; rel="canonical",; rel="shortlink"
X-Generator: Drupal 7 (http://drupal.org)
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8 xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:og="http://ogp.me/ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:sioc="http://rdfs.org/sioc/ns#"
xmlns:sioct="http://rdfs.org/sioc/types#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"> <title>Inscopix | In vivo rodent brain imaging</title>0 -
Google crawled my rich snippet pages and then excluded them
Hi guysWe have added schema.org mark up a few months ago and it all looked well and showed up then suddenly last month all the crawled pages disappeared from Webmaster tools Structured data (see the screenshot attached). This happened to another site of mine and I cannot figure out what causes it. Nothing has been changed on the pages and you can see by yourself in the HTML code. Any ideas to why this might happened this way?wenR89I.png?1
Intermediate & Advanced SEO | | Walltopia0 -
Wordpress blog in a subdirectory not being indexed by Google
HI MozzersIn my websites sitemap.xml, pages are listed, such as /blog/ and /blog/textile-fact-or-fiction-egyptian-cotton-explained/These pages are visible when you visit them in a browser and when you use the Google Webmaster tool - Fetch as Google to view them (see attachment), however they aren't being indexed in Google, not even the root directory for the blog (/blog/) is being indexed, and when we query:site: www.hilden.co.uk/blog/ It returns 0 results in Google.Also note that:The Wordpress installation is located at /blog/ which is a subdirectory of the main root directory which is managed by Magento. I'm wondering if this causing the problem.Any help on this would be greatly appreciated!AnthonyToTOHuj.png?1
Intermediate & Advanced SEO | | Tone_Agency0 -
SEOMoz Paid Directories List.... Are paid links always bad?
We all know paid links are bad.... so it begs the question.... why does the SEOMoz directory list have so many directories that only submit businesses for a fee.... i.e. a paid link. Are they worth paying for or not? Interested to hear peoples thoughts and experiences with some of the directories listed on the SEOMoz directory list....
Intermediate & Advanced SEO | | JohnW-UK0 -
Why are new pages not being indexed, and old pages (now in robots.txt) remain in the index?
I currently have a site that was recently restructured, causing much of its content to be reposted, creating new URL's for each page. To avoid duplicates, all of the existing pages were added to the robots file. That said, it has now been over a week - I know Google has recrawled the site - and when I search for term X, it is stil the old page that is ranking, with the new one nowhere to be seen. I'm assuming it's a cached version, but why are so many of the old pages still appearing in the index? Furthermore, all "tags" pages (it's a Q&A site, like this one) were also added to the robots a few months ago, yet I think they are all still appearing in the index. Anyone got any ideas about why this is happening, and how I can get my new pages indexed?
Intermediate & Advanced SEO | | corp08030