The Moz Q&A Forum

    • Forum
    • Questions
    • Users
    • Ask the Community

    Welcome to the Q&A Forum

    Browse the forum for helpful insights and fresh discussions about all things SEO.

    1. SEO and Digital Marketing Forum
    2. Categories
    3. SEO Tactics
    4. Intermediate & Advanced SEO
    5. How is Google crawling and indexing this directory listing?

    Moz Q&A is closed.

    After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.

    How is Google crawling and indexing this directory listing?

    Intermediate & Advanced SEO
    7 4 244.0k
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as question
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • danatanseo
      danatanseo last edited by

      We have three Directory Listing pages that are being indexed by Google:

      http://www.ccisolutions.com/StoreFront/jsp/

      http://www.ccisolutions.com/StoreFront/jsp/html/

      http://www.ccisolutions.com/StoreFront/jsp/pdf/

      How and why is Googlebot crawling and indexing these pages? Nothing else links to them (although the /jsp.html/ and /jsp/pdf/ both link back to /jsp/). They aren't disallowed in our robots.txt file and I understand that this could be why.

      If we add them to our robots.txt file and disallow, will this prevent Googlebot from crawling and indexing those Directory Listing pages without prohibiting them from crawling and indexing the content that resides there which is used to populate pages on our site?

      Having these pages indexed in Google is causing a myriad of issues, not the least of which is duplicate content.

      For example, this file <tt>CCI-SALES-STAFF.HTML</tt> (which appears on this Directory Listing referenced above - http://www.ccisolutions.com/StoreFront/jsp/html/) clicks through to this Web page:

      http://www.ccisolutions.com/StoreFront/jsp/html/CCI-SALES-STAFF.HTML

      This page is indexed in Google and we don't want it to be. But so is the actual page where we intended the content contained in that file to display: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff

      As you can see, this results in duplicate content problems.

      Is there a way to disallow Googlebot from crawling that Directory Listing page, and, provided that we have this URL in our sitemap: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff, solve the duplicate content issue as a result?

      For example:

      Disallow: /StoreFront/jsp/

      Disallow: /StoreFront/jsp/html/

      Disallow: /StoreFront/jsp/pdf/

      Can we do this without risking blocking Googlebot from content we do want crawled and indexed?

      Many thanks in advance for any and all help on this one!

      1 Reply Last reply Reply Quote 0
      • danatanseo
        danatanseo last edited by

        Thanks so much to you all. This has gotten us closer to an answer. We are consulting with the folks who developed the Web store to make sure that these solutions won't break other things if implemented, particularly something mentioned to me by our IT Director called "Sim links" - I'll keep you posted!

        1 Reply Last reply Reply Quote 1
        • StreamlineMetrics
          StreamlineMetrics @danatanseo last edited by

          I am referring to Web users. If a user or search engine tried to view those directory listing pages, they will get a Forbidden message, which is what you want to happen. The content in those directories will still be accessible by the pages on the site since the files still exist in those directories, but the pages listing the files in those directories won't be accessible in the browser to users/search engines. In other words, turning off the Directory indexes will not affect any of the content on the site.

          1 Reply Last reply Reply Quote 1
          • john4math
            john4math @StreamlineMetrics last edited by

            He's got the right idea, you shouldn't be serving these pages (unless you have a specific reason to).  The problem is these index pages are returning with a status code of 200 OK, so Google assumes it's fine to index them.  These pages should either come back with a 404 or a 403 (forbidden), and users then wouldn't be able to browse your site with these directory pages.

            Disallowing in robots.txt may not immediately remove these from search results, you may get that lovely description underneath the results that says, "A description for this result is not available because of this site's robots.txt".

            1 Reply Last reply Reply Quote 1
            • danatanseo
              danatanseo last edited by

              Thanks much to you both for jumping in. (thumbs up!)

              Streamline, I understand your suggestion regarding .htaccess, however, as I mentioned, the content in these directories is being used to populate content on our pages. In your response you mentioned that users/search engines wouldn't be able to access them. When you say "users," are you referring to Web visitors, and not site admins?

              StreamlineMetrics 1 Reply Last reply Reply Quote 0
              • StreamlineMetrics
                StreamlineMetrics last edited by

                There's numerous ways Google could have found those pages and added them to the index, but there's really no way to determine exactly what caused it in the first place. All it takes is for one visit by Google for a page to be crawled and indexed.

                If you don't want these pages indexed, then blocking those directories/pages in robots.txt would not be the solution because you would prevent Google from accessing those pages at all going forward. But the problem is that these pages are already in Google's index and by simply using the robots.txt file, you are just telling Google not to visit those pages from now on and thus your pages will remain in the index. A better solution would be to add the no-index, no-cache tags to those pages so the next time Google accesses those pages, they will know to remove those pages from the index.

                And now that I've read through your post again, I am now realizing you are talking about file directories rather than normal webpages. What I've wrote above mainly still applies, but I think the quick and easy fix would be to turn off Directory Indexes all together (unless you need them for some reason?). All you have to do is add the following code to your .htaccess file -

                Options -Indexes

                This will turn off these directory listings so users/search engines can't access them and they should eventually fall out of the Google index.

                john4math 1 Reply Last reply Reply Quote 2
                • FedeEinhorn
                  FedeEinhorn last edited by

                  You can use robots to disallow google from even crawling those pages, while the meta noindex still allows the crawling but prevents the indexing of those pages.

                  If you have any sensitive data that you don't want Google to read, then go ahead and use the robots directives you wrote above. However, if you just want them deindexed I'll suggest to go with the meta noindex, as it will allow other pages (linked) to be indexed but leave that particular page out.

                  1 Reply Last reply Reply Quote 1
                  • 1 / 1
                  • First post
                    Last post

                  Got a burning SEO question?

                  Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.


                  Start my free trial


                  Explore more categories

                  • Moz Tools

                    Chat with the community about the Moz tools.

                    Getting Started
                    Moz Pro
                    Moz Local
                    Moz Bar
                    API
                    What's New

                  • SEO Tactics

                    Discuss the SEO process with fellow marketers

                    Content Development
                    Competitive Research
                    Keyword Research
                    Link Building
                    On-Page Optimization
                    Technical SEO
                    Reporting & Analytics
                    Intermediate & Advanced SEO
                    Image & Video Optimization
                    International SEO
                    Local SEO

                  • Community

                    Discuss industry events, jobs, and news!

                    Moz Blog
                    Moz News
                    Industry News
                    Jobs and Opportunities
                    SEO Learn Center
                    Whiteboard Friday

                  • Digital Marketing

                    Chat about tactics outside of SEO

                    Affiliate Marketing
                    Branding
                    Conversion Rate Optimization
                    Web Design
                    Paid Search Marketing
                    Social Media

                  • Research & Trends

                    Dive into research and trends in the search industry.

                    SERP Trends
                    Search Behavior
                    Algorithm Updates
                    White Hat / Black Hat SEO
                    Other SEO Tools

                  • Support

                    Connect on product support and feature requests.

                    Product Support
                    Feature Requests
                    Participate in User Research

                  • See all categories

                  • Can Google Crawl & Index my Schema in CSR JavaScript
                    MJTrevens
                    MJTrevens
                    0
                    2
                    838

                  • How can I make a list of all URLs indexed by Google?
                    Bryggselv.no
                    Bryggselv.no
                    0
                    10
                    56.7k

                  • Is there a way to get a list of Total Indexed pages from Google Webmaster Tools?
                    sparrowdog
                    sparrowdog
                    0
                    7
                    11.8k

                  Get started with Moz Pro!

                  Unlock the power of advanced SEO tools and data-driven insights.

                  Start my free trial
                  Products
                  • Moz Pro
                  • Moz Local
                  • Moz API
                  • Moz Data
                  • STAT
                  • Product Updates
                  Moz Solutions
                  • SMB Solutions
                  • Agency Solutions
                  • Enterprise Solutions
                  • Digital Marketers
                  Free SEO Tools
                  • Domain Authority Checker
                  • Link Explorer
                  • Keyword Explorer
                  • Competitive Research
                  • Brand Authority Checker
                  • Local Citation Checker
                  • MozBar Extension
                  • MozCast
                  Resources
                  • Blog
                  • SEO Learning Center
                  • Help Hub
                  • Beginner's Guide to SEO
                  • How-to Guides
                  • Moz Academy
                  • API Docs
                  About Moz
                  • About
                  • Team
                  • Careers
                  • Contact
                  Why Moz
                  • Case Studies
                  • Testimonials
                  Get Involved
                  • Become an Affiliate
                  • MozCon
                  • Webinars
                  • Practical Marketer Series
                  • MozPod
                  Connect with us

                  Contact the Help team

                  Join our newsletter
                  Moz logo
                  © 2021 - 2026 SEOMoz, Inc., a Ziff Davis company. All rights reserved. Moz is a registered trademark of SEOMoz, Inc.
                  • Accessibility
                  • Terms of Use
                  • Privacy

                  Looks like your connection to Moz was lost, please wait while we try to reconnect.