Google: How to See URLs Blocked by Robots?
-
Google Webmaster Tools says we have 17K out of 34K URLs that are blocked by our Robots.txt file.
How can I see the URLs that are being blocked?
Here's our Robots.txt file.
User-agent: * Disallow: /swish.cgi Disallow: /demo Disallow: /reviews/review.php/new/ Disallow: /cgi-audiobooksonline/sb/order.cgi Disallow: /cgi-audiobooksonline/sb/productsearch.cgi Disallow: /cgi-audiobooksonline/sb/billing.cgi Disallow: /cgi-audiobooksonline/sb/inv.cgi Disallow: /cgi-audiobooksonline/sb/new_options.cgi Disallow: /cgi-audiobooksonline/sb/registration.cgi Disallow: /cgi-audiobooksonline/sb/tellfriend.cgi Disallow: /*?gdftrk
-
It seems you might be asking two different questions here, Larry.
You ask which URLs are blocked by your robots file. You then answered your own question by listing the entries in your robots file which are the actual URLs that it is blocking.
If in fact what you want to know is which pages exist on your website but are not currently indexed, that's a much bigger question and requires a lot more work to answer.
There is no way Webmaster Tools can give you that answer, because if it was aware of the URL it would already be indexing it.
HOWEVER! It is possible to do it if you are willing to do some of the work on your own to collect and manipulate data using several tools. Essentially, you have to do it in three steps:
- create a list of all the URLs that Google says are indexed. (This info comes from Google's SERPs.)
- then create a separate list of all of the URLs that actually exist on your website. (This must come from a 3rd-party tool you run against your site yourself.)
- From there, you will use Excel to subtract the indexed URLs from the known URLs, leaving a list of non-indexed URLS, which is what you asked for.
I actually laid out this process step-by-step in response to an earlier question, so you can read the process there http://www.seomoz.org/q/how-to-determine-which-pages-are-not-indexed
Is that what you were looking for?
Paul
-
Okay, well the robots.txt will only be excluding robots from the folders and URLs specified and as I say, there's no way to download a list of all the URLs that Google is not indexing from webmaster tools.
If you have exact URLs in mind which you think might be getting excluded, you can test individual URLs in Google Webmaster Tools in:
Health > Blocked URLs > URLs Specify the URLs and user-agents to test against.
Beyond this, if you want to know if there are URLs that shouldn't be excluded in the folders you have specified, I would run a crawl of your website using SEOMoz' crawl test or Screaming Frog. Then sort the URLs alphabetically and make sure that all of the URLs in the folders you have excluded via robots.txt are ones that you want to exclude.
-
I want to make sure that Google is indexing all of our pages we want them to. I.E. That all of the NOT indexed URLs are valid.
-
Hi Larry
Why do you want to find those URLs out for my understanding? Are you concerned that the robots.txt is blocking URLs it shouldn't be?
As for downloading a list of URLs which aren't indexed from Google Webmaster Tools, which is what I think you would really like, this isn't possible at the moment.
-
Liz; Perhaps my post was unclear or I am misunderstanding your answer.
I want to find out the specific URLs that Google says it isn't indexing because of our Robots.txt file.
-
If you want to see if Google has indexed individual pages which are supposed to be excluded, you can check the URLs in your robots.txt using the site: command.
E.g. type the following into Google:
site:http://www.audiobooksonline.com/swish.cgi
site:http://www.audiobooksonline.com/reviews/review.php/new/
...continue for all the URLs in your robots.txtJust from searching on the last example above (site:http://www.audiobooksonline.com/reviews/review.php/new/) I can see that you have results indexed. This is probably because you added the robots.txt after it was already indexed.
To get rid of these results you need to take the culprit line out of the robots.txt, add the robots meta tag set to noindex to all pages you want removed, submit a URL removal request via webmaster tools, check it has been nonidexed then you can add the line back into the robots.txt.
This is the tag:
I hope that makes sense and is useful!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google has discovered a URL but won't index it?
Hey all, have a really strange situation I've never encountered before. I launched a new website about 2 months ago. It took an awfully long time to get index, probably 3 weeks. When it did, only the homepage was indexed. I completed the site, all it's pages, made and submitted a sitemap...all about a month ago. The coverage report shows that Google has discovered the URL's but not indexed them. Weirdly, 3 of the pages ARE indexed, but the rest are not. So I have 42 URL's in the coverage report listed as "Excluded" and 39 say "Discovered- currently not indexed." When I inspect any of these URL's, it says "this page is not in the index, but not because of an error." They are listed as crawled - currently not indexed or discovered - currently not indexed. But 3 of them are, and I updated those pages, and now those changes are reflected in Google's index. I have no idea how those 3 made it in while others didn't, or why the crawler came back and indexed the changes but continues to leave the others out. Has anyone seen this before and know what to do?
Intermediate & Advanced SEO | | DanDeceuster0 -
How can I make a list of all URLs indexed by Google?
I started working for this eCommerce site 2 months ago, and my SEO site audit revealed a massive spider trap. The site should have been 3500-ish pages, but Google has over 30K pages in its index. I'm trying to find a effective way of making a list of all URLs indexed by Google. Anyone? (I basically want to build a sitemap with all the indexed spider trap URLs, then set up 301 on those, then ping Google with the "defective" sitemap so they can see what the site really looks like and remove those URLs, shrinking the site back to around 3500 pages)
Intermediate & Advanced SEO | | Bryggselv.no0 -
URL Errors for SmartPhone in Google Search Console/Webmaster Tools
Howdy all, In recent weeks I have seen a steady increase in the number of smartphone related url errors on Googles Search Console (formerly webmaster tools). THe crawler appears to be searching for a /m/ or /mobile/ directory within the URLs. Why is it doing this? Any insight would be greatly appreciated. Unfortunately this is for an unresponsive site, would setting the viewport help stop the issue for know until my new responsive site is launched shortly. Cheers fello Mozzers 🙂 Tim NDh1RNs
Intermediate & Advanced SEO | | TimHolmes1 -
Robots.txt assistance
I want to block all the inner archive news pages of my website in robots.txt - we don't have R&D capacity to set up rel=next/prev or create a central page that all inner pages would have a canonical back to, so this is the solution. The first page I want indexed reads:
Intermediate & Advanced SEO | | theLotter
http://www.xxxx.news/?p=1 all subsequent pages that I want blocked because they don't contain any new content read:
http://www.xxxx.news/?p=2
http://www.xxxx.news/?p=3
etc.... There are currently 245 inner archived pages and I would like to set it up so that future pages will automatically be blocked since we are always writing new news pieces. Any advice about what code I should use for this? Thanks!0 -
Robots.txt Blocked Most Site URLs Because of Canonical
Had a bit of a "Gotcha" in Magento. We had Yoast Canonical Links extension which worked well , but then we installed Mageworx SEO Suite.. which broke Canonical Links. Unfortunately it started putting www.mysite.com/catalog/product/view/id/516/ as the Canonical Link - and all URLs with /catalog/productview/* is blocked in Robots.txt So unfortunately We told Google that the correct page is also a blocked page. they haven't been removed as far as I can see but traffic has certainly dropped. We have also , at the same time had some Site changes grouping some pages & having 301 redirects. Resubmitted site map & did a fetch as google. Any other ideas? And Idea how long it will take to become unblocked?
Intermediate & Advanced SEO | | s_EOgi_Bear0 -
Issue with Robots.txt file blocking meta description
Hi, Can you please tell me why the following error is showing up in the serps for a website that was just re-launched 7 days ago with new pages (301 redirects are built in)? A description for this result is not available because of this site's robots.txt – learn more. Once we noticed it yesterday, we made some changed to the file and removed the amount of items in the disallow list. Here is the current Robots.txt file: # XML Sitemap & Google News Feeds version 4.2 - http://status301.net/wordpress-plugins/xml-sitemap-feed/ Sitemap: http://www.website.com/sitemap.xml Sitemap: http://www.website.com/sitemap-news.xml User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/ Other notes... the site was developed in WordPress and uses that followign plugins: WooCommerce All-in-One SEO Pack Google Analytics for WordPress XML Sitemap Google News Feeds Currently, in the SERPs, it keeps jumping back and forth between showing the meta description for the www domain and showing the error message (above). Originally, WP Super Cache was installed and has since been deactivated, removed from WP-config.php and deleted permanently. One other thing to note, we noticed yesterday that there was an old xml sitemap still on file, which we have since removed and resubmitted a new one via WMT. Also, the old pages are still showing up in the SERPs. Could it just be that this will take time, to review the new sitemap and re-index the new site? If so, what kind of timeframes are you seeing these days for the new pages to show up in SERPs? Days, weeks? Thanks, Erin ```
Intermediate & Advanced SEO | | HiddenPeak0 -
Is there a way to contact Google besides the google product forum?
Our traffic from google has dropped more than 35% and continues to fall. We have been on this forum and google's webmaster forum trying to get help. We received great advice, have waited months, but instead of our traffic improving, it has worsened. We are being penalized by google for many keywords such as trophies, trophies and awards and countless others - we were on page one previously. We filed two reconsideration requests and were told both times that there were no manual penalties. Some of our pages continue to rank well, so it is not across the board (but all of our listings went down a bit). We have made countless changes (please see below). Our busy season was from March to May and we got clobbered. Google, as most people know, is a monopoly when it comes to traffic, so we are getting killed. At first we thought it was Penquin, but it looks like we started getting killed late last year. Lots of unusual things happened - we had a large spike in traffic for two days, then lost our branded keywords, then our main keywords. Our branded keywords came back pretty quickly, but nothing else did. We have received wonderful advice and made most of the changes. We are a very reputable company and have a feeling we are being penalized for something other than spamming. For example, we have a mobile site we added late last year and a wholesale system was added around the same time. Since the date does not coincide with Penquin, we think there is some major technical driver, but have no idea what to do at this point. The webmasters have all been helpful, but nothing is working. We are trying to find out what one does in a situation as we are trying to avoid closing our business. Thank you! Changes Made: 1. We had many crawl errors so we reduced them significantly 2. We had introduced a mobile website in January which we
Intermediate & Advanced SEO | | trophycentraltrophiesandawards
thought may have been the cause (splitting traffic, duplicate content, etc.),
so we had our mobile provider add the site to their robots.txt file. 3. We were told by a webmaster that their were too many
links from our search provider, so we have them put the search pages in a
robots.txt file. 4. We were told that we had too much duplicate content. This was / is true, as we have hundred of legitate products that are similar:
example trophies and certificates that are virtually the same but are
for different sports or have different colors and sizes. Still, we added more content and added no index tags to many products. We compared our % of dups to competitors and it is far less. 5. At the recommendation of another webmaster, we changed
many pages that might have been splitting traffic. 6. Another webmaster told us that too many people were
linking into our site with the same text, namely Trophy Central and that it
might have appeared we were trying to game the system somehow. We have never bought links and don't even have a webmaster although over the last 10 years have worked with programmers and seo companies (but we don't think any have done anything unusual). 7. At the suggestion of another webmaster, we have tried to
improve our link profile. For example,
we found Yahoo was not linking to our url. 8. We were told to setup a 404 page, so we did 9. We were told to ensure that all of the similar domains
were pointing to www.trophycentral.com/ so we setup redirects 10. We were told that a site that we have linking to us from too many places so we reduced it to 1. Our key pages have A rankings from SEOMOZ for the selected keywords. We have made countless other changes recommended by experts
but have seen no improvements (actually got worse). I am the
president of the company and have made most of the above recent changes myself. Our website is trophycentral.com0 -
URL Length or Exact Breadcrumb Navigation URL? What's More Important
Basically my question is as follows, what's better: www.romancingdiamonds.com/gemstone-rings/amethyst-rings/purple-amethyst-ring-14k-white-gold (this would fully match the breadcrumbs). or www.romancingdiamonds.com/amethyst-rings/purple-amethyst-ring-14k-white-gold (cutting out the first level folder to keep the url shorter and the important keywords are closer to the root domain). In this question http://www.seomoz.org/qa/discuss/37982/url-length-vs-url-keywords I was consulted to drop a folder in my url because it may be to long. That's why I'm hesitant to keep the bradcrumb structure the same. To the best of your knowldege do you think it's best to drop a folder in the URL to keep it shorter and sweeter, or to have a longer URL and have it match the breadcrumb structure? Please advise, Shawn
Intermediate & Advanced SEO | | Romancing0