Pages getting into Google Index, blocked by Robots.txt??
-
Hi all,
So yesterday we set up to Remove URL's that got into the Google index that were not supposed to be there, due to faceted navigation... We searched for the URL's by using this in Google Search.
site:www.sekretza.com inurl:price=
site:www.sekretza.com inurl:artists=So it brings up a list of "duplicate" pages, and they have the usual: "A description for this result is not available because of this site's robots.txt – learn more."
So we removed them all, and google removed them all, every single one.
This morning I do a check, and I find that more are creeping in - If i take one of the suspecting dupes to the Robots.txt tester, Google tells me it's Blocked. - and yet it's appearing in their index??
I'm confused as to why a path that is blocked is able to get into the index?? I'm thinking of lifting the Robots block so that Google can see that these pages also have a Meta NOINDEX,FOLLOW tag on - but surely that will waste my crawl budget on unnecessary pages?
Any ideas?
thanks.
-
Oh, ok. If that's the case, pls don't worry about those in the index. You can get them removed using remove URL feature in webmaster tools account.
-
It doesn't show any result for the "blocked page" when I do that in Google.
-
Hi,
Please try this and let us know the results:
Suppose this is one of the pages in discussion:
http://www.yourdomain.com/blocked-page.html
Go to Google, type the following along with double quotes. Replace with the actual page:
"yourdomain.com/blocked-page.html" -site:yourdomain.com
-
Hi!
From what I could tell, it wasn't that many pages already in the index, so it could be worth trying to lift the block, at least for a short while, to see if it will have an impact.
In addition - how about configuring how GoogleBot should threat your URLs via the URL parameter tool in Google Webmaster Tools. Here's what Google has to say about this. https://support.google.com/webmasters/answer/1235687
Best regards,Anders
-
Hi Devanur.
What I'm guessing is the problem here, is that as of now, GoogleBot is restricted from accessing the pages (because of robots.txt), leading to it never going into the page and updateing its index regarding the "noindex, follow" declaration in the that seems to be in place.
One other thing that could be considered, is to add "rel=nofollow" to all the faceted navigation links on the left.
Fully agreeing with you on the "crawl budget" part
Anders
-
Hi guys,
Appreciate your replies, but as far as I checked last time, if the URL is blocked by a Robots.txt file, it cannot read the Meta Noindex, Follow tag within the page.
There are no external references to these URL's, so Google is finding them within the site itself.
In essence, what you are recommending is that I lift the robots block and let google crawl these pages (which could be infinite as it is faceted navigation).
This will waste my crawl budget.
Any other ideas?
-
Anderss has pointed out to the right article. With robots.txt blocking, Google bot will not do the crawl (link discovery) from within the website but what if references to these blocked pages are found else where on third-party websites? This is the case you have been into. So to fully block Google from doing the link discovery and indexing these blocked pages, you should go in for the page-level meta robots tag to block these pages. Once this is in place, this issue will fade away.
This issue has been addressed many times here on Moz.
Coming to your concern about the crawl budget. There is nothing to worry about this as Google will not crawl those blocked pages while its on your website as these are already been blocked using robots.txt file.
Hope it helps my friend.
Best regards,
Devanur Rafi
-
Hi!
It could be that that pages has already been indexed before you added the directives to robots.txt.
I see that you have added the rel=canonical for the pages and that you now have noindex,follow. Is that recently added? If so, it could be wise to actually let GoogleBot access and crawl the pages again - and then they'll go away after a while. Then you could add the directive again later. See https://support.google.com/webmasters/answer/93710?hl=en&ref_topic=4598466 for more about this.
Hope this helps!
Anders -
For example:
http://www.sekretza.com/eng/best-sellers-sekretza-products.html?price=1%2C1000Is blocked by using:
Disallow: /*price=.... ?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How to get visibility in Google Discover?
Hey everyone, I run a website that publish articles about pets. I have read some great things about Google Discover and the potential traffic it can bring to publishers (Condé Nast reported up to 20% of traffic coming from Discover in the US, at a certain point). I am currently trying to get indexed and after reading Google guidelines and a Ahrefs guide, I have made many optimizations to my site: structured data, creating an author page, fixing image size and publishing date... so far, it's not working. I feel the lack of a knowledge graph for my business may affect my chances. I'm currently building a GMB page to fix this. Do you have other recommendations or success stories of your own experiments with Discover? An example of an article I tried to get indexed was https://www.lebernard.ca/teletravail-chien-guide-survie/. Obviously, I'm not expecting feedback on the quality of the content since it's in French, but I'm curious if you see anything from a technical perspective that doesn't work. Thanks a lot for your help! Charles
Intermediate & Advanced SEO | | Cheebee1240 -
Search Results Pages Blocked in Robots.txt?
Hi I am reviewing our robots.txt file. I wondered if search results pages should be blocked from crawling? We currently have this in the file /searchterm* Is it a good thing for SEO?
Intermediate & Advanced SEO | | BeckyKey0 -
Organic Listings showing Google Tag Manager + Google Page Title...?
I'm a bit stumped with this. I optimise all my titles etc for Australia - and now the organic liatings are showing something strange. For example ( we sell health supplements ) Meta title = "My Product , Buy Online Australia" If I type "My Product" - the title in the organic listings says "My Product - My Company Limited" - and the only place I can see it getting that from is a combination of Meta Data used in Google Tag Manager + the Name on my Google places page. This is much more obvious for categories.. but it's a pain in the butt. If I type "My Product Australia" Then the original "My Product , Buy Online Australia" comes up. Any ideas on policy etc? I have taken the "Limited" off the Google business page - so hopefully this will change over time - but I can't find any information on why google would do something like this. If you had shed any light on this - would be much appreciated.
Intermediate & Advanced SEO | | s_EOgi_Bear0 -
Removing pages from index
My client is running 4 websites on ModX CMS and using the same database for all the sites. Roger has discovered that one of the sites has 2050 302 redirects pointing to the clients other sites. The Sitemap for the site in question includes 860 pages. Google Webmaster Tools has indexed 540 pages. Roger has discovered 5200 pages and a Site: query of Google reveals 7200 pages. Diving into the SERP results many of the pages indexed are pointing to the other 3 sites. I believe there is a configuration problem with the site because the other sites when crawled do not have a huge volume of redirects. My concern is how can we remove from Google's index the 2050 pages that are redirecting to the other sites via a 302 redirect?
Intermediate & Advanced SEO | | tinbum0 -
Google and PDF indexing
It was recently brought to my attention that one of the PDFs on our site wasn't showing up when looking for a particular phrase within the document. The user was trying to search only within our site. Once I removed the site restriction - I noticed that there was another site using the exact same PDF. It appears Google is indexing that PDF but not ours. The name, title, and content are the same. Is there any way to get around this? I find it interesting as we use GSA and within GSA it shows up for the phrase. I have to imagine Google is saying that it already has the PDF and therefore is ignoring our PDF. Any tricks to get around this? BTW - both sites rightfully should have the PDF. One is a client site and they are allowed to host the PDFs created for them. However, I'd like Mathematica to also be listed. Query: no site restriction (notice: Teach for america comes up #1 and Mathematica is not listed). https://www.google.com/search?as_q=&as_epq=HSAC_final_rpt_9_2013.pdf&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=pdf&as_rights=&gws_rd=ssl#q=HSAC_final_rpt_9_2013.pdf+"Teach+charlotte"+filetype:pdf&as_qdr=all&filter=0 Query: site restriction (notice that it doesn't find the phrase and redirects to any of the words) https://www.google.com/search?as_q=&as_epq=HSAC_final_rpt_9_2013.pdf&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=pdf&as_rights=&gws_rd=ssl#as_qdr=all&q="Teach+charlotte"+site:www.mathematica-mpr.com+filetype:pdf
Intermediate & Advanced SEO | | jpfleiderer0 -
How long does google take to show the results in SERP once the pages are indexed ?
Hi...I am a newbie & trying to optimize the website www.peprismine.com. I have 3 questions - A little background about this : Initially, close to 150 pages were indexed by google. However, we decided to remove close to 100 URLs (as they were quite similar). After the changes, we submitted the NEW sitemap (with close to 50 pages) & google has indexed those URLs in sitemap. 1. My pages were indexed by google few days back. How long does google take to display the URL in SERP once the pages get indexed ? 2. Does google give more preference to websites with more number of pages than those with lesser number of pages to display results in SERP (I have just 50 pages). Does the NUMBER of pages really matter ? 3. Does removal / change of URLs have any negative effect on ranking ? (Many of these URLs were not shown on the 1st page) An answer from SEO experts will be highly appreciated. Thnx !
Intermediate & Advanced SEO | | PepMozBot0 -
Website is not getting indexed in Google! Not sure why?
I just came up with my new blog, its not live yet but the 1<sup>st</sup> landing page is ready, up and running… all is fine but here is the only problem is its not getting indexed in Google and I am not really sure why? .xml sitemap is there Google webmaster and analytics are there Website contain at least that much real social shares that it should get indexed in Google Few Links may be coming from Famous Bloggers and SEOmoz (both sites are very authentic in their respective domains) It’s the 4 day the website is up I don’t think website is not getting indexed in Google just because it contains 1 landing page and a thank you page! Any clue or help will be appreciated. www.setalks.com is the domain
Intermediate & Advanced SEO | | MoosaHemani0 -
Getting 260,000 pages re-indexed?
Hey there guys, I was recently hired to do SEO for a big forum to move the site to a new domain and to get them back up to their ranks after this move. This all went quite well, except for the fact that we lost about 1/3rd of our traffic. Although I expected some traffic to drop, this is quite a lot and I'm wondering what it is. The big keywords are still pulling the same traffic but I feel that a lot of the small threads on the forums have been de-indexed. Now, with a site with 260,000 threads, do I just take my loss and focus on new keywords? Or is there something I can do to get all these threads re-indexed? Thanks!
Intermediate & Advanced SEO | | StefanJDorresteijn0