How do internal search results get indexed by Google?
-
Hi all,
Most of the URLs that are created by using the internal search function of a website/web shop shouldn't be indexed since they create duplicate content or waste crawl budget.
The standard way to go is to 'noindex, follow' these pages or sometimes to use robots.txt to disallow crawling of these pages.
The first question I have is how these pages actually would get indexed in the first place if you wouldn't use one of the options above. Crawlers follow links to index a website's pages. If a random visitor comes to your site and uses the search function, this creates a URL. There are no links leading to this URL, it is not in a sitemap, it can't be found through navigating on the website,... so how can search engines index these URLs that were generated by using an internal search function?
Second question: let's say somebody embeds a link on his website pointing to a URL from your website that was created by an internal search. Now let's assume you used robots.txt to make sure these URLs weren't indexed. This means Google won't even crawl those pages. Is it possible then that the link that was used on another website will show an empty page after a while, since Google doesn't even crawl this page?
Thanks for your thoughts guys.
-
Firstly (and I think you understand this, but for the benefit of others who find this page later): any user landing on the actual page will see its full content - robots.txt has no effect on their experience.
What I think you're asking about here is what happens if Google has previously indexed a page properly with crawling it and discovering content and then you block it in robots.txt, what will it look like in the SERPs?
My expectation is that:
- It will appear in the SERPs as it used to - with meta information / title etc - at least until Google would have recrawled it anyway, and possibly for a bit longer and some failure of Google to recrawl it after the robots.txt is updated
- Eventually, it will either drop out of the index or it may remain but with the "no information" message that shows up when a page is blocked in robots.txt from the outset yet it is indexed anyway
-
Hi Will,
Thanks for the clear answer. Both solutions do have pros and cons.
The only question left is if it would be possible that somebody gets an empty page (so without any content on it) after a while when following an external link to one of your internal search URLs when this URL would be blocked by robots.txt. Search engines wouldn't crawl these pages but still would be able to index them because they follow the link. Or does a URL and its content stay available and visible once it is generated, no matter if it is not crawlable or not indexable? This is maybe a bit out there and it would surprise me, but in this short article that I came across John Mueller says:
"One thing maybe to keep in mind here is that if these pages are blocked by robots.txt, then it could theoretically happen that someone randomly links to one of these pages. And if they do that then it could happen that we index this URL without any content because its blocked by robots.txt. So we wouldn’t know that you don’t want to have these pages actually indexed."
This could be in theory then the case for all URLs that are blocked by robots.txt but get external links.
What's your view on this?
-
I think you could legitimately take either approach to be honest. There isn't a perfect solution that avoids all possible problems so I guess it's a combination of picking which risk you are more worried about (pages getting indexed when you don't want them to, or crawl budget -- probably depends on the size of your site) and possibly considering difficulty of implementation etc.
In light of the fact that we heard about noindex,follow becoming equivalent to noindex,nofollow eventually, that does dampen the benefits of that approach, but doesn't entirely negate it.
I'm not totally sold on the phrasing in the yoast article - I wouldn't call it google "ignoring" robots.txt - it just serves a different purpose. Google is respecting the "do not crawl" directive, but that has never guaranteed that they wouldn't index a page if it got external links.
I personally might lean towards the robots.txt solution on larger sites if crawl budget were the primary concern - just because it wouldn't be the end of the world if (some of) these pages got indexed if they had external links. The only reason we were trying to keep them out was for google's benefit, so if they want to index despite the robots block, it wouldn't keep me awake at night.
Whatever route you go down, good luck!
-
Thanks for the good answers guys, really helpful! It's very clear now how these internal search URLs end up being indexed.
So 'noindex, follow' for URLs generated by internal searches is always the best solution? Even when this uses crawl budget, and blocking by robots.txt doesn't?
You could say that the biggest advantage would be the preservation of link juice when using 'noindex, follow', but John Mueller states that Google treats 'noindex, follow' the same as 'noindex, nofollow' after a while (see this article).
According to this article from Yoast, the most important reason to use 'noindex, follow' is because Google mostly takes this into account, and sometimes ignores the robots.txt.
Maybe this interesting article gives the real reason. If I understand this correctly, it would be possible that somebody gets an empty page after a while when following a link on another website to one of these internal search URLs when this URL would be blocked by robots.txt. Search engines wouldn't crawl these pages but still would be able to index them because they follow the link. Or does a URL and its content stay available and visible once it is generated, no matter if it is not crawlable or not indexable?
And an additional remark: I came across some big webshops that add a canonical tag on a search result page, pointing to the category URL to which the specific search is related to. So if you search for example for 'black laptops', the canonical version of the search result page would be example.com/laptops. If you don't index the search result pages and the links will eventually be 'nofollow', then these pages don't create any value, so what is the point of using canonical tags? On top of that, using canonicals and 'noindex' together should be avoided, according to John Mueller. Google will mostly pick rel=canonical over 'noindex', so this could be an extra reason of internal search URLs being indexed, even when they have the 'noindex' robots tag.
Thanks!
-
These are great additionals I am particularly interested in point #1. I had always suspected Google might try to predict, visit or penetrate URLs in other ways but I didn't know any of the specifics
-
This is a good answer. I'd add two small additional notes:
- Google is voracious in URL discovery even without any links to a page or any of the other mechanisms described here, we have seen instances of URLs being discovered from other sources (think: chrome usage data, crawling of common path patterns etc)
- The description at the end of the answer about robots.txt : I wouldn't describe it as Google "ignoring" the no crawl directives - they will still obey that, and won't crawl the page - it's just that they can index pages that they haven't crawled. Note that this is why you shouldn't combine robots.txt block and noindex tags - Google won't be able to crawl to discover the tags and so may still index the page.
-
Actually quite often there are links to pages of search results. Sometimes webmasters link to them when there's no decent, official page available for a series of products which they wish to promote internally (so they just write a query that captures what they want and link to that instead, from CTA buttons and promotional pop-outs and stuff)
Even when that's not the case, users often share search results with each other on forums and stuff like that. Quite often, even when you think there are 'no links' (internally or externally) to a search results page, you can end up being wrong
Sometimes you also have stuff like related search results hidden in the coding of a web-page, which don't 'activate' until a user begins typing (instant search facilities and the like). If coded badly, sometimes even when the user has entered nothing, a cloaked default list of related searches will appear in the source code or modified source code (after scripts have run) and occasionally Google can get caught up there too
Another problem that can occur is certain search results pages accidentally ending up in the XML sitemap, but that's another kettle of fish entirely
Sometimes you can have lateral indexation tags (canonical tags, hreflangs) going rogue too. Sometimes if a page exists in one language but not another, the site is programmed to 'do something clever' to find relevant content. In some cases these tags can be re-pointed to search result URLs to 'mask' the error of non-uniform multilingual deployment. Custom 404 pages can sometimes try and 'be helpful' by attempting to find similar content for end users and in some cases, end up linking to search results (which means if Google follows a 404, then ends up at the custom 404 URL - Googlebot can sometimes enter the /search area of a website)
You'd be surprised at the number of search results URLs which are linked to on the web, internally or externally
Remember: robots.txt doesn't control indexation, it only controls crawl accessibility. If Google believes a URL is popular (link signals) then they may ignore the no-crawl directive and index the URL anyway. Robots.txt isn't really the type of defense which you can '100% rely upon'
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
My competitor is ranking above me for a branded search in Google. How can I come back on top?
I work with an organization that is ranking #2 for a branded search term, second to a competitor. They have zero similarity between their names, and we've worked with them to up their SEO game around all major areas (one drawback: SquareSpace is killing their site speed). Their DA is 59, the competitor's DA is 77. What are some smart, specific ways that we can help our client come back out on top?
Intermediate & Advanced SEO | | ogiovetti0 -
Website dropped out from Google index
Howdy, fellow mozzers. I got approached by my friend - their website is https://www.hauteheadquarters.com She is saying that they dropped from google index over night - and, as you can see if you google their name, website url or even site: , most of the pages are not indexed. Home page is nowhere to be found - that's for sure. I know that they were indexed before. Google webmaster tools don't have any manual actions (at least yet). No sudden changes in content or backlink profile. robots.txt has some weird rule - disallow everything for EtaoSpider. I don't know if google would listen to that - robots checker in GWT says it's all good. Any ideas why that happen? Any ideas what I should check? P.S. Just noticed in GWT there was a huge drop in indexed pages within first week of August. Still no idea why though. P.P.S. Just noticed that there is noindex x-robots-tag in headers... Anyone knows where this can be set?
Intermediate & Advanced SEO | | DmitriiK0 -
Only 285 of 2,266 Images Indexed by Google
Only 285 of 2,266 Images Indexed by Google. Images for our site are hosted on Amazons CDN cloud based hosting service. Our Wordpress site is on a virtual private server and has its' own IP address. The number of indexed images has dropped substantially in the last year. Our site is for a real estate brokerage firm. There are about 250 listing pages set to "no-index". Perhaps these contain 400 photos, so they do not account for why so few photos have been indexed. The concern is that the low number of indexed images could be affecting overall ranking. The site URL is www.nyc-officespace-leader.com. Is this issue something that we should be concerned about? Thanks,
Intermediate & Advanced SEO | | Kingalan1
Alan0 -
Incorrect URL shown in Google search results
Can anyone offer any advice on how Google might get the url which it displays in search results wrong? It currently appears for all pages as: <cite>www.domainname.com › Register › Login</cite> When the real url is nothing like this. It should be: www.domainname.com/product-type/product-name. This could obviously affect clickthroughs. Google has indexed around 3,000 urls on the site and they are all like this. There are links at the top of the page on the website itself which look like this: Register » Login » which presumably could be affecting it? Thanks in advance for any advice or help!
Intermediate & Advanced SEO | | Wagada0 -
How would Google reach internal pages on Zales with Lazy Load?
Hi, I encountered the following page on Zales:
Intermediate & Advanced SEO | | BeytzNet
http://engagementring.theprestigediamondcollection.com/NewEngagementRing/NewEring.aspx As you scroll down more items pop up (the well known Pinterest style).
Would Google bot be able to enter the product pages? I don't assume the bot "scrolls"... Thanks0 -
Google + pages and SEO results...
Hi, Can anyone give me insight into how people are getting away with naming their business by the SEO search term, creating a BS Google + page, then having that page rank high in the search results. I am speaking specifically about the results you get when you Google: "Los Angeles DUI Lawyer". As you can see from my attached screenshot (I'm doing the search in Los Angeles), the FIRST listing is a Google + business. Strangely, the phone number listed doesn't actually take you to a DUI attorney, but rather to some marketing group that never answers the phone. Can anyone give me insight into why Google even allows this? I just find it odd that Google cares so much about the user experience, but have the first result be something completely misleading. I know it sounds like I'm just jealous (which I am, a little), but I find it disheartening that we work so hard on SEO, and someone takes the top spot with an obvious BS page. UupqBU9
Intermediate & Advanced SEO | | mrodriguez14400 -
To index or not to index search pages - (Panda related)
Hi Mozzers I have a WordPress site with Relevanssi the search engine plugin, free version. Questions: Should I let Google index my site's SERPS? I am scared the page quality is to thin, and then Panda bear will get angry. This plugin (or my previous search engine plugin) created many of these "no-results" uris: /?s=no-results%3Ano-results%3Ano-results%3Ano-results%3Ano-results%3Ano-results%3Ano-results%3Akids+wall&cat=no-results&pg=6 I have added a robots.txt rule to disallow these pages and did a GWT URL removal request. But links to these pages are still being displayed in Google's SERPS under "repeat the search with the omitted results included" results. So will this affect me negatively or are these results harmless? What exactly is an omitted result? As I understand it is that Google found a link to a page they but can't display it because I block GoogleBot. Thanx in advance guys.
Intermediate & Advanced SEO | | ClassifiedsKing0 -
How to get a news article / post to show up in a google trend for your keyword?
Does anyone know how google selects the news articles it displays in google trends? EX: http://www.google.com/trends?q=glitch+hop%2C+dubstep&ctab=0&geo=all&date=all&sort=0 See how dubstep has a couple posts that show up when searched in google trends? these are different than regular SERPS as far as i can tell. Does anyone know how google selects them?
Intermediate & Advanced SEO | | adriandg0