Does using robots.txt to block pages decrease search traffic?
-
I know you can use robots.txt to tell search engines not to spend their resources crawling certain pages.
So, if you have a section of your website that is good content, but is never updated, and you want the search engines to index new content faster, would it work to block the good, un-changed content with robots.txt? Would this content loose any search traffic if it were blocked by robots.txt? Does anyone have any available case studies?
-
If you block the pages from being crawled, you are also telling the search engines to not index the pages (they don't want to include something they haven't looked at). So yes, the traffic numbers from organic search will change if you block the pages in robots.txt.
-
Agreed, that is a better solution, but, I am still wondering if you block something with robots.txt, will that lead to a decrease in traffic? What if we have some duplicate content that is highly trafficked, if we block it with robots.txt, will the traffic numbers change?
-
You certainly don't want to block this content!
One thing I'd consider is the if-modified-since header, or other headers. Here are two articles that explain more about the concept of using headers to tell the search engines " this hasn't changed, don't bother crawling it". I haven't personally used this, but have read about it in many places.
http://www.feedthebot.com/ifmodified.html
http://searchengineland.com/how-to-improve-crawl-efficiency-with-cache-control-headers-88824
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt wildcards - the devs had a disagreement - which is correct?
Hi – the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?” The second developer suggested that this wildcard would only block URLs featuring a ? that come immediately after /shirts/ - for example: /shirts?minprice=10&maxprice=20 BUT argued that this robots.txt directive would not block URLS featuring a ? in sub directories - e.g. /shirts/blue?mprice=100&maxp=20 So which of the developers is correct? Beyond that, I assumed that the ? should feature a * on each side of it – for example - /? - to work as intended above? Am I correct in assuming that?
Intermediate & Advanced SEO | | McTaggart0 -
Avoiding Duplicate Content with Used Car Listings Database: Robots.txt vs Noindex vs Hash URLs (Help!)
Hi Guys, We have developed a plugin that allows us to display used vehicle listings from a centralized, third-party database. The functionality works similar to autotrader.com or cargurus.com, and there are two primary components: 1. Vehicle Listings Pages: this is the page where the user can use various filters to narrow the vehicle listings to find the vehicle they want.
Intermediate & Advanced SEO | | browndoginteractive
2. Vehicle Details Pages: this is the page where the user actually views the details about said vehicle. It is served up via Ajax, in a dialog box on the Vehicle Listings Pages. Example functionality: http://screencast.com/t/kArKm4tBo The Vehicle Listings pages (#1), we do want indexed and to rank. These pages have additional content besides the vehicle listings themselves, and those results are randomized or sliced/diced in different and unique ways. They're also updated twice per day. We do not want to index #2, the Vehicle Details pages, as these pages appear and disappear all of the time, based on dealer inventory, and don't have much value in the SERPs. Additionally, other sites such as autotrader.com, Yahoo Autos, and others draw from this same database, so we're worried about duplicate content. For instance, entering a snippet of dealer-provided content for one specific listing that Google indexed yielded 8,200+ results: Example Google query. We did not originally think that Google would even be able to index these pages, as they are served up via Ajax. However, it seems we were wrong, as Google has already begun indexing them. Not only is duplicate content an issue, but these pages are not meant for visitors to navigate to directly! If a user were to navigate to the url directly, from the SERPs, they would see a page that isn't styled right. Now we have to determine the right solution to keep these pages out of the index: robots.txt, noindex meta tags, or hash (#) internal links. Robots.txt Advantages: Super easy to implement Conserves crawl budget for large sites Ensures crawler doesn't get stuck. After all, if our website only has 500 pages that we really want indexed and ranked, and vehicle details pages constitute another 1,000,000,000 pages, it doesn't seem to make sense to make Googlebot crawl all of those pages. Robots.txt Disadvantages: Doesn't prevent pages from being indexed, as we've seen, probably because there are internal links to these pages. We could nofollow these internal links, thereby minimizing indexation, but this would lead to each 10-25 noindex internal links on each Vehicle Listings page (will Google think we're pagerank sculpting?) Noindex Advantages: Does prevent vehicle details pages from being indexed Allows ALL pages to be crawled (advantage?) Noindex Disadvantages: Difficult to implement (vehicle details pages are served using ajax, so they have no tag. Solution would have to involve X-Robots-Tag HTTP header and Apache, sending a noindex tag based on querystring variables, similar to this stackoverflow solution. This means the plugin functionality is no longer self-contained, and some hosts may not allow these types of Apache rewrites (as I understand it) Forces (or rather allows) Googlebot to crawl hundreds of thousands of noindex pages. I say "force" because of the crawl budget required. Crawler could get stuck/lost in so many pages, and my not like crawling a site with 1,000,000,000 pages, 99.9% of which are noindexed. Cannot be used in conjunction with robots.txt. After all, crawler never reads noindex meta tag if blocked by robots.txt Hash (#) URL Advantages: By using for links on Vehicle Listing pages to Vehicle Details pages (such as "Contact Seller" buttons), coupled with Javascript, crawler won't be able to follow/crawl these links. Best of both worlds: crawl budget isn't overtaxed by thousands of noindex pages, and internal links used to index robots.txt-disallowed pages are gone. Accomplishes same thing as "nofollowing" these links, but without looking like pagerank sculpting (?) Does not require complex Apache stuff Hash (#) URL Disdvantages: Is Google suspicious of sites with (some) internal links structured like this, since they can't crawl/follow them? Initially, we implemented robots.txt--the "sledgehammer solution." We figured that we'd have a happier crawler this way, as it wouldn't have to crawl zillions of partially duplicate vehicle details pages, and we wanted it to be like these pages didn't even exist. However, Google seems to be indexing many of these pages anyway, probably based on internal links pointing to them. We could nofollow the links pointing to these pages, but we don't want it to look like we're pagerank sculpting or something like that. If we implement noindex on these pages (and doing so is a difficult task itself), then we will be certain these pages aren't indexed. However, to do so we will have to remove the robots.txt disallowal, in order to let the crawler read the noindex tag on these pages. Intuitively, it doesn't make sense to me to make googlebot crawl zillions of vehicle details pages, all of which are noindexed, and it could easily get stuck/lost/etc. It seems like a waste of resources, and in some shadowy way bad for SEO. My developers are pushing for the third solution: using the hash URLs. This works on all hosts and keeps all functionality in the plugin self-contained (unlike noindex), and conserves crawl budget while keeping vehicle details page out of the index (unlike robots.txt). But I don't want Google to slap us 6-12 months from now because it doesn't like links like these (). Any thoughts or advice you guys have would be hugely appreciated, as I've been going in circles, circles, circles on this for a couple of days now. Also, I can provide a test site URL if you'd like to see the functionality in action.0 -
Canonical use when dynamically placing items on "all products" page
Hi all, We're trying to get our canonical situation straightened out. We have a section of our site with 100 product pages in it (in our case a city with hotels that we've reviewed), and we have a single page where we list them all out--an "all products" page called "all.html." However, because we have 100 and that's a lot for a user to see at once, we plan to first show only 50 on "all.html." When the user scrolls down to the bottom, we use AJAX to place another 50 on the page (these come from another page called "more.html" and are placed onto "all.html"). So, as you scroll down from the front end, you see "all.html" with 100 listings. We have other listings pages that are sorted and filtered subsets of this list with little or no unique content. Thus, we want to place a canonical on those pages. Question: Should the canonical point to "all.html"? Would spiders get confused, because they see that all.html is only half the listings? Is it dangerous to dynamically place content on a page that's used as a canonical? Is this a non-issue? Thanks, Tom
Intermediate & Advanced SEO | | TomNYC0 -
Blocking out specific URLs with robots.txt
I've been trying to block out a few URLs using robots.txt, but I can't seem to get the specific one I'm trying to block. Here is an example. I'm trying to block something.com/cats but not block something.com/cats-and-dogs It seems if it setup my robots.txt as so.. Disallow: /cats It's blocking both urls. When I crawl the site with screaming flog, that Disallow is causing both urls to be blocked. How can I set up my robots.txt to specifically block /cats? I thought it was by doing it the way I was, but that doesn't seem to solve it. Any help is much appreciated, thanks in advance.
Intermediate & Advanced SEO | | Whebb0 -
Why Does Ebay Allow Internal Search Result Pages to be Indexed?
Click this Google query: https://www.google.com/search?q=les+paul+studio Notice how Google has a rich snippet for Ebay saying that it has 229 results for Ebay's internal search result page: http://screencast.com/t/SLpopIvhl69z Notice how Sam Ash's internal search result page also ranks on page 1 of Google. I've always followed the best practice of setting internal search result pages to "noindex." Previously, our company's many Magento eCommerce stores had the internal search result pages set to be "index," and Google indexed over 20,000 internal search result URLs for every single site. I advised that we change these to "noindex," and impressions from Search Queries (reported in Google Webmaster Tools) shot up on 7/24 with the Panda update on that date. Traffic didn't necessarily shoot up...but it appeared that Google liked that we got rid of all this thin/duplicate content and ranked us more (deeper than page 1, however). Even Dr. Pete advises no-indexing internal search results here: http://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world So, why is Google rewarding Ebay and Sam Ash with page 1 rankings for their internal search result pages? Is it their domain authority that lets them get away with it? Could it be that noindexing internal search result pages is NOT best practice? Is the game different for eCommerce sites? Very curious what my fellow professionals think. Thanks,
Intermediate & Advanced SEO | | M_D_Golden_Peak
Dan0 -
Effect duration of robots.txt file.
in my web site there is demo site in that also, index in Google but no need it now.so i have created robots file and upload to server yesterday.in the demo folder there are some html files,and i wanna remove all these in demo file from Google.but still in web master tools it showing User-agent: *
Intermediate & Advanced SEO | | innofidelity
Disallow: /demo/ How long this will take to remove from Google ? And are there any alternative way doing that ?0 -
Robots
I have just noticed this in my code name="robots" content="noindex"> And have noticed some of my keywords have dropped, could this be the reason?
Intermediate & Advanced SEO | | Paul780 -
Are there any negative effects to using a 301 redirect from a page to another internal page?
For example, from http://www.dog.com/toys to http://www.dog.com/chew-toys. In my situation, the main purpose of the 301 redirect is to replace the page with a new internal page that has a better optimized URL. This will be executed across multiple pages (about 20). None of these pages hold any search rankings but do carry a decent amount of page authority.
Intermediate & Advanced SEO | | Visually0