Massive URL blockage by robots.txt
-
Hello people,
In May there has been a dramatic increase in blocked URLs by robots.txt, even though we don't have so many URLs or crawl errors. You can view the attachment to see how it went up. The thing is the company hasn't touched the text file since 2012. What might be causing the problem? Can this result any penalties? Can indexation be lowered because of this?
-
Even though there are less pages indexed compared to those that are blocked, you still have a significant increase in indexed pages as well. That is a good thing! You technically have more pages that are indexed than before. It looks like you possibly relaunched the site or something? More pages blocked could be an indexing problem, or it might be a good thing - it all depends on what pages are being blocked.
If you relaunched the site and used this great new whiz-bang CMS that created an online catalog that gave your users 54 ways to sort your product catalog, then the number of "pages" could increase with each sort. Just imagine, sort your widgets by color, or by size or by price, or by price and size, or by size and color, or by color and price - you get the idea. Very quickly you have a bunch of duplicate pages of a single page. If your SEO was on his or her toes, they would account for this using a canonical approach or possibly a meta noindex or changing the robots.txt etc. That would be good as you are not going to confuse Google with all the different versions of the same page.
Ultimately, Shailendra has the approach that you need to take. Look in robots.txt, look at the code on your pages. What happened around 5/26/2013? All those things need to be looked at to try and answer your question.
-
Le Fras,
You don't only have to change the robots.txt file for Google to indicate that more URLs are being blocked by it. The robots.txt file tells the search engines not to crawl given URLs, but that they may keep them in the index and display the URLs in the search results.
So the search engines do know of the URLs that are being blocked and they are able to indicate that more are being blocked as you add pages to your site that are restricted by the robots.txt file.
-
Check you robots file. Are there entries to block the crawling? If you can give the url then it would be helpful/
Regards
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What happens to crawled URLs subsequently blocked by robots.txt?
We have a very large store with 278,146 individual product pages. Since these are all various sizes and packaging quantities of less than 200 product categories my feeling is that Google would be better off making sure our category pages are indexed. I would like to block all product pages via robots.txt until we are sure all category pages are indexed, then unblock them. Our product pages rarely change, no ratings or product reviews so there is little reason for a search engine to revisit a product page. The sales team is afraid blocking a previously indexed product page will result in in it being removed from the Google index and would prefer to submit the categories by hand, 10 per day via requested crawling. Which is the better practice?
Intermediate & Advanced SEO | | AspenFasteners1 -
How Does Yelp Create URLs?
Hi all, How does Yelp (or other sites) go about creating URLs for just about every service and city possible ending with the search? in the URL like this https://www.yelp.com/search?cflt=chiropractors&find_loc=West+Palm+Beach%2C+FL. They clearly aren't creating all of these pages, so how do you go about setting a meta title/optimization formula that allows these pages to exist AND to be crawled by search engines and indexed?
Intermediate & Advanced SEO | | RickyShockley0 -
Many New Urls at once
Hi, I have about 5,000 new URLs to publish. For SEO/Google - Should I publish them gradually, or all at once is fine? *By the way - all these URLs were already indexed in the past, but then redirected. Cheers,
Intermediate & Advanced SEO | | viatrading10 -
Not sure how we're blocking homepage in robots.txt; meta description not shown
Hi folks! We had a question come in from a client who needs assistance with their robots.txt file. Metadata for their homepage and select other pages isn't appearing in SERPs. Instead they get the usual message "A description for this result is not available because of this site's robots.txt – learn more". At first glance, we're not seeing the homepage or these other pages as being blocked by their robots.txt file: http://www.t2tea.com/robots.txt. Does anyone see what we can't? Any thoughts are massively appreciated! P.S. They used wildcards to ensure the rules were applied for all locale subdirectories, e.g. /en/au/, /en/us/, etc.
Intermediate & Advanced SEO | | SearchDeploy0 -
E-commerce duplicate URLS
Hi I just realized that my e-commerce products do not have any difference except the SKUS, PRICE and THE PRODUCT name. Apart from each page has the same sidebar and a piece of content ( same ) under each product pages. And this is the reason why i am getting too many duplicate urls warning through Moz analytics. I do not have any other contents to add for each product because of the nature of the product. Only the price, product name and the SKUs will be different and rest will all be same for each products. How can i fix this ? Thanks
Intermediate & Advanced SEO | | MindlessWizard0 -
Robots.txt: how to exclude sub-directories correctly?
Hello here, I am trying to figure out the correct way to tell SEs to crawls this: http://www.mysite.com/directory/ But not this: http://www.mysite.com/directory/sub-directory/ or this: http://www.mysite.com/directory/sub-directory2/sub-directory/... But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way: disallow: /directory/sub-directory/ disallow: /directory/sub-directory2/ disallow: /directory/sub-directory/sub-directory/ disallow: /directory/sub-directory2/subdirectory/ etc... I would end up having thousands of definitions to disallow all the possible sub-directory combinations. So, is the following way a correct, better and shorter way to define what I want above: allow: /directory/$ disallow: /directory/* Would the above work? Any thoughts are very welcome! Thank you in advance. Best, Fab.
Intermediate & Advanced SEO | | fablau1 -
Two homepage urls
We have two different homepages for our website. One is designed for daytime users (i.e. businesses), whereas the second night version is designed with home consumers in mind. Is this hurting our SEO by having two homepage urls, instead of just building a strong presence around one? We have set up canonical meta on each one: On the night version: domain.com/indexnight.html we have a On the day version: domain.com/index.html we have a It seems to me that we should just choose one of them and set up a permanent 301 redirect from one to the other. Any assistance would be greatly appreciated, thank you!
Intermediate & Advanced SEO | | JessieT0 -
Does Prefix of my URL make any difference?
Hello, I have a website which is initially appeared in search engine as without www. Last week I made changes in preferred domain name that it appeared with www. In search engine it still shows as without www. I notified to google through webmaster tools that now my domain name is with www but it still shows without www. I want to know that does it affect in SEO and rankings. In Google webmaster tools I added my url with and without www however I kept preferred domain as with www. Do I need to make any extra changes in order to avoid confusion for search engines. Please guide. Thanks
Intermediate & Advanced SEO | | intmktcom0