Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Crawled page count in Search console
-
Hi Guys,
I'm working on a project (premium-hookahs.nl) where I stumble upon a situation I can’t address. Attached is a screenshot of the crawled pages in Search Console.
History:
Doing to technical difficulties this webshop didn’t always no index filterpages resulting in thousands of duplicated pages. In reality this webshops has less than 1000 individual pages. At this point we took the following steps to result this:
- Noindex filterpages.
- Exclude those filterspages in Search Console and robots.txt.
- Canonical the filterpages to the relevant categoriepages.
This however didn’t result in Google crawling less pages. Although the implementation wasn’t always sound (technical problems during updates) I’m sure this setup has been the same for the last two weeks. Personally I expected a drop of crawled pages but they are still sky high. Can’t imagine Google visits this site 40 times a day.
To complicate the situation:
We’re running an experiment to gain positions on around 250 long term searches. A few filters will be indexed (size, color, number of hoses and flavors) and three of them can be combined. This results in around 250 extra pages. Meta titles, descriptions, h1 and texts are unique as well.
Questions:
- - Excluding in robots.txt should result in Google not crawling those pages right?
- - Is this number of crawled pages normal for a website with around 1000 unique pages?
- - What am I missing?
-
Ben,
I doubt that crawlers are going to access the robots.txt file for each request, but they still have to validate any url they find against the list of the blocked ones.
Glad to help,
Don
-
Hi Don,
Thanks for the clear explanation. I always though disallow in robots.txt would give a sort of map to Google (at the start of a site crawl) with the pages on the site that shouldn’t be crawled. So he therefore didn’t have to “check the locked cars”.
If I understand you correctly, google checks the robots.txt with every single page load?
That could definitely explain high number of crawled pages per day.
Thanks a lot!
-
Hi Bob,
About the nofollow vs blocked. In the end I suppose you have the same results, but in practice it works a little differently. When you nofollow a link it tells the crawler as soon as it encounters the link not to request or follow that link path. When you block it via robots the crawler still attempts to access the url only to find it not accessible.
Imagine if I said go to the parking lot and collect all the loose change in all the unlocked cars. Now imagine how much easier that task would be if all the locked cars had a sign in the window that said "Locked", you could easily ignore the locked cars and go directly to the unlocked ones. Without the sign you would have to physically go check each car to see if it will open.
About link juice, if you have a link, juice will be passed regardless of the type of link. (You used to be able to use nofollow to preserve link juice but no longer). This is bit unfortunate for sites that use search filters because they are such a valuable tool for the users.
Don
-
Hi Don,
You're right about the sitemap, noted it on the to do list!
Your point about nofollow is intersting. Isn't excluding in robots.txt giving the same result?
Before we went on with the robots.txt we didn't implant nofollow because we didn't want any linkjuice to pass away. Since we got robots.txt I assume this doesn’t matter anymore since Google won’t crawl those pages anyway.
Best regards,
Bob
-
Hi Bob,
You can "suggest" a crawl rate to Google by logging into your webmasters tools on Google and adjusting it there.
As for indexing pages.. I looked at your robots and site. It really looks like you need to employ some No Follow on some of your internal linking, specifically on the product page filters, that alone could reduce the total number of URLS that the crawlers even attempts to look at.
Additionally your sitemap http://premium-hookahs.nl/sitemap.xml shows a change frequency of daily, and probably should be broken out between Pages / Images so you end up using two sitemaps one for images and one for pages. You may also want to review what is in there. Using ScreamingFrog (free) the sitemap I made (link) only shows about 100 urls.
Hope it helps,
Don
-
Hi Don,
Just wanted to add a quick note: your input made go through the indexation state of the website again which was worse than I through it was. I will take some steps to get this resolved, thanks!
Would love to hear your input about the number of crawled pages.
Best regards,
Bob
-
Hello Don,
Thanks for your advice. What would your advice be if the main goal would be the reduction of crawled pages per day? I think we got the right pages in the index and the old duplicates are mostly deindexed. At this point I’m mostly worried about Google spending it’s crawlbudget on the right pages. Somehow it still crawls 40.000 pages per day while we only got around 1000 pages that should be crawled. Looking at the current setup (with almost everything excluded though robots.txt) I can’t think of pages it does crawl to reach the 40k. And 40 times a day sounds like way to many crawled pages for a normal webshop.
Hope to hear from you!
-
Hello Bob,
Here is some food for thought. If you disallow a page in Robots.txt, google for example will not crawl that page. That does not however mean they will remove it from the index if it had previously been crawled. It simply treats it as inaccessible and moves on. It will take some time, months before Google finally says, we have no fresh crawls of page x, its time to remove it from the index.
On the other hand if you specifically allow Google to crawl those pages and show a no-index tag on it, Google now has a new directive it can act upon immediately.
So my evaluation of the situation would be to do 1 of 2 things.
1. Remove the disallow from robots and allow Google to crawl the pages again. However, this time use no-index, no-follow tags.
2. Remove the disallow from robots and allow Google to crawl the pages again, but use canonical tags to the main "filter" page to prevent further indexing the specific filter pages.
Which option is best depends on the amount of urls being indexed, a few thousand canonical would be my choice. A few hundred thousand, then no index would make more sense.
Whichever option, you will have to insure Google re-crawls, and then allow them time to re-index appropriately. Not a quick fix, but a fix none the less.
My thoughts and I hope it makes sense,
Don
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Is it ok to repeat a (focus) keyword used on a previous page, on a new page?
I am cataloguing the pages on our website in terms of which focus keyword has been used with the page. I've noticed that some pages repeated the same keyword / term. I've heard that it's not really good practice, as it's like telling google conflicting information, as the pages with the same keywords will be competing against each other. Is this correct information? If so, is the alternative to use various long-winded keywords instead? If not, meaning it's ok to repeat the keyword on different pages, is there a maximum recommended number of times that we want to repeat the word? Still new-ish to SEO, so any help is much appreciated! V.
Intermediate & Advanced SEO | | Vitzz1 -
Should I apply Canonical Links from my Landing Pages to Core Website Pages?
I am working on an SEO project for the website: https://wave.com.au/ There are some core website pages, which we want to target for organic traffic, like this one: https://wave.com.au/doctors/medical-specialties/anaesthetist-jobs/ Then we have basically have another version that is set up as a landing page and used for CPC campaigns. https://wave.com.au/anaesthetists/ Essentially, my question is should I apply canonical links from the landing page versions to the core website pages (especially if I know they are only utilising them for CPC campaigns) so as to push link equity/juice across? Here is the GA data from January 1 - April 30, 2019 (Behavior > Site Content > All Pages😞
Intermediate & Advanced SEO | | Wavelength_International0 -
My url disappeared from Google but Search Console shows indexed. This url has been indexed for more than a year. Please help!
Super weird problem that I can't solve for last 5 hours. One of my urls: https://www.dcacar.com/lax-car-service.html Has been indexed for more than a year and also has an AMP version, few hours ago I realized that it had disappeared from serps. We were ranking on page 1 for several key terms. When I perform a search "site:dcacar.com " the url is no where to be found on all 5 pages. But when I check my Google Console it shows as indexed I requested to index again but nothing changed. All other 50 or so urls are not effected at all, this is the only url that has gone missing can someone solve this mystery for me please. Thanks a lot in advance.
Intermediate & Advanced SEO | | Davit19850 -
Fresh page versus old page climbing up the rankings.
Hello, I have noticed that if publishe a webpage that google has never seen it ranks right away and usually in a descend position to start with (not great but descend). Usually top 30 to 50 and then over the months it slowly climbs up the rankings. However, if my page has been existing for let's say 3 years and I make changes to it, it takes much longer to climb up the rankings Has someone noticed that too ? and why is that ?
Intermediate & Advanced SEO | | seoanalytics0 -
Would you rate-control Googlebot? How much crawling is too much crawling?
One of our sites is very large - over 500M pages. Google has indexed 1/8th of the site - and they tend to crawl between 800k and 1M pages per day. A few times a year, Google will significantly increase their crawl rate - overnight hitting 2M pages per day or more. This creates big problems for us, because at 1M pages per day Google is consuming 70% of our API capacity, and the API overall is at 90% capacity. At 2M pages per day, 20% of our page requests are 500 errors. I've lobbied for an investment / overhaul of the API configuration to allow for more Google bandwidth without compromising user experience. My tech team counters that it's a wasted investment - as Google will crawl to our capacity whatever that capacity is. Questions to Enterprise SEOs: *Is there any validity to the tech team's claim? I thought Google's crawl rate was based on a combination of PageRank and the frequency of page updates. This indicates there is some upper limit - which we perhaps haven't reached - but which would stabilize once reached. *We've asked Google to rate-limit our crawl rate in the past. Is that harmful? I've always looked at a robust crawl rate as a good problem to have. Is 1.5M Googlebot API calls a day desirable, or something any reasonable Enterprise SEO would seek to throttle back? *What about setting a longer refresh rate in the sitemaps? Would that reduce the daily crawl demand? We could set increase it to a month, but at 500M pages Google could still have a ball at the 2M pages/day rate. Thanks
Intermediate & Advanced SEO | | lzhao0 -
Different Header on Home Page vs Sub pages
Hello, I am an SEO/PPC manager for a company that does a medical detox. You can see the site in question here: http://opiates.com. My question is, I've never heard of it specifically being a problem to have a different header on the home page of the site than on the subpages, but I rarely see it either. Most sites, if i'm not mistaken, use a consistent header across most of the site. However, a person i'm working for now said that she has had other SEO's look at the site (above) and they always say that it is a big SEO problem to have a different header on the homepage than on the subpages. Any thoughts on this subject? I've never heard of this before. Thanks, Jesse
Intermediate & Advanced SEO | | Waismann0 -
How long takes to a page show up in Google results after removing noindex from a page?
Hi folks, A client of mine created a new page and used meta robots noindex to not show the page while they are not ready to launch it. The problem is that somehow Google "crawled" the page and now, after removing the meta robots noindex, the page does not show up in the results. We've tried to crawl it using Fetch as Googlebot, and then submit it using the button that appears. We've included the page in sitemap.xml and also used the old Google submit new page URL https://www.google.com/webmasters/tools/submit-url Does anyone know how long will it take for Google to show the page AFTER removing meta robots noindex from the page? Any reliable references of the statement? I did not find any Google video/post about this. I know that in some days it will appear but I'd like to have a good reference for the future. Thanks.
Intermediate & Advanced SEO | | fabioricotta-840380 -
Why does my home page show up in search results instead of my target page for a specific keyword?
I am using Wordpress and am targeting a specific keyword..and am using Yoast SEO if that question comes up.. and I am at 100% as far as what they recommend for on page optimization. The target html page is a "POST" and not a "Page" using Wordpress definitions. Also, I am using this Pinterest style theme here http://pinclone.net/demo/ - which makes the post a sort of "pop-up" - but I started with a different theme and the results below were always the case..so I don't know if that is a factor or not. (I promise .. this is not a clever spammy attempt to promote their theme - in fact parts of it don't even work for me yet so I would not recommend it just yet...) I DO show up on the first page for my keyword.. however.. instead of Google showing the page www.mywebsite.com/this-is-my-targeted-keyword-page.htm Google shows www.mywebsite.com in the results instead. The problem being - if the traffic goes only to my home page.. they will be less likely to stay if they dont find what they want immediately and have to search for it.. Any suggestions would be appreciated!
Intermediate & Advanced SEO | | chunkyvittles0