Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
What to do with a site of >50,000 pages vs. crawl limit?
-
What happens if you have a site in your Moz Pro campaign that has more than 50,000 pages?
Would it be better to choose a sub-folder of the site to get a thorough look at that sub-folder?
I have a few different large government websites that I'm tracking to see how they are fairing in rankings and SEO. They are not my own websites. I want to see how these agencies are doing compared to what the public searches for on technical topics and social issues that the agencies manage. I'm an academic looking at science communication. I am in the process of re-setting up my campaigns to get better data than I have been getting -- I am a newbie to SEO and the campaigns I slapped together a few months ago need to be set up better, such as all on the same day, making sure I've set it to include www or not for what ranks, refining my keywords, etc.
I am stumped on what to do about the agency websites being really huge, and what all the options are to get good data in light of the 50,000 page crawl limit. Here is an example of what I mean:
To see how EPA is doing in searches related to air quality, ideally I'd track all of EPA's web presence.
www.epa.gov has 560,000 pages -- if I put in www.epa.gov for a campaign, what happens with the site having so many more pages than the 50,000 crawl limit? What do I miss out on? Can I "trust" what I get?
www.epa.gov/air has only 1450 pages, so if I choose this for what I track in a campaign, the crawl will cover that subfolder completely, and I am getting a complete picture of this air-focused sub-folder ... but (1) I'll miss out on air-related pages in other sub-folders of www.epa.gov, and (2) it seems like I have so much of the 50,000-page crawl limit that I'm not using and could be using. (However, maybe that's not quite true - I'd also be tracking other sites as competitors - e.g. non-profits that advocate in air quality, industry air quality sites - and maybe those competitors count towards the 50,000-page crawl limit and would get me up to the limit? How do the competitors you choose figure into the crawl limit?)
Any opinions on which I should do in general on this kind of situation? The small sub-folder vs. the full humongous site vs. is there some other way to go here that I'm not thinking of?
-
Hi Sean -- Can you clarify for me how competitors in a campaign figure in to the 50,000 page limit? Does the main page in the campaign get thoroughly crawled first and then competitors are crawled up to the limit?
Some examples:
If the main site is 100 pages, and I pick 2 competitors that are 100 to 1000 pages and a 3rd gargantuan competitor of 300,000 pages, what happens? Does it matter in what order I enter competitors in this situation as to whether the 100-page and 1000-page competitors get crawled vs. whether the limit maxes out on the 300K competitor before crawling the smaller competitors?
If the main site is 300,000 pages, do any competitors in the campaign just not get crawled at all because the 50,000 limit gets all used up on the main site?
What if the main site is 20,000 pages and a competitor is 45,000 pages? Thorough crawl of main site and then partial crawl of competitor?
I feel like I have a direction to go in based on our previous discussion for the main site in the campaign, but now I'm still a little stumped and confused about how competitors operate within the crawl limit.
-
Hi There,
Thanks for writing us and this is a tricky one because it is difficult to say if there is an objectively right answer.
In this case your best bet would be to create a sub folder that is under the standard subscription campaign limit and attempting to pick up what you miss using the other research tools. Although, our research tools are predominantly designed for one off interactions, you could probably use them to capture information that is a bit outside of the campaigns purview. Here is a link to our research tools for your reference: moz.com/researchtools/ose/
If you do decide to enter a website that far surpasses the crawl limits then, what will be cut off is determined by the existing site structure.
The way that our crawler works is that it will go from the link provided and use the existing link structure to keep crawling the site or until we run into a dead end.
Both approaches may present issues so it will be more of a judgement call. One thing that I will say is that we have a much easier time crawling fewer pages so that may be something to keep in mind.
Hope this helps and if you have any questions for me please let me know.
Have a fantastic day!
-
Thanks Patrick for the tip about ScreamingFrog! I checked out the link you shared, and it looks like a powerful tool. I'm going to put it on my list of additional tools I need to get going on using.
In the meantime, though, I still need a strategy for what to do in Moz. Any opinions on whether I should set my Moz campaigns to the smaller sub-folders of a few thousand pages vs. the humongous full sites of 100,000+ pages? I guess I'm leaning towards setting them to the smaller sub-folders. Or maybe I should do a small sub-folder for one of the huge sites and do the full site for another campaign, and see what kind of results I get.
-
Hi there
I would look into ScreamingFrog - you can crawl 500 URIs for free, otherwise, if you have a license, you can crawl as many pages as you'd like.
Let me know if this helps! Good luck!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
WEbsite cannot be crawled
I have received the following message from MOZ on a few of our websites now Our crawler was not able to access the robots.txt file on your site. This often occurs because of a server error from the robots.txt. Although this may have been caused by a temporary outage, we recommend making sure your robots.txt file is accessible and that your network and server are working correctly. Typically errors like this should be investigated and fixed by the site webmaster. I have spoken with our webmaster and they have advised the below: The Robots.txt file is definitely there on all pages and Google is able to crawl for these files. Moz however is having some difficulty with finding the files when there is a particular redirect in place. For example, the page currently redirects from threecounties.co.uk/ to https://www.threecounties.co.uk/ and when this happens, the Moz crawler cannot find the robots.txt on the first URL and this generates the reports you have been receiving. From what I understand, this is a flaw with the Moz software and not something that we could fix form our end. _Going forward, something we could do is remove these rewrite rules to www., but these are useful redirects and removing them would likely have SEO implications. _ Has anyone else had this issue and is there anything we can do to rectify, or should we leave as is?
Moz Pro | | threecounties0 -
How to deal with auto generated pages on our site that are considered thin content
Hi there, Wondering how to deal w/ about 300+ pages on our site that are autogenerated & considered thin content. Here is an example of those pages: https://app.cobalt.io/ninp0 The pages are auto generated when a new security researcher joins our team & then filled by each researcher with specifics about their personal experience. Additionally, there is a fair amount of dynamic content on these pages that updates with certain activities. These pages are also getting marked as not having a canonical tag on them, however, they are technically different pages just w/ very similar elements. I'm not sure I would want to put a canonical tag on them as some of them have a decent page authority & I think could be contributing to our overall SEO health. Any ideas on how I should deal w/ this group of similar but not identical pages?
Moz Pro | | ChrissyOck0 -
Should I set blog category/tag pages as "noindex"? If so, how do I prevent "meta noindex" Moz crawl errors for those pages?
From what I can tell, SEO experts recommend setting blog category and tag pages (ie. "http://site.com/blog/tag/some-product") as "noindex, follow" in order to keep the page quality of indexable pages high. However, I just received a slew of critical crawl warnings from Moz for having these pages set to "noindex." Should the pages be indexed? If not, why am I receiving critical crawl warnings from Moz and how do I prevent this?
Moz Pro | | NichGunn0 -
Url-delimiter vs. SEO
Hi all, Our customer is building a new homepage. Therefore, they use pages, which are generated out of a special module. Like a blog-page out of the blog-module (not only for blogs, also for lightboxes). For that, the programmer is using an url-delimiter for his url-parsing. The url-delimiter is for example a /b/ or /s/. The url would look like this: www.test.ch/de/blog/b/an-article www.test.ch/de/s/management-coaching Does the url-delimiter (/b/ or /s/ in the url) have a negative influence on SEO? Should we remove the /b/ or /s/ for a better seo-performance Thank you in advance for your feedback. Greetings. Samuel
Moz Pro | | brunoe10 -
Pages with Temporary Redirects on pages that don't exist!
Hi There Another obvious question to some I hope. I ran my first report using the Moz crawler and I have a bunch of pages with temporary redirects as a medium level issue showing up. Trouble is the pages don't exist so they are being redirected to my custom 404 page. So for example I have a URL in the report being called up from lord only knows where!: www.domain.com/pdf/home.aspx This doesn't exist, I have only 1 home.aspx page and it's in the root directory! but it is giving a temp redirect to my 404 page as I would expect but that then leads to a MOZ error as outlined. So basically you could randomize any url up and it would give this error so I am trying to work out how I deal with it before Google starts to notice or before a competitor starts to throw all kinds at my site generating these errors. Any steering on this would be much appreciated!
Moz Pro | | Raptor-crew0 -
How to find missing or incorrect title tags with a site hosting lots of pages.
i have a website that features more than 9,000 pages. i'm trying to figure out which ones have missing or incorrect title tags. Should I start with screaming frog??
Moz Pro | | SapphireCo0 -
Tool recommendation for Page Depth?
I'd like to crawl our ecommerce site to see how deep (clicks from home page) pages are. I want to verify that every category, sub-category, and product detail page is within three clicks of the home page for googlebot. Suggestions? Thanks!
Moz Pro | | Garmentory0 -
How can I reduce the number of links on a page and keep the site easy to navigate?
The SEOmoz Site Crawl indicates that we have too many on page links on over 9,970 pages. This is an ecommerce site with a large number of categories. I have a couple of questions regarding this issue: How important is the "too many on page links" factor to SEO? What are some methods of reducing the number of links when there are a large number of categories? We have main categories with dropdown menus currently and have found that they are used to browse and shop the store.
Moz Pro | | afmaury1