Automated XML Sitemap for a BIG site
-
Hi,
I would like to do an automated sitemap for my site but it has more than a million pages. It would need to be a sitemap index with a separation on different parts of the site (i.e. news, video) and I'll want a news sitemap and video sitemap as well (of course). Does anyone have any recommended way of making this and how much would you recommend it getting updated? For news and , I would like it to be pretty immediate if possible but the static pages don't need to be updated as much.
Thanks!
-
Another good reference:
http://googlewebmastercentral.blogspot.com/2014/10/best-practices-for-xml-sitemaps-rssatom.html
that points to how to ping:
http://www.sitemaps.org/protocol.html#submit_ping
specific search engine examples:
-
Excellent. Thank you! How would you ping google when a sitemap is updated?
-
Yes, split them out. You will need an index sitemap. That is a sitemap that links to other sitemaps
https://support.google.com/webmasters/answer/75712?vid=1-635768989722115177-4024498483&rd=1
In any given sitemap you can have up to 50,000 URLs listed in it and it can be no larger than 50MB uncompressed.
https://support.google.com/webmasters/answer/35738?hl=en&vid=1-635768989722115177-4024498483
Therefore, you could have an index sitemap with links up to 50,000 other sitemaps. Each of those sitemaps could contain links to 50,000 URLs on your site each.
If my math is right, that would be a max of 2,500,000,000 URLs if you have 50,000 sitemaps of 50,000 URLs each.
(Interesting side note Google allows up to 500 index sitemaps, so if you take 2,500,000,000 pages x 500 - 1,250,000,000,000 URLs that you can submit to Google via sitemaps)
How you divide up your content into sitemaps would relate to how your organize the pages on your site, so you are on the right track in breaking out the sitemaps by types of content. Depending on how big any one section of the site is, you may need to have more of those sitemaps in that type i.e. articlesitemap1.xml articlesitemap2.xml etc. You get the idea.
It is recommended that you ping Google every time a page in a sitemap is updated so Google will come back and recrawl the sitemap. I don't run any sites with 1M URLs but I do run several that run in the 10s of thousands. We break them up by type and ping whenever we update a page in that group. You need to consider your crawl budget with Google in that it may not crawl all 1M pages in your sitemap as often and so you may consider for a group of pages setting them up so that if you have articlesitemap1.xml, articlesitemap2.xml, articlesitemap3.xml you are always adding your newest URLs to the most recent sitemap created (i.e. articlesitemap3.xml) That way you are generally pinging Google about the update of a single sitemap out of the group vs all three.
My other thought is that in addition to pinging Google only on the sitemaps that that you have updated, you show a 304 server response to all sitemaps that have not been updated. 304 means "not modified" since last visit. One of your challenges will be your crawl budget with Google and so why make them recrawl a sitemap they have already crawled? You may want to consider a 304 on any URL on your site that has not changed since last time Google visited.
All of that said, as I mentioned above, I have not worked at the scale of 1M+ pages and would defer to others on the best way to approach. The general thought process would be the same though in trying to figure out the best way to use your sitemaps to manage your crawl budget from Google. Small side note, if you have 1M+ pages and any of those are from the use of things like sorting parameters, duplicate content, printer friendly pages, you may want to just noindex them regardless and leave them out of the sitemap and not allow Google to crawl them to start with.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Is it possible to set up one of the Goal Conversions on Google Analytics for a different site?
We are in the process of a website migration and need to set up the conversions for the new site. What is the most effective way of doing this?
Reporting & Analytics | | Sable_Group0 -
Curious, anyone ever had over half of their indexed links drop on an e-commerce site?
In a year went from around 300k indexed pages to around >100k according to GWT. Could this be duplicate content issue, lost links, spam, aged links or all of the above? either way an audit is in order. Thanks! Chris
Reporting & Analytics | | Sundance_Kidd0 -
Site account in Google Analytics
Hello I have a question about my site account. On 2014, during a week, my ID tracking of Google Analytics was removed of the site, in this period the volume of users and sessions is lower than the other weeks. But I don't understand why are the sessions and users still reporting during this period without ID Tracking
Reporting & Analytics | | Arkix0 -
What determines the page order of site:domain?
Whenever I use site:domain.com to check what's index, it's pretty much always in the same order. I gather from this, the order is not random. I'm also reasonably certainly it isn't related to any page strength signals or ranking results. So, does anyone know why the pages are displayed in the order they are? What information does the order of the pages tell me? Thanks, Ruben
Reporting & Analytics | | KempRugeLawGroup1 -
404 errors more than 1.8 lacs, Duplicate Content, Duplicate title, missing meta description increasing as site is based on regular ticket selling (CRM), kindly help
Sites error increasing i.e. 404 errors more than 1.8 lacs, Duplicate Content, Duplicate title, missing meta description increasing day by day as site is based on regular ticket selling (CRM), We have checked with webmasters for 404's, but it is not easy to delete 1.8 lac entries. How to resolve this issue for future. kindly help and suggest the solution.
Reporting & Analytics | | 1akal0 -
My GWT tells me that verification has failed numerous occasions - will this stop my site being crawled?
I launched www.over50choices.co.uk 6 weeks ago and have had trouble with google indexing and crawling all pages. It tells me 143 submitted & 129 Indexed, but the site has 166 pages? It still shows the old home page image in GWT - which is v annoying! Whilst the site is verified by GA & HTML Tag, it tells me in the Verification section that "reverification failed" on numerous occasions - they seem correspond with when google trys to process the site map. Is this a coincidence ie verification fails when its trying to process the site map, which in turn is leaving me with an out of date site map and therefore not all my pages submitted or crawled? Or will this not effect the googles ability to crawl the site? Your help please. Ash
Reporting & Analytics | | AshShep10 -
Has anyone noticed a dramatic drop in direct visits year over year in GA across multiple sites?
I monitor about 10 websites in GA. Many of these sites are in a stable phase of their lifecycle. I've noticed this year that direct visits on all my sites and even friends sites have dropped by 20-60%. Has anyone seen any explanation for this or noticed this when compared to previous year? In every instance I have no penalties, notices, anything and the drop is made up completely of "direct visits".
Reporting & Analytics | | bradwayland0 -
How to find out which URLs are NOT indexed on a site
Is there a way to easily find out which URLs on a store-type site are NOT being indexed in Google? For example, if my sitemap information in Google Webmaster tools shows I have 7342 URLs in my sitemap and 5699 of those indexed, how do I find out what the 1643 non-indexed URLS are? Thanks for any help!
Reporting & Analytics | | GregWalt0