Automated XML Sitemap for a BIG site
-
Hi,
I would like to do an automated sitemap for my site but it has more than a million pages. It would need to be a sitemap index with a separation on different parts of the site (i.e. news, video) and I'll want a news sitemap and video sitemap as well (of course). Does anyone have any recommended way of making this and how much would you recommend it getting updated? For news and , I would like it to be pretty immediate if possible but the static pages don't need to be updated as much.
Thanks!
-
Another good reference:
http://googlewebmastercentral.blogspot.com/2014/10/best-practices-for-xml-sitemaps-rssatom.html
that points to how to ping:
http://www.sitemaps.org/protocol.html#submit_ping
specific search engine examples:
-
Excellent. Thank you! How would you ping google when a sitemap is updated?
-
Yes, split them out. You will need an index sitemap. That is a sitemap that links to other sitemaps
https://support.google.com/webmasters/answer/75712?vid=1-635768989722115177-4024498483&rd=1
In any given sitemap you can have up to 50,000 URLs listed in it and it can be no larger than 50MB uncompressed.
https://support.google.com/webmasters/answer/35738?hl=en&vid=1-635768989722115177-4024498483
Therefore, you could have an index sitemap with links up to 50,000 other sitemaps. Each of those sitemaps could contain links to 50,000 URLs on your site each.
If my math is right, that would be a max of 2,500,000,000 URLs if you have 50,000 sitemaps of 50,000 URLs each.
(Interesting side note Google allows up to 500 index sitemaps, so if you take 2,500,000,000 pages x 500 - 1,250,000,000,000 URLs that you can submit to Google via sitemaps)
How you divide up your content into sitemaps would relate to how your organize the pages on your site, so you are on the right track in breaking out the sitemaps by types of content. Depending on how big any one section of the site is, you may need to have more of those sitemaps in that type i.e. articlesitemap1.xml articlesitemap2.xml etc. You get the idea.
It is recommended that you ping Google every time a page in a sitemap is updated so Google will come back and recrawl the sitemap. I don't run any sites with 1M URLs but I do run several that run in the 10s of thousands. We break them up by type and ping whenever we update a page in that group. You need to consider your crawl budget with Google in that it may not crawl all 1M pages in your sitemap as often and so you may consider for a group of pages setting them up so that if you have articlesitemap1.xml, articlesitemap2.xml, articlesitemap3.xml you are always adding your newest URLs to the most recent sitemap created (i.e. articlesitemap3.xml) That way you are generally pinging Google about the update of a single sitemap out of the group vs all three.
My other thought is that in addition to pinging Google only on the sitemaps that that you have updated, you show a 304 server response to all sitemaps that have not been updated. 304 means "not modified" since last visit. One of your challenges will be your crawl budget with Google and so why make them recrawl a sitemap they have already crawled? You may want to consider a 304 on any URL on your site that has not changed since last time Google visited.
All of that said, as I mentioned above, I have not worked at the scale of 1M+ pages and would defer to others on the best way to approach. The general thought process would be the same though in trying to figure out the best way to use your sitemaps to manage your crawl budget from Google. Small side note, if you have 1M+ pages and any of those are from the use of things like sorting parameters, duplicate content, printer friendly pages, you may want to just noindex them regardless and leave them out of the sitemap and not allow Google to crawl them to start with.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Is having the same URL in several sitemaps a problem for google?
We have 30 sitemaps, one for each language version of our site. About 5000 pages per sitemap.
Reporting & Analytics | | lcourse
To get a better idea on which pages google is not indexing, I thought about quickly generating sitemaps by page cagetories to see if there are any patterns. Any problems if I submit now new additional sitemaps dividing all our pages by product page, considering that the same pages are already in our existing sitemaps we submitted in the search console. So having same URL in more than 1 sitemap would be a problem? As a side note, we observed when adding a sitemap index that google search console in its count of total indexed pages, now counts every page twice since we submitted both the sitemap index and the individual sitemaps, so search console does not recognize in count that sitemaps in sitemaps index are identical to the ones we submitted individually in search console.0 -
How to configure multilingual site in google analytic? Currently showing in Referral Traffic why?
Hello All, Currently my Multilingual site is showing in referral traffic is it because I have not added hreflang tag on site? If yes and if I add the hreflang tag on all sites when where it will show in google analytic traffic from international sites? And what type of configuration required in analytic? Thanks!
Reporting & Analytics | | pragnesh96390 -
Launching a new site
What is the best method for Google Analytics implementation? Should I use the same UA id for the new site, or create an new one for the new site?
Reporting & Analytics | | brianvest0 -
Link Activity to site keeps decreasing
Recently I have started seeing daily declines in the total inbound links to my site. 4 of my top 5 sources are all experiencing pretty significant declines and in fact, Pinterest has declined from a high of 16,295 links in Nov '15 to a current number that is 8,479 on 1/6/16. Any ideas on what could be causing these declines? I did upload a new sitemap in Nov, could this be the cause?
Reporting & Analytics | | ctripp10100 -
Can you track two Google Analytics Accounts on one site?
If you have a site that had an old analytics account and then implemented a new one is it possible to run tracking code that records to both accounts without causing your site or data issues? We are doing this so we don't loose data at any point - ideally it wouldn't have been split between the two but making one redundant isn't an option. Ideally we would have merged the data from both accounts and had one - however the research we have done points to this not being a possibility - unless one of you guys knows different? It would be great if anyone has experience on any this.. Thanks
Reporting & Analytics | | ChrisAllbones0 -
How can we stop Google analytics pulling in data from another site?
We have a few accounts under one Google login. They all have separate Google analytics codes but one of the sites is somehow pulling in some data from another site but the other site has not got the same analytics code on it. Not sure how this is happening and what we can do about this, is it a bug in the Google Analytics system? Any help would be appreciated.
Reporting & Analytics | | dentaldesign0 -
May last year my sites orgainic listings, and therfore visitors, plumeted. Why?
Hello, May last year my site took a major fall. I am unsure why, and it's time I found out why. Recently I have rebuilt the site from scratch, except for the urls and content, and it's starting to turn back. What is the best method to go about understanding just what caused the decline? What are the options I have? See the image for a graph of the all-time traffic. http://i.imgur.com/uL93yPj.png My website in question is: www.ditalia.com.au Thanks. uL93yPj.png
Reporting & Analytics | | infinart0 -
Anyone notice a drop in results using site operator?
I set our site's preferred domain back on January 28. We had a www and non www domain being indexed. Since then, I've seen the number or results for our site site operator (site:) decline dramatically. Not sure if this is a good thing or bad thing. So, I'm trying to see if it's unique to our site. My gut is that the numbers are probably leveling out to where they should be and the duplicates are falling out, but I would think that as I see number of results for non www decline, the number of results for www would increase. Any thoughts? Anyone else seeing fluctuations in results using site: ? Lisa
Reporting & Analytics | | Aggie0