Automated XML Sitemap for a BIG site
-
Hi,
I would like to do an automated sitemap for my site but it has more than a million pages. It would need to be a sitemap index with a separation on different parts of the site (i.e. news, video) and I'll want a news sitemap and video sitemap as well (of course). Does anyone have any recommended way of making this and how much would you recommend it getting updated? For news and , I would like it to be pretty immediate if possible but the static pages don't need to be updated as much.
Thanks!
-
Another good reference:
http://googlewebmastercentral.blogspot.com/2014/10/best-practices-for-xml-sitemaps-rssatom.html
that points to how to ping:
http://www.sitemaps.org/protocol.html#submit_ping
specific search engine examples:
-
Excellent. Thank you! How would you ping google when a sitemap is updated?
-
Yes, split them out. You will need an index sitemap. That is a sitemap that links to other sitemaps
https://support.google.com/webmasters/answer/75712?vid=1-635768989722115177-4024498483&rd=1
In any given sitemap you can have up to 50,000 URLs listed in it and it can be no larger than 50MB uncompressed.
https://support.google.com/webmasters/answer/35738?hl=en&vid=1-635768989722115177-4024498483
Therefore, you could have an index sitemap with links up to 50,000 other sitemaps. Each of those sitemaps could contain links to 50,000 URLs on your site each.
If my math is right, that would be a max of 2,500,000,000 URLs if you have 50,000 sitemaps of 50,000 URLs each.
(Interesting side note Google allows up to 500 index sitemaps, so if you take 2,500,000,000 pages x 500 - 1,250,000,000,000 URLs that you can submit to Google via sitemaps)
How you divide up your content into sitemaps would relate to how your organize the pages on your site, so you are on the right track in breaking out the sitemaps by types of content. Depending on how big any one section of the site is, you may need to have more of those sitemaps in that type i.e. articlesitemap1.xml articlesitemap2.xml etc. You get the idea.
It is recommended that you ping Google every time a page in a sitemap is updated so Google will come back and recrawl the sitemap. I don't run any sites with 1M URLs but I do run several that run in the 10s of thousands. We break them up by type and ping whenever we update a page in that group. You need to consider your crawl budget with Google in that it may not crawl all 1M pages in your sitemap as often and so you may consider for a group of pages setting them up so that if you have articlesitemap1.xml, articlesitemap2.xml, articlesitemap3.xml you are always adding your newest URLs to the most recent sitemap created (i.e. articlesitemap3.xml) That way you are generally pinging Google about the update of a single sitemap out of the group vs all three.
My other thought is that in addition to pinging Google only on the sitemaps that that you have updated, you show a 304 server response to all sitemaps that have not been updated. 304 means "not modified" since last visit. One of your challenges will be your crawl budget with Google and so why make them recrawl a sitemap they have already crawled? You may want to consider a 304 on any URL on your site that has not changed since last time Google visited.
All of that said, as I mentioned above, I have not worked at the scale of 1M+ pages and would defer to others on the best way to approach. The general thought process would be the same though in trying to figure out the best way to use your sitemaps to manage your crawl budget from Google. Small side note, if you have 1M+ pages and any of those are from the use of things like sorting parameters, duplicate content, printer friendly pages, you may want to just noindex them regardless and leave them out of the sitemap and not allow Google to crawl them to start with.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Explore by site - Site overview's servers
Hello, When I want to "Explore by site" and make a "Site overview", I have only 4 choices for the region : USA United Kingdom Canada Australia But the location of my business is in Chile.
Reporting & Analytics | | Sodimaccl
Does this have any repercussion or negative impact in the analytics ? Thank you.0 -
Big discrepancy between GA an GSC
Hey! I would really appreciate if someone could help me on this 🙂 I have a website with separate mobile and desktop versions, on GSC I see a huge drop in the number of clicks on the mobile website and an increase in the number of clicks on the desktop website, however this isn't reflected at all in GA. GA shows that mobile traffic hasn't increased for any version of the website and the webpages losing traffic according to GSC are flat/growing on GA (even if I look at traffic coming from organic only). Has anyone experienced something similar or has any possible explanation? Thanks!
Reporting & Analytics | | Anna_90 -
Link Activity to site keeps decreasing
Recently I have started seeing daily declines in the total inbound links to my site. 4 of my top 5 sources are all experiencing pretty significant declines and in fact, Pinterest has declined from a high of 16,295 links in Nov '15 to a current number that is 8,479 on 1/6/16. Any ideas on what could be causing these declines? I did upload a new sitemap in Nov, could this be the cause?
Reporting & Analytics | | ctripp10100 -
How do I find links on my site
I'm looking to find a certain type of link on my site. A link that we're directing out of the site. We have a lot of subdomains though and I was wondering if there was a way to find all the links on each subdomain without screaming frog them all?
Reporting & Analytics | | mattdinbrooklyn0 -
When I look at my SEOMOZ campaigns I see there are a lot of warnings in regards to missing Meta Tags Descriptions but they exist on a clien'ts wordpress site
when I look at my SEOMOZ campaigns I see there are a lot of warnings in regards to missing Meta Tags Descriptions but they exist on a clien'ts wordpress site
Reporting & Analytics | | Doug_Hay1 -
If a site has 301 redirect - Will the Analytics of the target site show it as a referral or as the traffic source it came from?
Lets say I have a site www.abc.com and I rederect that site to www.xyz.com. If ABC.com is still ranking for keyword X and orgnically someone searches for X and they click on the ABC.com listing - In the XYZ site analytics (which is the target site) does it show as organic or referall, direct? Thanks
Reporting & Analytics | | M_80 -
Human Representation on a site
Hello Friends, thank you for helping in advance. My website http://www.FrontlineMobility.com gets a lot of traffic and with our Google Adwords campaigns we have very good click through rates(percentages from 1 percent to 4 percent). So I know that I am getting people to my site, but I can't get them to spend money. It seems like they get there ready to buy, but something turns them away at the last moment. My Partner feels like we should put more pictures of people on the site so that people feel like there is a face to our company. I am also in agreement with this, but I would also like to know if anything else is wrong with our site that perhaps maybe another set of eyes could perceive. Thank you again Moz friends. Justin Smith Frontline Mobility
Reporting & Analytics | | FrontlineMobility0 -
Something strange going on with new client's site...
Please forgive my stupidity if there is something obvious here which I have missed (I keep assuming that must be the case), but any advice on this would be much appreciated. We've just acquired a new client. Despite having a site for plenty of time now they did not previously have analytics with their last company (I know, a crime!). They've been with us for about a month now and we've managed to get them some great rankings already. To be fair, the rankings weren't bad before us either. Anyway. They have multiple position one rankings for well searched terms both locally and nationally. One would assume therefore that a lot of their traffic would come from Google right? Not according to their analytics. In fact, very little of it does... instead, 70% of their average 3,000 visits per month comes from just one referring site. A framed version of their site which is through reachlocal, which itself doesn't rank for any of their terms. I don't get it... The URL of the site is: www.namgrass.co.uk (ignore there being a .com too, that's a portal as they cover other countries). The referring site causing me all this confusion is: http://namgrass.rtrk.co.uk/ (see source code at the bottom for the reachlocal thing). Now I know reach local certainly isn't sending them all that traffic, so why does GA say it is... and what is this reachlocal thing anyway?? I mean, I know what reachlocal is, but what gives here with regards to it? Any ideas, please??
Reporting & Analytics | | SteveOllington0