Automated XML Sitemap for a BIG site
-
Hi,
I would like to do an automated sitemap for my site but it has more than a million pages. It would need to be a sitemap index with a separation on different parts of the site (i.e. news, video) and I'll want a news sitemap and video sitemap as well (of course). Does anyone have any recommended way of making this and how much would you recommend it getting updated? For news and , I would like it to be pretty immediate if possible but the static pages don't need to be updated as much.
Thanks!
-
Another good reference:
http://googlewebmastercentral.blogspot.com/2014/10/best-practices-for-xml-sitemaps-rssatom.html
that points to how to ping:
http://www.sitemaps.org/protocol.html#submit_ping
specific search engine examples:
-
Excellent. Thank you! How would you ping google when a sitemap is updated?
-
Yes, split them out. You will need an index sitemap. That is a sitemap that links to other sitemaps
https://support.google.com/webmasters/answer/75712?vid=1-635768989722115177-4024498483&rd=1
In any given sitemap you can have up to 50,000 URLs listed in it and it can be no larger than 50MB uncompressed.
https://support.google.com/webmasters/answer/35738?hl=en&vid=1-635768989722115177-4024498483
Therefore, you could have an index sitemap with links up to 50,000 other sitemaps. Each of those sitemaps could contain links to 50,000 URLs on your site each.
If my math is right, that would be a max of 2,500,000,000 URLs if you have 50,000 sitemaps of 50,000 URLs each.
(Interesting side note Google allows up to 500 index sitemaps, so if you take 2,500,000,000 pages x 500 - 1,250,000,000,000 URLs that you can submit to Google via sitemaps)
How you divide up your content into sitemaps would relate to how your organize the pages on your site, so you are on the right track in breaking out the sitemaps by types of content. Depending on how big any one section of the site is, you may need to have more of those sitemaps in that type i.e. articlesitemap1.xml articlesitemap2.xml etc. You get the idea.
It is recommended that you ping Google every time a page in a sitemap is updated so Google will come back and recrawl the sitemap. I don't run any sites with 1M URLs but I do run several that run in the 10s of thousands. We break them up by type and ping whenever we update a page in that group. You need to consider your crawl budget with Google in that it may not crawl all 1M pages in your sitemap as often and so you may consider for a group of pages setting them up so that if you have articlesitemap1.xml, articlesitemap2.xml, articlesitemap3.xml you are always adding your newest URLs to the most recent sitemap created (i.e. articlesitemap3.xml) That way you are generally pinging Google about the update of a single sitemap out of the group vs all three.
My other thought is that in addition to pinging Google only on the sitemaps that that you have updated, you show a 304 server response to all sitemaps that have not been updated. 304 means "not modified" since last visit. One of your challenges will be your crawl budget with Google and so why make them recrawl a sitemap they have already crawled? You may want to consider a 304 on any URL on your site that has not changed since last time Google visited.
All of that said, as I mentioned above, I have not worked at the scale of 1M+ pages and would defer to others on the best way to approach. The general thought process would be the same though in trying to figure out the best way to use your sitemaps to manage your crawl budget from Google. Small side note, if you have 1M+ pages and any of those are from the use of things like sorting parameters, duplicate content, printer friendly pages, you may want to just noindex them regardless and leave them out of the sitemap and not allow Google to crawl them to start with.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Site property is verified for new version of search console, but same property is unverified in the old version
Hi all! This is a weird one that I've never encountered before. So basically, an admin granted me search console access as an "owner" to the site in search console, and everything worked fine. I inspected some URL's with the new tool and had access to everything. Then, when I realized I had to remove certain pages from the index it directed me to the old search console, as there's no tool for that yet in the new version. However, the old version doesn't even list the site under the "property" dropdown as either verified or unverified, and if I try to add it it makes me undergo the verification process, which fails (I also have analytics and GTM access, so verification shouldn't fail). Has anyone experienced something similar or have any ideas for a fix? Thanks so much for any help!
Reporting & Analytics | | TukTown1 -
Two long established sites with similar audiences, what do we do?
Hi guys, We operate two long established and reasonably well ranking sites — our company website which was built on a keyword domain: market-stalls.co.uk (approx 15 years online) and our online store which was established several years later on a different domain: tradersupplies.co.uk (approx 9 years online). (At the bottom of this post I've attached real world traffic and turnover figures that demonstrate the issue we're facing) The problem is... The above sites target very similar audiences and keywords and both rank fairly well but I know are likely competing against eachother We're a small company (8-10 employees) and we (or rather, I) don't have the time or resources to blog, build back links, manage opseo and all the social channels etc for both sites. I'm struggling to cope with one. The question is... Do we abandon the original company site (market-stalls.co.uk) in favour of pooling all our resource in to improving rankings for our online store (tradersupplies.co.uk). All our social media presence relates to tradersupplies.co.uk. We don't have any social channels for market-stalls.co.uk. Ironically, the only blog we have is established on market-stalls.co.uk — set up a couple of years ago in the hope to pull ourselves back up the rankings — but it hasn't been updated in over a year due to time restraints. Or do we attempt to keep both sites operational, despite a lack of resource? That would likely include a fairly sizeable overhaul of market-stalls.co.uk to bring it up to date with modern design standards, establishing social media channels for market-stalls.co.uk, creating a blog on tradersupplies.co.uk, and regularly updating two blogs and two sets of social media channels with unique content. Sounds like a pretty huge job right!? Obviously, had we been setting up our business in 2017 and having read the many community posts on the subject of multiple websites, we wouldn't be splitting our time between two websites and would be focussing solely on building one highly ranking site. But unfortunately we're not in this position and we're in a quandary because we don't know whether or not we should let our original, highly ranking company site drop off the radar in favour of focussing on building traffic to our online store. This situation arose out of a decision to establish our online store on a different domain to our company website. Back in 2007 I rebuilt market-stalls.co.uk and spent a lot of time optimising it. The site blew up and we were ranking very well for all kinds of keywords related to market stalls In 2009 we opened our online store tradersupplies.co.uk which sells all of the products advertised on market-stalls.co.uk and then some By using "buy now" buttons on market-stalls.co.uk which redirected to tradersupplies.co.uk, our original site was driving a large amount of traffic and sales to tradersupplies.co.uk. At it's peak it was driving almost £6,000 GBP a month in sales. This has since dropped to around a third/quarter of this total. As the business grew we began to run short of time to maintain market-stalls.co.uk and it has inevitably slipped down the rankings This has also had a direct impact on the referral traffic and resulting sales on tradersupplies.co.uk. I've attached below the analytics which show the drop in referral traffic to tradersupplies.co.uk and the drop off in sales. I have a feeling I know the answer to this debacle but I'm keen to hear the opinions of those that may have found themselves in this position before! UPDATE: I've just had a call with our Magento developer halfway through writing this post ... he has suggested we transfer all content from market-stalls.co.uk over to CMS pages on our Magento powered online store, and create 301 redirects. Apparently this will carry the weight of market-stalls.co.uk over to tradersupplies.co.uk. Does anyone have any thoughts on this? turnover.jpg
Reporting & Analytics | | tinselworm0 -
Should Google Trends Match Organic Traffic to My Site?
When looking at Google Trends and my Organic Traffic (using GA) as percentages of their total yearly values I have a correlation of .47. This correlation doesn't seem right when you consider that Google Trends (which is showing relative search traffic data) should match up pretty strongly to your Organic Traffic. Any thoughts on what might be going on? Why isn't Google Trends correlating with Organic Traffic? Shouldn't they be pulling from the same data set? Thanks, Jacob
Reporting & Analytics | | jacob.young.cricut0 -
How can you tell if Google has already assessed a penalty against your site for spammy links?
Is there any way to tell for sure if there is a penalty? My client has a ton of low quality back links, and I think they are in danger of a Penguin penalty. Any way to know? The links are there for a business reason.... their clients mention them in the footer, with a backlink. It is not a link scheme. but folks are generally not clicking on a footer link, and so there is a pro/con of leaving it as it. Any way, to diagnose whether a Penguin penalty has already hit?
Reporting & Analytics | | DianeDP2 -
Why would our client's site be receiving significant levels of organic traffic for a keyword they do not rank for?
We have a client that has received 100+ organic visits for the keyword 'airport transfers', yet the site does not rank in the top 100 search results for this keyword. We have checked that it is not untagged PPC traffic. Truly baffling. Can anybody help?
Reporting & Analytics | | mrdavidingram0 -
My GA code is on my site but Google Analytics isn't being pulled into SEOMoz...why?
The CEO wants me to present an SEO plan next week for three of our sites; however, I got this message when I went to campaign overview tab: "It appears there's a problem with our connection to your Google Analytics account. Please go to your Settings page to update your connection." I double-checked the GA code and it's the same on both our site and in SEOMoz...what gives? I clicked on Choose Your GA Profile->Set GA Account and Profile then got this warning: "Are you sure you want to change your Google Analytics connection? Changing your connection will reset our cache of your historical GA traffic data." I need this data pronto so I can set strategy for three sites; any help would be greatly appreciated! Darrell
Reporting & Analytics | | AdviceElle0 -
WMT and 'Links To Your Site'
Anyone else find that there are, almost continually, links added to the 'Links To Your Site' list from years ago that weren't previously reflected? I'm seeing links that were added to directories in 2008 (by whoever was doing the SEO then) only showing in the last week or so when these links weren't in the list a few months ago. I don't suppose there's much I can do - it's just annoying in that it adds to more people to contact to have nonsense removed.
Reporting & Analytics | | Martin_S0 -
Tracking visits who entered the site via subdomain in GA
Hello, Does anybody know how I can segment visits who entered my site via subdomain? For example: I want to know people who entered via subdomain.example.com/***** and not include people who entered via www.example.com Thanks in advance!
Reporting & Analytics | | A_Q0