How Do I Generate a Sitemap for a Large Wordpress Site?
-
Hello Everyone!
I am working with a Wordpress site that is in Google news (i.e. everyday we have about 30 new URLs to add to our sitemap) The site has years of articles, resulting in about 200,000 pages on the site. Our strategy so far has been use a sitemap plugin that only generates the last few months of posts, however we want to improve our SEO and submit all the URLs in our site to search engines.
The issue is the plugins we've looked at generate the sitemap on-the-fly. i.e. when you request the sitemap, the plugin then dynamically generates the sitemap. Our site is so large that even a single request for our sitemap.xml ties up tons of server resources and takes an extremely long time to generate the sitemap (if the page doesn't time out in the process).
Does anyone have a solution?
Thanks,
Aaron
-
In my case, xml-sitempas works extremely good. I fully understand that a DB solution would avoid the crawl need, but the features that I get from xml-sitemaps are worth it.
I am running my website on a powerful dedicated server with SSDs, so perhaps that's why I'm not getting any problems plus I set limitations on the generator memory consumption and activated the feature that saves temp files just in case the generation fails.
-
My concern with recommending xml-sitemaps was that I've always had problems getting good, complete maps of extremely large sites. An internal CMS-based tool is grabbing pages straight from the database instead of having to crawl for them.
You've found that it gets you a pretty complete crawl of your 5K-page site, Federico?
-
I would go with the paid solution of xml-sitemaps.
You can set all the resources that you want it to have available, and it will store in temp files to avoid excessive consumption.
It also offers settings to create large sitemaps using a sitemap_index and you could get plugins that create the news sitemap automatically looking for changes since the last sitemap generation.
I have it running in my site with 5K pages (excluding tag pages) and it takes 10 minutes to crawl.
Then you also have plugins that create the sitemaps dynamically, like SEO by Yoast, Google XML Sitemaps, etc.
-
I think the solution to your server resource issue is to create multiple sitemaps, Aaron. Given that the sitemap protocol only allows 50,000 URLs max. per sitemap and Google News sitemaps can't be over 1000 URLs, this was going to be a necessity anyway, so may as well use these limitations to your advantage.
There's a functionality available for sitemaps called a sitemap index. It basically lists all the sitemap.xmls you've created, so the search engines can find and index them. You put it at the root of the site and then link to it in robots.txt just like a regular sitemap. (Can also submit it in GWT). In fact, Yoast's SEO plugin sitemaps and others use just this functionality already for their News add-on.
In your case, you could build the News sitemap dynamically to meet its special requirements (up to 1000 URLs and will crawl only last 2 days of posts) and to ensure it's up-to-the-minute accurate, as is critical for news sites.
Then separately you would build additional, segmented sitemaps for the existing 200,000 pages. Since these are historical pages, you could easily serve them from static files, since they wouldn't need to update once created. By having them static, there's be no server load to serve them each time - only the load to generate the current news sitemap. (I'd actually recommend you keep each static sitemap to around 25,000 pages each to ensure search engines can crawl them easily)
This approach would involve a bit of fiddling to initially set up, as you'd need to generate the "archive" sitemaps then convert them to static versions, but once set up, the News sitemap would take care of itself and once a month (or whatever you decide) you'd need to add the "expiring" pages from the News sitemap to the most recent "archive" segment. A smart programmer might even be able to automate that process.
Does this approach sound like it might solve your problem?
Paul
P.S. Since you'd already have the sitemap index capability, you could also add video and image sitemaps to your site if appropriate.
-
Have you ever tried using a web-based sitemap generator? Not sure how it would respond to your site but at least it would be running on someone else's server, right?
Not sure what else to say honestly.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Site migration/ CMS/domain site structure change-no access to search console
Hi everyone, We are migrating an old site under a bigger umbrella (our main domain). As mentioned in the title, We'll perform CMS migration, domain change, and site structure change. Now, the major problem is that we can't get into google search console for the old site. The site still has old GA code, so google search console verification using this method is not possible, also there is no way developers will be able to add GTM or edit DNS setting (not to bother you with the reason why). Now, my dilemma is : 1. Do we need access to old search console to notify Google about the domain name change or this could be done from our main site (old site will become a part of) search console 2. We are setting up 301 redirects from old to the new domain (not perfect 1:1 redirect ). Once migration is done does anything else needs to be done with the old domain (it will become obsolete)? 3.The main site, Site-map... Should I create a new sitemap with newly added pages or update the current one. 4. if you have anything else please add:) Thank you!
Intermediate & Advanced SEO | | bgvsiteadmin0 -
Can't generate a sitemap with all my pages
I am trying to generate a site map for my site nationalcurrencyvalues.com but all the tools I have tried don't get all my 70000 html pages... I have found that the one at check-domains.com crawls all my pages but when it writes the xml file most of them are gone... seemingly randomly. I have used this same site before and it worked without a problem. Can anyone help me understand why this is or point me to a utility that will map all of the pages? Kindly, Greg
Intermediate & Advanced SEO | | Banknotes0 -
Google does not index image sitemap
Hi, we put an image sitemap in the searchconsole/webmastertools http://www.sillasdepaseo.es/sillasdepaseo/sitemap-images.xml it contains only the indexed products and all images on the pages. We also claimed the CDN in the searchconsole http://media.sillasdepaseo.es/ It has been 2 weeks now, Google indexes the pages, but not the images. What can we do? Thanks in advance. Dieter Lang
Intermediate & Advanced SEO | | Storesco0 -
On 1 of our sites we have our Company name in the H1 on our other site we have the page title in our H1 - does anyone have any advise about the best information to have in the H1, H2 and Page Tile
We have 2 sites that have been set up slightly differently. On 1 site we have the Company name in the H1 and the product name in the page title and H2. On the other site we have the Product name in the H1 and no H2. Does anyone have any advise about the best information to have in the H1 and H2
Intermediate & Advanced SEO | | CostumeD0 -
Submitting XML Sitemap for large website: how big?
Hi there, I’m currently researching how I can generate an XML sitemap for a large website we run. We think that Google is having problems indexing the URLs based on some of the messages we have been receiving in Webmaster tools, which also shows a large drop in the total number of indexed pages. Content on this site can be accessed in two ways. On the home page, the content appears as a list of posts. Users can search for previous posts and can search all the way back to the first posts that were submitted. Posts are also categorised using tags, and these tags can also currently be crawled by search engines. Users can then click on tags to see articles covering similar subjects. A post could have multiple tags (e.g. SEO, inbound marketing, Technical SEO) and so can be reached in multiple ways by users, creating a large number of URLs to index. Finally, my questions are: How big should a sitemap be? What proportion of the URLs of a website should it cover? What are the best tools for creating the sitemaps of large websites? How often should a sitemap be updated? Thanks 🙂
Intermediate & Advanced SEO | | RG_SEO0 -
Recovering from a site migration
Hi. I've been working on http://www.alwayshobbies.com/ for a number of months. All was fine, but then we had a site migration which involved a huge number of redirects. There's been a couple of similar moves in the past. As a result, rankings have plummeted. To resolve this, we're considering letting all the old pages 404 by turning of the redirects, and removing all links to them where we can. Some key pages could have canonicals added, but basically we're looking to purge as much as possible. Does this sound like a reasonable tactic?
Intermediate & Advanced SEO | | neooptic0 -
Link masking in WordPress
in Wordpress, I want to block Google from crawling my site using the primary navigation. I want to use anchor text links in the body and custom menus in the sidebar to make maximum benefit of the "first link counts" rule. In short, I want to obfuscate all of the links in my primary navigation without using the dreaded nofollow. I do not want to block other links to the pages - body text, custom menus, etc. . This would be site wide. I'd rather not use Ajax or any type of programming unless it's part of a plugin. Can anyone make a simple, Google-friendly suggestion?
Intermediate & Advanced SEO | | CsmBill0 -
Wordpress site architecture conflicts and how to go about fixing them
I am attempting to figure something out with a site I'm trying to fix. So the problem is that I've got two categories that are basically related keywords. I set this up when I first started doing this work and didn't know what I was doing. So that site at one time was ranking on the first page for a specific term (example: 'project manager salary' and posted in the category 'project manager salary'. But they we added 'project manager salary in Vermont' and all other 50 posts for all states in a different category called, 'project manager salaries and benefits'. So my question is this: Would this cause some kind of keyword rank cannibalization? How do I fix this properly? Thanks! Michael
Intermediate & Advanced SEO | | mtking.us_gmail.com0