Sitemap Help!
-
Hi Guys,
Quick question regarding sitemaps. I am currently working on a huge site that has masses of pages.
I am looking to create a site map. How would you guys do this? i have looked at some tools but it say it will only do up to 30,000 pages roughly. It is so large it would be impossible to do this myself....any suggestions?
Also, how do i find out how many pages my site actually has indexed and not indexed??
Thank You all
Wayne
-
The problem that I have with CMS side sitemap generators is that it often pulls content from pages that are existing and adds entries based off that information. If you have pages linked to that are no longer there, as is the case with dynamic content, then you'll be imposing 404's on yourself like crazy.
Just something to watch out for but it's probably your best solution.
-
Hi! With this file, you can create a Google-friendly sitemap for any given folder almost automatically. No limits on the number of files. Please note that the code is the courtesy of @frkandris who generously helped me out when I had a similair problem. I hope it will be as helpful to you as it was to me
- Copy / paste the code below into a text editor.
- Edit the beginning of the file: where you see seomoz.com, put your own domain name there
- Save the file as getsitemap.php and ftp it to the appropriate folder.
- Write the full URL in your browser: http://www.yourdomain.com/getsitemap.php
- The moment you do it, a sitemap.xml will be generated in your folder
- Refresh your ftp client and download the sitemap. Make further changes to it if you wish.
=== CODE STARTS HERE ===
define(DIRBASE, './');define(URLBASE, 'http://www.seomoz.com/'); $isoLastModifiedSite = "";$newLine = "\n";$indent = " ";if (!$rootUrl) $rootUrl = "http://www.seomoz.com"; $xmlHeader = "$newLine"; $urlsetOpen = "<urlset xmlns=""http://www.google.com/schemas/sitemap/0.84"" ="" <="" span="">xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84 http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">$newLine";$urlsetValue = "";$urlsetClose = "</urlset>$newLine"; function makeUrlString ($urlString) { return htmlentities($urlString, ENT_QUOTES, 'UTF-8');} function makeIso8601TimeStamp ($dateTime) { if (!$dateTime) { $dateTime = date('Y-m-d H:i:s'); } if (is_numeric(substr($dateTime, 11, 1))) { $isoTS = substr($dateTime, 0, 10) ."T" .substr($dateTime, 11, ."+00:00"; } else { $isoTS = substr($dateTime, 0, 10); } return $isoTS;} function makeUrlTag ($url, $modifiedDateTime, $changeFrequency, $priority) { GLOBAL $newLine; GLOBAL $indent; GLOBAL $isoLastModifiedSite; $urlOpen = "$indent<url>$newLine";</url> $urlValue = ""; $urlClose = "$indent$newLine"; $locOpen = "$indent$indent<loc>";</loc> $locValue = ""; $locClose = "$newLine"; $lastmodOpen = "$indent$indent<lastmod>";</lastmod> $lastmodValue = ""; $lastmodClose = "$newLine"; $changefreqOpen = "$indent$indent<changefreq>";</changefreq> $changefreqValue = ""; $changefreqClose = "$newLine"; $priorityOpen = "$indent$indent<priority>";</priority> $priorityValue = ""; $priorityClose = "$newLine"; $urlTag = $urlOpen; $urlValue = $locOpen .makeUrlString("$url") .$locClose; if ($modifiedDateTime) { $urlValue .= $lastmodOpen .makeIso8601TimeStamp($modifiedDateTime) .$lastmodClose; if (!$isoLastModifiedSite) { // last modification of web site $isoLastModifiedSite = makeIso8601TimeStamp($modifiedDateTime); } } if ($changeFrequency) { $urlValue .= $changefreqOpen .$changeFrequency .$changefreqClose; } if ($priority) { $urlValue .= $priorityOpen .$priority .$priorityClose; } $urlTag .= $urlValue; $urlTag .= $urlClose; return $urlTag;} function rscandir($base='', &$data=array()) { $array = array_diff(scandir($base), array('.', '..')); # remove ' and .. from the array / foreach($array as $value) : / loop through the array at the level of the supplied $base / if (is_dir($base.$value)) : / if this is a directory / $data[] = $base.$value.'/'; / add it to the $data array / $data = rscandir($base.$value.'/', $data); / then make a recursive call with the current $value as the $base supplying the $data array to carry into the recursion / elseif (is_file($base.$value)) : / else if the current $value is a file / $data[] = $base.$value; / just add the current $value to the $data array */ endif; endforeach; return $data; // return the $data array } function kill_base($t) { return(URLBASE.substr($t, strlen(DIRBASE)));} $dir = rscandir(DIRBASE);$a = array_map("kill_base", $dir); foreach ($a as $key => $pageUrl) { $pageLastModified = date ("Y-m-d", filemtime($dir[$key])); $pageChangeFrequency = "monthly"; $pagePriority = 0.8; $urlsetValue .= makeUrlTag ($pageUrl, $pageLastModified, $pageChangeFrequency, $pagePriority); } $current = "$xmlHeader$urlsetOpen$urlsetValue$urlsetClose"; file_put_contents('sitemap.xml', $current); ?>
=== CODE ENDS HERE ===
-
HTML sitemaps are good for users; having 100,000 links on a page though, not so much.
If you can (and certainly with a site this large) if you can do video and image sitemaps you'll help Google get around your site.
-
Is there any way i can see pages that have not been indexed?
Not that I can tell and using site: isn't going to be feasible on a large site I guess.
Is it more beneficial to include various site maps or just the one?
Well, the max files size is 50,000 or 10MB uncompressed (you can gzip them), so if you've more than 50,000 URLs you'll have to.
-
Is there any way i can see pages that have not been indexed?
Is it more beneficial to include various site maps or just the one?
Thanks for your help!!
-
Thanks for your help
do you ffel it is important to have HTML + Video site maps as well? How does this make a differance?
-
How big we talking?
Probably best grabbing something server side if your CMS can't do it. Check out - http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators - I know Google says they've not tested any (and neither have I) but they must have looked at them at some point.
Secondly you'll need to know how to submit multiple sitemap parts and how to break them up.
Looking at it Amazon seem to cap theirs at 50,000 and Ebay at 40,000, so I think you should be fine with numbers around there.
Here's how to set up multiple sitemaps in the same directory - http://googlewebmastercentral.blogspot.com/2006/10/multiple-sitemaps-in-same-directory.html
Once you've submitted your sitemaps Webmaster Tools will tell you how many URLs you've submitted vs. how many they've indexed.
-
Hey,
I'm assuming you mean XML sitemaps here: You can create a sitemap index file which essentially lists a number of sitemaps in one file (A sitemap of sitemap files if that makes sense). See http://www.google.com/support/webmasters/bin/answer.py?answer=71453
There are automatic sitemap generators out there - if you're site has categories with thousands of pages I'd split up them up and have a sitemap per category.
DD
-
To extract URLs, you can use Xenu Link Sleuth. Then you msut make a hiearchy of sitemaps so that all sitemaps are efficiently crawled by Google.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Need help in understanding why my site does not rank well for our best content?
Hi, We are a technology site from India (Digit.in) and we create a lot of original product review content (http://www.digit.in/mobile-phones/nokia-lumia-830-review-24655.html) but don't seem to rank even in the first 4 pages on Google results. Please let me know the reason for this and how to get this fixed. Thanks Arun
On-Page Optimization | | 9dot90 -
Sitemap.xml parameters
Hello, I confused about the parameters when create sitemap.xml. I don'n know how to use these parameters efficiently. Some parameters are:
On-Page Optimization | | JohnHuynh
- **Page changing frequency: **what is a good value? Can I choose "None"?
- **Last modified date: **what is a good value? Can I choose "Don’t specify"?
- **Page priority: **what is a good value? Can I choose "Don’t specify"? Thanks,0 -
Page Titles For Local - Help on URL Structure
Trying to figure out the best way to construct localized urls for the dental website. For example, If I have the URL:
On-Page Optimization | | Czubmeister
http://www.kooskidental.com/services/cosmetic-dentistry/
and If I want to make it local to the city I would use: http://www.kooskidental.com/services/richardson-tx-cosmetic-dentistry/ But what happens is that I have other options off the menu like: http://www.koooskidental.com/services/richardson-tx-cosmetic-dentistry/teeth-whitening/ But if I am trying to rank for richardson tx teeth whitening, I would have to do http://www.koooskidental.com/services/richardson-tx-cosmetic-dentistry/richardson-tx-teeth-whitening/ But that's pretty long and ugly and I don't think I need richardson-tx in their twice. If I am trying to rank for richardson tx cosmetic dentistry and richardson tx teeth whitening, what would be the best structure for the url's?0 -
New adsense account request rejected - need help
I'm moving my company to Australia, shutting down the US company. Google said I had to request a new Adsense account, so I did. They opened the account, I added the same ads, in the same places, and they have rejected my application. What do I do now? The other account has been open since 2004. They never said a word about this before. After two years of working on improvements, now I'm just about destroyed. I need some help, because I thought I knew what I was doing, but obviously not! As usual. their helpful response is no help at all. http://bit.ly/NPACk - there are no G ads on the front page http://bit.ly/V8ubB5 - this is a typical story http://bit.ly/UpTC2r - this is a typical press release As mentioned in our welcome email, we conduct a second review of your AdSense application once AdSense code is placed on your site(s). As a result of this review, we have disapproved your account for the following violation(s): Issues: - Site does not comply with Google policies --------------------- Further detail: Site does not comply with Google policies: We're unable to approve your AdSense application at this time for one of the reasons listed below or another reason listed in our program policies ([https://support.google.com/adsense/bin/topic.py?topic=1271507](https://support.google.com/adsense/bin/topic.py?topic=1271507)). We recommend that you review the information provided below and make the necessary changes to your site. 1\. You need to improve your site’s user experience To ensure a good experience for users and advertisers, publishers participating in the AdSense program are required to adhere to the Webmaster Quality guidelines ([http://www.google.com/support/webmasters/bin/answer.py?answer=35769](http://www.google.com/support/webmasters/bin/answer.py?answer=35769)). These guidelines provide many tips to help you to provide a positive experience for your users. You’ll also find more useful information in this AdSense blog post which highlights five user experience principles: [http://adsense.blogspot.com/2012/10/publisher-insights-part-1-5-principles.html](http://adsense.blogspot.com/2012/10/publisher-insights-part-1-5-principles.html). Applying these principles will help you to provide a great experience for users on your site. 2\. Your site is a chat site which is not compliant with our policy Publishers are encouraged to experiment with a variety of ad placements and ad formats. However, as stated in our program policies ([http://support.google.com/adsense/bin/answer.py?hl=en&answer=48182](http://support.google.com/adsense/bin/answer.py?hl=en&answer=48182)), AdSense publishers may not place ad code, search boxes or search results in chat programs. This includes, but is not limited to, instant messaging (IMs), chat sites and other pages that contains dynamic content. 3\. You need to remove all content that encourages violation of Google product policies Publishers may not provide the means to circumvent the policies of any Google products, such as by allowing users to download YouTube videos, or encourage the violation of Google AdSense policies. Moreover, publishers may not make use of Google brand features such as logos, screenshots, or other distinctive features without our express permission. For more information, please visit our Help Center ([http://support.google.com/adsense/bin/answer.py?hl=en&ctx=as2&answer=1348688&rd=1](http://support.google.com/adsense/bin/answer.py?hl=en&ctx=as2&answer=1348688&rd=1)). 4\. Your site is dedicated to the sale and distribution of term papers We’re happy to see our publishers’ sites full of useful and informative content, however, as stated in our program policies ( [https://www.google.com/adsense/support/as/bin/answer.py?hl=en&answer=105953](https://www.google.com/adsense/support/as/bin/answer.py?hl=en&answer=105953) ), the sale or distribution of term papers, or any other content that is illegal, promotes illegal activity, or infringes on the legal rights of others is not allowed. Please review the AdSense program policies ([http://support.google.com/adsense/bin/answer.py?hl=en&answer=48182](http://support.google.com/adsense/bin/answer.py?hl=en&answer=48182)) to ensure that your site meets all of the requirements for approval. As soon as you’ve made the necessary changes, we’ll be happy to take another look at your application.
On-Page Optimization | | loopyal0 -
Cnnonical Issue! Plz Help
Hi, I'm having this problem for one of my website, say www.abc.com. Certain information in the site is long and thus required to be put into several pages. For example, let say there is a section for the "List of Business Schools in Canada", this is a huge list and thus divided into several pages. The main URL is like this www.abc.com/business-schools/list-of-business-schools-in-canada.html & after on its goes on like www.abc.com/business-schools/list-of-business-schools-in-canada1.html www.abc.com/business-schools/list-of-business-schools-in-canada2.html www.abc.com/business-schools/list-of-business-schools-in-canada3.html Etc. Now as Google is considering these pages as canonical what should I do suppose do what with it? I've examine that rel="canonical" tag is used on every pages (canada1.html, canada2.html etc.) and the canonical URL is set to the main list-of-business-schools-in-canada.html page. So, why is that Google is picking this up as canonical? Have I made a mistake in placing the rel= canonical tag ? Please suggest. Thanks in advance,
On-Page Optimization | | ITRIX0 -
Help I don't understand Rel Canonical
I'm really stuck on how to fix up Rel Canonical errors on a Wordpress site. I went in and changed all the URLs to remove the www and added / to the end. I get this message on page analysis details: <dt>Canonical URL</dt> <dd>"http://www.some-url.com.au/",</dd> <dd>"http://some-url..com.au/", and</dd> <dd>"http://some-url..com.au/"</dd> <dd>Well the first one with the www doesn't exists and the second two urls are the same! (Note that I have removed the actual URL for this post)</dd> <dd>I'm not sure how to read and fix the errors from the reports ether. The only issues I can see is that the 'Tag Value' has the www and the 'Page Title - URL' doesn't have the www.
On-Page Optimization | | zapprabbit
</dd>0 -
Need help with fluctuating ranking for a specific keyword
my website www.totalmanagement.com fluctuates for the search term: web based property management software I have been using SEO Moz for a few months now and have managed to get to the top 5 and jump around between 3 and 5. Does anyone have any suggestions to assist me? Long term goal is also to really target: Property Management Software But I am still very new at this. Thanks in advance for the help!
On-Page Optimization | | dgruhin0 -
Should I include my help desk link?
My website has a link to our help desk. I was considering a 'do not follow' since I don't think it should be included. However, are there any benefits to including it since there are A LOT of articles and pages on our help desk (though it's aimed at our curent customers, not new or potential customers)?
On-Page Optimization | | flightoffice0