Sitemap Help!
-
Hi Guys,
Quick question regarding sitemaps. I am currently working on a huge site that has masses of pages.
I am looking to create a site map. How would you guys do this? i have looked at some tools but it say it will only do up to 30,000 pages roughly. It is so large it would be impossible to do this myself....any suggestions?
Also, how do i find out how many pages my site actually has indexed and not indexed??
Thank You all
Wayne
-
The problem that I have with CMS side sitemap generators is that it often pulls content from pages that are existing and adds entries based off that information. If you have pages linked to that are no longer there, as is the case with dynamic content, then you'll be imposing 404's on yourself like crazy.
Just something to watch out for but it's probably your best solution.
-
Hi! With this file, you can create a Google-friendly sitemap for any given folder almost automatically. No limits on the number of files. Please note that the code is the courtesy of @frkandris who generously helped me out when I had a similair problem. I hope it will be as helpful to you as it was to me
- Copy / paste the code below into a text editor.
- Edit the beginning of the file: where you see seomoz.com, put your own domain name there
- Save the file as getsitemap.php and ftp it to the appropriate folder.
- Write the full URL in your browser: http://www.yourdomain.com/getsitemap.php
- The moment you do it, a sitemap.xml will be generated in your folder
- Refresh your ftp client and download the sitemap. Make further changes to it if you wish.
=== CODE STARTS HERE ===
define(DIRBASE, './');define(URLBASE, 'http://www.seomoz.com/'); $isoLastModifiedSite = "";$newLine = "\n";$indent = " ";if (!$rootUrl) $rootUrl = "http://www.seomoz.com"; $xmlHeader = "$newLine"; $urlsetOpen = "<urlset xmlns=""http://www.google.com/schemas/sitemap/0.84"" ="" <="" span="">xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84 http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">$newLine";$urlsetValue = "";$urlsetClose = "</urlset>$newLine"; function makeUrlString ($urlString) { return htmlentities($urlString, ENT_QUOTES, 'UTF-8');} function makeIso8601TimeStamp ($dateTime) { if (!$dateTime) { $dateTime = date('Y-m-d H:i:s'); } if (is_numeric(substr($dateTime, 11, 1))) { $isoTS = substr($dateTime, 0, 10) ."T" .substr($dateTime, 11, ."+00:00"; } else { $isoTS = substr($dateTime, 0, 10); } return $isoTS;} function makeUrlTag ($url, $modifiedDateTime, $changeFrequency, $priority) { GLOBAL $newLine; GLOBAL $indent; GLOBAL $isoLastModifiedSite; $urlOpen = "$indent<url>$newLine";</url> $urlValue = ""; $urlClose = "$indent$newLine"; $locOpen = "$indent$indent<loc>";</loc> $locValue = ""; $locClose = "$newLine"; $lastmodOpen = "$indent$indent<lastmod>";</lastmod> $lastmodValue = ""; $lastmodClose = "$newLine"; $changefreqOpen = "$indent$indent<changefreq>";</changefreq> $changefreqValue = ""; $changefreqClose = "$newLine"; $priorityOpen = "$indent$indent<priority>";</priority> $priorityValue = ""; $priorityClose = "$newLine"; $urlTag = $urlOpen; $urlValue = $locOpen .makeUrlString("$url") .$locClose; if ($modifiedDateTime) { $urlValue .= $lastmodOpen .makeIso8601TimeStamp($modifiedDateTime) .$lastmodClose; if (!$isoLastModifiedSite) { // last modification of web site $isoLastModifiedSite = makeIso8601TimeStamp($modifiedDateTime); } } if ($changeFrequency) { $urlValue .= $changefreqOpen .$changeFrequency .$changefreqClose; } if ($priority) { $urlValue .= $priorityOpen .$priority .$priorityClose; } $urlTag .= $urlValue; $urlTag .= $urlClose; return $urlTag;} function rscandir($base='', &$data=array()) { $array = array_diff(scandir($base), array('.', '..')); # remove ' and .. from the array / foreach($array as $value) : / loop through the array at the level of the supplied $base / if (is_dir($base.$value)) : / if this is a directory / $data[] = $base.$value.'/'; / add it to the $data array / $data = rscandir($base.$value.'/', $data); / then make a recursive call with the current $value as the $base supplying the $data array to carry into the recursion / elseif (is_file($base.$value)) : / else if the current $value is a file / $data[] = $base.$value; / just add the current $value to the $data array */ endif; endforeach; return $data; // return the $data array } function kill_base($t) { return(URLBASE.substr($t, strlen(DIRBASE)));} $dir = rscandir(DIRBASE);$a = array_map("kill_base", $dir); foreach ($a as $key => $pageUrl) { $pageLastModified = date ("Y-m-d", filemtime($dir[$key])); $pageChangeFrequency = "monthly"; $pagePriority = 0.8; $urlsetValue .= makeUrlTag ($pageUrl, $pageLastModified, $pageChangeFrequency, $pagePriority); } $current = "$xmlHeader$urlsetOpen$urlsetValue$urlsetClose"; file_put_contents('sitemap.xml', $current); ?>
=== CODE ENDS HERE ===
-
HTML sitemaps are good for users; having 100,000 links on a page though, not so much.
If you can (and certainly with a site this large) if you can do video and image sitemaps you'll help Google get around your site.
-
Is there any way i can see pages that have not been indexed?
Not that I can tell and using site: isn't going to be feasible on a large site I guess.
Is it more beneficial to include various site maps or just the one?
Well, the max files size is 50,000 or 10MB uncompressed (you can gzip them), so if you've more than 50,000 URLs you'll have to.
-
Is there any way i can see pages that have not been indexed?
Is it more beneficial to include various site maps or just the one?
Thanks for your help!!
-
Thanks for your help
do you ffel it is important to have HTML + Video site maps as well? How does this make a differance?
-
How big we talking?
Probably best grabbing something server side if your CMS can't do it. Check out - http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators - I know Google says they've not tested any (and neither have I) but they must have looked at them at some point.
Secondly you'll need to know how to submit multiple sitemap parts and how to break them up.
Looking at it Amazon seem to cap theirs at 50,000 and Ebay at 40,000, so I think you should be fine with numbers around there.
Here's how to set up multiple sitemaps in the same directory - http://googlewebmastercentral.blogspot.com/2006/10/multiple-sitemaps-in-same-directory.html
Once you've submitted your sitemaps Webmaster Tools will tell you how many URLs you've submitted vs. how many they've indexed.
-
Hey,
I'm assuming you mean XML sitemaps here: You can create a sitemap index file which essentially lists a number of sitemaps in one file (A sitemap of sitemap files if that makes sense). See http://www.google.com/support/webmasters/bin/answer.py?answer=71453
There are automatic sitemap generators out there - if you're site has categories with thousands of pages I'd split up them up and have a sitemap per category.
DD
-
To extract URLs, you can use Xenu Link Sleuth. Then you msut make a hiearchy of sitemaps so that all sitemaps are efficiently crawled by Google.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
I still don't understand how rel=canonical works. Help?
So here's the deal. I write for many different outlets. I also have many different pages on my blog that have duplicates (authorized, of course). On my blog, I have many different pages that redirect to "the original" content. I've only recently discovered the existence of rel=canonical. However I don't understand how it works. I have very specific questions. Can anyone help? If, on my blog, I have a blog post that's the original. And another website has the same content, used with authorization. If I want to tell search engines that the original content is on MY blog, what can I do? Is the only solution to ask the owner of the other blog to add a rel=canonical in the header of the specific post? If, on my blog, I have a blog post that's NOT the original. Do I simply add rel=canonical to the header, then add a link to the original in the body? If, on my blog, I have THE FIRST 300 WORDS of a blog post, then add a link saying "to read the whole article, click here" with a link pointing to the original, do I need to have a rel=canonical tag somewhere? Does it HAVE to be in the header? Can rel=canonical be used in the - What penalties are included with having duplicate content of my work everywhere on the web? I've been trying to find specifics, but can't. Thanks for the help. I'm quite confused, as you can see.
On-Page Optimization | | cedriklizotte0 -
Homepage SEO: Does Text Content Help Traffic?
Hi Mozzers! My employers homepage (www.swarovski.com) is - amongst other problems we're about to fix - very thin (not to say empty!) in text content. If we were to put relevant text on the next version of the page, would that be beneficial in terms of traffic to that page? Thanks and cheers, Chris
On-Page Optimization | | Diderino0 -
Sitemap.xml parameters
Hello, I confused about the parameters when create sitemap.xml. I don'n know how to use these parameters efficiently. Some parameters are:
On-Page Optimization | | JohnHuynh
- **Page changing frequency: **what is a good value? Can I choose "None"?
- **Last modified date: **what is a good value? Can I choose "Don’t specify"?
- **Page priority: **what is a good value? Can I choose "Don’t specify"? Thanks,0 -
Can Sitemap Be Used to Manage Canonical URLs?
We have a duplicate content challenge that likely has contributed to us loosing SERPs especially for generic keywords such as "audiobook," "audiobooks," "audio book," and "audio books." Our duplicate content is on two levels. 1. The first level is at our web store, www.audiobooksonline.com. Audiobooks are sometimes published in abridged, unabridged, on compact discs, on MP3 CD by the same publisher. In this case we use the publisher description of the story for each "flavor" = duplicate content. Can we use our sitemap to identify only one "flavor" so that a spider doesn't index the others? 2. The second level is that most online merchants of the same publisher's audio book use the same description of the story = lots of duplicate content on the Web. In that we have 11,000+ audio book titles offered at our Web store, I expect Google sees us as having lots of duplicated (on the Web) content and devalues our site. Some of our competitors who rank very high for our generic keywords use the same publisher's description. Any suggestions on how we could make our individual audio book title pages unique will be greatly appreciated.
On-Page Optimization | | lbohen0 -
Would removing high dynamic pages though nofollow help or hurt?
We have a sub-domain that is hosted by a third party. These pages are highly dynamic (change daily or more often) as they are product search results. Unfortunately they are raising several errors and warnings including duplicate page content, title missing or empty, long URLs, overly dynamic URL Would putting nofollows on the links to this sub-domain help, hurt or not affect page rank? As an example: Links in the middle of this page (prices) http://targetvacations.ca go to a page such as this http://travel.targetvacations.ca/cgi-bin/resultadv.cgi?id=16294922&code_ag=tgv&alias=tgv which is then redirected to a dynamic URL and presents the results.
On-Page Optimization | | TSDigital0 -
To many links hurting me even though they are helping users
I have a scrabble based site where I function as a anagram solver, scrabble dictionary look up and tons of different word lists. In each of these word lists I link every word to my scrabble dictionary. This has caused Google to index 10018 pages total for my site and over 300 of them have well over 100 links. Many of them contain over 1000 links. I know Google's and SEOMOZ stance that anything over 100 will hurt me. I have always seen the warnings in my dashboard warning me of this but I have simply ignored it. I have posted on this Q and A that I have this issue, but IMO having these links benefit the users in the aspect that they don't have to worry about coping the text and putting it in the search box, they can simply click the link. Some have said if it helps the users then I am good, others have said opposite. I am thinking about removing these links from all these word lists to reduce the links per page. My questions are these. 1. If I remove the links from my page could this possible help me? No harm in trying it out so this is an easy question 2. If I remove the links then I will have over 9000 pages that are indexed with Google that no longer have a link pointing to them, except for the aspect that they are indexed with Google still. Is it going to hurt me if I remove these links and Google no longer sees them linked from my site or anywhere else?
On-Page Optimization | | cbielich0 -
Need help with fluctuating ranking for a specific keyword
my website www.totalmanagement.com fluctuates for the search term: web based property management software I have been using SEO Moz for a few months now and have managed to get to the top 5 and jump around between 3 and 5. Does anyone have any suggestions to assist me? Long term goal is also to really target: Property Management Software But I am still very new at this. Thanks in advance for the help!
On-Page Optimization | | dgruhin0 -
Canonical tag help
Hi, We have a product which is marketed by affiliates . Affiliates send referrals to our sale page by adding their affiliate IDs to our product page like http://www.mysite.com/products.php?ref= 12345. We want to avoid the content duplication impression to Google by using canonical tags but we are not clear about its use. Should we use it on http://www.mysite.com/products.php ( actual page) or we should create temporary pages for each referral id i.e http://www.mysite.com/products.php?ref= 12345 and then add canonical tags to all those pages linking to proper page i.e http://www.mysite.com/products.php ? Thanks, shaz
On-Page Optimization | | shaz_lhr0