XML Sitemap Questions For Big Site
-
Hey Guys,
I have a few question about XML Sitemaps.
-
For a social site that is going to have presonal accounts created, what is the best way to get them indexed? When it comes to profiles I found out that twitter (https://twitter.com/i/directory/profiles) and facebook (https://www.facebook.com/find-friends?ref=pf) have directory pages, but Google plus has xml index pages (http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml).
-
If we go the XML route, how would we automatically add new profiles to the sitemap? Or is the only option to keep updating your xml profiles using a third party software (sitemapwriter)?
-
If a user chooses to not have their profile indexed (by default it will be index-able), how do we go about deindexing that profile? Is their an automatic way of doing this?
-
Lastly, has anyone dappled with google sitemap generator (https://code.google.com/p/googlesitemapgenerator/) if so do you recommend it?
Thank you!
-
-
Thanks for the input guys!
I believe Twitter and Facebook don't run sitemaps for their profiles, what they have is a directory for all their profiles (twitter: https://twitter.com/i/directory/profiles Facebook: https://www.facebook.com/find-friends?ref=pf) and use that to get their profiles crawled, however I feel the best approach is through xml sitemaps and Google plus actually does this with their profiles (http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml) and quite frankly I would rather follow Google then FB or Twitter... I'm just now wondering how the hell they upkeep that monster! Does it create a new sitemap everything one hits 50k? When do they update their sitemap? daily, weekly, or monthly and how?
One other question I have is if their is any penalties to getting a lot of pages crawled at once? Meaning one day we have 10 pages and the next we have 10,000 pages or 50,000 pages...
Thanks again guys!
-
I guess the way I was explaining it was for scalabilty on a large site. You have to think a site like fb or twitter with hundreds of millions of users still has the limitation of only having 50k records in a site map. So if they are running site maps, they have hundreds.
-
I'm not a web developer, so this might may be wrong, but I feel like it might be easier to just add every user to the xml sitemap and then add a noindex robots meta tag ons users pages that don't want to their profiles to be indexed.
-
If it were me and someone were asking me to design a system like that, I would design it in a few parts.
First I would create an application that handled the sitemap minus profiles, just for your tos, sign up pages, terms, and what ever pages like that.
Then I would design a system that handled the actual profiles. It would be pretty complex and resource intensive as the site grew. But the main idea flows like this
Start generation, grab the user record with id 1 in the database, check to see if indexable (move to next if not), see what pages are connected, write to xml file, loop back and start with record #2.
There are a few concessions you have to make, you need to keep up with the number of records in a file before you start another file. You can only have 50k records in one file.
The way I would handle the process in total for a large site would be this, sync the required tables via a weekly or daily cron to another instance (server). Call the php script (because that is what I use) that creates the first sitemap for the normal site wide pages. At the end of that site map, put a location for the user profile sitemap, then at the end of the scrip, execute the user profile site map generating script. At the end of each site map, put the location of the next site map file, because as you grow it might take 2-10000 site map files.
One thing that I would ensure to do is get a list of crawler ip addresses and in your .htaccess have an allow / deny rule. That way you can make the site maps only visible to the search engines.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
301 Question - issue
A while back we had a 'bleed' on one of our sites, which basically meant one of our sites started to leak across pages to another and that site started to rank for the same pages and now we have hundreds of pages ranking for urls that do not exists. It's hard to explain, bare with me. If you were to click on the cached view in Google for the ranked page it would show you the main site, but if you were to click it as usual, then you would be taken to the site but a 404 would show as the intended page was not for that site. We believe we fixed the 'bleed' and have setup 301s for all the affected pages to go to the home page for the site it affected. But these pages have not been removed from Google, which we thought a 301 would do. So we still have hundreds of pages being ranked but are redirected to the home page. Why hasn't these pages been removed?
Intermediate & Advanced SEO | | JH_OffLimits0 -
New site. How important is traffic for a new site? And what about domain age?
Hi guys. I've been building a new site because i've seen a real SEO opportunity out there. I'm a mixing professional by trade and so I wanted to take advantage of SEO to help gain more work. Here's the site: www.signalchainstudios.co.uk I'm curious about domain age. This site fairly well optimised for my keywords, and my site got pretty good content on it (i think so anyway). But it's no where to be seen on the SERP's (link at all). Is this just a domain age issue? I'd have though it might be in the top 50 because my site's services are not hard to rank for at all! Also what about traffic? Does Google want to see an 'active' site before it considers 'promoting' it up the ranks? Or are back links and good content the main factor in the equation? Thanks in advance. I love this community to bits 🙂 Isaac.
Intermediate & Advanced SEO | | isaac6631 -
Multilingual Sitemaps
Hey there, I have a site with many languages. So here are my questions concerning the sitemaps. The correct way of creating a sitemap for a multilingual site is as followed ( by the official blog of Google ) <urlset xmlns="</span>http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> http://www.example.com/loc> <xhtml:link rel="alternate" hreflang="en" href="</span>http://www.example.com/"/> <xhtml:link rel="alternate" hreflang="de" href="</span>http://www.example.com/de"/> <xhtml:link rel="alternate" hreflang="fr" href="</span>http://www.example.com/fr"/><a href=" http:="" www.example.com="" fr"="" target="_blank"></xhtml:link><a href=" http:="" www.example.com="" de"="" target="_blank"></xhtml:link><a href=" http:="" www.example.com="" "="" target="_blank"></xhtml:link><a href=" http:="" www.sitemaps.org="" schemas="" sitemap="" 0.9"="" rel="nofollow" target="_blank"></urlset> **So here is my first question. My site has over 200.000 pages that all of them support around 5-6 languages. Am I suppose to do this example 200.000 times?****My second question is. My root domain is www.example.com but this one redirects with 301 to www.example.com/en should the sitemap be at ****www.example.com/sitemap.xmlorwww.example.com/en/sitemap.xml ???****My third question is as followed. On WMT do I submit my sitemap in all versions of my site? I have all my languages there.**Thanks in advance for taking the time to respond to this thread and by creating it I hope many people will solve their own questions.
Intermediate & Advanced SEO | | Angelos_Savvaidis0 -
Do XML sitemaps need to be manually resubmitted every time they are changed?
I have been noticing lately that quite a few of my client's sites are showing sitemap errors/warnings in Google webmaster tools, despite the fact that the issue with the the sitemap (e.g a URL that we have blocked in robots.txt) was fixed several months earlier. Google talks about resubmitting sitemaps here where it says you can resubmit your sitemap when you have made changes to it, I just find it somewhat strange that the sitemap is not automatically re-scanned when Google crawls a website. Does anyone know if the sitemap is automatically rescanned and only webmaster tools is not updated, or am I going to have to manually resubmit or ping Google with the sitemap each time a change is made? It would be interesting to know other people's experiences with this 🙂
Intermediate & Advanced SEO | | Jamie.Stevens0 -
Submitting XML Sitemap for large website: how big?
Hi there, I’m currently researching how I can generate an XML sitemap for a large website we run. We think that Google is having problems indexing the URLs based on some of the messages we have been receiving in Webmaster tools, which also shows a large drop in the total number of indexed pages. Content on this site can be accessed in two ways. On the home page, the content appears as a list of posts. Users can search for previous posts and can search all the way back to the first posts that were submitted. Posts are also categorised using tags, and these tags can also currently be crawled by search engines. Users can then click on tags to see articles covering similar subjects. A post could have multiple tags (e.g. SEO, inbound marketing, Technical SEO) and so can be reached in multiple ways by users, creating a large number of URLs to index. Finally, my questions are: How big should a sitemap be? What proportion of the URLs of a website should it cover? What are the best tools for creating the sitemaps of large websites? How often should a sitemap be updated? Thanks 🙂
Intermediate & Advanced SEO | | RG_SEO0 -
Site not progressing at all....
We relaunched our site almost a year ago after our old site dropped out of ranking due to what we think was overused anchor text.... We transferred over the content to the new site, but started fresh in terms of links etc. And did not redirect the old site. Since the launch we have focused on producing good content and social, but the site has made no progress at all. The only factor I can think off is that one site linked to us from all of their pages, which we asked them to remove which they did over 3 months ago, but still showing in Webmaster tools.... Any help would be appreciated. Thanks
Intermediate & Advanced SEO | | jj34340 -
Why are these sites outranking me?
I am trying to rank for the phrase "a link between worlds walkthrough" I am on page 1 but there are several results that just outranks me and I cannot see any reason that they would be doing so. My site is hiddentriforce.com/a-link-between-worlds/walkthrough/ For that page I have 5 linking domains, varied anchor text that spans from things like "here" to a variety of related phrases. All of the links come from really good sites My page has 1400 likes, 90 shares, and about 20 each in tweets and +'s DA of 44 PA of 37 The 4 and 5 ranked sites both have WAY less social interactions, lower PA and DA, less links, etc Yet they outrank me why?
Intermediate & Advanced SEO | | Atomicx0 -
Mobile Site Outranking Main Site
Hi, We have recently been hit with a problem regarding our mobile site, where it is outranking our main site. This is causing a drop in orders and ranknings for our main site. It would appear that google has indexed our mobile site and so the two are now competing against each other. Our main site is on a .co.uk and our mobile site on a .mobi, but we have now taken down the mobile site until we get this sorted. Does anyone else have any experience of this happening and how to stop it happening again? Thanks Steve
Intermediate & Advanced SEO | | Steve251