Question about construction of our sitemap URL in robots.txt file
-
Hi all,
This is a Webmaster/SEO question. This is the sitemap URL currently in our robots.txt file:
http://www.ccisolutions.com/sitemap.xml
As you can see it leads to a page with two URLs on it. Is this a problem? Wouldn't it be better to list both of those XML files as separate line items in the robots.txt file?
Thanks!
Dana
-
Hi Jarno,
Thanks so very much! I have to say I am really liking the A1 generator. How awesome of you to follow up. I really appreciate that. Yes, if you want to send me the complete sitemap via PM that would be awesome. I certainly hope I can return the favor Happy Holidays!
Dana
-
Yes, we definitely use XENU, but I think I like Screaming Frog a bit better (although our IT Director swears it's broken).
-
Hi Christopher,
Thanks for the update. Yes, I looked at it too and other than it not being "pretty" XML, the data seemed to be okay. The one thing the A! generator did that we couldn't do was assign the values for importance and frequency specific pages are modified. If that data is accurate, that's pretty cool. I'm just not sure, although it seems it did identify pages that are modified more frequently correctly. I have 30 days to play with the free trial, but so far I think I like it a lot.
Dana
-
Dana,
It just finished scanning here are the results:
Internal Sitemap URL's:
- Listed found: 5248
- Listed deduced: 5301
- Analyzed content: 3110
- Analyzed references: 3176
External URL's:
- Listed found: 700
When i look at the overview of the result i see a number of 301 redirects, canonical redirects (when tested again the get code 200 OK). But I see a lot op pages.
When i build the sitemap it generates one file (no idea why not more then one) with all the links in the document. Google's sitemap protocol states it should be like the schema at sitemaps.org which it does. The entire protocol of sitemap.org states that a sitemap can not hold over 50,000 links and should be smaller then 10 MB in filesize.
The one I just build for you is only 1 MB and contains less url's then 50,000 and thus is it allowed by Google.
http://www.sitemaps.org/protocol.html
I can send you the entire version of the sitemap if you'd like in a personal message or through e-mail?
Hope this helps you further.
kind regards
Jarno
-
i started the scan and it's still busy:
2500 analyzed references so far.
Let you know how it turns out.
Jarno
-
Thanks Jarno. I really appreciate that. Yes, I had it selected to just scan for images (as prompted when I attempted to create an image sitemap). Let me know what you see? I am wondering if it is going around in circles?
Dana
-
Dana,
sometimes that happens. Are you scanning for images or are you scanning the site?
i will check your site tomorrow with my full version and see what it does.
Sometimes with some websites you'll get things like this but it can be loads of things. 3500 pages should not take 2 hours but only a couple of minutes. I'll check it first thing tomorrow. A1 is not installed on my laptop..
Let you know tomorrow.
Kind regards
Jarno
-
A1 Sitemap does 2 things:
1 ) It builds a file names sitemap.xml which contains all files on the website (not conform the google requirements
-
It builds a number of files listed in sitemap-index.xml for every 100 pages in one sitemap. So if you're website contains 2800 pages You'll get loads of files: 28 sitemap-1.xml etc and 1 sitemap-index.xml file. Which does meet the Google standards. Afterwards you can do 2 things in Google webmasters:
-
enter the sitemap-index.xml file as a sitemap -> Google will follow everything and come to the grand total of 2800 pages.
-
Enter each sitemap separately.-> same result but you can pinpoint better where you have a 100 pages and google only indexes fewer (can happen).
Hope this helps
-
-
Hi again Jarno,
Is it normal for A1's sitemap generator's "Scan website" function for images to take over two hours? Our site is about 3,500 URLs. So far it has under "Internal 'sitemap' URLs" Listed found: 82076 (and climbing every few seconds).
I am wondering if there isn't something wrong? (I don't have any frame of reference since I've never used it before). Thanks!
Dana
-
I'm not familiar with the A1 Sitemap generator, but regarding the sitemap protocol, there is a limit on the size of a single sitemap.xml file, so for large sites, the sitemap must be split into multiple sitemap.xml files. And, the protocol has a method for indexing these multiple sitemap.xml files. It's sort of like an index to an index. None of my sites exceed the sitemap file limit, so I don't know which sitemap generators use this approach, but I would guess many of them do.
Sitemap generators I have used include DMXZone which is a Dreamweaver plugin, and xml-sitemaps.com which includes a video sitemap generator.
Best,
ChristopherEDIT: PS: Your current sitemap looks fine to me.
-
Thanks Christopher,
Your answer took a noment to sink in, but I think I get it (I think I am coffee deprived this morning).
So, if I am using the A1 Sitemap generator that Jarno suggested, this sitemap index should automatically be generated based on the size of my generated sitemap. Is that correct?
-
Thanks Jarno,
I have downloaded and am trying the 30-day free trial of the A1 Sitemap Generator right now. Thanks for the tip. Can you comment on Christopher's remark below concerning sitemap indexes for larger sitemaps?
Can either you or Christopher give me more clarification on that. Is this what our IT director has attempted to do with the sitemap in our robots.txt file? If so, has it been done correctly?
Thanks!
-
There is a limit on the size of a sitemap and to allow for large sitemaps to be split into smaller sitemaps, the sitemap protocol includes a sitemapindex. See "Using Sitemap index files (to group multiple sitemap files)" here http://www.sitemaps.org/protocol.html. Of course, it's also possible to include the multiple sitemaps in the robot.txt file, but automated sitemap generators will likely use the sitemapindex feature so that the robots.txt file does not have to be modified as the size of the site changes.
Best,
Christopher -
Another tool to help generate a sitemap and even check broken links is called Xenu (weird logo, but good free product).
-
Dana,
the buildup of your sitemap.xml is very strange to me. I use an external program to build my sitemap.xml for me entire website.
You now have a link in your robots.txt file pointing to a sitemap which contains 2 files (both .xml) with een map of the site?
Why not use a program (free or paid like Microsys A1 (the one I use)) to build 1 sitemap.xml en point to this file from your robots.txt?
hope this helps
if you do have any questions, please let me know.
kind regards
Jarno
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Best way to create robots.txt for my website
How I can create robots.txt file for my website guitarcontrol.com ? It is having login and Guitar lessons.
Technical SEO | | zoe.wilson170 -
Will a Robots.txt 'disallow' of a directory, keep Google from seeing 301 redirects for pages/files within the directory?
Hi- I have a client that had thousands of dynamic php pages indexed by Google that shouldn't have been. He has since blocked these php pages via robots.txt disallow. Unfortunately, many of those php pages were linked to by high quality sites mulitiple times (instead of the static urls) before he put up the php 'disallow'. If we create 301 redirects for some of these php URLs that area still showing high value backlinks and send them to the correct static URLs, will Google even see these 301 redirects and pass link value to the proper static URLs? Or will the robots.txt keep Google away and we lose all these high quality backlinks? I guess the same question applies if we use the canonical tag instead of the 301. Will the robots.txt keep Google from seeing the canonical tags on the php pages? Thanks very much, V
Technical SEO | | Voodak0 -
Should I change or redirect this URL?
Happy Friday everyone! I just noticed that one of our Attorney Profile's url's is wrong. We used to have someone named "Dana Fortugno" as our Family Law attorney, but when he left, (over two years ago) we hired "Scott Finelli." The person who setup the site, just changed the information on the page not url. So instead of it saying "http://www.kempruge.com/scott-finelli-jd-llm/;" it says "http://www.kempruge.com/dana-fortugno-jd-llm/." I'm considering taking all the content on the page with the wrong url, copying it to a new page with the correct URL and 301 redirecting (what would now be a blank page) to the new page with the correct URL. Is this the best way to handle this? Also, I don't believe there are many SEO concerns regarding the pages specifically. The profile pages aren't what we rank for in any of our Family Law related keywords. I am worried about having a completely blank page that just 301 redirects as looking bad to google, but not sure if it would? As always, thank you for your time and any assistance you can provide. Ruben
Technical SEO | | KempRugeLawGroup0 -
Google is indexing blocked content in robots.txt
Hi,Google is indexing some URLs that i don't want to be indexed and also is indexing the same URLs with https. This URLs are blocked in the file robots.txt.I've tried to block this URLs through Google WebmasterTools but Google doesn't let me do it because this URL are httpsThe file robots.txt is correct so, what can i do to avoid this content to be indexed?
Technical SEO | | elisainteractive0 -
IIS 7.5 - Duplicate Content and Totally Wrong robot.txt
Well here goes! My very first post to SEOmoz. I have two clients that are hosted by the same hosting company. Both sites have major duplicate content issues and appear to have no internal links. I have checked this both here with our awesome SEOmoz Tools and with the IIS SEO Tool Kit. After much waiting I have heard back from the hosting company and they say that they have "implemented redirects in IIS7.5 to avoid duplicate content" based on the following article: http://blog.whitesites.com/How-to-setup-301-Redirects-in-IIS-7-for-good-SEO__634569104292703828_blog.htm. In my mind this article covers things better: www.seomoz.org/blog/what-every-seo-should-know-about-iis. What do you guys think? Next issue, both clients (as well as other sites hosted by this company) have a robot.txt file that is not their own. It appears that they have taken one client's robot.txt file and used it as a template for other client sites. I could be wrong but I believe this is causing the internal links to not be indexed. There is also a site map, again not for each client, but rather for the client that the original robot.txt file was created for. Again any input on this would be great. I have asked that the files just be deleted but that has not occurred yet. Sorry for the messy post...I'm at the hospital waiting to pick up my bro and could be called to get him any minute. Thanks so much, Tiff
Technical SEO | | TiffenyPapuc0 -
Changed URL of all web pages to a new updated one - Keywords still pick the old URL
A month ago we updated our website and with that we created new URLs for each page. Under "On-Page", the keywords we put to check ranking on are still giving information on the old urls of our websites. Slowly, some new URLs are popping up. I'm wondering if there's a way I can manually make the keywords feedback information from the new urls.
Technical SEO | | Champions0 -
OK to block /js/ folder using robots.txt?
I know Matt Cutts suggestions we allow bots to crawl css and javascript folders (http://www.youtube.com/watch?v=PNEipHjsEPU) But what if you have lots and lots of JS and you dont want to waste precious crawl resources? Also, as we update and improve the javascript on our site, we iterate the version number ?v=1.1... 1.2... 1.3... etc. And the legacy versions show up in Google Webmaster Tools as 404s. For example: http://www.discoverafrica.com/js/global_functions.js?v=1.1
Technical SEO | | AndreVanKets
http://www.discoverafrica.com/js/jquery.cookie.js?v=1.1
http://www.discoverafrica.com/js/global.js?v=1.2
http://www.discoverafrica.com/js/jquery.validate.min.js?v=1.1
http://www.discoverafrica.com/js/json2.js?v=1.1 Wouldn't it just be easier to prevent Googlebot from crawling the js folder altogether? Isn't that what robots.txt was made for? Just to be clear - we are NOT doing any sneaky redirects or other dodgy javascript hacks. We're just trying to power our content and UX elegantly with javascript. What do you guys say: Obey Matt? Or run the javascript gauntlet?0 -
Help needed with robots.txt regarding wordpress!
Here is my robots.txt from google webmaster tools. These are the pages that are being blocked and I am not sure which of these to get rid of in order to unblock blog posts from being searched. http://ensoplastics.com/theblog/?cat=743 http://ensoplastics.com/theblog/?p=240 These category pages and blog posts are blocked so do I delete the /? ...I am new to SEO and web development so I am not sure why the developer of this robots.txt file would block pages and posts in wordpress. It seems to me like that is the reason why someone has a blog so it can be searched and get more exposure for SEO purposes. IS there a reason I should block any pages contained in wodrpress? Sitemap: http://www.ensobottles.com/blog/sitemap.xml User-agent: Googlebot Disallow: /*/trackback Disallow: /*/feed Disallow: /*/comments Disallow: /? Disallow: /*? Disallow: /page/
Technical SEO | | ENSO
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /wp-content/plugins/ Disallow: /wp-content/themes/ Disallow: /trackback Disallow: /commentsDisallow: /feed0