Sitemap generator partially finding list of website URLs
-
Hi everyone,
When creating my XML sitemap here it is only able to detect a portion of the website. I am missing at least 20 URLs (blog pages + newly created resource pages). I have checked those missing URLs and all of them are index and they're not blocked by the robots.txt.
Any idea why this is happening? I need to make sure all wanted URLs to be generated in an XML sitemap.
Thanks!
-
Gaston,
Interestingly enough by default the generator only located only half of the URLs. I hope that one of those 2 fields will do the trick.
-
Hi Taysir,
I´ve never used that service. I suspect that the section you refer to should do the trick.
I believe that you do know how many URLs there are in the whole site, so you can compare how much pro-sitemaps.com finds to your numbers.Best luck!
GR -
Thanks for your response Gaston. These pages are definitely not blocked by the robots.txt file. I think that it is an internal linking problem. I actually subscribed to pro-sitemap.com and was wondering if I should use this section and add remaining sitemap URLs that are missing: https://cl.ly/0k0t093f0Y1T
Do you think this would do the trick?
-
Google not only provides a basic template you could do the sitemap manually if you wished, and this link has Google listing several dozen open source sitemap generators.
If Google Webmaster's can't read the one you generated fully, then clearly an alternate generator should definitely fix that for you. Good luck!
-
Hi taysir!
Have you tried any other crawler to check whether those pages can be finded?
I'd strongly suggest you Screaming Frog spider, the free version allows you up to 500 URLs. Also, it has a feature to create sitemaps from the crawled URLs. Even though dont know if that available in the free version.
Here some info about that feature: XML sitemap genetator - Screaming FrogUsual issues in not being findable are:
- Poor internal linking
- Not having a sitemap (this is why you find out)
- Blocked resources in robots.txt
- Blocked pages with robots meta tag
That being said, its completely normal that Google has indexed pages that you cant find in a AdHoc crawl, that is because GoogleBot could have found those pages from external linking.
Also keep in mind that having pages blocked with Robots.txt or robots meta tag will not prevent that page from being indexed nor will make them deindex if you add some rules to block them.Hope it helps.
Best luck
GR
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
NoIndex tag, canonical tag or automatically generated H1's for automatically generated enquiry pages?
What would be better for automatically generated accommodation enquiry pages for a travel company? NoIndex tag, canonical tag, automatically generated H1's or another solution? This is the homepage: https://www.discoverqueensland.com.au/ You would enquire from a page like this: https://www.discoverqueensland.com.au/accommodation/sunshine-coast/twin-waters/the-sebel-twin-waters This is the enquiry form: https://www.discoverqueensland.com.au/accommodation-enquiry.php?name=The+Sebel+Twin+Waters®ion_name=Sunshine+Coast
Technical SEO | | Kim_Lazaro0 -
Why my website does not index?
I made some changes in my website after that I try webmaster tool FETCH AS GOOGLE but this is 2nd day and my new pages does not index www. astrologersktantrik .com
Technical SEO | | ramansaab0 -
Website redirects
We consolidated websites. All the international sites have been brought under the roof of our mothership site based in the US: www.crisisprevention.com ... We mapped out all of the URLs and where they should be redirected. However, if someone types in, say, www.crisisprevention.co.uk it redirects to the mothership site, BUT the old URL hangs around no matter what page you navigate to. I feel like it has duplicate content ramifications or worse. I would like opinions on this, so I can take my findings to IT and figure out a solution. Here’s another example: http://www.positive-options.co.uk and another http://www.positive-options.com
Technical SEO | | spackle0 -
How could i create sitemap with 1000 page and should i update sitemap frequently?
My website have over 1000 pages but the sitemap creator tools i knew only create maximum 500 pages, how could i create sitemap with full of my webpage?
Technical SEO | | magician0 -
Find where the not selected pages are from
Hi all Can anyone suggest how I can find where gtoogle is finding approx. 1000 pages not to select? In round numbers I have 110 pages on the site site: searech shows all pages index status shows 110 slected and 1000 not selected. For the life of me I cannot fingure where these pages are coming from. I have set my prefered domain to www., setup 301 's to www. as per below RewriteCond %{HTTP_HOST} ^growingyourownveg.com$
Technical SEO | | spes123
RewriteRule ^(.*)$ "http://www.growingyourownveg.com/$1" [R=301,L] site is www.growingyourownveg.com any suggestions much appreciated Simon0 -
How to find a specific link on my website (currently causing redirects)
Hi everyone, I've used crawlers like Xenu to find broken links before, and I love these tools. What I can't figure out is how to find specific pieces of code within my site. For example, Webmaster Tools tells me there are still links to old pages somewhere on my website but I just can't find them. Do you know of a crawler that can search for a specific link within the html? Thanks in advance, Josh
Technical SEO | | dreadmichael0 -
URLs: To Change or Not to Change
Hello, We recently launched a redesigned site in Drupal in December of last year. We are an eco-travel company. My current URL's look like this: /africa-and-middle-east/kenya-tanzania /central-south-america/galapagos-islands My pages have good term targeting grades, and the rankings for the terms we are targeting - "kenya and tanzania safaris" and "galapagos islands cruises" are decent, but not great - most are on page 2 or 3. The one URL where I targeted our most important term, "amazon river cruises," I am still on page 2. /central-south-america/amazon-river-cruises My questions are: Did I miss an opportunity with the rest of the URL's, and should I consider changing the rest to more targeted terms with 301s? Since the new site launched in January, perhaps I have not given enough time for my new URL's to index and mature. Would it be easier to set up landing pages with unique article content that targets terms such as "galapagos islands cruises" and "kenya and tanzania safaris"? If so, how can I do it in such a way as to not "compete" with the pages I want to drive them to? This also raises the question of redirecting the same URL twice i.e. I would have 2 redirects in place for the same url e.g. from the former site to the new site, and yet another redirect to the most-recent URL. Is that a problem? Sorry if I've asked too many questions in one post. 😉 Any advice appreciated.
Technical SEO | | csmithal0 -
Directory URL structure last / in the url
Ok, So my site's urls works like this www.site.com/widgets/ If you go to www.site.com/widgets (without the last / ) you get a 404. My site did no used to require the last / to load the page but it has over the last year and my rankings have dropped on those pages... But Yahoo and BING still indexes all my pages without the last / and it some how still loads the page if you go to it from yahoo or bing, but it looks like this in the address bar once you arrive from bing or yahoo. http://www.site.com/404.asp?404;http://site.com:80/widgets/ How do I fix this? Should'nt all the engines see those pages the same way with the last / included? What is the best structure for SEO?
Technical SEO | | DavidS-2820610