Having issues crawling a website
-
We looked to use the Screaming Frog Tool to crawl this website and get a list of all meta-titles from the site, however, it only resulted with the one result - the homepage.
We then sought to obtain a list of the URLs of the site by creating a sitemap using https://www.xml-sitemaps.com/. Once again however, we just go the one result - the homepage.
There is something that seems to be restricting these tools from crawling all pages. If you anyone can shed some light as to what this could be, we'd be most appreciative.
-
That robots.txt should be fine.. its not blocking anything.
The reason the crawl is stopping on the homepage is this code:
<meta name="<a class="attribute-value">robots</a>" content="<a class="attribute-value">nofollow</a>">
Which tells bots to not follow any links on the page. Remove that and you should be good.
-
Hi,
I think it is your robots.txt file that is causing the issue. At the moment you have the following:
**User-agent: ***
Disallow:
I would recommend updating it to the following:
**User-agent: ***
Allow: /
Moz also has a good post about what else you can include in your robots.txt file for best practices etc. :
https://moz.com/learn/seo/robotstxt
Hope that helps
Thanks
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Issue with AMP pages
Hello, We have implemented AMP on our blog pages, but now some of the Web pages are also being shown like AMP pages. ( no footer and no navigation ) What could have gone wrong ? Thanks
Intermediate & Advanced SEO | | Johnroger0 -
Why is our website not generating organic traffic?
Our website is struggling to generate organic traffic over the past few months and we're lost as to why. It's a relatively new site. It's mobile friendly, all products have meta descriptions and product images have alt tags. Content is added regularly. Site speed could be better and alt tags need to be added to images on the blog but other than that we're stumped. Does anyone have any suggestions?
Intermediate & Advanced SEO | | Johnny_AppleSeed0 -
Can't crawl website with Screaming frog... what is wrong?
Hello all - I've just been trying to crawl a site with Screaming Frog and can't get beyond the homepage - have done the usual stuff (turn off JS and so on) and no problems there with nav and so on- the site's other pages have indexed in Google btw. Now I'm wondering whether there's a problem with this robots.txt file, which I think may be auto-generated by Joomla (I'm not familiar with Joomla...) - are there any issues here? [just checked... and there isn't!] If the Joomla site is installed within a folder such as at e.g. www.example.com/joomla/ the robots.txt file MUST be moved to the site root at e.g. www.example.com/robots.txt AND the joomla folder name MUST be prefixed to the disallowed path, e.g. the Disallow rule for the /administrator/ folder MUST be changed to read Disallow: /joomla/administrator/ For more information about the robots.txt standard, see: http://www.robotstxt.org/orig.html For syntax checking, see: http://tool.motoricerca.info/robots-checker.phtml User-agent: *
Intermediate & Advanced SEO | | McTaggart
Disallow: /administrator/
Disallow: /bin/
Disallow: /cache/
Disallow: /cli/
Disallow: /components/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /layouts/
Disallow: /libraries/
Disallow: /logs/
Disallow: /modules/
Disallow: /plugins/
Disallow: /tmp/0 -
Website not being indexed after relocation
I have a scenario where a 'draft' website was built using Google Sites, and published using a Google Sites sub domain. Consequently, the 'same' website was rebuilt and published on its own domain. So effectively there were two sites, both more or less identical, with identical content. The first website was thoroughly indexed by Google. The second website has not been indexed at all - I am assuming for the obvious reasons ie. that Google is viewing it as an obvious rip-off of the first site / duplicate content etc. I was reluctant to take down the first website until I had found an effective way to resolve this issue long-term => ensuring that in future Google would index the second 'proper' site. A permanent 301 redirect was put forward as a solution - however, believe it or not, the Google Sites platform has no facility for implementing this. For lack of an alternative solution I have gone ahead and taken down the first site. I understand that this may take some time to drop out of Google's index, however, and I am merely hoping that eventually the second site will be picked up in the index. I would sincerely appreciate an advice or recommendations on the best course of action - if any! - I can take from here. Many thanks! Matt.
Intermediate & Advanced SEO | | collectedrunning0 -
Index, Nofollow Issue
We are having on our site a couple of pages that we want the page to be indexed, however, we don't want the links on the page to be followed. For example url: http://www.printez.com/animal-personal-checks.html. We have added in our code: . Bing Webmaster Tools, is telling us the following: The pages uses a meta robots tag. Review the value of the tag to see if you are not unintentionally blocking the page from being indexed (NOINDEX). Question is, is the page using the right code as of now or do we need to do any changes in the code, if so, what should we use for them to index the page, but not to follow the links on the page? Please advise, Morris
Intermediate & Advanced SEO | | PrintEZ0 -
How would the rich snippets be treated in AJAX website?
Hi guys We have started to rewrite our website http://www.edamam.com on AJAX, and the idea is to have all the website on AJAX in the next few months. Although it would probably be difficult to index even with the Google Crawling protocol, and some other issues might appear, the engineers insist that from technology point of view this is the best way to go. We have already rewritten the internal search result pages, e.g. http://www.edamam.com/recipes/pasta and last week we set the Google Crawling protocol for AJAX to some of the individual recipe pages to test it. I'd like to ask for you opinion on whether the rich snippets we have in the search results will be affected by this change? Are there specific actions we need to take to preserve them? What other hot tips you have for dealing with AJAX on any level of the website? Thanks in advance Lily
Intermediate & Advanced SEO | | wspwsp0 -
HTML5 Nav Tag Issue - Be Aware
In checking my internal links with GWT, it is apparent that links within the nav tag in HTML5 are discounted by Google as "internal links" This could have major repercussions for designing your internal link structure for SEO purposes. I was surprised to see this result, as I have never seen it discussed. Anyone else notice this, or have any alternative views?
Intermediate & Advanced SEO | | veezer0 -
Why would the PageRank for all of our websites show the same?
The last time I checked (early this year), the PageRank on the sites I manage varied, with the highest showing as 6. It made sense as the PR6 site has loads of links and has been around for a long time, whereas the other sites hadn't. Now all of our websites are showing the same PageRank - 6, even one that has recently launched and another that has barely any links/traffic or anything to it. I didn't check the PR of that one last time (I'd be surprised if it was 2), but the sites now showing as 6 ranged from PR3 to PR6 back then. We changed server in February...so could this issue be something to do with all of the sites being stored on the same server? It doesn't seem right but it's the only thing I can think of. At the moment, the Domain Authority for these six websites ranges from 27 to 62.
Intermediate & Advanced SEO | | Alex-Harford0