Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Can't crawl website with Screaming frog... what is wrong?
-
Hello all - I've just been trying to crawl a site with Screaming Frog and can't get beyond the homepage - have done the usual stuff (turn off JS and so on) and no problems there with nav and so on- the site's other pages have indexed in Google btw.
Now I'm wondering whether there's a problem with this robots.txt file, which I think may be auto-generated by Joomla (I'm not familiar with Joomla...) - are there any issues here? [just checked... and there isn't!]
If the Joomla site is installed within a folder such as at
e.g. www.example.com/joomla/ the robots.txt file MUST be
moved to the site root at e.g. www.example.com/robots.txt
AND the joomla folder name MUST be prefixed to the disallowed
path, e.g. the Disallow rule for the /administrator/ folder
MUST be changed to read Disallow: /joomla/administrator/
For more information about the robots.txt standard, see:
http://www.robotstxt.org/orig.html
For syntax checking, see:
http://tool.motoricerca.info/robots-checker.phtml
User-agent: *
Disallow: /administrator/
Disallow: /bin/
Disallow: /cache/
Disallow: /cli/
Disallow: /components/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /layouts/
Disallow: /libraries/
Disallow: /logs/
Disallow: /modules/
Disallow: /plugins/
Disallow: /tmp/ -
For anyone wondering; The answer above by Ecommerce Site (odd name btw) works - 21-Nov-2016.
-
This is the best I could find to so someone who had a similar problem with Joomla-
"In the premium version you can slow down the crawl rate under 'speed' in the configuration. In the free lite version, you can crawl the site and then right click on any URLs with a 403 response and press 're-spider'. The server will generally then allow you to crawl these pages (and return a 200 ok response) as you're not requesting too many at once, so you might have to re-spider them individually."
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Changed all external links to 'NoFollow' to fix manual action penalty. How do we get back?
I have a blog that received a Webmaster Tools message about a guidelines violation because of "unnatural outbound links" back in August. We added a plugin to make all external links 'NoFollow' links and Google removed the penalty fairly quickly. My question, how do we start changing links to 'follow' again? Or at least being able to add 'follow' links in posts going forward? I'm confused by the penalty because the blog has literally never done anything SEO-related, they have done everything via social and email. I only started working with them recently to help with their organic presence. We don't want them to hurt themselves at all, but 'follow' links are more NATURAL than having everything as 'NoFollow' links, and it helps with their own SEO by having clean external 'follow' links. Not sure if there is a perfect answer to this question because it is Google we're dealing with here, but I'm hoping someone else has some tips that I may not have thought about. Thanks!
Intermediate & Advanced SEO | | HashtagJeff0 -
Duplicate Content through 'Gclid'
Hello, We've had the known problem of duplicate content through the gclid parameter caused by Google Adwords. As per Google's recommendation - we added the canonical tag to every page on our site so when the bot came to each page they would go 'Ah-ha, this is the original page'. We also added the paramter to the URL parameters in Google Wemaster Tools. However, now it seems as though a canonical is automatically been given to these newly created gclid pages; below https://www.google.com.au/search?espv=2&q=site%3Awww.mypetwarehouse.com.au+inurl%3Agclid&oq=site%3A&gs_l=serp.3.0.35i39l2j0i67l4j0i10j0i67j0j0i131.58677.61871.0.63823.11.8.3.0.0.0.208.930.0j3j2.5.0....0...1c.1.64.serp..8.3.419.nUJod6dYZmI Therefore these new pages are now being indexed, causing duplicate content. Does anyone have any idea about what to do in this situation? Thanks, Stephen.
Intermediate & Advanced SEO | | MyPetWarehouse0 -
Help! The website ranks fine but one of my web pages simply won't rank on Google!!!
One of our web pages will not rank on Google. The website as a whole ranks fine except just one section...We have tested and it looks fine...Google can crawl the page no problem. There are no spurious redirects in place. The content is fine. There is no duplicate page content issue. The page has a dozen product images (photos) but the load time of the page is absolutely fine. We have the submitted the page via webmaster and its fine. It gets listed but then a few hours later disappears!!! The site has not been penalised as we get good rankings with other pages. Can anyone help? Know about this problem?
Intermediate & Advanced SEO | | CayenneRed890 -
Would you rate-control Googlebot? How much crawling is too much crawling?
One of our sites is very large - over 500M pages. Google has indexed 1/8th of the site - and they tend to crawl between 800k and 1M pages per day. A few times a year, Google will significantly increase their crawl rate - overnight hitting 2M pages per day or more. This creates big problems for us, because at 1M pages per day Google is consuming 70% of our API capacity, and the API overall is at 90% capacity. At 2M pages per day, 20% of our page requests are 500 errors. I've lobbied for an investment / overhaul of the API configuration to allow for more Google bandwidth without compromising user experience. My tech team counters that it's a wasted investment - as Google will crawl to our capacity whatever that capacity is. Questions to Enterprise SEOs: *Is there any validity to the tech team's claim? I thought Google's crawl rate was based on a combination of PageRank and the frequency of page updates. This indicates there is some upper limit - which we perhaps haven't reached - but which would stabilize once reached. *We've asked Google to rate-limit our crawl rate in the past. Is that harmful? I've always looked at a robust crawl rate as a good problem to have. Is 1.5M Googlebot API calls a day desirable, or something any reasonable Enterprise SEO would seek to throttle back? *What about setting a longer refresh rate in the sitemaps? Would that reduce the daily crawl demand? We could set increase it to a month, but at 500M pages Google could still have a ball at the 2M pages/day rate. Thanks
Intermediate & Advanced SEO | | lzhao0 -
Can we have 2 websites with same business name and same business address?
I have 2 websites with same business name and same business address, and obvious 2 different domain names. I am providing the same services from 2 websites. Is this is a problem?
Intermediate & Advanced SEO | | AlexanderWhite0 -
How important is the optional <priority>tag in an XML sitemap of your website? Can this help search engines understand the hierarchy of a website?</priority>
Can the <priority>tag be used to tell search engines the hierarchy of a site or should it be used to let search engines know which priority to we want pages to be indexed in?</priority>
Intermediate & Advanced SEO | | mycity4kids0 -
Can I dissavow links on a 301'd website?
So we are performing link removal for a client on his old website (A), which is being 301 redirected to his new website (B). We have identified toxic links on site A and are removing, once complete we will undo the current 301, confirm a new GWT account for website A, and then submit the disavow report. We would then like to reapply the 301 redirect to site B while we are waiting for Google to process the disavow report, the logic being we can retain some current rankings on site B while waiting for the disavow to process on site A. Has anyone had experience with this method? I foresee some potential issues here but am interested to here from others on this. Thanks!
Intermediate & Advanced SEO | | SEOdub1 -
There's a website I'm working with that has a .php extension. All the pages do. What's the best practice to remove the .php extension across all pages?
Client wishes to drop the .php extension on all their pages (they've got around 2k pages). I assured them that wasn't necessary. However, in the event that I do end up doing this what's the best practices way (and easiest way) to do this? This is also a WordPress site. Thanks.
Intermediate & Advanced SEO | | digisavvy0