Exclude status codes in Screaming Frog
-
I have a very large ecommerce site I'm trying to spider using screaming frog. Problem is I keep hanging even though I have turned off the high memory safeguard under configuration.
The site has approximately 190,000 pages according to the results of a Google site: command.
- The site architecture is almost completely flat. Limiting the search by depth is a possiblity, but it will take quite a bit of manual labor as there are literally hundreds of directories one level below the root.
- There are many, many duplicate pages. I've been able to exclude some of them from being crawled using the exclude configuration parameters.
- There are thousands of redirects. I haven't been able to exclude those from the spider b/c they don't have a distinguishing character string in their URLs.
Does anyone know how to exclude files using status codes? I know that would help.
If it helps, the site is kodylighting.com.
Thanks in advance for any guidance you can provide.
-
Thanks for your help. It literally was just the fact that it had to be done before the crawl began and could not be changed during the crawl. Hopefully this is changed because sometimes during a crawl you find things you want to exclude that you may have not known of their existence before hand.
-
Are you sure it's just on Mac,have you tried on PC? Do you have any other rules in include or perhaps a conflicting rule in exclude? Try running a single exclude rule, also on another small site to test.
Also from support if failing on all fronts:
- Mac version, please make sure you have the most up to date version of the OS which will update Java.
- Please uninstall, then reinstall the spider ensuring you are using the latest version and try again.
To be sure - http://www.youtube.com/watch?v=eOQ1DC0CBNs
-
does the exclude function work on mac. i have tried every possible way to exclude folders and have not been successful while running an analysis
-
That's exactly the problem, the redirects are disbursed randomly throughout the site. Although, and the job's still running, it now appears as though there's almost a 1-2-1 correlation between pages and redirects on the site.
I also heard from Dan Sharp via Twitter. He said "You can't, as we'd have to crawl a URL to see the status code You can right click and remove after though!"
Thanks again Michael. Your thoroughness and follow through is appreciated.
-
Took another look, also looked at documentation/online and don't see any way to exclude URLs from crawl based on response codes. As I see it you would only want to exclude on name or directory as response code is likely to be random throughout a site and impede a thorough crawl.
-
Thank you Michael.
You're right. I was on a 64 bit machine running a 32 bit verson of java. I updated it and the scan has been running for more than 24 hours now without hanging. So thank you.
If anyone else knows of a way to exclude files using status codes I'd still like to learn about it. So far the scan is showing me 20,000 redirected files which I'd just as soon not inventory.
-
I don't think you can filter out on response codes.
However, first I would ensure you are running the right version of Java if you are on a 64bit machine. The 32bit version functions but you cannot increase the memory allocation which is why you could be running into problems. Take a look at http://www.screamingfrog.co.uk/seo-spider/user-guide/general/ under Memory.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Text to code ratio<10% warning from website audit by SiteChecker.Pro - how important is it?
Hi to everyone, I used Sitechecker.Pro for a website audit of a client website https://bizpages.org and there was this warning (not an error!): TEXT TO CODE RATIO<10% https://sitechecker.pro/app/main/project/1839063/audit/summary How important is this to achieve good ranking? What are good ratios? I undestand that more text needs to be added to improve it? fcdcfbe438
Technical SEO | | astweb0 -
Optimize code
Hi Guy's, In Wordpress we've got the plugin "WP Rocket". It's possible to optimize CSS, JS, HTML and Google Fonts. If i optimize (minify) CSS and JS the website will load faster and the pagespeed will be reduced. So i guess that's always better for SEO results en SERPS... But what will happen if i optimize my HTML code? It will be much shorter but less organized. Will this also affect SEO results and rankings? I'm awear that less code will increase my text ratio, but don't know if Google will punish the website for having a code that's not very good organized. Thanks!
Technical SEO | | Happy-SEO1 -
Some URLs were not accessible to Googlebot due to an HTTP status error.
Hello I'm a seo newbie and some help from the community here would be greatly appreciated. I have submitted the sitemap of my website in google webmasters tools and now I got this warning: "When we tested a sample of the URLs from your Sitemap, we found that some URLs were not accessible to Googlebot due to an HTTP status error. All accessible URLs will still be submitted." How do I fix this? What should I do? Many thanks in advance.
Technical SEO | | GoldenRanking140 -
Having javascript in the top of the source code
Dear Moz-community, In our company, we are torn about the influence of having a ton of javascript on the top of our source code - while our Tech guys are downplaying it's influence, us marketeers aren't quite sure. The link is here: view-source:http://www.bettingexpert.com/tips/football/italy/serie-a It is the javascript that is loaded right after the Would this be a problem with Google? Thank you very much,
Technical SEO | | BetterCollective
William0 -
After I 301 redirect duplicate pages to my rel=canonical page, do I need to add any tags or code to the non canonical pages?
I have many duplicate pages. Some pages have 2-3 duplicates. Most of which have Uppercase and Lowercase paths (generated by Microsoft IIS). Does this implementation of 301 and rel=canonical suffice? Or is there more I could do to optimize the passing of duplicate page link juice to the canonical. THANK YOU!
Technical SEO | | PFTools0 -
No crawl code for pages of helpful links vs. no follow code on each link?
Our college website has many "owners" who want pages of "helpful links" resulting in a large number of outbound links. If we add code to the pages to prevent them from being crawled, will that be just as effective as making every individual link no follow?
Technical SEO | | LAJN0 -
Source code structure: Position of content within the tag
Within the section of the source code of a site I work on, there are a number of distinct sections. The 1st one, appearing first in the source code, contains the code for the primary site navigation tabs and links. The second contains the keyword-rich page content. My question is this: if i could fix the layout so that the page still visually displayed in the same way as it does now, would it be advantageous for me to stick the keyword-rich content section at the top of the , above the navigation? I want the search engines to be able to reach the keyword-rich content faster when they crawl pages on the site; however, I dont want to implement this fix if it wont have any appreciable benefit; nor if it will be harmful to the search-engine's accessibilty to my primary navigation links. Does anyone have any experience of this working, or thoughts on whether it will make a difference? Thanks,
Technical SEO | | Tinhat0 -
Non-www to www code not working in htaccess
I use the same rewrite code on every site to consolidate the non-www and www versions. All sites in Joomla, linux hosting. Code is as follows: RewriteEngine On rewritecond %{http_host} ^site.com/ rewriteRule ^(.*) http://www.site.com/$1 [R=301,L] Immediately following this code, I also rewrite /index.php to /. Thing is, I can get index.php to rewrite correctly but the non-www won't rewrite to www. I use the same code on every site but for some reason it's not working here. Are there common issues that interfere with rewriting a non-www to www in htaccess that could be interfering with the code I'm using above?
Technical SEO | | Caleone0