Exclude status codes in Screaming Frog
-
I have a very large ecommerce site I'm trying to spider using screaming frog. Problem is I keep hanging even though I have turned off the high memory safeguard under configuration.
The site has approximately 190,000 pages according to the results of a Google site: command.
- The site architecture is almost completely flat. Limiting the search by depth is a possiblity, but it will take quite a bit of manual labor as there are literally hundreds of directories one level below the root.
- There are many, many duplicate pages. I've been able to exclude some of them from being crawled using the exclude configuration parameters.
- There are thousands of redirects. I haven't been able to exclude those from the spider b/c they don't have a distinguishing character string in their URLs.
Does anyone know how to exclude files using status codes? I know that would help.
If it helps, the site is kodylighting.com.
Thanks in advance for any guidance you can provide.
-
Thanks for your help. It literally was just the fact that it had to be done before the crawl began and could not be changed during the crawl. Hopefully this is changed because sometimes during a crawl you find things you want to exclude that you may have not known of their existence before hand.
-
Are you sure it's just on Mac,have you tried on PC? Do you have any other rules in include or perhaps a conflicting rule in exclude? Try running a single exclude rule, also on another small site to test.
Also from support if failing on all fronts:
- Mac version, please make sure you have the most up to date version of the OS which will update Java.
- Please uninstall, then reinstall the spider ensuring you are using the latest version and try again.
To be sure - http://www.youtube.com/watch?v=eOQ1DC0CBNs
-
does the exclude function work on mac. i have tried every possible way to exclude folders and have not been successful while running an analysis
-
That's exactly the problem, the redirects are disbursed randomly throughout the site. Although, and the job's still running, it now appears as though there's almost a 1-2-1 correlation between pages and redirects on the site.
I also heard from Dan Sharp via Twitter. He said "You can't, as we'd have to crawl a URL to see the status code You can right click and remove after though!"
Thanks again Michael. Your thoroughness and follow through is appreciated.
-
Took another look, also looked at documentation/online and don't see any way to exclude URLs from crawl based on response codes. As I see it you would only want to exclude on name or directory as response code is likely to be random throughout a site and impede a thorough crawl.
-
Thank you Michael.
You're right. I was on a 64 bit machine running a 32 bit verson of java. I updated it and the scan has been running for more than 24 hours now without hanging. So thank you.
If anyone else knows of a way to exclude files using status codes I'd still like to learn about it. So far the scan is showing me 20,000 redirected files which I'd just as soon not inventory.
-
I don't think you can filter out on response codes.
However, first I would ensure you are running the right version of Java if you are on a 64bit machine. The 32bit version functions but you cannot increase the memory allocation which is why you could be running into problems. Take a look at http://www.screamingfrog.co.uk/seo-spider/user-guide/general/ under Memory.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Redirect chain error free htaccess code for website
i want to redirect domain, example.com to https://www.example.com, is anyone can help me to provide redirect chain error free ht-access code. I implemented this htaccess code on the website and mhy site show on the moz redirect chain error RewriteCond %{HTTP_HOST} !=""
Technical SEO | | truehab
RewriteCond %{THE_REQUEST} ^[A-Z]+\s//+(.)\sHTTP/[0-9.]+$ [OR]
RewriteCond %{THE_REQUEST} ^[A-Z]+\s(./)/+\sHTTP/[0-9.]+$
RewriteRule .* http://%{HTTP_HOST}/%1 [R=301,L]0 -
Is my knowledge graph code wrong?
I inserted the Knowledge Graph code on our site last week and am still not seeing the knowledge graph in our search results. Is something incorrect with my code? <script type="<a class="attribute-value">application/ld+json</a>"> { "@context" : "http://schema.org", "@type" : "Organization", "name" : "IssueTrak", "url" : "http://www.issuetrak.com/", "sameAs" : [ "http://www.facebook.com/issuetrak", "http://www.twitter.com/issuetrak", "http://plus.google.com/google.com/+Issuetrak"] } script>head> I suspect it is the alignment of the "{" and "}" but others in the company say that doesn't matter. Any other explanations for why the KG isn't showing in the results? Thanks I did test it with Google's Structured Data Testing Tool and got the "all's good."
Technical SEO | | Nobody15969167212220 -
Site Wide Text to Code Ratio Tool
Does anyone know of a free or paid tool which provides the text to code ratio for all pages on a site? Something like Screaming Frog but with all the ratios for each page. At the moment we are checking key landing pages individually.
Technical SEO | | Dave_Schulhof0 -
Content too buried in source code?
Our team is working on a refresh/redesign and am wondering if there's a quantifiable way of determining how high our meta data, H1 and paragraph should be in the source code. Or even whether I should be concerned with that. Our navigation will likely have dozens of links (we're going to keep it to under 100), and this doesn't even factor in the design elements. I am concerned about the content being buried. Are these the kind of concerns I should be having? Is there a measurable way to avoid it?
Technical SEO | | SSFCU0 -
Google Schema Code for Organisation
I've created the Google Schema code for an organisation. Should this go in the template HTML so it would be shown on all pages or just on the home page?
Technical SEO | | CharlBritton0 -
Non-www to www code not working in htaccess
I use the same rewrite code on every site to consolidate the non-www and www versions. All sites in Joomla, linux hosting. Code is as follows: RewriteEngine On rewritecond %{http_host} ^site.com/ rewriteRule ^(.*) http://www.site.com/$1 [R=301,L] Immediately following this code, I also rewrite /index.php to /. Thing is, I can get index.php to rewrite correctly but the non-www won't rewrite to www. I use the same code on every site but for some reason it's not working here. Are there common issues that interfere with rewriting a non-www to www in htaccess that could be interfering with the code I'm using above?
Technical SEO | | Caleone0 -
URL with tracking code
Hi there, At the company i am currently working for we have a problem with shortcut url with tracking in it. They send a lot of brochures with a shortcut URL which redirects to the page of the event with tagging. For example The real URL is:
Technical SEO | | RuudHeijnen
http://www.sbo.nl/cursussen/schoolleider-primair-onderwijs/ The URL in the brochure is:
www.sbo.nl/schoolleiderpo this then redirects to: h
ttp://www.sbo.nl/cursussen/schoolleider-primair-onderwijs/?utm_source=direct&utm_medium=shortcut&utm_campaign=schoolleiderpo Now we can measure the effect of the brochure on on-line traffic and conversion. This is great but a lot of website link to that shortcut url and if the event is put offline the links to it generate an 404. We have now about 800 backlinks that generate this 404 and i want to fix it. Another big problem "i think" is the possibility that google will index this url with tagging. Now i have 2 options: 1. look at al the url with that 404 and redirect them with a 301 to the best page 2. create the shortcut on an page that is most suitable but then i will get the tagging in the URL and i guess google will see this as dublicate content. It is possible that in the future the shortcut url will be used again. What would you suggest as the best sollution.0