Exclude status codes in Screaming Frog
-
I have a very large ecommerce site I'm trying to spider using screaming frog. Problem is I keep hanging even though I have turned off the high memory safeguard under configuration.
The site has approximately 190,000 pages according to the results of a Google site: command.
- The site architecture is almost completely flat. Limiting the search by depth is a possiblity, but it will take quite a bit of manual labor as there are literally hundreds of directories one level below the root.
- There are many, many duplicate pages. I've been able to exclude some of them from being crawled using the exclude configuration parameters.
- There are thousands of redirects. I haven't been able to exclude those from the spider b/c they don't have a distinguishing character string in their URLs.
Does anyone know how to exclude files using status codes? I know that would help.
If it helps, the site is kodylighting.com.
Thanks in advance for any guidance you can provide.
-
Thanks for your help. It literally was just the fact that it had to be done before the crawl began and could not be changed during the crawl. Hopefully this is changed because sometimes during a crawl you find things you want to exclude that you may have not known of their existence before hand.
-
Are you sure it's just on Mac,have you tried on PC? Do you have any other rules in include or perhaps a conflicting rule in exclude? Try running a single exclude rule, also on another small site to test.
Also from support if failing on all fronts:
- Mac version, please make sure you have the most up to date version of the OS which will update Java.
- Please uninstall, then reinstall the spider ensuring you are using the latest version and try again.
To be sure - http://www.youtube.com/watch?v=eOQ1DC0CBNs
-
does the exclude function work on mac. i have tried every possible way to exclude folders and have not been successful while running an analysis
-
That's exactly the problem, the redirects are disbursed randomly throughout the site. Although, and the job's still running, it now appears as though there's almost a 1-2-1 correlation between pages and redirects on the site.
I also heard from Dan Sharp via Twitter. He said "You can't, as we'd have to crawl a URL to see the status code You can right click and remove after though!"
Thanks again Michael. Your thoroughness and follow through is appreciated.
-
Took another look, also looked at documentation/online and don't see any way to exclude URLs from crawl based on response codes. As I see it you would only want to exclude on name or directory as response code is likely to be random throughout a site and impede a thorough crawl.
-
Thank you Michael.
You're right. I was on a 64 bit machine running a 32 bit verson of java. I updated it and the scan has been running for more than 24 hours now without hanging. So thank you.
If anyone else knows of a way to exclude files using status codes I'd still like to learn about it. So far the scan is showing me 20,000 redirected files which I'd just as soon not inventory.
-
I don't think you can filter out on response codes.
However, first I would ensure you are running the right version of Java if you are on a 64bit machine. The 32bit version functions but you cannot increase the memory allocation which is why you could be running into problems. Take a look at http://www.screamingfrog.co.uk/seo-spider/user-guide/general/ under Memory.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Site hacked, but can't find the code
Discovered some really odd words ranking for us in WMT. Looked further and found pages like this www.pdnseek.com/wll/canadian-24-hour-pharmacy. When you click it it redirects to the home page. The developers can't find /wll anywhere on the site. The pages are indexed and cached. Looked at the back links in moz and found many backlinks to our site from other sites using URLs like this. The host says there is nothing on the server, but where else could it be. We've run virus scans, nothing, looked through source code, nothing. Anyone with some idea? www.pdnseek.com is the URL
Technical SEO | | Britewave0 -
Server return 404 and 200 status
Dear All,
Technical SEO | | omverma
We have a website where we are showing some products. Many times it happens when we remove any product and add that again later. In our site, we have product list page and detail page. So if any product get deleted and client hits the detail page with deleted product url, then we are returning 404. Next time when that product will be available our server will return 200. I have two questions : 1. Is is right way to deal with deleted product ? 2. After deploying it, we observed that our keywords ranking is going down, is that really affect ? Thanks,
Om0 -
Repeating Content Within Code On Many Pages
Hi, This is sort of a duplicate content issue, but not quite. I'm concerned with the way our code is written and whether or not it can cause problems in the future. On many of our pages (thousands), our users will have the option to post comments. We have a link which opens a JavaScript pop-up with our comments guidelines. It's a 480 word document of original text, but it's preloaded in the source code of every page it appears on. The content on these pages will be relatively thin immediately, and many will have thin content throughout. I'm afraid so many of our pages look the same in both code and on-site content that we'll have issues down the line. Admittedly, I've never dealt with this issue before, so I'm curious. Is having a 480 word piece of text in the source code on so many pages an issue, or will Google consider it part of the template, similar to footer/sidebar/headers? If it's an issue, we can easily make it an actual pop-up hosted on a SINGLE page, but I'm curious if it's a problem. Thanks!
Technical SEO | | kirmeliux0 -
Exclude mobile pages from non mobile Google serps
Hi Everybody I see that a lot of our pages on our mobile shop has started to turn up when i do site:domainname.com on google. As they could potentially compete with the similar non mobile version of the same page, is there some way to exlude the mobile domain in non mobile google result without blocking the mobile version altogether. We use an m.domain.com version for our mobile site.
Technical SEO | | AndersDK0 -
Paging Links Code - Best Way?
Currently we are using previous 1 2 3 next for our link to other inventory pages, with some variation of this javascript code javascript:__doPostBack('ctl00$phMain$dlPagesTop$ctl01$lnkPageTop','') . Can search engines even index the other pages with this javascript? Is there a better way to do this?
Technical SEO | | CFSSEO0 -
Code for redirect
What is the code to redirect www.xyz.com/abc where abc is a folder to www.xyz.com/abc.html
Technical SEO | | seoug_20050 -
How can I exclude display ads from robots.txt?
Google has stated that you can do this to get spiders to content only, and faster. Our IT guy is saying it's impossible.
Technical SEO | | GregBeddor
Do you know how to exlude display ads from robots.txt? Any help would be much appreciated.0 -
301 Redirect for homepage with language code
In my multilingual Magento store, I want to redirect the hompage URL with an added language code to the base URL. For example, I want to redirect http://www.mysite.com/tw/ to http://www.mysite.com/ which has the exact same content. Using a canonical URL will help with search engines, but I would just rather nip the problem in the butt by not showing http://www.mysite.com/tw/ to visitors in the first place. Problem is that I don't want (can't have) all /tw/ removed from URLs due to Magento limitations, so I just want to know how to redirect this single URL. Since rewrites are on, adding Redirect 301 /tw http://www.88kbbq.com would redirect all URLs with the /tw/ language code to ones without. Not an option. Hope folks can lend a hand here.
Technical SEO | | kwoolf0