Site Spider/ Crawler/ Scraper Software
-
Short of coding up your own web crawler - does anyone know/ have any experience with a good bit of software to run through all the pages on a single domain?
(And potentially on linked domains 1 hop away...)
This could be either server or desktop based.
Useful capabilities would include:
- Scraping (x-path parameters)
-
of clicks from homepage (site architecture)
- http headers
- Multi threading
- Use of proxies
- Robots.txt compliance option
- csv output
- Anything else you can think of...
Perhaps an oppourtunity for an additional SEOmoz tool here since they do it already!
Cheers!
Note:
I've had a look at:- Nutch
http://nutch.apache.org/ - Heritrix
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix - Scrapy
http://doc.scrapy.org/en/latest/intro/overview.html - Mozenda (does scraping but doesn't appear extensible..)
Any experience/ preferences with these or others?
-
Hey Alex,
Screaming Frog is hands down the best desktop crawling software and it has most of what you are looking for.
-Mike
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
My indexed site URL removed from google search without get any message or Manual Actions??
On Agust 2 or 3.. I'm not sure about the exact date...
International SEO | | newwaves
The main URL of my website https://new-waves.net/ had been completely removed from Google search results! without getting any messages or Manual Actions on search console ?? but I'm still can find some of my site subpages in search results and on Google local maps results when I tried to check it on google
info:new-waves.net >> no results
site:new-waves.net >> only now I can see the main URL in results because I had submitted it again and again to google but it might be deleted again today or tomorrow as that happen before last few days
100% of all ranked keywords >> my site URL new-waves.net had been completely removed from all results! but I'm still can see it on maps on some results My site was ranked number 1 on google search results for "digital marketing qatar" and some other keywords, but the main URL had been removed from 100% of all search results. but you can still see it on the map only. I just tried to submit it again to Google and to index it through google search console tool but still not get any results, Can any one help to know what is the reason?? and how can I solve this issue without losing my previous ranked keywords? Can I submit a direct message to google support or customer service to know the reason or get help on this issue? Thanks & Regards0 -
Can you target the same site with multiple country HREFlang entries?
Hi, I have a question regarding the country targeting aspect of HREFLANG. Can the same site be targeted with multiple country HREFlang entries? Example: A global company has an English South African site (geotargeted in webmaster tools to South Africa), with a hreflang entry targeted to "en-za", to signify English language and South Africa as the country. Could you add entries to the same site to target other English speaking South African countries? Entries would look something like this: (cd = Congo, a completely random example) etc... Since you can only geo-target a site to one country in WMT would this be a viable option? Thanks in advance for any help! Vince
International SEO | | SimonByrneIFS0 -
Duplicate product description ranking problems (off-site duplicate content)
We do business in niche category and not in English language market. We have 2-3 main competitors who use same product information as us. They all do have same duplicate products descriptions as we. We with one competitors have domains with highest authority in this market. They maybe have 10-20% better link profile (when counting linking domains and total links). Problem is that they rank much better with product names then we do (same duplicate product descriptions as we have and almost same level internal optimisation) and they haven't done any extra link building for products. Manufacturers website aren't problem, because these doesn't rank well with product name keywords. Most of our new and some old product go to the Supplemental Results and are shown in "In order to show you the most relevant results, we have omitted some entries very similar to the ... already displayed. If you like, you can repeat the search with the omitted results included.". Unique text for products isn't a option. When we have writen unique content for product, then these seem to rank way better. So our questions is what can we do externaly to help our duplicate description product rank better compared to our main competitor withour writing unique text? How important is indexation time? Will it give big advantage to get indexed first? We have thought of using more RSS/bing services to get faster indexation (both site will get products information almost at same time). It seems our competitor get quicker in index then we do. Also are farmpages helpful for getting some quick low value links for new products. We have planed to make 2-3 domains that would have few links pointint to these new products to get little advantage right after products are launched and doesn't have extranl links. Sitemap works and our new product are shown on front pages (products that still mostly doesn't rank well and go to Supplemental Results). Some new product have #1 or top3 raking, but these are only maybe 1/3 that should have top3 rankings. Also we have noticed problem that when we index products quickly (for example Fetch as Google) then these will get good top3 results and then some will get out of rankings (to Supplemental Results).
International SEO | | raido0 -
E-commerce : 1 site per country or 1 site per language?
I'm working with an European e-commerce; they already have a French website with a .fr domain. They want to target the Belgium with a .be domain and the Nederland with a .nl domain. Belgium = 50% dutch, 50% French. Is it better to do 3 websites, one per country, or 2 websites, one per language ? Thinking to SEO, costs, VAT management, what is your opinion?
International SEO | | johnny1220 -
.cn domain vs. .com/cn/ folder structure
Hey Moz Community, I'd love to hear your response based on some real world data around leveraging a .cn domain vs. porting the site over to a sub-folder structure (ie. com/cn/ structure). Currently, the site lives on a .cn and is fully translated/localized in simplified chinese - which is the ideal state. As part of a website redesign + cost analysis there is a discussion around moving all global content under a sub-folder structure using href lang, GWT combination to define country content. My question is around China specifically - does a .cn have a signficant impact on ranking? I've read conflicing reports. Secondly, how do Chinese users react to a non-.cn domain? I would imaging the click-through rate performance from SERPs is much lower. Thoughts? Comments?
International SEO | | JonClark150 -
Correct Hreflang & Canonical Implementation for Multilingual Site
OK, 2 primary questions for a multilingual site. This specific site has 2 language so I'll use that for the examples. 1 - Self-Referencing Hreflang Tag Necessary? The first is regarding the correct implementation of hreflang, and whether or not I should have a self-referencing hreflang tag. In other words, if I am looking at the source code for http://www.example.com/es/ (our Spanish subfolder), I am uncertain whether the source code should contain the second line below: Obviously the Spanish version should reference the English version, but does it need to reference itself? I have seen both versions implemented, with seemingly good results, but I want to know the best practice if it exists. 2 - Canonical of Current Language or Default Language? The second questions is regarding which canonical to use on the secondary language pages. I am aware of the update to the Google Webmaster Guidelines recently that state not to use canonical, but they say not to do it because everyone was messing it up, not because it shouldn't be done. So, in other words, if I am looking at the source code for http://www.example.com/es/ (our Spanish subfolder), which of the two following canonicals is correct? OR For this question, you can assume that (A) the English version of the site is our default and (B) the content is identical. Thanks guys, feel free to ask any qualifiers you think are relevant.
International SEO | | KaneJamison1 -
International SEO whats best 2 sites co.uk and com.au ?
We have the co.uk and com.au ccTLDS and currently operate out of the UK only but plans are in place for Australia. We can't get hold of the .org or .com so it has to be the ccTLD. I want to use the same site for both countries and either host 2 identical sites (same content) or 1 site with different domain names + meta tags for the 2 countries. Whats the best way to make this happen without screwing things up?
International SEO | | therealmarkhall0 -
I have a site that has 65 different versions of itself.
I've just started managing a site that serves over 50 different countries and the entire web enterprise is being flagged for duplicate content because there is so much of it. What's the best approach to stop this duplicate content, yet serve all of the countries we need to?
International SEO | | Veracity0