Site Spider/ Crawler/ Scraper Software
-
Short of coding up your own web crawler - does anyone know/ have any experience with a good bit of software to run through all the pages on a single domain?
(And potentially on linked domains 1 hop away...)
This could be either server or desktop based.
Useful capabilities would include:
- Scraping (x-path parameters)
-
of clicks from homepage (site architecture)
- http headers
- Multi threading
- Use of proxies
- Robots.txt compliance option
- csv output
- Anything else you can think of...
Perhaps an oppourtunity for an additional SEOmoz tool here since they do it already!
Cheers!
Note:
I've had a look at:- Nutch
http://nutch.apache.org/ - Heritrix
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix - Scrapy
http://doc.scrapy.org/en/latest/intro/overview.html - Mozenda (does scraping but doesn't appear extensible..)
Any experience/ preferences with these or others?
-
Hey Alex,
Screaming Frog is hands down the best desktop crawling software and it has most of what you are looking for.
-Mike
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
International SEO : Redirecting spanish visitors to spanish site
Hi There, I have a problem I need an advice for. I run an e-commerce site in French. Things are going well. I also run the Spanish version of this site. We are starting to sell. But nothing like French site. I have traffic coming to the French site from Spain from visitors with Spanish language and they don't buy anything. That is strange as the conversion rate is good. Si I want to redirect them to the Spanish site. We sell phone parts. Our SEO is mainly based on brands, make, and reference numbers. So keywords are almost the same in both languages. Of course, site.es is aiming at google.es, and site.fr at google.fr So I am wondering. If I redirect these visitors to the Spanish site, Will it affect french site's SEO? Thanks
International SEO | | Kepass0 -
How to interlink 16 different language versions of site?
I remember that Matt Cutts recommended against interlinking many language versions of a site.
International SEO | | lcourse
Considering that google now also crawls javascript links, what is best way to implement interlinking? I still see otherwhise extremely well optimized large sites interlinking to more than 10 different language versions e.g. zalando.de, but also booking.com (even though here on same domain). Currently we have an expandable css dropdown in the footer interlinking 16 different language versions with different TLD. Would you be concerned? What would you suggest how to interlink domains (for user link would be useful)?0 -
Duplicate Content - International Sites - AirBNB
Good morning Just a quick question. Why does AirBNB not get penalised for duplicate content on its sites. For example, the following two urls (and probably more for other countries), both rank appropriately in the google (UK and COM), https://www.airbnb.co.uk/help/getting-started/how-to-travel
International SEO | | joogla
https://www.airbnb.com/help/getting-started/how-to-travel Their are no canonical tags, no Alternative etc If I look at the following https://www.airbnb.co.uk/s/London--United-Kingdom
https://www.airbnb.com/s/London--United-Kingdom They both have alternative to point to the other language versions which I would expect. However they also both point to them selves as canonical. Would this not be duplicate content ? Thanks for your insights Shane0 -
Huge increase in US direct visits to a UK site, why?
Hi all, My UK website usually gets around 10,000 direct (Direct in Analytics) visits per month however for August this has shot up to 24,000! However the majority of these direct visits seem to be coming from the US and as a result the bounce rate is through the roof, 84%! Why would my UK based site suddenly be receiving huge amounts of US visits? Any ideas?
International SEO | | MarkHincks0 -
E-Commerce site in 2 languages - Duplicate content or not?
How does Google view this? Our current site works like:
International SEO | | bjs2010
www.domain.com/EN - English
www.domain.com/ES - Spanish All products are the same, just different language and different URL for them - is this good or bad? I thought of either Going with .co.uk or .com for "English" and a .es for "Spanish"
OR Subdomaining it. www.es.domain.com and www.en.domain.com Any advice appreciated!0 -
I have on site translated into several languages on different TLDs, .com, .de, .co.uk, .no, etc. Is this duplicate content?
Three of the sites are English (.co.uk, .com, .us) as well as foreign (.de, .no, etc.) - are these all seen as having duplicate content on every site? They're hosted under the same EpiServer backend system if this helps. But I am still copying and pasting content over each site, and translating where necessary, so I'm concerned this is indexed as being large amounts of duplicate content. Site traffic doesn't appear to be suffering but as I'm currently putting together new SEOs strategies, I want to cover this possibility. Any advice on ensuring the sites aren't penalised appreciated!
International SEO | | hurtigruten0 -
Google UK picking up USA Site
I have a site with two subfolders one is .../uk and one is .../us Part of the content on the two sites is the same and part is unique. The US site's language is set to en and the UK site's language is set to en_gb. I have setup geo-targeting in webmaster tools. The problem is that the home page is a GEO-IP redirect and it seems to be picking up information from the US site even on google uk. I'm not concerned too much about getting the uk site crawled as we submit a sitemap for that anyway. But my concern is that if I setup the geo-ip redirect as a 301 will my UK site loose all of it's ranking? Also am I likely to be penalised for duplicate content?
International SEO | | matthewdolman0 -
How can I see what my web site looks like from a different country?
I've tried a few proxy tools to try to see how my site looks from other global locations, but haven't found one that works very well yet -- or a list of reliable proxies around the world. I need to do this to test various geo-targetted ads and other optimizations. Can anyone make a recommendation? Thanks!
International SEO | | Dennis-529610