Site Spider/ Crawler/ Scraper Software
-
Short of coding up your own web crawler - does anyone know/ have any experience with a good bit of software to run through all the pages on a single domain?
(And potentially on linked domains 1 hop away...)
This could be either server or desktop based.
Useful capabilities would include:
- Scraping (x-path parameters)
-
of clicks from homepage (site architecture)
- http headers
- Multi threading
- Use of proxies
- Robots.txt compliance option
- csv output
- Anything else you can think of...
Perhaps an oppourtunity for an additional SEOmoz tool here since they do it already!
Cheers!
Note:
I've had a look at:- Nutch
http://nutch.apache.org/ - Heritrix
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix - Scrapy
http://doc.scrapy.org/en/latest/intro/overview.html - Mozenda (does scraping but doesn't appear extensible..)
Any experience/ preferences with these or others?
-
Hey Alex,
Screaming Frog is hands down the best desktop crawling software and it has most of what you are looking for.
-Mike
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Web Site Migration - Time to Google indexing
Soon we will do a website migration .com.br to .com/pt-br. Wi will do this migration when we have with lower traffic. Trying to follow Google Guidelines, applying the 301 redirect, sitemap etc... I would like to know, how long time the Google generally will use to transfering the relevance of .com.br to .com/pt-br/ using redirect 301?
International SEO | | mobic0 -
International site
Hi everybody,one of my clients has a domain (www.sea-aeroportimilano.it) well ranked on Google.it.
International SEO | | vanGoGh-creative
He has a redirect 302 from www.sea-aeroportimilano.it to www1.seamilano.eu/landing/index_it.html. The site has also an english version (www1.seamilano.eu/landing/index_en.html).Do you think it's the right setting? What about a 301 from www.sea-aeroportimilano.it to www1.seamilano.eu/landing and after that an authomatic redirect 302 for the language (to www1.seamilano.eu/landing/index_it.html or www1.seamilano.eu/landing/index_en.html)?Thanks a lot.Massimiliano0 -
How to interlink 16 different language versions of site?
I remember that Matt Cutts recommended against interlinking many language versions of a site.
International SEO | | lcourse
Considering that google now also crawls javascript links, what is best way to implement interlinking? I still see otherwhise extremely well optimized large sites interlinking to more than 10 different language versions e.g. zalando.de, but also booking.com (even though here on same domain). Currently we have an expandable css dropdown in the footer interlinking 16 different language versions with different TLD. Would you be concerned? What would you suggest how to interlink domains (for user link would be useful)?0 -
International SEO Subfolders / user journey etc
Hi According to all the resources i can find on Moz and elsewhere re int seo, say in the context of having duplicate versions of US & UK site, its best to have subfolders i.e. domain.com/en-gb/ & domain.com/en-us/ however when it comes to the user journey and promoting web address seems a bit weird to say visit us at: domain.com/en-us/ !? And what happens if someone just enters in domain.com from the US or UK ? My client wants to use an IP sniffer but i've read thats bad practice and should employ above style country/language code instead, but i'm confused about both the user journey and experience in the case of multiple sub folders. Any advice much appreciated ? Cheers Dan
International SEO | | Dan-Lawrence0 -
Will website with tag hreflang pass link juice to other country/language version of website?
For example, I have a website XXX.com and I made hreflang tags to other country/language versions of website: ru.XXX.com (for Russia/Russian) XXX.com.ua (for Ukraine/Russian) ua.XXX.com (for Ukraine/Ukraine) Then I will acquire links to XXX.com. The question is: will XXX.com pass link juice to websites ru.XXX.com, XXX.com.ua and ua.XXX.com. Will these websites rank in their countries if I will acquire links ONLY to XXX.com? I looked at https://support.google.com/webmasters/answer/189077?hl=en, but haven't found what google think about that. Thank you in advance. I will appreciate your help.
International SEO | | Kabanchik0 -
Is .in domain affecting international traffic inflow to my site?
My holiday website http://seekandhide.in/ was completed and went live in Feb 2012. Last month I got 83% traffic from India and 3-5% each from USA and UK. The rest is a mixed bag from other countries. This is largely the trend since the last 3-4 months. I want to attract more organic traffic from UK and rest of Europe. My SEO consultant says that with a .in domain that will be difficult. My website currently features unique holiday properties in India that typically attract European tourists so I don't think it is a product issue. But both website visits and sales enquiries remain primarily Indian even though total number of visitors have increased gradually over the last 6 months.. My queries are 1. Is it only the .in domain that's affecting inflow of international traffic? 2. Is there anything that I can do to offset it? 3. I own seekandhide.co.uk too. Is there something I can do with that site without building a whole different website there? If I shift completely to .co.uk, I will have the same issue of being geographically limited and end up losing Indian traffic. 4. Is there something else that is not ok on the site that I am missing? 5. Advice that I get from a lot of consultants is to buy seekandhideindia.com but I plan to add international properties in a couple of years so that name would limit my appeal. Thanks in advance! Sudha
International SEO | | Sudha_Mathew0 -
How to fix the duplicate content problem on different domains (.nl /.be) of your brand's websites in multiple countries?
Dear all, what is the best way to fix the duplicate content problem on different domains (.nl /.be) of your brand's websites in multiple countries? What must I add to my code of websites my .nl domain to avoid duplicate content and to keep the .nl website out of google.be, but still well-indexed in google.nl? What must I add to my code of websites my .be domain to avoid duplicate content and to keep the .nl website out of google.be, but still well-indexed in google.nl? Thanks in advance!
International SEO | | HMK-NL3 -
Different Home Sites for different Countries but same Language
We'r starting a new webshop soon and and one of our programmers came up with the following: Different Home Sites (Index Pages) for Austria and Germany. The Language is both times German but some words are different than others. The customer would like to have that. So we would have: domain.com (No Austrian or German IP Address) domain.com/at/ (User with Austrian IP Adress) domain.com/de/ (User with German IP Address) Is this SEO wise a disadvantage? How to set up the canonicals? DE & AT Page with the Canonical on the main Domain? Any advice? Thank you
International SEO | | leitpix0