Site Spider/ Crawler/ Scraper Software
-
Short of coding up your own web crawler - does anyone know/ have any experience with a good bit of software to run through all the pages on a single domain?
(And potentially on linked domains 1 hop away...)
This could be either server or desktop based.
Useful capabilities would include:
- Scraping (x-path parameters)
-
of clicks from homepage (site architecture)
- http headers
- Multi threading
- Use of proxies
- Robots.txt compliance option
- csv output
- Anything else you can think of...
Perhaps an oppourtunity for an additional SEOmoz tool here since they do it already!
Cheers!
Note:
I've had a look at:- Nutch
http://nutch.apache.org/ - Heritrix
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix - Scrapy
http://doc.scrapy.org/en/latest/intro/overview.html - Mozenda (does scraping but doesn't appear extensible..)
Any experience/ preferences with these or others?
-
Hey Alex,
Screaming Frog is hands down the best desktop crawling software and it has most of what you are looking for.
-Mike
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Using .ag for agriculture site with global targeting
Would using .ag with a short punchy domain like farm.ag, that was targeting a global audience be a wise decision? Versus say an 11 character descriptive ".com". Is there any benefit to using a ".ag" if the site is for agriculture? Note, this is a heavy content site so SEO important, with plans to serve different languages later.
International SEO | | mag7770 -
Need help with search results for US site for a compnay that has many international sites
I am tasked with optimizing a US site for a company that has many international sites. Currently, if you search for just the main company name and don't include "USA" in your search, it won't even give you the US site on the SERP. It displays the Italian, French, etc etc sites - even though I'm searching on Google in the US with a preferred language of Engilsh. Unfortunately I don't have any control over the other sites, only the US one. Is there anything I can add to the US site (aside from setting the country code in GSC) so that when someone searches from within the USA, they get the US site and not all of the other ones? thanks!
International SEO | | SEOIntouch0 -
License Details across multiple regional brand sites
Hi guys! I have a quick question. Our team are currently having a debate regarding whether we should display our licensing details as text across all our brands in multiple regions (roughly 50 sites). My argument is that if you are required to have a license to be able to operate legally that Google would EXPECT to be able to crawl those details in order to provide their (Google) users with reliable results as opposed to rogue operators. The other side of the argument is that it will tie all the sites together and that would be a huge risk (as Google will perceive it as a network)- also that it would be seen as duplicate content? Would really appreciate any feedback on what is the best to do in this case. Thanks!!
International SEO | | RedSearch010 -
Optimizing for 3 international sites, how to avoid getting into trouble
Hi Guys As a newbie, I want to avoid any penalties or mistakes as possible that will be due to unknown and have taken some steps to educate myself around international sites and multiple domains. our aim was to target new zealand first and then branch out. Whilst we are pondering the NZ site and writing fresh unique articles for the site and the blog. And besides making the currency, language more relevant to these domains, is there anything else I could work on? I thought about making the meta tags different for the home page and adding Australia etc If we are going to spend time growing the site organically I thought I would make the most of spending the time growing all three together.... Any recommendations on how to get started and optimize the 3 alot better? Thanks
International SEO | | edward-may1 -
Wordpress SEO/ Ecommerce , Site with Multiple Domains ( International ) & Canonical URLs
Hi I have an ecommerce site with an integrated wordpress instance. I want to have one wordpress site that outputs to 2 domains exactly the same content , but one will have canonical URL . NZ & Australia Sites. So: Would I use the rel="Alternate" hreflang="en-nz" . I want the same content to rank well for each country and not be penalised for duplicate content. Ideas?
International SEO | | s_EOgi_Bear0 -
Multilingual Site with 2 Separate domains and hand-translated
I have 2 separate domains: .com & .jp
International SEO | | khi5
I am having a professional translator translate the English written material from .com. However, the .jp will have same pictures and videos that I have on the .com which means alt tags are in English and video titles are in English. I have some dynamic pages where I use Google Translate and those pages I place as "no index follow" to avoid duplicate issues and they are not very important pages for me any way. Question: since I am doing a proper translating - no machines involved - can I leave pages as is or should I include any format of these: ISO language codes
2) www.example/com/” /> Even though hand translated, the translation will probably be 85% similar to that if I used Google Translate. Will that potentially be seen as duplicate content or not at all since I have not used the Google Translate tool? I wonder from which angle Google analyses this. Thank you,0 -
Researching (and launching a site within) a foreign language market
Morning peeps, A client wants to clone their website for a foreign language market, obviously swapping all English content for whichever language/market they're looking to target. Any advice on how to research a foreign market (when I only speak English), or perhaps any pitfalls to look out for or advice you might have with a launch like this? thanks
International SEO | | Martin_S0 -
Multiple domains for one site / satellite domains
Hi, I know this has been asked a few times before but I want to clarify everything my own head. We've recently relaunched a website for a client that combined three existing sites into one. The new site is http://www.gowerpensions.com/ I've added 301 rewrite rules to the three old domains to to point to the correct page on the new website, i.e the old contact page goes to the new one, the about page to the new about page etc, etc. The old domains are thehorizonplan.com, horizonqrops.com and horizonqnups.com. I've informed Google Webmaster Tools of the change. The client also has several other domains such as horizonpensions.com and qnupscheme.com. Am I correct in thinking I should not park these domains on top of the gowerpensions.com website as this will be seen as duplicate content? I don't think there is anything linking to these domains. They might not even be listed in Google. With the thehorizonplan.com, horizonqrops.com and horizonqnups.com domains there are existing links to them, but will parking these on top of gowerpensions.com cause a problem, or should I keep my 301 redirects forever? Would a better strategy be to make microsites on all of the satellite domains that link to the main one to create more relevant links? If this is the case then I'd need to fix any third party links to the old horizon domains. I hope that makes sense. Thanks Ric
International SEO | | BWIRic0