Site Spider/ Crawler/ Scraper Software
-
Short of coding up your own web crawler - does anyone know/ have any experience with a good bit of software to run through all the pages on a single domain?
(And potentially on linked domains 1 hop away...)
This could be either server or desktop based.
Useful capabilities would include:
- Scraping (x-path parameters)
-
of clicks from homepage (site architecture)
- http headers
- Multi threading
- Use of proxies
- Robots.txt compliance option
- csv output
- Anything else you can think of...
Perhaps an oppourtunity for an additional SEOmoz tool here since they do it already!
Cheers!
Note:
I've had a look at:- Nutch
http://nutch.apache.org/ - Heritrix
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix - Scrapy
http://doc.scrapy.org/en/latest/intro/overview.html - Mozenda (does scraping but doesn't appear extensible..)
Any experience/ preferences with these or others?
-
Hey Alex,
Screaming Frog is hands down the best desktop crawling software and it has most of what you are looking for.
-Mike
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Why Google is not indexing each country/language subfolder on the ranks?
Hi folks, We use Magento 2 for the multi-country shops (its a multistore). The URL: www.avarcas.com The first days Google indexed the proper url in each country: avarcas.com/uk avarcas.com/de ... Some days later, all the countries are just indexing / (the root). I correctly set the subfolders in Webmaster tools. What's happening? Thanks
International SEO | | administratorwibee0 -
Hreflang for bilingual website in the same region/location
Hi everyone, got a quick question concerning the hreflang tag. I have a website with 2 different language versions targeting to the same region(Reason: The area is bilingual however not everyone speaks the other language fluently) Question:
International SEO | | ennovators
Can I use hreflang in that case like: Many thanks in advance0 -
International Sites - Sitemaps, Robots & Geolocating in WMT
Hi Guys, I have a site that has now been launched in the US having originally just been UK. In order to accommodate this, the website has been set-up using directories for each country. Example: domain.com/en-gb domain.com/en-us As the site was originally set-up for UK, the sitemap, robots file & Webmaster Tools account were added to the main domain. Example: domain.com/sitemap.xml domain.com/robots.txt The question is does this now need changing to make it specific for each country. Example: The sitemap and robots.txt for the UK would move to: domain.com/en-gb/sitemap.xml domain.com/en-gb/robots.txt and the US would have its own separate sitemap and robots.txt. Example : domain.com/en-us/sitemap.xml domain.com/en-us/robots.txt Also in order to Geolocate this in WMT would this need to be done for each directory version instead of the main domain? Currently the WMT account for the UK site is verified at www.domain.com, would this need reverifying at domain.com/en-gb? Any help would be appreciated! Thanks!
International SEO | | CarlWint0 -
Researching (and launching a site within) a foreign language market
Morning peeps, A client wants to clone their website for a foreign language market, obviously swapping all English content for whichever language/market they're looking to target. Any advice on how to research a foreign market (when I only speak English), or perhaps any pitfalls to look out for or advice you might have with a launch like this? thanks
International SEO | | Martin_S0 -
Best practice for multi-language site?
Recently our company is going to expand our site from just english to multi-language, including english, french, german, japanese, and chinese. I deeply understand a solid and feasible plan is pretty important, so I want to ask you mozzers for help before we taking action! Our site is a business site which sells eBook software, for the product pages, the ranks are taken by famous software download sites like cnet, softonic, etc. So the main source of our organic traffic is the guide post, long-tail keywords. We are going to manually translate the product pages and guide post pages which targeting on important keywords into other languages. Not the entire english site. So my primary question is: should I use the sub-domain or sub-category to build the non-english pages? "www.example.com/fr/" or "fr.example.com"? The second question: As we are going to manually translate the entire pages into other languages, should I use the "rel=alternate hreflang=x" tags? Because Google's official guideline says if we only translate the navigations or just part of the content, we should use this tag. And what's your tips for building a multi-language site? Please let me know them as much as possible Thanks!
International SEO | | JonnyGreenwood0 -
Why is GoogleBot crawling our German site and rendering it in English.
We have a German website at (http://de.pa.com) and we can't get the search engines to index the site in German language. For some reason the GoogleBot, BingBot, etc are crawling de.pa.com and displaying English text on the SERP. I've tried testing via web-sniffer.net and Google Webmaster tools which both are crawling de.pa.com in English. We know the page titles/meta descriptions are in English which we are updating to German, but I'm curious to why search engines are indexing our German site and displaying on the SERP as English text when the entire content of the site is in German. Thank you, Brian
International SEO | | Liamis0 -
Optimizing terms with accents/tildes in Spanish
Hello all, quick question. We are optimizing for a keyword that includes an accent in Spanish. Is it better to use the accented or regular form (i.e. inglés vs. ingles)? Also, is there any distinction between accents (áéí...) and the ene (ñ) in terms of strategy/best practices? Does this accent issue have a huge impact on ranking?
International SEO | | CuriosityMedia0 -
Does it matter whether you use /en vs /uk
I have a global site targeting many countries including the UK which is the only English language site. Does it matter whether I use /en or /uk for the UK sub-folder? If I already have /en in place, but my Google UK listings are struggling, will it benefit me to switch to /uk? I honestly don't think it matters too much, but given the choice would've gone for the /uk I'm trying to weigh up whether it is worth the effort of changing it.
International SEO | | Red_Mud_Rookie0