Site Spider/ Crawler/ Scraper Software
-
Short of coding up your own web crawler - does anyone know/ have any experience with a good bit of software to run through all the pages on a single domain?
(And potentially on linked domains 1 hop away...)
This could be either server or desktop based.
Useful capabilities would include:
- Scraping (x-path parameters)
-
of clicks from homepage (site architecture)
- http headers
- Multi threading
- Use of proxies
- Robots.txt compliance option
- csv output
- Anything else you can think of...
Perhaps an oppourtunity for an additional SEOmoz tool here since they do it already!
Cheers!
Note:
I've had a look at:- Nutch
http://nutch.apache.org/ - Heritrix
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix - Scrapy
http://doc.scrapy.org/en/latest/intro/overview.html - Mozenda (does scraping but doesn't appear extensible..)
Any experience/ preferences with these or others?
-
Hey Alex,
Screaming Frog is hands down the best desktop crawling software and it has most of what you are looking for.
-Mike
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Wordpress SEO/ Ecommerce , Site with Multiple Domains ( International ) & Canonical URLs
Hi I have an ecommerce site with an integrated wordpress instance. I want to have one wordpress site that outputs to 2 domains exactly the same content , but one will have canonical URL . NZ & Australia Sites. So: Would I use the rel="Alternate" hreflang="en-nz" . I want the same content to rank well for each country and not be penalised for duplicate content. Ideas?
International SEO | | s_EOgi_Bear0 -
Direct traffic is up 2100% (due to a bot/crawler I believe)
Hi, The direct traffic to website www.webgain.dk has increased by over 2100% recently. I can see that most of it is from US (my target audience is in Denmark and the website is in danish).
International SEO | | WebGain
What can I do about this? All this traffic gives my website a bounce rate of 99.91% for direct traffic. I believe it is some sort of bot/crawler. 2100percentboost.png0 -
Using Javascript to alter ONE or TWO keywords in International Site
Hi, What is the best way to target a language that has slight variations in it without actually targetting specific countries? Scenario: Ecommerce site that sells mobile phones in Spanish, initially created to target Spanish from Spain. We call a mobile phone a "movil" Now we want to target LatinAmerican users, which also use Spanish with variations, the most notable being mobile phone called "celular". We don't want to create specific sites via new ccTLDs, nor subdomains, no directories for each new country, and we want to avoid having two sites - one for spain, one for latinamerica- given that the only major difference is we say MOVIL in spain and CELULAR in LatinAmerica. What is Googles take if we simply decide to modify THAT specific keyword in each page where it is mentioned? Either by: a) Server based. IP Detect. that is, render the page with either one or the other term b) Javascript based. i.e. Have BOTH terms on all pages but using Javascript show/hide according to user preferences. c) Display the keywords with different font sizes/emphasis, depending on the visitor. Any ideas?
International SEO | | doctorSIM0 -
Impact of Japanese .jp site duplicate content?
Our main website is at http://www.traxnyc.com and we just launched a Japanese version of the site at http://www.traxnyc.jp domain. However all the images used on the .jp site are linked from the .com site. Would this hurt me in Google at all for hotlinking images? Also there is quite a bit of duplicate content on the .jp site at the moment: only a few things have been translated to Japanese and for the most part the layouts and words are exactly the same (in English). Would this hurt my Google rankings in the US at all? Thanks for all your help.
International SEO | | DiamondJewelryEmpire0 -
Http://us.burberry.com/: Big traffic change for top URL (error 593f1ceb2d67)
Please forgive duplicating this question on the SEOMoz & Webmaster Tools forum but I'm hoping to hit both audiences with this question... A few days ago I noticed that our US homepage (us.burberry.com) had dropped from PR5 to PR0, and the page has been deindexed by Google. After checking Webmaster Tools I also received the following message: http://us.burberry.com/: Big traffic change for top URL April 2, 2012Search results clicks for http://us.burberry.com/ have decreased significantly.Message ID: 593f1ceb2d67.We're not doing any link building at all (we've enough on-site issues to deal with). The only changes I have made are adding Google Analytics to the website, uploading sitemaps via Webmaster Tools (it's not linked to from robots.txt yet), and setting the burberry.com and www.burberry.com geo-location settings to 'unlisted' (we want uk.burberry.com appearing in the UK results, us.burberry.com appearing in the US results etc rather than www.burberry.com).I've reversed the geo-location settings but I doubt this would have caused this. We've duplicate copies of our homepage (such as us.burberry.com/store//) from typos in inbound links (and bad programming that allows them to work rather than 404'ing) but I don't think any of this is new. What I don't understand is (a) why this is happening now and (b) why is this just affecting our US homepage? We've ~40 different duplicates of the homepage (us, uk, ca, pt, ro, sk etc etc) so why is the US site being affected and not the others? Does anyone know if this is due to an algorithm change by Google or something else all together? Background:Our website www.burberry.com has 46 subdomains such as uk.burberry.com, ca.burberry.com and us.burberry.com. There is a lot of duplicate content on each subdomain (including basic things like tracking parameters in URLs) and across subdomains (uk.burberry.com/store & us.burberry.com/store are exactly the same), there's very little text on the site (its nearly all images), as well as poor redirects, inaccessible content (AJAX/Flash) and a whole host of basic SEO things that aren't being done correctly. I've joined the company in the last few months and have started addressing these issues but I've got a LOT of work to do yet.One thing that we have in our favour is a link profile that is as clean and natural as they come - there was only ever one link building campaign performed (which was before my time) and I had all of those links removed as soon as I joined the company.Any help would be greatly appreciated! Thanks for your timeDean RoweEdit: us.burberry.com 301 redirects to us.burberry.com/store/ as explained on the webmaster tools forum, but I don't believe this is the cause as its the same across all subdomains.
International SEO | | FashionLux0 -
Does it matter whether you use /en vs /uk
I have a global site targeting many countries including the UK which is the only English language site. Does it matter whether I use /en or /uk for the UK sub-folder? If I already have /en in place, but my Google UK listings are struggling, will it benefit me to switch to /uk? I honestly don't think it matters too much, but given the choice would've gone for the /uk I'm trying to weigh up whether it is worth the effort of changing it.
International SEO | | Red_Mud_Rookie0 -
Site structure for multi-lingual hotel website (subfolder names)
Hi there superMozers! I´ve read a quite a few questions about multi-lingual sites but none answered my doubt / idea, so here it is: I´m re-designing an old website for a hotel in 4 different languages which are all** hosted on the same .com domain** as follows: example.com/english/ for english example.com/espanol/ for **spanish ** example.com/francais/ for french example.com/portugues/ for portuguese While doing keyword search, I have noticed that many travel agencies separate geographical areas by folders, therefor an **agency pomoting beach hotels in South America **will have a structure as follows: travelagency.com/argentina-beach-hotels/ travelagency.com/peru-beach-hotels/ and they list hotels in each folder, therefor benefiting from those keywords to rank ahead of many independent hotels sites from those areas. What **I would like to **do -rather than just naming those folders with the traditional /en/ for english or /fr/ for french etc- is take advantage of this extra language subfolder to_´include´_ important keywords in the name of the subfolders in the following way (supposing the we have a beach hotel in Argentina): example.com/argentina-beach-hotel/ for english example.com/hotel-playa-argentina/ for **spanish ** example.com/hotel-plage-argentine/ for french example.com/hotel-praia-argentina/ for portuguese Note that the same keywords are used in the name of the folder, but translated into the language the subfolders are. In order to make things clear for the search engines I would specify the language in the html for each page. My doubt is whether google or other search engines may consider this as ´stuffing´ although most travel agencies do it in their site structure. Any Mozers have experience with this, any idea on how search engines may react, or if they could penalise the site? Thanks in advance!
International SEO | | underground0 -
Google Webmaster Tools - International SEO Geo-Targeting site with Worldwide rankings
I have a client who already has rankings in the US & internationally. The site is broken down like this: url.com (main site with USA & International Rankings) url.com/de url.com/de-english url.com/ng url.com/au url.com/ch url.com/ch-french url.com/etc Each folder has it's own sitmap & relative content for it's respective country. I am reading in google webmaster tools > site config > settings, the option under 'Learn More': "If you don't want your site associated with any location, select Unlisted." If I want to keep my client's international rankings the way it currently is on url.com, do NOT geo target to United States? So I select unlisted, right? Would I use geo targeting on the url.com/de, url.com/de-english, url.com/ng, url.com/au and so on?
International SEO | | Francisco_Meza0