Good alternatives to Xenu's Link Sleuth and AuditMyPc.com Sitemap Generator
-
I am working on scraping title tags from websites with 1-5 million pages. Xenu's Link Sleuth seems to be the best option for this, at this point. Sitemap Generator from AuditMyPc.com seems to be working too, but it starts handing up, when a sitemap file, the tools is working on,becomes too large. So basically, the second one looks like it wont be good for websites of this size. I know that Scrapebox can scrape title tags from list of url, but this is not needed, since this comes with both of the above mentioned tools.
I know about DeepCrawl.com also, but this one is paid, and it would be very expensive with this amount of pages and websites too (5 million ulrs is $1750 per month, I could get a better deal on multiple websites, but this obvioulsy does not make sense to me, it needs to be free, more or less). Seo Spider from Screaming Frog is not good for large websites.
So, in general, what is the best way to work on something like this, also time efficient. Are there any other options for this?
Thanks.
-
import.io and it's free
-
Another idea that I have here, is to look for sitemaps of these websites. There may be a way to get a list of all the urls, right away, without crawling. Look at /robots.txt, /sitemap.xml, search for sitemap in Google, things like that. If there is urls, title tags can be scraped with Scrapebox, and as far as their website is saying, it can be done relatively fast.
# # Edit:
I had somebody suggesting http://inspyder.com, around $40 and free trial. May be a good option too.
-
So there is probably no way to tell, whether I have all the urls of the site, or what percentage I have... I may have 80 or even less percent of the total site, and not know about it, I would assume. This is one of the parts of working on the sites (I've never needed it, but I am working on something like this now), and there is no good tool, which would do the work.
I have a website with 33,500,000 pages. I've been running the tool for close to 5 hours, and I have around 125,000 urls, so far. This means, that it would take 1340 hours to do the entire site. This is close to two months of running the program 24 hours a day, which does not make sense. And besides that I was planning to do it on up to 100 sites. Definitely not something that can be done, and I would say that it should be possible, software-wise.
I will try your method, and see what I will get. I dont have too much time for experimenting with it too. I need to work, and generate results...
# # Edit
I will now how the number of urls compares to the 33,500,000 figure, obviously, but whats indexed in Google is not necessarily the complete website too. The method that you are suggesting is not perfect, but I dont have two months to wait too, obviously...
-
You will crawl some of the same URLs - that's why you remove duplicates at the end. There's no way to keep it from re-crawling some of the URLs, as far as I know.
But yes, get it to recognize 600-800k URLs and then split the file. (Export, put the links in as an html file and start over.) Let me break it down the best I can:
-
Crawl your main (seed) URL until you've recognized 800k.
-
Pause/stop and then export the results.
-
Create an html file with the URLs from the export - separated 50k to 100k at a time.
-
Recrawl those files in Xenu with the "file" option.
-
Build them back up to 800k or so recognized URLs again and repeat.
After a few (4-6) iterations of this, you'll have most URLs crawled on most sites no matter how large. Doing it this way, I think you could expect to crawl about 2-3 million URLs a day. If you really paid attention to it and created smaller files but ran them more frequently, you could get 4-5 million, I think. I've crawled close to that in a day for a scrape once.
-
-
Thanks. It is good to hear, that there is a way to do, of what I am trying to do, especially on 50 or more sites, large.
I've been running Xenu on a 33,500,000 pages site for a little over 4 hours and 15 minutes, and I have something like this, so far:
Close to 500,000 urls recognized, and only 115,000 processed, it looks like. I am manually saving it to a file, every now and then, as there is no way to auto save, as far as I was checking (there could be though, I am not sure, there is no too many options there).
I am not sure, based on your advice, how I could speed it up this process. Should I wait from this point, then stop the program, and divide the file into 8 separate files, and load it to the program separately? Then the program will recognize these separate files as one, and it will continue crawling for new urls? If possible, please give better information on how this would need to be done, as I dont fully understand. I also dont see how this could do this large website in one day, or lets say even five days...
# # Edit:
I actually got to understanding what you mean, get 8 separate files (can be 6 or, lets say 10) and run them all at the same time. But still, how will the program know not to crawl and download the same urls, on all the files? In general, I would like to ask for better explanation, on how this needs to be done.
Thanks.
-
Let Xenu crawl until you have about 800k links. Then export the file and add it back as 8 x 100k lists of URLs. You can then run it again and repeat the process. By the time you have split it 4-5 times, you can then export everything, put it into one file and remove duplicates.
Xenu, done this way, with 100 threads, is probably the fastest way to do the whole thing. I think you could get the 5M results in under 1 day of work this way.
-
Ok. So it looks like Screaming Frog may be a good way to go too, if not better. Xenu is free, which is a big plus. On the top of that Creaming Frog's Seo Spider is based on a yearly subscription, and not a one time fee. For those who dont know, there is a version of Xenu for large sites, which can be found on their website. They also have a support group at groups.yahoo.com (find it through there), I am not sure if it is still active.
Xenu upgraded to the version for larger sites may be the best way to go, since it is free. I've been testing AuditMyPc.com Sitemap Creator and the better version of Xenu, and the first one already hanged up (I discontinued using it). They were both collecting the info at about the same speed, but Xenu is working better (does not hang up, looks like it should be good). Either way, this will take quite a lot of time, with it, as previously mentioned.
-
I agree with Moosa and Danny - in terms of I use Screaming Frog (full paid version) on a stripped down windows machine with an SSD and 16GB of performance RAM. I have also download the 64 bit version of Java and increased the memory allocation for Screaming Frog to 12GB (default limit is 512mb) - here's how - http://www.screamingfrog.co.uk/seo-spider/user-guide/general/ (look at the section Increasing Memory on Windows 32 & 64-bit)
I did this as I was having issues crawling a large site - after I put this system in place it eats any site I have thrown at it so far so it works well for me personally. In terms of speed of crawl large sites such as you mention will still take a while - you can set crawl speed in Screaming Frog, but you need to be careful as you can overload the server of the site you are crawling and cause issues...
Another option would be to buy a server and configure it for Screaming Frog and other tools you may use - this gives you options to grow the system as your needs grow. It all depends on budget and how often you crawl large sites - obviously buying a server such as a windows instance on Amazon EC2 will cost more in the long run but it takes the strain away from your own systems and networks plus you should effectively never hit capacity on the server as you can just upgrade. It will also allow you to remote desktop in on whatever system you use - yes even a Mac
Hope this helps
-
I believe when you are talking about 1 to 5 million URLs it is going to take time no matter what tool you use but if you ask me screaming frog is a better tool and if you have a paid version of it you still can crawl websites with few million URLs in it.
Xenu is not a bad choice either but it’s kind of confusing and there is a possibility that it can broke.
Hope this helps!
-
I was facing similar issue with huge sites, that have over 100s of thousands of pages. But ever since I upgraded my computer with RAM and SSD it run way better on huge sites as well. I tried several scrappers and I still believe Xenu is the best one and most recommended by SEO experts. Also you might want to check this post on Moz Blog about Xenu's
http://moz.com/blog/xenu-link-sleuth-more-than-just-a-broken-links-finderGood luck!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Linking to my Site so I should Link Back?
I remember hearing a few years ago that it was a good practice to link back to a site that was linking to you. My company's site was referenced and linked to in a news article. The news company has an above average domain authority, which is pretty good for my company's backlink profile. Is it still or was ever a "best practice" to link back to this website/domain? I feel like linking back was a best practice, but when I try to search this, all I get back is backlinking 101 and backlinking articles. Nothing really answering my question straight forward. Thanks for any help.
Technical SEO | | aua0 -
Problems with WooCommerce Product Attribute Filter URL's
I am running a WordPress/WooCommerce site for a client, and Moz is picking up some issues with URL's generated from WooCommerce product attribute filters. For example: ..co.uk/womens-prescription-glasses/?filter_gender=mens&filter_style=full-rim&filter_shape=oval How do I get Google to ignore these filters?
Technical SEO | | SushiUK
I am running Yoast Premium, but not sure if this can solve the issue? Product categories are canonicalised to the root category URL. Any suggestions very gratefully appreciated. Thanks Bob0 -
GWT returning 200 for robots.txt, but it's actually returning a 404?
Hi, Just wondering if anyone has had this problem before. I'm just checking a client's GWT and I'm looking at their robots.txt file. In GWT, it's saying that it's all fine and returns a 200 code, but when I manually visit (or click the link in GWT) the page, it gives me a 404 error. As far as I can tell, the client has made no changes to the robots.txt recently, and we definitely haven't either. Has anyone had this problem before? Thanks!
Technical SEO | | White.net0 -
Hard-working newbie question: benefit of moving my blog to my online store's domain?
Hi all, I've been running an online wine store in Switzerland for a month and have been working hard on SEO (I love learning about it). Anyway, for a couple of years prior to launching the store, I had been running a wine blog whose articles are ranking well in Google. I now want to link the two. My questions are: A) will the addition of the blog (store.com/blog) contribute to the store's domain authority (currently, the blog authority is higher than the site authority)? B) technically, can I 301 the whole blog to store.com/blog? Any help and tips would be appreciated. Thank you!
Technical SEO | | fkupfer0 -
What's the correct SEO for a Gallery?
Hi there, I was wondering if anyone was an expert on galleries and using canonical URL's? URL: http://www.tecsew.com/gallery In short I'm doing SEO for a site and it has a large gallery (3000+ images) where each specific image has it's own page and each category (there's 200+) also has its own page. Now, what I'm thinking is that this should be reduced and asking Google to index/rank each page is wrong (I also think this because the quality of the pages are relatively low i.e little text & content etc) Therefore, what should be suggested/done to the gallery? Should just the main gallery categories get indexed (i.e http://www.tecsew.com/3d-cad-showcase)? Or should I continue to allow Google to trawl through all of it? Or should canonical URL's be used? Any help would be greatly appreciated. Best Wishes, Charlie S
Technical SEO | | media.street0 -
Website's stability and it's affect on SEO
What is the best way to combat previous website stability issues? We had page load time and site stability problems over the course of several months. As a result our keyword rankings plummeted. Now that the issues have been resolved, what's the best/quickest way to regain our rankings on specific keywords? Thanks, Eric
Technical SEO | | MediaCause0 -
Removing a site from Google's index
We have a site we'd like to have pulled from Google's index. Back in late June, we disallowed robot access to the site through the robots.txt file and added a robots meta tag with "no index,no follow" commands. The expectation was that Google would eventually crawl the site and remove it from the index in response to those tags. The problem is that Google hasn't come back to crawl the site since late May. Is there a way to speed up this process and communicate to Google that we want the entire site out of the index, or do we just have to wait until it's eventually crawled again?
Technical SEO | | issuebasedmedia0 -
Why do I have one page showing as two url's?
My SEOMoz stats show that I have duplicate titles for the following two url's: http://www.rmtracking.com/products.php and http://www.rmtracking.com/products I have checked my server files, and I don't see a live page without the php. A while back, we converted our site from html to php, but the html pages have 301's and as you can see the page without the php is properly redirecting to the php page. Any ideas why this would show as two separate url's?
Technical SEO | | BradBorst0