How to extract URLs from a site (without bringing the server down!)
-
Hi everybody.
One of my clients is migrating to a new ecommerce platform, and we need to get a list of urls from the existing site to start mapping out the 301 redirects. Usually, I'd use a tool like Xenu or Integrity to crawl and output a list.
However, the database and server setup is so bad that it can't handle the requests from these tools and it sends the site down. This, unsurprisingly, is one of the reasons for the migration.
Does anybody know of a way to get a full list of urls without having to make a bunch of http requests which will kill the site? Any advice would be much appreciated!
-
Just a follow-up to my endorsement. It looks like Screaming Frog will let you control the number of pages crawled per second, but to do a full crawl you'll need to get the paid version (the free version only crawls 500 URLs):
http://www.screamingfrog.co.uk/seo-spider/
It's a good tool, and nice to have around, IMO.
-
Copy the site, set it up on a staging server and run http://www.xml-sitemaps.com/ on it?
-
why not find the links to the site, becauase you will only need to 301 the urls with extenal links. let teh rest 404. i use Bing WMT as it has a most complete collection IMO. they also export to a csv
-
Thanks Yannick, I don't know why I didn't think of using a scraper! Can you recommend any good code (PHP perhaps)?
-
-
Scrape Google?
-
Make your own scraper and keep the requests per second really low ?
-
Maybe the site has an automated sitemap somewhere ?
-
Google webmaster tools -> download "internal links" table
-
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Hi! I first wrote an article on my medium blog but am now launching my site. a) how can I get a canonical tag on medium without importing and b) any issue with claiming blog is original when medium was posted first?
Hi! As above, I wrote this article on my medium blog but am now launching my site, UnderstandingJiuJitsu.com. I have the post saved as a draft because I don't want to get pinged by google. a) how can I get a canonical tag on medium without importing and b) any issue with claiming the UJJ.com post is original when medium was posted first? Thanks and health, Elliott
Technical SEO | | OpenMat0 -
If a URL canonically points to another link, is that URL indexed?
Hi, I have two URL both talking about keyword phrase 'counting aggregated cells' The first URL has canonical link pointing to the second URL, but if one searches for 'counting aggregated cells' both URLs are shown in the results. The first URL is the pdf, and i need only second URL (the landing page) to be shown in the search results. The canonical links should tell Google which URL to index, i don't understand why both URLs are present in search results? Is 'noindex' for the first URL only solution? I am using Yoast SEO for my website. Thank you for the answers.
Technical SEO | | Chemometec0 -
How to delete specific url?
I just ran drawl diagnostics and trying to delete pages such as "oops that page can't be found" or "404 (not found_ error response pages. Can anyone help?
Technical SEO | | sawedding0 -
Then why my site is not ranking
My website's DA and PAs are good compare with my competitors. Then why my site is not ranking.
Technical SEO | | Somanathan0 -
Https Cached Site
Hi there, I recently switch my site to a new ecommerce platform which hosts the SSL certificate on their end so my site no longer has the HTTPS status unless a user is going through the checkout. Google has cached the HTTPS version of the site so in search it comes up sometimes which leads to a nasty warning that the site may not be what they are looking for. Is there a way to tell google NOT to look at the https version of the site anymore? Thanks! Bianca
Technical SEO | | TheBatesMillStore0 -
Young site trying hard, but banging head against the wall -- Site Review
Hi All New to PRO but we're seriously committed to getting this working. And firstly thank you to anyone who offers any useful thoughts and insights. We've launched a new site, unfortunately late to the market for the season and are really struggling to get search engine recognition. Site: http://www.ignitehats.co.uk/ We're continuously adding new content, slowly gathering more links and working hard to promote socially. But even on our clearest search terms like "Ignite hats" we're down on page 4. Both GWT and the Seomoz tools highlight no big problems (a few titles that are too long) but otherwise nothing. Maybe wrongly we requested that the Google spam team review our site incase it was being penalised, but got a template response saying the site was not in their spam system (phew, there wasn't a reason it should be we believe). We're wondering if this is just that our site is just too young? It's been live for 6 weeks. But worry maybe this is not the case. We've had success with another site we run much sooner than this. Any help or pointers would be really appreciated. Similar stories and what others have done, at least to give us some confidence to carry on would be great. Thanks for reading.
Technical SEO | | JHill0 -
Frequent server changes
Hey guys. I have a server related question. One of our websites is hosted with a nasty slow company, and we want to make a change. The problem we have is that the site is 6 months old so it started on one server, the client then moved it to this slow host about 2 months ago, we now want to move it again. Will this negatively affect search engine rankings? As ever, thanks in advance 🙂
Technical SEO | | Nextman0 -
The course of action to move my macro site to some mini sites- justin if you can help
We have a site that we want to break up into mini sites but keep the old site for the major brands. Empirecovers.com is the major and we want to break it off into Empire Truck Covers and Empire Boat covers. What I am thinking of doing is linking from the home to Empiretruckcovers.com instead of a mini page on the site and 301 redirect the mini page to empiretruckcovers.com. Than (there wont be duplicate content) making a small page for truck covers on empire just so people do not get confused. Is this the best way to go or what do you suggest? We are doing this because I feel there is seo value in having mini sites and also the user experience will be cleaner and people will trust it a lot more than inside a big site. The other problem is I have some great rankings on the pages so I want to do it so there is as little damage as possible. I guess once I start I will do all the free directories, yahoo directory and try to get links as fast as I can. Any suggestions would be great. I am going to do a/b testing to see if my adwords convert better on mini site or on the big site for certain keywords too
Technical SEO | | goldjake17880