Can you use Screaming Frog to find all instances of relative or absolute linking?
-
My client wants to pull every instance of an absolute URL on their site so that they can update them for an upcoming migration to HTTPS (the majority of the site uses relative linking). Is there a way to use the extraction tool in Screaming Frog to crawl one page at a time and extract every occurrence of _href="http://" _?
I have gone back and forth between using an x-path extractor as well as a regex and have had no luck with either.
Ex. X-path: //*[starts-with(@href, “http://”)][1]
Ex. Regex: href=\”//
-
This only works if you have downloaded all the HTML files to your local computer. That said, it works quite well! I am betting this is a database driven site and so would not work in the same way.
-
Regex: href=("|'|)http:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+.
This allows for your link to have the " or ' or nothing between the = and the http If you have any other TLDs you can just keep expanding on the |
I modified this from a posting in github https://gist.github.com/gruber/8891611
You can play with tools like http://regexpal.com/ to test your regexp against example text
I assumed you would want the full URL and that was the issue you were running into.
As another solution why not just fix the https in the main navigation etc, then once you get the staging/testing site setup, run ScreamingFrog on that site and find all the 301 redirects or 404s and then use that report to find all the URLs to fix.
I would also ping ScreamingFrog - this is not the first time they have been asked this question. They may have a better regexp and/or solution vs what I have suggested.
-
Depending on how you've coded everything you could try to setup a Custom Search under Configuration. This will scan the HTML of the page so if the coding was consistent you could put something like href="http://www.yourdomain.com" as the string it's looking for and in the Custom tab on the resulting pages it'll show you all the ones that match the string.
That's the only way I can think of to get Screaming Frog to pull it but looking forward to anyone else's thoughts.
-
If you have access to all the website's files, you could try finding all instances in the directory using something like Notepad++. Could even use find and replace.
This is how I tend to locate those one-liners among hundreds of files.
Good luck!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How can I use MOZ to investigate a my recent drop in domain authority?
Between MOZ's last scan of my site and the one before my domain authority dropped from 35 to 29. I'm not sure where to begin investigating this and how I can leverage MOZ in this case. Any guidance would be greatly appreciated.
Technical SEO | | bearpaw0 -
Links from Instructables.com?
This is a silly newbie question. But will posting on www.instructables.com with some valuable content and url link back to my site help with "linking"? Or do they put a no-follow on all links on their site? Thanks for answering! Ron
Technical SEO | | yatesandcojewelers0 -
Should I no follow all external links?
I have worked with a few different SEO firms lately and a lot of them have recommended on the sites I was working on to "no-follow" all external links on the site. On one hand this traps all the link equity/Pagerank. On the other I would think this practice is frowned upon by Google. What are some opinions on this?
Technical SEO | | MarloSchneider0 -
Added data to links
Hello I am in the process of cleaning a site and getting less pages cached. it is a magento site and I was wondering what is your advice fo pages that get this padded to the link ?material=139&price=10%2C12 accept the obvious canonical? thanks
Technical SEO | | ciznerguy0 -
Forum Profile Links
Are they really important? Many preach they are, and there are tonnes of services out there who give you thousands of forum profile links in no time. I strictly believe in genuine links built the hard way, and definitely don't want to get into anything which is black hat. Please suggest if building several Forum Profile Links is an appropriate way of building links?
Technical SEO | | KS__2 -
Too many links on your blog?
In all of my campaigns, I have a lot of URLs with too many links on the page (defined loosely as around or over 100 links per page); these links are virtually all found on blog pages. The link count shoots up quickly when you start using things like tag clouds, showing all the tags/categories a post is in, in addition to all the cross linking thats typical of blog posts. My question is: Does this matter? Do you work to get blog pages down under that 100 link limit, or just assume most blogs are like this and move along? If you think it does matter, what strategies have you used to cut down the number of links while still keeping popular elements like tag clouds?
Technical SEO | | AdoptionHelp0 -
External Links from own domain
Hi all, I have a very weird question about external links to our site from our own domain. According to GWMT we have 603,404,378 links from our own domain to our domain (see screen 1) We noticed when we drilled down that this is from disabled sub-domains like m.jump.co.za. In the past we used to redirect all traffic from sub-domains to our primary www domain. But it seems that for some time in the past that google had access to crawl some of our sub-domains, but in december 2010 we fixed this so that all sub-domain traffic redirects (301) to our primary domain. Example http://m.jump.co.za/search/ipod/ redirected to http://www.jump.co.za/search/ipod/ The weird part is that the number of external links kept on growing and is now sitting on a massive number. On 8 April 2011 we took a different approach and we created a landing page for m.jump.co.za and all other requests generated 404 errors. We added all the directories to the robots.txt and we also manually removed all the directories from GWMT. Now 3 weeks later, and the number of external links just keeps on growing: Here is some stats: 11-Apr-11 - 543 747 534 12-Apr-11 - 554 066 716 13-Apr-11 - 554 066 716 14-Apr-11 - 554 066 716 15-Apr-11 - 521 528 014 16-Apr-11 - 515 098 895 17-Apr-11 - 515 098 895 18-Apr-11 - 515 098 895 19-Apr-11 - 520 404 181 20-Apr-11 - 520 404 181 21-Apr-11 - 520 404 181 26-Apr-11 - 520 404 181 27-Apr-11 - 520 404 181 28-Apr-11 - 603 404 378 I am now thinking of cleaning the robots.txt and re-including all the excluded directories from GWMT and to see if google will be able to get rid of all these links. What do you think is the best solution to get rid of all these invalid pages. moz1.PNG moz2.PNG moz3.PNG
Technical SEO | | JacoRoux0 -
Value of Twitter Links
Let's ignore the "social metric" value of Twitter links and mentions and look at it from the pure link juice point of view. Twitter accounts such as http://twitter.com/randfish used to have their own PageRank and were treated as separate URLs. Twitter changed that to http://twitter.com/#!/randfish consolidating all their content to a single URL. When I search for "randfish" in Google, however, the result is the first URL version. Some clarification on this matter would be much appreciated.
Technical SEO | | Dan-Petrovic0