Can you use Screaming Frog to find all instances of relative or absolute linking?
-
My client wants to pull every instance of an absolute URL on their site so that they can update them for an upcoming migration to HTTPS (the majority of the site uses relative linking). Is there a way to use the extraction tool in Screaming Frog to crawl one page at a time and extract every occurrence of _href="http://" _?
I have gone back and forth between using an x-path extractor as well as a regex and have had no luck with either.
Ex. X-path: //*[starts-with(@href, “http://”)][1]
Ex. Regex: href=\”//
-
This only works if you have downloaded all the HTML files to your local computer. That said, it works quite well! I am betting this is a database driven site and so would not work in the same way.
-
Regex: href=("|'|)http:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+.
This allows for your link to have the " or ' or nothing between the = and the http If you have any other TLDs you can just keep expanding on the |
I modified this from a posting in github https://gist.github.com/gruber/8891611
You can play with tools like http://regexpal.com/ to test your regexp against example text
I assumed you would want the full URL and that was the issue you were running into.
As another solution why not just fix the https in the main navigation etc, then once you get the staging/testing site setup, run ScreamingFrog on that site and find all the 301 redirects or 404s and then use that report to find all the URLs to fix.
I would also ping ScreamingFrog - this is not the first time they have been asked this question. They may have a better regexp and/or solution vs what I have suggested.
-
Depending on how you've coded everything you could try to setup a Custom Search under Configuration. This will scan the HTML of the page so if the coding was consistent you could put something like href="http://www.yourdomain.com" as the string it's looking for and in the Custom tab on the resulting pages it'll show you all the ones that match the string.
That's the only way I can think of to get Screaming Frog to pull it but looking forward to anyone else's thoughts.
-
If you have access to all the website's files, you could try finding all instances in the directory using something like Notepad++. Could even use find and replace.
This is how I tend to locate those one-liners among hundreds of files.
Good luck!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can you regain any SERPs / link juice of links that have 404'd?
We have a client whose 301 redirects disappeared and have been gone for about 6 months now. We are going to be putting the 301 redirects back in place. Will we be able to regain any of the previous SERPs or link juice from old links or is all lost? Thanks in advance!
Technical SEO | | SavvyPanda0 -
Can i use "nofollow" tag on product page (duplicated content)?
Hi, im working on my webstore SEO. I got descriptions from official seller like "Bosch". I got more than 15.000 items so i cant create unique content for each product. Can i use nofollow tag for each product and create great content on category pages? I dont wanna lose rankings because duplicated content. Thank you for help!
Technical SEO | | pejtupizdo0 -
No follow links on a blog
Hi On our blog, we have a section called 'Tags'. I have just noticed that these links are all "no follow" links. The tags section does appear on every single page on the blog - is this recommend to have them as 'no follow' links or should I get our developer to change them. Thanks
Technical SEO | | Andy-Halliday0 -
Best use of robots.txt for "garbage" links from Joomla!
I recently started out on Seomoz and is trying to make some cleanup according to the campaign report i received. One of my biggest gripes is the point of "Dublicate Page Content". Right now im having over 200 pages with dublicate page content. Now.. This is triggerede because Seomoz have snagged up auto generated links from my site. My site has a "send to freind" feature, and every time someone wants to send a article or a product to a friend via email a pop-up appears. Now it seems like the pop-up pages has been snagged by the seomoz spider,however these pages is something i would never want to index in Google. So i just want to get rid of them. Now to my question I guess the best solution is to make a general rule via robots.txt, so that these pages is not indexed and considered by google at all. But, how do i do this? what should my syntax be? A lof of the links looks like this, but has different id numbers according to the product that is being send: http://mywebshop.dk/index.php?option=com_redshop&view=send_friend&pid=39&tmpl=component&Itemid=167 I guess i need a rule that grabs the following and makes google ignore links that contains this: view=send_friend
Technical SEO | | teleman0 -
Can name="author" register as a link?
Hi all, We're seeing a very strange result in Google Webmaster tools. In "Links to your site", there is a site which we had nothing to do with (i.e. we didn't design or build it) showing over 1600 links to our site! I've checked the site several times now, and the only reference to us is in the rel="author" tag. Clearly the agency that did their design / SEO have nicked our meta, forgetting to delete or change the author tag!! There are literally no other references to us on this site, there hasn't every been (to our knowledge, at least) and so I'm very puzzled as to why Google thinks there are 1600+ links pointing to us. The only thing I can think of is that Google will recognise name="author" content as a link... seems strange, though. Plus the content="" only contains our company name, not our URL. Can anybody shed any light on this for me? Thanks guys!
Technical SEO | | RiceMedia0 -
No-follow links on advertising pages
Hi I run a job board that enables employers to post job vacancies and information about their organisations. These are 'paid for' pages (advertising) on our site. These link out to their own websites. My question is, would it be better for these links out to their sites to be no-follow? From my site's perspective, I cannot necessarily dictate the quality of their websites (although the majority are leading firms) as I would in article and feature content, where we do happily link out and refer to other quality sites with information that gives readers further information. I know that many large job boards do this where they run listings of feeds from other sites, but should we also do this at the page level where the link out is effectively paid for. What would be the pros and cons if I do or if I don't use no-follow? I hope this makes sense and look forward to some replies. Many thanks
Technical SEO | | CelestialChook0 -
Linkedin how to use it to promote your business
Hi i have been told about linkedin and how good it is but every time i look at it, it puzzles me and i am not sure if it is worth joining or not. I am looking to promote in the UK and not abroad and would like to know how good it is. Even my accountant uses it but each time i look at it i cannot get my head around it and how to use it to promote my business. Can anyone please let me know how much it cost to join and if it will have any benefits for me to promote my business and my sites. I look forward to hearing from you
Technical SEO | | ClaireH-1848860 -
4XX Broken Links
I am attempting to fix the issues SEOmoz found when crawling my site. I have a list of 4XX errors that I am attempting to fix. Basically I know one option is to redirect them to another page, but I would like to have the option to remove the links completely. The only problem is I can not find where the links are located. Does SEOmoz provide where on my site these broken links are? Or do they only provide the url that is linked to?
Technical SEO | | ClaytonKendall0