Can you use Screaming Frog to find all instances of relative or absolute linking?
-
My client wants to pull every instance of an absolute URL on their site so that they can update them for an upcoming migration to HTTPS (the majority of the site uses relative linking). Is there a way to use the extraction tool in Screaming Frog to crawl one page at a time and extract every occurrence of _href="http://" _?
I have gone back and forth between using an x-path extractor as well as a regex and have had no luck with either.
Ex. X-path: //*[starts-with(@href, “http://”)][1]
Ex. Regex: href=\”//
-
This only works if you have downloaded all the HTML files to your local computer. That said, it works quite well! I am betting this is a database driven site and so would not work in the same way.
-
Regex: href=("|'|)http:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+.
This allows for your link to have the " or ' or nothing between the = and the http If you have any other TLDs you can just keep expanding on the |
I modified this from a posting in github https://gist.github.com/gruber/8891611
You can play with tools like http://regexpal.com/ to test your regexp against example text
I assumed you would want the full URL and that was the issue you were running into.
As another solution why not just fix the https in the main navigation etc, then once you get the staging/testing site setup, run ScreamingFrog on that site and find all the 301 redirects or 404s and then use that report to find all the URLs to fix.
I would also ping ScreamingFrog - this is not the first time they have been asked this question. They may have a better regexp and/or solution vs what I have suggested.
-
Depending on how you've coded everything you could try to setup a Custom Search under Configuration. This will scan the HTML of the page so if the coding was consistent you could put something like href="http://www.yourdomain.com" as the string it's looking for and in the Custom tab on the resulting pages it'll show you all the ones that match the string.
That's the only way I can think of to get Screaming Frog to pull it but looking forward to anyone else's thoughts.
-
If you have access to all the website's files, you could try finding all instances in the directory using something like Notepad++. Could even use find and replace.
This is how I tend to locate those one-liners among hundreds of files.
Good luck!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Broken URL Links
Hi everyone, I have a question regarding broken URL links on my website. Late last year I move my site from an old platform to Shopify, and now have broken URL links giving out 4xx errors. When I look at Moz Pro>Campaigns>Insights>links, I can see the top broken URL links, however there is a difference if copy & paste URL directly from Moz Pro and by Export CSV file. For example below, If I copy and paste links direct from Moz Pro, it has the “http://” in front as below: http://www.thehairhub.com.au/WebRoot/ecshared01/Shops/thehairhub/57F3/1D8F/D244/C675/E27D/AC10/003F/35AD/manic-panic-colours.jpg But when I export the list of links as an CSV file, the http:// is removed. www.thehairhub.com.au/WebRoot/ecshared01/Shops/thehairhub/57F3/1D8F/D244/C675/E27D/AC10/003F/35AD/manic-panic-colours.jpg Another Example below: By copy & paste URL direct from Moz Pro
Technical SEO | | johnwall
http://thehairhub.com.au/Shop-Brands/Vitafive-CPR/CPR-Rescue By export CSV file.
thehairhub.com.au/Shop-Brands/Vitafive-CPR/CPR-Rescue Which one do I use to enter into the “Redirect From” field in Shopify URL Redirects? Do I need to have the http:// in front of the URL? Or is it not required for redirects to work? Kind Regards, John Wall
The Hair Hub0 -
Internal Linking issue
So i am working with a review company and I am having a hard time with something. We have created a category which lists and categorizes every one of our properties. For example a specific property in the category "restaurant" would be as seen below: /restaurant/mcdonalds /restaurant/panda-express And so on and so on. What I am noticing however is that our more obscure properties are not being linked to by any page. If I were to visit the page myurl.com/restaurant I would see 100+ pages of properties, however it seems like only the properties on the first few pages are being counted as having links. So far the only way I have been able to work around this issue is by creating a page and hiding it in our footer called "all restaurants". This page lists and links to every one of our properties. However it isn't exactly user friendly and I would prefer scrapers not to be able to scrape all properties at once! Anyway, any suggestions would be greatly appreciated.
Technical SEO | | HashtagHustler0 -
Site-wide Links
Hey y'all, I know this question has been asked many times before but I wanted to see what your stance was on this particular case. The organisation I work for is a group of 12 companies - each with its own website. On some of the sites we have a link to the other sites within the group on every single page of that site. Our organic search traffic has dropped a bit but not significantly and we haven't received any manual penalties from Google. It's also worth mentioning that the referral traffic for these sites from the other sites I control is quite good and the bounce rate is extremely low. If you were in my shoes would you remove the links, put a nofollow tag on the links or leave the links as they are? Thanks guys 🙂
Technical SEO | | AAttias0 -
Correct linking to the /index of a site and subfolders: what's the best practice? link to: domain.com/ or domain.com/index.html ?
Dear all, starting with my .htaccess file: RewriteEngine On
Technical SEO | | inlinear
RewriteCond %{HTTP_HOST} ^www.inlinear.com$ [NC]
RewriteRule ^(.*)$ http://inlinear.com/$1 [R=301,L] RewriteCond %{THE_REQUEST} ^./index.html
RewriteRule ^(.)index.html$ http://inlinear.com/ [R=301,L] 1. I redirect all URL-requests with www. to the non www-version...
2. all requests with "index.html" will be redirected to "domain.com/" My questions are: A) When linking from a page to my frontpage (home) the best practice is?: "http://domain.com/" the best and NOT: "http://domain.com/index.php" B) When linking to the index of a subfolder "http://domain.com/products/index.php" I should link also to: "http://domain.com/products/" and not put also the index.php..., right? C) When I define the canonical ULR, should I also define it just: "http://domain.com/products/" or in this case I should link to the definite file: "http://domain.com/products**/index.php**" Is A) B) the best practice? and C) ? Thanks for all replies! 🙂
Holger0 -
Hit hard by EMD update, used to be #1 now not in top 50, what can I do?
We have what I think is a pretty good site, unique articles a few widgets, lots of reviews, decent enough bounce rates and user times (60% and 2:15) based on drupal. Previous updates haven't touched us and an almost identical duplicate (same site compltely different content) of the site targetting a different but related EMD is unaffected which provides a control. I have seen some discussion on it having to do with link profiles. We did pay some backlinkers to link to us, much more on the site that has dropped, and quite a few for a partial match keyword. I'm supposing this is a lot of the issue. If we try and delete these backlinks will it make the situation better or worse? I have also notice some duplicate content warnings in seomoz that weren't there previously. Any ideas?
Technical SEO | | btrr690 -
Can 404 results from external links hurt site ranking?
Hello, I'm helping a university transition to a brand new website. In some cases the URLs will change between the old site and new site. They will put 301 redirects in place to make sure that people who have old URLs will get redirected properly to the new URLs. However they also have a bunch of old pages that they aren't using anymore. They don't really care if people still try to get to them (because they don't think many will), but they do care about the overall search engine rankings. I know that if a site has internal 404 links, that could hurt rankings. However can external links that return a 404 hurt rankings? Ryan
Technical SEO | | GreenHatWeb0 -
A Puzzling Link
I'm stumped and I'm hoping some mozzers will be able to help. I run our company blog (http://scottymacblog.com/). The last couple of days I have noticed that the blog is receiving some traffic from cnn.com. I looked, but cannot find any mention of the blog on cnn. Adding to my frustration is that the content on cnn is constantly changing. Our blog doesn't do any sort of advertising and no one affiliated with the blog posts on cnn. As great as it is to be getting traffic from such a valued source, I have no idea why. Has something like this happened to (for?) anyone else? Any ideas on how I can research the source of the link? Thanks in advance!
Technical SEO | | EssEEmily0 -
Nofollow internal links
Hi, we have problems with having too many links on page. Our website has a menu with 3 level sub-navigation drop down for categories which we want to maintain, for easy-navigation for the users. http://www.redwrappings.com.au/ After reading this article: http://www.seomoz.org/blog/questions-answers-with-googles-spam-guru, and some other articles, we came up with a solution. We can easily reduce the number of links per page by putting 'nofollow' on our categories links menu dropdown and create a separate 'landing page' that contains links to these categories (and allow 'follow' links for robots). Is it wise to do this? Or any better, easy solution that you can suggest? Thanks
Technical SEO | | Essentia1