Welcome to the Q&A Forum

CleverPhD

This only works if you have downloaded all the HTML files to your local computer. That said, it works quite well! I am betting this is a database driven site and so would not work in the same way.

CleverPhD

Regex: href=("|'|)http:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+.

This allows for your link to have the " or ' or nothing between the = and the http If you have any other TLDs you can just keep expanding on the |

I modified this from a posting in github https://gist.github.com/gruber/8891611

You can play with tools like http://regexpal.com/ to test your regexp against example text

I assumed you would want the full URL and that was the issue you were running into.

As another solution why not just fix the https in the main navigation etc, then once you get the staging/testing site setup, run ScreamingFrog on that site and find all the 301 redirects or 404s and then use that report to find all the URLs to fix.

I would also ping ScreamingFrog - this is not the first time they have been asked this question. They may have a better regexp and/or solution vs what I have suggested.

CleverPhD

If you wanted to look at relative search volume, you can look at Google Trends https://www.google.com/trends/ I would also see if you notice any trends in Google Search Console under Search Traffic > Search Analytics > Impressions

What your graph has me wondering is if this is an attribution issue with GA? On the grey line, Moz is simply taking your GA traffic that is tagged as organic and showing it in the graph. If you have an attribution issue in GA, organic traffic may be showing up as direct traffic. If there is anything wonky in the traffic attribution, GA will put it as Direct. You have this classic article by Groupon that was a good example of how organic can be attributed incorrectly. http://searchengineland.com/60-direct-traffic-actually-seo-195415

Look at your overall traffic in GA and then add a segment for organic traffic and then direct traffic. If your overall traffic is constant and you see organic going down while direct traffic is going up, you have your answer. As I understand it, this phenomenon is due to browser issues, so see if you have had more traffic recently from a given browser and that may give you another clue.

Another thing to check, you should be able to look at your organic traffic in GA and see if it is the same as Moz, or not. If not, ping the Moz folks to make sure your data from GA is coming in properly. May be some data import issues there.

My other guess here is that your ranking is ok, but your click rate has been jacked. Google Search console will show you CTR over time, and that may help. Look and see, did you change meta descriptions? Did you change up your schema markup so previously you had rich snippets in the SERP, but now you do not. You could potentially keep ranking, but loose CTR.

These are all things I would look at, but at this point, your guess is as good as mine. Looking through the above will probably prompt you to check other things that might give you an answer.

Good luck!

CleverPhD

I have not used this tool in this way, but have used it for other crawler projects related to content clean up and it is rock solid. They have been very responsive to me on questions related to use of the software. http://urlprofiler.com/

Duplicate content search is the project next on my list, here is how they do it.

http://urlprofiler.com/blog/duplicate-content-checker/

You let URL profiler crawl the section of your site that is most likely to be copied (say your blog) and you tell URL profiler what section of your HTML to compare against (i.e. the content section vs the header or footer). URL profiler then uses proxies (you have to buy the proxies) to perform Google searches on sentences from your content. It crawls those results to see if there is a site in the Google SERPs that has sentences from your content word for word (or pretty close).

I have played with Copyscape, but my markets are too niche for it to work for me. The logic here from URL profilers is that you are searching the database that most matters, Google.

Good luck!

CleverPhD

Thanks Jay. If I look on the backlinks side, they all seem to have the same subdomain in some form or another. You would just need to setup the regex in Screaming Frog to look for just that keyword in the subdomain so it should match all the variants of it.

That said, ignore everything I just posted. I was thinking earlier, "Surely there is scraper software out there that does this already." I did not take the time to look. Your mention of Scrapebox reminded me of that.

Scrapebox has a separate addon that does this

http://www.scrapebox.com/anchor-text-checker

The ScrapeBox Anchor Text Checker allows you to enter your domain and then load a list of URL’s that contain your backlink. It will scan all the URL’s containing your link and extract the anchor text used by the websites that link to you.

CleverPhD

Good luck with your recovery!

CleverPhD

Ok. Can you be more specific on what you are trying to accomplish with this data? I think that may help my understanding of what you are trying to do.

CleverPhD

Bummer. This smells of a technical change that occurred on your site.

Check: robots.txt - are you blocking access to images? You can also look in Search Console and under Crawl use the Robots.txt tester and see if your image URLs fail there. It will show you where the issue is.

Check things like all your images got moved to a CDN and no 301 redirects from the old image URLs were put in place.

Talk to your dev and look at every ticket prior to Sept 17th and see if there is anything else that was changed.

The good news is that if this is something technical and you fix it quickly, you should recover.

Good luck!

CleverPhD

Great answer. A good tool to use for testing the 301s in bulk is Screaming Frog. Save a CSV list of your old URLs before you migrate. When you update sites, set Screaming Frog in list mode and it will show you where all the old URLs 301 to. Makes it really easy to test.

If you do have any sort of staging site to do this with, that would be optimal before you go live. If you do go live, I would make this the first thing you do to check those 301s. Screaming frog will quickly check a ton of them and give you some peace of mind.

Side note, the only way link juice is lost in a 301 is if you 301 to a page that does not have semantically related content to the original page. i.e. if you have a page on Red Widgets and you 301 it to a page on Blue Bangles, Google will not pass the juice as it sees you trying to manipulate the link juice. As you are using 301 redirect to a new URL with the exact same content, you should be fine, assuming the other points that Dirk mentions.

CleverPhD

Screaming Frog can do this with custom extraction and list mode. If I am reading your question correctly, you have a list of URLs and what pages on your site that they link to.

You would upload the list of URLs into Screaming Frog so it knows what pages to scan and run it in list mode

http://www.screamingfrog.co.uk/seo-spider/user-guide/configuration/#15

You would then use the custom extraction tool to grep for the ahref code that has a link to your domain

http://www.screamingfrog.co.uk/web-scraper/

You would need to plug in a regular expression to look for your domain (or versions of it) and then include the rest of the HTML tag that include the anchor text all the way through the ending .

You should then be able to import that data into a spreadsheet and use text to columns to split the anchor text into it's own column.

It is a little tricky as the regular expression may have to be tweaked depending on how other sites link to your site. Run the Frog on a test group of 10 or so to make sure it works. If you have a bunch of errors, take the error examples and tweak the regular expression based on those.

CleverPhD

I respectfully disagree with all of the above. Please repeat after me, 404s are not bad, they are diagnostic, 404s are not bad, they are diagnostic, 404s are not bad, they are diagnostic.

After redesigning my website (5 months ago) in my crawl reports (Moz, Search Console) I still get tons of 404 pages which all seems to be the URLs from my previous website (same root domain).

**Part 1 Internal links that 404s from Moz Crawl: **The 404s that show up in the Moz crawl are only going to be from an internal link on your website. The Moz crawl only looks at internal links and not links from other website. In other words, if you see 404s in your Moz crawl, that means, somewhere, you are linking to those pages and that is why the 404s are showing up. Download the CSV and you will find them in your Moz crawl. Other tools such as screaming frog, Botify, Deep Crawl, will show you a similar analysis.

Simple solution. Go through your code and remove the internal links on your site that direct the Moz crawler to those pages and the 404s will go away. (FYI this same approach will work for any internal 301s) These 404 errors in the Moz report are great diagnostic signals on where to fix your site. It is bad for users to click on a link within your website and get sent to a page that does not exist.

**Part 2 external links from Search Console: **The 404s that show up in Search console can come from your internal links on your site AND external links from other sites. Google will keep trying to crawl these links due to other sites linking to pages on your site and your own internal links. For internal link fixing - see suggestion above. For external links you need a different approach.

Look at the external links, where are they coming from? Are they from quality websites? Do they go to formerly important pages on your websites (ie pages that were good converters? If so, then use the 301 redirect to send them to the correct replacement page (and this is not always the home page). You get users to the correct page and also any link equity is passed along as well and this can help with your site rankings. If the link goes to former page on your site that was not any good to start with and the links that come into it are poor quality, then you just let the page 404. Tools such as Moz Open Site Explorer or Ahrefs or Majestic can help with this assessment - but usually you can just look at a site linking to you and tell if it is crap or not.

You need to consider the above regardless of if you want to get the pages that are 404ing in question out of the Google index as if you get Google to remove the page from the index, it will then see the internal link on your site and then find the 404 again. If you have removed the links to the 404 pages on your site, eventually Google will stop crawling them and drop out of the index.

Important note regarding the use of robots.txt. Blocking Google from crawling the 404s will not remove the pages from the index, Google will just stop crawling them. Google has to be able to crawl the URL to see the 404 and then see that it is a bad page and then remove the page from the index. Blocking with robots.txt stops Google from doing that. As soon as you take the page out of robots Google will recrawl and the 404 shows up again. Robots.txt treats a symptom that is a red herring, allowing the 404 to occur takes care of the issue permanently.

Dead pages are a natural part of the web. Let Google see the 404 (if it truly is a page that should 404 and has no link equity that should be passed along with a 301). Google will crawl the 404 several times, you will see it in search console several times. It is ok. You are not penalized for X number of 404s. You may lose ranking if you 404 a page that Google used to rank well, but this is just because Google will not keep a page highly ranked that does not exist :-). Help Google out by cleaning up your internal link structure so when it sees that you do not link to the page any more, then that is a signal that the page should 404. Google knows that due to the nature of the web, pages will time out on occasion and show an error. Google will continue to recrawl a page just to make sure, it wants to give you the benefit of the doubt. Therefore, you have to give clear directives by not linking to dead pages so that after Google double and triple checks the page, it will finally drop it. You will see the 404 in your Search Console for several months then it will eventually go away.

Hope that makes sense. Good luck!

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Moz Q&A is closed.

CleverPhD

@CleverPhD

Posts made by CleverPhD

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved