I can't crawl the archive of this website with Screaming Frog
-
Hi
I'm trying to crawl this website (http://zeri.info/) with Screaming Frog but because of some technical issue with their site (i can't find what is causing it) i'm able to crawl only the first page of each category (ex. http://zeri.info/sport/) and then it will go to crawl each page of their archive (hundreds of thousands of pages) but it won't crawl the links inside these pages.
Thanks a lot!
-
I think the issue comes from the way you handle the pagination and or the way your render archived pages.
Example: First archive page of Aktualehttp://zeri.info/arkiva/?formkey=7301c1be1634ffedb1c3780e5063819b6ec19157&acid=aktuale
Clicking on page 2 adds the date
http://zeri.info/arkiva/?from=2016-06-01&until=2016-06-16&acid=aktuale&formkey=cc0a40ca389eb511b1369a9aa9da915826d6ca44&faqe=2#archive-results => I assume that you're only listing the articles published from June 1st till today.
If I check all the different section & the number of articles listed in each archive I get approx. 1200 pages - add some additional pages linked on these pages and you get to the 2K pages you mentioned.
There seems to be no possibility to reach the previously published content without executing a search - which Screaming Frog can't do. It's quite possible that this is causing issues for Google bot as well so I would try to fix this.
If you really want to crawl the full site in the mean time - add another rule in url rewriting - this time selecting 'regex replace' -
add regex: from=2016-06-01
replace regex from=2010-01-01 (replace by the earliest date of publishing)This way - the system will call url http://zeri.info/arkiva/?from=2010-06-01&until=2016-06-16&acid=kultura&formkey=5932742bd5dd77799524ba31b94928810908fc07&faqe=2 rather than the original one - listing all the articles instead of only the june articles.
Hope this helps.
Dirk
-
I can't make it work. After removing 'fromkey' parameter i was able to crawl 1.7k and it stopped there. The site has more than 400k pages so .. something must be wrong
I want to crawl only the root domain without subdomains and all i can crawl is around 2k pages.
Do you have any idea what might be happening?
-
Great it worked. Just a small note - if Screaming Frog is getting confused by all these parameters, it could well be that Googlebot (while more sophisticated) is also having these issues. You could consider to exclude the formkey parameter in the Search Console (Crawl > URL parameters)
DIrk
-
Dirk, thanks a lot.
I just added "formkey" to be removed as a parameter and it seems to be working. I crawled 1k pages until now and i'm going to monitor how it goes.
The site has more than 400k pages so the process to crawl them all will take time (and i'm going to have to crawl each sector so i can create sitemaps for them).
Thanks again
Gjergj -
In the menu 'url rewriting' you can simply put the parameters the site should ignore (like date, formkey,..). I removed the formkey parameter and I checked the pages of the archive in Screaming Frog.
It is clearly able to detect all the internal links on the page - so will crawl them.
How are you certain that the pages below are not crawled - could you give a specific example of page that should be crawled but isn't?
Dirk
-
I've tried changing settings to respect noindex & canonical .. it will stop crawling the archive pages but still it won't crawl the links inside those pages. (i've added NOINDEX, FOLLOW in all archive pagination pages)
What do you mean by rewriting the url to ignore the formkey? How do you think it should be.
Gjergji
-
It think Screaming Frog is going nuts on the formkey value in the url which is constantly changing when changing pages.
Could you modify the settings of the spider to respect noindex & respect canonical - looks like this is solving the issue.
Alternatively you could rewrite the url to ignore the formkey (remove parameter)
Dirk
-
Hi Logan
I've tried going back to default configuration but it didn't help .. still i don't believe Screaming Frog is to blame, i think there is something wrong with the way the site has been developed (they are using a custom CMS) .. but i can't find the reason why this is happening. As soon as i find the solution then i can ask the guys who developed this site to make the necessary changes.
Thanks a lot.
-
Hi Dirk
Thanks a lot for replying back. The issue is that Screaming Frog is crawling the archive pages (like these examples) but it won't crawl the articles that are listed inside these pages.
The hierarchy of the site goes like this:
Homepage
- Categories (with about 20 articles in them)
- Archive of that category (with all the remaining articles, which in this case means thousands since they are a news website)Screaming Frog will crawl the homepage and categories ... but after it goes to the archive it won't crawl the articles inside archive, instead it will only crawl the pages (pagination) of that archive.
Thanks again.
-
Try going to File > Default Conif > Clear Default Configuration. This happens to me sometimes as well as I've edited settings over time. Clearing it out and going back to default settings is usually quicker than clicking through the settings to identify which one is causing the problem.
-
Did you put in some special filters - just tried to crawl the site & it seems to work just fine?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Is content on widget bar less 'seo important' than main content?
hi, i wonder if content on widget bar less 'seo important' than main content.. i mean, is better to place content and links on main cotent than on wordpress widget bar? What are the pros and cons? tx!
Technical SEO | | Dreamrealemedia0 -
My 'complete guide' is cannibalising my main product page and hurting rankings
Hi everyone, I have a main page for my blepharoplasty surgical product that I want to rank. It's a pretty in-depth summary for patients to read all about the treatment and look at before and after pictures and there's calls to action in there. It works great and is getting lots of conversions. But I also have a 'complete guide' PDF which is for patients who are really interested in discovering all the technicalities of their eye-lift procedure including medical research, clinical stuff and risks. Now my main page is at position 4 and the complete guide is right below it in 5. So I tried to consolidate by adding the complete guide as a download on the main page. I've looked into rel canonical but don't think it's appropriate here as they are not technically 'duplicates' because they serve different purposes. Then I thought of adding a meta noindex but was not sure whether this was the right thing to do either. My report doesn't get any clicks from the serps, people visit it from the main page. I saw in Wordpress that there's options for the link, one says 'link to media file', 'custom URL' and 'attachment'. I've got the custom URL selected at the moment. There's also a box for 'link rel' which i figure is where I'd put the noindex. If that's the right thing to do, what should go in that box? Thanks.
Technical SEO | | Smileworks_Liverpool0 -
What was the Google 'update' on 31st March?
Hi all. I looked back and saw that there was an update shown in 'Search Analytics' in Webmaster Tools a few weeks before the Mobile algorithm update. Not been able to find any mention of it and what it did so thought I'd check in here. ps. Also, this is a 90 day stretch and shows that our rankings have taken a hit since the mobile algorithm update. Interesting stuff (see image below) 4rJMU9e.jpg?1
Technical SEO | | RobFD0 -
WebMaster Tools keeps showing old 404 error but doesn't show a "Linked From" url. Why is that?
Hello Moz Community. I have a question about 404 crawl errors in WebmasterTools, a while ago we had an internal linking problem regarding some links formed in a wrong way (a loop was making links on the fly), this error was identified and fixed back then but before it was fixed google got to index lots of those malformed pages. Recently we see in our WebMaster account that some of this links still appearing as 404 but we currently don't have that issue or any internal link pointing to any of those URLs and what confuses us even more is that WebMaster doesn't show anything in the "Linked From" tab where it usually does for this type of errors, so we are wondering what this means, could be that they still in google's cache or memory? we are not really sure. If anyone has an idea of what this errors showing up now means we would really appreciate the help. Thanks. jZVh7zt.png
Technical SEO | | revimedia1 -
Website being crawled but not indexed any thoughts?
Hi Everyone,
Technical SEO | | Ant71
I created a new website a few weeks ago www.drivingseaford.co.uk , did a little link citation, links from Google+, submitted to webmaster tools etc but its still not getting indexed. Webmaster tools crawl stats page is showing pages being crawled, no errors. But 0 indexed. http://www.drivingseaford.co.uk/robots.txt is showing User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/ Im a bit stumped as never had this before!!! Any ideas from you lovely people?? Antony0 -
Will syndicated content hurt a website's ranking potential?
I work with a number of independent insurance agencies across the United States. All of these agencies have setup their websites through one preferred insurance provider. The websites are customizable to a point, but the content for the entire website is mostly the same. Therefore, literally hundreds of agency sites have essentially the same content. The only thing that changes is a few "wildcards" in the copy where the agency fills in their city, state, services areas, company history, etc. My questions is: will this syndicated content hurt their ranking potential? I've been toying with the idea of further editing the content to make it more unique to an agency, but I would hate to waste a lot of hours doing this if it won't help anything. Would you expect this approach to be beneficial or a waste of time? Thank you for your help!
Technical SEO | | copyjack0 -
How to attach at text to image that other websites use from my website
I often have other websites link to my website. They will do this with an image that they pull off of my website. (actually my website continues to serve the image). These inbound links are great, but they don't have alt text. Is there a way for me to attach alt text to the images, or is this something the other website needs to code themselves?
Technical SEO | | EugeneF0 -
Crawl report showing only 1 crawled page
Hi, I´m really new to this and have just setup some Campaigns. I have setup a Campaign for the root domain: portaldeldiablo.com.uy which returned only 2 crawled pages.. As this page had a 301 redirect from the non-www to the www version, I deleted this Campaign and setup a new one for www.portaldeldiablo.com.uy which returned only 1 crawled page.. I really don´t know why is my website not being crawled..Thanks in advance for your help.
Technical SEO | | ceci27100