I can't crawl the archive of this website with Screaming Frog
-
Hi
I'm trying to crawl this website (http://zeri.info/) with Screaming Frog but because of some technical issue with their site (i can't find what is causing it) i'm able to crawl only the first page of each category (ex. http://zeri.info/sport/) and then it will go to crawl each page of their archive (hundreds of thousands of pages) but it won't crawl the links inside these pages.
Thanks a lot!
-
I think the issue comes from the way you handle the pagination and or the way your render archived pages.
Example: First archive page of Aktualehttp://zeri.info/arkiva/?formkey=7301c1be1634ffedb1c3780e5063819b6ec19157&acid=aktuale
Clicking on page 2 adds the date
http://zeri.info/arkiva/?from=2016-06-01&until=2016-06-16&acid=aktuale&formkey=cc0a40ca389eb511b1369a9aa9da915826d6ca44&faqe=2#archive-results => I assume that you're only listing the articles published from June 1st till today.
If I check all the different section & the number of articles listed in each archive I get approx. 1200 pages - add some additional pages linked on these pages and you get to the 2K pages you mentioned.
There seems to be no possibility to reach the previously published content without executing a search - which Screaming Frog can't do. It's quite possible that this is causing issues for Google bot as well so I would try to fix this.
If you really want to crawl the full site in the mean time - add another rule in url rewriting - this time selecting 'regex replace' -
add regex: from=2016-06-01
replace regex from=2010-01-01 (replace by the earliest date of publishing)This way - the system will call url http://zeri.info/arkiva/?from=2010-06-01&until=2016-06-16&acid=kultura&formkey=5932742bd5dd77799524ba31b94928810908fc07&faqe=2 rather than the original one - listing all the articles instead of only the june articles.
Hope this helps.
Dirk
-
I can't make it work. After removing 'fromkey' parameter i was able to crawl 1.7k and it stopped there. The site has more than 400k pages so .. something must be wrong
I want to crawl only the root domain without subdomains and all i can crawl is around 2k pages.
Do you have any idea what might be happening?
-
Great it worked. Just a small note - if Screaming Frog is getting confused by all these parameters, it could well be that Googlebot (while more sophisticated) is also having these issues. You could consider to exclude the formkey parameter in the Search Console (Crawl > URL parameters)
DIrk
-
Dirk, thanks a lot.
I just added "formkey" to be removed as a parameter and it seems to be working. I crawled 1k pages until now and i'm going to monitor how it goes.
The site has more than 400k pages so the process to crawl them all will take time (and i'm going to have to crawl each sector so i can create sitemaps for them).
Thanks again
Gjergj -
In the menu 'url rewriting' you can simply put the parameters the site should ignore (like date, formkey,..). I removed the formkey parameter and I checked the pages of the archive in Screaming Frog.
It is clearly able to detect all the internal links on the page - so will crawl them.
How are you certain that the pages below are not crawled - could you give a specific example of page that should be crawled but isn't?
Dirk
-
I've tried changing settings to respect noindex & canonical .. it will stop crawling the archive pages but still it won't crawl the links inside those pages. (i've added NOINDEX, FOLLOW in all archive pagination pages)
What do you mean by rewriting the url to ignore the formkey? How do you think it should be.
Gjergji
-
It think Screaming Frog is going nuts on the formkey value in the url which is constantly changing when changing pages.
Could you modify the settings of the spider to respect noindex & respect canonical - looks like this is solving the issue.
Alternatively you could rewrite the url to ignore the formkey (remove parameter)
Dirk
-
Hi Logan
I've tried going back to default configuration but it didn't help .. still i don't believe Screaming Frog is to blame, i think there is something wrong with the way the site has been developed (they are using a custom CMS) .. but i can't find the reason why this is happening. As soon as i find the solution then i can ask the guys who developed this site to make the necessary changes.
Thanks a lot.
-
Hi Dirk
Thanks a lot for replying back. The issue is that Screaming Frog is crawling the archive pages (like these examples) but it won't crawl the articles that are listed inside these pages.
The hierarchy of the site goes like this:
Homepage
- Categories (with about 20 articles in them)
- Archive of that category (with all the remaining articles, which in this case means thousands since they are a news website)Screaming Frog will crawl the homepage and categories ... but after it goes to the archive it won't crawl the articles inside archive, instead it will only crawl the pages (pagination) of that archive.
Thanks again.
-
Try going to File > Default Conif > Clear Default Configuration. This happens to me sometimes as well as I've edited settings over time. Clearing it out and going back to default settings is usually quicker than clicking through the settings to identify which one is causing the problem.
-
Did you put in some special filters - just tried to crawl the site & it seems to work just fine?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Website crawl error
Hi all, When I try to crawl a website, I got next error message: "java.lang.IllegalArgumentException: Illegal cookie name" For the moment, I found next explanation: The errors indicate that one of the web servers within the same cookie domain as the server is setting a cookie for your domain with the name "path", as well as another cookie with the name "domain" Does anyone has experience with this problem, knows what it means and knows how to solve it? Thanks in advance! Jens
Technical SEO | | WeAreDigital_BE0 -
Falling rankings - can't figure out why
I am still fairly green on in depth SEO, however I have a good grasp on making a site SEO friendly, as my skills are more down to website construction than technical SEO, however, I am working on a site at the moment which just continues to lose rankings and is slipping further and further. Keywords are dropping week on week in rankings Search visibility is also dropping week on week On site sales have fallen massively in the last quarter We have made huge improvements on the following; Moved the site to a faster stand alone cloud vps server - taken page rank scores from 54 to 87%. Added caching (WP Rocket) & CDN support. Improved URL structure (Woocommerce) removed /product and or /product-category from URLS to give more accurate & relevant structures. Added canonical URLs to all product categories (We use Yoast Premium) Amended on page structures to include correct H tags. Improved Facebook activity with a huge increase in engagements These are just some of the improvements we have made, yet we're still seeing huge drops in traffic and rankings. One insight I have noted which may be a big pointer, is we have 56 backlinks.... which I know is not good and we are about to address this. I suspect this is the reason for the poor performance, but should I be looking at anything else? Is there anything else we should be looking at? As I said, I'm no SEO specialist, but I don't think there's been any Penguin penalty, but my expertise is not sufficient enough to dig deeper. Can anyone offer any constructive advice at this stage? I'm thinking things to look at that could be hurting us that isn't immediately obvious? The site is www.glassesonspec.co.uk Thanks in advance Bob
Technical SEO | | SushiUK0 -
WebMaster Tools keeps showing old 404 error but doesn't show a "Linked From" url. Why is that?
Hello Moz Community. I have a question about 404 crawl errors in WebmasterTools, a while ago we had an internal linking problem regarding some links formed in a wrong way (a loop was making links on the fly), this error was identified and fixed back then but before it was fixed google got to index lots of those malformed pages. Recently we see in our WebMaster account that some of this links still appearing as 404 but we currently don't have that issue or any internal link pointing to any of those URLs and what confuses us even more is that WebMaster doesn't show anything in the "Linked From" tab where it usually does for this type of errors, so we are wondering what this means, could be that they still in google's cache or memory? we are not really sure. If anyone has an idea of what this errors showing up now means we would really appreciate the help. Thanks. jZVh7zt.png
Technical SEO | | revimedia1 -
Can't for the life of me figure out how this is possible !! Any ideas ?
I would imagine it's not all that easy to rank on 1st page ( not going for 1st position here ) for https://www.google.com.au/search?q=credi+cards. I am looking at the AU market. For some reason which I can't figure out Everyday Money Credit Card ( https://www.woolworthsmoney.com.au/ ) ranks number 4. The home page redirect to https://www.woolworthsmoney.com.au/wowm/wps/wcm/connect/wowmoney/wowmoney/home/home/ Why have your homepage in this format ? I would love to hear any theories you guys might have. It does not look like they have a strong link profile , I could not figure out how old the domain was or what other possible reason there is for the site to rank .
Technical SEO | | RuchirP0 -
Why aren't certain links showing in SEOMOZ?
Hi, I have been trying to understand our page rank and domains that are linking to us. When I look at the list of linking domains, I see some bigger ones are missing and I don't know why. For example, we are in the Yahoo Directory with a link to trophycentral.com, but SEOMOZ is not showing the link. If SEOMOZ is not seeing it, my guess is Google is not either, which concerns me. There are several onther high page rank domains also not showing. Anyone have any idea why? Thanks! BTW, our domain is trophycentral.com
Technical SEO | | trophycentraltrophiesandawards0 -
Can JavaScrip affect Google's index/ranking?
We have changed our website template about a month ago and since then we experienced a huge drop in rankings, especially with our home page. We kept the same url structure on entire website, pretty much the same content and the same on-page seo. We kind of knew we will have a rank drop but not that huge. We used to rank with the homepage on the top of the second page, and now we lost about 20-25 positions. What we changed is that we made a new homepage structure, more user-friendly and with much more organized information, we also have a slider presenting our main services. 80% of our content on the homepage is included inside the slideshow and 3 tabs, but all these elements are JavaScript. The content is unique and is seo optimized but when I am disabling the JavaScript, it becomes completely unavailable. Could this be the reason for the huge rank drop? I used the Webmaster Tolls' Fetch as Googlebot tool and it looks like Google reads perfectly what's inside the JavaScrip slideshow so I did not worried until now when I found this on SEOMoz: "Try to avoid ... using javascript ... since the search engines will ... not indexed them ... " One more weird thing is that although we have no duplicate content and the entire website has been cached, for a few pages (including the homepage), the picture snipet is from the old website. All main urls are the same, we removed some old ones that we don't need anymore, so we kept all the inbound links. The 301 redirects are properly set. But still, we have a huge rank drop. Also, (not sure if this important or not), the robots.txt file is disallowing some folders like: images, modules, templates... (Joomla components). We still have some html errors and warnings but way less than we had with the old website. Any advice would be much appreciated, thank you!
Technical SEO | | echo10 -
Just relaunched a website - why did it fall in Google's SERPs?
I work for a marketing agency that just redesigned, rewrote and relaunched a client's website. They used to rank #4 on Google for the company's name (which is a fairly common one, for what it's worth). Now they're at #10 and want to know why. I'd like to explain to them what happened but don't know myself. Can someone explain it to me? And can I tell them if/when their ranking might go back up? In case this matters, I can tell you that it looks like Google hasn't yet crawled the new site. Anyway, thanks in advance for any help you can provide.
Technical SEO | | matt-145670 -
What's the website that analyzes all local business submissions?
I was recently looking at a blog post here or a webinar and it showed a website where you could see all of the local sites (yelp, Google places) where your business has been submitted. It was an automated tool. Does anyone remember the name of the site?
Technical SEO | | beeneeb0