I can't crawl the archive of this website with Screaming Frog
-
Hi
I'm trying to crawl this website (http://zeri.info/) with Screaming Frog but because of some technical issue with their site (i can't find what is causing it) i'm able to crawl only the first page of each category (ex. http://zeri.info/sport/) and then it will go to crawl each page of their archive (hundreds of thousands of pages) but it won't crawl the links inside these pages.
Thanks a lot!
-
I think the issue comes from the way you handle the pagination and or the way your render archived pages.
Example: First archive page of Aktualehttp://zeri.info/arkiva/?formkey=7301c1be1634ffedb1c3780e5063819b6ec19157&acid=aktuale
Clicking on page 2 adds the date
http://zeri.info/arkiva/?from=2016-06-01&until=2016-06-16&acid=aktuale&formkey=cc0a40ca389eb511b1369a9aa9da915826d6ca44&faqe=2#archive-results => I assume that you're only listing the articles published from June 1st till today.
If I check all the different section & the number of articles listed in each archive I get approx. 1200 pages - add some additional pages linked on these pages and you get to the 2K pages you mentioned.
There seems to be no possibility to reach the previously published content without executing a search - which Screaming Frog can't do. It's quite possible that this is causing issues for Google bot as well so I would try to fix this.
If you really want to crawl the full site in the mean time - add another rule in url rewriting - this time selecting 'regex replace' -
add regex: from=2016-06-01
replace regex from=2010-01-01 (replace by the earliest date of publishing)This way - the system will call url http://zeri.info/arkiva/?from=2010-06-01&until=2016-06-16&acid=kultura&formkey=5932742bd5dd77799524ba31b94928810908fc07&faqe=2 rather than the original one - listing all the articles instead of only the june articles.
Hope this helps.
Dirk
-
I can't make it work. After removing 'fromkey' parameter i was able to crawl 1.7k and it stopped there. The site has more than 400k pages so .. something must be wrong
I want to crawl only the root domain without subdomains and all i can crawl is around 2k pages.
Do you have any idea what might be happening?
-
Great it worked. Just a small note - if Screaming Frog is getting confused by all these parameters, it could well be that Googlebot (while more sophisticated) is also having these issues. You could consider to exclude the formkey parameter in the Search Console (Crawl > URL parameters)
DIrk
-
Dirk, thanks a lot.
I just added "formkey" to be removed as a parameter and it seems to be working. I crawled 1k pages until now and i'm going to monitor how it goes.
The site has more than 400k pages so the process to crawl them all will take time (and i'm going to have to crawl each sector so i can create sitemaps for them).
Thanks again
Gjergj -
In the menu 'url rewriting' you can simply put the parameters the site should ignore (like date, formkey,..). I removed the formkey parameter and I checked the pages of the archive in Screaming Frog.
It is clearly able to detect all the internal links on the page - so will crawl them.
How are you certain that the pages below are not crawled - could you give a specific example of page that should be crawled but isn't?
Dirk
-
I've tried changing settings to respect noindex & canonical .. it will stop crawling the archive pages but still it won't crawl the links inside those pages. (i've added NOINDEX, FOLLOW in all archive pagination pages)
What do you mean by rewriting the url to ignore the formkey? How do you think it should be.
Gjergji
-
It think Screaming Frog is going nuts on the formkey value in the url which is constantly changing when changing pages.
Could you modify the settings of the spider to respect noindex & respect canonical - looks like this is solving the issue.
Alternatively you could rewrite the url to ignore the formkey (remove parameter)
Dirk
-
Hi Logan
I've tried going back to default configuration but it didn't help .. still i don't believe Screaming Frog is to blame, i think there is something wrong with the way the site has been developed (they are using a custom CMS) .. but i can't find the reason why this is happening. As soon as i find the solution then i can ask the guys who developed this site to make the necessary changes.
Thanks a lot.
-
Hi Dirk
Thanks a lot for replying back. The issue is that Screaming Frog is crawling the archive pages (like these examples) but it won't crawl the articles that are listed inside these pages.
The hierarchy of the site goes like this:
Homepage
- Categories (with about 20 articles in them)
- Archive of that category (with all the remaining articles, which in this case means thousands since they are a news website)Screaming Frog will crawl the homepage and categories ... but after it goes to the archive it won't crawl the articles inside archive, instead it will only crawl the pages (pagination) of that archive.
Thanks again.
-
Try going to File > Default Conif > Clear Default Configuration. This happens to me sometimes as well as I've edited settings over time. Clearing it out and going back to default settings is usually quicker than clicking through the settings to identify which one is causing the problem.
-
Did you put in some special filters - just tried to crawl the site & it seems to work just fine?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can't get my site recognised for keyword
My site prettycool.co.uk and primary we sell fascinators, the problem is I can't get the word fascinators to be listed by Google. We are on the 1st page for most colours ie. pink fascinators, blue fascinators etc. but for the term fascinators even if we fetch we are listed for a couple of hours and then disappear. I've checked for keyword stuffing but our site sell fascinators and we need to have this word in our site and other sites have a lot more references to the term and are listed on the 1st or 2nd pages. We used to be listed on page 1 for many years but the last 2 or 3 years dropped back to page 4 but now nothing. Any help or suggestions would be fantastic!
Technical SEO | | Rutts0 -
Are sidewide badge links can harm your website?
Hey all, I wanted to check if links that have built naturally over the past years, linking from a badge (image) sitewide, can harm the linked website? Here is some more information: 1. It's from a competition that the winners were able to add the badge with the link to their site (the link to our website was to a subpage, not homepage). 2. There are around 15 websites with the badge as a link. The website has around 200 root domain links. There will not be any more websites with the badge, just these 15. 3. The sitewide links percentage are 5% of the overall number of pages linked to our website. Based on the last penguin update (4th of October, 2013), can our website be harmed from the badge link building?
Technical SEO | | stevanl0 -
I have 3500 pages crawled by Google, - why is SEOMOZ only able to crawl 400 of these ?
I added my site almost two weeks ago to the PRO DashBoard, and so far only 404 pages has been crawled, - but I know for a fact that there is 3500 pages that should be crawled. Other search engines has no problem in crawling and indexing these pages, so what can be wrong here ?
Technical SEO | | haybob270 -
Web page is showing up on Google but doesn't show when it was cached, so is it indexed?
Hey everyone So I created a new page on a WordPress website, it was live for a few hours till I changed my mind & switched it back to a draft. Just out of curiosity I did the Site:www.example.com/Example search on Google to see if it had been indexed & apparently it had but when I click on cached to see what time it got indexed at exactly it's showing me an error. So does this mean it is indexed or not?
Technical SEO | | conversiontactics0 -
Website's stability and it's affect on SEO
What is the best way to combat previous website stability issues? We had page load time and site stability problems over the course of several months. As a result our keyword rankings plummeted. Now that the issues have been resolved, what's the best/quickest way to regain our rankings on specific keywords? Thanks, Eric
Technical SEO | | MediaCause0 -
Why hasn't my sites indexed on opensiteexplorer.org changed in weeks?
Why hasn't my sites indexed on opensiteexplorer.org changed in weeks, even though I've done link-building like crazy?
Technical SEO | | AccountKiller0 -
I have a penalized site and don't know what the cause is
I have a site which appears to have a Google indexation penalty. According to Google because its violating the T/Cs. Here are some background details about the site: The site is a online poker + deposit methods related site on a .co.uk TLD. It has 30+ uniquely written pages, and no advertising at the moment. In June of 2010, June 10 to be precisely, I bought this site from a fellow webmaster/affiliate. After the site 's ownership changed I tried accessing the server, but I couldn't log into it . I noticed that this host had serious problems and the IP was unreachable. After trying for some time the previous owner got me all the content in Word files and I created a new hosting account and re-launched the site on June 28. Between a couple of days after June 10 and June 28, the site was unreachable, and completely de-indexed from Google. When I re-launched the site, I used the default Wordpress Template Twenty Ten, and created new pages with the Word files I received from the previous owner. I waited a bit, but noticed the site didn't get re-indexed. So on August 18th I moved the content of domain xxx.com to yyy.co.uk/xxx/ and 301-ed all the former locations, hoping that this might help yyy.co.uk get indexed..... but nothing. On October 28 of 2010 I submitted my first reconsideration request, which was processed on November 17th without any change. At that time Google didn't say if anything was wrong like now, so I just waited... and waited... and waited some more. At some point I was ready to let this one go, as I didn't/don't see any problems with it. In fact, it used to be indexed before. By now, I removed all links pointing to it that I had control off, and there are hardly any left over. The site as well doesn't have any outgoing links left, so that can't be it either. I also removed a kind-a duplicate keyword heavy menu from the sidebar, as well as the widgets from the footer. Finally I also fixed a problem caused by Yoast Wordpress SEO Plugin, but I only installed this plugin recently, so that could not be the problem that caused the penalty. So after another reconsideration request Google again let me know this site still has issues, but I really have no clue which, or how to find out. I don't feel like doing any work on this site, as there is no guarantee that it will ever lose its penalty. What should I do now?
Technical SEO | | VisualSense0 -
Google crawl index issue with our website...
Hey there. We've run into a mystifying issue with Google's crawl index of one of our sites. When we do a "site:www.burlingtonmortgage.biz" search in Google, we're seeing lots of 404 Errors on pages that don't exist on our site or seemingly on the remote server. In the search results, Google is showing nonsensical folders off the root domain and then the actual page is within that non-existent folder. An example: Google shows this in its index of the site (as a 404 Error page): www.burlingtonmortgage.biz/MQnjO/idaho-mortgage-rates.asp The actual page on the site is: www.burlingtonmortgage.biz/idaho-mortgage-rates.asp Google is showing the folder MQnjO that doesn't exist anywhere on the remote. Other pages they are showing have different folder names that are just as wacky. We called our hosting company who said the problem isn't coming from them... Has anyone had something like this happen to them? Thanks so much for your insight!
Technical SEO | | ILM_Marketing
Megan0