I can't crawl the archive of this website with Screaming Frog
-
Hi
I'm trying to crawl this website (http://zeri.info/) with Screaming Frog but because of some technical issue with their site (i can't find what is causing it) i'm able to crawl only the first page of each category (ex. http://zeri.info/sport/) and then it will go to crawl each page of their archive (hundreds of thousands of pages) but it won't crawl the links inside these pages.
Thanks a lot!
-
I think the issue comes from the way you handle the pagination and or the way your render archived pages.
Example: First archive page of Aktualehttp://zeri.info/arkiva/?formkey=7301c1be1634ffedb1c3780e5063819b6ec19157&acid=aktuale
Clicking on page 2 adds the date
http://zeri.info/arkiva/?from=2016-06-01&until=2016-06-16&acid=aktuale&formkey=cc0a40ca389eb511b1369a9aa9da915826d6ca44&faqe=2#archive-results => I assume that you're only listing the articles published from June 1st till today.
If I check all the different section & the number of articles listed in each archive I get approx. 1200 pages - add some additional pages linked on these pages and you get to the 2K pages you mentioned.
There seems to be no possibility to reach the previously published content without executing a search - which Screaming Frog can't do. It's quite possible that this is causing issues for Google bot as well so I would try to fix this.
If you really want to crawl the full site in the mean time - add another rule in url rewriting - this time selecting 'regex replace' -
add regex: from=2016-06-01
replace regex from=2010-01-01 (replace by the earliest date of publishing)This way - the system will call url http://zeri.info/arkiva/?from=2010-06-01&until=2016-06-16&acid=kultura&formkey=5932742bd5dd77799524ba31b94928810908fc07&faqe=2 rather than the original one - listing all the articles instead of only the june articles.
Hope this helps.
Dirk
-
I can't make it work. After removing 'fromkey' parameter i was able to crawl 1.7k and it stopped there. The site has more than 400k pages so .. something must be wrong
I want to crawl only the root domain without subdomains and all i can crawl is around 2k pages.
Do you have any idea what might be happening?
-
Great it worked. Just a small note - if Screaming Frog is getting confused by all these parameters, it could well be that Googlebot (while more sophisticated) is also having these issues. You could consider to exclude the formkey parameter in the Search Console (Crawl > URL parameters)
DIrk
-
Dirk, thanks a lot.
I just added "formkey" to be removed as a parameter and it seems to be working. I crawled 1k pages until now and i'm going to monitor how it goes.
The site has more than 400k pages so the process to crawl them all will take time (and i'm going to have to crawl each sector so i can create sitemaps for them).
Thanks again
Gjergj -
In the menu 'url rewriting' you can simply put the parameters the site should ignore (like date, formkey,..). I removed the formkey parameter and I checked the pages of the archive in Screaming Frog.
It is clearly able to detect all the internal links on the page - so will crawl them.
How are you certain that the pages below are not crawled - could you give a specific example of page that should be crawled but isn't?
Dirk
-
I've tried changing settings to respect noindex & canonical .. it will stop crawling the archive pages but still it won't crawl the links inside those pages. (i've added NOINDEX, FOLLOW in all archive pagination pages)
What do you mean by rewriting the url to ignore the formkey? How do you think it should be.
Gjergji
-
It think Screaming Frog is going nuts on the formkey value in the url which is constantly changing when changing pages.
Could you modify the settings of the spider to respect noindex & respect canonical - looks like this is solving the issue.
Alternatively you could rewrite the url to ignore the formkey (remove parameter)
Dirk
-
Hi Logan
I've tried going back to default configuration but it didn't help .. still i don't believe Screaming Frog is to blame, i think there is something wrong with the way the site has been developed (they are using a custom CMS) .. but i can't find the reason why this is happening. As soon as i find the solution then i can ask the guys who developed this site to make the necessary changes.
Thanks a lot.
-
Hi Dirk
Thanks a lot for replying back. The issue is that Screaming Frog is crawling the archive pages (like these examples) but it won't crawl the articles that are listed inside these pages.
The hierarchy of the site goes like this:
Homepage
- Categories (with about 20 articles in them)
- Archive of that category (with all the remaining articles, which in this case means thousands since they are a news website)Screaming Frog will crawl the homepage and categories ... but after it goes to the archive it won't crawl the articles inside archive, instead it will only crawl the pages (pagination) of that archive.
Thanks again.
-
Try going to File > Default Conif > Clear Default Configuration. This happens to me sometimes as well as I've edited settings over time. Clearing it out and going back to default settings is usually quicker than clicking through the settings to identify which one is causing the problem.
-
Did you put in some special filters - just tried to crawl the site & it seems to work just fine?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can we re rank our Penalyzed website in Google?
Hello This is Maqbul, from India. I have a jobs portal blog [ bharatrecruit.com]. It was getting around 50K to 100K Views a Day and made me $100 a day. But after a few months, my competitor made negative SEO with 12,000 Spammy backlinks. Suddenly my site was hit by Google and now it is getting 200 to 300 Pageviews a day. So the question is I did not disavow bad links for a long time like 3 to 4 months. Now I disavow all the bad links but the website is not ranking. Can we re-rank this site or create another website. Please reply must. None of the bloggers can answer this. Thanks, Regards Maqbul
Technical SEO | | vinaso960 -
I have made my new website live. But while checking in Google it is not showing in search result ( site: www.oomfr.com ). Can anybody please advice.
Hi Team, I have made my new website live. But while checking in Google it is not showing in search result ( site: www.oomfr.com ). Can anybody please advice.
Technical SEO | | nlogix0 -
Getting high priority issue for our xxx.com and xxx.com/home as duplicate pages and duplicate page titles can't seem to find anything that needs to be corrected, what might I be missing?
I am getting high priority issue for our xxx.com and xxx.com/home as reporting both duplicate pages and duplicate page titles on crawl results, I can't seem to find anything that needs to be corrected, what am I be missing? Has anyone else had a similar issue, how was it corrected?
Technical SEO | | tgwebmaster0 -
Can you regain any SERPs / link juice of links that have 404'd?
We have a client whose 301 redirects disappeared and have been gone for about 6 months now. We are going to be putting the 301 redirects back in place. Will we be able to regain any of the previous SERPs or link juice from old links or is all lost? Thanks in advance!
Technical SEO | | SavvyPanda0 -
New Website, New URL, New Content - What do we do with the old site? Are 301's the only option?
We've just built a new site for a client. They were adamant on changing the url. The new site is entirely new content, however the subject mater is the same. Some pages are even titled very similarly. Is is advisable to keep the old site running, and link it to the new site? Permanently, or temporarily? Do we simply place redirects from the old site the new? Old site was 30 pages, new site is 80 pages. So redirects won't be available to all the new pages. It seems a shame to trash the old site, it is getting some good traffic, and the content - although outdated is unique and of a high quality. Old url is 4+ yrs old, the new url is new. Some enlightened opinions would be greatly welcomed. Thanks
Technical SEO | | MarketsOnline0 -
WMT - Googlebot can't access your site
Hi On our new website which is just a few weeks old upon logging into Webmaster tools I am getting the following message Googlebot can't access your site - The overall error rate for DNS queries is 50% What do I need to do to resolve this, I have never had this problem before with any of the sites - where the domains are with Fasthosts (UK) and hosting is with Dreamhosts. What is the recommended course of action Google mention contacting your host in my case Dreamhost - but what do you need to ask them in a support ticket. When doing a fetch in WMT the fetch status is a success?
Technical SEO | | ocelot0 -
Google is somehow linking my two sites that aren't linked! HELP
Good Morning... In my Google webmaster account it is showing an increase of backlinks between one site i own to the other.... This should not happen, as there are no links from one site to the other. I have thoroughly checked many pages on the new site to see if i can find a backlink, but i can't. Does anyone know why this is showing like this (google now shows 50,000 links from one site to the other).. Can someone please take a look and see if you can find any link from one to the other... original site : http://goo.gl/JgK1e new site : http://goo.gl/Jb4ng Please let me know why you guys think this is happening or if you were actually able to find a link on the new site pointing back to the old site... thanks a lot
Technical SEO | | Prime850 -
Crawl Tool Producing Random URL's
For some reason SEOmoz's crawl tool is returning duplicate content URL's that don't exist on my website. It is returning pages like "mydomain.com/pages/pages/pages/pages/pages/pricing" Nothing like that exists as a URL on my website. Has anyone experienced something similar to this, know what's causing it, or know how I can fix it?
Technical SEO | | MyNet0