I can't crawl the archive of this website with Screaming Frog
-
Hi
I'm trying to crawl this website (http://zeri.info/) with Screaming Frog but because of some technical issue with their site (i can't find what is causing it) i'm able to crawl only the first page of each category (ex. http://zeri.info/sport/) and then it will go to crawl each page of their archive (hundreds of thousands of pages) but it won't crawl the links inside these pages.
Thanks a lot!
-
I think the issue comes from the way you handle the pagination and or the way your render archived pages.
Example: First archive page of Aktualehttp://zeri.info/arkiva/?formkey=7301c1be1634ffedb1c3780e5063819b6ec19157&acid=aktuale
Clicking on page 2 adds the date
http://zeri.info/arkiva/?from=2016-06-01&until=2016-06-16&acid=aktuale&formkey=cc0a40ca389eb511b1369a9aa9da915826d6ca44&faqe=2#archive-results => I assume that you're only listing the articles published from June 1st till today.
If I check all the different section & the number of articles listed in each archive I get approx. 1200 pages - add some additional pages linked on these pages and you get to the 2K pages you mentioned.
There seems to be no possibility to reach the previously published content without executing a search - which Screaming Frog can't do. It's quite possible that this is causing issues for Google bot as well so I would try to fix this.
If you really want to crawl the full site in the mean time - add another rule in url rewriting - this time selecting 'regex replace' -
add regex: from=2016-06-01
replace regex from=2010-01-01 (replace by the earliest date of publishing)This way - the system will call url http://zeri.info/arkiva/?from=2010-06-01&until=2016-06-16&acid=kultura&formkey=5932742bd5dd77799524ba31b94928810908fc07&faqe=2 rather than the original one - listing all the articles instead of only the june articles.
Hope this helps.
Dirk
-
I can't make it work. After removing 'fromkey' parameter i was able to crawl 1.7k and it stopped there. The site has more than 400k pages so .. something must be wrong
I want to crawl only the root domain without subdomains and all i can crawl is around 2k pages.
Do you have any idea what might be happening?
-
Great it worked. Just a small note - if Screaming Frog is getting confused by all these parameters, it could well be that Googlebot (while more sophisticated) is also having these issues. You could consider to exclude the formkey parameter in the Search Console (Crawl > URL parameters)
DIrk
-
Dirk, thanks a lot.
I just added "formkey" to be removed as a parameter and it seems to be working. I crawled 1k pages until now and i'm going to monitor how it goes.
The site has more than 400k pages so the process to crawl them all will take time (and i'm going to have to crawl each sector so i can create sitemaps for them).
Thanks again
Gjergj -
In the menu 'url rewriting' you can simply put the parameters the site should ignore (like date, formkey,..). I removed the formkey parameter and I checked the pages of the archive in Screaming Frog.
It is clearly able to detect all the internal links on the page - so will crawl them.
How are you certain that the pages below are not crawled - could you give a specific example of page that should be crawled but isn't?
Dirk
-
I've tried changing settings to respect noindex & canonical .. it will stop crawling the archive pages but still it won't crawl the links inside those pages. (i've added NOINDEX, FOLLOW in all archive pagination pages)
What do you mean by rewriting the url to ignore the formkey? How do you think it should be.
Gjergji
-
It think Screaming Frog is going nuts on the formkey value in the url which is constantly changing when changing pages.
Could you modify the settings of the spider to respect noindex & respect canonical - looks like this is solving the issue.
Alternatively you could rewrite the url to ignore the formkey (remove parameter)
Dirk
-
Hi Logan
I've tried going back to default configuration but it didn't help .. still i don't believe Screaming Frog is to blame, i think there is something wrong with the way the site has been developed (they are using a custom CMS) .. but i can't find the reason why this is happening. As soon as i find the solution then i can ask the guys who developed this site to make the necessary changes.
Thanks a lot.
-
Hi Dirk
Thanks a lot for replying back. The issue is that Screaming Frog is crawling the archive pages (like these examples) but it won't crawl the articles that are listed inside these pages.
The hierarchy of the site goes like this:
Homepage
- Categories (with about 20 articles in them)
- Archive of that category (with all the remaining articles, which in this case means thousands since they are a news website)Screaming Frog will crawl the homepage and categories ... but after it goes to the archive it won't crawl the articles inside archive, instead it will only crawl the pages (pagination) of that archive.
Thanks again.
-
Try going to File > Default Conif > Clear Default Configuration. This happens to me sometimes as well as I've edited settings over time. Clearing it out and going back to default settings is usually quicker than clicking through the settings to identify which one is causing the problem.
-
Did you put in some special filters - just tried to crawl the site & it seems to work just fine?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Will putting a one page site up for all other countries stop Googlebot from crawling my UK website?
I have a client that only wants UK users to be able to purchase from the UK site. Currently, there are customers from the US and other countries purchasing from the UK site. They want to have a single webpage that is displayed to users trying to access the UK site that are outside the UK. This is fine but what impact would this have on Google bots trying to crawl the UK website? I have scoured the web for an answer but can't find one. Any help will be greatly appreciated. Thanks 🙂
Technical SEO | | lbagley0 -
My wepgages aren't crawled by google
Most of my webpages aren't crawled by google.
Technical SEO | | Poutokas
Why is that and what can i do to make google index at least most of my webpages?0 -
What should I do with a large number of 'pages not found'?
One of my client sites lists millions of products and 100s or 1000s are de-listed from their inventory each month and removed from the site (no longer for sale). What is the best way to handle these pages/URLs from an SEO perspective? There is no place to use a 301. 1. Should we implement 404s for each one and put up with the growing number of 'pages not found' shown in Webmaster Tools? 2. Should we add them to the Robots.txt file? 3. Should we add 'nofollow' into all these pages? Or is there a better solution? Would love some help with this!
Technical SEO | | CuriousCatDigital0 -
My site is not being regularly crawled?
My site used to be crawled regularly, but not anymore. My pages aren't showing up in the index months after they've been up. I've added them to the sitemap and everything. I now have to submit them through webmaster tools to get them to index. And then they don't really rank? Before you go spouting off the standard SEO resolutions... Yes, I checked for crawl errors on Google Webmaster and no, there aren't any issues No, the pages are not noindex. These pages are index,follow No, the pages are not canonical No, the robots.txt does not block any of these pages No, there is nothing funky going on in my .htaccess. The pages load fine No, I don't have any URL parameters set What else would be interfereing? Here is one of the URLs that wasn't crawled for over a month: http://www.howlatthemoon.com/locations/location-st-louis
Technical SEO | | howlusa0 -
Error: Missing Meta Description Tag on pages I can't find in order to correct
This seems silly, but I have errors on blog URLs in our WordPress site that I don't know how to access because they are not in our Dashboard. We are using All in One SEO. The errors are for blog archive dates, authors and just simply 'blog'. Here are samples: http://www.fateyes.com/2012/10/
Technical SEO | | gfiedel
http://www.fateyes.com/author/gina-fiedel/
http://www.fateyes.com/blog/ Does anyone know how to input descriptions for pages like these?
Thanks!!0 -
Can I put the tag in the MasterPage of my ASP.NET website or does this need to be specific to each page?
Hi Moz Community, I am a designer/junior SEO'er and have been working with our web developer to setup SEO oriented redirects and the rel canonical tag on our ASP.NET page running MasterPages - www.tisbest.org. I know setting up an incorrect canonical tag can be devastating so I'm hoping for some guidance. Can we put the <title> </span>Charity Gift Cards | Donation Gift Ideas | TisBest Philanthropy</p> <p style="color: #5e5e5e; font-family: Helvetica, Arial, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; line-height: normal;"><span style="color: #5e5e5e;"> </span></p> <p style="color: #5e5e5e; font-family: Helvetica, Arial, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; line-height: normal;"><span style="color: #5e5e5e;"></title> Thanks! Chad
Technical SEO | | TisBest0 -
Website isn't Ranking for Any Keyword
Hi, I launched a playhouses website in april this year and have been steadily link building to it over the past few months. I have gotten all of the internal optimisation correct (that I can see) however it is still not ranking for any keyword and suprinsgly all of our traffic is comming either direct or through bing. The website is showing as being in googles index however it is still not ranking for even the smallest of niche keywords. The only penalty I can see is that we have some spammy blog links that my colleague has gotten which I have been trying to counteract with high quality guest blogging. Any input is welcome the url is http://www.playhouses.co.uk/ Simon
Technical SEO | | GardenGamer0 -
Should we use Google's crawl delay setting?
We’ve been noticing a huge uptick in Google’s spidering lately, and along with it a notable worsening of render times. Yesterday, for example, Google spidered our site at a rate of 30:1 (google spider vs. organic traffic.) So in other words, for every organic page request, Google hits the site 30 times. Our render times have lengthened to an avg. of 2 seconds (and up to 2.5 seconds). Before this renewed interest Google has taken in us we were seeing closer to one second average render times, and often half of that. A year ago, the ratio of Spider to Organic was between 6:1 and 10:1. Is requesting a crawl-delay from Googlebot a viable option? Our goal would be only to reduce Googlebot traffic, and hopefully improve render times and organic traffic. Thanks, Trisha
Technical SEO | | lzhao0