I can't crawl the archive of this website with Screaming Frog
-
Hi
I'm trying to crawl this website (http://zeri.info/) with Screaming Frog but because of some technical issue with their site (i can't find what is causing it) i'm able to crawl only the first page of each category (ex. http://zeri.info/sport/) and then it will go to crawl each page of their archive (hundreds of thousands of pages) but it won't crawl the links inside these pages.
Thanks a lot!
-
I think the issue comes from the way you handle the pagination and or the way your render archived pages.
Example: First archive page of Aktualehttp://zeri.info/arkiva/?formkey=7301c1be1634ffedb1c3780e5063819b6ec19157&acid=aktuale
Clicking on page 2 adds the date
http://zeri.info/arkiva/?from=2016-06-01&until=2016-06-16&acid=aktuale&formkey=cc0a40ca389eb511b1369a9aa9da915826d6ca44&faqe=2#archive-results => I assume that you're only listing the articles published from June 1st till today.
If I check all the different section & the number of articles listed in each archive I get approx. 1200 pages - add some additional pages linked on these pages and you get to the 2K pages you mentioned.
There seems to be no possibility to reach the previously published content without executing a search - which Screaming Frog can't do. It's quite possible that this is causing issues for Google bot as well so I would try to fix this.
If you really want to crawl the full site in the mean time - add another rule in url rewriting - this time selecting 'regex replace' -
add regex: from=2016-06-01
replace regex from=2010-01-01 (replace by the earliest date of publishing)This way - the system will call url http://zeri.info/arkiva/?from=2010-06-01&until=2016-06-16&acid=kultura&formkey=5932742bd5dd77799524ba31b94928810908fc07&faqe=2 rather than the original one - listing all the articles instead of only the june articles.
Hope this helps.
Dirk
-
I can't make it work. After removing 'fromkey' parameter i was able to crawl 1.7k and it stopped there. The site has more than 400k pages so .. something must be wrong
I want to crawl only the root domain without subdomains and all i can crawl is around 2k pages.
Do you have any idea what might be happening?
-
Great it worked. Just a small note - if Screaming Frog is getting confused by all these parameters, it could well be that Googlebot (while more sophisticated) is also having these issues. You could consider to exclude the formkey parameter in the Search Console (Crawl > URL parameters)
DIrk
-
Dirk, thanks a lot.
I just added "formkey" to be removed as a parameter and it seems to be working. I crawled 1k pages until now and i'm going to monitor how it goes.
The site has more than 400k pages so the process to crawl them all will take time (and i'm going to have to crawl each sector so i can create sitemaps for them).
Thanks again
Gjergj -
In the menu 'url rewriting' you can simply put the parameters the site should ignore (like date, formkey,..). I removed the formkey parameter and I checked the pages of the archive in Screaming Frog.
It is clearly able to detect all the internal links on the page - so will crawl them.
How are you certain that the pages below are not crawled - could you give a specific example of page that should be crawled but isn't?
Dirk
-
I've tried changing settings to respect noindex & canonical .. it will stop crawling the archive pages but still it won't crawl the links inside those pages. (i've added NOINDEX, FOLLOW in all archive pagination pages)
What do you mean by rewriting the url to ignore the formkey? How do you think it should be.
Gjergji
-
It think Screaming Frog is going nuts on the formkey value in the url which is constantly changing when changing pages.
Could you modify the settings of the spider to respect noindex & respect canonical - looks like this is solving the issue.
Alternatively you could rewrite the url to ignore the formkey (remove parameter)
Dirk
-
Hi Logan
I've tried going back to default configuration but it didn't help .. still i don't believe Screaming Frog is to blame, i think there is something wrong with the way the site has been developed (they are using a custom CMS) .. but i can't find the reason why this is happening. As soon as i find the solution then i can ask the guys who developed this site to make the necessary changes.
Thanks a lot.
-
Hi Dirk
Thanks a lot for replying back. The issue is that Screaming Frog is crawling the archive pages (like these examples) but it won't crawl the articles that are listed inside these pages.
The hierarchy of the site goes like this:
Homepage
- Categories (with about 20 articles in them)
- Archive of that category (with all the remaining articles, which in this case means thousands since they are a news website)Screaming Frog will crawl the homepage and categories ... but after it goes to the archive it won't crawl the articles inside archive, instead it will only crawl the pages (pagination) of that archive.
Thanks again.
-
Try going to File > Default Conif > Clear Default Configuration. This happens to me sometimes as well as I've edited settings over time. Clearing it out and going back to default settings is usually quicker than clicking through the settings to identify which one is causing the problem.
-
Did you put in some special filters - just tried to crawl the site & it seems to work just fine?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Another client copies everything to blogspot. Is that what keeps her site from ranking? Or what? Appears to be a penalty somewhere but can't find it.
This client has a brand new site: http://www.susannoyesandersonpoems.com Her previous site was really bad for SEO, yet at one time she actually ranked on the first page for "LDS poems." She came to me because she lost rank. I checked things out and found some shoddy SEO work by a very popular Wordpress webhoste that I will leave unnamed. If you do a backlink analysis you can see the articles and backlinks they created. But there are so few, so I'm not sure if that was it, or it just was because of the fact that her site was so poorly optimized and Google made a change, and down she fell. Here's the only page she had on the LDS poems topic in her old site: https://web.archive.org/web/20130820161529/http://susannoyesandersonpoems.com/category/lds-poetry/ Even the links in the nav were bad as they were all images. And that ranked in position 2 I think she said. Even with her new site, she continues to decline. In fact she is nowhere to be found for main keywords making me think there is a penalty. To try and build rank for categories, I'm allowing google to index the category landing pages and had her write category descriptions that included keywords. We are also listing the categories on the left and linking to those category pages. Maybe those pages are watered down by the poem excerpts?? Here's an example of a page we want to rank: http://susannoyesandersonpoems.com/category/lds-poetry/ Any help from the peanut gallery?
Technical SEO | | katandmouse0 -
How can I index several systems used for my website?
My site is built on PHP, but has a help.website.com page based on a helpdesk platform. I also have a wordpress blog. So, these are three "different systems" under the same domain. When I crawl my site, neither the blog nor the help page show up. How can I make them show up? Thanks!
Technical SEO | | rodelmo880 -
Ecommerce website: Product page setup & SKU's
I manage an E-commerce website and we are looking to make some changes to our product pages to try and optimise them for search purposes and to try and improve the customer buying experience. This is where my head starts to hurt! Now, let's say I am selling a T shirt that comes in 4 sizes and 6 different colours. At the moment my website would have 24 products, each with pretty much the same content (maybe differing references to the colour & size). My idea is to change this and have 1 main product page for the T-shirt, but to have 24 product SKU's/variations that exist to give the exact product details. Some different ways I have been considering to do this: a) have drop-down fields on the product page that ask the customer to select their Tshirt size and colour. The image & price then changes on the page. b) All product 24 product SKUs sre listed under the main product with the 'Add to Cart' open next to each one. Each one would be clickable so a page it its own right. Would I need to set up a canonical links for each SKU that point to the top level product page? I'm obviously looking to minimise duplicate content but Im not exactly sure on how to set this up - its a big decision so I need to be 100% clear before signing off on anything. . Any other tips on how to do this or examples of good e-commerce websites that use product SKus well? Kind regards Tom
Technical SEO | | DHS_SH0 -
Error msg 'Duplicate Page Content', how to fix?
Hey guys, I'm new to SEO and have the following error msg 'Duplicate Page Content'. Of course I know what it means, but my question is how do you delete the old pages that has duplicate content? I use to run my website through Joomla! but have since moved to Shopify. I see that the duplicated site content is still from the old Joomla! site and I would like to learn how to delete this content (or best practice in this situation). Any advice would be very helpful! Cheers, Peter
Technical SEO | | pjuszczynski0 -
What can we do to improve our site
Hi. I am hoping that some of you can help me with the in2town site www.in2town.co.uk The site is a news/lifestyle magazine site. The site is a cross between, huffington post, digital spy, female first and the sun newspaper. Basically the site is a news site as well as covering showbiz news, travel news, health news and advice etc What i would like is for people to look at the site and let me know what they feel i should do to improve the site to make it better for our readers and to gain more readership. I would also like to hear from people on how they find moving around the site as well as the speed of the site. At the moment the site is with an american hosting company and i am in the process of talking to UK hosting companies to move the site. The site is currently on a dedicated server. It would mean a lot if people could give me their advice on how to improve the site and make it a beter experience for our readers while at the same time being able to generate income with the site. Just a quick note, all content is original and we have a number of people who write for the site. many thanks
Technical SEO | | ClaireH-1848860 -
I am getting an error message from Google Webmaster Tools and I don't know what to do to correct the problem
The message is:
Technical SEO | | whitegyr
"Dear site owner or webmaster of http://www.whitegyr.com/, We've detected that some of your site's pages may be using techniques that are outside Google's Webmaster Guidelines. If you have any questions about how to resolve this issue, please see our Webmaster Help Forum for support. Sincerely, Google Search Quality Team" I have always tried to follow Google's guidelines and don't know what I am doing wrong, I have eight different websites all getting this warning and I don't know what is wrong, is there anyone you know that will look at my sites and advise me what I need to do to correct the problem? Website with this warning:
artistalaska.com
cosmeticshandbook.com
homewindpower.ws
montanalandsale.com
outdoorpizzaoven.net
shoes-place.com
silverstatepost.com
www.whitegyr.com0 -
Are these 'not found' errors a concern?
Our webmaster report is showing thousands of 'not found' errors for links that show up in javascript code. Is this something we should be concerned about? Especially since there are so many?
Technical SEO | | nicole.healthline0 -
Destination URL in SERPs keeps changing and I can't work out why.. Help.
I am befuddled as to why our destination URL in SERPs keeps changing oak furniture was nicely returning http://www.thefurnituremarket.co.uk/oakfurniture.asp then I changed something yesterday I did 2 things. published a link to that on facebook as part of a competition. redirected dynamic pages to the static URL for oak furniture.. Now for oak furniture the SERPs in GG UK is returning our home page as the most relevant landing page.. Any Idea why? I'm leaning to an onpage issue than posting on FB.. Thoughts?
Technical SEO | | robertrRSwalters0