I can't crawl the archive of this website with Screaming Frog
-
Hi
I'm trying to crawl this website (http://zeri.info/) with Screaming Frog but because of some technical issue with their site (i can't find what is causing it) i'm able to crawl only the first page of each category (ex. http://zeri.info/sport/) and then it will go to crawl each page of their archive (hundreds of thousands of pages) but it won't crawl the links inside these pages.
Thanks a lot!
-
I think the issue comes from the way you handle the pagination and or the way your render archived pages.
Example: First archive page of Aktualehttp://zeri.info/arkiva/?formkey=7301c1be1634ffedb1c3780e5063819b6ec19157&acid=aktuale
Clicking on page 2 adds the date
http://zeri.info/arkiva/?from=2016-06-01&until=2016-06-16&acid=aktuale&formkey=cc0a40ca389eb511b1369a9aa9da915826d6ca44&faqe=2#archive-results => I assume that you're only listing the articles published from June 1st till today.
If I check all the different section & the number of articles listed in each archive I get approx. 1200 pages - add some additional pages linked on these pages and you get to the 2K pages you mentioned.
There seems to be no possibility to reach the previously published content without executing a search - which Screaming Frog can't do. It's quite possible that this is causing issues for Google bot as well so I would try to fix this.
If you really want to crawl the full site in the mean time - add another rule in url rewriting - this time selecting 'regex replace' -
add regex: from=2016-06-01
replace regex from=2010-01-01 (replace by the earliest date of publishing)This way - the system will call url http://zeri.info/arkiva/?from=2010-06-01&until=2016-06-16&acid=kultura&formkey=5932742bd5dd77799524ba31b94928810908fc07&faqe=2 rather than the original one - listing all the articles instead of only the june articles.
Hope this helps.
Dirk
-
I can't make it work. After removing 'fromkey' parameter i was able to crawl 1.7k and it stopped there. The site has more than 400k pages so .. something must be wrong
I want to crawl only the root domain without subdomains and all i can crawl is around 2k pages.
Do you have any idea what might be happening?
-
Great it worked. Just a small note - if Screaming Frog is getting confused by all these parameters, it could well be that Googlebot (while more sophisticated) is also having these issues. You could consider to exclude the formkey parameter in the Search Console (Crawl > URL parameters)
DIrk
-
Dirk, thanks a lot.
I just added "formkey" to be removed as a parameter and it seems to be working. I crawled 1k pages until now and i'm going to monitor how it goes.
The site has more than 400k pages so the process to crawl them all will take time (and i'm going to have to crawl each sector so i can create sitemaps for them).
Thanks again
Gjergj -
In the menu 'url rewriting' you can simply put the parameters the site should ignore (like date, formkey,..). I removed the formkey parameter and I checked the pages of the archive in Screaming Frog.
It is clearly able to detect all the internal links on the page - so will crawl them.
How are you certain that the pages below are not crawled - could you give a specific example of page that should be crawled but isn't?
Dirk
-
I've tried changing settings to respect noindex & canonical .. it will stop crawling the archive pages but still it won't crawl the links inside those pages. (i've added NOINDEX, FOLLOW in all archive pagination pages)
What do you mean by rewriting the url to ignore the formkey? How do you think it should be.
Gjergji
-
It think Screaming Frog is going nuts on the formkey value in the url which is constantly changing when changing pages.
Could you modify the settings of the spider to respect noindex & respect canonical - looks like this is solving the issue.
Alternatively you could rewrite the url to ignore the formkey (remove parameter)
Dirk
-
Hi Logan
I've tried going back to default configuration but it didn't help .. still i don't believe Screaming Frog is to blame, i think there is something wrong with the way the site has been developed (they are using a custom CMS) .. but i can't find the reason why this is happening. As soon as i find the solution then i can ask the guys who developed this site to make the necessary changes.
Thanks a lot.
-
Hi Dirk
Thanks a lot for replying back. The issue is that Screaming Frog is crawling the archive pages (like these examples) but it won't crawl the articles that are listed inside these pages.
The hierarchy of the site goes like this:
Homepage
- Categories (with about 20 articles in them)
- Archive of that category (with all the remaining articles, which in this case means thousands since they are a news website)Screaming Frog will crawl the homepage and categories ... but after it goes to the archive it won't crawl the articles inside archive, instead it will only crawl the pages (pagination) of that archive.
Thanks again.
-
Try going to File > Default Conif > Clear Default Configuration. This happens to me sometimes as well as I've edited settings over time. Clearing it out and going back to default settings is usually quicker than clicking through the settings to identify which one is causing the problem.
-
Did you put in some special filters - just tried to crawl the site & it seems to work just fine?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can bad html code hurt your website from ranking ?
Hello,For example if I search for “ Bike Tours in France” I am looking for a page with a list of tours in France.Does it mean that if my html doesn’t have list * in the code but only that apparently doesn’t have any semantic meaning for a search engine my page won’t rank because of that ?Example on this page : https://bit.ly/2C6hGUn According to W3schools: "A semantic element clearly describes its meaning to both the browser and the developer. Examples of non-semantic elements: <div> and - Tells nothing about its content. Examples of semanticelements: <form>, , and- Clearly defines its content."Has anyone any experience with something similar ?Thank you, </form>
Technical SEO | | seoanalytics0 -
My Homepage Won't Load if Javascript is Disabled. Is this an SEO/Indexation issue?
Hi everyone, I'm working with a client who recently had their site redesigned. I'm just going through to do an initial audit to make sure everything looks good. Part of my initial indexation audit goes through questions about how the site functions when you disable, javascript, cookies, and/or css. I use the Web Developer extension for Chrome to do this. I know, more recently, people have said that content loaded by Javascript will be indexed. I just want to make sure it's not hurting my clients SEO. http://americasinstantsigns.com/ Is it as simple as looking at Google's Cached URL? The URL is definitely being indexed and when looking at the text-only version everything appears to be in order. This may be an outdated question, but I just want to be sure! Thank you so much!
Technical SEO | | ccox10 -
E-commerce website
Hi, I have to do a SEO optimization on a huge e-commerce website, but i don´t know if i have to focus in Schema or meta tags ( page title, meta description) which of them is more important? how can I optimize the website? Thanks
Technical SEO | | AbacoDigital1 -
404 error - but I can't find any broken links on the referrer pages
Hi, My crawl has diagnosed a client's site with eight 404 errors. In my CSV download of the crawl, I have checked the source code of the 'referrer' pages, but can't find where the link to the 404 error page is. Could there be another reason for getting 404 errors? Thanks for your help. Katharine.
Technical SEO | | PooleyK0 -
We registered with Yahoo Directory. Why won't this show up as a a linking root domain in our link analysis??
Recently checked our link analysis report for 2 of our campaigns who are registered in the dir.yahoo.com (yahoo directory). For some reason, we don't see this being a domain that shows up as linking to our website - why is this?
Technical SEO | | MMP0 -
The 'On Page' section of SEOMOZ
How does SEOMOZ choose a keyword for a page, for example it has ranked one of my pages for a search term which does not really appear on that page and then given it an F - how do I change the key word association? Secondly, when I first started using SEOMOZ I could change the page and then click the button 'Grade my on-page optimization' and it would show an immediate update - does anyone know why this has been stopped, as it is very useful to know you have got the page right away to an A for example.
Technical SEO | | bowravenseo0 -
Can I format my H1 to be smaller than H2's and H3's on the same page?
I would like to create a web design with 12px H1 and for sub headings on the page to be more like 24px. Will search engines see this and dislike it? The reason for doing it is that I want to put a generic page title in the banner, and more poetic headings above the main body. Example: Small H1: Wholesale coffee, online coffee shop and London roastery Large h2: Respect the bean... Thanks
Technical SEO | | Crumpled_Dog
Scott0 -
Site 'filtered' by Google in early July.... and still filtered!
Hi, Our site got demoted by Google all of a sudden back in early July. You can view the site here: http://alturl.com/4pfrj and you may read the discussions I posted in Google's forums here: http://www.google.com/support/forum/p/Webmasters/thread?tid=6e8f9aab7e384d88&hl=en http://www.google.com/support/forum/p/Webmasters/thread?tid=276dc6687317641b&hl=en Those discussions chronicle what happened, and what we've done since. I don't want to make this a long post by retyping it all here, hence the links. However, we've made various changes (as detailed), such as getting rid of duplicate content (use of noindex on various pages etc), and ensuring there is no hidden text (we made an unintentional blunder there through use of a 3rd party control which used CSS hidden text to store certain data). We have also filed reconsideration requests with Google and been told that no manual penalty has been applied. So the problem is down to algorithmic filters which are being applied. So... my reason for posting here is simply to see if anyone here can help us discover if there is anything we have missed? I'd hope that we've addressed the main issues and that eventually our Google ranking will recover (ie. filter removed.... it isn't that we 'rank' poorly, but that a filter is bumping us down, to, for example, page 50).... but after three months it sure is taking a while! It appears that a 30 day penalty was originally applied, as our ranking recovered in early August. But a few days later it dived down again (so presumably Google analysed the site again, found a problem and applied another penalty/filter). I'd hope that might have been 30 or 60 days, but 60 days have now passed.... so perhaps we have a 90 day penalty now. OR.... perhaps there is no time frame this time, simply the need to 'fix' whatever is constantly triggering the filter (that said, I 'feel' like a time frame is there, especially given what happened after 30 days). Of course the other aspect that can always be worked on (and oft-mentioned) is the need for more and more original content. However, we've done a lot to increase this and think our Guide pages are pretty useful now. I've looked at many competitive sites which list in Google and they really don't offer anything more than we do..... so if that is the issue it sure is puzzling if we're filtered and they aren't. Anyway, I'm getting wordy now, so I'll pause. I'm just asking if anyone would like to have a quick look at the site and see what they can deduce? We have of course run it through SEOMoz's tools and made use of the suggestions. Our target pages generally rate as an A for SEO in the reports. Thanks!
Technical SEO | | Go2Holidays0