Google News not indexing .index.html pages

H-FARM

Hi all,

we've been asked by a blog to help them better indexing and ranking on Google News (with the site being already included in Google News with poor results)

The blog had a chronicle URL duplication problem with each post existing with 3 different URLs:

#1) www.domain.com/post.html (currently in noindex for editorial choices as showing all the comments)

#2) www.domain.com/post/index.html (currently indexed showing only top comments)

#3) www.domain.com/post/ (very same as #2)

We've chosen URL #2 (/index.html) as canonical URL, and included a rel=canonical tag on URL #3 (/) linking to URL #2.
Also we've submitted yesterday a Google News sitemap including consistently the list of URLs #2 from the last 48h . The sitemap has been properly "digested" by Google and shows that all URLs have been sent and indexed.

However if we use the site:domain.com command on Google News we see something completely different: Google News has indexed actually only some news and more specifically only the URLs #3 type (ending with the trailing slash instead of /index.html). Why ? What's wrong ?

a) Does Google News bot have problems indexing URLs ending with .index.html ? While figuring out what's wrong we've found out that http://news.google.it/news/search?aq=f&pz=1&cf=all&ned=us&hl=en&q=inurl%3Aindex.html gives no results...it seems that Google News index overall does not include any URLs ending with /index.html

b) Does Google News bot recognise rel=canonical tag ?

c) Is it just a matter of time and then Google News will pick up the right URLs (/index.html) and/or shall we communicate Google News team any changes ?

d) Any suggestions ? OR Shall we do the other way around. meaning make URL #3 the canonical one ?

While Google News is showing these problems, Google Web search has actually well received the changes, so we don't know what to do.

Thanks for your help,

Matteo

H-FARM

To follow up on this.

Look what I've found in the Google News Forum:

http://www.google.com/support/forum/p/news/thread?tid=248ef4e6fe372e91&hl=en

The problem is almost the same. Google News not indexing URLs with the trailing index.html.

The only person who answered was a Top Contributor suggesting to contact directly Google News team.

BlinkWeb

Hmmm, that is strange! Check a cached version of one of your URLs to make sure they new version is in the index. If it is, maybe you should switch to option 3.

I am not sure what if any the implications would be of leaving it the way you have it.

Since it is in 2 different areas of search I am not sure that duplicate content issues apply if you were to just leave it be.

H-FARM

hey Roger,

Look the CNN seems to have exactly the same "problem" as we do.

http://www.google.com/#q=Obama+makes+stop+in+Los+Angeles%2C+wraps+up+campaign+swing+&fp=3986f88f9d6402d3&hl=en

They have the "/" article indexed in google news and the index.html version on the non-google news index. They did exavtly what we did, putting a rel=canonical on the "/" version to the "index.html" one. Despite this the "/" version is still the only one showing up on google news

Here is the screenshot just in case

https://skitch.com/matsutton/r5swm/obama-makes-stop-in-los-angeles-wraps-up-campaign-swing-google-search

and here the two versions of the same article:

- http://edition.cnn.com/2011/POLITICS/04/22/obama.campaign/

- http://edition.cnn.com/2011/POLITICS/04/22/obama.campaign/index.html

BlinkWeb

They seem to meet these requirements. The only one that is a problem is requirement #3, but it clearly states that is waived with News sitemaps which Matteo said they submitted.

With that said I do like Matteo's option #1 better than the naming convention they chose to go with.

BlinkWeb

It does sound weird, but I am not sure that search operator works in Google News.

Here is a simple test. Search Google News for "Google"

The second story I see is http://phandroid.com/2011/04/22/will-spotify-be-google-musics-savior/

However a Google News search for "inurl:will-spotify-be-google-musics-savior" returns no results.

Clearly the story is indexed!

KeriMorgret

My hunch, and it's only a hunch, is that it relates to their URL requirements that the URL has to be dedicate to an article. An index.html page is usually not a page that would be dedicated to one individual news story. See http://www.google.com/support/news_pub/bin/answer.py?hl=en&answer=68323 for their URL requirements.

H-FARM

Hi roger and thx for the very insightful answer !

what about the fact that not a single URL ending with index.html is indexed in Google News ?

http://news.google.it/news/search?aq=f&pz=1&cf=all&ned=us&hl=en&q=inurl%3Aindex.html

compare that with the normal google index

http://www.google.it/search?q=inurl%3Aindex.html&hl=en&ned=us&tab=nw

doesn't that sound weird to you ?

matteo

BlinkWeb

I had another thought too. Just because the pages say they are indexed in Google WMT, doesn't mean the new content including the new canonical tags have been crawled or added to the index yet.

I recently did a similar project adding canonical tags to an ecommerce site. The new URLs are only showing up correctly in the search results maybe 10% of the time, even for pages I know have been crawled and I submitted a week ago. The important thing is that more URLs are updated each day.

I dont believe they throw out their index the first time they crawl an established page and something has changed. I believe the index gets changed as they continue to crawl they compare versions and index data based on multiple crawl agregates, especially if it is for existing pages that have been in the index for a while. So in other words, if they compare 20 recent crawls and only see 1 version as being different, they may not throw out the old version right away until they crawl it multiple times and see that the the new version exists, say 5 or 10 of the most recent 20 crawls. BTW I don't have any data to back that up just my personal observation/theory.

BlinkWeb

If you used the rel canonical tag properly and only submitted sitemap yesterday, its just a waiting game. You will get crawled and indexed properly soon.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Google News not indexing .index.html pages

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Google indexing .com and .co.uk site

3,511 Pages Indexed and 3,331 Pages Blocked by Robots

Home page deindexed by google

Getting Pages Indexed That Are Not In The Main Navigation

How do you know what version of your site of Google is in their index?

Getting More Pages Indexed

Google refuses to index our domain. Any suggestions?

Getting Google to index new pages