Google News not indexing .index.html pages
-
Hi all,
we've been asked by a blog to help them better indexing and ranking on Google News (with the site being already included in Google News with poor results)
The blog had a chronicle URL duplication problem with each post existing with 3 different URLs:
#1) www.domain.com/post.html (currently in noindex for editorial choices as showing all the comments)
#2) www.domain.com/post/index.html (currently indexed showing only top comments)
#3) www.domain.com/post/ (very same as #2)
We've chosen URL #2 (/index.html) as canonical URL, and included a rel=canonical tag on URL #3 (/) linking to URL #2.
Also we've submitted yesterday a Google News sitemap including consistently the list of URLs #2 from the last 48h . The sitemap has been properly "digested" by Google and shows that all URLs have been sent and indexed.However if we use the site:domain.com command on Google News we see something completely different: Google News has indexed actually only some news and more specifically only the URLs #3 type (ending with the trailing slash instead of /index.html). Why ? What's wrong ?
a) Does Google News bot have problems indexing URLs ending with .index.html ? While figuring out what's wrong we've found out that http://news.google.it/news/search?aq=f&pz=1&cf=all&ned=us&hl=en&q=inurl%3Aindex.html gives no results...it seems that Google News index overall does not include any URLs ending with /index.html
b) Does Google News bot recognise rel=canonical tag ?
c) Is it just a matter of time and then Google News will pick up the right URLs (/index.html) and/or shall we communicate Google News team any changes ?
d) Any suggestions ? OR Shall we do the other way around. meaning make URL #3 the canonical one ?
While Google News is showing these problems, Google Web search has actually well received the changes, so we don't know what to do.
Thanks for your help,
Matteo
-
To follow up on this.
Look what I've found in the Google News Forum:
http://www.google.com/support/forum/p/news/thread?tid=248ef4e6fe372e91&hl=en
The problem is almost the same. Google News not indexing URLs with the trailing index.html.
The only person who answered was a Top Contributor suggesting to contact directly Google News team.
-
Hmmm, that is strange! Check a cached version of one of your URLs to make sure they new version is in the index. If it is, maybe you should switch to option 3.
I am not sure what if any the implications would be of leaving it the way you have it.
Since it is in 2 different areas of search I am not sure that duplicate content issues apply if you were to just leave it be.
-
hey Roger,
Look the CNN seems to have exactly the same "problem" as we do.
They have the "/" article indexed in google news and the index.html version on the non-google news index. They did exavtly what we did, putting a rel=canonical on the "/" version to the "index.html" one. Despite this the "/" version is still the only one showing up on google news
Here is the screenshot just in case
and here the two versions of the same article:
- http://edition.cnn.com/2011/POLITICS/04/22/obama.campaign/
- http://edition.cnn.com/2011/POLITICS/04/22/obama.campaign/index.html
-
They seem to meet these requirements. The only one that is a problem is requirement #3, but it clearly states that is waived with News sitemaps which Matteo said they submitted.
With that said I do like Matteo's option #1 better than the naming convention they chose to go with.
-
It does sound weird, but I am not sure that search operator works in Google News.
Here is a simple test. Search Google News for "Google"
The second story I see is http://phandroid.com/2011/04/22/will-spotify-be-google-musics-savior/
However a Google News search for "inurl:will-spotify-be-google-musics-savior" returns no results.
Clearly the story is indexed!
-
My hunch, and it's only a hunch, is that it relates to their URL requirements that the URL has to be dedicate to an article. An index.html page is usually not a page that would be dedicated to one individual news story. See http://www.google.com/support/news_pub/bin/answer.py?hl=en&answer=68323 for their URL requirements.
-
Hi roger and thx for the very insightful answer !
what about the fact that not a single URL ending with index.html is indexed in Google News ?
http://news.google.it/news/search?aq=f&pz=1&cf=all&ned=us&hl=en&q=inurl%3Aindex.html
compare that with the normal google index
http://www.google.it/search?q=inurl%3Aindex.html&hl=en&ned=us&tab=nw
doesn't that sound weird to you ?
matteo
-
I had another thought too. Just because the pages say they are indexed in Google WMT, doesn't mean the new content including the new canonical tags have been crawled or added to the index yet.
I recently did a similar project adding canonical tags to an ecommerce site. The new URLs are only showing up correctly in the search results maybe 10% of the time, even for pages I know have been crawled and I submitted a week ago. The important thing is that more URLs are updated each day.
I dont believe they throw out their index the first time they crawl an established page and something has changed. I believe the index gets changed as they continue to crawl they compare versions and index data based on multiple crawl agregates, especially if it is for existing pages that have been in the index for a while. So in other words, if they compare 20 recent crawls and only see 1 version as being different, they may not throw out the old version right away until they crawl it multiple times and see that the the new version exists, say 5 or 10 of the most recent 20 crawls. BTW I don't have any data to back that up just my personal observation/theory.
-
If you used the rel canonical tag properly and only submitted sitemap yesterday, its just a waiting game. You will get crawled and indexed properly soon.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Inner pages of a directory site wont index
I have a business directory site thats been around a long time but has always been split into two parts, a subdomain and the main domain. The subdomain has been used for listings for years but just recently Ive opened up the main domain and started adding listings there. The problem is that none of the listing pages seem to be betting indexed in Google. The main domain is indexed as is the category page and all its pages below that eg /category/travel but the actual business listing pages below that will not index. I can however get them to index if I request Google to crawl them in search console. A few other things: I have nothing blocked in the robots.txt file The site has a DA over 50 and a decent amount of backlinks There is a sitemap setup also any ideas?
Technical SEO | | linklander0 -
Why add .html to WordPress pages?
A site I may take over has a plugin that adds .html to the pages. I searched online but I’ve only found how to add it rather than why to add it. Is it needed? If I remove it, I’ll have to be careful with SEO / indexed pages and redirects. The site is running 3.x.x and 90% of the plugins have not been updated in over 5 years including this one. Before I update to 4.7.x, I am trying to understand the landscape (pros / cons) on why something could be used and if I need to find a suitable replacement for it.
Technical SEO | | acktivate2 -
Google Webmaster Image Index Issue
I submitted the image sitemap in GWT and only few of them get indexed in google, but now the indexed images are also getting de-index. Any solution for it? See the attached E4hPDQE
Technical SEO | | tigersohelll0 -
Why is Google Webmaster Tools showing 404 Page Not Found Errors for web pages that don't have anything to do with my site?
I am currently working on a small site with approx 50 web pages. In the crawl error section in WMT Google has highlighted over 10,000 page not found errors for pages that have nothing to do with my site. Anyone come across this before?
Technical SEO | | Pete40 -
Homepage no longer indexed in Google
Have been working on a site and the hompage has recently vanished from Google. I submit the site to Google webmaster tools a couple of days ago and checked today and the homepage has vanished. There are no no follow tags, and no robots.txt stopping the page from being crawled. It's a bit of a worry, the site is http://www.beyondthedeal.com
Technical SEO | | tonysandwich
Any insights would be massively appreciated! Thanks.0 -
Pages not being indexed
Hi Moz community! We have a client for whom some of their pages are not ranking at all, although they do seem to be indexed by Google. They are in the real estate sector and this is an example of one: http://www.myhome.ie/residential/brochure/102-iveagh-gardens-crumlin-dublin-12/2289087 In the example above if you search for "102 iveagh gardens crumlin" on Google then they do not rank for that exact URL above - it's a similar one. And this page has been live for quite some time. Anyone got any thoughts on what might be at play here? Kind regards. Gavin
Technical SEO | | IrishTimes0 -
Removing irrelevant items from Google News?
A client wants to know if it's possible to get Google to remove stories from Google News feeds if those stories have nothing to do with the client? Any advice would be greatly appreciated. Thank you.
Technical SEO | | JamesAMartin0 -
Page rank 2 for home page, 3 for service pages
Hey guys, I have noticed with one of our new sites, the home page is showing page rank two, whereas 2 of the internal service pages are showing as 3. I have checked with both open site explorer and yahoo back links and there are by far more links to the home page. All quality and relevant directory submissions and blog comments. The site is only 4 months old, I wonder if anyone can shed any light on the fact 2 of the lesser linked pages are showing higher PR? Thanks 🙂
Technical SEO | | Nextman0