Google News not indexing .index.html pages
-
Hi all,
we've been asked by a blog to help them better indexing and ranking on Google News (with the site being already included in Google News with poor results)
The blog had a chronicle URL duplication problem with each post existing with 3 different URLs:
#1) www.domain.com/post.html (currently in noindex for editorial choices as showing all the comments)
#2) www.domain.com/post/index.html (currently indexed showing only top comments)
#3) www.domain.com/post/ (very same as #2)
We've chosen URL #2 (/index.html) as canonical URL, and included a rel=canonical tag on URL #3 (/) linking to URL #2.
Also we've submitted yesterday a Google News sitemap including consistently the list of URLs #2 from the last 48h . The sitemap has been properly "digested" by Google and shows that all URLs have been sent and indexed.However if we use the site:domain.com command on Google News we see something completely different: Google News has indexed actually only some news and more specifically only the URLs #3 type (ending with the trailing slash instead of /index.html). Why ? What's wrong ?
a) Does Google News bot have problems indexing URLs ending with .index.html ? While figuring out what's wrong we've found out that http://news.google.it/news/search?aq=f&pz=1&cf=all&ned=us&hl=en&q=inurl%3Aindex.html gives no results...it seems that Google News index overall does not include any URLs ending with /index.html
b) Does Google News bot recognise rel=canonical tag ?
c) Is it just a matter of time and then Google News will pick up the right URLs (/index.html) and/or shall we communicate Google News team any changes ?
d) Any suggestions ? OR Shall we do the other way around. meaning make URL #3 the canonical one ?
While Google News is showing these problems, Google Web search has actually well received the changes, so we don't know what to do.
Thanks for your help,
Matteo
-
To follow up on this.
Look what I've found in the Google News Forum:
http://www.google.com/support/forum/p/news/thread?tid=248ef4e6fe372e91&hl=en
The problem is almost the same. Google News not indexing URLs with the trailing index.html.
The only person who answered was a Top Contributor suggesting to contact directly Google News team.
-
Hmmm, that is strange! Check a cached version of one of your URLs to make sure they new version is in the index. If it is, maybe you should switch to option 3.
I am not sure what if any the implications would be of leaving it the way you have it.
Since it is in 2 different areas of search I am not sure that duplicate content issues apply if you were to just leave it be.
-
hey Roger,
Look the CNN seems to have exactly the same "problem" as we do.
They have the "/" article indexed in google news and the index.html version on the non-google news index. They did exavtly what we did, putting a rel=canonical on the "/" version to the "index.html" one. Despite this the "/" version is still the only one showing up on google news
Here is the screenshot just in case
and here the two versions of the same article:
- http://edition.cnn.com/2011/POLITICS/04/22/obama.campaign/
- http://edition.cnn.com/2011/POLITICS/04/22/obama.campaign/index.html
-
They seem to meet these requirements. The only one that is a problem is requirement #3, but it clearly states that is waived with News sitemaps which Matteo said they submitted.
With that said I do like Matteo's option #1 better than the naming convention they chose to go with.
-
It does sound weird, but I am not sure that search operator works in Google News.
Here is a simple test. Search Google News for "Google"
The second story I see is http://phandroid.com/2011/04/22/will-spotify-be-google-musics-savior/
However a Google News search for "inurl:will-spotify-be-google-musics-savior" returns no results.
Clearly the story is indexed!
-
My hunch, and it's only a hunch, is that it relates to their URL requirements that the URL has to be dedicate to an article. An index.html page is usually not a page that would be dedicated to one individual news story. See http://www.google.com/support/news_pub/bin/answer.py?hl=en&answer=68323 for their URL requirements.
-
Hi roger and thx for the very insightful answer !
what about the fact that not a single URL ending with index.html is indexed in Google News ?
http://news.google.it/news/search?aq=f&pz=1&cf=all&ned=us&hl=en&q=inurl%3Aindex.html
compare that with the normal google index
http://www.google.it/search?q=inurl%3Aindex.html&hl=en&ned=us&tab=nw
doesn't that sound weird to you ?
matteo
-
I had another thought too. Just because the pages say they are indexed in Google WMT, doesn't mean the new content including the new canonical tags have been crawled or added to the index yet.
I recently did a similar project adding canonical tags to an ecommerce site. The new URLs are only showing up correctly in the search results maybe 10% of the time, even for pages I know have been crawled and I submitted a week ago. The important thing is that more URLs are updated each day.
I dont believe they throw out their index the first time they crawl an established page and something has changed. I believe the index gets changed as they continue to crawl they compare versions and index data based on multiple crawl agregates, especially if it is for existing pages that have been in the index for a while. So in other words, if they compare 20 recent crawls and only see 1 version as being different, they may not throw out the old version right away until they crawl it multiple times and see that the the new version exists, say 5 or 10 of the most recent 20 crawls. BTW I don't have any data to back that up just my personal observation/theory.
-
If you used the rel canonical tag properly and only submitted sitemap yesterday, its just a waiting game. You will get crawled and indexed properly soon.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can Google index the text content in a PDF?
I really really thought the answer was always no. There's plenty of other things you can do to improve search visibility for a PDF, but I thought the nature of the file type made the content itself not-parsable by search engine crawlers... But now, my client's competitor is ranking for my client's brand name with a PDF that contains comparison content. Thing is, my client's brand isn't in the title, the alt-text, the url... it's only in the actual text of the PDF. Did I miss a major update? Did I always have this wrong?
Technical SEO | | LindsayDayton0 -
Google Indexing of Site Map
We recently launched a new site - on June 4th we submitted our site map to google and almost instantly had all 25,000 URL's crawled (yay!). On June 18th, we made some updates to the title & description tags for the majority of pages on our site and added new content to our home page so we submitted a new sitemap. So far the results have been underwhelming and google has indexed a very low number of the updated pages. As a result, only a handful of the new titles and descriptions are showing up on the SERP pages. Any ideas as to why this might be? What are the tricks to having google re-index all of the URLs in a sitemap?
Technical SEO | | Emily_A0 -
Should I remove these pages from the Google index?
Hi there, Please have a look at the following URL http://www.elefant-tours.com/index.php?callback=imagerotator&gid=65&483. It's a "sitemap" generated by a Wordpress plug-in called NextGen gallery and it maps all the images that have been added to the site through this plugin, which is quite a lot in this case. I can see that these "sitemap" pages have been indexed by Google and I'm wondering whether I should remove these or not? In my opinion these are pages that a search engine would never would want to serve as a search result and pages that a visitor never would want to see. Attracting any traffic through Google images is irrelevant in this case. What is your advice? Block it or leave it indexed or something else?
Technical SEO | | Robbern0 -
Wrong page ranked in Google, specific example
Hi All, I've searched for previous questions and many talk about the same problem but do not post an actual example. I am also thinking to do a blog post and a series of experiments once there is a theory. My target keyword is "Exhibition Stand Hire" and this is the target page on our site http://goo.gl/qt54lb Site appears on page 6 of SERPS (google.co.uk), but instead of this page a homepage is listed. But if I'm searching for the term using quotes, ie "Exhibition Stand Hire" the right page appears on page 4 of the SERPs. Our home page only uses the keyword in the body text, while target page is very optimised. Could it be over-optimised? I've tried mixing up words in the title tag to not offer an exact match, also i've varied the anchor text of all incoming links but that didn't fix the problem. (Hence why at the moment they all use different terms to point to this page) None of this helped alter what page is chosen to appear. Is it simply the matter of page not being strong enough compared to other less relevant pages on the site? How come many other sites rank better with much less effort? (i'm using OSE to determine competition) Thank you.
Technical SEO | | georgexx0 -
Why did google pick this page to rank over another one?
I recently started working here and I have noticed that google is ranking some pages over other for the main key word. Example: We are ranking on page one for ATV tires for this url http://www.rockymountainatvmc.com/t/43/81/165/723/ATV-Tires-All I thought google would pick http://www.rockymountainatvmc.com/c/43/81/165/ATV-Tires since it is higher up in the folders. I Have a couple reasons why the are picking the other one. Mostly from link signals from one other site and footer link.. Any other thoughts. If we want google to rank the second url instead what would you suggest?
Technical SEO | | DoRM0 -
Does Google News Inclusion Affect Organic Rankings?
Hello SEO Gurus, Here's a question I've been unable to find an answer for: if you manage to get a publishing website or blog included in the Google News aggregate, can it negatively affect organic search visibility? I've never read anything that explicitly says so, but I have both read and experienced how e-commerce sites often have difficulty in ranking high for both organic and shopping searches. It seems that Google balances out visibility between the two. Has anyone had any experience with a website or blog that managed to rank high for the same high-value keyword on both organic search and news search? Thanks in advance! Mike
Technical SEO | | RCNOnlineMarketing0 -
Google is indexing proxy (mirror) site.
We moved the site to a new hosting. Previously the site used Godaddy Windows Hosting with white domain masking. After moving the site we just mirrored the site. We have to use mirrored domain for PPC campaigns because it mirrored site contains true BRAND name and there is better conversion with that domain plus all trade marked keywords are approved for mirrored domain. Robots.txt User-agent: * Host: www.hermitagejewelers.com Disallow: /Bin Disallow: /css www.hermitagejewelers.com is the main domain. Mirror site is www.ermitagejewelers.com (Without the "H" at the beginning) Most of the keywords are now picked up by mirror site. I have not noticed any major changes in ranking except that it ranks for mirror site. We updated the sitemap. Website is designed very poorly (not by us). Also, we submitted the change address request for ermitagejewelers to hermitagejewelers in webmasters. Please let me know any advice to fix that problem. Thank you.
Technical SEO | | MaxRuso1 -
Home page URL disappears in Google after switching to WordPress
It was a 10 page static HTML page website. 3 year old, PR2. Monday night, copied a WordPress from somewhere to this website's public_html folder and activate it. The home page was "index.html" before switching to WordPress. Now this html file (index.html) has been deleted, so WordPress' Home page can work. All other 9 static html pages are still there in Google index. Just notice it today that the home page URL disappears in Google completely. Why? All other 9 static html pages' URL are still in Google. robots.txt is Allow: / What may have gone wrong to remove the home domain URL from Google index? Thank you for your help!
Technical SEO | | johnzhel0