Google News not indexing .index.html pages
-
Hi all,
we've been asked by a blog to help them better indexing and ranking on Google News (with the site being already included in Google News with poor results)
The blog had a chronicle URL duplication problem with each post existing with 3 different URLs:
#1) www.domain.com/post.html (currently in noindex for editorial choices as showing all the comments)
#2) www.domain.com/post/index.html (currently indexed showing only top comments)
#3) www.domain.com/post/ (very same as #2)
We've chosen URL #2 (/index.html) as canonical URL, and included a rel=canonical tag on URL #3 (/) linking to URL #2.
Also we've submitted yesterday a Google News sitemap including consistently the list of URLs #2 from the last 48h . The sitemap has been properly "digested" by Google and shows that all URLs have been sent and indexed.However if we use the site:domain.com command on Google News we see something completely different: Google News has indexed actually only some news and more specifically only the URLs #3 type (ending with the trailing slash instead of /index.html). Why ? What's wrong ?
a) Does Google News bot have problems indexing URLs ending with .index.html ? While figuring out what's wrong we've found out that http://news.google.it/news/search?aq=f&pz=1&cf=all&ned=us&hl=en&q=inurl%3Aindex.html gives no results...it seems that Google News index overall does not include any URLs ending with /index.html
b) Does Google News bot recognise rel=canonical tag ?
c) Is it just a matter of time and then Google News will pick up the right URLs (/index.html) and/or shall we communicate Google News team any changes ?
d) Any suggestions ? OR Shall we do the other way around. meaning make URL #3 the canonical one ?
While Google News is showing these problems, Google Web search has actually well received the changes, so we don't know what to do.
Thanks for your help,
Matteo
-
To follow up on this.
Look what I've found in the Google News Forum:
http://www.google.com/support/forum/p/news/thread?tid=248ef4e6fe372e91&hl=en
The problem is almost the same. Google News not indexing URLs with the trailing index.html.
The only person who answered was a Top Contributor suggesting to contact directly Google News team.
-
Hmmm, that is strange! Check a cached version of one of your URLs to make sure they new version is in the index. If it is, maybe you should switch to option 3.
I am not sure what if any the implications would be of leaving it the way you have it.
Since it is in 2 different areas of search I am not sure that duplicate content issues apply if you were to just leave it be.
-
hey Roger,
Look the CNN seems to have exactly the same "problem" as we do.
They have the "/" article indexed in google news and the index.html version on the non-google news index. They did exavtly what we did, putting a rel=canonical on the "/" version to the "index.html" one. Despite this the "/" version is still the only one showing up on google news
Here is the screenshot just in case
and here the two versions of the same article:
- http://edition.cnn.com/2011/POLITICS/04/22/obama.campaign/
- http://edition.cnn.com/2011/POLITICS/04/22/obama.campaign/index.html
-
They seem to meet these requirements. The only one that is a problem is requirement #3, but it clearly states that is waived with News sitemaps which Matteo said they submitted.
With that said I do like Matteo's option #1 better than the naming convention they chose to go with.
-
It does sound weird, but I am not sure that search operator works in Google News.
Here is a simple test. Search Google News for "Google"
The second story I see is http://phandroid.com/2011/04/22/will-spotify-be-google-musics-savior/
However a Google News search for "inurl:will-spotify-be-google-musics-savior" returns no results.
Clearly the story is indexed!
-
My hunch, and it's only a hunch, is that it relates to their URL requirements that the URL has to be dedicate to an article. An index.html page is usually not a page that would be dedicated to one individual news story. See http://www.google.com/support/news_pub/bin/answer.py?hl=en&answer=68323 for their URL requirements.
-
Hi roger and thx for the very insightful answer !
what about the fact that not a single URL ending with index.html is indexed in Google News ?
http://news.google.it/news/search?aq=f&pz=1&cf=all&ned=us&hl=en&q=inurl%3Aindex.html
compare that with the normal google index
http://www.google.it/search?q=inurl%3Aindex.html&hl=en&ned=us&tab=nw
doesn't that sound weird to you ?
matteo
-
I had another thought too. Just because the pages say they are indexed in Google WMT, doesn't mean the new content including the new canonical tags have been crawled or added to the index yet.
I recently did a similar project adding canonical tags to an ecommerce site. The new URLs are only showing up correctly in the search results maybe 10% of the time, even for pages I know have been crawled and I submitted a week ago. The important thing is that more URLs are updated each day.
I dont believe they throw out their index the first time they crawl an established page and something has changed. I believe the index gets changed as they continue to crawl they compare versions and index data based on multiple crawl agregates, especially if it is for existing pages that have been in the index for a while. So in other words, if they compare 20 recent crawls and only see 1 version as being different, they may not throw out the old version right away until they crawl it multiple times and see that the the new version exists, say 5 or 10 of the most recent 20 crawls. BTW I don't have any data to back that up just my personal observation/theory.
-
If you used the rel canonical tag properly and only submitted sitemap yesterday, its just a waiting game. You will get crawled and indexed properly soon.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
My WP website got attack by malware & now my website site:www.example.ca shows about 43000 indexed page in google.
Hi All My wordpress website got attack by malware last week. It affected my index page in google badly. my typical site:example.ca shows about 130 indexed pages on google. Now it shows about 43000 indexed pages. I had my server company tech support scan my site and clean the malware yesterday. But it still shows the same number of indexed page on google. Does anybody had ever experience such situation and how did you fixed it. Looking for help. Thanks FILE HIT LIST:
Technical SEO | | Chophel
{YARA}Spam_PHP_WPVCD_ContentInjection : /home/example/public_html/wp-includes/wp-tmp.php
{YARA}Backdoor_PHP_WPVCD_Deployer : /home/example/public_html/wp-includes/wp-vcd.php
{YARA}Backdoor_PHP_WPVCD_Deployer : /home/example/public_html/wp-content/themes/oceanwp.zip
{YARA}webshell_webshell_cnseay02_1 : /home/example2/public_html/content.php
{YARA}eval_post : /home/example2/public_html/wp-includes/63292236.php
{YARA}webshell_webshell_cnseay02_1 : /home/example3/public_html/content.php
{YARA}eval_post : /home/example4/public_html/wp-admin/28855846.php
{HEX}php.generic.malware.442 : /home/example5/public_html/wp-22.php
{HEX}php.generic.cav7.421 : /home/example5/public_html/SEUN.php
{HEX}php.generic.malware.442 : /home/example5/public_html/Webhook.php0 -
How can I get a photo album indexed by Google?
We have a lot of photos on our website. Unfortunately most of them don't seem to be indexed by Google. We run a party website. One of the things we do, is take pictures at events and put them on the site. An event page with a photo album, can have anywhere between 100 and 750 photo's. For each foto's there is a thumbnail on the page. The thumbnails are lazy loaded by showing a placeholder and loading the picture right before it comes onscreen. There is no pagination of infinite scrolling. Thumbnails don't have an alt text. Each thumbnail links to a picture page. This page only shows the base HTML structure (menu, etc), the image and a close button. The image has a src attribute with full size image, a srcset with several sizes for responsive design and an alt text. There is no real textual content on an image page. (Note that when a user clicks on the thumbnail, the large image is loaded using JavaScript and we mimic the page change. I think it doesn't matter, but am unsure.) I'd like that full size images should be indexed by Google and found with Google image search. Thumbnails should not be indexed (or ignored). Unfortunately most pictures aren't found or their thumbnail is shown. Moz is giving telling me that all the picture pages are duplicate content (19,521 issues), as they are all the same with the exception of the image. The page title isn't the same but similar for all images of an album. Example: On the "A day at the park" event page, we have 136 pictures. A site search on "a day at the park" foto, only reveals two photo's of the albums. 3QolbbI.png QTQVxqY.jpg mwEG90S.jpg
Technical SEO | | jasny0 -
My some pages are not showing cached in Google, WHY?
I have website http://www.vipcollisionlv.com/ and when i check the cache status with tags **site:http:vipcollisionlv.com, **some page has no cache status.. you can see this in image. How to resolve this issue. please help me.
Technical SEO | | 1akal0 -
Page disappeared from Google index. Google cache shows page is being redirected.
My URL is: http://shop.nordstrom.com/c/converse Hi. The week before last, my top Converse page went missing from the Google index. When I "fetch as Googlebot" I am able to get the page and "submit" it to the index. I have done this several times and still cannot get the page to show up. When I look at the Google cache of the page, it comes up with a different page. http://webcache.googleusercontent.com/search?q=cache:http://shop.nordstrom.com/c/converse shows: http://shop.nordstrom.com/c/pop-in-olivia-kim Back story: As far as I know we have never redirected the Converse page to the Pop-In page. However the reverse may be true. We ran a Converse based Pop-In campaign but that used the Converse page and not the regular Pop-In page. Though the page comes back with a 200 status, it looks like Google thinks the page is being redirected. We were ranking #4 for "converse" - monthly searches = 550,000. My SEO traffic for the page has tanked since it has gone missing. Any help would be much appreciated. Stephan
Technical SEO | | shop.nordstrom0 -
Dev Site Was Indexed By Google
Two of our dev sites(subdomains) were indexed by Google. They have since been made private once we found the problem. Should we take another step to remove the subdomain through robots.txt or just let it ride out? From what I understand, to remove the subdomain from Google we would verify the subdomain on GWT, then give the subdomain it's own robots.txt and disallow everything. Any advice is welcome, I just wanted to discuss this before making a decision.
Technical SEO | | ntsupply0 -
Google not indexing my website
Hi guys, We have this website http://www.m-health-expo.nl/ but it is not indexed by google. In webmaster tools google says that it can not fetch the site due to the robots.txt but i do not see any faults in it. http://www.m-health-expo.nl/robots.txt Do you see something strange, it really bothers me.
Technical SEO | | RuudHeijnen0 -
Best way to handle indexed pages you don't want indexed
We've had a lot of pages indexed by google which we didn't want indexed. They relate to a ajax category filter module that works ok for front end customers but under the bonnet google has been following all of the links. I've put a rule in the robots.txt file to stop google from following any dynamic pages (with a ?) and also any ajax pages but the pages are still indexed on google. At the moment there is over 5000 pages which have been indexed which I don't want on there and I'm worried is causing issues with my rankings. Would a redirect rule work or could someone offer any advice? https://www.google.co.uk/search?q=site:outdoormegastore.co.uk+inurl:default&num=100&hl=en&safe=off&prmd=imvnsl&filter=0&biw=1600&bih=809#hl=en&safe=off&sclient=psy-ab&q=site:outdoormegastore.co.uk+inurl%3Aajax&oq=site:outdoormegastore.co.uk+inurl%3Aajax&gs_l=serp.3...194108.194626.0.194891.4.4.0.0.0.0.100.305.3j1.4.0.les%3B..0.0...1c.1.SDhuslImrLY&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.&fp=ff301ef4d48490c5&biw=1920&bih=860
Technical SEO | | gavinhoman0 -
Site being indexed by Google before it has launched
We are currently coming towards the end of migrating one of our retail sites over to magento. To our horror, we find out today that some pages are already being indexed by Google, and we have started receiving orders through new site. Do you have any suggestions for what may have caused this? Or similarly, what the best solution would be to de-index ourselves? We most recently excluded anything with a certain parameter from robots.txt - could this being implemented incorrectly have caused this issue? Thanks
Technical SEO | | Sayers0