How does Google index pagination variables in Ajax snapshots? We're seeing random huge variables.

sitestrux

We're using the Google snapshot method to index dynamic Ajax content. Some of this content is from tables using pagination. The pagination is tracked with a var in the hash, something like:

#!home/?view_3_page=1

We're seeing all sorts of calls from Google now with huge numbers for these URL variables that we are not generating with our snapshots. Like this:

#!home/?view_3_page=10099089

These aren't trivial since each snapshot represents a server load, so we'd like these vars to only represent what's returned by the snapshots.

Is Google generating random numbers going fishing for content? If so, is this something we can control or minimize?

sitestrux

Thanks for the great replies all. Just to clarify, this is the page we're referencing:

http://www.knackhq.com/business-directory-user-demo/?escaped_fragment=

You can see the one pagination var "next" that points here:

http://www.knackhq.com/business-directory-user-demo/?escaped_fragment=home/?view_3_page=2

As you can see this is pretty simple. There's only one potential variable (the "prev" and "next" links) for introducing these huge numbers and that's pretty limited. We tested the Google URLs up and down the app and haven't seen anything that would send it fishing for larger numbers. But Google keeps hammering us with:

GET /business-directory-user-demo/?escaped_fragment=home/?view_3_page=1000251

For now we're trying to respond to those with 404s and hope they eventually die.

Unfortunately we can't avoid hashbangs.

Carson-Ward

This seems to do this only for parameters that it has decided "changes, re-orders, or narrows content." They may also crawl things that look like URLs in Javascript even when it's part of a function, but it doesn't seem like that's what's happening in this case.

Depending on the setup of the site, you can either manually configure the variable in WMT (don't do this if the parameter is material), write a clever robots.txt rule (e.g. to block anything after a number of digits after the parameter), or (the best solution) re-work the system to generate URLs that don't rely on parameters.

I'm not sure I understand why the server is rendering a page if the URL isn't supposed to exist. Depending on your server config, you may also be able to return a 404 and make a rule for which (valid) pages to render. From there you can just ignore the 404 errors until Google figures it out.

I think that's the best I can do without seeing the site.

randfish

I agree with Federico. I've seen Google go fishing with URL parameters (?param=xyz) and I've seen it with AJAX and hashbangs as well. How far they take this and when they choose to apply it doesn't seem to follow a consistent pattern . You can see some folks on StackExchange discussing this, too: http://webmasters.stackexchange.com/questions/25560/does-the-google-crawler-really-guess-url-patterns-and-index-pages-that-were-neve

sitestrux

Awesome, thanks for looking into it. We've gotten nowhere with any kind of answer.

evolvingSEO

Hi There

I'm an associate here at Moz, and have asked the other associates if they might know the answer, as this one's a little outside of my experience. Please follow up and let us know if you don't hear from anyone.

Thanks!

-Dan

FedeEinhorn

We also noticed some weird crawls last year using random numbers at the end of the URL, checking in google webmaster tools we saw that most of those urls were reported as not found, checking from where the link came from google listed some of our URLs, but didn't had any link to those URLs google was trying to fetch. After 2 or 3 months those crawls stopped. We never knew from where Google got those URLs...

sitestrux

Hi Federico, thanks for the response.

Unfortunately this is an SEO solution for a third-party JavaScript product, so removing the hash isn't an option.

I'm still interested in knowing if this is a formal Google practice and if there's some way to control or mitigate this.

FedeEinhorn

I think you are right. Google is fishing for content. I would find a solution to make those URL friendly by removing the hash and using some URL rewrite and pushState to paginate that content instead.

Here's a previous question that may help: http://moz.com/community/q/best-way-to-break-down-paginated-content

Explore more categories

Canonical URL's searchable in Google?

Google does not want to index my page

Does Google see this as duplicate content?

How can I see all the pages google has indexed for my site?

Why is google ranking me higher for pages that aren't optimised for keywords those that are?

Indexing non-indexed content and Google crawlers

How to see which site Google views as a scraper site?

Will Google Visit Non-Canonicalized Page Again and Return Its Page's Original Ranking?

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

How does Google index pagination variables in Ajax snapshots? We're seeing random huge variables.

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions