How Google Carwler Cached Orphan pages and directory?

darshit21

I have made some changes in live website and upload it to "demo" directory (which is recently created) for client approval.

Now, my demo link will be www.test.com/demo/

I am not doing any type of link building or any activity which pass referral link to www.test.com/demo/

Then how Google crawler find it and cached some pages or entire directory?

Thanks

KeriMorgret

Try putting the URL into Google and see if you find any pages linking to it.

I knew a company that created a test site that was a copy of a live site (made with a specific hosted CMS). Didn't exclude the test site in robots because "we all know we won't link to it so it'll be ok". Site got indexed, and it was because a person at the company was having problems with the implementation of the test site, went to the help forum (which person didn't think would be indexed) and posted the URL to the test site.

I found the above by just putting in the URL of the test site into Google, and I saw the post in the help desk. You might try the same to see if somehow there is a rogue link.

darshit21

Is google crawling our mails?

Is it possible?

StalkerB

Yup, correct.

I was certain I'd replied to this $:\$

Anyway, you ever notice how the ads in gmail are always relevant to the content of your emails? Google are totally reading them

ChrisMacNaughton

The <conspiracy hat="">side of things was him commenting that Google is sometimes accused of processing everything in Gmail and could have possibly pulled your link to the demo directory from that.</conspiracy>

darshit21

Hi Barry,

Yes, We were used Gmail for reporting.

Is it make any sense??

StalkerB

<conspiracy-hat></conspiracy-hat>

Did either you or your client use gmail when you sent him the demo link?

Regardless, Dan's advice to noindex and block the directory from spiders is the future when doing development work.

darshit21

Hi JoelHit,

NO, There is not any single refferal link to "Demo" directory from entire website and also from third party websites.

I am aware about Google Crawling and Indexing Systems.

Thanks.

darshit21

Hi Thetjo,

I know about it.

My question is that how Google Crawl it without any referral link?

Thanks.

darshit21

Hi Dan,

No, i am not exclude "demo" directory from robots.txt for any search engine.

I am not using wordpress its simple stattic HTML website (Not using any type of CMS).

Theo-NL

Did this actually happen or are we talking about a hypothetical situation here? It could be that there is a link to the demo directory you've overlooked? Has the /demo folder perhaps been used in the past and there were still old links to it?

As a meta-solution to this problem: prevent crawlers and nosy people from accessing the content by adding a .htpasswd login to the area used for client approval.

DanDeceuster

Did you block the /demo/ directory in your robots.txt file? This is step number one to try and ensure they don't get crawled. Also, are you using wordpress? If so, wordpress automatically pings search engines when you add a post and if you use the common sitemap plugin, when it creates the sitemap it submits it automatically to Google, so that's another way Google could have found it.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

How Google Carwler Cached Orphan pages and directory?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Please help us undertsand the things we need to improve so that google crawler visit us more often to reindex pages from our domain

Google Detecting Real Page as Soft 404 Error

What will happen if we 302 a page that is ranking #1 in google for a high traffic term?

Why Google is not showing right title tags of my website inner pages?

Why does Google add my domain as a suffix to page title in SERPS?

Does Google crawl and spider for other links in rel=canonical pages?

Get Duplicate Page content for same page with different extension ?

Google consolidating link juice on duplicate content pages