How to get a large number of urls out of Google's Index when there are no pages to noindex tag?
-
Hi,
I'm working with a site that has created a large group of urls (150,000) that have crept into Google's index. If these urls actually existed as pages, which they don't, I'd just noindex tag them and over time the number would drift down.
The thing is, they created them through a complicated internal linking arrangement that adds affiliate code to the links and forwards them to the affiliate. GoogleBot would crawl a link that looks like it's to the client's same domain and wind up on Amazon or somewhere else with some affiiiate code. GoogleBot would then grab the original link on the clients domain and index it... even though the page served is on Amazon or somewhere else. Ergo, I don't have a page to noindex tag.
I have to get this 150K block of cruft out of Google's index, but without actual pages to noindex tag, it's a bit of a puzzler.
Any ideas? Thanks! Best... Michael
P.S.,
All 150K urls seem to share the same url pattern... exmpledomain.com/item/... so /item/ is common to all of them, if that helps.
-
If no pages which support web coding actually exist for the URLs you want to remove from Google's index, I'd probably use the HTTP header instead. Maybe use the X-Robots directives:
- https://yoast.com/x-robots-tag-play/
- https://www.searchenginejournal.com/x-robots-tag-simple-alternate-robots-txt-meta-tag/67138/
Even if you have no page with web-code, you can always have a HTTP Header. A HTTP header simply allows a client and / or server to fire additional information through 'requests' (post / get etc).
This is the only thing I can think of which would really help. Some people might suggest robots.txt wildcards, but robots.txt handles crawling and not indexation (so those answers wouldn't really be worth anything to you)
The other thing you could do (maybe combine this with the X-Robots stuff) is to get all of those URLs to serve status code 410 (gone) instead of 404 (temporarily gone, but coming back)
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Is it good or bad to add noindex for empty pages, which will get content dynamically after some days
We have followers, following, friends, etc pages for each user who creates account on our website. so when new user sign up, he may have 0 followers, 0 following and 0 friends, but over period of time he can get those lists go up. we have different pages for followers, following and friends which are allowed for google to index. When user don't have any followers/following/friends, those pages looks empty and we get issue of duplicate content and description too short. so is it better that we add noindex for those pages temporarily and remove noindex tag when there are at least 2 or more people on those pages. What are side effects of adding noindex when there is no data on those page or benefits of it?
Intermediate & Advanced SEO | | swapnil120 -
Should I use noindex or robots to remove pages from the Google index?
I have a Magento site and just realized we have about 800 review pages indexed. The /review directory is disallowed in robots.txt but the pages are still indexed. From my understanding robots means it will not crawl the pages BUT if the pages are still indexed if they are linked from somewhere else. I can add the noindex tag to the review pages but they wont be crawled. https://www.seroundtable.com/google-do-not-use-noindex-in-robots-txt-20873.html Should I remove the robots.txt and add the noindex? Or just add the noindex to what I already have?
Intermediate & Advanced SEO | | Tylerj0 -
Getting Pages Requiring Login Indexed
Somehow certain newspapers' webpages show up in the index but require login. My client has a whole section of the site that requires a login (registration is free), and we'd love to get that content indexed. The developer offered to remove the login requirement for specific user agents (eg Googlebot, et al.). I am afraid this might get us penalized. Any insight?
Intermediate & Advanced SEO | | TheEspresseo0 -
Sudden increase in number of indexed URLs. How ca I know what URLs these are?
We saw a spike in the total number of indexed URLs (17,000 to 165,000)--what would be the most efficient way to find out what the newly indexed URLs are?
Intermediate & Advanced SEO | | nicole.healthline0 -
Getting Google in index but display "parent" pages..
Greetings esteemed SEO experts - I'm hunting for advice: We operate an accommodation listings website. We monetize by listing position in search results, i.e. you pay more to get higher placing in the page. Because of this, while we want individual detailed listing pages to be indexed to get the value of the content, we don't really want them appearing in Google search results. We ideally want the "content value" to be attributed to the parent page - and google to display this as the link in the search results instead of the individual listing. Any ideas on how to achieve this?
Intermediate & Advanced SEO | | AABAB0 -
Does Google index url with hashtags?
We are setting up some Jquery tabs in a page that will produce the same url with hashtags. For example: index.php#aboutus, index.php#ourguarantee, etc. We don't want that content to be crawled as we'd like to prevent duplicate content. Does Google normally crawl such urls or does it just ignore them? Thanks in advance.
Intermediate & Advanced SEO | | seoppc20120 -
Why is google automatically showing my competitor's result even when customer types in our brand name on the query?
This is little weird. We run a website specific to mobile phones called as 91mobiles.com. The site has gained lot of user interest and trust in the last 2 years [pre-dominantly indian users]. Lot of our users type mobile phone model and 91mobiles as the query. Example "Sony Xperia P 91mobiles". Google is showing gsmarena.com results on top and then shows our own results below! What's annoying is the fact that google also bolds the term gsmarena [denoting that it's a synonym]. Any idea why this is happening? We are very sure that we are not doing anything wrong..We have worked really hard for the last 2 years to reach where we are..and it's kind of hard to see gsmarena siphoning away our traffic for no reason at all [even when customer types in 91mobiles a part of the query to quality it]...Can some experts here demystify this? Thanks.
Intermediate & Advanced SEO | | Gaadi0 -
Why is my site's 'Rich Snippets' information not being displayed in SERPs?
We added hRecipe microformats data to our site in April and then migrated to the Schema.org Recipe format in July, but our content is still not being displayed as Rich Snippets in search engine results. Our pages validate okay in the Google Rich Snippets Testing Tool. Any idea why they are not being displayed in SERP's? Thanks.
Intermediate & Advanced SEO | | Techboy0