Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Google is indexing bad URLS
-
Hi All,
The site I am working on is built on Wordpress. The plugin Revolution Slider was downloaded. While no longer utilized, it still remained on the site for some time. This plugin began creating hundreds of URLs containing nothing but code on the page. I noticed these URLs were being indexed by Google. The URLs follow the structure: www.mysite.com/wp-content/uploads/revslider/templates/this-part-changes/
I have done the following to prevent these URLs from being created & indexed:
1. Added a directive in my Htaccess to 404 all of these URLs
2. Blocked /wp-content/uploads/revslider/ in my robots.txt
3. Manually de-inedex each URL using the GSC tool
4. Deleted the plugin
However, new URLs still appear in Google's index, despite being blocked by robots.txt and resolving to a 404. Can anyone suggest any next steps? I
Thanks!
-
All of the plugins I can find allow the tag to be deployed on pages, posts etc. You pick from a pre-defined list of existing content, instead of just whacking in a URL and having it inserted (annoying!)
If you put an index.php at that location (the location of the 404), you could put whatever you wanted in it. Might work (maybe test with one). Would resolve a 200 so you'd then need to force a 410 over the top. Not very scalable though...
-
I do agree, I may have to pass this off to someone with more backend experience than myself. In terms of plugins, are you aware of any that will allow you to add noindex tags to an entire folder?
Thanks!
-
Hmm, that's interesting - it should work just as you say! This is the point where you need a developer's help rather than an SEO analysts :') sorry!
Google will revisit 410s if it believes there is a legitimate reason to do so, but it's much less likely to revisit them than it is with 404s (which actively tell Google that the content will return).
Plugins are your friends. Too many will overload a site and make it run pretty slowly (especially as PHP has no multi-threading support!) - but this plugin, you would only need it temporarily anyway.
You might have to start using something like PHPMyAdmin to browse your SQL databases. It's possible that the uninstall didn't work properly and there are still databases at work, generating fresh URLs. You can quash them at the database level if required, however I'd say go to a web developer as manual DB edits can be pretty hazardous to a non-expert
-
Thank you for all your help. I added in a directive to 410 the pages in my htaccess as so: Redirect 410 /revslider*/. However, it does not seem to work.
Currently, I am using Options All -Indexes to 404 the URLs. Although I still remain worried as even though Google would not revisit a 410, could it still initially index it? This seems to be the case with my 404 pages - Google is actively indexing the new 404 pages that the broken plugin is producing.
As I can not seem to locate the directory in Cpanel, adding a noindex to them has been tough. I will look for a plugin that can dynamically add it based on folder structure because the URLs are still actively being created.
The ongoing creation of the URL's is the ultimate source of the issue, I expected that deleting the plugin would have resolved it but that does not seem to be the case.
-
Just remember, the only regex character which is supported is "*". Others like "" and "?" are not supported! So it's still very limited. Changing the response from 404 to 410 should really help, but be prepared to give Google a week or two to digest your changes
Yes, it would be tricky to inject those URLs with Meta no index tags, but it wouldn't be impossible. You could create an index.php file at the directory of each page which contained a Meta no-index directive, or use a plugin to inject the tag onto specific URLs. There will be ways, don't give up too early! That being said, this part probably won't add much more than the 410s will
It wouldn't be a bad idea to inject the no-index tags, but do it for 410s and not for 404s (doing it for 404s could cause you BIG problems further down the line). Remember, 404 - "temporarily gone but will come back", 410 - "gone - never coming back". Really all 410s should be served with no-index tags. Google can read dynamically generated content, but is less likely to do so and crawls it less often. Still - it would at least make the problem begin shrinking over time. It would be better to get the tags into to non-modified source code (server side rendering)
By the way, you can send a no-index directive in the HTTP header if you are really stuck!
https://sitebulb.com/hints/indexability/robots-hints/noindex-in-html-and-http-header/
The above post is quite helpful, it shows no-index directives in HTML but also in the HTTP header
In contrast to that example, you'd be serving 410 (gone) not 200 (ok)
-
Thank you for your response! I will certainly use the regex in my robots.txt and try to change my Htaccess directive to 410 the pages.
However, the issue is that a defunct plugin is randomly creating hundreds of these URL's without my knowledge, which I can not seem to access. As this is the case, I can't add a no-index tag to them.
This is why I manually de-indexed each page using the GSC removal tool and then blocked them in my robots.txt. My hope was that after doing so, Google would no longer be able to find the bad URL's.
Despite this, Google is still actively crawling & indexing new URL's following this path, even though they are blocked by my robots.txt (validated). I am unsure how these URL's even continue to be created as I deleted the plugin.
I had the idea to try to write a program with javascript that would take the status code and insert a no-index tag if the header returned a 404, but I don't believe this would even be recognized by Google, as it would be inserted dynamically. Ultimately, I would like to find a way to get the plugin to stop creating these URL's, this way I can simply manually de-index them again.
Thanks,
-
You have taken some good measures there, but it does take Google time to revisit URLs and re-index them (or remove them from the index!)
Did you know, 404 just means a URL was temporarily removed and will be coming back? The status code you are looking to serve is 410 (gone) which is a harder signal
Robots.txt (for Google) does in-fact support wild cards. It's not full regex, in-fact the only wildcard supported is "*" (asterisk: matching any character or string of characters). You could supplement with a rule like this:
User-agent: * Disallow: /*revslider* That should, theoretically block any URL from indexation if it contains the string "revslider" Be sure to **validate** any new robots.txt rules using Google Search Console to check they are working right! Remember that robots.txt affects crawling and **not indexation!** To give Google a directive not to index a URL, you should use the Meta no-index tag: [https://support.google.com/webmasters/answer/93710?hl=en](https://support.google.com/webmasters/answer/93710?hl=en) **The steps are:**
- Remove your existing robots.txt rule (which would stop Google crawling the URL and thus stop them seeing a Meta no-index tag or any change in status code)
- Apply status 410 to those pages instead of 404
- Apply Meta no-index tags to the 410'ing URLs
- Wait for Google to digest and remove the pages from its index
- Put your robots.txt rule back to prevent it ever happening again
- Supplement with an additional wildcard rule
- Done!
- Hope that helps
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google Not Indexing Pages (Wordpress)
Hello, recently I started noticing that google is not indexing our new pages or our new blog posts. We are simply getting a "Discovered - Currently Not Indexed" message on all new pages. When I click "Request Indexing" is takes a few days, but eventually it does get indexed and is on Google. This is very strange, as our website has been around since the late 90's and the quality of the new content is neither duplicate nor "low quality". We started noticing this happening around February. We also do not have many pages - maybe 500 maximum? I have looked at all the obvious answers (allowing for indexing, etc.), but just can't seem to pinpoint a reason why. Has anyone had this happen recently? It is getting very annoying having to manually go in and request indexing for every page and makes me think there may be some underlying issues with the website that should be fixed.
Technical SEO | | Hasanovic1 -
URLs dropping from index (Crawled, currently not indexed)
I've noticed that some of our URLs have recently dropped completely out of Google's index. When carrying out a URL inspection in GSC, it comes up with 'Crawled, currently not indexed'. Strangely, I've also noticed that under referring page it says 'None detected', which is definitely not the case. I wonder if it could be something to do with the following? https://www.seroundtable.com/google-ranking-index-drop-30192.html - It seems to be a bug affecting quite a few people. Here are a few examples of the URLs that have gone missing: https://www.ihasco.co.uk/courses/detail/sexual-harassment-awareness-training https://www.ihasco.co.uk/courses/detail/conflict-resolution-training https://www.ihasco.co.uk/courses/detail/prevent-duty-training Any help here would be massively appreciated!
Technical SEO | | iHasco0 -
Google is still indexing the old domain a year after 301 redirects are put in place
Hi there, You might have experienced this before but for me this is the first. A client of mine moved from domain A (www.domainA.com) to domain B (www.domainB.com). 301 redirects are all in place for over a year. But the old domain is still showing in Google when you search for "site:domainA.com" The HTTP Header check shows this result for the URL https://www.domainA.com/company/cookie-policy.aspx HTTP/1.1 301 Moved Permanently =>
Technical SEO | | iQi
Cache-Control => private
Content-Length => 174
Content-Type => text/html; charset=utf-8
Location => https://www.domain_B_.com/legal/cookie-policy
Server => Microsoft-IIS/10.0
X-AspNetMvc-Version => 5.2
X-AspNet-Version => 4.0.30319
X-Powered-By => ASP.NET
Date => Fri, 15 Mar 2019 12:01:33 GMT
Connection => close Does the redirect look wrong? The change of address request was made on Google Console when the website was moved over a year ago. Edit: Checked the domainA.com on bing and it seems that its not indexed, and replaced with domainB.com, which is the right. Just Google is indexing the old domain! Please let me know your thoughts on why this is happening. Best,0 -
How long does Google takes to re-index title tags?
Hi, We have carried out changes in our website title tags. However, when I search for these pages on Google, I still see the old title tags in the search results. Is there any way to speed this process up? Thanks
Technical SEO | | Kilgray0 -
Vanity URLs are being indexed in Google
We are currently using vanity URLs to track offline marketing, the vanity URL is structured as www.clientdomain.com/publication, this URL then is 302 redirected to the actual URL on the website not a custom landing page. The resulting redirected URL looks like: www.clientdomain.com/xyzpage?utm_source=print&utm_medium=print&utm_campaign=printcampaign. We have started to notice that some of the vanity URLs are being indexed in Google search. To prevent this from happening should we be using a 301 redirect instead of a 302 and will the Google index ignore the utm parameters in the URL that is being 301 redirect to? If not, any suggestions on how to handle? Thanks,
Technical SEO | | seogirl221 -
Using the Google Remove URL Tool to remove https pages
I have found a way to get a list of 'some' of my 180,000+ garbage URLs now, and I'm going through the tedious task of using the URL removal tool to put them in one at a time. Between that and my robots.txt file and the URL Parameters, I'm hoping to see some change each week. I have noticed when I put URL's starting with https:// in to the removal tool, it adds the http:// main URL at the front. For example, I add to the removal tool:- https://www.mydomain.com/blah.html?search_garbage_url_addition On the confirmation page, the URL actually shows as:- http://www.mydomain.com/https://www.mydomain.com/blah.html?search_garbage_url_addition I don't want to accidentally remove my main URL or cause problems. Is this the right way this should look? AND PART 2 OF MY QUESTION If you see the search description in Google for a page you want removed that says the following in the SERP results, should I still go to the trouble of putting in the removal request? www.domain.com/url.html?xsearch_... A description for this result is not available because of this site's robots.txt – learn more.
Technical SEO | | sparrowdog1 -
Why do some URLs for a specific client have "/index.shtml"?
Reviewing our client's URLs for a 301 redirect strategy, we have noticed that many URLs have "/index.shtml." The part we don'd understand is these URLs aren't the homepage and they have multiple folders followed by "/index.shtml" Does anyone happen to know why this may be occurring? Is there any SEO value in keeping the "/index.shtml" in the URL?
Technical SEO | | FranFerrara0 -
Dynamically-generated .PDF files, instead of normal pages, indexed by and ranking in Google
Hi, I come across a tough problem. I am working on an online-store website which contains the functionlaity of viewing products details in .PDF format (by the way, the website is built on Joomla CMS), now when I search my site's name in Google, the SERP simply displays my .PDF files in the first couple positions (shown in normal .PDF files format: [PDF]...)and I cannot find the normal pages there on SERP #1 unless I search the full site domain in Google. I really don't want this! Would you please tell me how to figure the problem out and solve it. I can actually remove the corresponding component (Virtuemart) that are in charge of generating the .PDF files. Now I am trying to redirect all the .PDF pages ranking in Google to a 404 page and remove the functionality, I plan to regenerate a sitemap of my site and submit it to Google, will it be working for me? I really appreciate that if you could help solve this problem. Thanks very much. Sincerely SEOmoz Pro Member
Technical SEO | | fugu0