Robots.txt to disallow /index.php/ path
-
Hi SEOmoz,
I have a problem with my Joomla site (yeah - me too!). I get a large amount of /index.php/ urls despite using a program to handle these issues. The URLs cause indexation errors with google (404). Now, I fixed this issue once before, but the problem persist. So I thought, instead of wasting more time, couldnt I just disallow all paths containing /index.php/ ?.
I don't use that extension, but would it cause me any problems from an SEO perspective?
How do I disallow all index.php's? Is it a simple: Disallow: /index.php/
-
Hi Cyrus,
Thanks for your reply!
Unfortunately the problem is yet to be fixed, I hope that my disallow will work shortly.
It seems that most of the index.php links to each other internally (and from old /index.php/ pages that no longer exist), which is super weird. How google found them does not make any sense to me.
I don't beleive that external sources are linking to these pages either - I mean, how would they find these links anyway?.
-
Hi Mikkel,
Like Chris, I spidered your site and couldn't find any links to /index.php files, which probably indicates one of two things:
- You've fixed the problem - Yay!
- Or Google is finding those links from external sources
- Google found those links at one time in the past, and is still trying to crawl them.
In the Crawl Errors report in Google Webmaster Tools, if you click on the link of each 404, there's often a "linked from" source where you can see where Google discovered the broken link. This is really helpful in rooting out the cause.
Regardless, I'm going to go with #1 and optimistically believe that you were able to fix the problem.
-
If I spider your site I'm not seeing any /index.php urls. Does that mean you did get Joomla to cooperate with your rewriting?
Or was your problem that you'd previously had urls indexed with /index.php/ paths and you needed to remove them?
-
Hi Mikkel, I have checked your robots.txt, it looks perfect. If you redirect /index.php to home page that using httaccess file or by using any joomla plugin that would great for you. And its also a permanent solution.
-
Well, I tried the sensible solution and redirecting to the correct URL instead. However the SEF program is quite limited and keep on creating new URLs regardless of my modification. Im looking for a more permanent solution, and the disallow seems at bit simple as I'm not a super programmer.
By the way - thanks for quick replys, kudos to both of you!
-
Sure, the website in question is www.vauni.dk
I don't think that there is any inbound links to the index.php pages. They are not easily found.
-
Couldn't you rewrite those /index.php/ urls to remove the /index.php/?
Like this in .htaccess:
RewriteRule ^(.*)$ /index.php/$1 [L]
Only used Joomla once, but there must be a way to configure joomla to just use "/" instead of "/index.php/"?
Update:
Here's a solution to your /index.php/ issue:
http://www.eprcreations.com/remove-index-php-from-joomla-urls/
Once you've updated that, and have your urls working properly without the /index.php/, you could add this slight modification of the rewrite rule above so that all your old /index.php/ urls would be 301'd to your new ones:
RewriteRule ^(.*)$ /index.php/$1 [R=301,L]
Put it underneath the RewriteBase / line they describe in that post.
-
Hi Mikkel,
Do you inbound link pointing to you index.php pages ? If yes, then it might affect your seo. Disallow: /index.ph/ is perfect but after implementing it don't inter link those index.php pages. Can you share me your website URL so that I can show you with example. How to do it.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Sudden Indexation of "Index of /wp-content/uploads/"
Hi all, I have suddenly noticed a massive jump in indexed pages. After performing a "site:" search, it was revealed that the sudden jump was due to the indexation of many pages beginning with the serp title "Index of /wp-content/uploads/" for many uploaded pieces of content & plugins. This has appeared approximately one month after switching to https. I have also noticed a decline in Bing rankings. Does anyone know what is causing/how to fix this? To be clear, these pages are **not **normal /wp-content/uploads/ but rather "index of" pages, being included in Google. Thank you.
Technical SEO | | Tom3_150 -
Google Indexed a version of my site w/ MX record subdomain
We're doing a site audit and found "internal" links to a page in search console that appear to be from a subdomain of our site based on our MX record. We use Google Mail internally. The links ultimately redirect to our correct preferred subdomain "www", but I am concerned as to why this is happening and if it can have any negative SEO implications. Example of one of the links: Links aspmx3.googlemail.com.sullivansolarpower.com/about/solar-power-blog/daniel-sullivan/renewable-energy-and-electric-cars-are-not-political-footballs I did a site operator search, site:aspmx3.googlemail.com.sullivansolarpower.com on google and it returns several results.
Technical SEO | | SS.Digital0 -
Indexation and visibility problem
Hi I am working on a website (usarrestsearch org) for 6 months. I wrote about 100 pages full of good content. for some reason I see only 75% of the pages indexed in GWT. and Im having problems with SERP positions not rising. I suspect that it might be connected to the structure of the site. will appreciate any help thanks
Technical SEO | | holdportals0 -
Google indexing despite robots.txt block
Hi This subdomain has about 4'000 URLs indexed in Google, although it's blocked via robots.txt: https://www.google.com/search?safe=off&q=site%3Awww1.swisscom.ch&oq=site%3Awww1.swisscom.ch This has been the case for almost a year now, and it does not look like Google tends to respect the blocking in http://www1.swisscom.ch/robots.txt Any clues why this is or what I could do to resolve it? Thanks!
Technical SEO | | zeepartner0 -
When to use mod rewrite / canonical / 301 redirect
Hello, I have taken over the management of a site which has a big problem with duplicate content. The duplicate content is caused by two things: Upper and lower case urls e.g: www.mysite.com/blog and www.mysite.com/Blog The other reason is the use of product filters / pagination which mean you can get to the same 'page' via different filters. The filters generate separate URLs. http://www.mysite.com/casestudy
Technical SEO | | Barques-Design
http://www.mysite.com/casestudy/filter?page=1
http://www.mysite.com/casestudy/filter?solution=0&page=1
http://www.mysite.com/casestudy?page=1
http://www.cpio.co.uk/casestudy/filter?solution=0" Am I right to assume that for the case sensitive URLs I should use a 301 redirect because I only want the lower page to be shown? For the issue with dynamic URLs should we implement a mod-rewrite and 301 to one page? Any advice would be greatly appreciated.
Mat0 -
AJAX and Bing Indexation
Hello. I've been going back and forth with Bing technical support regarding a crawling issue on our website (which I have to say is pretty helpful - you do get a personal, thoughtful response pretty quickly from Bing). Currently our website is set with a java redirect to send users/crawlers to an AJAX version of our website. For example, they come into - mysite.com/category..and get redirected to mysite.com/category#!category. This is to provide an AJAX search overlay which improves UEx. We are finding that Bing gets 'hung up' on these AJAX pages, despite AJAX protocol being in place. They say that if the AJAX redirect is removed, they would index and crawl the non-AJAX url correctly - at which point our indexation would (theoretically) improve. I'm wondering if it's possible (or advisable) to direct the robots to crawl the non-AJAX version, while users get the AJAX version. I'm assuming that it's the classic - the bots want to see exactly what the users see - but I wanted to post here for some feedback. The reality of the situation is the AJAX overlay is in place and our rankings in Bing have plummeted as a result.
Technical SEO | | Blenny0 -
Correct Indexing problem
I recently redirected an old site to a new site. All the URLs were the same except the domain. When I redirected them I failed to realize the new site had https enable on all pages. I have noticed that Google is now indexing both the http and https version of pages in the results. How can I fix this? I am going to submit a sitemap but don't know if there is more I can do to get this fixed faster.
Technical SEO | | kicksetc0 -
How to disallow google and roger?
Hey Guys and girls, i have a question, i want to disallow all robots from accessing a certain root link: Get rid of bots User-agent: * Disallow: /index.php?_a=login&redir=/index.php?_a=tellafriend%26productId=* Will this make the bots not to access any web link that has the prefix you see before the asterisk? And at least google and roger will get away by reading "user-agent: *"? I know this isn't the standard proceedure but if it works for google and seomoz bot we are good.
Technical SEO | | iFix0