Kill your htaccess file, take the risk to learn a little
-
Last week I was browsing Google's index with "site:www.mydomain.com and wanted to scan over to see what Google had indexed with my site. I came across a URL that was mistakenly indexed. It went something like this
www.mydomain.com/link1/link2/link1/link4/link3
I didn't understand why Google had indexed a page like that of mine when the "link" pages were links that were on my main bar which were site wide links. It seemed to be looping infinitely over and over. So I started trying to see how many of these Google had indexed and I came across about 20 pages. I went through the process of removing the URL's in Webmaster Tools, but then I wanted to know why it was happening. I had discovered that I had mistakenly placed some links on my site in my header in such a manner
If you know HTML you will realize that by not placing the "/" in the front of the link I was telling that page to add that link in addition to the URL that is was currently on. What this did was create an infinite loop of links which is not good
Basically when Google went to www.mydomain.com/link1/ it found the other links which then told Google to add that url to the existing URL and then go to that link.
Something like: www.mydomain.com/links1/link2/...
When you do not add the "/" in front of the directory you are linking too it will do this. The "/" refers to the root so if you place that in front of your directory you are linking too it will always assume that first "/" as the root then the url will follow.
So what did I do?
Even though I was able to find about 20 URL's using the "site:" search method there had to be more out there. Even though I tried to search I was not able to find anymore, but I was not convinced.
The light bulb went on at this point
My .htaccess file contained many 301 redirects in my attempt to try and redirect those pages to a real page, they were not really relevant pages to redirect too. So how could I really find out what Google had indexed out there for me since Webmaster Tools only reports the top 1000 links.
I decided to kill my htaccess file. Knowing that Google is "forgiving" when major changes to your site happen I knew Google would not simply just kill my site for removing my htaccess file immediately.
I waited 3 days then BOOM! Webmaster Tools was reporting to me that it found a ton of 401's on my site. I looked at the Crawl Errors and there they were. All those infinite loop links that I knew had to be more out there, I was able to see.
How many were there?
Google found in the first crawl over 5,000 of them. OMG! Yeah could you imagine the "Low quality" score I was getting on those pages? By seeing all those links I was able to determine about 4 patterns in the links. For example:
Now my issue was I wanted to keep all the URL's that were pointing to www.mydomain.com/link1 but anything after that I needed gone. I went into my Robots.txt file and added this
Disallow: www.mydomain.com/link1/link2/
Disallow: www.mydomain.com/link1/link3/
Disallow: www.mydomain.com/link1/link4/
Disallow: www.mydomain.com/link1/link5/
Now there were many more pages indexed that went deeper into those links but I knew I wanted anything after the 2nd URL gone since it was the start of the loop that I detected. With that I was able to have from what I know at least 5k links if not more.
What did I learn from this?
Kill your htaccess file for a few days and see what comes back in your reports. You might learn something
After doing this I simply replaced my htaccess file and I am on my way to removing a ton of "low quality" links I didn't even know I had.
-
Interesting post. Yeah, the .htaccess file is the most important file out there and it is easy to mess up (as I am sure most everyone has at one time or another).
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Client wants to repackage in-depth content as PowerPoint files and embed on site. SEO implications?
Hi, I've a client who is planning to build out "courses" for their site. Their ultimate goal is to have videos (which will have transcriptions) but since the videos are not yet ready they want to launch with the content in PowerPoint format instead. Thing is, the pages they have now are really good content/in-depth. In short it seems videos are Phase 2, so their Phase 1 preference is to take all their courses content and put them in PowerPoint slides and add them to their web site. While I understand standalone files like PDFs and PPTs can be indexable, my recollection is that embedded slides are not (like SlideShare). Is that correct? My worry is that by taking this content and reformatting it into PowerPoints will hurt their site instead of helping. Any insight is appreciated!
Technical SEO | | CR-SEO0 -
Where does rel=canonical go? One file that manages sort order, view, filters, etc...
Where do I put the rel=canonical when the search.cfm (using URL re-write) page is the one and only page, just using url parameters to control sort, filter, view, etc. Do I just put the rel=canonical at the top of the search.cfm page? The duplicate content issues I am getting are: https://www.domain.com/tx/austin/ https://www.domain.com/tx/austin/?d=25&h=&s=r&t=&v=l&a= Just want to be clear since Moz Pro is picking up both URL's but it's only really one file, search.cfm Thanks in advance for your help.
Technical SEO | | ErnieB0 -
Is there a limit to how many URLs you can put in a robots.txt file?
We have a site that has way too many urls caused by our crawlable faceted navigation. We are trying to purge 90% of our urls from the indexes. We put no index tags on the url combinations that we do no want indexed anymore, but it is taking google way too long to find the no index tags. Meanwhile we are getting hit with excessive url warnings and have been it by Panda. Would it help speed the process of purging urls if we added the urls to the robots.txt file? Could this cause any issues for us? Could it have the opposite effect and block the crawler from finding the urls, but not purge them from the index? The list could be in excess of 100MM urls.
Technical SEO | | kcb81780 -
Getting Google to index a large PDF file
Hello! We have a 100+ MB PDF with multiple pages that we want Google to fully index on our server/website. First of all, is it even possible for Google to index a PDF file of this size? It's been up on our server for a few days, and my colleague did a Googlebot fetch via Webmaster Tools, but it still hasn't happened yet. My theories as to why this may not work: A) We have no actual link(s) to the pdf anywhere on our website. B) This PDF is approx 130 MB and very slow to load. I added some compression to it, but that only got it down to 105 MB. Any tips or suggestions on getting this thing indexed in Google would be appreciated. Thanks!
Technical SEO | | BBEXNinja0 -
Take a good amount of existing landing pages offline because of low traffic, cannibalism and thin content
Hello Guys, I decided to take of about 20% of my existing landing pages offline (of about 50 from 250, which were launched of about 8 months ago). Reasons are: These pages sent no organic traffic at all in this 8 months Often really similiar landing pages exist (just minor keyword targeting difference and I would call it "thin" content) Moreover I had some Panda Issues in Oct, basically I ranked with multiple landing pages for the same keyword in the top ten and in Oct many of these pages dropped out of the top 50.. I also realized that for some keywords the landing page dropped out of the top 50, another landing page climbed from 50 to top 10 in the same week, next week the new landing page dropped to 30, next week out of 50 and the old landing pages comming back to top 20 - but not to top ten...This all happened in October..Did anyone observe such things as well? That are the reasons why I came to the conclustion to take these pages offline and integrating some of the good content on the other similiar pages to target broader with one page instead of two. And I hope to benefit from this with my left landing pages. I hope all agree? Now to the real question: Should I redirect all pages I take offline? Basically they send no traffic at all and non of them should have external links so I will not give away any link juice. Or should I just remove the URL's in the google webmaster tools and take them then offline? Like I said the sites are basically dead and personally I see no reason for these 50 redirects. Cheers, Heiko
Technical SEO | | _Heiko_0 -
Where Is This Being Addended to Our Page File Names?
I have worked over the last several months to eliminate duplicate page titles at our site. Below is one situation that I need your advice on. Google Webmaster Tools is reporting several of our pages with
Technical SEO | | lbohen
duplicate title such as this one: This is a valid page at our Web store: http://www.audiobooksonline.com/159179126X.html This is an invalid page that Google says is a duplicate of the one above: http://www.audiobooksonline.com/159179126X.html?gdftrk=gdfV2138_a_7c177_a_7c432_a_7c9781591791263 Where might the code ?gdftrk=.... be coming from? How to get rid of it?0 -
How can I make Google Webmaster Tools see the robots.txt file when I am doing a .htacces redirec?
We are moving a site to a new domain. I have setup an .htaccess file and it is working fine. My problem is that Google Webmaster tools now says it cannot access the robots.txt file on the old site. How can I make it still see the robots.txt file when the .htaccess is doing a full site redirect? .htaccess currently has: Options +FollowSymLinks -MultiViews
Technical SEO | | RalphinAZ
RewriteEngine on
RewriteCond %{HTTP_HOST} ^(www.)?michaelswilderhr.com$ [NC]
RewriteRule ^ http://www.s2esolutions.com/ [R=301,L] Google webmaster tools is reporting: Over the last 24 hours, Googlebot encountered 1 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 100.0%.0 -
Ensuring Assets (PDFs, PowerPoint Files, Word Docs, etc.) are Indexable on Site
Hi there - I'm working on an educational site in which users will be able to search our repository of PDF articles, PowerPoint files, and so on through an on-site search engine. What is the best way to ensure each of these documents/assets are indexable by Google since they technically don't reside on an HTML page....they are just pulled up if the user searches for them? The site itself is just a few pages, but the files, articles, and videos in the repository are in the hundreds. Should I just name and tag them properly and make sure they're all included in an XML site map? Anything else suggested? Thanks very much!
Technical SEO | | MedThinkCommunications0