Robot.txt help
-
Hi,
We have a blog that is killing our SEO.
We need to
Disallow
Disallow: /Blog/?tag*
Disallow: /Blog/?page*
Disallow: /Blog/category/*
Disallow: /Blog/author/*
Disallow: /Blog/archive/*
Disallow: /Blog/Account/.
Disallow: /Blog/search*
Disallow: /Blog/search.aspx
Disallow: /Blog/error404.aspx
Disallow: /Blog/archive*
Disallow: /Blog/archive.aspx
Disallow: /Blog/sitemap.axd
Disallow: /Blog/post.aspxBut Allow everything below /Blog/Post
The disallow list seems to keep growing as we find issues. So rather than adding in to our Robot.txt all the areas to disallow. Is there a way to easily just say Allow /Blog/Post and ignore the rest. How do we do that in Robot.txt
Thanks
-
These: http://screencast.com/t/p120RbUhCT
They appear on every page I looked at, and take up the entire area "above the fold" and the content is "below the fold"
-Dan
-
Thanks Dan, but what grey areas, what url are you looking at?
-
Ahh. I see. You just need to "noindex" the pages you don't want in the index. As far as how to do that with blogengine, I am not sure, as I have never used it before.
But I think a bigger issue is like the giant box areas at the top of every page. They are pushing your content way down. That's definitely hurting UX and making the site a little confusing. I'd suggest improving that as well
-Dan
-
Hi Dan, Yes sorry that's the one!
-
Hi There... that address does not seem to work for me. Should it be .net? http://www.dotnetblogengine.net/
-Dan
-
Hi
The blog is www.dotnetblogengine.com
The content is only on the blog once it is just it can be accessed lots of different ways
-
Andrew
I doubt that one thing made your rankings drop so much. Also, what type of CMS are you on? Duplicate content like that should be controlled through indexation for the most part, but I am not recognizing that type of URL structure as any particular CMS?
Are just the title tags duplicate or the entire page content? Essentially, I would either change the content of the pages so they are not duplicate, or if that doesn't make sense I would just "noindex" them.
-Dan
-
Hi Dan,
I am getting duplicate content errors in WMT like
This is because tag=ABC and page=1 are both different ways to get to www.mysite.com/Blog/Post/My-Blog-Post.aspx
To fix this I have remove the URL's www.mysite.com/Blog/?tag=ABC and www.mysite.com/Blog/?Page=1from GWMT and by setting robot.txt up like
User-agent: *
Disallow: /Blog/
Allow: /Blog/post
Allow: /Blog/PostI hope to solve the duplicate content issue to stop it happening again.
Since doing this my SERP's have dropped massively. Is what I have done wrong or bad? How would I fix?
Hope this makes sense thanks for you help on this its appreciated.
Andrew
-
Hi There
Where are they appearing in WMT? In crawl errors?
You can also control crawling of parameters within webmaster tools - but I am still not quite sure if you are trying to remove these from the index or just prevent crawling (and if preventing crawling, for what reason?) or both?
-Dan
-
Hi Dan,
The issue is my blog had tagging switched on, it cause canonicalization mayhem.
I switched it off, but the tags still appears in Google Webmaster Tools (GWMT). I Remove URL via GWMT but they are still appearing. This has also caused me to plummet down the SERPs! I am hoping this is why my SERPs had dropped anyway! I am now trying to get to a point where google just sees my blog posts and not the ?Tag or ?Author or any other parameter that is going to cause me canoncilization pain. In the meantime I am sat waiting for google to bring me back up the SERPs when things settle down but it has been 2 weeks now so maybe something else is up?
-
I'm wondering why you want to block crawling of these URLs - I think what you're going for is to not index them, yes? If you block them from being crawled, they'll remain in the index. I would suggest considering robots meta noindex tags - unless you can describe in a little more detail what the issue is?
-Dan
-
Ok then you should be all set if your tests on GWMT did not indicate any errors.
-
Thanks it goes straight to www.mysite.com/Blog
-
Yup, I understand that you want to see your main site. This is why I recommended blocking only /Blog and not / (your root domain).
However, many blogs have a landing page. Does yours? In other words, when you click on your blog link, does it take you straight to Blog/posts or is there another page in between, eg /Blog/welcome?
If it does not go straight into Blog/posts you would want to also allow the landing page.
Does that make sense?
-
The structure is:
www.mysite.com - want to see everything at this level and below it
www.mysite.com/Blog - want to BLOCK everything at this level
www.mysite.com/Blog/posts - want to see everything at this level and below it
-
Well what Martijn (sorry, I spelled his name wrong before) and I were saying was not to forget to allow the landing page of your blog - otherwise this will not be indexed as you are disallowing the main blog directory.
Do you have a specific landing page for your blog or does it go straight into the /posts directory?
I'd say there's nothing wrong with allowing both Blog/Post and Blog/post just to be on the safe side...honestly not sure about case sensitivity in this instance.
-
"We're getting closer David, but after reading the question again I think we both miss an essential point ;-)" What was the essential point you missed. sorry I don't understand. I don;t want to make a mistake in my Robot.txt so would like to be 100% sure on what you are saying
-
Thanks guys so I have
User-agent: *
Disallow: /Blog/
Allow: /Blog/post
Allow: /Blog/Postthat works. My Home page also works. I there anything wrong with including both uppercase "Post" and lowercase "post". It is lowercase on the site but want uppercase "P" just incase. Is there a way to make the entry non case sensitive?
Thanks
-
Correct, Martijin. Good catch!
-
There was a reason that I said he should test this!
We're getting closer David, but after reading the question again I think we both miss an essential point ;-). As we know also exclude the robots from crawling the 'homepage' of the blog. If you have this homepage don't forget to also Allow it.
-
Well, no point in a blog that hurts your seo
I respectfully disagree with Martijin; I believe what you would want to do is disallow the Blog directory itself, not the whole site. It would seem if you Disallow: / and _Allow:/Blog/Post _ that you are telling SEs not to index anything on your site except for /Blog/Post.
I'd recommend:
User-agent: *
Disallow: /Blog/
Allow: /Blog/PostThis should block off the entire Blog directory except for your post subdirectory. As Maritijin stated; always test before you make real changes to your robots.txt.
-
That would be something like this, please check this or test this within Google Webmaster Tools if it works because I don't want to screw up your whole site. What this does is disallowing your complete site and just allows the /Blog/Post urls.
User-agent: *
Disallow: /
Allow: /Blog/Post
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can anyone help me figure out these sitelinks?
My company is Squatty Potty (yes, of magic unicorn fame) and I recently redid our website's navigation. We're overhauling it currently to rebuild the whole thing, but what is there should give a good idea of site hierarchy to Google I would think. The funny thing is, when you Google [squatty potty website] we do have sitelinks. But when you Google just [squatty potty] we don't. Any ideas on why sitelinks would appear on one search but not the other? I see they appear with [squatty potty logo] as well. I can't figure out how to get them to appear for my brand name search, any help appreciated!
Intermediate & Advanced SEO | | DanDeceuster0 -
Robots.txt Blocking - Best Practices
Hi All, We have a web provider who's not willing to remove the wildcard line of code blocking all agents from crawling our client's site (user-agent: *, Disallow: /). They have other lines allowing certain bots to crawl the site but we're wondering if they're missing out on organic traffic by having this main blocking line. It's also a pain because we're unable to set up Moz Pro, potentially because of this first line. We've researched and haven't found a ton of best practices regarding blocking all bots, then allowing certain ones. What do you think is a best practice for these files? Thanks! User-agent: * Disallow: / User-agent: Googlebot Disallow: Crawl-delay: 5 User-agent: Yahoo-slurp Disallow: User-agent: bingbot Disallow: User-agent: rogerbot Disallow: User-agent: * Crawl-delay: 5 Disallow: /new_vehicle_detail.asp Disallow: /new_vehicle_compare.asp Disallow: /news_article.asp Disallow: /new_model_detail_print.asp Disallow: /used_bikes/ Disallow: /default.asp?page=xCompareModels Disallow: /fiche_section_detail.asp
Intermediate & Advanced SEO | | ReunionMarketing0 -
Robots.txt - blocking JavaScript and CSS, best practice for Magento
Hi Mozzers, I'm looking for some feedback regarding best practices for setting up Robots.txt file in Magento. I'm concerned we are blocking bots from crawling essential information for page rank. My main concern comes with blocking JavaScript and CSS, are you supposed to block JavaScript and CSS or not? You can view our robots.txt file here Thanks, Blake
Intermediate & Advanced SEO | | LeapOfBelief0 -
How to handle a blog subdomain on the main sitemap and robots file?
Hi, I have some confusion about how our blog subdomain is handled in our sitemap. We have our main website, example.com, and our blog, blog.example.com. Should we list the blog subdomain URL in our main sitemap? In other words, is listing a subdomain allowed in the root sitemap? What does the final structure look like in terms of the sitemap and robots file? Specifically: **example.com/sitemap.xml ** would I include a link to our blog subdomain (blog.example.com)? example.com/robots.xml would I include a link to BOTH our main sitemap and blog sitemap? blog.example.com/sitemap.xml would I include a link to our main website URL (even though it's not a subdomain)? blog.example.com/robots.xml does a subdomain need its own robots file? I'm a technical SEO and understand the mechanics of much of on-page SEO.... but for some reason I never found an answer to this specific question and I am wondering how the pros do it. I appreciate your help with this.
Intermediate & Advanced SEO | | seo.owl0 -
Alternative Markup Challenge. Can anyone help?
I have a challenge around alternative markup. We currently operate a single domain with geo-targeted folders and alternative markup implemented. We are now now looking to expand this out to non-English content. Current Implementation; All generic English language content hosted on the main domain, with x5 English language content variations (locales) available under a folder structure (.com/en-us/ etc.). Alternative markup is in place for all locales within the HTML, implemented automatically by developers via the CMS. Locale folders geo-targeted via GWT and Bing WT. Planned Launch; Introduction of 5 new non-English locale folders (e.g. /de-de/ etc.), targeted to their respective country and language. Content language will be mixed, with around 1/10 of pages translated and the other 9/10 of pages (business listings) having their body content remain in English, with headers / footers translated. Locale folders will be geo-targeted via GWT and Bing WT. Folder and markup usage TBC. Options; Folders; Implement folder structure /de/, attempting to indicate country but not language (issue; usually a single identifier indicates language, not country?). Implement /de-de/ folder structure to match the English locales and maintain correct country targeting (issue; some content is not in language). Alternative markup; Do not make use of markup at all. Implement CMS based automated markup on all English and non-English content throughout the locale (e.g. /de-de/), but exclude English language versions (e.g. /en-gb/). Attempt manually implementing markup to bridge the English and non-English locales, potentially creating future issues with new content going live and content being removed. A heavy risk. Current approach is webmaster tools targeting, a /de-de/ folder structure and automated implementation of markup. This means English language URLs will have markup and non-English language URLs will have markup, but they will not match up (e.g. English pages will never have markup for non-English language content). If you minds haven't melted, what's your thoughts? Any help is much appreciated.
Intermediate & Advanced SEO | | HelloAlba0 -
Robots Disallow Backslash - Is it right command
Bit skeptical, as due to dynamic url and some other linkage issue, google has crawled url with backslash and asterisk character ex - www.xyz.com/\/index.php?option=com_product www.xyz.com/\"/index.php?option=com_product Now %5c is the encoded version of \ - backslash & %22 is encoded version of asterisk Need to know for command :- User-agent: * Disallow: \As am disallowing all backslash url through this - will it only remove the backslash url which are duplicates or the entire site,
Intermediate & Advanced SEO | | Modi0 -
Is our robots.txt file correct?
Could you please review our robots.txt file and let me know if this is correct. www.faithology.com/robots.txt Thank you!
Intermediate & Advanced SEO | | BMPIRE0 -
Whole site blocked by robots in webmaster tools
My URL is: www.wheretobuybeauty.com.auThis new site has been re-crawled over last 2 weeks, and in webmaster tools index status the following is displayed:Indexed 50,000 pagesblocked by robots 69,000Search query 'site:wheretobuybeauty.com.au' returns 55,000 pagesHowever, all pages in the site do appear to be blocked and over the 2 weeks, the google search query site traffic declined from significant to zero (proving this is in fact the case ).This is a Linux php site and has the following: 55,000 URLs in sitemap.xml submitted successfully to webmaster toolsrobots.txt file existed but did not have any entries to allow or disallow URLs - today I have removed robots.txt file completely URL re-direction within Linux .htaccess file - there are many rows within this complex set of re-directions. Developer has double checked this file and found that it is valid.I have read everything that google and other sources have on this topic and this does not help. Also checked webmaster crawl errors, crawl stats, malware and there is no problem there related to this issue.Is this a duplicate content issue - this is a price comparison site where approx half the products have duplicate product descriptions - duplicated because they are obtained from the suppliers through an XML data file. The suppliers have the descriptions from the files in their own sites.Help!!
Intermediate & Advanced SEO | | rrogers0