Robots.txt and Magento
-
HI,
I am working on getting my robots.txt up and running and I'm having lots of problems with the robots.txt my developers generated. www.plasticplace.com/robots.txt
I ran the robots.txt through a syntax checking tool (http://www.sxw.org.uk/computing/robots/check.html) This is what the tool came back with: http://www.dcs.ed.ac.uk/cgi/sxw/parserobots.pl?site=plasticplace.com There seems to be many errors on the file.
Additionally, I looked at our robots.txt in the WMT and they said the crawl was postponed because the robots.txt is inaccessible. What does that mean?
A few questions:
1. Is there a need for all the lines of code that have the “#” before it? I don’t think it’s necessary but correct me if I'm wrong.
2. Furthermore, why are we blocking so many things on our website? The robots can’t get past anything that requires a password to access anyhow but again correct me if I'm wrong.
3. Is there a reason Why can't it just look like this:
User-agent: *
Disallow: /onepagecheckout/
Disallow: /checkout/cart/
I do understand that Magento has certain folders that you don't want crawled, but is this necessary and why are there so many errors?
-
Yes your short robots.txt idea would create a huge problem.
In your Magento admin if you click in the menu Catalog > URL Rewrite Management
You will see the magento feature that creates all the "pretty urls", in that page you will see a table. If get value from Target path column and copy and paste after your site domain, for example domain.com/value_in_target_path...
You'll see that the page loads fine, you don't want Google to rank those pages with the "messy" URL so that's why you need all those stuff in your robots.txt
-
I am bit confused. Are you saying that technically my Magento site has two different urls that can both be indexed; one with a (messy) url and another with a vanity url? This would create major duplicate content issues! The robots.txt would not solve such a complex issue.
Am I missing something?
-
My developer said they custom configured it to block the files they needed according to Magento.
You think I can simply make it look like this:
User-agent: *
Disallow: /onepagecheckout/
Disallow: /checkout/cart/
and then disable it in WMT?
-
3. Is there a reason Why can't it just look like this:
Yes, It would generate a lot of duplicates issues, for example your robots.txt you have the follow line:
Disallow: /catalog/category/view/ -> That's the "real" category URL, you can access any category on magento by /catalog/category/view/id or by the "pretty" URL. Because you disallow the "real: URL only the pretty URL will be viable for search engines. This same rule apply for many other parts of the robots.txt.
-
I assume this is a robots.txt that has been automatically created by Magento? - or has it been created by a developer?
I ran it through a tool and it showed 1 error and 10 warnings - so i would say you definitely need to do something about it.
The reason for all those disallows is to try and stop search engine indexing them (whether they would even find them to index them if they were not there is debatable).
What you could do is set up robots.txt as you have suggested and then stop the SE's indexing the directories or pages you don't want in appropriate webmaster tools.
I don't like displaying a lot of 'don't index' paths in the robots texts as it is pretty much telling any hacker or nasty spider where your weak points may be.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Website URL, Robots.txt and Google Search Console (www. vs non www.)
Hi MOZ Community,
Technical SEO | | Badiuzz
I would like to request your kind assistance on domain URLs - www. VS non www. Recently, my team have moved to a new website where a 301 Redirection has been done. Original URL : https://www.example.com.my/ (with www.) New URL : https://example.com.my/ (without www.) Our current robots.txt sitemap : https://www.example.com.my/sitemap.xml (with www.)
Our Google Search Console property : https://www.example.com.my/ (with www.) Question:
1. How/Should I standardize these so that Google crawler can effectively crawl my website?
2. Do I have to change back my website URLs to (with www.) or I just need to update my robots.txt?
3. How can I update my Google Search Console property to reflect accordingly (without www.), because I cannot see the options in the dashboard.
4. Is there any to dos such as Canonicalization needed, or should I wait for Google to automatically detect and change it, especially in GSC property? Really appreciate your kind assistance. Thank you,
Badiuzz0 -
Little confused regarding robots.txt
Hi there Mozzers! As a newbie, I have a question that what could happen if I write my robots.txt file like this... User-agent: * Allow: / Disallow: /abc-1/ Disallow: /bcd/ Disallow: /agd1/ User-agent: * Disallow: / Hope to hear from you...
Technical SEO | | DenorL0 -
Advice urgently needed on best practice for handling multiple product categories on Magento website
I have an ecommerce site built using Magento and urgently need advice on best practice for handling multiple product categories (where products appear in more than one category on the site creating multiple URLs to the same page). In April this year, based on advice from my SEO who felt that duplicate content issues were causing my rankings to be held back, I changed about 25% of the product categories to 'noindex, follow'. This has made organic traffic fall (obviously) as these pages fell out of Google's index. But, contrary to what I was hoping for, it didn't then improve rankings - not one iota, nothing - which was the ONLY reason why I did this. This has had a real negative impact on sales, so I'm starting to think this was actually an a terrible idea. Should I change them back? And to ask a wider question, what is best practice for this particular scenario?
Technical SEO | | Coraltoes770 -
Restricted by robots.txt does this cause problems?
I have restricted around 1,500 links which are links to retailers website and links that affiliate links accorsing to webmaster tools Is this the right approach as I thought it would affect the link juice? or should I take the no follow out of the restricted by robots.txt file
Technical SEO | | ocelot0 -
Client accidently blocked entire site with robots.txt for a week
Our client was having a design firm do some website development work for them. The work was done on a staging server that was blocked with a robots.txt to prevent duplicate content issues. Unfortunately, when the design firm made the changes live, they also moved over the robots.txt file, which blocked the good, live site from search for a full week. We saw the error (!) as soon as the latest crawl report came in. The error has been corrected, but... Does anyone have any experience with a snafu like this? Any idea how long it will take for the damage to be reversed and the site to get back in the good graces of the search engines? Are there any steps we should take in the meantime that would help to rectify the situation more quickly? Thanks for all of your help.
Technical SEO | | pixelpointpress0 -
Robots.txt and canonical tag
In the SEOmoz post - http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts, it's being said - If you have a robots.txt disallow in place for a page, the canonical tag will never be seen. Does it so happen that if a page is disallowed by robots.txt, spiders DO NOT read the html code ?
Technical SEO | | seoug_20050 -
Site not being Indexed that fast anymore, Is something wrong with this Robots.txt
My wordpress site's robots.txt used to be this: User-agent: * Disallow: Sitemap: http://www.domainame.com/sitemap.xml.gz I also have all in one SEO installed and other than posts, tags are also index,follow on my site. My new posts used to appear on google in seconds after publishing. I changed the robots.txt to following and now post indexing takes hours. Is there something wrong with this robots.txt? User-agent: * Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: /wp-login.php Disallow: /wp-login.php Disallow: /trackback Disallow: /feed Disallow: /comments Disallow: /author Disallow: /category Disallow: */trackback Disallow: */feed Disallow: */comments Disallow: /login/ Disallow: /wget/ Disallow: /httpd/ Disallow: /*.php$ Disallow: /? Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: /*.gz$ Disallow: /*.wmv$ Disallow: /*.cgi$ Disallow: /*.xhtml$ Disallow: /? Disallow: /*?Allow: /wp-content/uploads User-agent: TechnoratiBot/8.1 Disallow: ia_archiverUser-agent: ia_archiver Disallow: / disable duggmirror User-agent: duggmirror Disallow: / allow google image bot to search all imagesUser-agent: Googlebot-Image Disallow: /wp-includes/ Allow: /* # allow adsense bot on entire siteUser-agent: Mediapartners-Google* Disallow: Allow: /* Sitemap: http://www.domainname.com/sitemap.xml.gz
Technical SEO | | ideas1230 -
How do I use the Robots.txt "disallow" command properly for folders I don't want indexed?
Today's sitemap webinar made me think about the disallow feature, seems opposite of sitemaps, but it also seems both are kind of ignored in varying ways by the engines. I don't need help semantically, I got that part. I just can't seem to find a contemporary answer about what should be blocked using the robots.txt file. For example, I have folders containing site comps for clients that I really don't want showing up in the SERPS. Is it better to not have these folders on the domain at all? There are also security issues I've heard of that make sense, simply look at a site's robots file to see what they are hiding. It makes it easier to hunt for files when they know the directory the files are contained in. Do I concern myself with this? Another example is a folder I have for my xml sitemap generator. I imagine google isn't going to try to index this or count it as content, so do I need to add folders like this to the disallow list?
Technical SEO | | SpringMountain0