Welcome to the Q&A Forum

ShaMenz

Hi Zippy-Bungle,

To understand first why the 803 error was reported:

When a page is called, the web server sends header details of what's to be displayed. You can see a complete list of these HTTP header fields here.

One of the headers sent by the web server is Content-length, which indicates how many bytes the rest of the page is going to send. So let's say for example that content length is 100 bytes but the server only sends 74 bytes (it may be valid HTML, but the length does not match the content length indicated)

Since the web server only sent 74 bytes and the crawler expected 100 bytes the crawler sees a TCP close port error because it is trying to read the number of bytes that the webserver said it was going to send. So you get an 803 error.

Now browsers don't care when a mismatch like this happens because Content-length is an outdated component for modern browsers, but Roger Mozbot (the Moz crawler, identified in your logs as RogerBot) is on a mission to show you any errors that might be occurring. So Roger is configured to detect and report such errors.

The degree to which an 803 error will adversely affect crawl efficiency for search engine bots such as Googlebot, Bingbot and others will vary, but the fundamental problem with all 8xx errors is that they result from violations of the underlying HTTP or HTTPS protocol. The crawler expects all responses it receives to conform to the HTTP protocol and will typically throw an exception when encountering a protocol-violating response.

In the same way that 1xx and 2xx errors generally indicate a badly-misconfigured site, fixing them should be a priority to ensure that the site can be crawled effectively. It is worth noting here that bingbot is well known for being highly sensitive to technical errors.

So what makes the mismatch happen?

The problem could be originating from the website itself (page code), the server, or the web server. There are two broad sources:

Crappy code
Buggy server

I'm afraid you will need to get a tech who understands this type of problem to work through each of these possibilities to isolate and resolve the root cause.

The Moz Resource Guide on HTTP Errors in Crawl Reports is also worth a read in case Roger encounters any other infrequently seen errors.

Hope that helps,

Sha

ShaMenz

Hi Charles,

I would say

Keep an eye on #Mozcon and connect with some people before you get to Seattle
Stay over and take a Tour of the Mozplex
Spend some real time checking out the Speaker photos on this page so you know who they are and what they look like...there is a real chance you will find yourself standing next to them in the lunch queue, sitting down to breakfast or lining up to sing Karaoke with them!
Be at the conference early every day and try to sit with new people as well as catch up with friends you have already made.
Read this post from Zeph , hit him up on Twitterand get him to hook you up for a spot at the informal Meetup on the night before the conference
Make a list of awesome people that you would like to meet while there and keep it close handy
Make a donation to the Myrtle Point Library in honor of Michael Cottam's parents at the conference and help send love to Michael and his family from everyone in the Moz "family"
Get some tips on where to find the Best coffee in Seattle from Jonathon Coleman
Arm yourself with Rand's Restaurant and Bar Guide to Seattle and the The Not-so-Short Shortlist of Moz's Top Seattle Restaurants, Bars, and Activities for MozCon 2013
Bring business cards and make sure you make notes on the cards people give you so you can easily put names to faces and shared experiences when you get home.

Oh...and you might consider joining a PosSEO, becoming a PosSEM or something like that, but whatever you do, say G'day!

Sha

ShaMenz

Hi Rob,

First of all ...this is not a domain redirect.

What they are actually doing is pulling the content of your site into their own using an iframe.

They are not able to do anything through your site by doing this as the content is rendered by the browser and not their server. So, the question is why they would do it.

Best guess: This could be someone who is planning to set up some kind of low quality site (possibly full of ads), but wants to build up backlinks for the domain. They can go to blogs, forums etc and leave comments & posts with their URL. The webmaster checks the URL and sees your site, so approves the comment or post...after a few months of doing this, BAM! they remove the iframe and let loose their real content.

hmmm...

Sha

ShaMenz

Hi waspman,

No, you are absolutely not doing anything wrong

The simple answer is that you are acquiring links primarily to the sub-domains and not the root domains.

To explain:

"mozRank is SEOmoz's global link popularity score. It compares the relative link value (ranking power) between URLs on the Internet. It is similar in premise to Google's original PageRank metric but is updated more frequently and offers greater precision."

You can read more about all of the SEOmoz Metrics at the Open Site Explorer explanatory page.

So, if you go to Open Site Explorer and run a query for your sub-domain you will see the number of external links to that URL. If you then remove the www and run the query for the root domain you will see that there are 0 links to that URL.

Since mozRank = link popularity - it should be higher for the sub-domain because that is where the links are pointing, and of course you want to ensure that your links are always consolidated to just one domain. So, things are exactly as they should be.

Hope that helps,

Sha

ShaMenz

Hi Rick,

If you wish to use the robots.txt method to disallow all or part of your site's https protocol, you simply need to load two separate robots.txt files.

The http and https protocols are basically viewed by bots as if they were two completely separate root domains (which I guess you already know as you have mentioned the fact that port 443 is used for the secure protocol).

Google's advice is that to use this method, you should have a separate robots.txt file for each protocol with code as follows:

For your http protocol (http://www.startuploans.org/robots.txt

User-agent: *
Allow: /

For the https protocol (https://www.startuploans.org/robots.txt

User-agent: *
Disallow: /

However, blocking crawlers with robots.txt is not the most reliable method for excluding pages from Search engines. The reason for this is that the page will continue to be indexed if it happens to be found via a link from another page. Basically, the robots.txt is the sign on the front door that says "Please stay out of our house", but it is never seen by the people who enter via the rear exit or climb in a window!

The most reliable method of excluding pages is to add the noindex meta tag as suggested by MagentoWebDeveloper and Alan.When a bot encounters the noindex meta tag it will send a signal to the search engine to de-index the page and there is no further problem.

I would generally use noindex, follow rather than noindex, nofollow as the nofollow tag will stop the flow of link value through your site. In most cases, as long as the noindex is in place, there is no reason to be worried about the links on the pages being followed.

You should NEVER use both methods at the same time.

Hope that helps,

Sha

ShaMenz

Hmmm...

Hi echo 1,

My view of Press Releases is quite different. The sole aim of a press release is to provide writers/bloggers/journalists with the key information they need to research and write stories or articles of their own which will promote your site (or your business, products & services).

A Press Release that is likely to be successful in achieving its aim is by necessity quite structured, following a pre-determined format and containing particular facts and information ordered quite specifically to make the prospective writer's job easier. So, by its very nature, a good Press Release is not what I would consider suitable page content for my site. For this reason I would never consider adding releases to my on site content.

Of course, the other natural conclusion from this is that if my Press Release reads like great page content for my site, then it does not follow the basic principles of a well constructed release and should be looked at closely before it is submitted.

If you are submitting to the major online Release sites, they all have resources which will basically teach you the format that you should use for your releases.

Hope that helps,

Sha

ShaMenz

Hi Robert,

I think I've picked up on all of the questions here (there's a lot going on!) and have borrowed some awesomeness from my Tech Wizard (Boss) to fill in the exciting bits, so here goes:

I'll start with the easy one first... well actually, none of them are that hard

As a part of our ongoing SEO we check page load speed for our clients. A few days ago a client who has their site hosted by the same company was running very slow (about 8 seconds to load without cache). We ran every check we could and could not find a reason on our end. The client called the host and were told they needed to be on some other type of server (with the host) at a fee increase of roughly $10 per month.

OK, basically the answer to this one would be that your client's site was being throttled back by the host because it was using more bandwidth than was allowed under their existing plan. By moving them to the next plan (the extra $10 per month) the problem is resolved and the site speed returns to normal. Throttling it back gets the client to call... 8(

OK, 1 down and 2 to go...

About 4 months ago we realized we had a group of sites down thanks to monitoring alerts and checked it out. All were on the same IP address and the sites on the other IP address were still up and functioning well. When we contacted the support at first we were stonewalled, but eventually they said there was a problem and it was resolved within about 2 hours. Up until recently we had no problems.

and also

Yesterday, we noticed one group of sites on our server was down and, again, it was one IP address with about 8 sites on it.

OK, you know already that there can be up to 8 IPs on a box and at times something in the network will go bad. There are some variables here as to what is wrong. If you are on a Class C Network and one IP goes down then it means that the Switch or Router has gone bad (whether it is a Switch or a Router is determined by how the host has their hardware set up). If you are on a Class D Network and one IP goes down, then the problem is one of 3 things, the Card, the port, or the cable connecting the two, related to that IP.

The trick is that the person on the phone needs to realise what they are dealing with and escalate it to get the hardware issue resolved (A recent interaction with that particular host for one of our clients indicated to me that the realization part might be a little hit and miss, so good to have an understanding of what might be happening if it happens again)

Phew! Nearly there, last of all...

**On chat with support, they kept saying it was our ISP. (We speed tested on multiple computers and were 22MB down and 9MB up +/-2MB). We ran a trace on the IP address and it went through without a problem on three occassions over about ten minutes. After about 30 minutes the sites were back up. **

Here's the twist: we had a couple of people in the building who were on other ISP's try and the sites came up and loaded on their machines. Does anyone have any idea as to what the issue is?

OK this one is all about DNS caching. That particular host (the one that likes lady racing drivers) has a fail-over system in place. This means that if an IP goes down, the domains on that IP will automatically fail-over to another box.

So, if you have looked at those domains on your machine, it will be cached. When you go back to check the site you are still looking at the cached version. The other people in the building are coming to the domain fresh and through a different ISP, so they see those domains because they are back up on the new box.

When the host reps were telling you that it was your ISP, what they really meant was that it had failed-over to a new box and you were still seeing the cached DNS location.

OK, think I covered it all so....that's all Folks!

Have a great holiday weekend!

Sha

ShaMenz

Hi Jampaper,

Just to preface, I spend my days wading through the unnatural links sewer looking at the mess people have gotten themselves into because they thought they were smarter than Google or had that "how would Google ever know" thought in their heads.

EGOL is spot on with his response.

The criteria for undesirable links is not "how would Google ever know it's unnatural?", but "is it unnatural?"

On the "How", here are some things to consider:

Google's reach and ability to mine and interpret data (accurately or not) is so far outside our comprehension that it is probably better we don't even think about it.
Reviewers have a habit of unitentionally sharing information or creating patterns in the way they do things that are a clear red flag for orchestrated reviews
"These reviews always point to inner pages" ...Ooops! There's a pattern
"We're obviously targeting authoritative sites which do do reviews" ...Ooops! another pattern
Unnatural links on "Authoritative sites" would be more likely to enrage me if I were a member of the Webspam team than those on less influential sites. Let's face it, nobody ever sent me an email suggesting they could sell me links on a crap site
(and this you should take as very tongue in cheek, but perhaps give some thought to implications)
This site has upwards of 400,000 community members. One of them is a guy who is currently on leave from his job at G, but occasionally comments on Moz blog posts that interest him (that's the tongue in cheek part as while it is possible, I seriously doubt he or any of the other Googlers who might be members spend time combing through this site looking for extra work!)

However, it doesn't take much imagination to think there may be other people out there who could be made aware and if they were a certain kind of person might be likely to look into a backlink profile and perhaps lodge a report. Once the manual review process comes into play, the cleverness of the algorithm is irrelevant.

When you have a great product your customers will always be your best sales force! Do things that make THEM want to tell people how THEY feel about you. If you do that enough, even those Authoritative sites will be checking you out for themselves and gifting you natural links

Hope that helps,

Sha

ShaMenz

Hi Jason,

There is obviously something going on with this that is affecting what some crawlers are seeing on your pages.

I ran the Screaming Frog Tool and it shows that the majority of your pages have empty Titles even though I can see that there are Titles loading in the browser.

On checking your code I see that you are using the pragma directive meta element , but it actually appears below the Title element in the code.

Example from your code:

<head>
<title>Are You Socially Awkward? | Branding Blog | The Bullettitle>
**<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />**

So I ran the page through the W3C Markup Validation Service and it also indicates that it sees no character encoding declaration:

Info No Character encoding declared at document level

No character encoding information was found within the document, either in an HTML meta element or an XML declaration.

So, I believe the issue here may be related to the fact that the pragma directive should appear as close as possible to the top of the head element ie before the Title element.

The following is from the W3.org documentation on declaring character encoding. You will see that there is specific reference to the fact that the use of the pragma directive is required in the case of XHTML 1.x documents as yours is:

For XHTML syntax, you should, of course, have " />" after the content attribute, rather than just ">".

The encoding of the document is specified just after charset=. In this case the specified encoding is the Unicode encoding, UTF-8.

The pragma directive should be used for pages written in HTML 4.01. It should also be used for XHTML 1.x documents served as HTML, since the HTML parser will not pick up encoding information from the XML declaration.

In HTML5 you can either use this approach for declaring the encoding, or the newly specified meta charset attribute, but not both in the same page. The encoding declaration should also fit within the first 1024 bytes of the document, so you should generally put it immediately after the opening tag of the head element.

Hope that helps,

Sha

ShaMenz

OK Robert,

First I'm going to tip my hat to Ryan, who has perfectly explained the fact that some of what you see in your site: search can be because the 301's have not yet been recognized by the search engine.

Second, an apology to Alan as I went right to the LAMP solution because of prior knowledge from a previous thread or two that you were going to be talking about .htaccess

Now...I will spell out a couple of things because I have a feeling that you are likely to come across them again in the future and quick recognition can often mean a lot of time saved.

So here goes.

When I first read your question, my little web developer antennae suddenly started twitching! When I hear that there are multiple versions of a file with different file names deployed on a server I generally suspect one of two things:

The site has been developed from a standard Template package, or
There has just been a little "untidiness" taking place in the development process.

In your example, the /contact.php was the original file deployed live to the server, then the /contact-us.php file was created to replace it (presumably for SEO purposes - debatable, but that is a whole other conversation). As I'm sure you can imagine, /contact is pretty common in template packages, although the biggest template producer out there is much easier to spot, as the pages in their templates are always in the format /index-1.htm etc. It may just be that the developer creates their own standard template from an original design and rather than pre-planning and creating the file names to maximize SEO, they create standard page names and change them later.

While there is nothing really wrong with either of these things (unless you are charging the client for an original design and buying a pre-designed template at a fraction of the cost), both methods do open up the way for mistakes and errors to occur. As a result, there are a few things to keep in mind if you are working this way -

It is a much better idea to build on a development server so that none of the files that will become obsolete during the process will be indexed by search engines in the meantime. Tidy architecture, remove the obsolete files, test, then push to production.
When changing file names it is ALWAYS better to re-name the existing file and do a global update of links rather than create a duplicate with a different name. As soon as you create two files, you open up the possibility of accidentally linking both files within the site. You could have /contact.php linked from the home page and contact-us.php linked from the footer for example. There is a danger here that should you decide to delete the unwanted file, you create broken links without knowing it, or you have duplicate content. Either way, you have to recognize the problem and either fix it, or put a 301 in place to catch it.
NEVER hard code your links, because as soon as you change the name of the directory you placed your files in, you create a broken link! If you use relative links, the change of directory name will not matter.

I can see from Screaming Frog that some of the URL's for the pdf files have 301's in place, but it appears that the Redirect URL may also be hard coded to the /pdfs directory. The fact that they all return a 404 when the directory name is changed to match that section makes it purely a guess as to what is happening here. It seems both www and non www pdf's are returning 404's in the browser.

The picture is muddied a little by the fact that there appear to be internal URL rewrites in the mix as well (to produce those pretty URL's with trailing slashes). So, there are a few options as to why the pdf's are not accessible:

They are not actually on the server at all (unlikely)
The names of the pdf's themselves have been changed, so even if the URL rewrite is sending the request to the new directory, the file requested does not exist.
The /pdfs directory has been named something completely different and the hard coding is the problem
The /pdfs directory has been moved to another location within the site architecture

I tried guessing a couple dozen of the obvious options, but no luck I'm afraid

There is one other possibility, in that the internal URL rewrites and 301 redirects could be creating a problem for each other. I am not clever enough to identify whether this is the case without a hint from the code, but will ask the God of All Things Code (my Boss) if he can answer that for me when daytime arrives 8D

OK....this is now so long that I really need to read the whole thread back to see if I have forgotten anything! If I find something I have missed, or can find anything else when help arrives, I'll be back!

Hope it makes some sort of sense and ultimately helps,

Sha

ShaMenz

Hi Jesse,

Basically, the only significant search volume is for the shorter two word term "minneapolis massage", so there is no benefit in going for the longer term (which also has a negative correlation for rankings).

So, since Ryan advised that the .net version of minneapolismassage is available, if you are wanting an exact match domain that might be a better option.

Since the return you are wanting is quite modest, then the traffic afforded by that term and the little extra you might attract from other terms should work for you if you can get some reasonable rankings.

I would suggest that you put some good effort into local search optimization if you decide to go with that domain.

Hope that is clearer,

Sha

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Moz Q&A is closed.

ShaMenz

@ShaMenz

Best posts made by ShaMenz

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved