Monday, September 17, 2012

Does 404 / 410 Errors Affect Website Ranking in Google

So there you are, minding your own business, using Webmaster Tools to check out how awesome your site is... but, wait! The Crawl errors page is full of 404 (Not found) errorsIs disaster imminent??


Fear not, my young padawan. Let’s take a look at 404s and how they do (or do not) affect your site:

Q: Do the 404 errors reported in Webmaster Tools affect my site’s ranking?
A:
 404s are a perfectly normal part of the web; the Internet is always changing, new content is born, old content dies, and when it dies it (ideally) returns a 404 HTTP response code. Search engines are aware of this; we have 404 errors on our own sites, as you can see above, and we find them all over the web. In fact, we actually prefer that, when you get rid of a page on your site, you make sure that it returns a proper 404 or 410 response code (rather than a “soft 404”). Keep in mind that in order for our crawler to see the HTTP response code of a URL, it has to be able to crawl that URL—if the URL is blocked by your robots.txt file we won’t be able to crawl it and see its response code. The fact that some URLs on your site no longer exist / return 404s does not affect how your site’s other URLs (the ones that return 200 (Successful)) perform in our search results.

Q: So 404s don’t hurt my website at all?
A:
 If some URLs on your site 404, this fact alone does not hurt you or count against you in Google’s search results. However, there may be other reasons that you’d want to address certain types of 404s. For example, if some of the pages that 404 are pages you actually care about, you should look into why we’re seeing 404s when we crawl them! If you see a misspelling of a legitimate URL (www.example.com/awsome instead of www.example.com/awesome), it’s likely that someone intended to link to you and simply made a typo. Instead of returning a 404, you could 301 redirect the misspelled URL to the correct URL and capture the intended traffic from that link. You can also make sure that, when users do land on a 404 page on your site, you help them find what they were looking for rather than just saying “404 Not found."

Q: Tell me more about “soft 404s.”
A:
 A soft 404 is when a web server returns a response code other than 404 (or 410) for a URL that doesn’t exist. A common example is when a site owner wants to return a pretty 404 page with helpful information for his users, and thinks that in order to serve content to users he has to return a 200 response code. Not so! You can return a 404 response code while serving whatever content you want. Another example is when a site redirects any unknown URLs to their homepage instead of returning 404s. Both of these cases can have negative effects on our understanding and indexing of your site, so we recommend making sure your server returns the proper response codes for nonexistent content. Keep in mind that just because a page says “404 Not Found,” doesn’t mean it’s actually returning a 404 HTTP response code—use the Fetch as Googlebot feature in Webmaster Tools to double-check. If you don’t know how to configure your server to return the right response codes, check out your web host’s help documentation.

Q: How do I know whether a URL should 404, or 301, or 410?
A:
 When you remove a page from your site, think about whether that content is moving somewhere else, or whether you no longer plan to have that type of content on your site. If you’re moving that content to a new URL, you should 301 redirect the old URL to the new URL—that way when users come to the old URL looking for that content, they’ll be automatically redirected to something relevant to what they were looking for. If you’re getting rid of that content entirely and don’t have anything on your site that would fill the same user need, then the old URL should return a 404 or 410. Currently Google treats 410s (Gone) the same as 404s (Not found), so it’s immaterial to us whether you return one or the other.

Q: Most of my 404s are for bizarro URLs that never existed on my site. What’s up with that? Where did they come from?
A:
 If Google finds a link somewhere on the web that points to a URL on your domain, it may try to crawl that link, whether any content actually exists there or not; and when it does, your servershould return a 404 if there’s nothing there to find. These links could be caused by someone making a typo when linking to you, some type of misconfiguration (if the links are automatically generated, e.g. by a CMS), or by Google’s increased efforts to recognize and crawl links embedded in JavaScript or other embedded content; or they may be part of a quick check from our side to see how your server handles unknown URLs, to name just a few. If you see 404s reported in Webmaster Tools for URLs that don’t exist on your site, you can safely ignore them. We don’t know which URLs are important to you vs. which are supposed to 404, so we show youall the 404s we found on your site and let you decide which, if any, require your attention.

Q: Someone has scraped my site and caused a bunch of 404s in the process. They’re all “real” URLs with other code tacked on, like http://www.example.com/images/kittens.jpg" width="100" height="300" alt="kittens"/></a... Will this hurt my site?
A:
 Generally you don’t need to worry about “broken links” like this hurting your site. We understand that site owners have little to no control over people who scrape their site, or who link to them in strange ways. If you’re a whiz with the regex, you could consider redirecting these URLs as described here, but generally it’s not worth worrying about. Remember that you can also file a takedown request when you believe someone is stealing original content from your website.

Q: Last week I fixed all the 404s that Webmaster Tools reported, but they’re still listed in my account. Does this mean I didn’t fix them correctly? How long will it take for them to disappear?
A:
 Take a look at the ‘Detected’ column on the Crawl errors page—this is the most recent date on which we detected each error. If the date(s) in that column are from before the time you fixed the errors, that means we haven’t encountered these errors since that date. If the dates are more recent, it means we’re continuing to see these 404s when we crawl.

After implementing a fix, you can check whether our crawler is seeing the new response code by using Fetch as Googlebot. Test a few URLs and, if they look good, these errors should soon start to disappear from your list of Crawl errors.

Q: Can I use Google’s URL removal tool to make 404 errors disappear from my account faster?
A:
 No; the URL removal tool removes URLs from Google’s search results, not from your Webmaster Tools account. It’s designed for urgent removal requests only, and using it isn’t necessary when a URL already returns a 404, as such a URL will drop out of our search results naturally over time. See the bottom half of this blog post for more details on what the URL removal tool can and can’t do for you.

Still want to know more about 404s? Check out 404 week from our blog, or drop by ourWebmaster Help Forum.

Sourcehttp://googlewebmastercentral.blogspot.in/2011/05/do-404s-hurt-my-site.html

Structured Data Dashboard in Google Webmaster Tools

Structured data is becoming an increasingly important part of the web ecosystem. Google makes use of structured data in a number of ways including rich snippets which allow websites to highlight specific types of content in search results. Websites participate by marking up their content using industry-standard formats and schemas.

To provide webmasters with greater visibility into the structured data that Google knows about for their website, we’re introducing today a new feature in Webmaster Tools - the Structured Data Dashboard. The Structured Data Dashboard has three views: site, item type and page-level. 

Site-level view 
At the top level, the Structured Data Dashboard, which is under Optimization, aggregates this data (by root item type and vocabulary schema).  Root item type means an item that is not an attribute of another on the same page.  For example, the site below has about 2 million Schema.Org annotations for Books (“http://schema.org/Book”) 


Itemtype-level view 
It also provides per-page details for each item type, as seen below: 


Google parses and stores a fixed number of pages for each site and item type. They are stored in decreasing order by the time in which they were crawled. We also keep all their structured data markup. For certain item types we also provide specialized preview columns as seen in this example below (e.g. “Name” is specific to schema.org Product). 


The default sort order is such that it would facilitate inspection of the most recently added Structured Data. 

Page-level view 
Last but not least, we have a details page showing all attributes of every item type on the given page (as well as a link to the Rich Snippet testing tool for the page in question). 


Webmasters can use the Structured Data Dashboard to verify that Google is picking up new markup, as well as to detect problems with existing markup, for example monitor potential changes in instance counts during site redesigns. 

Source: http://googlewebmastercentral.blogspot.in/2012/07/introducing-structured-data-dashboard.html

Keyword Alerts in Google Webmaster Tools

Many of you check Webmaster Tools daily (thank you!), but not everybody has the time to monitor the health of their site 24/7. It can be time consuming to analyze all the data and identify the most important issues. To make it a little bit easier we’ve been incorporating alerts into Webmaster Tools. We process the data for your site and try to detect the events that could be most interesting for you. Recently we rolled out alerts for Crawl Errors and today we’re introducing  alerts for Search Queries data.

The Search Queries feature in Webmaster Tools shows, among other things, impressions and clicks for your top pages over time. For most sites, these numbers follow regular patterns, so when sudden spikes or drops occur, it can make sense to look into what caused them. Some changes are due to differing demand for your content, other times they may be due to technical issues that need to be resolved, such as broken redirects. For example, a steady stream of clicks which suddenly drops to zero is probably worth investigating.

The alerts look like this:




We’re still working on the sensitivity threshold of the messages and welcome your feedback in our help forums. We hope the new alerts will be useful. Don’t forget to sign up for email forwarding to receive them in your inbox.

Source: http://googlewebmastercentral.blogspot.in/2012/08/search-queries-alerts-in-webmaster-tools.html

GEO Content Similarity & SEO - rel="alternate" hreflang="x"


Many websites serve users from around the world, with content that's translated, or targeted to users in a certain region. The rel="alternate" hreflang="x" annotations help Google serve the correct language or regional URL to searchers. More information about multi-regional and multilingual sites.
Some example scenarios where rel="alternate" hreflang="x" is recommended:
  • You translate only the template of your page, such as the navigation and footer, and keep the main content in a single language. This is common on pages that feature user-generated content, like a forum post.
  • Your pages have broadly similar content within a single language, but the content has small regional variations. For example, you might have English-language content targeted at readers in the US, GB, and Ireland.
  • Your site content is fully translated. For example, you have both German and English versions of each page.

Using rel="alternate" hreflang="x"

Imagine you have an English language page hosted at http://www.example.com/, with a Spanish alternative at http://es.example.com/. You can indicate to Google that the Spanish URL is the Spanish-language equivalent of the English page in one of three ways:
  • HTML link element. In the HTML <head> section of http://www.example.com/, add a linkelement pointing to the Spanish version of that webpage at http://es.example.com/, like this:
    <link rel="alternate" hreflang="es" href="http://es.example.com/" />
  • HTTP header. If you publish non-HTML files (like PDFs), you can use an HTTP header to indicate a different language version of a URL:
    Link: <http://es.example.com/>; rel="alternate"; hreflang="es"
  • Sitemap. Instead of using markup, you can submit language version information in a Sitemap.
If you have multiple language versions of a URL, each language page in the set must userel="alternate" hreflang="x" to identify the other language versions. For example, if your site provides content in French, English, and Spanish, the Spanish version must include arel="alternate" hreflang="x" link to both the English and the French versions, and the English and French versions must each include a similar link pointing to each other and to the Spanish site.
If you have several alternate URLs targeted at users with the same language but in different locales, it's a good idea to provide a generic URL for geographically unspecified users. For example, you may have specific URLs for English speakers in Ireland (en-ie), Canada (en-ca), and Australia (en-au), but want all other English speakers to see your generic English (en) page. In this case you should specify the generic English-language (en) page for searchers in, say, the UK.

hreflang supported values

The value of the hreflang attribute identifies the language (in ISO 6391-1 format) and optionally the region (in ISO 3166-1 Alpha 2 format) of an alternate URL. For example:
  • de: German content, independent of region
  • en-GB: English content, for GB users
  • de-ES: German content, for users in Spain
For language script variations the proper script is derived from the country. For example, when using zh-TW for users zh-TW, the language script is automatically derived (in this example: Chinese-Traditional). You can also specify the script itself explicitly using ISO 15924, like this:
  • zh-Hant: Chinese (Traditional)
  • zh-Hans: Chinese (Simplified)
Alternatively, you can also specify a combination of script and region—for example, use zh-Hans-TW to specify Chinese (Simplified) for Taiwanese users.

Example configuration: rel="alternate" hreflang="x" in action

Example Widgets, Inc has a website that serves users in the USA, Great Britain, and Germany. The following URLs contain substantially the same content, but with regional variations:
  • http://www.example.com/page.html English-language homepage. Contains information about fees for shipping internationally from the USA.
  • http://en-gb.example.com/page.html English-language; displays prices in pounds sterling.
  • http://en-us.example.com/page.html English-language; displays prices in US dollars.
  • http://de.example.com/seite.html German-language version of the content
rel="alternate" hreflang="x" is used as a page level, not a site level, and you need to mark up each set of pages, including the home page, as appropriate. You can specify as many content variations and language/regional clusters as you need.
To indicate to Google that you want the German version of the page to be served to searchers using Google in German, the en-us version to searchers using google.com in English, and the en-gb version to searchers using google.co.uk in English, use rel="alternate" hreflang="x" to identify alternate language versions.
Update the HTML of each URL in the set by adding a set of rel="alternate" hreflang="x" link elements. Include a rel="alternate" hreflang="x" link for every URL in the set, like this:
<link rel="alternate" hreflang="en" href="http://www.example.com/page.html" />
<link rel="alternate" hreflang="en-gb" href="http://en-gb.example.com/page.html" />
<link rel="alternate" hreflang="en-us" href="http://en-us.example.com/page.html" />
<link rel="alternate" hreflang="de" href="http://de.example.com/seite.html" />
This markup tells Google's algorithm to consider all of these pages as alternate versions of each other.

Source:http://support.google.com/webmasters/bin/answer.py?hl=en&answer=189077

Blocking Pages from being Crawled and Indexed in Google - The Methods


robots.txt file restricts access to your site by search engine robots that crawl the web. These bots are automated, and before they access pages of a site, they check to see if a robots.txt file exists that prevents them from accessing certain pages. (All respectable robots will respect the directives in a robots.txt file, although some may interpret them differently. However, a robots.txt is not enforceable, and some spammers and other troublemakers may ignore it. For this reason, we recommend password protecting confidential information.)
To see which URLs Google has been blocked from crawling, visit the Blocked URLs page of the Healthsection of Webmaster Tools.

You need a robots.txt file only if your site includes content that you don't want search engines to index. If you want search engines to index everything in your site, you don't need a robots.txt file (not even an empty one).
While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.
In order to use a robots.txt file, you'll need to have access to the root of your domain (if you're not sure, check with your web hoster). If you don't have access to the root of a domain, you can restrict access using the robots meta tag.

To entirely prevent a page's contents from being listed in the Google web index even if other sites link to it, use a noindex meta tag or x-robots-tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index. The x-robots-tag HTTP header is particularly useful if you wish to limit indexing of non-HTML files like graphics or other kinds of documents.

Create a robots.txt file

The simplest robots.txt file uses two rules:
  • User-agent: the robot the following rule applies to
  • Disallow: the URL you want to block
These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.
Each section in the robots.txt file is separate and does not build upon previous sections. For example:
User-agent: *
Disallow: /folder1/

User-Agent: Googlebot
Disallow: /folder2/
In this example only the URLs matching /folder2/ would be disallowed for Googlebot.

User-agents and bots

A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:
User-agent: *
Google uses several different bots (user-agents). The bot we use for our web search is Googlebot. Our other bots like Googlebot-Mobile and Googlebot-Image follow rules you set up for Googlebot, but you can set up specific rules for these specific bots as well.

Blocking user-agents

The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).
  • To block the entire site, use a forward slash.
    Disallow: /
  • To block a directory and everything in it, follow the directory name with a forward slash.
    Disallow: /junk-directory/
  • To block a page, list the page.
    Disallow: /private_file.html
  • To remove a specific image from Google Images, add the following:
    User-agent: Googlebot-Image
    Disallow: /images/dogs.jpg 
  • To remove all images on your site from Google Images:
    User-agent: Googlebot-Image
    Disallow: / 
  • To block files of a specific file type (for example, .gif), use the following:
    User-agent: Googlebot
    Disallow: /*.gif$
  • To prevent pages on your site from being crawled, while still displaying AdSense ads on those pages, disallow all bots other than Mediapartners-Google. This keeps the pages from appearing in search results, but allows the Mediapartners-Google robot to analyze the pages to determine the ads to show. The Mediapartners-Google robot doesn't share pages with the other Google user-agents. For example:
    User-agent: *
    Disallow: /
    
    User-agent: Mediapartners-Google
    Allow: /
Note that directives are case-sensitive. For instance, Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp. Googlebot will ignore white-space (in particular empty lines)and unknown directives in the robots.txt.
Googlebot supports submission of Sitemap files through the robots.txt file.

Pattern matching

Googlebot (but not all search engines) respects some pattern matching.
  • To match a sequence of characters, use an asterisk (*). For instance, to block access to all subdirectories that begin with private:
    User-agent: Googlebot
    Disallow: /private*/
  • To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
    User-agent: Googlebot
    Disallow: /*?
  • To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
    User-agent: Googlebot 
    Disallow: /*.xls$
    You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
    User-agent: *
    Allow: /*?$
    Disallow: /*?
    The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
    The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
Save your robots.txt file by downloading the file or copying the contents to a text file and saving as robots.txt. Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.

Test a robots.txt file

The Test robots.txt tool will show you if your robots.txt file is accidentally blocking Googlebot from a file or directory on your site, or if it's permitting Googlebot to crawl files that should not appear on the web. When you enter the text of a proposed robots.txt file, the tool reads it in the same way Googlebot does, and lists the effects of the file and any problems found.

Test a site's robots.txt file:

  1. On the Webmaster Tools Home page, click the site you want.
  2. Under Health, click Blocked URLs.
  3. If it's not already selected, click the Test robots.txt tab.
  4. Copy the content of your robots.txt file, and paste it into the first box.
  5. In the URLs box, list the site to test against.
  6. In the User-agents list, select the user-agents you want.
Any changes you make in this tool will not be saved. To save any changes, you'll need to copy the contents and paste them into your robots.txt file.
This tool provides results only for Google user-agents (such as Googlebot). Other bots may not interpret the robots.txt file in the same way. For instance, Googlebot supports an extended definition of the standard robots.txt protocol. It understands Allow: directives, as well as some pattern matching. So while the tool shows lines that include these extensions as understood, remember that this applies only to Googlebot and not necessarily to other bots that may crawl your site.

Source: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449