Posts Tagged ‘Robots.txt’

Nofollow and Noindex on Robots.txt and Meta Tags

Monday, December 28th, 2009

I know a lot of you are wondering what’s the difference between the “nofollow” and “noindex” in the robots.txt file and the ones confirmed on the meta tags. Some search engine optimization specialists claims that there’s no difference while others believe that one method is less valuable than the other. You can question it out on several webmaster/SEO forums, but you’ll just get mixed responses. In here, I will give you the REAL answer! According to Eric Enge’s interview with Google’s software engineer Matt Cutts way back in 2007, there are differences in using NoIndex, NoFollow, and Robots.txt. In The Robots.txt File Even if you confine Google’s spider from indexing certain pages of your website, they can still accumulate page rank. For example, if you disallow the crawling of your “About Us” page yet your homepage links to it, PR juice is still passed on. In addition, websites that disallow crawling can still accumulate page rank and be visible in search results. Why? If a further website links to it, PR juice will be passed on. Google can sometimes use the information of a website submitted to ODP (also known as DMOZ) in order to show it on their results page, or when a further website links to it. In The Meta Tag Let’s know what Nofollow and Noindex really means. The Nofollow is usually used on outgoing links, when confirmed in the meta tag, it means “do not follow all links on this page.” With this, you are also telling the spider not to pass PR juice. But, the nofollow on the meta tag also applies to your links that points to the other pages of your website. This also means that you are depriving the flow of PR juice to some of your pages. On the other hand, the Noindex means, “do not index this page.” Pages with noindex can still accumulate page rank if a there’s a dofollow link pointing to that page. — Dofollow is the opposite of Nofollow – a link with no “nofollow” attribute assigned means it’s a dofollow link. A term used by webmasters and SEO experts. To sum it all up: * NoFollow means you’re telling the search spider not to follow a link and also not to pass PR juice to that link. * NoIndex means you’re telling the search spider not to index your website and not to show it on SERP. * The site or page in a Nofollow link can still gain PR if a further site links to it without the nofollow attribute. * Pages with the Noindex tag can still gain PR if a further site links to it without a nofollow attribute. * Pages with the Noindex tag can still be visible on SEPR using the information from ODP (DMOZ) or when a further website links to it using * Nofollow on the meta tag applies to all the links on a webpage * Nofollow attribute on a link applies only to that link So whether you use the Robots.txt or the Meta Tag, the results are still the same. The reason why Robots.txt is frequently used is because “it is the fundamental method of putting up an electronic no trespassing sign that people have used since 1996,” and it’s much simpler to declare which pages of your website not to crawl instead of placing the NoIndex code manually on individual pages’ meta tags. Commonly, SEO air force do provide complete analysis of a website as well as recommendations. So when you are looking for a excellent search engine optimization service, never forget to question for a website analysis first, don’t just jump in and hire somebody you do not know.

The Robots.txt protocol

Sunday, December 27th, 2009

The Robots.txt protocol, also called the “robots exclusion standard” is designed to lock out web spiders from accessing part of a website. It is a security or privacy measure, the equivalent of hanging a “Keep Out” sign on your door.This protocol is used by web site administrators when there are sections or files that they would rather not be accessed by the rest of the world. This could include employee lists, or files that they are circulating internally. For example, the White House website uses robots.txt to block any inquiries on speeches by the Vice Head, a photo essay of the First Lady, and profiles of the 911 victims.How does the protocol work? It lists the files that shouldn’t be scanned, and places it in the top-amount directory of the website. The robots.txt protocol was produced by consensus in June 1994 by members of the robots mailing list (robots-question for@nexor.co.uk). There is no official standards body or RFC for the protocol, so it’s hard to legislate or mandate that the protocol be followed. In fact, the file is treated as strictly advisory, and does not have absolute guarantee that those contents won’t be read.In effect, robot.txt requires cooperation by the web spider and even the reader, since anything that is uploaded into the internet becomes publicly available. You aren’t locking them out of those pages, you are just making it harder for them to get in. But it takes very small for them to ignore these instructions. Computer hackers can also easily penetrate the files and retrieve information. So the rule of thumb is—if it’s that sensitive, it shouldn’t be on your website to start with.Care, but, should be taken to ensure that the Robots.txt protocol doesn’t block the website robots from other areas of the website. This will dramatically affect your search engine ranking, as the crawlers rely on the robots to count the keywords, review metatags, titles and crossheads, and even register the hyperlinks.One misplaced hyphen or dash can have catastrophic things. For example, the robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final ‘/’ character appended: otherwise all files with names early with that substring will match, rather than just those in the directory intended.To avoid these problems, consider submitting your site to a search engine spider simulator, also called search engine robot simulator. These simulators—which can be bought or downloaded from the internet— use the same processes and strategies of different search engines and give you a “dry run” of how they will read your site. They will tell you which pages are skipped, which links are ignored, and which errors are encountered. Since the simulators will also reenact how the bots will follow your hyperlinks, you’ll see if your robot.txt protocol is interfering with the search engine’s ability to read through all the necessary pages.It’s also vital to review your robot.txt files, which will enable you to spot any problems and right them previous to you submit them to real search engines.

Robots.txt to Keep Those Crawlers at Bay

Friday, December 25th, 2009

The world of the internet is a crazy mix of various tools, technologies and terminologies that can have you winding up in circles with the apparent complexities involved. Much of that obviously happens to hang around over the arena of website marketing or promotion in which lots of tools and strategies are at play. The act of making your website more visible for the audience denotes its quantitatively improved appearance on search engines. The more number of times your website is reflected on search engines when people search for relevant air force or harvest, the more traffic you can generate. This is possible through a process known as search engine optimization or SEO in fleeting. A search engine optimization firm analyses the potential that is underlying your website and prepares a strategy for its promotion. For this purpose, a web promotion firm engages in activities like keyword research, article submissions, link building and directory submissions. With a lot of such firms prevalent today, you can avail really affordable search engine optimization air force.In the midst of this entire process, a web promotion specialist may face certain obstacles in the way. One such obstacle is variably known as robots or crawlers or spiders, which can stay your websites evenly to scan the contents and make an impression out of it. Now, this can be both advantageous or detrimental for you with the latter being a common scenario since you would not be updating your website content everyday and the robots can lower the search frequency after some time. To avoid this, a program known as robots.txt or robots exclusion protocol has been urban to prevent the accessibility of your website by the robots. So, the next time you seek out an affordable web promotion service, be sure that you specifically question for the inclusion of this program by the SEO specialists for your website. The lesser the hassles you face in your web promotion process, the more chances you have of success.

Robots.txt File: How to Benefit From it Most

Thursday, December 24th, 2009

 

In the web, you will find many types of websites in which you can get the information that you need. You can simply search the web engine and there are by now lists of website pages that may match your needs. Anyone can also have their own site and place all the necessary contents that they want to share to other people. But, there are a number people who make some web pages but they do not intend that it would be search for others. Thus, they use the robots.txt file.

One may have some websites in which they do not want others to view it because it is not yet finished or that there are a few information that may be irrelevant to most people. Thus, you can use the file and keep others in opening the webpage.

The search engine crawler generally follows this robots exclusion protocol or robots.txt file if it is present in the server. The main use of this file is to determine which sites or pages in a website are to be accessed by the search engine and which are not. This will keep the Web robots from crawling in certain pages that may have sensitive contents that is not intended for other viewers. But, this file only prevents the access into the web pages but it does not keep the site from being indexed.

There are some people who tend to have problems when their sites are not listed in the search engine. When this is the case, most blame the incorrect use of the robots.txt file. The file prevented the users why certain sites are not listed in the search engines or some cannot access the total site. When you fixed the problem regarding the file, the site will then be indexed and it will soon have a better traffic.

When a person does not use the file correctly or place codes the incorrect way, there result would certainly the other way around. Thus, they should be able to know the use of the file and how it should work according to your needs. There are some people who want their website to be viewed by others so they must not use the robots.txt file. But, for some who may want certain unfinished or confidential pages not to be indexed, the proper use of the file will be a huge help for them.

There may be a number of problems with the robots.txt file since there is no way that you may be able to stop other sites to link into your site. But, in some way, the robots.txt file help in making the person protect the search engines from gaining access to some web pages or your total website but the site is still available, only not from the search engines. Still, a careful use of the file should be done so that the results will be according to what you want.

Search Engine Spiders And Your Robots.txt File

Wednesday, December 23rd, 2009

In this article we will discuss search engine spiders and what they do. You will also learn how to make a robots.txt file and why you might need one.
Search engine spiders are automated software programs that crawl the Web looking for pages to feed to search engines. They are also called crawlers, robots and bots. Spiders are one of the most useful programs on the internet. They are a key part in how the search engines operate. Spiders allow your site to be found by the millions of people who use search engines. Feed the spiders right and they will tell the search engines about your site.
How Spiders Work
A search engine is an index to the Internet, search engines point to relevant web sites depending on your search. Search engines need a tool that is able to stay websites, navigate the websites, choose what the website is about and add that data to the search engine.
Spiders are essentially programs that “crawl” sites and report back to their boss their findings. Their purpose in life is to make it simple for your site to get listed in search engines.
Spiders work by finding links to web sites, visiting those web sites, going through the content of a web site and then reporting the content of the site back to the database of the search engine they work for. From there, the information is added to the search engine, and the site then shows up in search results.
The robots.txt file
By defining a few rules, you can tell robots to not crawl certain directories or files, within your site. Web sites do not unquestionably have to have a robots.txt file, they can get along just fine without one. Most spiders look for a robots.txt file as soon as they arrive on your site. Take a look at your site statistics. If your statistics has a “files not found” section, you may see many entries where spiders disastrous to find the file on your site.
The default behavior is to allow all unless you have a Disallow for that resource. If you wish to exclude some of your pages from search engine indexing, this is the tool approved by the search engines. Making a robots.txt file that guides spiders is simple.
If you want to allow the spiders to crawl your site but exclude directories of your choice, copy and paste the following into a blank txt file:
User-agent: *
Disallow: /directory1/
Disallow: /directory2/
Disallow: /directory3/
To exclude files of your choice, type in the path to the files you want to exclude:
User-agent: *
Disallow: /directory1/page1.html
Disallow: /directory2/page2.html
Disallow: /directory3/page3.html
To exclude all the search engine spiders from your entire web site, copy and paste the following into the txt file:
User-agent: *
Disallow: /
This will keep a specific search engine spider from indexing your site:
User-agent: Name_of_Robot
Disallow: /
To allow a single robot and exclude all other robots:
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /
There can only be one robots.txt on a site, and you may not have blank lines in a record. Once you have it the way you want, save the file as “robots” and as a .txt file. Uploading the file to the root directory of your site, that is the directory where your home page or index page is. Place the robots.txt file right alongside the index file.

Learn to Protect Your Site by Communicating in the Language of Robots.txt

Tuesday, December 22nd, 2009

If you are a website owner, you know the reasoning behind that question. No, we are not talking about physical robots in general, but rather the language of robots. Anyone that is familiar with the well-known Google robot – Googlebot, knows how vital it can be to be able to know the language of robots to help protect your website. Not everyone though, is at savoir-faire in the language art of speaking robot.    It can be intimidating to some website owners when thinking they have to learn to effectively use the language, but there are tools available to help the lesser robot savoir-faire communicators. Most of us have probably employed the air force of Googlebot to protect sections and parts of our websites that we don’t want invaded. Those that are familiar with using the robots.txt language can simply fire off a file to him and he will permanently deliver what we need. But if you are unsure of your abilities in the art of speaking robot, there is a touch that can help you.There is a new Webmaster tool available that acts as a translator or robot.txt files. It helps you build the file to use, and all you have to do is penetrate the areas you do not want robots to crawl through. You can also make it very specific blocking only certain types of robots from certain types of files. After you use the generator tool, you can take it for a test drive by using the analysis tool. After you have seen that your test file is ready to go, you can simply save the new file on the root directory on your website and sit back. When making and using the robots files, you should consider the following two tips:1.    Robot text files are not permanently supported on all search engines – Googlebot and some other robots can know the files, but other robots may not be able to know the generated files.2.    Keep in mind that robot text files are only a method of asking that your site be protected from robots crawling. You simply generate the file, but to some robots who are not as scrupulous as others, they can choose to ignore the file and get in. Make sure you use the password protection option to protect what files you need blocked.This can be a fantastic tool for those who are not as confident in their robot language skills, and can make a safe haven for the files on your website you need protected from unsavory robots. It can substantially help you in your quest to protect your website and files within by helping you generate the file in the right format to the robot. As permanently, there are options out there if you need further guidance, you can check out the help center for Webmaster tools or seek answers from a help group of Webmasters.

The Easy Guide to Making a Robots.txt File

Monday, December 21st, 2009

If you have a website you really need to have a robots.txt file. It gives search engine spiders specific commands and it is simple to use and simple to maintain. Here is an simple guide to a robots.txt file in five minutes.

There are times when you don’t want a search engine to index a page or a folder on your website. Maybe you have some information you just don’t want to have show up in google. This may include your statistics page, a page of notes, or a dynamic page. And, importantly, if you use google adsense and the search tool that displays search results on your website google mandates you exclude this page from search engines. Which means they mandate you having a robots.txt file.

A robots.txt file is a simple document named robots.txt and saved in the root folder of your website. Search engines see this and follow any commands it contains. Make a simple text document using any word processor program like notepad and place these two lines it:

User-agent: *

Disallow:

The first line tells all spiders to listen up because the following command is for you. The second line means do not index any of the following pages. And it is here you place the url of any pages you don’t want spidered. So if you wanted the spiders to skip your confidential page it looks like this:

Disallow:/privatepage.htm

If you want the spiders to skip a total folder you place the url of that folder with a slash like this:

Disallow:/privatefolder/

Simply place this text file in the root folder of your website and you are done. In the future you can add and remove commands easily.

The robots.txt file is a very simple file to write and maintain and it is a very powerful tool that will help you interact successfully with search engines. This disallow command is the simplest and most used command but there are also many other commands you can use and if you have a website it is well worth your time to have a robots.txt file and even to research it a bit further.

For more appealing insights into being a creative webmaster and making your website work for you stay the authors site at: The Creative Webmaster – Forging the Iron of Creativity on the Anvil of a Website

Robots.txt, An Online Marketers Friend Or Foe?

Monday, December 21st, 2009

robots.txt is possibly the most miss understood file that a website can contain.
Many people reckon that by using a robots.txt file on their website they are protecting pages and folders from thieves and hackers. In fact it is perfectly the opposite! robots.txt opens up an enormous security hole that hackers and theives will use to easily gain access to the parts of your website that you don’t want them to.
What is robots.txt?
robots.txt is a file that you make and upload to your websites root directory that is used by search engine spiders to determine which parts of your website they should index and which folders/pages that you, the website owner,
don’t want listed in search engine indexes.
Why would not want pages indexed?
There are many reasons why you might not want search engines to index pages on your website, such as confidential membership pages or exclusive training pages and such like.
If you are an Internet Peddler selling your own ebook or other digital product, you wouldn’t want your thank you pages indexed either!
And this is where the misunderstanding comes to the fore, and robots.txt becomes your foe.
Many online marketers who provide ebooks or other digital harvest for instant download will list their download thank you pages in the robots.txt file because they obviously don’t want those pages indexed in search engines.
By using robots.txt this way though, you will be opening up your product to anyone who has a slight bit of knowledge about how the file works.
robots.txt is easily readable by any human that opens a browser and types in http://www.yourdomain.com/robots.txt and if you have listed your thank you pages, all they have to do is go to that url and take your product(s)!
It’s that simple!
And I’m living proof that this works, as this is exactly what happened to me. I had listed my thank you pages in robots.txt and thought that they were safe from hackers and thieves, then one day I was checking my web site stats and BAM, a name had been to every single thank you page, and taken everything.
The moral is, don’t list any URL in robots.txt that you don’t want humans to have free access to. Use robots.txt with fantastic caution and secure your thank you pages using dedicated software.

Importance of the Robots.txt File

Monday, December 21st, 2009

Despite the importance of the Robots.txt file in getting your website indexed with the major search engines, many webmasters don’t place forward one on their site. What is the robots.txt file you question? If you don’t know, you are far from alone. The robots.txt file is a simple text file (no html) that is placed in your website’s root directory in order to tell the search engines which pages to index and which to skip.

When a search engine sends its webcrawler to your site, one of the first things the webcrawler will do is search the root directory for the robots.txt file. A correctly formated robots.txt file will consist of several records, each providing instructions for a particular search-bot. A record will generally consist of two components, the first is called the user-agent and is where the name of the search-bot is listed. The second line consits of one or more “disallow” lines. These lines tell the webcrawler which files or folders should not be indexed (ie a cgi-bin folder).

If you currently have a website and do not have a robots.txt file, you can make one easily. As mentioned earlier, the files are plain text, so just open up notepad and save the file at robots.txt. Most webmasters can use one record that will apply to all of the search engine crawlers. Once you have opened notepad penetrate the following:

User-agent: * Disallow:

The “*” applies this rule to all bots. In this example, there is nothing listed in the disallow line. This tells the robot to index the entire site. You can also penetrate a folder path here such as “/confidential” if there is a folder that shouldn’t be indexed. This can be very useful if you are still testing a part of your website or is a section is still under construction.

Now that you know what should go into your robots.txt file, there are several common mistakes people make when making these files. Never penetrate notes or comments into the file as these bits and pieces can produce confusion for the webcrawler. Also, the format should permanently be the user-agent on the first line, followed by the disallow(s). Do not reverse the order. A further common mistake made involves using the incorrect case. If the disallowed folder is /confidential, make sure your robots.txt file does not list the folder as /Confidential. It seems like a very minor issue, but it will produce problems if done incorrectly. Irrevocably, there is no Allow command. You cannot tell the webcrawler what to look at, only what not to look at.

If you are still curious about the robots.txt file you can find many more complex examples online. Just try one of your favorite websites and look for their robots.txt file. For example you can go to http://www.cnn.com/robots.txt. If you need help making a robots.txt file for your site, there are plenty of places online that will make the file for you for free. One example is http://www.seochat.com/seo-tools/robots-generator/. Despite its apparently simplicity, this file can make or break your site’s chances with the search engines. Make sure you have your robots.txt file in place and correctly formatted today.

Using Robots.txt to Control Search Engines

Monday, December 21st, 2009

Robots.txt is a text file you place on your site to tell search robots which pages you would like them not to stay. Robots.txt implements the Robots Exclusion Protocol, which allows you as a web manager, to define what parts of your site are off-limits to search engine crawlers. For example, Web managers can disallow access to .cgi, confidential, temporary directories and other areas with pages they do not want accessed or indexed. 

The robots.txt file is made up of two parts, the User-agent and the Disallow. The User-agent specifies which robots to allow or disallow and the Disallow specifies which directories robots can or cannot crawl. The robots.txt is a gentleman’s agreement and some crawlers, such as Google, may ignore the robots.txt file that disallows all crawling.

The structure of a robots.txt is pretty simple. This example allows all robots to stay all files:

User-agent: *Disallow:

Example of a recommended robots.txt files blocking crawling of the scripts and images directories:

User-agent: * Disallow: /scripts/

Disallow: /images/

If you have a particular robot in mind, such as the Google image search robot, which collects images on your site for the Google Image search engine, you may include lines like the following: 

User-agent: Googlebot-Image

Disallow: /

This means that the Google image search robot, should not try to access any file in the root directory and all its subdirectories.

You can make the robots.txt file manually, using any text editor. It should be an ASCII-encoded text file, not an HTML file and the filename should be lowercase. Include the robots.txt file in your server’s root directory. This is standard web management practice. It must be in the main directory because otherwise user agents (search engines) will not be able to find it – they do not search the total site for a file named robots.txt. Instead, they look first in the main directory and if they don’t find it there, they simply assume that this site does not have a robots.txt file and therefore they index everything they find along the way.

 

All search engines, or at least all the vital ones, now look for a robots.txt file as soon their spiders your web site. So, even if you currently do not need to exclude the spiders from any part of your site, having a robots.txt file is still a excellent thought, it can act as a sort of invitation into your site.