Posts Tagged ‘Spiders’

Search Engine Spiders and Their Purpose

Thursday, January 7th, 2010

Search engine spiders are by far one of the most useful things to come around in the last 10 years of the internet. They are useful not only to the web sites (Google and many others) that use them, but also to people who are searching for a particular site and those who run web sites. Spiders allow your site to be seen by the millions of people who use search engines every day. In this newsletter, we will discuss what search engine spiders do, how they work, and how to set up a robots.txt file and upload that to your site to keep spiders from visiting your site.

What are spiders and what purpose do they serve?

Spiders are essentially programs that “crawl” sites and report back to their superior (Google or whatever search engine they were produced for) what their findings are. Their purpose is to make it simple for sites to get listed in search engines.

You might be wondering, what does it mean to “crawl” a site?

Well it means to stay and site and copy the information.

How do spiders work?

Spiders work by finding links to web sites, visiting those web sites, going through the content of a web site and then reporting the content of the site back to the database of the site which they are working for. Google spiders, thus, crawl sites and report the information back to Google’s database. From there, the information is added to Google’s search engine, and the site then shows up in Google search results. Much the same process happens with any other search engine spider.

How can I keep spiders from visiting my site?

You might be thinking, “Why would I want to keep such a useful thing from visiting my site?” Well, the fleeting answer is, sometimes site owners don’t want the spider to crawl on a particular part of their site. Some site owners don’t want spiders to crawl their site at all. The reasons for not wanting a spider to crawl a site or a particular part of a site vary, even if most of the time it is because the site is either completely spam or features a page or two of spam.

If you’re one of those site owners, then you’ll want to make and upload a touch called a robots.txt file. We will briefly go over how to do this.

A robots.txt file

The total purpose of a robots.txt file is to tell a search engine spider not to crawl the site or part of the site on which the robots.txt file resides.

Making the file

Making a robots.txt file that blocks out spiders is simple.

First, open up notepad. Then, copy and paste the following:

User-agent: *

Disallow: /

Once you’ve done that, save the file as “robots” and as a .txt file.

Uploading the file

Next, you will upload the file to the part of your site which you do not want the spider to stay. So, if you don’t want them to stay yoursite.com/news/, you’ll upload robots.txt to the news folder. If you don’t want the search engine spider to stay your site as well, upload robots.txt to your index folder. That’s all there is to it.

Using the robots.txt file to make sure search engine spiders DO stay your site

Believe it or not, the robots.txt file can be used to both disallow and allow search engine spiders to crawl your site. Here’s how to make and upload such a file.

Making the file

Open up notepad and copy and paste in the following:

User-agent: *

Disallow:

You’ll notice that the only difference between this and the earlier example is that Disallow: is not followed with /. If it were, that would tell spiders to go away. Once again, save the file as robots.txt.

Uploading the file

All you’ll do is upload the robots.txt file to the part of your site that you want the robot to pay a stay to. So if you want the robot to see the total site, just place the robots.txt file right alongside the index file. And you’re done.

Making and uploading a robots.txt file to help make sure spiders don’t miss your site is quick and simple. So what are you waiting for? Make and upload that file now!SEO Softwareinternet marketing softwareemail marketing

How exactly do search engine spiders & robots work

Friday, January 1st, 2010

Some internet surfers still hold on to the mistaken belief that actual people stay each and every website and then input it for inclusion in the search engine’s database. Imagine, if these were right! With billions of websites available on the internet and with a majority of these sites offering fresh content it will take thousands of people to achieve the tasks made by search engine spiders and robots – and even then they won’t be as well-organized or as thorough.Search engine spiders and robots are pieces of code or software that have only one aim – seek content on the internet and within each and every individual web page out there. These tools have a very vital role in how effectively search engines operate.Search engine spiders and robots stay websites and get the necessary information that it needs to determine the nature and content of the website and then adds the data to the search engine’s index. Search engine spiders and robots follow links from one website to a further so that it can consistently and infinitely gather the necessary information. The ultimate goal of search engine spiders and robots is to compile a comprehensive and valuable database that can deliver the most relevant results to the search queries of visitors. But how exactly do search engine spiders and robots work?The total process starts when a web page is sent to a search engine for submission. The submitted URL is added to the queue of websites that will be visited by the search engine spider. Submissions can be optional though because most spiders will be able to find the content in a web page if other websites link to the page. This is the reason why it is a excellent thought to build reciprocal links with other website. By enhancing the link popularity of your website and getting links from other sites that have the same topic as your website. When the search engine spider robot visits the website, it checks if there is an existing robots.txt file. The file tells the robot which areas of the site are off limits to its probe – like certain directories that have no use for search engines. All search engine bots look for this text file so it is a excellent thought to place one even if it is blank.The robots list and store all of the links found on a page and they follow each link to its destination website or page. The robots then submit all of this information to the search engine, which in turn compiles the data expected from all the bots and builds the search engine database. This part of the process by now has the intervention of search engine engineers who write the algorithms employed in evaluating and scoring the information that the search engine bots compiled. The moment all of the information is added to the search engine database this information is by now made available to search engine visitors who are making search queries in the search engine.

Did you find this article useful?  For more useful tips and   hints, points to brood over and keep in mind, techniques, and insights pertaining to Internet Business, do please browse for more information at our websites.http://www.allhottips.com                                     http://www.bookstoretoday.com

Search Engine Spiders And Your Robots.txt File

Wednesday, December 23rd, 2009

In this article we will discuss search engine spiders and what they do. You will also learn how to make a robots.txt file and why you might need one.
Search engine spiders are automated software programs that crawl the Web looking for pages to feed to search engines. They are also called crawlers, robots and bots. Spiders are one of the most useful programs on the internet. They are a key part in how the search engines operate. Spiders allow your site to be found by the millions of people who use search engines. Feed the spiders right and they will tell the search engines about your site.
How Spiders Work
A search engine is an index to the Internet, search engines point to relevant web sites depending on your search. Search engines need a tool that is able to stay websites, navigate the websites, choose what the website is about and add that data to the search engine.
Spiders are essentially programs that “crawl” sites and report back to their boss their findings. Their purpose in life is to make it simple for your site to get listed in search engines.
Spiders work by finding links to web sites, visiting those web sites, going through the content of a web site and then reporting the content of the site back to the database of the search engine they work for. From there, the information is added to the search engine, and the site then shows up in search results.
The robots.txt file
By defining a few rules, you can tell robots to not crawl certain directories or files, within your site. Web sites do not unquestionably have to have a robots.txt file, they can get along just fine without one. Most spiders look for a robots.txt file as soon as they arrive on your site. Take a look at your site statistics. If your statistics has a “files not found” section, you may see many entries where spiders disastrous to find the file on your site.
The default behavior is to allow all unless you have a Disallow for that resource. If you wish to exclude some of your pages from search engine indexing, this is the tool approved by the search engines. Making a robots.txt file that guides spiders is simple.
If you want to allow the spiders to crawl your site but exclude directories of your choice, copy and paste the following into a blank txt file:
User-agent: *
Disallow: /directory1/
Disallow: /directory2/
Disallow: /directory3/
To exclude files of your choice, type in the path to the files you want to exclude:
User-agent: *
Disallow: /directory1/page1.html
Disallow: /directory2/page2.html
Disallow: /directory3/page3.html
To exclude all the search engine spiders from your entire web site, copy and paste the following into the txt file:
User-agent: *
Disallow: /
This will keep a specific search engine spider from indexing your site:
User-agent: Name_of_Robot
Disallow: /
To allow a single robot and exclude all other robots:
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /
There can only be one robots.txt on a site, and you may not have blank lines in a record. Once you have it the way you want, save the file as “robots” and as a .txt file. Uploading the file to the root directory of your site, that is the directory where your home page or index page is. Place the robots.txt file right alongside the index file.