| HOW TO WRITE A ROBOTS.TXT TEXT FILE |
In a normal environment a Website Crawler (also known as a Web Spider or Web Robot) is a program or automated script, normally activated by a search
engine, that scans your website (your public_html folder and the folders and files within it) collecting and analysing information about your website
and its web pages (i.e. Number of Web Pages. Language used. Keywords and Phrases used) and their content (i.e. Number of Links. Code information.
Email Addresses. Is Audio/Video used?).
Some of that collected information is used by the search engine to give your website/web pages a ranking, position and subject matter (listing) within
their search engine results, while the other information is used for Caching (storage) and Database purposes. Search engine companies such as Google,
Yahoo and Microsoft may also choose to pass on their collected information to third parties (i.e. other search engine companies). So generally then,
a website crawler (web robot) is a good thing. It searches your website for popularity, links, well written articles and other content in order to make
your website/web pages reachable to as many people as possible via a search engine.
The downside to a website crawler though is that it tends to search every web page and folder within your public_html folder (your website). This is not
good if you are using one of your folders as a "Members Only" area of your website, simply because the website crawler will collect information from the
web pages inside that "Members Only" folder and more precisely keep a record of their location - inside your "Members Only" folder. In turn, the search
engines will list those "Members Only" web pages (as links). Add to this that I have mentioned A (one) website crawler, because I was giving you the
definition of A (one) website crawler, when in fact a lot of search engines use their own website crawler these days and it means you now have many
search engines listing your "Members Only" web pages (as links). The Robotstxt Database - Web Page Listing
(or Robotstxt Database - Text List) lists over 300 website crawlers (web robots) for
example, each with their given unique User Agent name (i.e. GoogleBot. MSNBot. Slurp. AskJeeves).
Fortunately, there is an answer to this problem but not a complete solution. It is called the robots.txt text file - A simple .txt (text) file,
that can be created with a Text Editor such as Notepad, that allows you to give instructions to a website crawler. In particular, a DISALLOW instruction.
The DISALLOW instruction tells a website crawler not to scan certain folders and file types within your website (public_html folder). So you could
disallow a website crawler from scanning your "Members Only" folder and the web pages within it for example. However. The reason why the robots.txt text
file is not a complete solution is because it is not governed by any laws. Meaning. Website crawlers can ignore your robots.txt text file altogether and
therefore ignore your DISALLOW instruction. Once the robots.txt text file is created you upload (transfer) it to your public_html folder - A website
crawler will read that robots.txt text file, if it wants to, before it scans your website (the content of your public_html folder).
To create a robots.txt text file begin by opening your favourite text editor (i.e. Notepad or Wordpad) and then type your instructions for the website
crawler to obey (see below). From there. Save the text file with the filename robots.txt, using Notepad's (or Wordpad's) SAVE AS menu-item
(Fig 1.1), before uploading (transferring) the text file to your public_html folder.
In the above example Fig 1.0 shows the instruction User-agent: with a parameter of * (asterisk). This is followed by the instruction Disallow: with a parameter of /MembersOnly/. Together they are telling all User Agents (website crawlers) not to scan the content (folders and files) of the folder called MembersOnly.
So far you have learnt that a website crawler is also known as a Web Robot, or Web Spider, and has a unique name known as a User Agent (otherwise known
as a robot name or spider name). For example. The website crawler (web spider / web robot) with the user agent (robot name / spider name) of GoogleBot
is the website crawler Google uses to scan your website (public_html folder) and return search engine results for the general public, based on information
it scans from your website. The website crawler can, if it wants to, ignore your robots.txt text file though. Saying this. The major website crawlers
do obey the instructions inside your robots.txt text file.
When a web page is shown in a search engine result as a link that web page is known as an Indexed web page. Its content (i.e. keywords and email
addresses) has been scanned as normal but its path name and file name (full path name) has also been indexed as a link purposely for search engine
results. Not all website crawlers use a search engine, therefore they only scan but do not index for search engine results - They may index for personal
(i.e. links database) reasons though.
The User Agent: instruction normally expects a parameter that tells it which user agent to use. For example. The instruction User-Agent: GoogleBot,
followed by the instruction Disallow: /MembersOnly/, would tell the website crawler called GoogleBot that when it reads the robots.txt text file
it should not scan the content of the folder called MembersOnly and should not index its content (i.e. web pages) as links for search engine results. All
other website crawlers would be allowed to scan and index the content of the MembersOnly folder.
At this point you may be asking "What is the point of stopping one website crawler from scanning and indexing when the other
website crawlers can do so?". Well provided that you do not have a private area on your website (i.e. a MembersOnly folder)
one reason is because a certain website crawler might be scanning and indexing your website too frequently, innocently robbing
your bandwidth in the process. Remember. A website crawler also caches (makes a copy of) your web pages so the general public
can view them when your web hosting provider's server (computer) is not working and therefore your website is not live
(offline).
Another reason could be to stop a bad (malware) website crawler from scanning and indexing your website. Malware
(Malicious Software) website crawlers scan and index your website looking to find private information and products (Membership
Areas, Private Documents, Software You Sell and so on) in order to retrieve Passwords, Product Numbers, Database details
and so on. In these cases you may want to stop all website crawlers by using the User-Agent: * and Disallow: /
instructions together - A malware website crawler would ignore your robots.txt text file though!
If you want to disallow all website crawlers from scanning and indexing a certain web page (i.e. a web page called membersnews.htm inside the MembersOnly folder) you would use the User-Agent: * instruction followed by this Disallow: /MembersOnly/membersnews.htm instruction.
To do the same thing but disallow only the GoogleBot website crawler you would use the User-Agent: GoogleBot instruction followed by the Disallow: /MembersOnly/membersnews.htm instruction.
If you want to add more than one folder to your disallow list simply put another Disallow: /FolderName/ instruction into your robots.txt text file. For example. To disallow all website crawlers from scanning and indexing your cgi-bin folder, your images folder and your MembersOnly folder you would have the following robots.txt text file:
To specifically disallow GoogleBot from scanning and indexing your image files, where ever they are located in your public_html folder, you can use its image user agent called GoogleBot-Image instead.
You can use Disallow: /images/ above, instead of Disallow: /, if you want to disallow your images folder
only or you can stick to using User-agent: GoogleBot with Disallow: /images/. One reason for wanting to
disallow images, a part from having a private/family photos you do not want the general public to see, is because of
bandwidth theft.
Suppose you have a photograph of a car on your website. Many websites might link to that car photograph on your website
(i.e. http://www.yourwebsite.com/car.jpg) instead of displaying it directly from their own images folder because they do
not want people using their bandwidth when the car photograph is downloaded from their website and/or because they do not
have ownership of the car picture. They would rather have people clicking on their CAR Link, linking to the car photograph
on your website, so that they are using your bandwidth and not theirs to display your car photograph on their website
from your images folder.
To allow a certain website crawler (i.e. GoogleBot) to scan and index your website's content but disallow all other website
crawlers you would use the following instruction pairs. The empty line between the two pairs of instructions acts as
a user agent separator. Meaning. The empty line allows you to build up a combination of user agent instructions.
Although there are a couple of new instructions out there (namely ALLOW, SITEMAP and CRAWL-DELAY), as well as WildCards (i.e. the use of * and ?), this section has explained the main basics of the robotstxt instructions that would be needed by most website beginners and their website. However. If you wish to know more about robotstxt I would consider visiting the RobotsTxt website and this Search Tools website.
All HTM, CSS, PHP and MySQL files in the websitecreationhelp.com folder and its sub-folders are (c) John White, 2010. All Rights Reserved.