How To Write A Robots.txt Text File
Protect Your Website And Its Bandwidth
In a normal environment a Website Crawler (also known as a Web Spider or Web Robot) is a program or automated script, normally activated by a search
engine, that scans your website (your public_html folder and the folders and files within it) collecting and analysing information about your website
and its web pages (i.e. Number of Web Pages. Language used. Keywords and Phrases used) and their content (i.e. Number of Links. Code information.
E-Mail Addresses. Is Audio/Video used?).
Some of that collected information is used by the search engine to give your website/web pages a ranking, position and subject matter (listing) within
their search engine results, while the other information is used for Caching (storage) and Database purposes.
Search engine companies such as Google, Yahoo and Microsoft may also choose to pass on their collected information to third parties (i.e. other search
engine companies). So a website crawler (web robot) is a good thing, in general. It searches your website for popularity, links, well written articles
and other content in order to make your website/web pages viewable to as many people as possible via a search engine.
The downside to a website crawler though is that it tends to search every web page and folder within your public_html folder (your website). This is not
good if you are using one of your folders as a "Members Only" area of your website, simply because the website crawler will collect information from the
web pages inside that "Members Only" folder and more precisely keep a record of their location - inside your "Members Only" folder.
In turn, the search engines will list those "Members Only" web pages (as links). Add to this that I have mentioned A (one) website crawler, because I was
giving you the definition of A (one) website crawler, when in fact a lot of search engines use their own website crawler these days and it means you now
have many search engines listing your "Members Only" web pages (as links). The
Robotstxt Database - Web Page Listing (or
Robotstxt Database - Text List) lists over 300 website crawlers (web
robots) for example, each with their given unique User Agent name (i.e. GoogleBot. MSNBot. Slurp. AskJeeves).
Fortunately, there is an answer to this problem but not a complete solution. It is called the robots.txt text file - A simple .txt (text) file,
that can be created with a Text Editor such as Notepad, that allows you to give instructions to a website crawler. In particular, a DISALLOW instruction.
The DISALLOW instruction tells a website crawler not to scan certain folders and file types within your website (public_html folder). So you could
disallow a website crawler from scanning your "Members Only" folder and the web pages within it for example. However. The reason why the robots.txt text
file is not a complete solution is because it is not governed by any laws. Meaning. Website crawlers can ignore your robots.txt text file altogether and
therefore ignore your DISALLOW instruction.
Once the robots.txt text file is created you upload (transfer) it to your public_html folder - A website crawler will read that robots.txt text file, if
it wants to, before it scans your website (the content of your public_html folder).
To create a robots.txt text file begin by opening your favourite text editor (i.e. Notepad or Wordpad) and then type your instructions for the website
crawler to obey (see below). From there. Save the text file with the filename robots.txt, using Notepad's (or Wordpad's) SAVE AS menu-item
(Fig 1.1), before uploading (transferring) the text file to your public_html folder.
Fig 1.0 Open Notepad and then type your instructions for the website crawler to obey
Fig 1.1 Click on Notepad's FILE menu and select the SAVE AS menu-item to continue
Fig 1.2 Save the text as a text (.txt) file with the name: robots.txt
Fig 1.3 The Robots.txt text file when it has been uploaded to the public_html folder
In the above example Fig 1.0 shows the instruction User-agent: with a parameter of * (asterisk). This is followed by the instruction Disallow: with a parameter of /MembersOnly/. Together they are telling all User Agents (website crawlers) not to scan the content (folders and files) of the folder called MembersOnly.
USER AGENTS AND ROBOTSTXT INSTRUCTIONS
So far you have learnt that a website crawler is also known as a Web Robot, or Web Spider, and has a unique name known as a User Agent (otherwise known
as a robot name or spider name). For example. The website crawler (web spider / web robot) with the user agent (robot name / spider name) of GoogleBot
is the website crawler Google uses to scan your website (public_html folder) and return search engine results for the general public, based on information
it scans from your website. The website crawler can, if it wants to, ignore your robots.txt text file though. Saying this. The major website crawlers
do obey the instructions inside your robots.txt text file.
When a web page is shown in a search engine result as a link that web page is known as an Indexed web page. Its content (i.e. keywords and email
addresses) has been scanned as normal but its path name and file name (full path name) has also been indexed as a link purposely for search engine
results. Not all website crawlers use a search engine, therefore they only scan but do not index for search engine results - They may index for personal
(i.e. links database) reasons though.
The User Agent: instruction normally expects a parameter that tells it which user agent to use. For example. The instruction User-Agent: GoogleBot, followed by the instruction Disallow: /MembersOnly/, would tell the website crawler called GoogleBot that when it reads the robots.txt text file it should not scan the content of the folder called MembersOnly and should not index its content (i.e. web pages) as links for search engine results. All other website crawlers would be allowed to scan and index the content of the MembersOnly folder.
Fig 1.4 Only disallow the GoogleBot website crawler from scanning and indexing MembersOnly
At this point you may be asking "What is the point of stopping one website crawler from scanning and indexing when the other website crawlers can do so?".
Well provided that you do not have a private area on your website (i.e. a MembersOnly folder) one reason is because a certain website crawler might be
scanning and indexing your website too frequently, innocently robbing your bandwidth in the process.
Remember. A website crawler also caches (makes a copy of) your web pages so the general public can view them when your web hosting provider's server
(computer) is not working and therefore your website is not live (offline).
Another reason could be to stop a bad (malware) website crawler from scanning and indexing your website. Malware (Malicious Software) website crawlers
scan and index your website looking to find private information and products (Membership Areas, Private Documents, Software You Sell and so on) in order
to retrieve Passwords, Product Numbers, Database details and so on. In these cases you may want to stop all website crawlers by using the
User-Agent: * and Disallow: / instructions together - A malware website crawler would ignore your robots.txt text file though!
Fig 1.5 Disallow all website crawlers from scanning and indexing your website's content
If you want to disallow all website crawlers from scanning and indexing a certain web page (i.e. a web page called news.htm inside the MembersOnly folder) you would use the User-Agent: * instruction followed by this Disallow: /MembersOnly/news.htm instruction.
Fig 1.6 Disallow all website crawlers from scanning and indexing the news.htm web page
To do the same thing but disallow only the GoogleBot web crawler you would use the User-Agent: GoogleBot instruction followed by the Disallow: /MembersOnly/news.htm instruction.
Fig 1.7 Only disallow the GoogleBot website crawler from scanning and indexing news.htm
If you want to add more than one folder to your disallow list simply put another instruction of Disallow: /FolderName/ into your robots.txt text file. For example. To disallow all website crawlers from scanning and indexing your cgi-bin folder, your images folder and your MembersOnly folder you would have the following robots.txt text file:
Fig 1.8 Disallow all website crawlers from scanning and indexing more than one folder
To specifically disallow GoogleBot from scanning and indexing your image files, where ever they are located in your public_html folder, you can use its image user agent called GoogleBot-Image instead.
Fig 1.9 Disallow the website crawler GoogleBot from scanning and indexing any image files
You can use Disallow: /images/ above, instead of Disallow: /, if you want to disallow your images folder only or you can stick to using
User-agent: GoogleBot with Disallow: /images/. One reason for wanting to disallow images, a part from having a private/family photos you
do not want the general public to see, is because of bandwidth theft.
Suppose you have a photo of a car on your website. Many websites might link to that car photo (i.e. http://www.yourwebsite.com/car.jpg), instead of
displaying it directly from their own website's images folder, because they would not want people using their bandwidth if the car photo had to be
downloaded from their website and/or because they do not have ownership of the car photo.
They would rather have people clicking on their CAR Link, linking to the car photo on your website, so that they are using your bandwidth and not theirs
to display your car photo on their website from your images folder.
To allow a certain website crawler (i.e. GoogleBot) to scan and index your website's content but disallow all other website crawlers you would use the
following instruction pairs. The empty line between the two pairs of instructions acts as a user agent separator. Meaning. The empty line allows you to
build up a combination of user agent instructions.
Fig 1.10 Allow GoogleBot to scan and index your website's content but not other website crawlers
Although there are a couple of new instructions out there (namely ALLOW, SITEMAP and CRAWL-DELAY), as well as WildCards (i.e. the use of * and ?), this section has explained the main basics of the robotstxt instructions that would be needed by most website beginners and their website. However. If you wish to know more about robotstxt I would consider visiting the RobotsTxt website and this Search Tools website.