HOW  TO  WRITE  A  ROBOTS.TXT  TEXT  FILE

In a normal environment a Website Crawler (also known as a Web Spider or Web Robot) is a program or automated script, normally activated by a search engine, that scans your website (your public_html folder and the folders and files within it) collecting and analysing information about your website and its web pages (i.e. Number of Web Pages. Language used. Keywords and Phrases used) and their content (i.e. Number of Links. Code information. Email Addresses. Is Audio/Video used?).

Some of that collected information is used by the search engine to give your website/web pages a ranking, position and subject matter (listing) within their search engine results, while the other information is used for Caching (storage) and Database purposes. Search engine companies such as Google, Yahoo and Microsoft may also choose to pass on their collected information to third parties (i.e. other search engine companies). So generally then, a website crawler (web robot) is a good thing. It searches your website for popularity, links, well written articles and other content in order to make your website/web pages reachable to as many people as possible via a search engine.

The downside to a website crawler though is that it tends to search every web page and folder within your public_html folder (your website). This is not good if you are using one of your folders as a "Members Only" area of your website, simply because the website crawler will collect information from the web pages inside that "Members Only" folder and more precisely keep a record of their location - inside your "Members Only" folder. In turn, the search engines will list those "Members Only" web pages (as links). Add to this that I have mentioned A (one) website crawler, because I was giving you the definition of A (one) website crawler, when in fact a lot of search engines use their own website crawler these days and it means you now have many search engines listing your "Members Only" web pages (as links). The Robotstxt Database - Web Page Listing (or Robotstxt Database - Text List) lists over 300 website crawlers (web robots) for example, each with their given unique User Agent name (i.e. GoogleBot. MSNBot. Slurp. AskJeeves).

Fortunately, there is an answer to this problem but not a complete solution. It is called the robots.txt text file - A simple .txt (text) file, that can be created with a Text Editor such as Notepad, that allows you to give instructions to a website crawler. In particular, a DISALLOW instruction.

The DISALLOW instruction tells a website crawler not to scan certain folders and file types within your website (public_html folder). So you could disallow a website crawler from scanning your "Members Only" folder and the web pages within it for example. However. The reason why the robots.txt text file is not a complete solution is because it is not governed by any laws. Meaning. Website crawlers can ignore your robots.txt text file altogether and therefore ignore your DISALLOW instruction. Once the robots.txt text file is created you upload (transfer) it to your public_html folder - A website crawler will read that robots.txt text file, if it wants to, before it scans your website (the content of your public_html folder).

To create a robots.txt text file begin by opening your favourite text editor (i.e. Notepad or Wordpad) and then type your instructions for the website crawler to obey (see below). From there. Save the text file with the filename robots.txt, using Notepad's (or Wordpad's) SAVE AS menu-item (Fig 1.1), before uploading (transferring) the text file to your public_html folder.



Fig 1.0  Open Notepad and then type your instructions for the website crawler to obey




Fig 1.1  Click on Notepad's FILE menu and select the SAVE AS menu-item to continue




Fig 1.2  Save the text as a text (.txt) file with the name: robots.txt




Fig 1.3  The Robots.txt text file when it has been uploaded (transferred) to your public_html folder

In the above example Fig 1.0 shows the instruction User-agent: with a parameter of * (asterisk). This is followed by the instruction Disallow: with a parameter of /MembersOnly/. Together they are telling all User Agents (website crawlers) not to scan the content (folders and files) of the folder called MembersOnly.

USER  AGENTS  AND  ROBOTSTXT  INSTRUCTIONS

So far you have learnt that a website crawler is also known as a Web Robot, or Web Spider, and has a unique name known as a User Agent (otherwise known as a robot name or spider name). For example. The website crawler (web spider / web robot) with the user agent (robot name / spider name) of GoogleBot is the website crawler Google uses to scan your website (public_html folder) and return search engine results for the general public, based on information it scans from your website. The website crawler can, if it wants to, ignore your robots.txt text file though. Saying this. The major website crawlers do obey the instructions inside your robots.txt text file.

When a web page is shown in a search engine result as a link that web page is known as an Indexed web page. Its content (i.e. keywords and email addresses) has been scanned as normal but its path name and file name (full path name) has also been indexed as a link purposely for search engine results. Not all website crawlers use a search engine, therefore they only scan but do not index for search engine results - They may index for personal (i.e. links database) reasons though.

The User Agent: instruction normally expects a parameter that tells it which user agent to use. For example. The instruction User-Agent: GoogleBot, followed by the instruction Disallow: /MembersOnly/, would tell the website crawler called GoogleBot that when it reads the robots.txt text file it should not scan the content of the folder called MembersOnly and should not index its content (i.e. web pages) as links for search engine results. All other website crawlers would be allowed to scan and index the content of the MembersOnly folder.



Fig 1.4  Only disallow the website crawler called GoogleBot from scanning and indexing the content of the MembersOnly folder

At this point you may be asking "What is the point of stopping one website crawler from scanning and indexing when the other website crawlers can do so?". Well provided that you do not have a private area on your website (i.e. a MembersOnly folder) one reason is because a certain website crawler might be scanning and indexing your website too frequently, innocently robbing your bandwidth in the process. Remember. A website crawler also caches (makes a copy of) your web pages so the general public can view them when your web hosting provider's server (computer) is not working and therefore your website is not live (offline).

Another reason could be to stop a bad (malware) website crawler from scanning and indexing your website. Malware (Malicious Software) website crawlers scan and index your website looking to find private information and products (Membership Areas, Private Documents, Software You Sell and so on) in order to retrieve Passwords, Product Numbers, Database details and so on. In these cases you may want to stop all website crawlers by using the User-Agent: * and Disallow: / instructions together - A malware website crawler would ignore your robots.txt text file though!



Fig 1.5  Disallow all website crawlers from scanning and indexing your website's content

If you want to disallow all website crawlers from scanning and indexing a certain web page (i.e. a web page called membersnews.htm inside the MembersOnly folder) you would use the User-Agent: * instruction followed by this Disallow: /MembersOnly/membersnews.htm instruction.



Fig 1.6  Disallow all website crawlers from scanning and indexing the membersnews.htm web page

To do the same thing but disallow only the GoogleBot website crawler you would use the User-Agent: GoogleBot instruction followed by the Disallow: /MembersOnly/membersnews.htm instruction.



Fig 1.7  Only disallow the website crawler called GoogleBot from scanning and indexing the membersnews.htm web page

If you want to add more than one folder to your disallow list simply put another Disallow: /FolderName/ instruction into your robots.txt text file. For example. To disallow all website crawlers from scanning and indexing your cgi-bin folder, your images folder and your MembersOnly folder you would have the following robots.txt text file:



Fig 1.8  Disallow all website crawlers from scanning and indexing more than one folder

To specifically disallow GoogleBot from scanning and indexing your image files, where ever they are located in your public_html folder, you can use its image user agent called GoogleBot-Image instead.



Fig 1.9  Disallow GoogleBot from scanning and indexing your image files, regardless of where they are in your public_html folder

You can use Disallow: /images/ above, instead of Disallow: /, if you want to disallow your images folder only or you can stick to using User-agent: GoogleBot with Disallow: /images/. One reason for wanting to disallow images, a part from having a private/family photos you do not want the general public to see, is because of bandwidth theft.

Suppose you have a photograph of a car on your website. Many websites might link to that car photograph on your website (i.e. http://www.yourwebsite.com/car.jpg) instead of displaying it directly from their own images folder because they do not want people using their bandwidth when the car photograph is downloaded from their website and/or because they do not have ownership of the car picture. They would rather have people clicking on their CAR Link, linking to the car photograph on your website, so that they are using your bandwidth and not theirs to display your car photograph on their website from your images folder.

To allow a certain website crawler (i.e. GoogleBot) to scan and index your website's content but disallow all other website crawlers you would use the following instruction pairs. The empty line between the two pairs of instructions acts as a user agent separator. Meaning. The empty line allows you to build up a combination of user agent instructions.



Fig 1.10  Allow GoogleBot to scan and index your website's content but not other website crawlers

Although there are a couple of new instructions out there (namely ALLOW, SITEMAP and CRAWL-DELAY), as well as WildCards (i.e. the use of * and ?), this section has explained the main basics of the robotstxt instructions that would be needed by most website beginners and their website. However. If you wish to know more about robotstxt I would consider visiting the RobotsTxt website and this Search Tools website.

Bandwidth Explained Index Meta Tags Explained