The Security Value of the robots.txt file

DISCLAIMER

This tutorial is only for Educational purposes. Please don’t use these kind of attacks for unethical purposes.

robots-txt-logo.jpg

The “robots.txt” file is one of the primary ways of telling a search engine where it can and can’t go.This is called the robots exclusion protocol.

NetDNA-Blog-RobotsTxt-R11

The robots.txt file is read by search engine spiders.The first thing a search engine spider like googlebot checks when it is visiting a page is the robots.txt file.

It’s very important that your”robots.txt” file is really called “robots.txt”. The name is case sensitive. Don’t make any mistakes in it or it will just not work. The “robots.txt” file should always be at the root of your domain. So if your domain is http://www.ethicalHacker7.com, it should befound at http://www.ethicalHacker7.com/robots.txt

robots-googlebot.png

The reason for doing this is that it needs to know whether it has permission to access that page or file. If the robots.txt file says that it can enter the page, the search engine spider then continues on to the page files.If your website does not have a robots.txt file the robot then feels free to visit all your web pages and content because this is what the robot is programmed to do in this situation

This is simply how it works,

A robot wants to visits a Web site URL,  http://www.ethicalHacker7.com/welcome.html. Before it does so, it firsts checks for http://www.ethicalHacker7.com/robots.txt, and finds:

User-agent: *
Disallow: /
The “User-agent: *” means this section applies to all robots. The “Disallow: /” tells the robot that it should not visit any pages on the site.

2.PNG

The “Disallow” key word is there to tell the robots what folders they should not access.

2

The “User-agent *” line says “this applies to all robots”. The “Disallow: /photos” line says “don’t visit or index my photos folder”.

The Security Value…

During the reconnaissance stage of a web application testing hackers use robots.txt file to get valuable information.

There are websites which do not have  robots.txt file. This action could be justified as being a security measure, as having “disallow” entries could reveal hidden folders. But not declaring what files a web crawler can and can’t crawl is bad security measure.  A simple declaration such as the following would have prevent this from being indexed.

User-agent: *
Disallow: /wp-content/uploads/

Surely if we declare all the confidential paths such as http://www.ethicalHacker7.com/admin on our site, then an attacker will have a nice and easy job in finding them. But if a attacker have been using google to actively find confidential files. We can prevent that threat to the security of the website and that is an advantage.

The solution is to use robots.txt as a defense in depth measure.

Layer 1 – Access Control
Layer 2 – No Directory Listing
Layer 3 – Meta Robots
Layer 4 – Robots.txt

So we start out from the perimeter defense mechanism, the robots.txt. This file should be used to declare areas of the site that we don’t want to get indexed, however not the really sensitive folders. To protect our really sensitive files we can use the following declaration in the HTML page.

<meta name=”robots” content=”no index, no follow” />

This serves two purposes, search engines will not index the page and the attacker will not have a ready list of sensitive directories to attack. We then strengthen our security infrastructure by adding no directory listing and access controls.

Normally when you are trying to pen test a web site use a web proxy like Burp Suite ans spider the website. Most probably u’ll able to find the robots.txt file or simply add “/robots.txt”  to the end of the url of the targeted web site.

1-qs-robots

And you’ll be able to find some interesting information about the targeted website 😉 like and the rest is up to you… 😉

Capture

If you clever enough, you’ll able to get the admin privileges of the targeted website using some techniques… 😉

 

 

how-to-use-robots-txt.png

This tutorial is only for Educational purposes. Please don’t use these kind of attacks for unethical purposes.

Stay Ethical… Use Protection… 

online-training-cceh.jpg

One thought on “The Security Value of the robots.txt file

  1. Hi,

    thanks for the article. However like attackers may use the robots.txt it may be also used as a honeypot. Obviously disallowing wp-admin or wp-login to be indexed while hiding / relocating the backend login proved for me a very feasible approach to secure the websites I am responsible for.

    Cheers
    Mike

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s