Robots.txt Isn’t the Best Way to Keep Your Site From Being Spidered
I’m not entirely sure why some people would want to keep their site from being searched by the search engines, honestly. I can only assume that they have content that they don’t want the search engines to see. Anyway, this gave rise to a myth that robots.txt is the best way to keep the spiders away from your site.
The basic idea is a fairly good one, except that it doesn’t work in all cases. Telling the spiders to ignore your page only works if they’re not coming in by a link from another website.
Say, for instance, you set the robots.txt to ignore the landing page. If there’s a backlink to another page on your site that, the spiders can follow that link and see this page. You can, theoretically, do this for all your pages, but there are more elegant ways of getting the robots to ignore your site, or at least a selection of pages.
First of all, making the pages you want to be ignored be accessible by a login screen is much easier than changing up your robots.txt file. If you have to login to see it, the spiders can’t touch it–it’s that simple.
You can see this at work in blog sites like LiveJournal where you can set entries to be private or public. The public pages can be viewed without a login, and they will also appear in Google searches. Obviously, these pages can be spidered. However, if you set a post to private, it doesn’t show up on Google searches, and so cannot be spidered.
The great thing about login access also is that you can monetize your content this way. Keep your landing page accessible by spiders, charge for login access, and there you go. That is, unless you REALLY have something to hide, in which case, it’s best to not even have a website.













