← All Posts

How to Create a Robots.txt File (Complete Guide)

Published on February 10, 2026

What Is robots.txt?

A robots.txt file is a plain text file placed at the root of a website that tells search engine crawlers and other web robots which pages or sections of the site they are allowed or not allowed to access. It follows the Robots Exclusion Protocol, a standard that has been in use since 1994 and was formally documented as RFC 9309 in 2022.

When a crawler like Googlebot visits your site, it first checks https://yoursite.com/robots.txt before crawling any other page. If the file exists, the crawler reads the directives and follows them (assuming the crawler respects the standard). If the file does not exist, the crawler assumes it can access all pages.

It is important to understand that robots.txt is a request, not an enforcement mechanism. Well-behaved crawlers like Googlebot, Bingbot, and most legitimate bots honor the file. However, malicious scrapers and spam bots may ignore it entirely. If you need to truly block access, use server-side authentication, IP filtering, or other access control mechanisms.

Where to Place robots.txt

The file must be named exactly robots.txt (lowercase) and placed in the root directory of your domain. For example:

https://www.example.com/robots.txt     ✓ Correct
https://www.example.com/pages/robots.txt ✗ Wrong location
https://www.example.com/Robots.txt       ✗ Wrong case

Each subdomain needs its own robots.txt. A file at www.example.com/robots.txt does not apply to blog.example.com. If you host a blog on a separate subdomain, create a separate robots.txt for it.

Robots.txt Syntax and Directives

The file consists of one or more rule groups. Each group starts with a User-agent line that specifies which crawler the rules apply to, followed by Allow and Disallow directives.

User-agent

The User-agent directive identifies the crawler. Use * (asterisk) as a wildcard to match all crawlers, or specify a particular bot by name:

User-agent: *          # All crawlers
User-agent: Googlebot  # Google's crawler
User-agent: Bingbot    # Bing's crawler
User-agent: GPTBot     # OpenAI's crawler

Disallow

The Disallow directive tells a crawler not to access a specific path. An empty Disallow means nothing is blocked (everything is allowed):

Disallow: /admin/       # Block the /admin/ directory
Disallow: /private.html # Block a specific page
Disallow: /tmp/         # Block the /tmp/ directory
Disallow:               # Allow everything (empty value)

Allow

The Allow directive explicitly permits access to a path, even if a parent directory is disallowed. This is useful for overriding broad Disallow rules:

User-agent: *
Disallow: /private/
Allow: /private/public-page.html

In this example, the entire /private/ directory is blocked except for public-page.html.

Sitemap

The Sitemap directive tells crawlers where to find your XML sitemap. It can appear anywhere in the file and is not tied to a specific User-agent block:

Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://www.example.com/sitemap-blog.xml

Crawl-delay

The Crawl-delay directive asks crawlers to wait a specified number of seconds between requests. Note that Google does not honor Crawl-delay — use Google Search Console to adjust crawl rate for Googlebot instead. Bing and some other crawlers do respect it:

User-agent: Bingbot
Crawl-delay: 10  # Wait 10 seconds between requests

Common Configurations

Allow All Crawlers

To allow all crawlers full access to your site, use an empty Disallow:

User-agent: *
Disallow:

Sitemap: https://www.example.com/sitemap.xml

Block All Crawlers

To block all crawlers from your entire site (common on staging environments):

User-agent: *
Disallow: /

Be extremely careful with this rule. If accidentally deployed to production, it will de-index your entire site from search engines over time.

Block AI Crawlers

With the rise of AI training crawlers, many site owners want to block bots like GPTBot (OpenAI), Google-Extended (Gemini training), and CCBot (Common Crawl) while still allowing normal search indexing:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: *
Disallow: /admin/
Allow: /

Sitemap: https://www.example.com/sitemap.xml

Standard Website Configuration

A typical configuration for a production website blocks admin areas, API endpoints, and internal search while allowing everything else:

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Disallow: /cart/
Disallow: /account/
Disallow: /*.json$
Allow: /

Sitemap: https://www.example.com/sitemap.xml

Wildcard Patterns

Google and Bing support two wildcard characters in robots.txt paths:

  • * matches any sequence of characters. For example, Disallow: /folder*/page blocks /folder1/page, /folder-abc/page, etc.
  • $ anchors the match to the end of the URL. For example, Disallow: /*.pdf$ blocks all URLs ending in .pdf but not /pdf-guide/.
# Block all PDF files
Disallow: /*.pdf$

# Block all URLs with query parameters
Disallow: /*?

# Block specific file types
Disallow: /*.doc$
Disallow: /*.xls$

SEO Implications

Your robots.txt file has a direct impact on how search engines crawl and index your site. Here are key SEO considerations:

  • Blocking important pages hurts rankings. If you accidentally disallow pages you want indexed, they will disappear from search results. Always verify your rules carefully.
  • Disallow does not mean noindex. A disallowed page can still appear in search results if other pages link to it — Google just cannot crawl its content. Use a noindex meta tag to truly prevent indexing.
  • Crawl budget matters for large sites. For sites with millions of pages, blocking low-value pages (filters, session URLs, duplicate content) helps search engines focus their crawl budget on important content.
  • Include your sitemap. Adding a Sitemap directive helps crawlers discover all your pages, especially new or deeply nested ones.
  • Test before deploying. A misconfigured robots.txt can silently de-index your site. Always test changes using a validator tool before pushing to production.

Testing Your robots.txt

Before deploying your robots.txt, test it to make sure your rules work as intended:

  1. Google Search Console — Use the robots.txt Tester tool to enter a URL and see if it is blocked or allowed for Googlebot.
  2. Bing Webmaster Tools — Bing offers a similar testing tool in its webmaster dashboard.
  3. PulpMiner Robots.txt Tool — Use our free Robots.txt Generator to build and validate your file with a visual interface. It generates correct syntax automatically and lets you preview how different crawlers will interpret your rules.

Common Mistakes to Avoid

  • Forgetting a trailing slash. Disallow: /admin blocks /admin, /admin/, and /admin-panel. Disallow: /admin/ only blocks paths starting with /admin/.
  • Blocking CSS and JavaScript files. Google needs access to CSS and JS files to render your pages. Blocking them can negatively affect how Google understands your content.
  • Using robots.txt for sensitive content. Anyone can read your robots.txt file. Do not list secret paths or directories — you are essentially telling the world they exist.
  • Leaving staging rules in production. A Disallow: / that was correct on staging will destroy your SEO if deployed to production.

Need to extract data from websites?

PulpMiner turns any webpage into structured JSON data. No scraping code needed.

Try PulpMiner Free

No credit card required