What Is robots.txt?
A robots.txt file is a plain text file placed at the root of a website that tells search engine crawlers and other web robots which pages or sections of the site they are allowed or not allowed to access. It follows the Robots Exclusion Protocol, a standard that has been in use since 1994 and was formally documented as RFC 9309 in 2022.
When a crawler like Googlebot visits your site, it first checks https://yoursite.com/robots.txt before crawling any other page. If the file exists, the crawler reads the directives and follows them (assuming the crawler respects the standard). If the file does not exist, the crawler assumes it can access all pages.
It is important to understand that robots.txt is a request, not an enforcement mechanism. Well-behaved crawlers like Googlebot, Bingbot, and most legitimate bots honor the file. However, malicious scrapers and spam bots may ignore it entirely. If you need to truly block access, use server-side authentication, IP filtering, or other access control mechanisms.
Where to Place robots.txt
The file must be named exactly robots.txt (lowercase) and placed in the root directory of your domain. For example:
https://www.example.com/robots.txt ✓ Correct
https://www.example.com/pages/robots.txt ✗ Wrong location
https://www.example.com/Robots.txt ✗ Wrong caseEach subdomain needs its own robots.txt. A file at www.example.com/robots.txt does not apply to blog.example.com. If you host a blog on a separate subdomain, create a separate robots.txt for it.
Robots.txt Syntax and Directives
The file consists of one or more rule groups. Each group starts with a User-agent line that specifies which crawler the rules apply to, followed by Allow and Disallow directives.
User-agent
The User-agent directive identifies the crawler. Use * (asterisk) as a wildcard to match all crawlers, or specify a particular bot by name:
User-agent: * # All crawlers
User-agent: Googlebot # Google's crawler
User-agent: Bingbot # Bing's crawler
User-agent: GPTBot # OpenAI's crawlerDisallow
The Disallow directive tells a crawler not to access a specific path. An empty Disallow means nothing is blocked (everything is allowed):
Disallow: /admin/ # Block the /admin/ directory
Disallow: /private.html # Block a specific page
Disallow: /tmp/ # Block the /tmp/ directory
Disallow: # Allow everything (empty value)Allow
The Allow directive explicitly permits access to a path, even if a parent directory is disallowed. This is useful for overriding broad Disallow rules:
User-agent: *
Disallow: /private/
Allow: /private/public-page.htmlIn this example, the entire /private/ directory is blocked except for public-page.html.
Sitemap
The Sitemap directive tells crawlers where to find your XML sitemap. It can appear anywhere in the file and is not tied to a specific User-agent block:
Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://www.example.com/sitemap-blog.xmlCrawl-delay
The Crawl-delay directive asks crawlers to wait a specified number of seconds between requests. Note that Google does not honor Crawl-delay — use Google Search Console to adjust crawl rate for Googlebot instead. Bing and some other crawlers do respect it:
User-agent: Bingbot
Crawl-delay: 10 # Wait 10 seconds between requestsCommon Configurations
Allow All Crawlers
To allow all crawlers full access to your site, use an empty Disallow:
User-agent: *
Disallow:
Sitemap: https://www.example.com/sitemap.xmlBlock All Crawlers
To block all crawlers from your entire site (common on staging environments):
User-agent: *
Disallow: /Be extremely careful with this rule. If accidentally deployed to production, it will de-index your entire site from search engines over time.
Block AI Crawlers
With the rise of AI training crawlers, many site owners want to block bots like GPTBot (OpenAI), Google-Extended (Gemini training), and CCBot (Common Crawl) while still allowing normal search indexing:
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: *
Disallow: /admin/
Allow: /
Sitemap: https://www.example.com/sitemap.xmlStandard Website Configuration
A typical configuration for a production website blocks admin areas, API endpoints, and internal search while allowing everything else:
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Disallow: /cart/
Disallow: /account/
Disallow: /*.json$
Allow: /
Sitemap: https://www.example.com/sitemap.xmlWildcard Patterns
Google and Bing support two wildcard characters in robots.txt paths:
*matches any sequence of characters. For example,Disallow: /folder*/pageblocks/folder1/page,/folder-abc/page, etc.$anchors the match to the end of the URL. For example,Disallow: /*.pdf$blocks all URLs ending in.pdfbut not/pdf-guide/.
# Block all PDF files
Disallow: /*.pdf$
# Block all URLs with query parameters
Disallow: /*?
# Block specific file types
Disallow: /*.doc$
Disallow: /*.xls$SEO Implications
Your robots.txt file has a direct impact on how search engines crawl and index your site. Here are key SEO considerations:
- Blocking important pages hurts rankings. If you accidentally disallow pages you want indexed, they will disappear from search results. Always verify your rules carefully.
- Disallow does not mean noindex. A disallowed page can still appear in search results if other pages link to it — Google just cannot crawl its content. Use a
noindexmeta tag to truly prevent indexing. - Crawl budget matters for large sites. For sites with millions of pages, blocking low-value pages (filters, session URLs, duplicate content) helps search engines focus their crawl budget on important content.
- Include your sitemap. Adding a
Sitemapdirective helps crawlers discover all your pages, especially new or deeply nested ones. - Test before deploying. A misconfigured
robots.txtcan silently de-index your site. Always test changes using a validator tool before pushing to production.
Testing Your robots.txt
Before deploying your robots.txt, test it to make sure your rules work as intended:
- Google Search Console — Use the robots.txt Tester tool to enter a URL and see if it is blocked or allowed for Googlebot.
- Bing Webmaster Tools — Bing offers a similar testing tool in its webmaster dashboard.
- PulpMiner Robots.txt Tool — Use our free Robots.txt Generator to build and validate your file with a visual interface. It generates correct syntax automatically and lets you preview how different crawlers will interpret your rules.
Common Mistakes to Avoid
- Forgetting a trailing slash.
Disallow: /adminblocks/admin,/admin/, and/admin-panel.Disallow: /admin/only blocks paths starting with/admin/. - Blocking CSS and JavaScript files. Google needs access to CSS and JS files to render your pages. Blocking them can negatively affect how Google understands your content.
- Using robots.txt for sensitive content. Anyone can read your
robots.txtfile. Do not list secret paths or directories — you are essentially telling the world they exist. - Leaving staging rules in production. A
Disallow: /that was correct on staging will destroy your SEO if deployed to production.
