Which AI bots should I look for in my logs?

The common ones are GPTBot and ChatGPT-User (OpenAI), ClaudeBot and anthropic-ai (Anthropic), Google-Extended (Google's AI training), PerplexityBot (Perplexity), CCBot (Common Crawl, which feeds many models), Bytespider (ByteDance), Amazonbot, Applebot-Extended and Meta-ExternalAgent. They all identify themselves in the User-Agent header, so they are easy to match by name.

How do I tell a training crawler from a live fetch?

Some agents scrape pages in bulk to train models (GPTBot, ClaudeBot, CCBot, Google-Extended), while others fetch a single page in real time because a user asked the assistant about it (ChatGPT-User, Perplexity-User). Both appear in your access log by User-Agent, but the live-fetch ones tend to arrive one request at a time rather than crawling your whole sitemap.

Can I block AI bots from my site?

Yes. Well-behaved crawlers honour robots.txt, so a Disallow rule for their user agent stops them. For ones that ignore it you can block by User-Agent in nginx and return 403. Remember that blocking training crawlers can also remove your site from the answers those assistants give, so it is a trade-off, not a pure win.

Do AI bots actually obey robots.txt?

The major, named crawlers from OpenAI, Anthropic, Google and Perplexity document their user agents and honour robots.txt. Some scrapers ignore it entirely. That is why robots.txt is your first line of control but not your only one: log everything, and enforce hard blocks in the web server for agents that misbehave.

Track AI bots visiting your website

Know the user agents
Find them in your access log
Log AI bots to their own file
Control what they can do
Check any site from the outside

AI crawlers now make up a real slice of traffic. Some scrape pages in bulk to train models, others fetch a single page in real time because someone asked an assistant about it. Either way, you can see exactly who's visiting — they announce themselves in the User-Agent header. Here's how to track them.

Know the user agents

The major AI companies publish and respect their crawler names. The ones worth watching for:

User agent	Operator	Purpose
`GPTBot`	OpenAI	Training crawl
`ChatGPT-User`	OpenAI	Live fetch for a user prompt
`ClaudeBot`, `anthropic-ai`	Anthropic	Training crawl
`Google-Extended`	Google	AI training (Gemini)
`PerplexityBot`	Perplexity	Search index
`CCBot`	Common Crawl	Open dataset many models train on
`Bytespider`	ByteDance	Training crawl
`Amazonbot`, `Applebot-Extended`, `Meta-ExternalAgent`	Amazon, Apple, Meta	AI training

They all match cleanly by name, which is what makes tracking them straightforward.

Find them in your access log

The quickest check needs nothing but the log you already have. Grep the access log for the names:

grep -iE "GPTBot|ClaudeBot|Google-Extended|PerplexityBot|CCBot|Bytespider|anthropic-ai" \
  /var/log/nginx/access.log

To see which AI bots hit you most over the current log, pull out the user agent and count:

grep -ioE "GPTBot|ClaudeBot|Google-Extended|PerplexityBot|CCBot|Bytespider|ChatGPT-User" \
  /var/log/nginx/access.log | sort | uniq -c | sort -rn

That one line tells you whether you're being crawled for training (lots of GPTBot/ClaudeBot hits across many URLs) or fetched live (occasional ChatGPT-User).

Log AI bots to their own file

Grepping is fine for a spot check, but for ongoing tracking it's cleaner to route AI-bot requests into a dedicated log. Use an Nginx map to flag the user agent, then log conditionally:

map $http_user_agent $ai_bot {
    default          "";
    "~*GPTBot"        "openai";
    "~*ChatGPT-User"  "openai";
    "~*ClaudeBot"     "anthropic";
    "~*anthropic-ai"  "anthropic";
    "~*Google-Extended" "google";
    "~*PerplexityBot" "perplexity";
    "~*CCBot"         "commoncrawl";
    "~*Bytespider"    "bytedance";
}

server {
    # Write AI-bot hits to their own file, and only those.
    access_log /var/log/nginx/ai-bots.log combined if=$ai_bot;
}

Reload Nginx and the new file fills up with AI-bot traffic only:

nginx -t && systemctl reload nginx
tail -f /var/log/nginx/ai-bots.log

This is the same conditional-logging technique covered in log bot requests in nginx and detect Googlebot in nginx, narrowed to the AI crawlers.

Control what they can do

Once you can see them, you can decide. Well-behaved crawlers honour robots.txt, so you allow or disallow each by name:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Allow: /

For agents that ignore robots.txt, enforce it at the web server and return a 403 using the $ai_bot variable you already defined:

if ($ai_bot) { return 403; }

Bear in mind the trade-off: blocking training crawlers can also drop your site out of the answers those assistants give. Many sites choose to allow AI bots precisely so they get cited, and just track the volume. If you want to point crawlers at a curated, machine-readable summary of your site instead, an llms.txt file is the emerging convention for that.

Check any site from the outside

To see how a site treats AI crawlers without digging through logs — which agents it allows, what its robots.txt and llms.txt say — run it through the AI scan tool. It fetches the same signals an AI bot would and reports how the site responds, which is a fast way to audit your own setup or a competitor's.

Knowledge

Track AI bots visiting your website

#AI

Know the user agents

Find them in your access log

Log AI bots to their own file

Control what they can do

Check any site from the outside

Subscribe to our newsletter

Frequently asked questions

More in #AI

Knowledge

Track AI bots visiting your website

#AI

#Know the user agents

#Find them in your access log

#Log AI bots to their own file

#Control what they can do

#Check any site from the outside

Subscribe to our newsletter

Frequently asked questions

More in #AI

Know the user agents

Find them in your access log

Log AI bots to their own file

Control what they can do

Check any site from the outside