Knowledge

Track AI bots visiting your website

#AI

AI crawlers like GPTBot, ClaudeBot and PerplexityBot scrape sites to train models and answer questions. Here is how to spot them in your logs, log them to their own file, and decide what to allow.

Published by Mark van Eijk on July 1, 2026
Updated on July 1, 2026 · 2 minute read

  1. Know the user agents
  2. Find them in your access log
  3. Log AI bots to their own file
  4. Control what they can do
  5. Check any site from the outside

AI crawlers now make up a real slice of traffic. Some scrape pages in bulk to train models, others fetch a single page in real time because someone asked an assistant about it. Either way, you can see exactly who's visiting — they announce themselves in the User-Agent header. Here's how to track them.

Know the user agents

The major AI companies publish and respect their crawler names. The ones worth watching for:

User agent Operator Purpose
GPTBot OpenAI Training crawl
ChatGPT-User OpenAI Live fetch for a user prompt
ClaudeBot, anthropic-ai Anthropic Training crawl
Google-Extended Google AI training (Gemini)
PerplexityBot Perplexity Search index
CCBot Common Crawl Open dataset many models train on
Bytespider ByteDance Training crawl
Amazonbot, Applebot-Extended, Meta-ExternalAgent Amazon, Apple, Meta AI training

They all match cleanly by name, which is what makes tracking them straightforward.

Find them in your access log

The quickest check needs nothing but the log you already have. Grep the access log for the names:

grep -iE "GPTBot|ClaudeBot|Google-Extended|PerplexityBot|CCBot|Bytespider|anthropic-ai" \
  /var/log/nginx/access.log

To see which AI bots hit you most over the current log, pull out the user agent and count:

grep -ioE "GPTBot|ClaudeBot|Google-Extended|PerplexityBot|CCBot|Bytespider|ChatGPT-User" \
  /var/log/nginx/access.log | sort | uniq -c | sort -rn

That one line tells you whether you're being crawled for training (lots of GPTBot/ClaudeBot hits across many URLs) or fetched live (occasional ChatGPT-User).

Log AI bots to their own file

Grepping is fine for a spot check, but for ongoing tracking it's cleaner to route AI-bot requests into a dedicated log. Use an Nginx map to flag the user agent, then log conditionally:

map $http_user_agent $ai_bot {
    default          "";
    "~*GPTBot"        "openai";
    "~*ChatGPT-User"  "openai";
    "~*ClaudeBot"     "anthropic";
    "~*anthropic-ai"  "anthropic";
    "~*Google-Extended" "google";
    "~*PerplexityBot" "perplexity";
    "~*CCBot"         "commoncrawl";
    "~*Bytespider"    "bytedance";
}

server {
    # Write AI-bot hits to their own file, and only those.
    access_log /var/log/nginx/ai-bots.log combined if=$ai_bot;
}

Reload Nginx and the new file fills up with AI-bot traffic only:

nginx -t && systemctl reload nginx
tail -f /var/log/nginx/ai-bots.log

This is the same conditional-logging technique covered in log bot requests in nginx and detect Googlebot in nginx, narrowed to the AI crawlers.

Control what they can do

Once you can see them, you can decide. Well-behaved crawlers honour robots.txt, so you allow or disallow each by name:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Allow: /

For agents that ignore robots.txt, enforce it at the web server and return a 403 using the $ai_bot variable you already defined:

if ($ai_bot) { return 403; }

Bear in mind the trade-off: blocking training crawlers can also drop your site out of the answers those assistants give. Many sites choose to allow AI bots precisely so they get cited, and just track the volume. If you want to point crawlers at a curated, machine-readable summary of your site instead, an llms.txt file is the emerging convention for that.

Check any site from the outside

To see how a site treats AI crawlers without digging through logs — which agents it allows, what its robots.txt and llms.txt say — run it through the AI scan tool. It fetches the same signals an AI bot would and reports how the site responds, which is a fast way to audit your own setup or a competitor's.

Subscribe to our newsletter

Do you want to receive regular updates with fresh and exclusive content to learn more about web development, hosting, security and performance? Subscribe now!

Frequently asked questions

Which AI bots should I look for in my logs?
The common ones are GPTBot and ChatGPT-User (OpenAI), ClaudeBot and anthropic-ai (Anthropic), Google-Extended (Google's AI training), PerplexityBot (Perplexity), CCBot (Common Crawl, which feeds many models), Bytespider (ByteDance), Amazonbot, Applebot-Extended and Meta-ExternalAgent. They all identify themselves in the User-Agent header, so they are easy to match by name.
How do I tell a training crawler from a live fetch?
Some agents scrape pages in bulk to train models (GPTBot, ClaudeBot, CCBot, Google-Extended), while others fetch a single page in real time because a user asked the assistant about it (ChatGPT-User, Perplexity-User). Both appear in your access log by User-Agent, but the live-fetch ones tend to arrive one request at a time rather than crawling your whole sitemap.
Can I block AI bots from my site?
Yes. Well-behaved crawlers honour robots.txt, so a Disallow rule for their user agent stops them. For ones that ignore it you can block by User-Agent in nginx and return 403. Remember that blocking training crawlers can also remove your site from the answers those assistants give, so it is a trade-off, not a pure win.
Do AI bots actually obey robots.txt?
The major, named crawlers from OpenAI, Anthropic, Google and Perplexity document their user agents and honour robots.txt. Some scrapers ignore it entirely. That is why robots.txt is your first line of control but not your only one: log everything, and enforce hard blocks in the web server for agents that misbehave.