AI Bot Abuse: Our Evolving Strategy for Taming Performance Nightmares on Drupal Faceted Search Pages
If you're managing a Drupal website that's struggling under the weight of AI bots making uncached requests, you're not alone. We've been wrestling with this exact problem for months across several websites we manage for our clients. Finally we found peace.
Key Points
- AI bots are hammering search pages with facets that are built as links
- Cloudflare and Pantheon both have solutions to help mitigate that traffic
- Rebuilding facets as true parts of the search form using Facets 3 is a promising path forward
I first posted about this issue on LinkedIn in March. It was a very popular post. I discovered a community of folks in the same boat desperately seeking answers. Through our continuous improvement efforts, and a fair amount of trial and error, we learned a lot. I added those learnings as comments on that post. This article aggregates our initial findings with subsequent discoveries, offering solutions and insights to help you mitigate these frustrating performance crises. We’ll make updates to this article when we learn more.
The common thread among the affected websites is the presence of search pages with facets. Every facet combination generates a unique, cache-breaking request that requires a trip to the origin database. Because the Facets blocks module renders facets as links, the AI bots see each facet as just another link to visit. Facets are inherently expensive, not just because they adaptively display taxonomy terms but also because they provide node counts for each term given specific facet combinations. Add to this the fact that the AI bots are going to be visiting every possible combination of terms, those requests are uncached, and now your database is getting crushed.

Web Application Firewall: Blocking the traffic at the edge
The best way to protect the application and its infrastructure is to prevent the traffic from getting there in the first place. Web Application Firewalls (WAFs) intercept the request from the client to the server, inspect the nature of the traffic, determine whether it is valid or malicious, and take the appropriate action. We want to block the abusive AI bots traversing facets and we want to allow the humans to go about their business. Determining the difference between the two is, well, difficult.
Before diving into solutions, it's crucial to understand that the days of blocking singular IP addresses are over. These sophisticated crawlers are utilizing a vast array of IP addresses, making traditional blocking tactics ineffective.
Cloudflare WAF rules
Some of the websites we manage utilize Cloudflare. Their self-help WAF administration model is great for being in control of tuning what traffic is blocked and allowed. With great power comes great responsibility, approaching WAF administration with caution is critical. I highly urge you to define a change control protocol that includes a RACI matrix (or similar) and change validation.
Cloudflare’s pricing puts advanced traffic mitigation techniques (JA3 fingerprinting) out of reach for most organizations, but the Free, Pro and Business tiers all come with powerful WAF, albeit cruder, solutions to give your website relief. We settled on returning a managed challenge (an interstitial page that says "Please wait, we are checking your browser...") when a user uses a facet once per session. And this is what I mean by crude, we’d prefer there be no intrusion on the user experience.
Our WAF rule definition has evolved, here’s the latest configuration. It will respond to facet usage anywhere on the website.
- Navigate to Security > WAF > Custom rules > Create rule.
- Add a new condition:
- Field: URI Query String
- Operator: wildcard
- Value: *f%5B0%5D*
- Set Choose action to "Managed Challenge."
- Click Deploy.
This rule immediately proved effective against new swarms of bots. I checked in on the analytics after implementing the rule. The orange line represents the traffic that triggered the rule - that’s a ton of traffic!

The impact on the visits metric on the Pantheon dashboard is staggering. The first full day after enabling this WAF rule, we’re down to ~4.5K visits per day (which is the correct level of traffic for this website).

And the corresponding impact on the number of uncached requests is indicated in this graph. As a rule of thumb, I always want to see a cache hit ratio well above 90%.

Pantheon's AGCGN + WAF
For those on Pantheon with access to their AGCDN + WAF service, this is a compelling route to take. It’s a fully managed solution that avoids many of the hazards of Cloudflare’s self-help model and adds the benefits of expertise. The team at Pantheon takes care of analyzing the traffic and applying WAF rules. That said I still recommend the above-discussed change control process.
There is no equivalent of Cloudflare’s managed challenge response (Pantheon uses Fastly). Instead, the approach at Pantheon is to use rate limiting and blocking traffic based on JA3 fingerprinting, a deeper discussion of these two approaches is below.
We have a client who was on Pantheon but did not have Cloudflare out in front. We thought we were stuck. But some quick work and one or two approvals later and we were in business. And just like the impact seen above with Cloudflare’s WAF rule implementation, you can see the dramatic drop in daily Visits, back down to human-centric levels:

And the corresponding impact on the number of uncached requests is indicated in this graph. A stunning reversal.

Rate limiting
The default rate limiting setting at Pantheon is rather high, it isn’t tuned for dealing with the headache of facet traversal which tends to get in under the default threshold. We have experimented with applying more aggressive rate limiting on pages with facets with mixed results. Sure it mitigates the threat of a negative performance impact but the visit count (quota utilization) is still way too high. If you’re on a smaller plan (Basic, Performance Small, Performance Medium), you could easily overrun your utilization allocation. It's worth noting that Pantheon doesn't currently charge for these specific overages, but it's still something to monitor.
Blocking traffic
Let me start by saying this: Blocking traffic is a responsive posture. It means that there has been an incident and you’re now scrambling to respond. It means that performance and/or utilization is suffering. These moments are inherently unplanned and therefore disruptive and stressful. It’s a bad moment all around.
That said, Pantheon does have the ability to block requests based on JA3 fingerprinting which is magical. We create a ticket with Pantheon, the AGCDN team analyzes the traffic, and they apply the WAF rule. In the end the impact on our team and the client team is smaller as Pantheon deals with the hard part of figuring out what tactic to deploy.
A Crucial Pantheon Basic Plan Gotcha: If you have a website on a Pantheon "Basic" plan then you don’t have Redis. This means your website will store page cache in the cache_page database table of your Drupal application. This table can grow extremely large due to facet traversal, with a 15MB database quickly ballooning to 200MB or more until an hourly cron job cleans it. If your website is under relentless facet traversal, your cron job might exceed its 120 second timeout, failing to complete. This prevents the database cleanup, leading to a continuously growing database. So, keep a close eye on your database size via the backup page on the Pantheon dashboard if you're on a Basic plan.
Resiliency by design: Remove the possibility of facet abuse
In this section we’ll cover how better application architecture can mitigate and eliminate the thread of facet traversal.
Limit the number of facets that can be used
The author of the Facet Bot Blocker module, John Brandenburg at Forum One joined the LinkedIn discussion. This module is quite clever and can be very effective for configurations where the number of possible facet parameters far exceeds what a human would reasonably use. If this describes your use case, give it a try.. With a well-tuned "Facet parameter limit" you can significantly reduce database load and make it home in time for dinner.
More information about our evaluation is here. It wasn't a perfect fit for our needs because we have a low number of facets and a high likelihood of humans using all of them on their user journey.
Remove the links: An experiment in refactoring facets
The underlying problem here is that facets are rendered as links. The AI bots love links. The AI bot sees a link and makes a get request.
What if the facets were rendered as something bots don’t like to request, say a form action?
Enter the Facets Exposed Filters sub-module within Facets 3.0 released in January of this year.
Facets are now form fields when implemented this way, bundled in the Views exposed filter form, something bots don’t like to submit. (This of course could change in the future if the last 18 months of web traffic patterns are any indication.)
And if that wasn’t promising enough, now that we're dealing with a form, it opens up other bot-fighting options such as attaching CAPTCHAs or protecting it with Antibot, which is not only good at what it does but also doesn’t introduce any user experience hazards. (We did have some difficulty integrating Antibot and put that effort on hold. We’ll circle back on this effort later and will update this article.)
Bonus: No longer must we go through the difficulty of integrating facet blocks with form elements produced by Views (keyword search, results per page).
So we got to work refactoring our facets for one of our clients.
We had to rewire our front end code to use the HTML markup change from links to checkbox fields via the Better Exposed Filters module (which is a dependency). On the surface this sounds tedious but not difficult. Turns out the Better Exposed Filters theme rendering abstraction, while impressive, is difficult to follow, so good luck with that!
And the results?
We deployed the refactored facets for one of our clients in July. The impact is clear to see in the graphs below – visits dropped to a steady level representative of human and modest bot traffic and the cache hit ratio is now well north of 90%. Perfect!

And the corresponding impact on the number of uncached requests is indicated in this graph. I love this!

While the initial metrics are positive it is premature to celebrate and declare success. To be clear, I don’t see the operators of AI bots wanting to make changes to traverse form-based facets, it could happen by accident, in the same way that link-based facet traversal is a thing. We have periodic review tasks to audit the visits and cache hit ratio metrics. We will provide updates to measure success (or regression!) over time.
Another lesson learned: robots.txt tuning
In a surprising discovery, we found one of our websites was suffering from facet traversal due to GoogleBot. While GoogleBot is typically considered a good bot, our robots.txt wasn’t telling it not to traverse the facets – so it was – the robots.txt file needed some tuning. In the spirit of our Cloudflare WAF configuration we implemented a global rule using the RobotsTXT module to protect all facet requests: Disallow: /*f%5B0%5D*.
This had an immediate effect on load and we worked closely with the client’s Google Analytics manager to be sure it didn’t have a negative effect on SEO.
Broader Perspectives: Strategic Industry-Wide Solutions
While we've focused on tactical ways of dealing with facet traversal, it's important to acknowledge that more strategic, industry-wide solutions are emerging.
Cloudflare's announcement of Pay-per-Crawl and Dries Buytaert's suggestion for a content licensing marketplace are far more strategic. If implemented broadly, these types of initiatives could fundamentally solve the downstream facet traversal problem at its source, offering a more sustainable and equitable way for content providers to interact with AI companies and other large-scale crawlers.
Of course there is a much simpler way: AI bots should respect the directives in the robots.txt file.