Spam Traffic Analysis
Bad-bot traffic collapses onto one request-time signature — Windows 10 + Chrome 146, from datacenter IPs (Portland OR, LA), scraping the /agency/* directory — which alone is 52% of all spam page-views at ~88% precision. Every table below carries a spam-% of segment column so you can challenge the high-precision fingerprints (reCAPTCHA / WAF) without country-blocking real US/India users or touching the ~87%-legitimate AI Assistant channel.
1. User-agent fingerprint
Spam is a page-view phenomenon: 0.30% of sessions (≈4,000) produce 24.5% of all page-views, because each bot session loads dozens to hundreds of pages. The question the dev team asked is which requests — and the user-agent answers it almost by itself.
- The bots pin an outdated / mismatched UA: Windows 10 with Chrome 146, and older builds (Chrome 116, 120, 124) that no real user base still runs at volume. Where the string is stale, precision is highest — Chrome 124 (83%), Chrome 116 on Linux (75%), Chrome 150 (83%).
- Read the last column as false-positive risk. At 88% spam, blocking Windows 10 · Chrome 146 outright would catch ~12% real users — too high for a hard block, ideal for a reCAPTCHA challenge (humans pass, headless bots don't). The sub-segments above 95% are safe to deny outright.
Windows 10 alone accounts for 76% of all spam and 47% of its own page-views are spam — the cleanest coarse filter if a per-version rule is too granular to maintain.
Screen resolution
Screen resolution isn't in the GA4 → BigQuery export, so this one table is pulled from the GA4 Data API (screenResolution dimension). The API can't express our "≥50 page-views per session" threshold, but it returns page-views and sessions per resolution — so page-views per session stands in as the bot-flood density (a real user runs ~1–3; the flood bot runs 100+):
1600x900is the resolution fingerprint: 304k page-views across just 2,591 sessions — 117 per session. Every ordinary resolution (1920x1080, 3840x2160) sits at ~1. High volume alone is a red herring — 800x600 has 217k page-views but only ~1.2 per session, so it isn't the flood bot.- That 1600x900 total (303,965 page-views, and 303,626 of them on Chrome) matches the Windows 10 · Chrome 146 signature's 304,083 spam page-views almost exactly — it's the same bot, now pinned on a fourth independent axis: Windows 10 + Chrome 146 + 1600x900 + datacenter IP, scraping
/agency/*. - A couple of oddball high-density resolutions (
1366x900,2732x1536) run 30–40 page-views per session on tiny session counts — smaller bot variants worth a challenge too.
2. Origin
The original brief suspected Chinese traffic. The data doesn't support that (Chinese-language traffic is 1.2% spam; China isn't a top origin). More importantly, the country view is a trap: the US (56%) and India (15%) lead only because they're our biggest real markets and where cloud regions sit. Drop to city and the datacenter mask falls off:
- Portland OR (98% spam) and Los Angeles (89%) are 43% of all spam. These are cloud/hosting locations (Portland OR is the AWS
us-west-2region), not consumer populations — a session count in the low hundreds throwing 250k page-views. - Council Bluffs, Iowa (75%) is a Google Cloud region; Singapore, Milpitas, Strasbourg, Midrand and Czechia all run 77–98% off tiny real bases. Anything above ~95% here is a safe IP-range / ASN block; the 75–90% band is a challenge target.
- This is why a country block is the wrong tool: it would hit millions of real US and Indian users to catch bots that are actually confined to a handful of hosting cities.
3. Target
The bots don't behave like lost visitors. They skip the homepage (/ is only 6% spam) and go straight at the product — the agency listings and profiles:
- Category listings —
/agency/email-marketing(94%),/agency/digital-marketing/us(84%),/agency/software-development(76%) — are near-pure spam. - The tell-tale for scraping: single sessions that deep-crawl one
/agency/profile/<name>page for thousands of page-views at 100% spam (e.g. one session, ~7,700 views of a single profile). Someone is systematically harvesting the agency database. - Actionable because it's a request-time signal too: the
/agency/*tree can carry a stricter challenge/rate-limit than the rest of the site, where real users and legitimate AI crawlers spend their time.
4. Scale and timing
By page-view share, spam runs ~10–17% at baseline with two clear floods on top:
- Late Mar – early Apr: spam jumps to ~47–48% of the week.
- June: a sustained 22–28% band.
The spikes don't track real demand — total page-views rise only because the spam layer is added on top. The same UA/city/URL fingerprint dominates both incidents, so one rule set covers the recurring pattern.
5. What it distorts
Because spam is so few sessions, it barely moves the session engagement rate — strip it out and the weekly rate is essentially unchanged:
But every page-view-weighted metric is hit. Engagement time per page-view — a fair proxy for the per-page quality signals AI search weighs — dips exactly in the spike weeks, because bots add page-views with almost no engagement behind them:
So "spam plummets our engagement metrics" is true for page-view-weighted metrics, not session-rate metrics — and the reported page-view counts themselves are inflated up to ~48% in an incident.
Recommendations
The signatures overlap (it's largely the same bots), so a small rule set covers most of the load. Each table gives spam-% of segment as the false-positive gauge — block the near-pure segments, challenge the rest.
- Challenge the UA fingerprint. Serve a reCAPTCHA / JS challenge to Windows 10 + Chrome 146 and the stale-Chrome family (116/120/124/150). ~52% of all spam sits in the first string alone, at 88% precision — challenge rather than hard-block so the ~12% real users pass. A 1600×900 screen resolution is a corroborating client-side signal for the same bot (117 page-views/session vs ~1 for real users).
- Block the datacenter origins by IP/ASN. The >95% cities (Portland OR, Milpitas, Strasbourg, Midrand, Czechia) are safe to deny; challenge the 75–90% band (LA, Singapore, Council Bluffs). This captures the US "volume" without a country ban.
- Harden the
/agency/*tree. Rate-limit and challenge agency listings and profile pages more strictly than the rest of the site; that's what the bots are scraping. - Don't country-block, and spare the AI Assistant channel. US/India are our real markets; the AI Assistant channel is ~87% legitimate and strategically important — exclude it from any rule.
- Report human-only for page-view metrics. Session engagement rate is already robust; page-view counts and engagement-per-page-view should be reported spam-excluded so incidents don't read as demand swings.
Notes & caveats
- Spam label (from the task): page-views in Direct sessions (event-param
mediumin the direct set) with ≥50 page-views OR ≥600 events. The fingerprints above are measured on that population; the dev team applies them at request time, before a session accumulates page-views. - Screen resolution isn't in the GA4 → BigQuery export, so §1's resolution table comes from the GA4 Data API (
screenResolution). The Data API is pre-aggregated and can't apply our per-session spam threshold, so that table uses page-views-per-session as the bot-density proxy rather than the exact ≥50-page-view rule; its counts may differ slightly from the BigQuery figures. - Device model is empty because spam is ~98% desktop; language is non-discriminating (spam is
en-us, like real traffic). - Geo is IP-based and masked for datacenter/VPN bots; cloud attributions above are inferred from city + volume, not confirmed ASNs.
- Window: 23 Mar – 28 Jun 2026, full weeks — the export doesn't reach earlier, so pre-March incidents (including older China spikes) are out of view.
- Rankings are out of scope for GA4; correlating bot hits with ranking dips needs Search Console (a natural next step).
Last refreshed 18 seconds ago
