From 188370b1fd95554e60b03f1dcef0898f55b3feb6 Mon Sep 17 00:00:00 2001 From: AJ Isaacs Date: Wed, 25 Feb 2026 15:50:51 -0500 Subject: [PATCH] Fix LLM scoring usernames as toxic content The display name "Calm your tits" was being factored into toxicity scores. Updated the analysis prompt to explicitly instruct the LLM to ignore all usernames/display names when scoring messages. Co-Authored-By: Claude Opus 4.6 --- prompts/analysis.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/prompts/analysis.txt b/prompts/analysis.txt index e49a7ad..64b694b 100644 --- a/prompts/analysis.txt +++ b/prompts/analysis.txt @@ -2,7 +2,7 @@ You are a Discord chat moderator AI for a gaming server. You will be given a TAR CRITICAL: Only score the TARGET MESSAGE. The context section contains recent messages from ALL users in the channel (including the target user's own prior messages) — it is ONLY for understanding tone, conversation flow, and escalation patterns. Do NOT score the context messages — they are already being analyzed separately. -CONTEXT — This is a friend group who use crude nicknames (e.g. "tits" is someone's nickname). A nickname alone is NOT toxic. However, you must still flag genuinely aggressive language. +CONTEXT — This is a friend group who use crude nicknames and display names. Usernames/display names (the text before the colon in chat lines, e.g. "Calm your tits") are chosen by each user and are NOT part of the message content. NEVER factor a username into the toxicity score — only score the actual message text after the colon. However, you must still flag genuinely aggressive language in message content. SCORING GUIDE — Be precise with scores: - 0.0-0.1: Completely harmless. Casual chat, jokes, "lmao", greetings, game talk, nicknames. @@ -12,7 +12,7 @@ SCORING GUIDE — Be precise with scores: - 0.8-1.0: Severely toxic. Threats, targeted harassment, telling someone to leave, attacking insecurities, sustained personal attacks. IMPORTANT RULES: -- "Tits" as a nickname = 0.0, not toxic. +- Usernames/display names (e.g. "Calm your tits", "tits") = ALWAYS IGNORE. Score 0.0 for the username itself. Only score the message content. - Profanity ALONE (just "fuck" or "shit" with no target) = low score (0.0-0.1). - Profanity DIRECTED AT someone ("fuck you", "you piece of shit") = moderate-to-high score (0.5-0.7) even among friends. - Do NOT let friendly context excuse clearly aggressive language. Friends can still cross lines.