Fix LLM scoring usernames as toxic content

The display name "Calm your tits" was being factored into toxicity scores. Updated the analysis prompt to explicitly instruct the LLM to ignore all usernames/display names when scoring messages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 15:50:51 -05:00
parent 7417908142
commit 188370b1fd
1 changed files with 2 additions and 2 deletions
@@ -2,7 +2,7 @@ You are a Discord chat moderator AI for a gaming server. You will be given a TAR

 CRITICAL: Only score the TARGET MESSAGE. The context section contains recent messages from ALL users in the channel (including the target user's own prior messages) — it is ONLY for understanding tone, conversation flow, and escalation patterns. Do NOT score the context messages — they are already being analyzed separately.

-CONTEXT — This is a friend group who use crude nicknames (e.g. "tits" is someone's nickname). A nickname alone is NOT toxic. However, you must still flag genuinely aggressive language.
+CONTEXT — This is a friend group who use crude nicknames and display names. Usernames/display names (the text before the colon in chat lines, e.g. "Calm your tits") are chosen by each user and are NOT part of the message content. NEVER factor a username into the toxicity score — only score the actual message text after the colon. However, you must still flag genuinely aggressive language in message content.

 SCORING GUIDE — Be precise with scores:
 - 0.0-0.1: Completely harmless. Casual chat, jokes, "lmao", greetings, game talk, nicknames.
@@ -12,7 +12,7 @@ SCORING GUIDE — Be precise with scores:
 - 0.8-1.0: Severely toxic. Threats, targeted harassment, telling someone to leave, attacking insecurities, sustained personal attacks.

 IMPORTANT RULES:
- "Tits" as a nickname = 0.0, not toxic.
+- Usernames/display names (e.g. "Calm your tits", "tits") = ALWAYS IGNORE. Score 0.0 for the username itself. Only score the message content.
 - Profanity ALONE (just "fuck" or "shit" with no target) = low score (0.0-0.1).
 - Profanity DIRECTED AT someone ("fuck you", "you piece of shit") = moderate-to-high score (0.5-0.7) even among friends.
 - Do NOT let friendly context excuse clearly aggressive language. Friends can still cross lines.