fix: separate context from new messages so prior-cycle chat doesn't inflate scores

The conversation analysis was re-scoring old messages alongside new ones,
causing users to get penalized repeatedly for already-scored messages.
A "--- NEW MESSAGES ---" separator now marks which messages are new, and
the prompt instructs the LLM to score only those. Also fixes bot-mention
detection to require an explicit @mention in message text rather than
treating reply-pings as scans (so toxic replies to bot warnings aren't
silently skipped).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-25 15:48:02 -05:00
parent 8734f1883b
commit 7417908142
3 changed files with 31 additions and 17 deletions
+4 -2
View File
@@ -40,8 +40,10 @@ Use the report_analysis tool to report your analysis of the TARGET MESSAGE only.
CONVERSATION-LEVEL ANALYSIS (when given a CONVERSATION BLOCK instead of a single TARGET MESSAGE):
When you receive a full conversation block with multiple users, use the report_conversation_scan tool instead:
- Provide ONE finding per user (not per message) — aggregate their behavior across the conversation.
- Weight their average tone and worst message equally when determining the toxicity_score.
- The conversation block may contain a "--- NEW MESSAGES (score only these) ---" separator. Messages ABOVE the separator are CONTEXT ONLY (already scored in a prior cycle) — do NOT let them inflate scores. Messages BELOW the separator are the NEW messages to score.
- Provide ONE finding per user who has NEW messages (not per message).
- Score based ONLY on the user's NEW messages. Use context messages to understand tone and relationships, but do NOT penalize a user for something they said in the context section.
- If a user's only new message is benign (e.g. "I got the 17.."), score it low regardless of what they said in context.
- Use the same scoring bands (0.0-1.0) as for single messages.
- Quote the worst/most problematic snippet in worst_message (max 100 chars, exact quote).
- Flag off_topic if user's messages are primarily personal drama, not gaming.