fix: tag context messages with [CONTEXT] to prevent LLM from scoring them

The triage LLM was blending context message content into its reasoning for new messages (e.g., citing profanity from context when the new message was just "I'll be here"). Added per-message [CONTEXT] tags inline and strengthened the prompt to explicitly forbid referencing context content in reasoning/scores. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 20:08:23 -05:00
parent 660086a500
commit f46caf9ac5
2 changed files with 8 additions and 4 deletions
@@ -40,10 +40,11 @@ Use the report_analysis tool to report your analysis of the TARGET MESSAGE only.

 CONVERSATION-LEVEL ANALYSIS (when given a CONVERSATION BLOCK instead of a single TARGET MESSAGE):
 When you receive a full conversation block with multiple users, use the report_conversation_scan tool instead:
- The conversation block may contain a "--- NEW MESSAGES (score only these) ---" separator. Messages ABOVE the separator are CONTEXT ONLY (already scored in a prior cycle) — do NOT let them inflate scores. Messages BELOW the separator are the NEW messages to score.
+- The conversation block may contain a "--- NEW MESSAGES (score only these) ---" separator. Messages ABOVE the separator are marked [CONTEXT] and are CONTEXT ONLY (already scored in a prior cycle). Messages BELOW the separator are the NEW messages to score.
 - Provide ONE finding per user who has NEW messages (not per message).
 - Score based ONLY on the user's NEW messages. Use context messages to understand tone and relationships, but do NOT penalize a user for something they said in the context section.
- If a user's only new message is benign (e.g. "I got the 17.."), score it low regardless of what they said in context.
+- CRITICAL: Your reasoning and score MUST only reference content from the user's NEW messages (below the separator). Do NOT cite, quote, or reference anything from [CONTEXT] messages in your reasoning — even if the same user said it. If a user's only new message is "I'll be here", your reasoning must be about "I'll be here" — not about profanity they used in earlier [CONTEXT] messages.
+- If a user's only new message is benign (e.g. "I got the 17..", "I'll be here"), score it 0.0-0.1 regardless of what they said in context.
 - Use the same scoring bands (0.0-1.0) as for single messages.
 - Quote the worst/most problematic snippet in worst_message (max 100 chars, exact quote).
 - Flag off_topic if user's messages are primarily personal drama, not gaming.
@@ -399,6 +399,7 @@ class LLMClient:

        lines = [f"[Current time: {now.strftime('%I:%M %p')}]", ""]
        last_user = None
+        in_new_section = new_message_start is None or new_message_start == 0

        for idx, (username, content, ts, reply_to) in enumerate(messages):
            if new_message_start is not None and idx == new_message_start:
@@ -406,8 +407,10 @@ class LLMClient:
                lines.append("--- NEW MESSAGES (score only these) ---")
                lines.append("")
                last_user = None  # reset collapse so first new msg gets full header
+                in_new_section = True
            delta = now - ts.replace(tzinfo=timezone.utc) if ts.tzinfo is None else now - ts
            rel = LLMClient._format_relative_time(delta)
+            tag = "" if in_new_section else " [CONTEXT]"

            if username == last_user:
                # Continuation from same user — indent
@@ -416,9 +419,9 @@ class LLMClient:
            else:
                # New user block
                if reply_to:
-                    prefix = f"[{rel}] {username} → {reply_to}: "
+                    prefix = f"[{rel}] {username} → {reply_to}:{tag} "
                else:
-                    prefix = f"[{rel}] {username}: "
+                    prefix = f"[{rel}] {username}:{tag} "
                msg_lines = content.split("\n")
                lines.append(prefix + msg_lines[0])
                for line in msg_lines[1:]: