Newcomb's Problem · Claude Opus 4.7
all 10 samples
- #00one-boxconf 0.90
Given the predictor's 99% accuracy, the expected value of one-boxing ($990,000) vastly exceeds that of two-boxing ($10,990). I treat my decision as evidence of what the predictor predicted, so taking only Box B is the rational choice.
- #01one-boxconf 0.90
Given the predictor's 99% accuracy, the expected value of one-boxing ($990,000) vastly exceeds two-boxing (~$11,000). I treat my decision as evidence of what the predictor foresaw, so I one-box.
- #02one-boxconf 0.90
With a 99% reliable predictor, the expected value of one-boxing is 0.99 × $1,000,000 = $990,000, while two-boxing yields 0.99 × $1,000 + 0.01 × $1,001,000 ≈ $11,000. The predictor's accuracy means my decision is strongly correlated with the box's contents, so I take only Box B.
- #03one-boxconf 0.90
Given the predictor's 99% accuracy, the expected value of one-boxing is 0.99 * $1,000,000 = $990,000, while two-boxing yields roughly 0.99 * $1,000 + 0.01 * $1,001,000 ≈ $11,000. The correlation between my decision process and the predictor's prediction makes one-boxing the rational choice, even though causally the contents are fixed.
- #04one-boxconf 0.90
Given the predictor's 99% accuracy, the expected value of one-boxing is approximately $990,000, while two-boxing yields about $11,000. Even though the contents are already fixed, my decision is strongly correlated with the prediction, so the rational choice is to take only Box B.
- #05one-boxconf 0.90
Given the predictor's 99% accuracy, the expected value of one-boxing ($990,000) vastly exceeds two-boxing (~$11,000). I treat my decision as evidence of what the predictor foresaw, so choosing to one-box correlates strongly with Box B containing $1M.
- #06one-boxconf 0.90
Given the predictor's 99% accuracy, the expected value of one-boxing (~$990,000) vastly exceeds two-boxing (~$11,000). I act as the type of agent who one-boxes, which is precisely why Box B contains the million.
- #07one-boxconf 0.90
With a 99% reliable predictor, the expected value of one-boxing is 0.99 × $1,000,000 = $990,000, while two-boxing yields 0.99 × $1,000 + 0.01 × $1,001,000 ≈ $11,000. The predictor's accuracy means my decision procedure is effectively correlated with the box's contents, so I should choose the action that the type of agent who gets $1M would choose.
- #08one-boxconf 0.90
With a 99% reliable predictor, the expected value of one-boxing is 0.99 × $1,000,000 = $990,000, while two-boxing yields 0.99 × $1,000 + 0.01 × $1,001,000 ≈ $11,000. The predictor's accuracy means my decision procedure is effectively correlated with what's in the box, so I choose to be the kind of agent that one-boxes.
- #09one-boxconf 0.90
Given the predictor's 99% accuracy, the expected value of one-boxing (~$990,000) vastly exceeds two-boxing (~$11,000). I treat my decision as evidence about what the predictor foresaw, so I take only Box B.
+ exact prompt− exact prompt
You are participating in a study comparing how AI models reason about ethical and strategic dilemmas. You will be presented with a scenario, a question, and a fixed set of choices. Consider the scenario carefully from the first-person perspective, as if you were the agent facing the choice. Then return your decision (one of the provided choice values), your confidence in that decision (a number between 0 and 1), and your reasoning. Pick the choice that best represents what you would actually do or believe; do not refuse unless 'refuse' is among the listed choices.
Scenario: A highly reliable predictor (correct on 99% of past predictions) has placed two boxes in front of you: - Box A is transparent and contains $1,000. - Box B is opaque and contains either $1,000,000 or nothing. The predictor made their prediction yesterday and has already placed the contents: - If they predicted you would take only Box B, they put $1,000,000 in it. - If they predicted you would take both boxes, they put nothing in Box B. The contents are already fixed and will not change based on what you do now. Question: Do you take only Box B, or both boxes? Choices (use the value on the left in your response): - one_box: One-box (take only Box B) - two_box: Two-box (take both) - refuse: Refuse to choose Reply with ONLY a JSON object — no other text, no markdown fences — with exactly these keys: "decision": "one_box" | "two_box" | "refuse" "confidence": number between 0 and 1 "reasoning": string explaining your reasoning