ds4eval - Agent Harness Evaluation

Generated 2026-04-26T16:34:02.934Z · Scoring score-v1-static-heuristic · Raw data report.json
Average score27.4
Success rate6/20
Failed runs14
Average latency171.76s

Model ranking

#ModelScoreSuccessLatency
1deepseek-v4-pro
DeepSeek V4 Pro
45.350%170.83s
2deepseek-v4-flash
DeepSeek V4 Flash
23.025%171.26s
3moonshotai/kimi-k2.6
Kimi K2.6
23.025%172.30s
4moonshotai/kimi-k2.5
Kimi K2.5
23.025%172.53s
5minimax/minimax-m2.5
MiniMax M2.5
23.025%171.88s

Score heatmap

ModelBaseline artifact writing
opencode
FX monitor and alert
opencode
Multi-city weather alert
opencode
Market brief
opencode
deepseek-v4-flash
DeepSeek V4 Flash
92
success
0
failed
0
failed
0
failed
deepseek-v4-pro
DeepSeek V4 Pro
92
success
89
success
0
failed
0
failed
moonshotai/kimi-k2.6
Kimi K2.6
92
success
0
failed
0
failed
0
failed
moonshotai/kimi-k2.5
Kimi K2.5
92
success
0
failed
0
failed
0
failed
minimax/minimax-m2.5
MiniMax M2.5
92
success
0
failed
0
failed
0
failed

Latest results

Baseline artifact writing · DeepSeek V4 Flash success

opencode · opencode · 99.98s · - tokens
92/100
Score breakdown
Completion
Run completed and printed DONE.
30/30
Artifacts
2 artifacts were produced.
25/25
JSON validity
1/1 JSON artifacts parsed successfully.
20/20
Trace quality
Trace includes final answer only.
10/15
Latency
Completed within 2 minutes.
7/10
Trace timeline
S
Start
opencode · deepseek-v4-flash
F
Final answer · 99.98s
DONE
A
Artifact: output/baseline-result.json
{
  "task": "baseline-artifact",
  "status": "ok",
  "checks": ["created-json", "created-summary", "printed-done"],
  "model_note": "Created both output files using direct file writes, then printed DONE."
}
A
Artifact: output/baseline-summary.txt
Created baseline artifact: wrote structured JSON result and a short English summary to the output directory.
Artifacts (2)

output/baseline-result.json

{
  "task": "baseline-artifact",
  "status": "ok",
  "checks": ["created-json", "created-summary", "printed-done"],
  "model_note": "Created both output files using direct file writes, then printed DONE."
}

output/baseline-summary.txt

Created baseline artifact: wrote structured JSON result and a short English summary to the output directory.

Baseline artifact writing · DeepSeek V4 Pro success

opencode · opencode · 98.23s · - tokens
92/100
Score breakdown
Completion
Run completed and printed DONE.
30/30
Artifacts
2 artifacts were produced.
25/25
JSON validity
1/1 JSON artifacts parsed successfully.
20/20
Trace quality
Trace includes final answer only.
10/15
Latency
Completed within 2 minutes.
7/10
Trace timeline
S
Start
opencode · deepseek-v4-pro
F
Final answer · 98.23s
DONE
A
Artifact: output/baseline-result.json
{
  "task": "baseline-artifact",
  "status": "ok",
  "checks": ["created-json", "created-summary", "printed-done"],
  "model_note": "Created the required JSON and summary files by writing them directly to the output directory."
}
A
Artifact: output/baseline-summary.txt
Baseline task complete: created baseline-result.json and baseline-summary.txt in the output directory as specified.
Artifacts (2)

output/baseline-result.json

{
  "task": "baseline-artifact",
  "status": "ok",
  "checks": ["created-json", "created-summary", "printed-done"],
  "model_note": "Created the required JSON and summary files by writing them directly to the output directory."
}

output/baseline-summary.txt

Baseline task complete: created baseline-result.json and baseline-summary.txt in the output directory as specified.

Baseline artifact writing · Kimi K2.6 success

opencode · opencode · 104.13s · - tokens
92/100
Score breakdown
Completion
Run completed and printed DONE.
30/30
Artifacts
2 artifacts were produced.
25/25
JSON validity
1/1 JSON artifacts parsed successfully.
20/20
Trace quality
Trace includes final answer only.
10/15
Latency
Completed within 2 minutes.
7/10
Trace timeline
S
Start
opencode · moonshotai/kimi-k2.6
F
Final answer · 104.13s
DONE
A
Artifact: output/baseline-result.json
{
  "task": "baseline-artifact",
  "status": "ok",
  "checks": ["created-json", "created-summary", "printed-done"],
  "model_note": "Created both required files and printed DONE as instructed."
}
A
Artifact: output/baseline-summary.txt
Created the required JSON and text files in the output directory, then printed DONE.
Artifacts (2)

output/baseline-result.json

{
  "task": "baseline-artifact",
  "status": "ok",
  "checks": ["created-json", "created-summary", "printed-done"],
  "model_note": "Created both required files and printed DONE as instructed."
}

output/baseline-summary.txt

Created the required JSON and text files in the output directory, then printed DONE.

Baseline artifact writing · Kimi K2.5 success

opencode · opencode · 105.03s · - tokens
92/100
Score breakdown
Completion
Run completed and printed DONE.
30/30
Artifacts
2 artifacts were produced.
25/25
JSON validity
1/1 JSON artifacts parsed successfully.
20/20
Trace quality
Trace includes final answer only.
10/15
Latency
Completed within 2 minutes.
7/10
Trace timeline
S
Start
opencode · moonshotai/kimi-k2.5
F
Final answer · 105.03s
DONE
A
Artifact: output/baseline-result.json
{
  "task": "baseline-artifact",
  "status": "ok",
  "checks": [
    "created-json",
    "created-summary",
    "printed-done"
  ],
  "model_note": "I created the required JSON and summary files in the output directory as specified."
}
A
Artifact: output/baseline-summary.txt
Completed baseline artifact task by creating the required JSON result file and text summary file in the output directory.
Artifacts (2)

output/baseline-result.json

{
  "task": "baseline-artifact",
  "status": "ok",
  "checks": [
    "created-json",
    "created-summary",
    "printed-done"
  ],
  "model_note": "I created the required JSON and summary files in the output directory as specified."
}

output/baseline-summary.txt

Completed baseline artifact task by creating the required JSON result file and text summary file in the output directory.

Baseline artifact writing · MiniMax M2.5 success

opencode · opencode · 102.43s · - tokens
92/100
Score breakdown
Completion
Run completed and printed DONE.
30/30
Artifacts
2 artifacts were produced.
25/25
JSON validity
1/1 JSON artifacts parsed successfully.
20/20
Trace quality
Trace includes final answer only.
10/15
Latency
Completed within 2 minutes.
7/10
Trace timeline
S
Start
opencode · minimax/minimax-m2.5
F
Final answer · 102.43s
DONE
A
Artifact: output/baseline-result.json
{
  "task": "baseline-artifact",
  "status": "ok",
  "checks": ["created-json", "created-summary", "printed-done"],
  "model_note": "I created the required output files in the ./output directory."
}
A
Artifact: output/baseline-summary.txt
Created baseline-artifact with required output files including baseline-result.json with task status and checks array, and baseline-summary.txt with a brief English summary. Task completed successfully with all required fields present.
Artifacts (2)

output/baseline-result.json

{
  "task": "baseline-artifact",
  "status": "ok",
  "checks": ["created-json", "created-summary", "printed-done"],
  "model_note": "I created the required output files in the ./output directory."
}

output/baseline-summary.txt

Created baseline-artifact with required output files including baseline-result.json with task status and checks array, and baseline-summary.txt with a brief English summary. Task completed successfully with all required fields present.

FX monitor and alert · DeepSeek V4 Flash failed

opencode · opencode · 195.02s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · deepseek-v4-flash
E
Error
opencode timed out before producing final answer or artifacts

FX monitor and alert · DeepSeek V4 Pro success

opencode · opencode · 195.04s · - tokens
89/100
Score breakdown
Completion
Run completed and printed DONE.
30/30
Artifacts
2 artifacts were produced.
25/25
JSON validity
1/1 JSON artifacts parsed successfully.
20/20
Trace quality
Trace includes final answer only.
10/15
Latency
Completed within 4 minutes.
4/10
Trace timeline
S
Start
opencode · deepseek-v4-pro
F
Final answer · 195.04s
DONE
A
Artifact: output/forex-alert.txt
FOREX ALERT: CNY/JPY cross rate moved by +13.88% (from 20.45 to 23.2878), exceeding the 2% threshold.
A
Artifact: output/forex-result.json
{
  "observed_rates": {
    "date": "2026-04-25",
    "source": "https://open.er-api.com/v6/latest/USD",
    "usd_to_cny": 6.847089,
    "usd_to_jpy": 159.453848
  },
  "calculation": {
    "cny_to_jpy_cross_rate": 23.2878,
    "formula": "USD/JPY / USD/CNY"
  },
  "comparison": {
    "last_week_reference": 20.45,
    "change_pct": 13.88,
    "change_abs": 2.8378
  },
  "decision": {
    "threshold_pct": 2,
    "exceeds_threshold": true,
    "alert_written": true,
    "result_json_written": true
  }
}
Artifacts (2)

output/forex-alert.txt

FOREX ALERT: CNY/JPY cross rate moved by +13.88% (from 20.45 to 23.2878), exceeding the 2% threshold.

output/forex-result.json

{
  "observed_rates": {
    "date": "2026-04-25",
    "source": "https://open.er-api.com/v6/latest/USD",
    "usd_to_cny": 6.847089,
    "usd_to_jpy": 159.453848
  },
  "calculation": {
    "cny_to_jpy_cross_rate": 23.2878,
    "formula": "USD/JPY / USD/CNY"
  },
  "comparison": {
    "last_week_reference": 20.45,
    "change_pct": 13.88,
    "change_abs": 2.8378
  },
  "decision": {
    "threshold_pct": 2,
    "exceeds_threshold": true,
    "alert_written": true,
    "result_json_written": true
  }
}

FX monitor and alert · Kimi K2.6 failed

opencode · opencode · 195.02s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · moonshotai/kimi-k2.6
E
Error
opencode timed out before producing final answer or artifacts

FX monitor and alert · Kimi K2.5 failed

opencode · opencode · 195.03s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · moonshotai/kimi-k2.5
E
Error
opencode timed out before producing final answer or artifacts

FX monitor and alert · MiniMax M2.5 failed

opencode · opencode · 195.03s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · minimax/minimax-m2.5
E
Error
opencode timed out before producing final answer or artifacts

Multi-city weather alert · DeepSeek V4 Flash failed

opencode · opencode · 195.02s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · deepseek-v4-flash
E
Error
opencode timed out before producing final answer or artifacts

Multi-city weather alert · DeepSeek V4 Pro failed

opencode · opencode · 195.02s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · deepseek-v4-pro
E
Error
opencode timed out before producing final answer or artifacts

Multi-city weather alert · Kimi K2.6 failed

opencode · opencode · 195.03s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · moonshotai/kimi-k2.6
E
Error
opencode timed out before producing final answer or artifacts

Multi-city weather alert · Kimi K2.5 failed

opencode · opencode · 195.05s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · moonshotai/kimi-k2.5
E
Error
opencode timed out before producing final answer or artifacts

Multi-city weather alert · MiniMax M2.5 failed

opencode · opencode · 195.03s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · minimax/minimax-m2.5
E
Error
opencode timed out before producing final answer or artifacts

Market brief · DeepSeek V4 Flash failed

opencode · opencode · 195.02s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · deepseek-v4-flash
E
Error
opencode timed out before producing final answer or artifacts

Market brief · DeepSeek V4 Pro failed

opencode · opencode · 195.03s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · deepseek-v4-pro
E
Error
opencode timed out before producing final answer or artifacts

Market brief · Kimi K2.6 failed

opencode · opencode · 195.03s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · moonshotai/kimi-k2.6
E
Error
opencode timed out before producing final answer or artifacts

Market brief · Kimi K2.5 failed

opencode · opencode · 195.02s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · moonshotai/kimi-k2.5
E
Error
opencode timed out before producing final answer or artifacts

Market brief · MiniMax M2.5 failed

opencode · opencode · 195.02s · - tokens
0/100
Score breakdown
Completion
Run failed before completion.
0/30
Artifacts
No artifacts were produced.
0/25
JSON validity
No JSON artifact was produced.
0/20
Trace quality
No execution steps were captured.
0/15
Latency
Timed out before completion.
0/10
Trace timeline
S
Start
opencode · minimax/minimax-m2.5
E
Error
opencode timed out before producing final answer or artifacts