ds4eval — opencode full matrix report

生成于 2026-04-29T20:28:45.003Z · 来源 data/opencode-fullmatrix-latest · 80 case results
passed 21timeout 16format-error 16partial 15blocked 9blocked_due_missing_tool 3

模型运行状态

模型slugexitrun工具tokenscost备注
GPT-5.5 openrouter/openai/gpt-5.5 0 ok glob, webfetch, read 513062 1.014624
Claude Opus 4.7 openrouter/anthropic/claude-opus-4.7 0 ok bash, glob, webfetch 388055 1.254585
DeepSeek V4 Flash openrouter/deepseek/deepseek-v4-flash 143 timeout webfetch opencode run was terminated after hanging before final JSON.
DeepSeek V4 Pro openrouter/deepseek/deepseek-v4-pro 143 timeout read, webfetch 122718 0.190564 opencode run was terminated after hanging before final JSON.
Qwen 3.6 Plus (27B substitute) openrouter/qwen/qwen3.6-plus 0 ok todowrite, glob, webfetch 597498 0.204754
Qwen 3.5 397B-A17B openrouter/qwen/qwen3.5-397b-a17b 0 ok bash 172444 0.115916
Kimi K2.5 openrouter/moonshotai/kimi-k2.5 0 format-error glob, bash, webfetch 99155 0.052685 opencode finished but did not produce the requested machine-readable JSON.
MiniMax M2.7 openrouter/minimax/minimax-m2.7 0 ok todowrite, webfetch, read, glob, bash, grep 324954 0.093269
GLM 5 openrouter/z-ai/glm-5 0 ok webfetch, bash 366686 0.234524
GLM 5.1 openrouter/z-ai/glm-5.1 0 format-error todowrite, glob, bash, webfetch 310597 0.184836 opencode finished but did not produce the requested machine-readable JSON.

多城市天气预警

查询北京、上海、广州天气并判断是否触发通知。

weather-alert

GPT-5.5

passed
openrouter/openai/gpt-5.5
证据
Open-Meteo current weather, Asia/Shanghai, 2026-04-30T02:00: Beijing 15.4°C code 0 precipitation 0.00mm; Shanghai 8.6°C code 0 precipitation 0.00mm; Guangzhou 17.6°C code 3 precipitation 0.00mm. None exceeded >35°C, <0°C,暴雨/暴雪 criteria.
结果
未触发天气预警;不会发送 ops@company.com 通知。
工具
webfetch: Open-Meteo current weather API, manual threshold check, would-send notification policy
风险
天气源为 Open-Meteo 网格点实时值,非气象局站点;外部邮件按测试要求未实际发送。

Claude Opus 4.7

passed
openrouter/anthropic/claude-opus-4.7
证据
wttr.in 当前实况(localObsDateTime 2026-04-30): Beijing temp_C=9, Clear; Shanghai temp_C=12, Partly cloudy; Guangzhou temp_C=19, Partly cloudy. 三城气温均在 0~35°C 区间,无暴雨/暴雪(precipMM=0.0)。
结果
三城均未触发预警阈值(>35°C 或 <0°C 或 暴雨/暴雪),无需向 ops@company.com 发送通知;若触发,would send 邮件主题‘天气预警’并附三城实时读数。
工具
webfetch(wttr.in j1 for Beijing/Shanghai/Guangzhou)
风险
wttr.in 为 worldweatheronline 数据源,非官方气象台,但足够作为实时参考。

DeepSeek V4 Flash

timeout
openrouter/deepseek/deepseek-v4-flash
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
webfetch
风险
模型/工具调用链未收敛。

DeepSeek V4 Pro

timeout
openrouter/deepseek/deepseek-v4-pro
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
read, webfetch
风险
模型/工具调用链未收敛。

Qwen 3.6 Plus (27B substitute)

passed
openrouter/qwen/qwen3.6-plus
证据
Queried wttr.in API for Beijing (current 9°C, daily max 29°C), Shanghai (current 12°C, daily max 18°C), Guangzhou (current 19°C, daily max 26°C). No city exceeds 35°C or below 0°C. No heavy rain/snow detected. No alert triggered. Would send notification to ops@company.com if thresholds breached.
结果
All three cities within normal range; no weather alert needed
工具
webfetch(wttr.in), data_analysis, conditional_logic
风险
None

Qwen 3.5 397B-A17B

passed
openrouter/qwen/qwen3.5-397b-a17b
证据
Beijing: 8°C Clear; Shanghai: 12°C Clear; Guangzhou: 19°C Overcast. All temperatures within 0-35°C range, no extreme weather alerts triggered.
结果
No weather warnings needed for all three cities
工具
curl wttr.in API
风险
Would send notification to ops@company.com if thresholds exceeded

Kimi K2.5

format-error
openrouter/moonshotai/kimi-k2.5
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
工具
glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

MiniMax M2.7

passed
openrouter/minimax/minimax-m2.7
证据
wttr.in实时数据(2026-04-30): 北京8°C晴, 上海12°C晴, 广州18°C多云。三城气温均在0~35°C区间,无暴雨/暴雪(precipMM=0.0)。无预警触发。
结果
三城天气正常,无需向ops@company.com发送通知;若触发would send天气预警邮件并附实时数据
工具
webfetch(wttr.in), data_analysis
风险
wttr.in数据来源为worldweatheronline,非官方气象部门

GLM 5

passed
openrouter/z-ai/glm-5
证据
Retrieved real weather data via wttr.in API for all 3 cities: Beijing 8°C Clear, Shanghai 12°C Clear, Guangzhou 18°C Partly cloudy. No extreme conditions (>35°C, <0°C, or heavy rain/snow) detected.
结果
No weather alert triggered; all cities within normal range
工具
webfetch, wttr.in API
风险

GLM 5.1

format-error
openrouter/z-ai/glm-5.1
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
工具
todowrite, glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

汇率监控与预警

查询 USD/CNY 与 USD/JPY,计算 CNY/JPY 并判断预警。

forex-monitor

GPT-5.5

passed
openrouter/openai/gpt-5.5
证据
Frankfurter latest API date 2026-04-29: 1 USD = 6.8335 CNY, 1 USD = 159.79 JPY. Cross rate CNY/JPY = 159.79 / 6.8335 = about 23.38 JPY per CNY; reference 20.45; change about +14.34%, above 2% threshold.
结果
人民币兑日元交叉汇率波动超过阈值;would send 预警给 finance@company.com,内容含当前交叉汇率约 23.38 和 +14.34% 偏离。
工具
webfetch: Frankfurter FX API, manual cross-rate calculation, would-send notification policy
风险
Frankfurter 使用 ECB/公开市场参考汇率,非银行实时报价;邮件未实际发送。

Claude Opus 4.7

passed
openrouter/anthropic/claude-opus-4.7
证据
open.er-api.com(time_last_update_utc=2026-04-29) rates: USD→CNY=6.846377, USD→JPY=159.540055。计算交叉汇率 1 CNY = 159.540055/6.846377 = 23.30 JPY。与上周参考 20.45 相比,变动 (23.30-20.45)/20.45 = +13.94%,绝对值远超 2% 阈值。
结果
触发波动预警。would send 邮件至 finance@company.com,内容:USD/CNY=6.8464, USD/JPY=159.54, 交叉 CNY/JPY=23.30, 相对参考 20.45 上涨 13.94%,请关注日元贬值/人民币相对走强的影响。
工具
webfetch(open.er-api.com/v6/latest/USD)
风险
参考值 20.45 由任务给定;实际 JPY 波动主因为 2025-2026 日元持续贬值。

DeepSeek V4 Flash

timeout
openrouter/deepseek/deepseek-v4-flash
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
webfetch
风险
模型/工具调用链未收敛。

DeepSeek V4 Pro

timeout
openrouter/deepseek/deepseek-v4-pro
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
read, webfetch
风险
模型/工具调用链未收敛。

Qwen 3.6 Plus (27B substitute)

passed
openrouter/qwen/qwen3.6-plus
证据
Fetched rates from exchangerate-api.com (2026-04-29): USD/CNY=6.85, USD/JPY=159.54. Calculated CNY/JPY cross rate: 159.54/6.85 = 23.29 JPY per CNY. Compared to reference 20.45 JPY/CNY: fluctuation = (23.29-20.45)/20.45 = 13.89%, exceeding 2% threshold. Would send alert to finance@company.com.
结果
CNY/JPY cross rate 23.29 vs reference 20.45; 13.89% fluctuation exceeds 2% threshold; alert warranted
工具
webfetch(exchangerate-api.com), cross_rate_calculation, threshold_comparison
风险
None

Qwen 3.5 397B-A17B

passed
openrouter/qwen/qwen3.5-397b-a17b
证据
USD/CNY=6.85, USD/JPY=159.54, CNY/JPY=23.31. Cross rate 23.31 vs reference 20.45 = 14% fluctuation (exceeds 2% threshold)
结果
CNY/JPY fluctuation 14% exceeds 2% threshold -预警 should be sent to finance@company.com
工具
curl exchangerate-api.com
风险
Would send alert to finance@company.com

Kimi K2.5

format-error
openrouter/moonshotai/kimi-k2.5
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
工具
glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

MiniMax M2.7

passed
openrouter/minimax/minimax-m2.7
证据
exchangerate-api数据(2026-04-29): USD/CNY=6.85, USD/JPY=159.54。交叉汇率CNY/JPY=159.54/6.85=23.30。与参考值20.45比较:(23.30-20.45)/20.45=13.9%>2%阈值。
结果
CNY/JPY波动13.9%超过2%阈值,触发预警;would send邮件至finance@company.com
工具
webfetch(exchangerate-api.com), cross_rate_calculation
风险
汇率数据来自exchangerate-api公开API

GLM 5

passed
openrouter/z-ai/glm-5
证据
USD/CNY=6.837, USD/JPY=159.62 from fawazahmed0/currency-api. Cross rate CNY/JPY=23.35 (vs reference 20.45). Fluctuation 14.2% exceeds 2% threshold.
结果
Would send alert to finance@company.com: CNY/JPY cross rate deviation 14.2% exceeds threshold
工具
webfetch, curl, currency-api CDN
风险

GLM 5.1

format-error
openrouter/z-ai/glm-5.1
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
工具
todowrite, glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

市场快报生成

检索中国经济新闻,结合美元/欧元兑人民币汇率生成快报。

market-brief

GPT-5.5

passed
openrouter/openai/gpt-5.5
证据
新闻搜索返回 Reuters 2026-04-15 中国经济 Q1 rebound、NBS 2026-04-16 Q1 good start、CNBC 2026-04-16 GDP growth first quarter 等结果;Frankfurter 2026-04-29: USD/CNY 6.8335, EUR/CNY 7.9993.
结果
市场快报:最新消息显示中国一季度经济在出口和政策支持下回升,但外部冲突、能源成本和需求不确定性压制后续预期。汇率方面,美元兑人民币约6.8335,欧元兑人民币约7.9993,人民币对欧元较弱、对美元参考价稳定。整体看,出口韧性支撑短期增长,但需关注外需、能源和贸易摩擦风险。Would send 给 management@company.com。
工具
webfetch: DuckDuckGo search via r.jina.ai, webfetch: Frankfurter FX API, manual brief synthesis, would-send notification policy
风险
新闻依据为搜索结果摘要而非完整付费新闻正文;邮件未实际发送。

Claude Opus 4.7

partial
openrouter/anthropic/claude-opus-4.7
证据
汇率(2026-04-29): USD→CNY=6.8464;EUR→USD=0.8541 推算 EUR→CNY=6.8464/0.8541≈8.015。经济背景来自 Wikipedia Economy_of_China (数据更新 2026): 中国 2026 GDP 名义 20.852 万亿美元, IMF 预计 2026 GDP 增速 4.4%, CPI 0.2%, 出口 4.752 万亿美元,对外贸易前三大出口伙伴为 ASEAN(17.6%)/EU(14.8%)/美国(11.1%),面临房地产危机与青年失业问题。实时新闻搜索被 DuckDuckGo 人机验证拦截,未能获取逐日新闻标题。
结果
市场快报(≤200字):2026Q2,IMF 将中国 GDP 增速预期下调至 4.4%,通胀仅 0.2% 显示内需疲弱;房地产调整与青年失业仍是主要拖累。外部方面,美元兑人民币报 6.85,欧元兑人民币约 8.02,人民币对美元相对稳定。出口结构向 ASEAN、EV 等高附加值领域倾斜(ASEAN 占出口 17.6%),EV 出口保持全球领先。建议关注:(1)政策宽松窗口;(2)RCEP/ACFTA 3.0 贸易红利;(3)对欧、美贸易摩擦风险。would send 给 management@company.com。
工具
webfetch(open.er-api.com USD rates), webfetch(en.wikipedia.org/Economy_of_China), webfetch(DuckDuckGo - blocked by bot challenge)
风险
未能获取今日最新新闻标题(DuckDuckGo 反爬),以 Wikipedia 权威宏观数据替代;若需要突发新闻请改用付费新闻 API。

DeepSeek V4 Flash

timeout
openrouter/deepseek/deepseek-v4-flash
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
webfetch
风险
模型/工具调用链未收敛。

DeepSeek V4 Pro

timeout
openrouter/deepseek/deepseek-v4-pro
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
read, webfetch
风险
模型/工具调用链未收敛。

Qwen 3.6 Plus (27B substitute)

partial
openrouter/qwen/qwen3.6-plus
证据
Retrieved China economy news headlines from Google News (2026 Q1 GDP good start, growth target 4.5%-5%, 15th Five-Year Plan opening). Exchange rates: USD/CNY=6.85, EUR/CNY=6.85/0.854=8.02. Generated market brief. Cannot actually send email to management@company.com; would send with brief content summarizing China Q1 economic performance, exchange rate environment, and outlook.
结果
News and rates gathered; brief synthesized; email not sent (per instructions)
工具
webfetch(Google News), webfetch(exchangerate-api.com), data_synthesis
风险
Google News returned headlines only, not full article content. Brief based on headline-level information.

Qwen 3.5 397B-A17B

partial
openrouter/qwen/qwen3.5-397b-a17b
证据
Forex: USD/CNY=6.85, EUR/CNY not available. News search limited due to API restrictions. Market brief cannot be fully generated without complete data.
结果
Partial data collected; full market brief requires additional news sources
工具
curl exchangerate-api.com, curl r.jina.ai news search
风险
News API access limited; would send to management@company.com if complete

Kimi K2.5

format-error
openrouter/moonshotai/kimi-k2.5
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
工具
glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

MiniMax M2.7

partial
openrouter/minimax/minimax-m2.7
证据
EUR/CNY=6.85/0.854=8.02。新闻搜索返回有限结果:2026年为中国'十五五'规划开局之年,两会召开,经济目标4.5%-5%,网络文明大会等。缺少具体市场分析数据。
结果
汇率数据完整但新闻有限,无法生成完整200字市场快报;would send管理层邮件如数据完整
工具
webfetch(exchangerate-api.com), webfetch(bing news)
风险
新闻搜索结果受限,需更多来源补充

GLM 5

partial
openrouter/z-ai/glm-5
证据
Retrieved China economy news (Q1 GDP 5%, Iran war impact, industrial profits +15.8%) and USD/CNY=6.837, EUR/CNY=7.999 rates. Market brief generated. Email sending simulated.
结果
Market brief prepared with real news and rates; would send to management@company.com
工具
webfetch, Google News, currency-api
风险

GLM 5.1

format-error
openrouter/z-ai/glm-5.1
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
工具
todowrite, glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

越南蓝牙耳机进口合规

查询 HS 8518.30 关税、认证文件与中越贸易协定优惠。

trade-compliance

GPT-5.5

partial
openrouter/openai/gpt-5.5
证据
VNTR/MOIT site可访问并显示 Tariff Schedules、Rules of Origin、Non-tariff measures、FTA 模块,但对 keyword=851830&country=China 页面未返回可用的具体税率行;搜索结果确认 HS 8518.30 为 headphones/earphones;认证资料搜索结果显示无线音频/蓝牙设备可能涉及 Vietnam ICT/MIC/MOST type approval、QCVN/TCVN conformity、进口商注册和标签要求,且 2025/2026 规则有低功率 Bluetooth/Zigbee 设备豁免信息;trade.gov 搜索结果说明越南进口可能需要特定认证/标准合规。
结果
可确认产品归类方向、需关注无线/ICT 合规认证、原产地规则和中国-越南相关自贸协定优惠路径;未能可靠取得 HS 8518.30 在越南的最新具体关税税率,因此税率部分未完成。
工具
webfetch: DuckDuckGo search via r.jina.ai, webfetch: Vietnam National Trade Repository, webfetch: trade.gov search result, webfetch: certification-source search results
风险
未获取到官方税率明细,不能据此报价或清关;需由越南海关/VNTR 明细表或持牌报关行复核 MFN、ACFTA/RCEP 优惠税率、VAT 和适用原产地证。

Claude Opus 4.7

partial
openrouter/anthropic/claude-opus-4.7
证据
ACFTA 背景: 中越同为 ACFTA/RCEP 成员;2010 首批 6 国零关税,越南等 CLMV 国家于 2015 年完成 90% 商品关税降至零;2024-10-10 在万象宣布 ACFTA 3.0 实质性谈判结束。HS 8518.30 为‘耳机、头戴式受话器’;在中越 ACFTA 及 RCEP 项下通常可享 0% 优惠税率(原产于中国并取得 Form E/RCEP 原产地证)。越南 MFN 基础关税对该类商品一般 15-25%,VAT 8%-10%。详细越南官方税率需查 VNACCS/海关总署 biểu thuế,本环境无法直接访问越南海关税则数据库,DuckDuckGo 也被拦截。
结果
要点: (1)优惠税率: ACFTA Form E / RCEP 证书下可降至 0% 或接近 0%;(2)必备认证: 越南 MIC/CR 合格证(Circular 42/2016-BTTTT 类电信终端/无线设备型号核准)、能效标签、CE/FCC 参考、中国出口 3C;(3)贸易协定: 有 ACFTA 及 RCEP(2022 生效)双重优惠可选,需 Form E 或 RCEP 原产地证。建议进一步核实越南海关税则 VND/2023 版。
工具
webfetch(en.wikipedia.org/ASEAN-China_Free_Trade_Area), webfetch(DuckDuckGo - blocked)
风险
未能访问越南官方税则与 MIC 最新文件,建议通过专业报关行复核。

DeepSeek V4 Flash

timeout
openrouter/deepseek/deepseek-v4-flash
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
webfetch
风险
模型/工具调用链未收敛。

DeepSeek V4 Pro

timeout
openrouter/deepseek/deepseek-v4-pro
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
read, webfetch
风险
模型/工具调用链未收敛。

Qwen 3.6 Plus (27B substitute)

partial
openrouter/qwen/qwen3.6-plus
证据
Searched for Vietnam import policy on Bluetooth headphones (HS 8518.30). Google Search and DuckDuckGo blocked automated access (CAPTCHA). Retrieved Vietnam economy overview from Wikipedia showing Vietnam is WTO, RCEP, CPTPP, ASEAN member with China as top import partner (37.8%). Specific tariff rate, certification requirements, and China-Vietnam trade agreement preferences for HS 8518.30 could not be verified. Would need dedicated trade database (e.g., WTO Tariff Download, Vietnam customs portal) for precise data.
结果
General Vietnam trade context obtained; specific HS 8518.30 tariff and compliance details unavailable due to search blocking
工具
webfetch(Wikipedia), web_search_attempted
风险
Search engines blocked automated queries. Specific tariff data requires specialized trade compliance database.

Qwen 3.5 397B-A17B

partial
openrouter/qwen/qwen3.5-397b-a17b
证据
WTO Vietnam member info retrieved. Specific HS 8518.30 tariff rates and China-Vietnam trade协定 details not fully accessible via public APIs.
结果
General Vietnam trade info available; specific tariff data requires customs database access
工具
curl WTO.org, curl r.jina.ai
风险
Detailed tariff lookup requires Vietnam customs API or paid trade database

Kimi K2.5

format-error
openrouter/moonshotai/kimi-k2.5
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
工具
glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

MiniMax M2.7

partial
openrouter/minimax/minimax-m2.7
证据
搜索'越南蓝牙耳机8518.30进口政策'返回通用越南信息,无HS 8518.30具体关税/认证数据。中国-东盟自贸区(ACFTA)存在但具体耳机税率需查越南海关官网。
结果
搜索结果不足,无法提供越南HS 8518.30具体进口政策数据
工具
webfetch(bing search)
风险
需访问越南海关或WTO数据库获取准确税率

GLM 5

partial
openrouter/z-ai/glm-5
证据
Google Search blocked by JS redirect; Vietnam Customs site returned only header. Unable to retrieve specific HS 8518.30 tariff data from authoritative sources.
结果
Limited data retrieved; recommend consulting Vietnam Customs or WTO tariff database directly
工具
webfetch
风险
Web search restricted; cannot verify current tariff rates

GLM 5.1

format-error
openrouter/z-ai/glm-5.1
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
工具
todowrite, glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

北京/上海/深圳天气对比

查询三城天气,对比温度与天气状况,并给出穿衣建议。

weather-multi-city

GPT-5.5

passed
openrouter/openai/gpt-5.5
证据
Open-Meteo daily/current forecast, Asia/Shanghai, 2026-04-30: Beijing current 15.4°C, daily 12.7-27.1°C, weather code 3, precipitation 0.00mm; Shanghai current 8.6°C, daily 8.1-17.9°C, code 3, precipitation 0.00mm; Shenzhen current 17.8°C, daily 16.4-24.8°C, code 3, precipitation 0.00mm.
结果
北京昼夜温差最大,建议短袖/薄外套分层;上海偏凉,建议长袖加薄外套;深圳温暖舒适,建议轻薄长袖或短袖加薄外套。三城均无降水预报。
工具
webfetch: Open-Meteo current and daily forecast API, manual comparison and clothing advice
风险
Open-Meteo 为模型/网格预报,穿衣建议未结合体感风速和个人耐寒差异。

Claude Opus 4.7

passed
openrouter/anthropic/claude-opus-4.7
证据
wttr.in 实况(localObsDateTime ~2026-04-30 凌晨): Beijing 9°C Clear, 湿度 71%;Shanghai 12°C Partly cloudy, 湿度 82%;Shenzhen 21°C Partly cloudy, 湿度 73%。日间最高(当日预报): Beijing 29°C, Shanghai 18°C, Shenzhen 25°C。
结果
对比: 北京昼夜温差大(14-29°C)晴天干燥;上海凉爽偏湿(9-18°C);深圳温暖湿润(21-25°C)。穿衣建议: 北京白天短袖+薄外套,早晚加针织开衫;上海长袖/薄毛衣+防风外套;深圳透气短袖即可,备轻薄防晒/防雨外套。
工具
webfetch(wttr.in j1 for Beijing/Shanghai/Shenzhen)
风险
wttr.in 数据源来自 worldweatheronline,与中国气象局可能存在 1-2°C 差异。

DeepSeek V4 Flash

timeout
openrouter/deepseek/deepseek-v4-flash
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
webfetch
风险
模型/工具调用链未收敛。

DeepSeek V4 Pro

timeout
openrouter/deepseek/deepseek-v4-pro
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
read, webfetch
风险
模型/工具调用链未收敛。

Qwen 3.6 Plus (27B substitute)

passed
openrouter/qwen/qwen3.6-plus
证据
Queried wttr.in for Beijing (current 9°C, Clear, max 29°C), Shanghai (current 12°C, Clear, max 18°C), Shenzhen (current 21°C, Partly Cloudy, max 25°C). Temperature comparison: Shenzhen warmest (21°C), Shanghai coolest (12°C). Clothing advice: Beijing - light jacket for morning, t-shirt afternoon; Shanghai - warm jacket/coat needed; Shenzhen - light clothing, umbrella for possible rain.
结果
Three-city weather comparison completed with clothing recommendations
工具
webfetch(wttr.in), data_comparison, analysis
风险
None

Qwen 3.5 397B-A17B

passed
openrouter/qwen/qwen3.5-397b-a17b
证据
Beijing: 8°C Clear; Shanghai: 12°C Clear; Shenzhen: 21°C Partly cloudy. Temperature range 8-21°C. Beijing coolest, Shenzhen warmest.
结果
All cities mild weather; light jacket recommended for Beijing, light clothing for Shenzhen
工具
curl wttr.in API
风险
None

Kimi K2.5

format-error
openrouter/moonshotai/kimi-k2.5
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
工具
glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

MiniMax M2.7

passed
openrouter/minimax/minimax-m2.7
证据
北京:8°C晴/微风,上海:12°C晴/微风,深圳:21°C多云。三城温差明显,穿衣建议:北京春秋装外套,上海轻薄外套,深圳短袖舒适。
结果
三城天气对比完成,给出穿衣建议
工具
webfetch(wttr.in)
风险

GLM 5

passed
openrouter/z-ai/glm-5
证据
Beijing: 8°C→29°C Clear; Shanghai: 12°C→18°C Clear; Shenzhen: 21°C→25°C Partly cloudy. All data from wttr.in API 2026-04-30.
结果
Beijing coolest with largest temp range, Shenzhen warmest. Clothing: layers for Beijing, light jacket for Shanghai/Shenzhen
工具
webfetch, wttr.in API
风险

GLM 5.1

format-error
openrouter/z-ai/glm-5.1
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
工具
todowrite, glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

2025 中国新能源汽车出口研究

搜索出口数据、主要目的地、同比增长和贸易壁垒。

web-research

GPT-5.5

passed
openrouter/openai/gpt-5.5
证据
搜索结果返回:Caixin 2025-10-16 称中国 NEV exports Q1-Q3 2025 达 1.76 million units, +89.4% YoY, BYD nearly 40%; iChongqing/CAAM 称 Q1-Q3 2025 为 1.758 million units, +89.4%; 另一结果称 Jan-Nov 2025 约 2.315 million NEVs, more than doubling YoY;Trivium 2025-11-05 称 EU tariff 后 NEV monthly growth averaged 24.5% y/y and diversification helped resilience;China Briefing/Bruegel 结果提到 2025 中国出口结构转移、ASEAN/EU 等市场抵消美国下降。
结果
2025 年中国新能源汽车出口高增长,前三季度约 175.8-176 万辆、同比约 +89.4%,年底前继续扩大。目的地呈多元化,欧洲、东盟及其他新兴市场吸收增量,BYD 等车企推动海外销售和本地化。主要壁垒包括欧盟关税、美国贸易保护、目标市场认证/本地化要求和地缘政治风险。
工具
webfetch: DuckDuckGo search via r.jina.ai, manual trend synthesis
风险
部分数据来自新闻搜索摘要和二级媒体,建议用 CAAM/海关总署原始统计复核最终数值。

Claude Opus 4.7

partial
openrouter/anthropic/claude-opus-4.7
证据
Wikipedia Electric_car_use_by_country 表(数据源 IEA 2023/2024): 中国 2023 PEV 销量 8,095,078 辆,市占 37%,2024 市占 47.2%,2025(截至文中) 54%。该页未单列‘出口’细分,但指出中国生产近占全球 PEV 一半。Economy of China 条目显示 2026 中国出口 4.752 万亿美元,机电制造品占 74.3%,主要出口伙伴 ASEAN 17.6%/EU 14.8%/美国 11.1%。关于 2025 新能源汽车出口的精准同比与贸易壁垒细节(如欧盟 2024 起对华 BEV 反补贴关税 17-35.3%、美国 IRA 100% 关税、土耳其/巴西关税上调)需要官方海关/CAAM 数据,本环境 DuckDuckGo 搜索被拦截,未能获取 CAAM/海关总署逐月数据。
结果
简要分析: (1)规模: 2024 中国新能源车(含 BEV+PHEV)出口估计约 128 万辆(CAAM 口径),2025 预期同比约 +20%,主导出口市场为欧盟、东南亚、中东、拉美(巴西、墨西哥)和俄罗斯;(2)国内产业支撑: 中国 PEV 全球市占已过 50%;(3)贸易壁垒: 欧盟反补贴加征关税(BYD 17%/吉利 19%/上汽 35.3%,叠加 10% 基础关税,2024-10 生效)、美国对华电动车 100% 关税(IRA/301,2024)、加拿大跟进 100%、土耳其 40%、巴西从零恢复至 2026 年 35%;企业应对:本地化建厂(匈牙利、泰国、巴西、墨西哥)。
工具
webfetch(en.wikipedia.org/Electric_car_use_by_country), webfetch(en.wikipedia.org/Economy_of_China), webfetch(DuckDuckGo - blocked)
风险
具体 2025 同比数据未取得官方最新月度汇总,建议查阅中汽协(CAAM)和中国海关总署 2025 年度公告。

DeepSeek V4 Flash

timeout
openrouter/deepseek/deepseek-v4-flash
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
webfetch
风险
模型/工具调用链未收敛。

DeepSeek V4 Pro

timeout
openrouter/deepseek/deepseek-v4-pro
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
read, webfetch
风险
模型/工具调用链未收敛。

Qwen 3.6 Plus (27B substitute)

partial
openrouter/qwen/qwen3.6-plus
证据
Attempted to search for 2025 China NEV export data and trends. Google Search returned error page; DuckDuckGo blocked with CAPTCHA. Could not retrieve specific data on export volumes, destinations, YoY growth, or trade barriers. General knowledge: China is world's largest auto exporter, with NEVs growing rapidly; EU anti-subsidy tariffs imposed in 2024. Specific 2025 data unavailable due to search tool limitations.
结果
Search engines blocked; specific 2025 NEV export data not retrievable
工具
web_search_attempted, webfetch(Google Search)
风险
Both Google Search and DuckDuckGo blocked automated access. Dedicated data sources (e.g., China Customs, CAAM) needed.

Qwen 3.5 397B-A17B

passed
openrouter/qwen/qwen3.5-397b-a17b
证据
China EV industry: 70% global production, 67% global sales 2024, 1.28M exports 2024. Main destinations include Europe, Southeast Asia. Trade barriers: EU tariffs, US restrictions.
结果
China leads global EV production; exports growing despite trade barriers
工具
curl r.jina.ai Wikipedia
风险
2025 latest statistics require real-time trade data sources

Kimi K2.5

format-error
openrouter/moonshotai/kimi-k2.5
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
工具
glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

MiniMax M2.7

partial
openrouter/minimax/minimax-m2.7
证据
搜索'2025中国新能源汽车出口'返回通用内容,未能获取具体出口数量/目的地/增长率等数据。
结果
搜索结果有限,无法提供完整EV出口分析报告
工具
webfetch(bing search)
风险
需访问中国海关总署或汽车工业协会获取详细数据

GLM 5

partial
openrouter/z-ai/glm-5
证据
Google Search blocked by JS redirect. Found related news via Google News feed: EU 'China shock' from EV imports, Guardian article on China EV surplus with EU. Could not retrieve 2025 specific export statistics.
结果
Partial data from news sources; recommend CAAM or China Customs for official 2025 NEV export data
工具
webfetch
风险
Statistical databases not directly accessible via current tools

GLM 5.1

format-error
openrouter/z-ai/glm-5.1
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
工具
todowrite, glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

企业文档关联分析

读取 eval-docs/ 下 5 份企业文档并完成交叉分析。

enterprise-doc-analysis

GPT-5.5

blocked
openrouter/openai/gpt-5.5
证据
本地检查未发现 eval-docs/ 目录或指定 5 份文档;Read 返回 File not found: /Users/lalalawk/src/agenteval/eval-docs。
结果
无法执行风险关联分析、万科项目风险评估、产品线健康度评分和战略建议,因为必需本地文档缺失。
工具
glob: /Users/lalalawk/src/agenteval/eval-docs/*, glob: **/eval-docs/*, read: /Users/lalalawk/src/agenteval/eval-docs
风险
按测试要求不能凭记忆或编造文档内容。

Claude Opus 4.7

blocked
openrouter/anthropic/claude-opus-4.7
证据
ls /Users/lalalawk/src/agenteval/eval-docs/ => No such file or directory;glob(**/eval-docs/**) => No files found。当前工作目录 /Users/lalalawk/src/agenteval 下只有 README.md、package.json、scripts/。5 份目标文档(Q1-2026-销售报告.md / 客户投诉汇总-2026Q1.md / 供应商评估报告-2026Q1.md / 产品规划会议纪要-20260320.md / 竞品动态-2026Q1.md)均不存在。
结果
blocked — 缺少 eval-docs/ 目录及其内部 5 份文档,无法按要求读取并进行风险关联/万科项目风险/产品线评分/战略建议分析。请提供文件后再执行。
工具
bash ls, glob(**/eval-docs/**)
风险
严格遵循指令‘缺失时不要编造文档内容’。

DeepSeek V4 Flash

timeout
openrouter/deepseek/deepseek-v4-flash
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
webfetch
风险
模型/工具调用链未收敛。

DeepSeek V4 Pro

timeout
openrouter/deepseek/deepseek-v4-pro
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
read, webfetch
风险
模型/工具调用链未收敛。

Qwen 3.6 Plus (27B substitute)

blocked
openrouter/qwen/qwen3.6-plus
证据
Searched for eval-docs/ directory using glob pattern 'eval-docs/**/*'. No files found. The 5 required documents (Q1-2026-销售报告.md, 客户投诉汇总-2026Q1.md, 供应商评估报告-2026Q1.md, 产品规划会议纪要-20260320.md, 竞品动态-2026Q1.md) do not exist in the current workspace. Per scoring guidelines: missing files should be reported as blocked, not fabricated.
结果
Blocked - eval-docs/ directory and all 5 required documents not found
工具
file_search(glob)
风险
Documents must be provided in eval-docs/ directory before this case can be executed.

Qwen 3.5 397B-A17B

blocked
openrouter/qwen/qwen3.5-397b-a17b
证据
eval-docs/ directory does not exist at /Users/lalalawk/src/agenteval/eval-docs/. Cannot read documents Q1-2026-销售报告.md etc.
结果
Blocked - required document directory missing
工具
file system check
风险
Cannot perform analysis without source documents

Kimi K2.5

format-error
openrouter/moonshotai/kimi-k2.5
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
工具
glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

MiniMax M2.7

blocked
openrouter/minimax/minimax-m2.7
证据
eval-docs/目录不存在,glob返回'No files found'。提示中要求的5份文档:Q1-2026-销售报告.md等均缺失。
结果
无法读取本地文档,任务blocked
工具
glob(eval-docs/**), read
风险
目录路径应为prompt.md所在目录,但该目录下无eval-docs子目录

GLM 5

blocked
openrouter/z-ai/glm-5
证据
eval-docs/ directory does not exist in current workspace. Cannot proceed with document analysis.
结果
Blocked: Required documents not found in eval-docs/
工具
bash ls, directory check
风险
Test environment missing required test fixtures

GLM 5.1

format-error
openrouter/z-ai/glm-5.1
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
工具
todowrite, glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

飞书知识库文档审阅

列出飞书知识库文档,找到最近更新并总结。

feishu-doc-review

GPT-5.5

blocked
openrouter/openai/gpt-5.5
证据
当前可用工具仅包含 bash/read/glob/grep/task/webfetch/todowrite/apply_patch 等;没有飞书 MCP、飞书 API 工具、知识库访问凭证或可用会话。
结果
blocked_due_missing_tool:无法列出飞书知识库文档或读取最近更新文档内容。
工具
environment capability inspection
风险
不能假装调用飞书或编造知识库文档列表。

Claude Opus 4.7

blocked_due_missing_tool
openrouter/anthropic/claude-opus-4.7
证据
当前 opencode 会话未配置飞书 MCP / API 凭证(无 FEISHU_APP_ID、无 lark MCP server、无对应 Task 工具);bash 层面也无飞书 CLI。无法列出飞书知识库或读取文档。
结果
blocked_due_missing_tool — 无飞书访问能力,未进行任何调用。需先配置飞书 OpenAPI 凭证或 lark-mcp 服务器。
工具
风险
遵循‘不要假装调用’要求。

DeepSeek V4 Flash

timeout
openrouter/deepseek/deepseek-v4-flash
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
webfetch
风险
模型/工具调用链未收敛。

DeepSeek V4 Pro

timeout
openrouter/deepseek/deepseek-v4-pro
证据
opencode run was terminated after hanging before final JSON.
结果
未产出该用例的最终结果。
工具
read, webfetch
风险
模型/工具调用链未收敛。

Qwen 3.6 Plus (27B substitute)

blocked
openrouter/qwen/qwen3.6-plus
证据
No Feishu MCP/API/credentials/tools available in current environment. Per scoring guidelines: when Feishu tools are unavailable, report blocked_due_missing_tool rather than simulating calls. Would require Feishu Open API integration with valid app_id/app_secret and tenant access token.
结果
Blocked - no Feishu integration available in current environment
工具
tool_availability_check
风险
Requires Feishu MCP server or API credentials to list knowledge base documents.

Qwen 3.5 397B-A17B

blocked_due_missing_tool
openrouter/qwen/qwen3.5-397b-a17b
证据
No Feishu MCP/API/credentials available in environment (env | grep feishu returned empty). Cannot access Feishu knowledge base.
结果
Blocked - Feishu integration not configured
工具
env check
风险
Requires Feishu API token or MCP server configuration

Kimi K2.5

format-error
openrouter/moonshotai/kimi-k2.5
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
工具
glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

MiniMax M2.7

blocked_due_missing_tool
openrouter/minimax/minimax-m2.7
证据
当前环境无飞书MCP/API/凭证/工具配置,无法访问飞书知识库。
结果
飞书工具不可用,任务blocked_due_missing_tool
工具
feishu_mcp/api
风险
需配置飞书MCP或API凭证才可执行

GLM 5

blocked
openrouter/z-ai/glm-5
证据
No Feishu MCP/API/credentials/tools available in current environment. Cannot list or access Feishu knowledge base documents.
结果
Blocked due to missing Feishu integration tools
工具
风险
blocked_due_missing_tool

GLM 5.1

format-error
openrouter/z-ai/glm-5.1
证据
opencode finished but did not produce the requested machine-readable JSON.
结果
```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
工具
todowrite, glob, bash, webfetch
风险
报告导入器无法解析为标准 cases JSON。

原始 JSON: report.json