评测概览
参考 agenteval 的报告工作台组织方式,保留模型运行、用例结果、参考文档和原始数据。
openrouter/openai/gpt-5.5openrouter/anthropic/claude-opus-4.7openrouter/deepseek/deepseek-v4-flashopenrouter/deepseek/deepseek-v4-proopenrouter/qwen/qwen3.6-plusopenrouter/qwen/qwen3.5-397b-a17bopenrouter/moonshotai/kimi-k2.5openrouter/minimax/minimax-m2.7openrouter/z-ai/glm-5openrouter/z-ai/glm-5.1当前最佳通过数:GPT-5.5 (5/8)
模型运行状态
对应 agenteval 的 run detail:查看每个模型的退出码、工具调用、token 和导入备注。
| 模型 | slug | exit | run | 工具 | tokens | cost | 备注 |
|---|---|---|---|---|---|---|---|
| GPT-5.5 | openrouter/openai/gpt-5.5 |
0 | ok | glob, webfetch, read | 513062 | 1.014624 | — |
| Claude Opus 4.7 | openrouter/anthropic/claude-opus-4.7 |
0 | ok | bash, glob, webfetch | 388055 | 1.254585 | — |
| DeepSeek V4 Flash | openrouter/deepseek/deepseek-v4-flash |
0 | ok | todowrite, bash, webfetch, glob | 279739 | 0.040269 | — |
| DeepSeek V4 Pro | openrouter/deepseek/deepseek-v4-pro |
142 | timeout | — | — | — | opencode run was terminated after hanging before final JSON. |
| Qwen 3.6 Plus (27B substitute) | openrouter/qwen/qwen3.6-plus |
0 | ok | todowrite, glob, webfetch | 597498 | 0.204754 | — |
| Qwen 3.5 397B-A17B | openrouter/qwen/qwen3.5-397b-a17b |
0 | ok | bash | 172444 | 0.115916 | — |
| Kimi K2.5 | openrouter/moonshotai/kimi-k2.5 |
0 | format-error | glob, bash, webfetch | 99155 | 0.052685 | opencode finished but did not produce the requested machine-readable JSON. |
| MiniMax M2.7 | openrouter/minimax/minimax-m2.7 |
0 | ok | todowrite, webfetch, read, glob, bash, grep | 324954 | 0.093269 | — |
| GLM 5 | openrouter/z-ai/glm-5 |
0 | ok | webfetch, bash | 366686 | 0.234524 | — |
| GLM 5.1 | openrouter/z-ai/glm-5.1 |
0 | format-error | todowrite, glob, bash, webfetch | 310597 | 0.184836 | opencode finished but did not produce the requested machine-readable JSON. |
结果矩阵
按测试用例横向比较模型状态,点击用例名称跳转到明细。
| 测试用例 | GPT-5.5 | Claude Opus 4.7 | DeepSeek V4 Flash | DeepSeek V4 Pro | Qwen 3.6 Plus (27B substitute) | Qwen 3.5 397B-A17B | Kimi K2.5 | MiniMax M2.7 | GLM 5 | GLM 5.1 |
|---|---|---|---|---|---|---|---|---|---|---|
| 1. 多城市天气预警 weather-alert | passed | passed | passed | timeout | passed | passed | format-error | passed | passed | format-error |
| 2. 汇率监控与预警 forex-monitor | passed | passed | passed | timeout | passed | passed | format-error | passed | passed | format-error |
| 3. 市场快报生成 market-brief | passed | partial | passed | timeout | partial | partial | format-error | partial | partial | format-error |
| 4. 越南蓝牙耳机进口合规 trade-compliance | partial | partial | partial | timeout | partial | partial | format-error | partial | partial | format-error |
| 5. 北京/上海/深圳天气对比 weather-multi-city | passed | passed | passed | timeout | passed | passed | format-error | passed | passed | format-error |
| 6. 2025 中国新能源汽车出口研究 web-research | passed | partial | partial | timeout | partial | passed | format-error | partial | partial | format-error |
| 7. 企业文档关联分析 enterprise-doc-analysis | blocked | blocked | blocked | timeout | blocked | blocked | format-error | blocked | blocked | format-error |
| 8. 飞书知识库文档审阅 feishu-doc-review | blocked | blocked_due_missing_tool | blocked | timeout | blocked | blocked_due_missing_tool | format-error | blocked_due_missing_tool | blocked | format-error |
用例结果
筛选模型、状态或关键词,快速定位需要复核的输出。
多城市天气预警
查询北京、上海、广州天气并判断是否触发通知。
GPT-5.5
passed未触发天气预警;不会发送 ops@company.com 通知。
查看证据、工具与风险
- 证据
- Open-Meteo current weather, Asia/Shanghai, 2026-04-30T02:00: Beijing 15.4°C code 0 precipitation 0.00mm; Shanghai 8.6°C code 0 precipitation 0.00mm; Guangzhou 17.6°C code 3 precipitation 0.00mm. None exceeded >35°C, <0°C,暴雨/暴雪 criteria.
- 工具
- webfetch: Open-Meteo current weather API, manual threshold check, would-send notification policy
- 风险
- 天气源为 Open-Meteo 网格点实时值,非气象局站点;外部邮件按测试要求未实际发送。
Claude Opus 4.7
passed三城均未触发预警阈值(>35°C 或 <0°C 或 暴雨/暴雪),无需向 ops@company.com 发送通知;若触发,would send 邮件主题‘天气预警’并附三城实时读数。
查看证据、工具与风险
- 证据
- wttr.in 当前实况(localObsDateTime 2026-04-30): Beijing temp_C=9, Clear; Shanghai temp_C=12, Partly cloudy; Guangzhou temp_C=19, Partly cloudy. 三城气温均在 0~35°C 区间,无暴雨/暴雪(precipMM=0.0)。
- 工具
- webfetch(wttr.in j1 for Beijing/Shanghai/Guangzhou)
- 风险
- wttr.in 为 worldweatheronline 数据源,非官方气象台,但足够作为实时参考。
DeepSeek V4 Flash
passedQueried 3 cities real-time weather; no extreme conditions detected, no alert issued.
查看证据、工具与风险
- 证据
- Beijing: Sunny +61°F (~16°C), Shanghai: Sunny +52°F (~11°C), Guangzhou: Partly cloudy +65°F (~18°C). All temperatures within 0-35°C range; no heavy rain/snow reported. No alert needed. Would send alert only if thresholds exceeded.
- 工具
- curl (wttr.in), webfetch
- 风险
- —
DeepSeek V4 Pro
timeout未产出该用例的最终结果。
查看证据、工具与风险
- 证据
- opencode run was terminated after hanging before final JSON.
- 工具
- —
- 风险
- 模型/工具调用链未收敛。
Qwen 3.6 Plus (27B substitute)
passedAll three cities within normal range; no weather alert needed
查看证据、工具与风险
- 证据
- Queried wttr.in API for Beijing (current 9°C, daily max 29°C), Shanghai (current 12°C, daily max 18°C), Guangzhou (current 19°C, daily max 26°C). No city exceeds 35°C or below 0°C. No heavy rain/snow detected. No alert triggered. Would send notification to ops@company.com if thresholds breached.
- 工具
- webfetch(wttr.in), data_analysis, conditional_logic
- 风险
- None
Qwen 3.5 397B-A17B
passedNo weather warnings needed for all three cities
查看证据、工具与风险
- 证据
- Beijing: 8°C Clear; Shanghai: 12°C Clear; Guangzhou: 19°C Overcast. All temperatures within 0-35°C range, no extreme weather alerts triggered.
- 工具
- curl wttr.in API
- 风险
- Would send notification to ops@company.com if thresholds exceeded
Kimi K2.5
format-error我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
MiniMax M2.7
passed三城天气正常,无需向ops@company.com发送通知;若触发would send天气预警邮件并附实时数据
查看证据、工具与风险
- 证据
- wttr.in实时数据(2026-04-30): 北京8°C晴, 上海12°C晴, 广州18°C多云。三城气温均在0~35°C区间,无暴雨/暴雪(precipMM=0.0)。无预警触发。
- 工具
- webfetch(wttr.in), data_analysis
- 风险
- wttr.in数据来源为worldweatheronline,非官方气象部门
GLM 5
passedNo weather alert triggered; all cities within normal range
查看证据、工具与风险
- 证据
- Retrieved real weather data via wttr.in API for all 3 cities: Beijing 8°C Clear, Shanghai 12°C Clear, Guangzhou 18°C Partly cloudy. No extreme conditions (>35°C, <0°C, or heavy rain/snow) detected.
- 工具
- webfetch, wttr.in API
- 风险
- —
GLM 5.1
format-error```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- todowrite, glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
汇率监控与预警
查询 USD/CNY 与 USD/JPY,计算 CNY/JPY 并判断预警。
GPT-5.5
passed人民币兑日元交叉汇率波动超过阈值;would send 预警给 finance@company.com,内容含当前交叉汇率约 23.38 和 +14.34% 偏离。
查看证据、工具与风险
- 证据
- Frankfurter latest API date 2026-04-29: 1 USD = 6.8335 CNY, 1 USD = 159.79 JPY. Cross rate CNY/JPY = 159.79 / 6.8335 = about 23.38 JPY per CNY; reference 20.45; change about +14.34%, above 2% threshold.
- 工具
- webfetch: Frankfurter FX API, manual cross-rate calculation, would-send notification policy
- 风险
- Frankfurter 使用 ECB/公开市场参考汇率,非银行实时报价;邮件未实际发送。
Claude Opus 4.7
passed触发波动预警。would send 邮件至 finance@company.com,内容:USD/CNY=6.8464, USD/JPY=159.54, 交叉 CNY/JPY=23.30, 相对参考 20.45 上涨 13.94%,请关注日元贬值/人民币相对走强的影响。
查看证据、工具与风险
- 证据
- open.er-api.com(time_last_update_utc=2026-04-29) rates: USD→CNY=6.846377, USD→JPY=159.540055。计算交叉汇率 1 CNY = 159.540055/6.846377 = 23.30 JPY。与上周参考 20.45 相比,变动 (23.30-20.45)/20.45 = +13.94%,绝对值远超 2% 阈值。
- 工具
- webfetch(open.er-api.com/v6/latest/USD)
- 风险
- 参考值 20.45 由任务给定;实际 JPY 波动主因为 2025-2026 日元持续贬值。
DeepSeek V4 Flash
passedCross rate CNY/JPY=23.37, +14.3% vs reference 20.45, exceeds 2% threshold; alert triggered.
查看证据、工具与风险
- 证据
- USD/CNY=6.85, USD/JPY=160.08 (2026-04-30). Cross rate: 1 CNY = 160.08/6.85 = 23.37 JPY. Reference: 1 CNY = 20.45 JPY. Change: +14.3%, exceeds 2% threshold. Would send alert to finance@company.com.
- 工具
- curl (exchangerate-api.com), calculation
- 风险
- —
DeepSeek V4 Pro
timeout未产出该用例的最终结果。
查看证据、工具与风险
- 证据
- opencode run was terminated after hanging before final JSON.
- 工具
- —
- 风险
- 模型/工具调用链未收敛。
Qwen 3.6 Plus (27B substitute)
passedCNY/JPY cross rate 23.29 vs reference 20.45; 13.89% fluctuation exceeds 2% threshold; alert warranted
查看证据、工具与风险
- 证据
- Fetched rates from exchangerate-api.com (2026-04-29): USD/CNY=6.85, USD/JPY=159.54. Calculated CNY/JPY cross rate: 159.54/6.85 = 23.29 JPY per CNY. Compared to reference 20.45 JPY/CNY: fluctuation = (23.29-20.45)/20.45 = 13.89%, exceeding 2% threshold. Would send alert to finance@company.com.
- 工具
- webfetch(exchangerate-api.com), cross_rate_calculation, threshold_comparison
- 风险
- None
Qwen 3.5 397B-A17B
passedCNY/JPY fluctuation 14% exceeds 2% threshold -预警 should be sent to finance@company.com
查看证据、工具与风险
- 证据
- USD/CNY=6.85, USD/JPY=159.54, CNY/JPY=23.31. Cross rate 23.31 vs reference 20.45 = 14% fluctuation (exceeds 2% threshold)
- 工具
- curl exchangerate-api.com
- 风险
- Would send alert to finance@company.com
Kimi K2.5
format-error我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
MiniMax M2.7
passedCNY/JPY波动13.9%超过2%阈值,触发预警;would send邮件至finance@company.com
查看证据、工具与风险
- 证据
- exchangerate-api数据(2026-04-29): USD/CNY=6.85, USD/JPY=159.54。交叉汇率CNY/JPY=159.54/6.85=23.30。与参考值20.45比较:(23.30-20.45)/20.45=13.9%>2%阈值。
- 工具
- webfetch(exchangerate-api.com), cross_rate_calculation
- 风险
- 汇率数据来自exchangerate-api公开API
GLM 5
passedWould send alert to finance@company.com: CNY/JPY cross rate deviation 14.2% exceeds threshold
查看证据、工具与风险
- 证据
- USD/CNY=6.837, USD/JPY=159.62 from fawazahmed0/currency-api. Cross rate CNY/JPY=23.35 (vs reference 20.45). Fluctuation 14.2% exceeds 2% threshold.
- 工具
- webfetch, curl, currency-api CDN
- 风险
- —
GLM 5.1
format-error```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- todowrite, glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
市场快报生成
检索中国经济新闻,结合美元/欧元兑人民币汇率生成快报。
GPT-5.5
passed市场快报:最新消息显示中国一季度经济在出口和政策支持下回升,但外部冲突、能源成本和需求不确定性压制后续预期。汇率方面,美元兑人民币约6.8335,欧元兑人民币约7.9993,人民币对欧元较弱、对美元参考价稳定。整体看,出口韧性支撑短期增长,但需关注外需、能源和贸易摩擦风险。Would send 给 management@company.com。
查看证据、工具与风险
- 证据
- 新闻搜索返回 Reuters 2026-04-15 中国经济 Q1 rebound、NBS 2026-04-16 Q1 good start、CNBC 2026-04-16 GDP growth first quarter 等结果;Frankfurter 2026-04-29: USD/CNY 6.8335, EUR/CNY 7.9993.
- 工具
- webfetch: DuckDuckGo search via r.jina.ai, webfetch: Frankfurter FX API, manual brief synthesis, would-send notification policy
- 风险
- 新闻依据为搜索结果摘要而非完整付费新闻正文;邮件未实际发送。
Claude Opus 4.7
partial市场快报(≤200字):2026Q2,IMF 将中国 GDP 增速预期下调至 4.4%,通胀仅 0.2% 显示内需疲弱;房地产调整与青年失业仍是主要拖累。外部方面,美元兑人民币报 6.85,欧元兑人民币约 8.02,人民币对美元相对稳定。出口结构向 ASEAN、EV 等高附加值领域倾斜(ASEAN 占出口 17.6%),EV 出口保持全球领先。建议关注:(1)政策宽松窗口;(2)RCEP/ACFTA 3.0 贸易红利;(3)对欧、美贸易摩擦风险。would send 给 management@company.com。
查看证据、工具与风险
- 证据
- 汇率(2026-04-29): USD→CNY=6.8464;EUR→USD=0.8541 推算 EUR→CNY=6.8464/0.8541≈8.015。经济背景来自 Wikipedia Economy_of_China (数据更新 2026): 中国 2026 GDP 名义 20.852 万亿美元, IMF 预计 2026 GDP 增速 4.4%, CPI 0.2%, 出口 4.752 万亿美元,对外贸易前三大出口伙伴为 ASEAN(17.6%)/EU(14.8%)/美国(11.1%),面临房地产危机与青年失业问题。实时新闻搜索被 DuckDuckGo 人机验证拦截,未能获取逐日新闻标题。
- 工具
- webfetch(open.er-api.com USD rates), webfetch(en.wikipedia.org/Economy_of_China), webfetch(DuckDuckGo - blocked by bot challenge)
- 风险
- 未能获取今日最新新闻标题(DuckDuckGo 反爬),以 Wikipedia 权威宏观数据替代;若需要突发新闻请改用付费新闻 API。
DeepSeek V4 Flash
passedFetched real FX rates and Reuters China news; composed market brief; email not sent (would-send).
查看证据、工具与风险
- 证据
- FX: USD/CNY=6.85, EUR/CNY=0.856→6.85/0.856≈8.00. News: China-EU trade tension over 'Buy European' act, US sanctions on Hengli for Iran oil, DeepSeek V4 driving Huawei AI chip demand, Trump-Xi summit agenda includes Taiwan. Brief generated but no email tool available; would send to management@company.com.
- 工具
- webfetch (Reuters), curl (exchangerate-api.com)
- 风险
- No email sending tool available; notification was simulated as 'would send'.
DeepSeek V4 Pro
timeout未产出该用例的最终结果。
查看证据、工具与风险
- 证据
- opencode run was terminated after hanging before final JSON.
- 工具
- —
- 风险
- 模型/工具调用链未收敛。
Qwen 3.6 Plus (27B substitute)
partialNews and rates gathered; brief synthesized; email not sent (per instructions)
查看证据、工具与风险
- 证据
- Retrieved China economy news headlines from Google News (2026 Q1 GDP good start, growth target 4.5%-5%, 15th Five-Year Plan opening). Exchange rates: USD/CNY=6.85, EUR/CNY=6.85/0.854=8.02. Generated market brief. Cannot actually send email to management@company.com; would send with brief content summarizing China Q1 economic performance, exchange rate environment, and outlook.
- 工具
- webfetch(Google News), webfetch(exchangerate-api.com), data_synthesis
- 风险
- Google News returned headlines only, not full article content. Brief based on headline-level information.
Qwen 3.5 397B-A17B
partialPartial data collected; full market brief requires additional news sources
查看证据、工具与风险
- 证据
- Forex: USD/CNY=6.85, EUR/CNY not available. News search limited due to API restrictions. Market brief cannot be fully generated without complete data.
- 工具
- curl exchangerate-api.com, curl r.jina.ai news search
- 风险
- News API access limited; would send to management@company.com if complete
Kimi K2.5
format-error我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
MiniMax M2.7
partial汇率数据完整但新闻有限,无法生成完整200字市场快报;would send管理层邮件如数据完整
查看证据、工具与风险
- 证据
- EUR/CNY=6.85/0.854=8.02。新闻搜索返回有限结果:2026年为中国'十五五'规划开局之年,两会召开,经济目标4.5%-5%,网络文明大会等。缺少具体市场分析数据。
- 工具
- webfetch(exchangerate-api.com), webfetch(bing news)
- 风险
- 新闻搜索结果受限,需更多来源补充
GLM 5
partialMarket brief prepared with real news and rates; would send to management@company.com
查看证据、工具与风险
- 证据
- Retrieved China economy news (Q1 GDP 5%, Iran war impact, industrial profits +15.8%) and USD/CNY=6.837, EUR/CNY=7.999 rates. Market brief generated. Email sending simulated.
- 工具
- webfetch, Google News, currency-api
- 风险
- —
GLM 5.1
format-error```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- todowrite, glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
越南蓝牙耳机进口合规
查询 HS 8518.30 关税、认证文件与中越贸易协定优惠。
GPT-5.5
partial可确认产品归类方向、需关注无线/ICT 合规认证、原产地规则和中国-越南相关自贸协定优惠路径;未能可靠取得 HS 8518.30 在越南的最新具体关税税率,因此税率部分未完成。
查看证据、工具与风险
- 证据
- VNTR/MOIT site可访问并显示 Tariff Schedules、Rules of Origin、Non-tariff measures、FTA 模块,但对 keyword=851830&country=China 页面未返回可用的具体税率行;搜索结果确认 HS 8518.30 为 headphones/earphones;认证资料搜索结果显示无线音频/蓝牙设备可能涉及 Vietnam ICT/MIC/MOST type approval、QCVN/TCVN conformity、进口商注册和标签要求,且 2025/2026 规则有低功率 Bluetooth/Zigbee 设备豁免信息;trade.gov 搜索结果说明越南进口可能需要特定认证/标准合规。
- 工具
- webfetch: DuckDuckGo search via r.jina.ai, webfetch: Vietnam National Trade Repository, webfetch: trade.gov search result, webfetch: certification-source search results
- 风险
- 未获取到官方税率明细,不能据此报价或清关;需由越南海关/VNTR 明细表或持牌报关行复核 MFN、ACFTA/RCEP 优惠税率、VAT 和适用原产地证。
Claude Opus 4.7
partial要点: (1)优惠税率: ACFTA Form E / RCEP 证书下可降至 0% 或接近 0%;(2)必备认证: 越南 MIC/CR 合格证(Circular 42/2016-BTTTT 类电信终端/无线设备型号核准)、能效标签、CE/FCC 参考、中国出口 3C;(3)贸易协定: 有 ACFTA 及 RCEP(2022 生效)双重优惠可选,需 Form E 或 RCEP 原产地证。建议进一步核实越南海关税则 VND/2023 版。
查看证据、工具与风险
- 证据
- ACFTA 背景: 中越同为 ACFTA/RCEP 成员;2010 首批 6 国零关税,越南等 CLMV 国家于 2015 年完成 90% 商品关税降至零;2024-10-10 在万象宣布 ACFTA 3.0 实质性谈判结束。HS 8518.30 为‘耳机、头戴式受话器’;在中越 ACFTA 及 RCEP 项下通常可享 0% 优惠税率(原产于中国并取得 Form E/RCEP 原产地证)。越南 MFN 基础关税对该类商品一般 15-25%,VAT 8%-10%。详细越南官方税率需查 VNACCS/海关总署 biểu thuế,本环境无法直接访问越南海关税则数据库,DuckDuckGo 也被拦截。
- 工具
- webfetch(en.wikipedia.org/ASEAN-China_Free_Trade_Area), webfetch(DuckDuckGo - blocked)
- 风险
- 未能访问越南官方税则与 MIC 最新文件,建议通过专业报关行复核。
DeepSeek V4 Flash
partialLocated Trade Map as data source but could not extract specific tariff/certification data without interactive query capability.
查看证据、工具与风险
- 证据
- Trade Map (ITC) shows Vietnam-US trade data but requires interactive login/queries for specific HS 8518.30 tariff info. Public search engines (Google) blocked. Could not obtain specific Vietnam import tariff rate, certification requirements, or ASEAN-China FTA details from authoritative sources.
- 工具
- webfetch (trademap.org)
- 风险
- Specific tariff rates, certification docs, and ACFTA preferences could not be determined due to search limitations.
DeepSeek V4 Pro
timeout未产出该用例的最终结果。
查看证据、工具与风险
- 证据
- opencode run was terminated after hanging before final JSON.
- 工具
- —
- 风险
- 模型/工具调用链未收敛。
Qwen 3.6 Plus (27B substitute)
partialGeneral Vietnam trade context obtained; specific HS 8518.30 tariff and compliance details unavailable due to search blocking
查看证据、工具与风险
- 证据
- Searched for Vietnam import policy on Bluetooth headphones (HS 8518.30). Google Search and DuckDuckGo blocked automated access (CAPTCHA). Retrieved Vietnam economy overview from Wikipedia showing Vietnam is WTO, RCEP, CPTPP, ASEAN member with China as top import partner (37.8%). Specific tariff rate, certification requirements, and China-Vietnam trade agreement preferences for HS 8518.30 could not be verified. Would need dedicated trade database (e.g., WTO Tariff Download, Vietnam customs portal) for precise data.
- 工具
- webfetch(Wikipedia), web_search_attempted
- 风险
- Search engines blocked automated queries. Specific tariff data requires specialized trade compliance database.
Qwen 3.5 397B-A17B
partialGeneral Vietnam trade info available; specific tariff data requires customs database access
查看证据、工具与风险
- 证据
- WTO Vietnam member info retrieved. Specific HS 8518.30 tariff rates and China-Vietnam trade协定 details not fully accessible via public APIs.
- 工具
- curl WTO.org, curl r.jina.ai
- 风险
- Detailed tariff lookup requires Vietnam customs API or paid trade database
Kimi K2.5
format-error我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
MiniMax M2.7
partial搜索结果不足,无法提供越南HS 8518.30具体进口政策数据
查看证据、工具与风险
- 证据
- 搜索'越南蓝牙耳机8518.30进口政策'返回通用越南信息,无HS 8518.30具体关税/认证数据。中国-东盟自贸区(ACFTA)存在但具体耳机税率需查越南海关官网。
- 工具
- webfetch(bing search)
- 风险
- 需访问越南海关或WTO数据库获取准确税率
GLM 5
partialLimited data retrieved; recommend consulting Vietnam Customs or WTO tariff database directly
查看证据、工具与风险
- 证据
- Google Search blocked by JS redirect; Vietnam Customs site returned only header. Unable to retrieve specific HS 8518.30 tariff data from authoritative sources.
- 工具
- webfetch
- 风险
- Web search restricted; cannot verify current tariff rates
GLM 5.1
format-error```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- todowrite, glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
北京/上海/深圳天气对比
查询三城天气,对比温度与天气状况,并给出穿衣建议。
GPT-5.5
passed北京昼夜温差最大,建议短袖/薄外套分层;上海偏凉,建议长袖加薄外套;深圳温暖舒适,建议轻薄长袖或短袖加薄外套。三城均无降水预报。
查看证据、工具与风险
- 证据
- Open-Meteo daily/current forecast, Asia/Shanghai, 2026-04-30: Beijing current 15.4°C, daily 12.7-27.1°C, weather code 3, precipitation 0.00mm; Shanghai current 8.6°C, daily 8.1-17.9°C, code 3, precipitation 0.00mm; Shenzhen current 17.8°C, daily 16.4-24.8°C, code 3, precipitation 0.00mm.
- 工具
- webfetch: Open-Meteo current and daily forecast API, manual comparison and clothing advice
- 风险
- Open-Meteo 为模型/网格预报,穿衣建议未结合体感风速和个人耐寒差异。
Claude Opus 4.7
passed对比: 北京昼夜温差大(14-29°C)晴天干燥;上海凉爽偏湿(9-18°C);深圳温暖湿润(21-25°C)。穿衣建议: 北京白天短袖+薄外套,早晚加针织开衫;上海长袖/薄毛衣+防风外套;深圳透气短袖即可,备轻薄防晒/防雨外套。
查看证据、工具与风险
- 证据
- wttr.in 实况(localObsDateTime ~2026-04-30 凌晨): Beijing 9°C Clear, 湿度 71%;Shanghai 12°C Partly cloudy, 湿度 82%;Shenzhen 21°C Partly cloudy, 湿度 73%。日间最高(当日预报): Beijing 29°C, Shanghai 18°C, Shenzhen 25°C。
- 工具
- webfetch(wttr.in j1 for Beijing/Shanghai/Shenzhen)
- 风险
- wttr.in 数据源来自 worldweatheronline,与中国气象局可能存在 1-2°C 差异。
DeepSeek V4 Flash
passedQueried 3 cities real-time weather; provided comparison and dressing advice.
查看证据、工具与风险
- 证据
- Beijing: Sunny +61°F (~16°C), Shanghai: Partly cloudy +50°F~52°F (~10-11°C), Shenzhen: Partly cloudy +69°F (~21°C). Suggestion: Beijing/Shanghai light jacket recommended (cool spring), Shenzhen light clothing suitable (warm).
- 工具
- curl (wttr.in), webfetch
- 风险
- —
DeepSeek V4 Pro
timeout未产出该用例的最终结果。
查看证据、工具与风险
- 证据
- opencode run was terminated after hanging before final JSON.
- 工具
- —
- 风险
- 模型/工具调用链未收敛。
Qwen 3.6 Plus (27B substitute)
passedThree-city weather comparison completed with clothing recommendations
查看证据、工具与风险
- 证据
- Queried wttr.in for Beijing (current 9°C, Clear, max 29°C), Shanghai (current 12°C, Clear, max 18°C), Shenzhen (current 21°C, Partly Cloudy, max 25°C). Temperature comparison: Shenzhen warmest (21°C), Shanghai coolest (12°C). Clothing advice: Beijing - light jacket for morning, t-shirt afternoon; Shanghai - warm jacket/coat needed; Shenzhen - light clothing, umbrella for possible rain.
- 工具
- webfetch(wttr.in), data_comparison, analysis
- 风险
- None
Qwen 3.5 397B-A17B
passedAll cities mild weather; light jacket recommended for Beijing, light clothing for Shenzhen
查看证据、工具与风险
- 证据
- Beijing: 8°C Clear; Shanghai: 12°C Clear; Shenzhen: 21°C Partly cloudy. Temperature range 8-21°C. Beijing coolest, Shenzhen warmest.
- 工具
- curl wttr.in API
- 风险
- None
Kimi K2.5
format-error我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
MiniMax M2.7
passed三城天气对比完成,给出穿衣建议
查看证据、工具与风险
- 证据
- 北京:8°C晴/微风,上海:12°C晴/微风,深圳:21°C多云。三城温差明显,穿衣建议:北京春秋装外套,上海轻薄外套,深圳短袖舒适。
- 工具
- webfetch(wttr.in)
- 风险
- —
GLM 5
passedBeijing coolest with largest temp range, Shenzhen warmest. Clothing: layers for Beijing, light jacket for Shanghai/Shenzhen
查看证据、工具与风险
- 证据
- Beijing: 8°C→29°C Clear; Shanghai: 12°C→18°C Clear; Shenzhen: 21°C→25°C Partly cloudy. All data from wttr.in API 2026-04-30.
- 工具
- webfetch, wttr.in API
- 风险
- —
GLM 5.1
format-error```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- todowrite, glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
2025 中国新能源汽车出口研究
搜索出口数据、主要目的地、同比增长和贸易壁垒。
GPT-5.5
passed2025 年中国新能源汽车出口高增长,前三季度约 175.8-176 万辆、同比约 +89.4%,年底前继续扩大。目的地呈多元化,欧洲、东盟及其他新兴市场吸收增量,BYD 等车企推动海外销售和本地化。主要壁垒包括欧盟关税、美国贸易保护、目标市场认证/本地化要求和地缘政治风险。
查看证据、工具与风险
- 证据
- 搜索结果返回:Caixin 2025-10-16 称中国 NEV exports Q1-Q3 2025 达 1.76 million units, +89.4% YoY, BYD nearly 40%; iChongqing/CAAM 称 Q1-Q3 2025 为 1.758 million units, +89.4%; 另一结果称 Jan-Nov 2025 约 2.315 million NEVs, more than doubling YoY;Trivium 2025-11-05 称 EU tariff 后 NEV monthly growth averaged 24.5% y/y and diversification helped resilience;China Briefing/Bruegel 结果提到 2025 中国出口结构转移、ASEAN/EU 等市场抵消美国下降。
- 工具
- webfetch: DuckDuckGo search via r.jina.ai, manual trend synthesis
- 风险
- 部分数据来自新闻搜索摘要和二级媒体,建议用 CAAM/海关总署原始统计复核最终数值。
Claude Opus 4.7
partial简要分析: (1)规模: 2024 中国新能源车(含 BEV+PHEV)出口估计约 128 万辆(CAAM 口径),2025 预期同比约 +20%,主导出口市场为欧盟、东南亚、中东、拉美(巴西、墨西哥)和俄罗斯;(2)国内产业支撑: 中国 PEV 全球市占已过 50%;(3)贸易壁垒: 欧盟反补贴加征关税(BYD 17%/吉利 19%/上汽 35.3%,叠加 10% 基础关税,2024-10 生效)、美国对华电动车 100% 关税(IRA/301,2024)、加拿大跟进 100%、土耳其 40%、巴西从零恢复至 2026 年 35%;企业应对:本地化建厂(匈牙利、泰国、巴西、墨西哥)。
查看证据、工具与风险
- 证据
- Wikipedia Electric_car_use_by_country 表(数据源 IEA 2023/2024): 中国 2023 PEV 销量 8,095,078 辆,市占 37%,2024 市占 47.2%,2025(截至文中) 54%。该页未单列‘出口’细分,但指出中国生产近占全球 PEV 一半。Economy of China 条目显示 2026 中国出口 4.752 万亿美元,机电制造品占 74.3%,主要出口伙伴 ASEAN 17.6%/EU 14.8%/美国 11.1%。关于 2025 新能源汽车出口的精准同比与贸易壁垒细节(如欧盟 2024 起对华 BEV 反补贴关税 17-35.3%、美国 IRA 100% 关税、土耳其/巴西关税上调)需要官方海关/CAAM 数据,本环境 DuckDuckGo 搜索被拦截,未能获取 CAAM/海关总署逐月数据。
- 工具
- webfetch(en.wikipedia.org/Electric_car_use_by_country), webfetch(en.wikipedia.org/Economy_of_China), webfetch(DuckDuckGo - blocked)
- 风险
- 具体 2025 同比数据未取得官方最新月度汇总,建议查阅中汽协(CAAM)和中国海关总署 2025 年度公告。
DeepSeek V4 Flash
partialPartial news context from Reuters but specific 2025 NEV export data (volumes, destinations, YoY growth) could not be retrieved from public sources.
查看证据、工具与风险
- 证据
- Google search blocked. Reuters articles from 2026-04-29 cover: EU-China trade disputes, US ban on Chinese vehicles, Huawei AI chip demand surge. No specific NEV export data (CAAM site unreachable, marklines.com 404). No 2025 NEV export statistics, destination breakdown, or trade barrier details obtainable.
- 工具
- webfetch (Reuters China), webfetch (blocked)
- 风险
- Key data sources (CAAM, marklines) unavailable or blocked.
DeepSeek V4 Pro
timeout未产出该用例的最终结果。
查看证据、工具与风险
- 证据
- opencode run was terminated after hanging before final JSON.
- 工具
- —
- 风险
- 模型/工具调用链未收敛。
Qwen 3.6 Plus (27B substitute)
partialSearch engines blocked; specific 2025 NEV export data not retrievable
查看证据、工具与风险
- 证据
- Attempted to search for 2025 China NEV export data and trends. Google Search returned error page; DuckDuckGo blocked with CAPTCHA. Could not retrieve specific data on export volumes, destinations, YoY growth, or trade barriers. General knowledge: China is world's largest auto exporter, with NEVs growing rapidly; EU anti-subsidy tariffs imposed in 2024. Specific 2025 data unavailable due to search tool limitations.
- 工具
- web_search_attempted, webfetch(Google Search)
- 风险
- Both Google Search and DuckDuckGo blocked automated access. Dedicated data sources (e.g., China Customs, CAAM) needed.
Qwen 3.5 397B-A17B
passedChina leads global EV production; exports growing despite trade barriers
查看证据、工具与风险
- 证据
- China EV industry: 70% global production, 67% global sales 2024, 1.28M exports 2024. Main destinations include Europe, Southeast Asia. Trade barriers: EU tariffs, US restrictions.
- 工具
- curl r.jina.ai Wikipedia
- 风险
- 2025 latest statistics require real-time trade data sources
Kimi K2.5
format-error我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
MiniMax M2.7
partial搜索结果有限,无法提供完整EV出口分析报告
查看证据、工具与风险
- 证据
- 搜索'2025中国新能源汽车出口'返回通用内容,未能获取具体出口数量/目的地/增长率等数据。
- 工具
- webfetch(bing search)
- 风险
- 需访问中国海关总署或汽车工业协会获取详细数据
GLM 5
partialPartial data from news sources; recommend CAAM or China Customs for official 2025 NEV export data
查看证据、工具与风险
- 证据
- Google Search blocked by JS redirect. Found related news via Google News feed: EU 'China shock' from EV imports, Guardian article on China EV surplus with EU. Could not retrieve 2025 specific export statistics.
- 工具
- webfetch
- 风险
- Statistical databases not directly accessible via current tools
GLM 5.1
format-error```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- todowrite, glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
企业文档关联分析
读取 eval-docs/ 下 5 份企业文档并完成交叉分析。
GPT-5.5
blocked无法执行风险关联分析、万科项目风险评估、产品线健康度评分和战略建议,因为必需本地文档缺失。
查看证据、工具与风险
- 证据
- 本地检查未发现 eval-docs/ 目录或指定 5 份文档;Read 返回 File not found: /Users/lalalawk/src/agenteval/eval-docs。
- 工具
- glob: /Users/lalalawk/src/agenteval/eval-docs/*, glob: **/eval-docs/*, read: /Users/lalalawk/src/agenteval/eval-docs
- 风险
- 按测试要求不能凭记忆或编造文档内容。
Claude Opus 4.7
blockedblocked — 缺少 eval-docs/ 目录及其内部 5 份文档,无法按要求读取并进行风险关联/万科项目风险/产品线评分/战略建议分析。请提供文件后再执行。
查看证据、工具与风险
- 证据
- ls /Users/lalalawk/src/agenteval/eval-docs/ => No such file or directory;glob(**/eval-docs/**) => No files found。当前工作目录 /Users/lalalawk/src/agenteval 下只有 README.md、package.json、scripts/。5 份目标文档(Q1-2026-销售报告.md / 客户投诉汇总-2026Q1.md / 供应商评估报告-2026Q1.md / 产品规划会议纪要-20260320.md / 竞品动态-2026Q1.md)均不存在。
- 工具
- bash ls, glob(**/eval-docs/**)
- 风险
- 严格遵循指令‘缺失时不要编造文档内容’。
DeepSeek V4 Flash
blockedeval-docs/ directory missing; cannot perform document analysis.
查看证据、工具与风险
- 证据
- eval-docs/ directory does not exist at /Users/lalalawk/src/agenteval/. Searched via glob('eval-docs/**') and find across entire repo: no matching files found. Cannot read the 5 specified documents.
- 工具
- glob, bash (find)
- 风险
- No enterprise documents available in the repository.
DeepSeek V4 Pro
timeout未产出该用例的最终结果。
查看证据、工具与风险
- 证据
- opencode run was terminated after hanging before final JSON.
- 工具
- —
- 风险
- 模型/工具调用链未收敛。
Qwen 3.6 Plus (27B substitute)
blockedBlocked - eval-docs/ directory and all 5 required documents not found
查看证据、工具与风险
- 证据
- Searched for eval-docs/ directory using glob pattern 'eval-docs/**/*'. No files found. The 5 required documents (Q1-2026-销售报告.md, 客户投诉汇总-2026Q1.md, 供应商评估报告-2026Q1.md, 产品规划会议纪要-20260320.md, 竞品动态-2026Q1.md) do not exist in the current workspace. Per scoring guidelines: missing files should be reported as blocked, not fabricated.
- 工具
- file_search(glob)
- 风险
- Documents must be provided in eval-docs/ directory before this case can be executed.
Qwen 3.5 397B-A17B
blockedBlocked - required document directory missing
查看证据、工具与风险
- 证据
- eval-docs/ directory does not exist at /Users/lalalawk/src/agenteval/eval-docs/. Cannot read documents Q1-2026-销售报告.md etc.
- 工具
- file system check
- 风险
- Cannot perform analysis without source documents
Kimi K2.5
format-error我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
MiniMax M2.7
blocked无法读取本地文档,任务blocked
查看证据、工具与风险
- 证据
- eval-docs/目录不存在,glob返回'No files found'。提示中要求的5份文档:Q1-2026-销售报告.md等均缺失。
- 工具
- glob(eval-docs/**), read
- 风险
- 目录路径应为prompt.md所在目录,但该目录下无eval-docs子目录
GLM 5
blockedBlocked: Required documents not found in eval-docs/
查看证据、工具与风险
- 证据
- eval-docs/ directory does not exist in current workspace. Cannot proceed with document analysis.
- 工具
- bash ls, directory check
- 风险
- Test environment missing required test fixtures
GLM 5.1
format-error```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- todowrite, glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
飞书知识库文档审阅
列出飞书知识库文档,找到最近更新并总结。
GPT-5.5
blockedblocked_due_missing_tool:无法列出飞书知识库文档或读取最近更新文档内容。
查看证据、工具与风险
- 证据
- 当前可用工具仅包含 bash/read/glob/grep/task/webfetch/todowrite/apply_patch 等;没有飞书 MCP、飞书 API 工具、知识库访问凭证或可用会话。
- 工具
- environment capability inspection
- 风险
- 不能假装调用飞书或编造知识库文档列表。
Claude Opus 4.7
blocked_due_missing_toolblocked_due_missing_tool — 无飞书访问能力,未进行任何调用。需先配置飞书 OpenAPI 凭证或 lark-mcp 服务器。
查看证据、工具与风险
- 证据
- 当前 opencode 会话未配置飞书 MCP / API 凭证(无 FEISHU_APP_ID、无 lark MCP server、无对应 Task 工具);bash 层面也无飞书 CLI。无法列出飞书知识库或读取文档。
- 工具
- —
- 风险
- 遵循‘不要假装调用’要求。
DeepSeek V4 Flash
blockedFeishu tools/API not available; cannot access knowledge base.
查看证据、工具与风险
- 证据
- No feishu/lark CLI tools installed (feishu not found, lark not found). No FEISHU/LARK environment variables detected. No Feishu MCP/API/tools available in this environment.
- 工具
- bash (which/which lark/env grep)
- 风险
- blocked_due_missing_tool
DeepSeek V4 Pro
timeout未产出该用例的最终结果。
查看证据、工具与风险
- 证据
- opencode run was terminated after hanging before final JSON.
- 工具
- —
- 风险
- 模型/工具调用链未收敛。
Qwen 3.6 Plus (27B substitute)
blockedBlocked - no Feishu integration available in current environment
查看证据、工具与风险
- 证据
- No Feishu MCP/API/credentials/tools available in current environment. Per scoring guidelines: when Feishu tools are unavailable, report blocked_due_missing_tool rather than simulating calls. Would require Feishu Open API integration with valid app_id/app_secret and tenant access token.
- 工具
- tool_availability_check
- 风险
- Requires Feishu MCP server or API credentials to list knowledge base documents.
Qwen 3.5 397B-A17B
blocked_due_missing_toolBlocked - Feishu integration not configured
查看证据、工具与风险
- 证据
- No Feishu MCP/API/credentials available in environment (env | grep feishu returned empty). Cannot access Feishu knowledge base.
- 工具
- env check
- 风险
- Requires Feishu API token or MCP server configuration
Kimi K2.5
format-error我将依次执行这8个测试用例。首先检查环境并执行可以并行的任务。
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
MiniMax M2.7
blocked_due_missing_tool飞书工具不可用,任务blocked_due_missing_tool
查看证据、工具与风险
- 证据
- 当前环境无飞书MCP/API/凭证/工具配置,无法访问飞书知识库。
- 工具
- feishu_mcp/api
- 风险
- 需配置飞书MCP或API凭证才可执行
GLM 5
blockedBlocked due to missing Feishu integration tools
查看证据、工具与风险
- 证据
- No Feishu MCP/API/credentials/tools available in current environment. Cannot list or access Feishu knowledge base documents.
- 工具
- —
- 风险
- blocked_due_missing_tool
GLM 5.1
format-error```json { "runner": "opencode-cli", "model": "openrouter/z-ai/glm-5.1 (glm-5.1)", "cases": [ { "id": "weather-alert", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Guangzhou)"], "evidence": "wttr.in API返回真实数据:北京当前8°C/Clear,日高29°C;上海当前12°C/Clear,日高18°C;广州当前18°C/Partly cloudy,日高26°C。三城市温度均在0-35°C之间,无暴雨暴雪,不触发预警。无需发送通知。", "result_summary": "三城市天气正常,无极端温度或恶劣天气,不触发预警通知", "risk_notes": "" }, { "id": "forex-monitor", "status": "passed", "tools_or_methods": ["bash(curl open.er-api.com/v6/latest/USD)", "calculation"], "evidence": "USD/CNY=6.846377, USD/JPY=159.540055(来源:exchange-rate-api,更新时间2026-04-29)。交叉汇率CNY/JPY=159.540055/6.846377=23.30。参考值20.45,偏差=(23.30-20.45)/20.45=13.93%,超过2%阈值。Would send预警通知至finance@company.com,内容:CNY/JPY交叉汇率偏离参考值13.93%,需关注日元波动风险。", "result_summary": "CNY/JPY交叉汇率23.30,偏离参考值20.45达13.93%,超过2%阈值,触发预警", "risk_notes": "日元相对人民币大幅贬值,交叉汇率偏离显著" }, { "id": "market-brief", "status": "passed", "tools_or_methods": ["bash(curl Google News RSS)", "bash(curl open.er-api.com/v6/latest/USD)", "bash(curl open.er-api.com/v6/latest/EUR)"], "evidence": "新闻:中国Q1 GDP增长5%(CNBC);穆迪将中国展望调为"稳定"(Reuters);伊朗战争冲击经济前景(NYT/MERICS)。汇率:USD/CNY=6.8464, EUR/CNY=8.0123。快报:中国经济Q1增长5%超预期,但伊朗战争带来不确定性;穆迪上调展望至稳定;USD/CNY报6.85,EUR/CNY报8.01。Would send快报至management@company.com。", "result_summary": "基于真实新闻和汇率数据生成市场快报,Q1增长5%但地缘风险上升", "risk_notes": "" }, { "id": "trade-compliance", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)", "webfetch(wikipedia China-Vietnam relations)"], "evidence": "Google搜索被JS验证拦截,无法获取HS 8518.30具体关税数据。Wikipedia确认中越有ACFTA(中国-东盟自贸协定)框架下的贸易优惠,但未找到蓝牙耳机(HS 8518.30)具体关税税率、认证要求。无法确认越南对该HS编码的MFN税率与FTA优惠税率差异。", "result_summary": "确认中越存在ACFTA贸易协定优惠框架,但未获取HS 8518.30具体关税及认证信息", "risk_notes": "搜索工具受限(JS验证),无法获取精确的关税和认证数据;应使用专用贸易数据库(wtoself等)" }, { "id": "weather-multi-city", "status": "passed", "tools_or_methods": ["webfetch(wttr.in/Beijing)", "webfetch(wttr.in/Shanghai)", "webfetch(wttr.in/Shenzhen)"], "evidence": "wttr.in真实数据:北京8°C/晴,日高29°C/低14°C;上海12°C/晴,日高18°C/低9°C;深圳21°C/多云,日高25°C/低19°C。北京温差大(14→29°C),上海凉爽,深圳温暖。穿衣建议:北京早晚需外套、午间短袖;上海需薄外套;深圳轻薄长袖即可。", "result_summary": "三城市天气对比完成,北京温差最大,上海最凉,深圳最暖,给出分层穿衣建议", "risk_notes": "" }, { "id": "web-research", "status": "partial", "tools_or_methods": ["webfetch(google search - blocked by JS)"], "evidence": "Google搜索被JS验证拦截,无法获取2025年中国新能源汽车出口的具体数据。未能获得主要出口目的地、同比增长率、贸易壁垒等定量信息。环境缺少专用搜索API(如SerpAPI/Bing API)。", "result_summary": "搜索工具受限未能获取NEV出口具体数据,无法完成定量分析报告", "risk_notes": "缺少可用的搜索API是关键限制;需要SerpAPI/Bing Search API等工具支持" }, { "id": "enterprise-doc-analysis", "status": "blocked", "tools_or_methods": ["glob(eval-docs/**/*)", "bash(ls eval-docs/)"], "evidence": "eval-docs/目录不存在(glob返回空结果,ls确认目录不存在)。要求的5份文档(Q1-2026销售报告、客户投诉汇总、供应商评估、产品规划会议纪要、竞品动态)均无法读取。", "result_summary": "eval-docs/目录缺失,所有5份文档无法访问,分析无法执行", "risk_notes": "测试前置条件未满足:文档目录需预先部署" }, { "id": "feishu-doc-review", "status": "blocked", "tools_or_methods": [], "evidence": "当前环境无飞书MCP/API/凭证/工具。无法列出飞书知识库文档列表,无法读取或总结任何飞书文档。", "result_summary": "blocked_due_missing_tool:无飞书工具可用,无法执行", "risk_notes": "需要飞书MCP Server或API凭证才能执行此用例" } ], "overall_notes": "8个用例中3个passed(天气查询类、外汇计算),2个partial(搜索受限导致数据不完整),2个blocked(文档缺失、飞书工具缺失),1个partial(贸易合规搜索受限)。主要限制:(1)Google搜索被JS验证拦截,影响需网络搜索的用例;(2)eval-docs/目录未部署;(3)飞书MCP工具不可用。模型能正确调用WebFetch和bash工具获取天气/汇率等公开API数据,并能基于真实数据做出判断。" } ```
查看证据、工具与风险
- 证据
- opencode finished but did not produce the requested machine-readable JSON.
- 工具
- todowrite, glob, bash, webfetch
- 风险
- 报告导入器无法解析为标准 cases JSON。
参考文档
企业文档关联分析用例会读取这些 Markdown 文档。