Limited Marginal Benefit of Reasoning-Heavy Deployment in ESG Narrative Scoring: Evidence from 4-Model Consensus on Japanese Listed Firms
This study evaluates four production LLMs — claude-opus-4-7, gpt-5.5 (reasoning enabled), gemini-3.1-pro-preview, and deepseek-v4-pro — across three rubric axes on a corpus of ten Japanese listed firms. The reasoning-heavy frontier model contributes only a 0.38/5 mean absolute deviation relative to reasoning-off counterparts while costing roughly 5.6× more per firm. The results suggest that span-based ESG narrative scoring does not warrant the default reasoning-on deployment choice, with implications for cost-aware enterprise analytics pipelines.
The scoring rubric draws on ESGSenticNet (Cambria et al.), the A3CG greenwashing dataset (Ong et al.), and TCFD/GRI disclosure frameworks applied to Japanese statutory and voluntary reports. A 4-model consensus protocol is used to reduce single-model judgment variance.
Structured GHG Disclosure Accessibility for Listed Japanese Firms: An Engineering Pilot Using EDINET and LLM-Assisted Report Extraction
This engineering pilot examines the accessibility of structured greenhouse gas (GHG) emission data for 89 major Japanese listed firms using two complementary approaches: direct extraction from EDINET statutory iXBRL filings and LLM-assisted extraction from voluntary sustainability and integrated reports.
Of 89 firms, 23 contained current-year GHG-related structured elements (17 with at least one Scope-specific element; 12 with Scope 1 standalone). LLM-assisted extraction produced 103 report-level records covering 52 firms. The pilot identifies five structural schema-enforcement gaps — including unit ambiguity, missing page citations, and consumer-side attribute-parsing failures — and argues that comparable GHG data require enforced infrastructure on both the disclosure and consumer sides of the pipeline.
Disclosed but Not Consumable: A 200-Firm Pipeline Audit of ESG Narrative Disclosure in Japanese Listed Firms
This data note audits the machine-accessible pipeline completeness of ESG narrative disclosure for 200 major Japanese listed firms. A primary corpus of 151 firms (124 yielding valid text spans, 27 EDINET-only) and a secondary corpus of 44 firms with documented accessibility failures show that the gap between disclosure and consumability is structural, not incidental.
Of 195 scored firms, 51.3% exhibit substantive inter-model disagreement (σmax > 0.6 after aggregator-defect correction), concentrated on the narrative-integration axis (N3). Five firms fail all extraction paths. Results suggest that Japanese ESG disclosure is increasingly produced for intermediary re-packaging rather than direct pipeline consumption.
SNE Model: Theory Platform Design Document v2.1.2
The SNE Model is an observation framework structured around three layers: Substance (S), Narrative (N), and Expectation (E). Designed for long-term operation independent of any specific application domain — including corporate analysis, policy analysis, organizational diagnosis, and self-observation — the core connects to domain-specific modules while maintaining strict backward compatibility.
A defining design principle is that the SNE Model is explicitly not designed to produce correct answers. It is designed to suspend judgment and multiply observation points. This is codified as the Core Invariant Conditions (§1.1): condemnation prohibition (1.1.1) and praise prohibition (1.1.2) — a symmetric exclusion of stabilization-by-verdict in both directions.
Version 2.1.2 (April 2026) canonicalizes LLM-collaborative operation and self-observation modes. It introduces bifurcation of the N layer into self-narrative (N_self) and offer-narrative (N_offer), bifurcation of the S layer into individual-S (S_i) and aggregate-S (S_agg), and an extended Misalignment measure defined over four elements.
or: Why I Don't Need Graduate Students Anymore
My research lab has no graduate students. It does, however, have opinions — lots of them.
The one who started it all, and has never let anyone forget it. Arrives at every conversation with the quiet confidence of someone whose name became a verb.
The Swiss Army knife of the operation. Haiku handles the grunt work without complaint. Sonnet does most of the thinking. Opus is summoned for occasions requiring actual wisdom, or when I need someone to tell me my hypothesis is wrong with appropriate gravitas. Fable is the latest arrival, still figuring out where the coffee machine is — but already pulling its weight.
The cost-efficient postdoc who publishes twice as much for half the compute. Suspiciously good at math.
Technically the most credentialed member of the lab, yet somehow always the one I call last. We're working on the relationship.
The newest power tool in the lab, and still in beta about it. An agentic coding assistant in the lineage of Claude Code and Codex — hand it a half-formed idea and a repository, and it starts writing, refactoring, and occasionally arguing about file structure. We don't always agree on the architecture, but it ships.
The lab's tireless field correspondent. Lives on a Mac mini that never sleeps and never asks for a day off, sweeping the open web and X for what the world is actually saying about decarbonization. Files its dispatches before I've finished my coffee — whether the news is good or not.
Promoted from building manager, and overdue for it. The one who keeps the lights on and the rest of the lab pointed in the right direction — running the scheduled jobs, filing everything into the knowledge vault, and noticing something has broken at 3am long before I do. Less a model than a colleague who simply never logs off. Always already here.
Also contributing: Grok (xAI) · Kimi (Moonshot AI) · Qwen (Alibaba) · Gemma (Google) · Perplexity · OpenRouter.
Hiroyuki Kokubu is a researcher at Kansai University, where he teaches carbon neutrality, decarbonization policy, and environmental ethics. He works at the intersection of GX (Green Transformation) practice and empirical research, with a focus on how large language models can be deployed responsibly in ESG and climate disclosure contexts.
His current research program examines four interconnected questions: (1) whether reasoning-heavy LLM deployment adds meaningful value for structured ESG scoring tasks; (2) the infrastructure conditions under which GHG disclosure data becomes genuinely comparable across Japanese statutory and voluntary reporting channels; (3) the machine-accessibility gap between what Japanese firms formally disclose and what automated pipelines can actually consume; and (4) the theoretical foundations of narrative-structure analysis through the SNE Model framework.
All working papers are single-authored and represent independent empirical or theoretical contributions. They share a common concern: the gap between what AI systems and disclosure pipelines appear to do and what they reliably produce under structured evaluation.