KenyaESG: A Sentence-level ESG Disclosure Classification Dataset for Nairobi Securities Exchange Corporate Reports

KenyaESG: ナイロビ証券取引所の企業報告向け文書レベルESG開示分類データセット (AI 翻訳)

Joseph Agossa

Zenodo (CERN European Organization for Nuclear Research)データセット2026-06-09#AI×ESGOrigin: Global経営インパクト: 調達リスク対象セクター: cross_sector

DOI: 10.5281/zenodo.20608236

原典: https://doi.org/10.5281/zenodo.20608236

🤖 gxceed AI 要約

日本語

本論文は、ナイロビ証券取引所上場企業の年次報告書・統合報告書・サステナビリティ報告書（2010-2024年）から構築された、文書レベルでのESG開示分類データセット「KenyaESG」を紹介する。環境・社会・ガバナンスの3軸それぞれに対し、3,900文の微調整セットと100文の評価セットを提供し、高いアノテーター間一致率を達成した。特に、キーワードベースのフィルターと単一レビューによるラベル付与手法を採用し、OCR劣化やスワヒリ語を含む実データの課題に対処している。

English

This paper introduces KenyaESG, a sentence-level ESG disclosure classification dataset built from annual, integrated, and sustainability reports of firms listed on the Nairobi Securities Exchange (2010-2024). For each of the three pillars (Environmental, Social, Governance), it provides a fine-tuning set of 3,900 sentences and a held-out human evaluation set of 100 sentences with high inter-annotator agreement (Cohen's kappa 0.94, 0.86, 0.88). Training labels were generated via keyword filtering and single-reviewer refinement, while evaluation sets are fully human-annotated. The dataset addresses challenges like OCR degradation and Swahili text, and includes anonymization of company and personal names.

Unofficial AI-generated summary based on the public title and abstract. Not an official translation.

📝 gxceed 編集解説 — Why this matters

日本のGX文脈において

本データセットはケニア市場向けだが、日本企業のアジア・アフリカ拠点でのESG開示分析に応用可能。SSBJや有報のESG記載分析にも同様の文書レベル分類手法が活用できる。ただし、ラベルの品質保証や言語対応など、実務への適用には追加の検討が必要。

In the global GX context

KenyaESG represents a valuable resource for global ESG disclosure research, particularly for emerging markets. The sentence-level classification approach can be adapted to other jurisdictions, including TCFD/ISSB-aligned reports. The dataset's focus on disclosure intensity rather than quality, and its handling of multilingual and OCR-degraded text, offers methodological insights for practitioners building automated ESG analysis pipelines.

👥 読者別の含意

🔬研究者:A benchmark dataset and fine-tuned classifiers for sentence-level ESG classification, applicable to corporate report analysis and cross-market comparisons.

🏢実務担当者:Provides a ready-made annotation framework and pre-trained models that can be adapted for internal ESG disclosure monitoring or due diligence in emerging markets.

🏛政策担当者:Illustrates how AI can support ESG disclosure enforcement and consistency, relevant for regulators considering mandatory ESG reporting standards in developing economies.

📄 Abstract（原文）

KenyaESG is a sentence-level dataset for three independent binary ESG classification tasks (Environmental, Social, Governance) built from the annual, integrated, and sustainability reports of firms listed on the Nairobi Securities Exchange (NSE), 2010–2024, combined with reference sentences from Schimanski et al. (2024). For each pillar, the dataset provides a fine-tuning set of 3,900 sentences and a disjoint held-out human evaluation set of 100 sentences carrying two independent annotators' judgments (Cohen's kappa: 0.94 Environmental, 0.86 Social, 0.88 Governance). Training and evaluation sentences are disjoint to prevent leakage; together they reconstruct the full 4,000-sentence pool per pillar. Training labels for the Kenya-sourced sentences were assigned by a keyword-based filter and refined by a single-reviewer pass (not full manual annotation); the human-annotated evaluation sets are the appropriate benchmark for performance estimates. The dataset captures disclosure intensity rather than disclosure quality. The corpus is predominantly English but retains a small number of Swahili and OCR-degraded sentences, reflecting the original NSE reports. Company names and personal names have been masked in the text field ([COMPANY] and [PERSON]) in line with the anonymisation principles of the Kenya Data Protection Act, 2019 (No. 24 of 2019) and Regulation (EU) 2016/679 (GDPR). Masking is semi-automatic and not guaranteed to be exhaustive; a small number of false-positive masks on common words (e.g. "equity", "equality", "Standard") were subsequently restored, as documented in the correction log. The dataset must not be used to attempt re-identification of individuals or firms. Files: three training sets (KenyaESG_train_environmental.xlsx, KenyaESG_train_social.xlsx, KenyaESG_train_governance.xlsx), three evaluation sets (KenyaESG_evaluation_environmental.xlsx, KenyaESG_evaluation_social.xlsx, KenyaESG_evaluation_governance.xlsx), the dataset card (KenyaESG_dataset_card.pdf), and the correction log (KenyaESG_correction_log.txt). The companion fine-tuned classifiers are available on the Hugging Face Hub: KenyaESG-RoBERTa-env (DOI 10.57967/hf/9126), -soc (10.57967/hf/9127), and -gov (10.57967/hf/9128). Licensed under Apache-2.0.

🔗 Provenance — このレコードを発見したソース

openalex https://doi.org/10.5281/zenodo.20608237first seen 2026-06-14 04:40:47 · last seen 2026-06-18 04:50:17

🔔 こうした論文の新着を逃したくない方はキーワードアラートに登録（無料・3キーワードまで）。

gxceed は公開メタデータに基づく研究支援データセットです。要約・翻訳・解説は AI 支援で生成されています。最終的な解釈・検証は利用者が原典資料に基づいて行うことを前提とします。