KenyaESG: A Sentence-level ESG Disclosure Classification Dataset for Nairobi Securities Exchange Corporate Reports
KenyaESG:ナイロビ証券取引所の企業報告書における文レベルのESG開示分類データセット (AI 翻訳)
Joseph Agossa
🤖 gxceed AI 要約
日本語
ナイロビ証券取引所上場企業の年次・統合・サステナビリティ報告書(2010~2024年)から構築された、文レベルのESG分類データセット。環境・社会・ガバナンスの各軸で3,900文のファインチューニングセットと100文の評価セットを提供し、RoBERTaモデルで分類器を学習。ケニアのデータ保護法に準拠し、個人・企業名をマスク。
English
KenyaESG is a sentence-level dataset for binary ESG classification (E, S, G) built from NSE-listed firms' reports (2010–2024), with 3,900 training and 100 human-evaluated sentences per pillar. Fine-tuned RoBERTa classifiers are provided. The dataset masks company and personal names per Kenyan and EU data protection laws.
Unofficial AI-generated summary based on the public title and abstract. Not an official translation.
📝 gxceed 編集解説 — Why this matters
日本のGX文脈において
日本ではSSBJ対応に向けたESG開示データの整備が進むが、新興市場向けの同様の試みは少ない。本データセットはアフリカ市場におけるESG開示分析のベンチマークとなり、日本のグローバル展開企業にとって現地開示実態の理解に役立つ可能性がある。
In the global GX context
While global ESG disclosure datasets mostly cover developed markets, this Kenyan dataset fills a gap for emerging economies. It provides a benchmark for NLP-based ESG analysis in Africa, which is relevant for multinational investors and companies operating in the region.
👥 読者別の含意
🔬研究者:A valuable resource for NLP and ESG research, especially for understudied African markets; the dataset enables reproducibility and cross-market comparisons.
🏢実務担当者:Can be used to build automated ESG screening tools for African portfolios or to benchmark disclosure quality against global standards.
🏛政策担当者:Highlights the need for structured ESG data in emerging markets and may inform disclosure regulation design.
📄 Abstract(原文)
KenyaESG is a sentence-level dataset for three independent binary ESG classification tasks (Environmental, Social, Governance) built from the annual, integrated, and sustainability reports of firms listed on the Nairobi Securities Exchange (NSE), 2010–2024, combined with reference sentences from Schimanski et al. (2024). For each pillar, the dataset provides a fine-tuning set of 3,900 sentences and a disjoint held-out human evaluation set of 100 sentences carrying two independent annotators' judgments (Cohen's kappa: 0.94 Environmental, 0.86 Social, 0.88 Governance). Training and evaluation sentences are disjoint to prevent leakage; together they reconstruct the full 4,000-sentence pool per pillar. Training labels for the Kenya-sourced sentences were assigned by a keyword-based filter and refined by a single-reviewer pass (not full manual annotation); the human-annotated evaluation sets are the appropriate benchmark for performance estimates. The dataset captures disclosure intensity rather than disclosure quality. The corpus is predominantly English but retains a small number of Swahili and OCR-degraded sentences, reflecting the original NSE reports. Company names and personal names have been masked in the text field ([COMPANY] and [PERSON]) in line with the anonymisation principles of the Kenya Data Protection Act, 2019 (No. 24 of 2019) and Regulation (EU) 2016/679 (GDPR). Masking is semi-automatic and not guaranteed to be exhaustive; a small number of false-positive masks on common words (e.g. "equity", "equality", "Standard") were subsequently restored, as documented in the correction log. The dataset must not be used to attempt re-identification of individuals or firms. Files: three training sets (KenyaESG_train_environmental.xlsx, KenyaESG_train_social.xlsx, KenyaESG_train_governance.xlsx), three evaluation sets (KenyaESG_evaluation_environmental.xlsx, KenyaESG_evaluation_social.xlsx, KenyaESG_evaluation_governance.xlsx), the dataset card (KenyaESG_dataset_card.pdf), and the correction log (KenyaESG_correction_log.txt). The companion fine-tuned classifiers are available on the Hugging Face Hub: KenyaESG-RoBERTa-env (DOI 10.57967/hf/9126), -soc (10.57967/hf/9127), and -gov (10.57967/hf/9128). Licensed under Apache-2.0.
🔗 Provenance — このレコードを発見したソース
- openalex https://doi.org/10.5281/zenodo.20608237first seen 2026-06-14 04:40:56 · last seen 2026-06-14 04:41:29
🔔 こうした論文の新着を逃したくない方は キーワードアラート に登録(無料・3キーワードまで)。
gxceed は公開メタデータに基づく研究支援データセットです。要約・翻訳・解説は AI 支援で生成されています。 最終的な解釈・検証は利用者が原典資料に基づいて行うことを前提とします。