Extracting Product Carbon Footprint in PDF Documents using Question Answering Framework

質問応答フレームワークを用いたPDF文書からの製品カーボンフットプリント抽出 (AI 翻訳)

Kaiwen Zhao, Bharathan Balaji, Stephen Lee

📚 査読済 / ジャーナル2026-06-16#AI×ESGOrigin: US経営インパクト: 調達リスク対象セクター: cross_sector

原典: https://doi.org/10.1145/3744255.3798128

🤖 gxceed AI 要約

日本語

本論文は、PDF形式の製品サステナビリティレポートからカーボンフットプリント情報を抽出するための質問応答フレームワークを提案する。1,735件の報告書からなるオープンデータセットCarbonPDF-QAを作成し、Llama 3をファインチューニングしたCarbonPDF手法が既存手法を上回る性能を示した。GPT-4oがデータの不整合に弱いという課題を解決している。

English

This paper proposes an LLM-based question-answering framework to extract product carbon footprint information from PDF sustainability reports. It introduces CarbonPDF-QA, an open-source dataset of 1,735 reports with human-annotated Q&A pairs. The CarbonPDF method, fine-tuned on Llama 3, outperforms current state-of-the-art QA systems and addresses data inconsistency issues that affect GPT-4o.

Unofficial AI-generated summary based on the public title and abstract. Not an official translation.

📝 gxceed 編集解説 — Why this matters

日本のGX文脈において

日本ではSSBJ開示基準が導入されつつあり、製品単位のカーボンフットプリント情報の自動抽出は、企業の開示業務効率化に直結する。本手法は、非構造化PDFからの情報抽出を低コストで実現し、サプライチェーン全体の排出量把握（Scope 3）にも応用可能である。

In the global GX context

Globally, as sustainability reporting frameworks (TCFD, ISSB, CSRD) demand product-level carbon data, this method automates extraction from varied PDF formats. It addresses a key pain point for companies struggling with inconsistent reporting standards and large volumes of documents, and the open-source dataset facilitates further research.

👥 読者別の含意

🔬研究者:Provides a new benchmark dataset (CarbonPDF-QA) and a fine-tuned LLM approach for carbon footprint extraction from unstructured PDFs, enabling further work in automated ESG data extraction.

🏢実務担当者:Offers a practical tool to automate the extraction of product carbon footprint data from sustainability reports, saving time and reducing manual effort in disclosure preparation.

🏛政策担当者:Highlights the need for standardized digital reporting formats and the potential of AI to improve data accessibility and comparability in sustainability disclosures.

📄 Abstract（原文）

Product sustainability reports provide valuable insights into the environmental impacts of a product and are often distributed in PDF format. These reports often include a combination of tables and text, which complicates their analysis. The lack of standardization and the variability in reporting formats further exacerbate the difficulty of extracting and interpreting relevant information from large volumes of documents. In this paper, we tackle the challenge of answering questions related to carbon footprints within sustainability reports available in PDF format. Unlike previous approaches, our focus is on addressing the difficulties posed by the unstructured and inconsistent nature of text extracted from PDF parsing. To facilitate this analysis, we introduce CarbonPDF-QA, an open-source dataset containing question-answer pairs for 1,735 product report documents, along with human-annotated answers. Our analysis shows that GPT-4o struggles to answer questions with data inconsistencies. To address this limitation, we propose CarbonPDF, an LLM-based technique specifically designed to answer carbon footprint questions on such datasets. We develop CarbonPDF by fine-tuning Llama 3 with our training data. Our results show that our technique outperforms current state-of-the-art techniques, including question-answering (QA) systems finetuned on table and text data.

🔗 Provenance — このレコードを発見したソース

openalex https://doi.org/10.1145/3744255.3798128first seen 2026-06-23 05:40:07

🔔 こうした論文の新着を逃したくない方はキーワードアラートに登録（無料・3キーワードまで）。

gxceed は公開メタデータに基づく研究支援データセットです。要約・翻訳・解説は AI 支援で生成されています。最終的な解釈・検証は利用者が原典資料に基づいて行うことを前提とします。