How gxceed builds and classifies
the GX research corpus
This page documents the data sources, ingestion approach, SNE-axis classification logic, AI pipeline, and known limitations of the gxceed GX paper corpus. It is intended for researchers who want to assess reproducibility, coverage, and methodological transparency.
gxceed is designed as a low-cost open metadata pipeline for the GX research field. Rather than building a proprietary database, it aggregates metadata from open scholarly infrastructure — primarily DOI-based, OAI-PMH, and open REST APIs — and adds AI-assisted relevance scoring, bilingual summaries, and SNE-axis classification.
The core design principle: use what is already open, classify what matters for the research–implementation gap, and make the distribution of GX knowledge production legible without requiring access to paywalled databases.
gxceed collects from 13 open scholarly metadata sources. Each source has a dedicated resolver that handles authentication, pagination, and normalization into the internal paper schema.
| Source | Type | Coverage / Notes |
|---|---|---|
| arXiv | Preprint (EN) | cs.AI, eess.SY, physics — GX-adjacent fields |
| Jxiv | Preprint (JA) | JST Japan Science and Technology Agency. Primary source for Japanese-origin GX preprints |
| Zenodo | Multi-type (EN) | CERN-operated repository. Includes datasets, reports, and preprints |
| SSRN | Preprint (EN) | Social science preprints. ESG investing, green finance, climate policy |
| EarthArXiv | Preprint (EN) | Earth sciences preprint server. Climate change, geothermal, environmental |
| J-STAGE | Peer-reviewed (JA/EN) | JST — Japanese peer-reviewed journals. High-quality but often no English abstract |
| CiNii Research | Academic graph (JA) | NII (National Institute of Informatics). Japanese academic papers and university bulletins |
| Research Square | Preprint (EN) | Natural science and engineering preprints |
| OpenAlex | Metadata graph (EN) | Open academic knowledge graph. Used as cross-source supplement and CN affiliation detection |
| IEA | Policy report (EN) | International Energy Agency. World Energy Outlook, energy statistics |
| Carbon Brief | Policy media (EN) | Climate policy analysis and data journalism |
| Nature Energy | Peer-reviewed (EN) | High-impact energy research. Metadata only (abstract from OpenAlex where available) |
| ChinaRxiv | Preprint (ZH/EN) | Chinese Academy of Sciences preprint server. ~23,000 papers. REST API accessible from Japan |
GX relevance filtering is applied at collection time using keyword matching on title and abstract. This reduces ingest volume before AI scoring.
Each paper is assigned a canonical_key — typically doi:10.xxx/yyy or arxiv:XXXX.XXXXX — which is used as the deduplication anchor across sources. Papers are not duplicated even when they appear in multiple sources (e.g., a paper on both arXiv and OpenAlex).
The internal schema captures:
titleOriginal title in source languagetitle_ja / title_enAI-translated title (opposite language from original)abstractOriginal abstract from source (may be truncated by source)ai_summary_ja / ai_summary_enAI-generated bilingual editorial digest (200–400 characters each)context_note_ja / context_note_enEditorial framing for Japanese or global contextauthorsAuthor names as provided by source (structured where available)sourceOriginating resolver (arxiv, jxiv, zenodo, ssrn, etc.)published_atPublication or preprint date from sourcecollected_atTimestamp when gxceed first retrieved the recordorigin_countryInferred country of origin (JP / US / EU / CN / Global / Unknown)draft_scoreGX relevance score 0–100 assigned by AI pipeline (published if ≥ 85)Papers with draft_score ≥ 85 are published to the public corpus. Papers below threshold are retained internally for potential manual review or threshold adjustment.
Each published paper is assigned a set of binary SNE-axis flags based on AI-assisted classification of its title and abstract. The flags map to the five SNE axes:
is_measurement_focusedS₁ — scientific measurement, empirical evidence, quantitative methodsis_policy_narrative_focusedN — policy framing, regulatory targets, disclosure frameworksis_outcome_focusedE — outcome claims, impact projections, external expectationsis_implementation_focused + is_scalability_focusedS₂ — implementation process, industrial scalability, deployment contextis_industrial_adoption_focused + is_verification_focusedW — adoption evidence, third-party verification, validation in fieldFrom these flags, a derived sne_profile_hint is assigned:
S2_capableImplementation-ready papers with strong S₂ and/or W signalsN_heavy_S2_weakPolicy-narrative-dominant papers with little implementation substanceS1_heavyMeasurement-focused papers with minimal narrative or implementation signalsS1_S2_mixedPapers combining scientific measurement and implementation contextunclassifiedPapers that do not clearly map to any single profileClassification is performed in a single AI pass per paper. SNE classification is not adversarially reviewed and should be treated as indicative rather than definitive. The model version is SNE v2.1.2 (Hiroyuki Kokubu / snecompass.com).
As of 2026, a dedicated lane for Chinese-origin GX research has been added, recognizing that Chinese institutional research represents a significant and underrepresented share of global GX output.
Accessed via the ChinaRxiv REST API (chinarxiv.org), which is reachable from Japanese IPs. Covers approximately 23,000 preprints from Chinese institutions. Detection method: source_native_chinarxiv.
The ChinaXiv OAI-PMH endpoint is geo-blocked from outside China. The resolver performs a graceful fail and logs the block without raising an error. Coverage from this source is currently unavailable.
Papers with Chinese institutional affiliations are identified via OpenAlex metadata. This supplements ChinaRxiv coverage with peer-reviewed and journal-published Chinese GX research.
Chinese-language abstracts and titles are translated to English and Japanese via the DeepSeek AI pipeline. Original Chinese text is preserved in the internal record; translated fields are explicitly marked as AI-generated.
This lane is considered experimental (as of May 2026). Translation quality for specialized technical Chinese should be treated with additional caution.
All AI-assisted classification and generation uses the following pipeline:
DeepSeek (primary)Relevance scoring (0–100), bilingual title translation, English/Japanese editorial summaries, context notes, Japan↔Global axis scoring, SNE-axis flag assignmentClaude Opus (editorial review)Spot-check review of selected papers. Not applied to every paper; used for quality monitoring and edge-case resolutionAI outputs are generated from title and abstract only. Full-text PDF processing is not performed. This means classification and summaries reflect what is signaled in the abstract — which may not capture implementation details buried in the paper body.
AI outputs on gxceed are positioned as a supplementary classification layer — not as a factual source. Readers should always refer to the original paper for authoritative conclusions.
The following aspects of the pipeline are disclosed and can be used for independent assessment:
- Paper schema fields and their definitions (see About and llms-full.txt)
- SNE classification logic and axis definitions
- Data source list and access methods (Section 2 above)
- Relevance threshold (≥ 85 of 100)
- AI models used (DeepSeek primary, Claude for review)
The following are not disclosed for operational security reasons:
- API keys and authentication credentials
- Cron schedule details and internal endpoint paths
- Deployment infrastructure specifics
- Classification prompt text (disclosed in general terms only)
The source code for gxceed is not currently open-source. If you are a researcher who needs additional methodological detail for a specific use case, please contact [email protected].
The corpus is weighted toward preprint servers (arXiv, Jxiv, Zenodo, SSRN, EarthArXiv, ChinaRxiv) and open-access journals. Paywalled journal papers are largely absent unless indexed by OpenAlex with accessible metadata. This may under-represent peer-reviewed industrial implementation research published in subscription journals.
CiNii Research records often have incomplete or absent English abstracts. IEA and Carbon Brief records are policy reports rather than academic papers and may receive different SNE profiles than comparable academic work. Abstract truncation by source APIs affects classification quality.
SNE flags are assigned from title and abstract only. Papers that describe implementation work primarily in their methods or results sections — not in the abstract — may be underclassified on S₂ or W axes.
Machine-translated titles and summaries, especially for Chinese and specialized Japanese technical content, may contain errors. Transliterated terms (e.g., specific chemical names, proprietary technologies) are particularly prone to mistranslation.
The ≥ 85 threshold was set conservatively to prioritize precision over recall. Boundary papers (score 70–84) may be relevant but are not published. The threshold is subject to periodic review.
ChinaRxiv coverage began in 2026 and is not yet systematically evaluated. OpenAlex CN affiliation detection may miss papers from Chinese authors at non-Chinese institutions. ChinaXiv (OAI-PMH) is currently blocked and unavailable.
gxceed is operated by a single independent researcher. Editorial decisions, threshold adjustments, and resolver maintenance reflect individual judgment rather than institutional peer review.