Methodology

How gxceed builds and classifies
the GX research corpus

This page documents the data sources, ingestion approach, SNE-axis classification logic, AI pipeline, and known limitations of the gxceed GX paper corpus. It is intended for researchers who want to assess reproducibility, coverage, and methodological transparency.

SNE Research Profile →About gxceed →

1 · Overview

gxceed is designed as a low-cost open metadata pipeline for the GX research field. Rather than building a proprietary database, it aggregates metadata from open scholarly infrastructure — primarily DOI-based, OAI-PMH, and open REST APIs — and adds AI-assisted relevance scoring, bilingual summaries, and SNE-axis classification.

The core design principle: use what is already open, classify what matters for the research–implementation gap, and make the distribution of GX knowledge production legible without requiring access to paywalled databases.

2 · Data Sources

gxceed collects from 13 open scholarly metadata sources. Each source has a dedicated resolver that handles authentication, pagination, and normalization into the internal paper schema.

Source	Type	Coverage / Notes
arXiv	Preprint (EN)	cs.AI, eess.SY, physics — GX-adjacent fields
Jxiv	Preprint (JA)	JST Japan Science and Technology Agency. Primary source for Japanese-origin GX preprints
Zenodo	Multi-type (EN)	CERN-operated repository. Includes datasets, reports, and preprints
SSRN	Preprint (EN)	Social science preprints. ESG investing, green finance, climate policy
EarthArXiv	Preprint (EN)	Earth sciences preprint server. Climate change, geothermal, environmental
J-STAGE	Peer-reviewed (JA/EN)	JST — Japanese peer-reviewed journals. High-quality but often no English abstract
CiNii Research	Academic graph (JA)	NII (National Institute of Informatics). Japanese academic papers and university bulletins
Research Square	Preprint (EN)	Natural science and engineering preprints
OpenAlex	Metadata graph (EN)	Open academic knowledge graph. Used as cross-source supplement and CN affiliation detection
IEA	Policy report (EN)	International Energy Agency. World Energy Outlook, energy statistics
Carbon Brief	Policy media (EN)	Climate policy analysis and data journalism
Nature Energy	Peer-reviewed (EN)	High-impact energy research. Metadata only (abstract from OpenAlex where available)
ChinaRxiv	Preprint (ZH/EN)	Chinese Academy of Sciences preprint server. ~23,000 papers. REST API accessible from Japan

GX relevance filtering is applied at collection time using keyword matching on title and abstract. This reduces ingest volume before AI scoring.

3 · Ingestion and Normalization

Each paper is assigned a canonical_key — typically doi:10.xxx/yyy or arxiv:XXXX.XXXXX — which is used as the deduplication anchor across sources. Papers are not duplicated even when they appear in multiple sources (e.g., a paper on both arXiv and OpenAlex).

The internal schema captures:

titleOriginal title in source language

title_ja / title_enAI-translated title (opposite language from original)

abstractOriginal abstract from source (may be truncated by source)

ai_summary_ja / ai_summary_enAI-generated bilingual editorial digest (200–400 characters each)

context_note_ja / context_note_enEditorial framing for Japanese or global context

authorsAuthor names as provided by source (structured where available)

sourceOriginating resolver (arxiv, jxiv, zenodo, ssrn, etc.)

published_atPublication or preprint date from source

collected_atTimestamp when gxceed first retrieved the record

origin_countryInferred country of origin (JP / US / EU / CN / Global / Unknown)

draft_scoreGX relevance score 0–100 assigned by AI pipeline (published if ≥ 80)

Papers with draft_score ≥ 80 are published to the public corpus. Papers below threshold are retained internally for potential manual review or threshold adjustment.

4 · SNE Classification

Each published paper is assigned a set of binary SNE-axis flags based on AI-assisted classification of its title and abstract. The flags map to the five SNE axes:

is_measurement_focusedS₁ — scientific measurement, empirical evidence, quantitative methods

is_policy_narrative_focusedN — policy framing, regulatory targets, disclosure frameworks

is_outcome_focusedE — outcome claims, impact projections, external expectations

is_implementation_focused + is_scalability_focusedS₂ — implementation process, industrial scalability, deployment context

is_industrial_adoption_focused + is_verification_focusedW — adoption evidence, third-party verification, validation in field

From these flags, a derived sne_profile_hint is assigned:

S2_capableImplementation-ready papers with strong S₂ and/or W signals

N_heavy_S2_weakPolicy-narrative-dominant papers with little implementation substance

S1_heavyMeasurement-focused papers with minimal narrative or implementation signals

S1_S2_mixedPapers combining scientific measurement and implementation context

unclassifiedPapers that do not clearly map to any single profile

Classification is performed in a single AI pass per paper. SNE classification is not adversarially reviewed and should be treated as indicative rather than definitive. The model version is SNE v2.1.2 (Hiroyuki Kokubu / snecompass.com).

5 · Chinese Metadata Lane

As of 2026, a dedicated lane for Chinese-origin GX research has been added, recognizing that Chinese institutional research represents a significant and underrepresented share of global GX output.

ChinaRxiv (direct API)

Accessed via the ChinaRxiv REST API (chinarxiv.org), which is reachable from Japanese IPs. Covers approximately 23,000 preprints from Chinese institutions. Detection method: source_native_chinarxiv.

ChinaXiv (OAI-PMH)

The ChinaXiv OAI-PMH endpoint is geo-blocked from outside China. The resolver performs a graceful fail and logs the block without raising an error. Coverage from this source is currently unavailable.

OpenAlex (CN affiliation)

Papers with Chinese institutional affiliations are identified via OpenAlex metadata. This supplements ChinaRxiv coverage with peer-reviewed and journal-published Chinese GX research.

Chinese-language abstracts and titles are translated to English and Japanese via the DeepSeek AI pipeline. Original Chinese text is preserved in the internal record; translated fields are explicitly marked as AI-generated.

This lane is considered experimental (as of May 2026). Translation quality for specialized technical Chinese should be treated with additional caution.

6 · AI Pipeline

All AI-assisted classification and generation uses the following pipeline:

DeepSeek (primary)Relevance scoring (0–100), bilingual title translation, English/Japanese editorial summaries, context notes, Japan↔Global axis scoring, SNE-axis flag assignment

Claude Opus (editorial review)Spot-check review of selected papers. Not applied to every paper; used for quality monitoring and edge-case resolution

AI outputs are generated from title and abstract only. Full-text PDF processing is not performed. This means classification and summaries reflect what is signaled in the abstract — which may not capture implementation details buried in the paper body.

AI outputs on gxceed are positioned as a supplementary classification layer — not as a factual source. Readers should always refer to the original paper for authoritative conclusions.

7 · Vector Embedding and Semantic Search

All published papers in the gxceed corpus are embedded into a Cloudflare Vectorize index for semantic (vector) search. This enables discovery of papers on abstract concepts — such as "transition justice", "sentient fish and fishery bans", or "GX knowledge gaps in supply chains" — that keyword search cannot handle.

Model@cf/baai/bge-m3 (Cloudflare Workers AI) — multilingual, 1,024-dim; supports Japanese, English, and Chinese

Distance metricCosine similarity

IndexCloudflare Vectorize: gxceed-papers (1,024-dim, cosine)

CoverageAll 11,000+ published papers (June 2026). New papers are embedded automatically at ingest.

The embedding text per paper is constructed as a pipe-delimited concatenation of the most semantically rich fields (max ~2,000 characters):

title | title_ja | title_en | ai_summary_ja[:500] | ai_summary_en[:500] | abstract[:500] | primary_topic | tags

Vector search is available via the papers search API at GET /api/papers/search?mode=vector. Results include a vector_score field (0–1 cosine similarity). Metadata filters (topic, shelf, lang, min_score) are applied post-retrieval on D1 results.

Known limitation: abstract-only classification applies to embeddings as well — implementation details buried in paper bodies are not captured. bge-m3 is multilingual but may perform differently across languages for highly specialized GX terminology.

Researcher API documentation →

8 · Reproducibility and Security

The following aspects of the pipeline are disclosed and can be used for independent assessment:

Paper schema fields and their definitions (see About and llms-full.txt)
SNE classification logic and axis definitions
Data source list and access methods (Section 2 above)
Relevance threshold (≥ 80 of 100)
AI models used (DeepSeek primary, Claude for review)

The following are not disclosed for operational security reasons:

API keys and authentication credentials
Cron schedule details and internal endpoint paths
Deployment infrastructure specifics
Classification prompt text (disclosed in general terms only)

The source code for gxceed is not currently open-source. If you are a researcher who needs additional methodological detail for a specific use case, please contact [email protected].

9 · Known Limitations

Source bias toward preprints and open-access

The corpus is weighted toward preprint servers (arXiv, Jxiv, Zenodo, SSRN, EarthArXiv, ChinaRxiv) and open-access journals. Paywalled journal papers are largely absent unless indexed by OpenAlex with accessible metadata. This may under-represent peer-reviewed industrial implementation research published in subscription journals.

Metadata quality varies by source

CiNii Research records often have incomplete or absent English abstracts. IEA and Carbon Brief records are policy reports rather than academic papers and may receive different SNE profiles than comparable academic work. Abstract truncation by source APIs affects classification quality.

Abstract-only classification

SNE flags are assigned from title and abstract only. Papers that describe implementation work primarily in their methods or results sections — not in the abstract — may be underclassified on S₂ or W axes.

AI translation errors

Machine-translated titles and summaries, especially for Chinese and specialized Japanese technical content, may contain errors. Transliterated terms (e.g., specific chemical names, proprietary technologies) are particularly prone to mistranslation.

Relevance threshold conservatism

The ≥ 80 threshold prioritizes precision over recall. Strong-GX boundary papers (score 65–79, e.g. broader energy transition / CCUS / hydrogen / EV empirical work) are retained as drafts and reviewed manually rather than auto-published. The threshold is subject to periodic review based on corpus growth and reader feedback.

Chinese lane is experimental

ChinaRxiv coverage began in 2026 and is not yet systematically evaluated. OpenAlex CN affiliation detection may miss papers from Chinese authors at non-Chinese institutions. ChinaXiv (OAI-PMH) is currently blocked and unavailable.

Single-operator platform

gxceed is operated by a single independent researcher. Editorial decisions, threshold adjustments, and resolver maintenance reflect individual judgment rather than institutional peer review.

📊 SNE Research Profile →📚 Browse Papers →About gxceed →[email protected]

How gxceed builds and classifiesthe GX research corpus

How gxceed builds and classifies
the GX research corpus