🤖 Agentic AI · Data Extraction · ML-Ready

GDD — GlycoDataDigest

Automatic Paper Download &
Data Extraction Framework

GDDAI automates scientific literature discovery, intelligent PDF extraction, and fact-verified ML-ready dataset generation — end to end.

Explore Pipeline → View Dataset

Three-Phase Pipeline

From Query to Clean Data

Each phase is handled by a specialised agent with dedicated tools.

Phase 1 · Acquisition Prep

🔍 Paper Search Agent

Searches Google Scholar and PubMed using the user's research keyword, ranks results by semantic similarity, and downloads full PDF articles.

Phase 1 · Acquisition Prep

📥 Downloader & Conversion

Bulk-downloads PDFs, enriches each with author metadata and molecular structure info, then converts every PDF to clean Markdown for downstream agents.

Phase 2 · Intelligent Extraction

🤖 Multi-Tool Extraction Agent

An LLM agent equipped with 5 different search tools spawns keyword-driven extraction passes, pulling numerical values into structured tabular rows.

Phase 3 · Audit Verification

✅ Verification Agent

A dedicated auditor agent cross-checks every extracted value against the source document — validating numerical accuracy, units, and citation details before export.

Core Capabilities

What GDDAI Can Do

Four tightly-integrated capabilities that cover the full research-to-data lifecycle.

🔎

Automated Literature Acquisition

Enter any research keyword and GDDAI automatically searches Google Scholar and PubMed, identifies relevant papers, and downloads their PDF full-texts — no manual browsing required.

Google Scholar PubMed Automated PDF Download

📊

Semantic PDF Ranking

Retrieved papers are ranked by cosine similarity against the search query embedding, ensuring the most relevant literature reaches the extraction stage first and irrelevant documents are deprioritised automatically.

Similarity Scoring Embedding Ranking Smart Prioritisation

⚡

Agentic Data Extraction

An LLM agent with tool-calling capabilities deploys five specialised search strategies over the converted Markdown text — keyword spawning, table parsing, unit normalisation — and outputs ML-ready tabular data without manual annotation.

Tool Calling LLM Agent Tabular Output 5 Search Tools

🛡️

Fact & Citation Verification

A separate Verification Agent independently audits every extracted data point — checking numerical context, pattern matching against source sentences, validating units, and confirming full citation details before the dataset is finalised.

Fact Checking Numerical Validation Citation Audit

Output Dataset

Glycan Properties — ML-Ready Data

A curated, numerically-verified dataset of physicochemical properties extracted from peer-reviewed glycan literature.

🍬 Glycan Physicochemical Database

Automatically assembled from primary literature and verified at the source-sentence level. Structured for direct use in regression models, property-prediction benchmarks, and QSPR / QSAR workflows.

2000+

Verified Data Points

💧Solubility 🌊Viscosity ⚡Activation Energy 🔀Diffusion Coefficient 💦Hygroscopicity 🌈IR Spectra 🔄Optical Rotation 🌡️Glass Transition Temp. 🧫Polymer–Carbohydrate Conjugates 🧩Self-Assembly ➕+ More Properties

Sample Schema

compound	property	value	unit	temperature	source_doi	verified
Sucrose	solubility	2000	g/L	25 °C	10.1002/...	✓
Trehalose	glass_transition_temp	115	°C	—	10.1021/...	✓
Maltose	viscosity	1.84	mPa·s	20 °C	10.1039/...	✓

Developer Access

REST API — Programmatic Data Access

All GlycoDataDigest datasets are accessible via a public REST API built with Django REST Framework. Query any property endpoint directly into your Python, R, or workflow scripts.

📡 GlycoData REST API

Base URL: https://glycodata.org/api/gdd/

View Full API Docs →

Resource	Endpoint	Method
💧 Solubilities	/api/gdd/solubilities/	GET
🌊 Viscosities	/api/gdd/viscosities/	GET
⚡ Activation Energies	/api/gdd/activation-energies/	GET
🔀 Diffusion Constants (T1)	/api/gdd/diffusion-constants-table1/	GET
🔀 Diffusion Constants (T2)	/api/gdd/diffusion-constant-table2/	GET
💦 Hygroscopicity	/api/gdd/hygroscopicity/	GET
🔄 Optical Rotations	/api/gdd/optical-rotations/	GET
🌡️ Glass Transition Temps.	/api/gdd/glass-transition-temperatures/	GET
🌈 IR Spectra	/api/gdd/ir-spectra/	GET

Usage Example

bash & python

# ── cURL ─────────────────────────────────────────────
curl -X GET "https://glycodata.org/api/gdd/solubilities/" \
     -H "Accept: application/json"

# ── Python (requests) ────────────────────────────────
import requests

base = "https://glycodata.org/api/gdd/"

# Fetch all solubility data (paginated, 20 per page)
resp = requests.get(base + "solubilities/")
data = resp.json()

# Paginate through results
while data["next"]:
    resp = requests.get(data["next"])
    data = resp.json()

Automatic Paper Download &Data Extraction Framework

The GDDAI Architecture