Ontology in the Loop

A Framework for AI-Assisted Knowledge Graph Evolution

Niran Pravithana

III. System Architecture

This section presents the overall system architecture, introducing the two-layer LLM approach and domain-based schema slicing.

3.1 Architecture Overview

The system comprises five main components working together to process documents, detect gaps, generate proposals, and apply approved changes:

System Architecture
Fig. 4. Self-Evolution Flow: from document ingestion through human review to schema updates.

Component Responsibilities

  • LLM Layer 1 — Domain detection: determines which ontology domains are relevant to a document
  • Schema Slicer — Extracts only the relevant portion of the ontology
  • LLM Layer 2 — Extraction and gap detection: attempts to extract data, identifies schema gaps
  • Proposal Store — Holds pending proposals for human review
  • Human Review — Approves, rejects, or modifies proposals

3.2 Two-Layer LLM Architecture

A naive approach would provide the entire ontology to a single LLM. This fails in practice because production ontologies often exceed context limits. Our solution splits the task across two specialized layers.

Why Two Layers?

Consider a production ontology with 50 entity types, 100 relationship types, and 200 pending proposals. Serialized, this might consume 35,000+ tokens—exceeding many model limits and incurring high costs.

The two-layer approach dramatically reduces context requirements:

Layer 1: Domain Selection
  Input:  Document + Domain list (~500 tokens)
  Output: Selected domains (1-3)
  Total:  ~3,000 tokens

Layer 2: Extract + Analyze
  Input:  Document + Sliced schema (~2,000)
        + Relevant proposals (~3,000)
  Output: Triples + Gaps + Proposals
  Total:  ~7,000 tokens
      

Total: ~10,000 tokens vs. ~35,000 for single-layer approach.

Layer 1: Domain Detection

The first layer receives a lightweight domain listing and determines relevance. The prompt structure:

Available Domains:

SUPPLY_CHAIN
  Use when: suppliers, vendors, sourcing
  Covers: supply chain, manufacturing

OWNERSHIP
  Use when: shareholders, investors
  Covers: ownership stakes, holdings

PARTNERSHIP
  Use when: joint ventures, alliances
  Covers: business partnerships

---
Document: [document text]

Select 1-3 domains. Output JSON:
{"domains": [...], "reasoning": "..."}
      

Layer 2: Extraction and Analysis

The second layer receives only the sliced schema plus pending proposals for those domains. It attempts extraction and reports gaps:

Schema (CORE + OWNERSHIP domains):

Entities:
  COMPANY - Public or private company
  COUNTRY - Nation
  SHAREHOLDER - Entity holding shares

Relationships:
  HEADQUARTERS_IN: Company -> Country
  OWNS_SHARES: Shareholder -> Company

Pending Proposals:
  PROP-042: Add ORGANIZATION entity

---
Document: [document text]

Extract triples using this schema.
Report any concepts that don't fit.
      

3.3 Domain-Based Schema Organization

The ontology is partitioned into semantic domains—groups of related entity and relationship types.

Domain Structure

Each domain contains:

  • A name and description
  • Usage hints (when to select this domain)
  • Associated entity types
  • Associated relationship types
  • Flags (e.g., always_include)

Example domains for a financial knowledge graph:

CORE (always included)
  Entities: COMPANY, COUNTRY, REGION
  Rels: HEADQUARTERS_IN, MEMBER_OF

OWNERSHIP
  Entities: SHAREHOLDER, FUND
  Rels: OWNS_SHARES, MANAGES

SUPPLY_CHAIN
  Entities: FACILITY, PRODUCT
  Rels: SOURCES_FROM, MANUFACTURES_AT

PARTNERSHIP
  Entities: JOINT_VENTURE
  Rels: PARTNERS_WITH, FORMS_JV
      

Core Domain

The CORE domain contains foundational types (COMPANY, COUNTRY) that are relevant across most queries. It is automatically included in every slice, ensuring basic entity types are always available.

Slicing Algorithm

Given selected domains, the slicer retrieves:

  1. All entity types belonging to selected domains + CORE
  2. All relationship types belonging to selected domains + CORE
  3. Pending proposals tagged with these domains

Formally, given a set of selected domains $D_{sel}$, the schema slice is:

$$S_{slice} = S_{CORE} \cup \bigcup_{d \in D_{sel}} S_d$$

where $S_d$ denotes all schema elements (entities, relationships) belonging to domain $d$. The context reduction ratio is:

$$r = 1 - \frac{|S_{slice}|}{|S_{full}|}$$

In practice, $r$ typically falls between 0.6 and 0.8, meaning 60-80% context reduction.

3.4 Proposal Structure

When LLM Layer 2 detects a gap, it generates a structured proposal. A proposal contains:

{
  "id": "PROP-042",
  "title": "Add ORGANIZATION entity",
  "status": "pending",
  "domains": ["CORE"],

  "ontology_changes": [
    {
      "type": "add_entity",
      "name": "ORGANIZATION",
      "description": "International org",
      "properties": ["name", "org_type"]
    },
    {
      "type": "add_relationship",
      "name": "IS_MEMBER_OF",
      "from": "COUNTRY",
      "to": "ORGANIZATION"
    }
  ],

  "data_enrichment": {
    "description": "OPEC + 13 members",
    "sample_triples": [...]
  },

  "source": {
    "document": "OPEC Report 2024",
    "confidence": 0.92
  }
}
      

Proposal Components

  • Ontology changes — Schema modifications (add entity, add relationship, etc.)
  • Data enrichment — Instance data to insert once approved
  • Source — Provenance linking to the originating document

3.5 Storage Architecture

The system uses separate storage for different data types:

Neo4j (Graph Database)
  - Entity instances
  - Relationship instances
  - Meta-schema (domain definitions)
  - APPROVED data only

Postgres (Relational)
  - Proposal records
  - Status tracking
  - Discussion history
  - Pending extractions

pgvector (Vector Extension)
  - Ontology embeddings
  - Proposal embeddings
  - For similarity search
      

Separation Rationale

Keeping proposals separate from production data ensures:

  • Production graph contains only approved, validated data
  • Pending proposals don't pollute queries
  • Clear audit trail of what's approved vs. pending

3.6 Data Flow Summary

The complete flow for processing a document:

  1. Ingest — Document enters pipeline
  2. Domain detect — LLM 1 selects relevant domains
  3. Slice — System retrieves relevant schema subset
  4. Extract — LLM 2 extracts with sliced context
  5. Classify — Separate successful extractions from gaps
  6. Store — Successful triples → Neo4j; Gaps → Proposals
  7. Validate — Check proposals for duplicates
  8. Queue — Valid proposals enter review queue
  9. Review — Human approves/rejects
  10. Apply — Approved changes applied atomically

The following sections detail gap detection (Section IV) and the human review workflow (Section V).