III. System Architecture

This section presents the overall system architecture, introducing the two-layer LLM approach and domain-based schema slicing.

3.1 Architecture Overview

The system comprises five main components working together to process documents, detect gaps, generate proposals, and apply approved changes:

System Architecture — Fig. 4. Self-Evolution Flow: from document ingestion through human review to schema updates.

Component Responsibilities

LLM Layer 1 — Domain detection: determines which ontology domains are relevant to a document
Schema Slicer — Extracts only the relevant portion of the ontology
LLM Layer 2 — Extraction and gap detection: attempts to extract data, identifies schema gaps
Proposal Store — Holds pending proposals for human review
Human Review — Approves, rejects, or modifies proposals

3.2 Two-Layer LLM Architecture

A naive approach would provide the entire ontology to a single LLM. This fails in practice because production ontologies often exceed context limits. Our solution splits the task across two specialized layers.

Why Two Layers?

Consider a production ontology with 50 entity types, 100 relationship types, and 200 pending proposals. Serialized, this might consume 35,000+ tokens—exceeding many model limits and incurring high costs.

The two-layer approach dramatically reduces context requirements:

Layer 1: Domain Selection
  Input:  Document + Domain list (~500 tokens)
  Output: Selected domains (1-3)
  Total:  ~3,000 tokens

Layer 2: Extract + Analyze
  Input:  Document + Sliced schema (~2,000)
        + Relevant proposals (~3,000)
  Output: Triples + Gaps + Proposals
  Total:  ~7,000 tokens

Total: ~10,000 tokens vs. ~35,000 for single-layer approach.

Layer 1: Domain Detection

The first layer receives a lightweight domain listing and determines relevance. The prompt structure:

Available Domains:

SUPPLY_CHAIN
  Use when: suppliers, vendors, sourcing
  Covers: supply chain, manufacturing

OWNERSHIP
  Use when: shareholders, investors
  Covers: ownership stakes, holdings

PARTNERSHIP
  Use when: joint ventures, alliances
  Covers: business partnerships

---
Document: [document text]

Select 1-3 domains. Output JSON:
{"domains": [...], "reasoning": "..."}

Layer 2: Extraction and Analysis

The second layer receives only the sliced schema plus pending proposals for those domains. It attempts extraction and reports gaps:

Schema (CORE + OWNERSHIP domains):

Entities:
  COMPANY - Public or private company
  COUNTRY - Nation
  SHAREHOLDER - Entity holding shares

Relationships:
  HEADQUARTERS_IN: Company -> Country
  OWNS_SHARES: Shareholder -> Company

Pending Proposals:
  PROP-042: Add ORGANIZATION entity

---
Document: [document text]

Extract triples using this schema.
Report any concepts that don't fit.

3.3 Domain-Based Schema Organization

The ontology is partitioned into semantic domains—groups of related entity and relationship types.

Domain Structure

Each domain contains:

A name and description
Usage hints (when to select this domain)
Associated entity types
Associated relationship types
Flags (e.g., always_include)

Example domains for a financial knowledge graph:

CORE (always included)
  Entities: COMPANY, COUNTRY, REGION
  Rels: HEADQUARTERS_IN, MEMBER_OF

OWNERSHIP
  Entities: SHAREHOLDER, FUND
  Rels: OWNS_SHARES, MANAGES

SUPPLY_CHAIN
  Entities: FACILITY, PRODUCT
  Rels: SOURCES_FROM, MANUFACTURES_AT

PARTNERSHIP
  Entities: JOINT_VENTURE
  Rels: PARTNERS_WITH, FORMS_JV

Core Domain

The CORE domain contains foundational types (COMPANY, COUNTRY) that are relevant across most queries. It is automatically included in every slice, ensuring basic entity types are always available.

Slicing Algorithm

Given selected domains, the slicer retrieves:

All entity types belonging to selected domains + CORE
All relationship types belonging to selected domains + CORE
Pending proposals tagged with these domains

Formally, given a set of selected domains $D_{sel}$, the schema slice is:

$$S_{slice} = S_{CORE} \cup \bigcup_{d \in D_{sel}} S_d$$

where $S_d$ denotes all schema elements (entities, relationships) belonging to domain $d$. The context reduction ratio is:

$$r = 1 - \frac{|S_{slice}|}{|S_{full}|}$$

In practice, $r$ typically falls between 0.6 and 0.8, meaning 60-80% context reduction.

3.4 Proposal Structure

When LLM Layer 2 detects a gap, it generates a structured proposal. A proposal contains:

{
  "id": "PROP-042",
  "title": "Add ORGANIZATION entity",
  "status": "pending",
  "domains": ["CORE"],

  "ontology_changes": [
    {
      "type": "add_entity",
      "name": "ORGANIZATION",
      "description": "International org",
      "properties": ["name", "org_type"]
    },
    {
      "type": "add_relationship",
      "name": "IS_MEMBER_OF",
      "from": "COUNTRY",
      "to": "ORGANIZATION"
    }
  ],

  "data_enrichment": {
    "description": "OPEC + 13 members",
    "sample_triples": [...]
  },

  "source": {
    "document": "OPEC Report 2024",
    "confidence": 0.92
  }
}

Proposal Components

Ontology changes — Schema modifications (add entity, add relationship, etc.)
Data enrichment — Instance data to insert once approved
Source — Provenance linking to the originating document

3.5 Storage Architecture

The system uses separate storage for different data types:

Neo4j (Graph Database)
  - Entity instances
  - Relationship instances
  - Meta-schema (domain definitions)
  - APPROVED data only

Postgres (Relational)
  - Proposal records
  - Status tracking
  - Discussion history
  - Pending extractions

pgvector (Vector Extension)
  - Ontology embeddings
  - Proposal embeddings
  - For similarity search

Separation Rationale

Keeping proposals separate from production data ensures:

Production graph contains only approved, validated data
Pending proposals don't pollute queries
Clear audit trail of what's approved vs. pending

3.6 Data Flow Summary

The complete flow for processing a document:

Ingest — Document enters pipeline
Domain detect — LLM 1 selects relevant domains
Slice — System retrieves relevant schema subset
Extract — LLM 2 extracts with sliced context
Classify — Separate successful extractions from gaps
Store — Successful triples → Neo4j; Gaps → Proposals
Validate — Check proposals for duplicates
Queue — Valid proposals enter review queue
Review — Human approves/rejects
Apply — Approved changes applied atomically

The following sections detail gap detection (Section IV) and the human review workflow (Section V).