VI. Implementation

This section discusses practical implementation considerations including technology choices, storage design, and operational concerns.

6.1 Technology Stack

The framework can be implemented with various technologies. One viable stack:

Knowledge Graph — Neo4j (native graph database)
Proposal Storage — PostgreSQL (relational with JSON support)
Vector Search — pgvector extension (embedding similarity)
LLM — OpenAI GPT-4 or Anthropic Claude
Embeddings — OpenAI text-embedding-ada-002
Application — Rust or Python

6.2 Storage Design

Neo4j Meta-Schema

Domain definitions are stored as graph nodes:

(:_Domain {
  name: "CORE",
  description: "...",
  always_include: true
})

(:_EntityType {
  name: "COMPANY",
  description: "...",
  key_properties: ["name", "ticker"]
})

(:_RelationType {
  name: "HEADQUARTERS_IN",
  from_entity: "COMPANY",
  to_entity: "COUNTRY"
})

(_Domain)-[:INCLUDES_ENTITY]->(_EntityType)
(_Domain)-[:INCLUDES_RELATION]->(_RelationType)

PostgreSQL Schema

Proposals and pending data stored relationally:

CREATE TABLE proposals (
  id TEXT PRIMARY KEY,
  title TEXT NOT NULL,
  status TEXT DEFAULT 'pending',
  domains TEXT[],
  ontology_changes JSONB,
  data_enrichment JSONB,
  source JSONB,
  created_at TIMESTAMPTZ,
  approved_at TIMESTAMPTZ
);

CREATE TABLE pending_extractions (
  id SERIAL PRIMARY KEY,
  document_id TEXT,
  raw_triples JSONB,
  waiting_for TEXT REFERENCES proposals(id)
);

Embedding Index

pgvector stores ontology and proposal embeddings:

CREATE TABLE ontology_embeddings (
  id SERIAL PRIMARY KEY,
  element_type TEXT,  -- 'entity' or 'relationship'
  name TEXT,
  description TEXT,
  embedding vector(1536)
);

CREATE INDEX ON ontology_embeddings
  USING ivfflat (embedding vector_cosine_ops);

6.3 API Design

Key APIs the system exposes:

get_domain_list() -> String
  Returns formatted domain descriptions
  for LLM Layer 1 prompt

slice_schema(domains: [String]) -> Schema
  Returns ontology subset for selected
  domains (auto-includes CORE)

create_proposal(proposal: JSON) -> ID
  Stores new proposal, computes embedding

get_pending_proposals(domains: [String])
  Returns pending proposals for domains

validate_proposal(id: ID) -> Validation
  Checks for duplicates/conflicts

apply_proposals(ids: [ID]) -> Result
  Atomically applies approved proposals

6.4 LLM Prompting

Effective prompts are critical. Key principles:

Structured output — Request JSON for parsing
Clear constraints — Specify naming conventions
Examples — Include few-shot examples
Reasoning — Ask for explanation with answer

Example domain selection prompt structure:

You are a domain selector for a
financial knowledge graph.

Available domains:
[domain list with descriptions]

Document:
[document text]

Select 1-3 most relevant domains.

Output JSON:
{
  "domains": ["DOMAIN_1", ...],
  "reasoning": "..."
}

6.5 Error Handling

The system should handle common failure modes:

LLM failures — Retry with backoff; fallback to manual review
Parse errors — Validate JSON output; request retry if malformed
Transaction failures — Rollback and notify; maintain consistency
Embedding failures — Queue for retry; don't block pipeline

6.6 Performance Considerations

Token Costs

Two-layer architecture reduces costs significantly:

Single layer: ~35K tokens/doc
Two layer:    ~10K tokens/doc
Savings:      ~70%

For high-volume processing, this difference is substantial.

Latency

Sequential LLM calls add latency. Mitigations:

Batch documents when possible
Cache domain selections for similar documents
Use faster models for Layer 1 (less complex task)

Embedding Queries

Vector similarity search should be fast:

Use approximate nearest neighbor (IVFFlat)
Limit search to relevant domains
Cache frequently accessed embeddings

6.7 Monitoring

Key metrics to track:

Gap detection rate — Gaps per document
Proposal approval rate — Quality indicator
Review backlog — Queue health
Token usage — Cost tracking
Duplicate detection — Validation effectiveness