VI. Implementation

Section นี้พูดถึงข้อพิจารณาการ implement ในทางปฏิบัติรวมถึงการเลือก technology, storage design และข้อกังวลด้าน operation

6.1 Technology Stack

Framework สามารถ implement ด้วย technology ต่างๆ ได้ Stack ที่เป็นไปได้หนึ่งอย่าง:

Knowledge Graph — Neo4j (native graph database)
Proposal Storage — PostgreSQL (relational พร้อม JSON support)
Vector Search — pgvector extension (embedding similarity)
LLM — OpenAI GPT-4 หรือ Anthropic Claude
Embeddings — OpenAI text-embedding-ada-002
Application — Rust หรือ Python

6.2 Storage Design

Neo4j Meta-Schema

Domain definition ถูกเก็บเป็น graph node:

(:_Domain {
  name: "CORE",
  description: "...",
  always_include: true
})

(:_EntityType {
  name: "COMPANY",
  description: "...",
  key_properties: ["name", "ticker"]
})

(:_RelationType {
  name: "HEADQUARTERS_IN",
  from_entity: "COMPANY",
  to_entity: "COUNTRY"
})

(_Domain)-[:INCLUDES_ENTITY]->(_EntityType)
(_Domain)-[:INCLUDES_RELATION]->(_RelationType)

PostgreSQL Schema

Proposal และ pending data ถูกเก็บแบบ relational:

CREATE TABLE proposals (
  id TEXT PRIMARY KEY,
  title TEXT NOT NULL,
  status TEXT DEFAULT 'pending',
  domains TEXT[],
  ontology_changes JSONB,
  data_enrichment JSONB,
  source JSONB,
  created_at TIMESTAMPTZ,
  approved_at TIMESTAMPTZ
);

CREATE TABLE pending_extractions (
  id SERIAL PRIMARY KEY,
  document_id TEXT,
  raw_triples JSONB,
  waiting_for TEXT REFERENCES proposals(id)
);

Embedding Index

pgvector เก็บ ontology และ proposal embedding:

CREATE TABLE ontology_embeddings (
  id SERIAL PRIMARY KEY,
  element_type TEXT,  -- 'entity' or 'relationship'
  name TEXT,
  description TEXT,
  embedding vector(1536)
);

CREATE INDEX ON ontology_embeddings
  USING ivfflat (embedding vector_cosine_ops);

6.3 API Design

API หลักที่ระบบ expose:

get_domain_list() -> String
  Returns formatted domain descriptions
  for LLM Layer 1 prompt

slice_schema(domains: [String]) -> Schema
  Returns ontology subset for selected
  domains (auto-includes CORE)

create_proposal(proposal: JSON) -> ID
  Stores new proposal, computes embedding

get_pending_proposals(domains: [String])
  Returns pending proposals for domains

validate_proposal(id: ID) -> Validation
  Checks for duplicates/conflicts

apply_proposals(ids: [ID]) -> Result
  Atomically applies approved proposals

6.4 LLM Prompting

Prompt ที่มีประสิทธิภาพมีความสำคัญ หลักการสำคัญ:

Structured output — Request JSON สำหรับ parsing
Clear constraint — ระบุ naming convention
Example — Include few-shot example
Reasoning — ขอคำอธิบายพร้อมคำตอบ

ตัวอย่างโครงสร้าง domain selection prompt:

You are a domain selector for a
financial knowledge graph.

Available domains:
[domain list with descriptions]

Document:
[document text]

Select 1-3 most relevant domains.

Output JSON:
{
  "domains": ["DOMAIN_1", ...],
  "reasoning": "..."
}

6.5 Error Handling

ระบบควรจัดการ failure mode ที่พบบ่อย:

LLM failure — Retry with backoff; fallback to manual review
Parse error — Validate JSON output; request retry หาก malformed
Transaction failure — Rollback และ notify; รักษา consistency
Embedding failure — Queue สำหรับ retry; ไม่ block pipeline

6.6 Performance Considerations

Token Cost

Two-layer architecture ลดค่าใช้จ่ายอย่างมาก:

Single layer: ~35K tokens/doc
Two layer:    ~10K tokens/doc
Savings:      ~70%

สำหรับการประมวลผลปริมาณมาก ความแตกต่างนี้มีนัยสำคัญ

Latency

Sequential LLM call เพิ่ม latency วิธีบรรเทา:

Batch document เมื่อเป็นไปได้
Cache domain selection สำหรับเอกสารที่คล้ายกัน
ใช้ model ที่เร็วกว่าสำหรับ Layer 1 (task ที่ซับซ้อนน้อยกว่า)

Embedding Query

Vector similarity search ควรเร็ว:

ใช้ approximate nearest neighbor (IVFFlat)
จำกัดการ search ไปยัง domain ที่เกี่ยวข้อง
Cache embedding ที่เข้าถึงบ่อย

6.7 Monitoring

Metric สำคัญที่ต้องติดตาม:

Gap detection rate — Gap ต่อเอกสาร
Proposal approval rate — ตัวบ่งชี้คุณภาพ
Review backlog — สุขภาพของ queue
Token usage — การติดตามค่าใช้จ่าย
Duplicate detection — ประสิทธิภาพ validation