Skip to content

UCCA Engine Refactor Plan v1

Purpose: Complete refactor plan to transform ucca-engine from a TGA-specific course generator into a corpus-agnostic CCO processing engine. Design decisions and rationale only — no code.

Inputs: - engine-blueprint-v1.md — codebase audit (2026-03-09) - triumvirate-adapter-design-v1.md — Triumvirate schema and Corpus Adapter interface design (2026-03-09) - ucca-db-dump-20260309.sql — PostgreSQL 16 production database dump - ucca CLI wrapper — bash entrypoint with 9 commands - UCCA Gate whitepaper — architectural vision for CCO framework

Date: 2026-03-09 Status: Design document — decisions and rationale. No code.


Table of Contents

  1. Database Migration Plan (PostgreSQL 16 → Cloudflare D1/SQLite)
  2. Schema Separation (Core Engine vs TGA Adapter)
  3. Triumvirate Schema Design (Generalizing framework_* Tables)
  4. Corpus Adapter Interface
  5. Refactor Phases

1. Database Migration Plan

1.1 Current State

The production database is PostgreSQL 16.11 (Debian), accessed via SQLAlchemy ORM. The dump reveals:

19 tables (excluding alembic_version):

Table PK Type Row Estimate PG-Specific Features
course serial int ~10 timestamp with time zone, DEFAULT now()
course_artifacts bigserial ~100 jsonb (none used here, but format is varchar)
course_module serial int ~50 jsonb (generator_extras)
course_source_map uuid (gen_random_uuid()) ~10 pgcrypto extension, uuid PK, CHECK constraint with ANY(ARRAY[...])
course_unit_map bigserial ~10 Partial unique index (WHERE role = 'primary')
framework serial int ~2
framework_unit serial int ~10
framework_element serial int ~50
framework_criterion serial int ~200
module_element_map serial int ~50
source_unit uuid (gen_random_uuid()) ~10 jsonb (source_payload), uuid PK
source_unit_dependency uuid (gen_random_uuid()) ~20 uuid PK, compound defaults
training_packages varchar(10) PK 54
training_package_domain_map varchar PK ~20
ucca_domain text PK 22
ucca_unit bigserial ~10
ucca_unit_source_map bigserial ~10 Partial unique index (WHERE relation_type = 'current')
ucca_code_sequence composite (domain, level) ~10

Total estimated rows: ~650. This is a small dataset.

1.2 PostgreSQL → SQLite Compatibility Issues

Issue 1: pgcrypto extension and gen_random_uuid() - Used by: course_source_map, source_unit, source_unit_dependency - SQLite equivalent: None built-in. Must generate UUIDs in application code before insert. - Decision: Generate UUIDs in Python (uuid.uuid4()) and pass as TEXT. D1 stores UUIDs as TEXT natively. This is cleaner anyway — the database should not be responsible for identity generation in a deterministic engine.

Issue 2: jsonb columns - Used by: course_module.generator_extras, source_unit.source_payload - SQLite equivalent: Store as TEXT, parse as JSON in application code. SQLite has json_extract() for queries but no indexing on JSON paths. - Decision: Store as TEXT. Both fields are opaque blobs read/written as whole objects — no partial JSON queries exist in the codebase. No functionality loss.

Issue 3: timestamp with time zone - Used by: every created_at, updated_at, imported_at column - SQLite equivalent: Store as TEXT in ISO 8601 UTC format. D1 has no native timestamp type. - Decision: Store as TEXT (YYYY-MM-DDTHH:MM:SSZ). All timestamps are UTC anyway. Application code normalizes on read.

Issue 4: Partial unique indexes - Used by: ux_course_unit_map_one_primary (WHERE role = 'primary'), ux_course_source_map_one_primary_per_source (WHERE mapping_type = 'primary'), ux_uusm_one_current_per_unit (WHERE relation_type = 'current') - SQLite supports partial indexes (since 3.8.0, 2013). D1 is SQLite 3.x. These should work as-is. - Decision: Keep partial unique indexes. Test during migration to confirm D1 compatibility.

Issue 5: serial / bigserial auto-increment - Used by: most integer PKs - SQLite equivalent: INTEGER PRIMARY KEY AUTOINCREMENT - Decision: Use INTEGER PRIMARY KEY AUTOINCREMENT. Direct mapping.

Issue 6: CHECK constraints with ANY(ARRAY[...]) - Used by: course_source_map (mapping_type = ANY(ARRAY['primary', 'supplementary', 'prerequisite'])) - SQLite equivalent: CHECK (mapping_type IN ('primary', 'supplementary', 'prerequisite')) - Decision: Rewrite to IN (...) syntax. Semantically identical.

Issue 7: Sequences (nextval) - Used by: ucca_code_sequence (manual sequence table for UCCA code allocation) - SQLite equivalent: ucca_code_sequence is a manual table, not a PG sequence — it's just a row with next_seq that gets incremented. Works as-is in SQLite. - Decision: No change needed. The application-managed sequence pattern is DB-agnostic.

Issue 8: ON DELETE CASCADE / ON DELETE RESTRICT / ON DELETE SET NULL - SQLite supports foreign key actions, but they must be enabled per-connection with PRAGMA foreign_keys = ON. - Decision: Enable PRAGMA foreign_keys = ON at connection init. D1 respects this pragma.

1.3 Migration Strategy

Approach: Schema-first, data-second.

The dataset is small (~650 rows). This is not a terabyte migration problem. The strategy is:

  1. Write a D1-compatible DDL script — hand-translate the 19 CREATE TABLE statements from PG to SQLite syntax, resolving the 8 issues above.
  2. Write a data export script — Python script that reads from PG via SQLAlchemy and writes INSERT statements for D1 (or uses D1 REST API to batch-insert).
  3. Validate — row counts match, all foreign key references resolve, all unique constraints hold.
  4. Switch the engine — replace db.py (SQLAlchemy + PG connection) with a D1 access pattern. The engine runs on Cloudflare Workers, so D1 is the natural target.

What NOT to do: - Do not attempt pg_dump → SQLite conversion. The syntax differences are too many and the dataset is too small to justify automated tooling. - Do not use an ORM abstraction to paper over the differences. The engine should speak D1's SQL dialect directly. SQLAlchemy is being removed.

1.4 SQLAlchemy Removal

SQLAlchemy is the current ORM. In the D1 world, it has no purpose:

  • D1 is accessed via HTTP REST API (from Workers) or via wrangler d1 execute locally.
  • The Python engine code will use D1's HTTP API for database operations, or a lightweight SQLite client if running locally.
  • All ORM model definitions become DDL + application-level data classes (dataclasses or TypedDicts).

Decision: Remove SQLAlchemy entirely. Replace with: - DDL scripts for schema management (versioned SQL files, not Alembic migrations) - Dataclasses for in-memory data structures - Raw SQL for queries (parameterized, never string-interpolated)

Rationale: The ORM added complexity without value for this use case. The engine's queries are simple (CRUD + a few joins). The dataset is small. The schema is stable. An ORM's migration system (Alembic) is overkill for a 19-table database with < 1000 rows.

1.5 Database Placement Decision

Two deployment options for the engine database:

Option A: Cloudflare D1 - Pros: Same platform as surfaces, edge-local reads, no separate infra to manage - Cons: SQLite limitations (no stored procedures, limited concurrent writes), D1 is still relatively young - Fits: Small dataset, simple queries, read-heavy workload

Option B: Separate hosted PostgreSQL (e.g., Neon, Supabase) - Pros: Full PG feature set, no migration effort for schema - Cons: Additional infra, latency from Workers to external DB, cost, operational burden - Fits: Large datasets, complex queries, high write concurrency

Decision: Option A — Cloudflare D1.

Rationale: - The dataset is ~650 rows. D1's limitations are irrelevant at this scale. - The engine is a batch processor, not a high-concurrency OLTP system. Write conflicts won't happen. - The surfaces already run on Cloudflare. Keeping the engine database on the same platform eliminates an entire infrastructure dependency. - D1 databases are independently freezable/exportable/deletable — this aligns with the world isolation requirement. - If the engine grows beyond D1's capabilities, migration to external PG is straightforward — the DDL is simple, the data volume is small, and the queries are standard SQL.

1.6 Database Naming

Following the established two-tier pattern: - engine-db — the engine's core schema (Triumvirate structures, CCO records, adapter registrations, audit trails) - Per-world databases remain as they are (rtopacks-db, etc.) — owned by the world, not the engine

The engine database is a platform resource bound to the engine worker, not to any client world. Client-specific data (adapter state, audit logs) lives in the world's database, not in engine-db.


2. Schema Separation

2.1 Current Table Classification

Every table in the current schema falls into one of three categories:

Category 1: Core Engine Tables — generic, not TGA-specific, reusable as-is or with minor generalization.

Table Assessment Action
ucca_domain Generic domain taxonomy. 22 domains, no TGA reference. Keep as-is
ucca_unit Canonical unit identity. Domain code, status, no source-system dependency. Keep as-is
ucca_unit_source_map Links UCCAUnit to SourceUnit. Relation types (current/superseded). Keep as-is
ucca_code_sequence Code allocation sequences per domain+level. Keep as-is
course Course record. UCCACode, domain, level, generation metadata. Keep as-is
course_module Module within a course. Index, title, content. Keep as-is
course_artifacts Artifact storage with type/format/intent/audience classification. Keep as-is
course_unit_map Links Course to UCCAUnit. Keep as-is
course_source_map Links Course to SourceUnit. Keep as-is
module_element_map Links CourseModule to FrameworkElement. Generalize — rename to module_outcome_map when framework tables become Triumvirate tables

Category 2: Embryonic Triumvirate Tables — currently TGA-shaped but structurally close to the Triumvirate's Outcome Specification. Will be generalized.

Table Assessment Action
framework Named "framework" — already generic. Single row per framework. Evolve into Triumvirate envelope metadata
framework_unit source_code + source_release — unit within a framework. Evolve into instrument reference
framework_element Elements within a unit. Element number + title. Evolve into outcome (Triumvirate Instrument 1)
framework_criterion Criteria within an element. Number + text. Evolve into criterion (Triumvirate Instrument 1)

Category 3: TGA-Specific Tables — entirely TGA. Must move out of core engine into the TGA adapter's scope.

Table Assessment Action
training_packages TGA training package registry. Fields: tga_name, nrt_flag, tp_developer, regulator. Move to TGA adapter
training_package_domain_map Maps TGA package codes to UCCA domains. Move to TGA adapter
source_unit Structurally generic (source_system, source_code) but defaults and lifecycle fields are TGA-specific. Generalize — keep in core, remove TGA defaults
source_unit_dependency depends_on_system defaults to "TGA", source defaults to "tga". Generalize — keep in core, remove TGA defaults

2.2 Target Schema Layout

After separation, the engine database has two distinct zones:

engine-db (D1)
├── Core Engine Schema
│   ├── ucca_domain
│   ├── ucca_unit
│   ├── ucca_unit_source_map
│   ├── ucca_code_sequence
│   ├── source_unit              (generalized — no TGA defaults)
│   ├── source_unit_dependency   (generalized — no TGA defaults)
│   ├── course
│   ├── course_module
│   ├── course_artifacts
│   ├── course_unit_map
│   ├── course_source_map
│   └── module_outcome_map       (renamed from module_element_map)
├── Triumvirate Schema (NEW — replaces framework_* tables)
│   ├── triumvirate_envelope
│   ├── triumvirate_instrument
│   ├── triumvirate_outcome
│   ├── triumvirate_criterion
│   ├── triumvirate_evidence_requirement
│   ├── triumvirate_compliance_rule     (Instrument 2 — NEW, unmodeled today)
│   ├── triumvirate_credential          (Instrument 3 — NEW, unmodeled today)
│   └── triumvirate_cross_reference
└── Adapter Registry (NEW)
    ├── adapter_registration
    └── adapter_audit_log

Per-world databases (e.g., rtopacks-db) continue to hold world-specific content. The TGA adapter tables (training_packages, training_package_domain_map) move into the TGA adapter's codebase — they are adapter-internal lookup tables, not engine schema.

2.3 source_unit Generalization

The source_unit table is structurally sound — it has source_system and source_code which are generic. But:

  • source_system defaults to 'TGA' — remove the default, make it required
  • package — TGA concept. Rename to corpus_group (generic: a grouping concept within any source system)
  • version, status, supersedes, superseded_by — these are lifecycle fields that exist in every regulatory system, not just TGA. Keep the columns, but document them as generic lifecycle fields
  • source_payload (jsonb → TEXT) — remains as an opaque blob of adapter-specific raw data. The engine never parses this; the adapter does
  • application_text — TGA-specific name for what is generically "scope statement" or "purpose statement". Rename to scope_text

source_unit after generalization:

CREATE TABLE source_unit (
    id TEXT PRIMARY KEY,                    -- UUID as TEXT
    source_system TEXT NOT NULL,            -- no default
    source_code TEXT NOT NULL,
    corpus_group TEXT,                      -- was "package"
    title TEXT NOT NULL,
    version TEXT,
    status TEXT,
    source_url TEXT,
    scope_text TEXT,                        -- was "application_text"
    source_payload TEXT,                    -- was jsonb, now TEXT (JSON blob)
    created_at TEXT NOT NULL,               -- ISO 8601
    updated_at TEXT NOT NULL,
    imported_at TEXT NOT NULL,
    supersedes TEXT,
    superseded_by TEXT,
    release_date TEXT,                      -- was DATE, now TEXT (YYYY-MM-DD)
    UNIQUE (source_system, source_code, version)
);

2.4 source_unit_dependency Generalization

CREATE TABLE source_unit_dependency (
    id TEXT PRIMARY KEY,                    -- UUID as TEXT
    source_unit_id TEXT NOT NULL REFERENCES source_unit(id) ON DELETE CASCADE,
    depends_on_system TEXT NOT NULL,        -- no default
    depends_on_code TEXT NOT NULL,
    relation_type TEXT NOT NULL,
    condition_text TEXT,
    source TEXT NOT NULL,                   -- no default — must be explicit
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    UNIQUE (source_unit_id, depends_on_system, depends_on_code, relation_type)
);

2.5 What the TGA Adapter Owns

The TGA adapter's internal schema (stored wherever the adapter stores state — could be a separate D1, could be files, could be in-memory):

  • training_packages — the 54 TGA package definitions with tga_name, nrt_flag, etc.
  • training_package_domain_map — the TGA package → UCCA domain mapping
  • All code in tga_packages.py, tga_scraper.py, tga_pdf_processor.py
  • All code in bootstrap_from_tga_unit.py
  • The regex _TGA_UNIT_CODE_RE
  • The URL patterns for training.gov.au
  • The PDF revision resolution logic (R5→R1)
  • The LLM extraction prompt template for TGA PDFs
  • The INDUSTRY_TERMINOLOGY tables (Australian→US)
  • The AQF level extraction logic

The engine never sees any of this. The adapter translates it all into Triumvirate format before the engine touches it.


3. Triumvirate Schema Design

3.1 Evolution from framework_* Tables

The current framework_* tables are the embryonic Triumvirate. Here is the exact mapping:

CURRENT (PG)                    →    TARGET (D1)
─────────────────────────────        ──────────────────────────────
framework                       →    triumvirate_envelope
  id, name, description               triumvirate_id, schema_version,
                                       adapter_id, client_id,
                                       corpus_hash, created_at

framework_unit                  →    triumvirate_instrument
  id, framework_id,                    id, triumvirate_id,
  source_code, source_release,         instrument_type (1/2/3),
  title                                source_system, instrument_id,
                                       instrument_title, version,
                                       effective_date, source_uri,
                                       jurisdiction

framework_element               →    triumvirate_outcome
  id, unit_id,                         id, instrument_id,
  element_number, title                outcome_id_local, title,
                                       description, source_ref,
                                       sort_order

framework_criterion             →    triumvirate_criterion
  id, element_id,                      id, outcome_id,
  criterion_number, text               criterion_id_local,
                                       description, source_ref,
                                       evidence_type

(not modeled)                   →    triumvirate_evidence_requirement
                                       id, instrument_id,
                                       requirement_type, statement,
                                       source_ref

(not modeled)                   →    triumvirate_compliance_rule
                                       id, instrument_id,
                                       rule_id_local, clause_ref,
                                       title, requirement_text,
                                       source_ref, obligation_type

(not modeled)                   →    triumvirate_credential
                                       id, instrument_id,
                                       credential_id_local, title,
                                       credential_type, level,
                                       source_ref, scope_description

(not modeled)                   →    triumvirate_cross_reference
                                       id, source_instrument_id,
                                       source_element_id,
                                       target_instrument_id,
                                       target_element_id,
                                       relationship_type

3.2 D1 DDL for Triumvirate Tables

-- Triumvirate Envelope: the container for all three instruments
CREATE TABLE triumvirate_envelope (
    triumvirate_id TEXT PRIMARY KEY,
    schema_version TEXT NOT NULL DEFAULT '1.0',
    adapter_id TEXT NOT NULL,
    adapter_version TEXT NOT NULL,
    client_id TEXT NOT NULL,
    corpus_id TEXT,
    corpus_title TEXT,
    corpus_version TEXT,
    corpus_hash TEXT,
    structural_valid INTEGER NOT NULL DEFAULT 0,     -- boolean
    semantic_valid INTEGER NOT NULL DEFAULT 0,        -- boolean
    created_at TEXT NOT NULL,
    validation_timestamp TEXT,
    adapter_signature TEXT
);

-- Triumvirate Instrument: one row per instrument (1, 2, or 3) within a Triumvirate
CREATE TABLE triumvirate_instrument (
    id TEXT PRIMARY KEY,
    triumvirate_id TEXT NOT NULL REFERENCES triumvirate_envelope(triumvirate_id) ON DELETE CASCADE,
    instrument_type INTEGER NOT NULL CHECK (instrument_type IN (1, 2, 3)),
    source_system TEXT NOT NULL,
    instrument_id TEXT NOT NULL,
    instrument_title TEXT NOT NULL,
    version TEXT,
    effective_date TEXT,
    source_uri TEXT,
    jurisdiction TEXT,
    regulator TEXT,                                   -- Instrument 2 only
    UNIQUE (triumvirate_id, instrument_type)
);

-- Triumvirate Outcome: outcomes within Instrument 1 (Outcome Specification)
CREATE TABLE triumvirate_outcome (
    id TEXT PRIMARY KEY,
    instrument_id TEXT NOT NULL REFERENCES triumvirate_instrument(id) ON DELETE CASCADE,
    outcome_id_local TEXT NOT NULL,                   -- unique within instrument
    title TEXT NOT NULL,
    description TEXT,
    source_ref TEXT,
    sort_order INTEGER NOT NULL DEFAULT 0,
    UNIQUE (instrument_id, outcome_id_local)
);

-- Triumvirate Criterion: criteria within an outcome
CREATE TABLE triumvirate_criterion (
    id TEXT PRIMARY KEY,
    outcome_id TEXT NOT NULL REFERENCES triumvirate_outcome(id) ON DELETE CASCADE,
    criterion_id_local TEXT NOT NULL,                 -- unique within outcome
    description TEXT NOT NULL,
    source_ref TEXT,
    evidence_type TEXT CHECK (evidence_type IN ('performance', 'knowledge', 'product')),
    sort_order INTEGER NOT NULL DEFAULT 0,
    UNIQUE (outcome_id, criterion_id_local)
);

-- Triumvirate Evidence Requirement: evidence requirements at the instrument level
CREATE TABLE triumvirate_evidence_requirement (
    id TEXT PRIMARY KEY,
    instrument_id TEXT NOT NULL REFERENCES triumvirate_instrument(id) ON DELETE CASCADE,
    requirement_type TEXT NOT NULL
        CHECK (requirement_type IN ('knowledge', 'performance', 'condition')),
    statement TEXT NOT NULL,
    source_ref TEXT,
    sort_order INTEGER NOT NULL DEFAULT 0
);

-- Triumvirate Compliance Rule: rules within Instrument 2 (Compliance Ruleset)
CREATE TABLE triumvirate_compliance_rule (
    id TEXT PRIMARY KEY,
    instrument_id TEXT NOT NULL REFERENCES triumvirate_instrument(id) ON DELETE CASCADE,
    rule_id_local TEXT NOT NULL,
    clause_ref TEXT,
    title TEXT,
    requirement_text TEXT NOT NULL,
    source_ref TEXT,
    obligation_type TEXT CHECK (obligation_type IN ('mandatory', 'conditional', 'guidance')),
    sort_order INTEGER NOT NULL DEFAULT 0,
    UNIQUE (instrument_id, rule_id_local)
);

-- Triumvirate Credential: credentials within Instrument 3 (Credential Map)
CREATE TABLE triumvirate_credential (
    id TEXT PRIMARY KEY,
    instrument_id TEXT NOT NULL REFERENCES triumvirate_instrument(id) ON DELETE CASCADE,
    credential_id_local TEXT NOT NULL,
    title TEXT NOT NULL,
    credential_type TEXT CHECK (credential_type IN (
        'qualification', 'licence', 'endorsement',
        'unit_credential', 'skill_set', 'rating'
    )),
    level TEXT,
    source_ref TEXT,
    scope_description TEXT,
    validity_period TEXT,                             -- ISO 8601 duration
    sort_order INTEGER NOT NULL DEFAULT 0,
    UNIQUE (instrument_id, credential_id_local)
);

-- Triumvirate Cross-Reference: links between elements across instruments
CREATE TABLE triumvirate_cross_reference (
    id TEXT PRIMARY KEY,
    triumvirate_id TEXT NOT NULL REFERENCES triumvirate_envelope(triumvirate_id) ON DELETE CASCADE,
    source_table TEXT NOT NULL,                       -- 'triumvirate_outcome', 'triumvirate_compliance_rule', etc.
    source_element_id TEXT NOT NULL,
    target_table TEXT NOT NULL,
    target_element_id TEXT NOT NULL,
    relationship_type TEXT NOT NULL
        CHECK (relationship_type IN (
            'constrains', 'requires', 'modifies',
            'prerequisite', 'corequisite', 'supersedes'
        ))
);

3.3 Design Decisions

D1: instrument_type is an integer (1/2/3), not a string. Rationale: The three instruments are a fixed, ordered set. Using integers prevents typos and enables natural ordering. The Triumvirate is always three instruments — no more, no less. The integer maps to: 1 = Outcome Specification, 2 = Compliance Ruleset, 3 = Credential Map.

D2: Cross-references use a generic reference table, not per-instrument foreign keys. Rationale: The cross-reference graph connects elements across all three instruments in multiple directions (outcomes↔rules, outcomes↔credentials, rules↔credentials). A generic reference table is simpler than three separate junction tables and handles future relationship types without schema changes. The tradeoff is that referential integrity on source_element_id / target_element_id must be enforced in application code, not via foreign keys.

D3: All IDs are TEXT (UUIDs generated in application code). Rationale: D1 has no native UUID type. TEXT is the natural representation. Generating UUIDs in application code (not in the database) aligns with the engine's deterministic processing principle — the engine controls identity generation, not the database.

D4: *_local ID columns for within-instrument addressing. Fields like outcome_id_local, criterion_id_local, rule_id_local, credential_id_local are the identifiers meaningful within the source instrument (e.g., "Element 1", "Clause 1.1"). They are distinct from the table's primary key (id) which is the engine's internal UUID. This allows the Triumvirate to reference elements using the source document's numbering system while maintaining internal referential integrity.

D5: Compliance rules and credentials are stored flat, not as nested JSON. Rationale: The triumvirate-adapter-design-v1.md schema shows nested structures (applies_to[], evidence_of_compliance, requirements, lifecycle, packaging_rules). These are valid at the API/interchange level but should be stored as separate tables or flattened columns in D1. Nested JSON in SQLite loses queryability.

Decision: For v1, store the nested structures as JSON TEXT columns on the parent row (e.g., triumvirate_compliance_rule.applies_to_json, triumvirate_credential.requirements_json, triumvirate_credential.lifecycle_json). This preserves the full structure from the adapter without requiring 10+ additional junction tables. When query patterns emerge that need relational access to nested data, extract specific tables then.

3.4 module_element_map → module_outcome_map

The existing module_element_map links course modules to framework elements. In the new schema, framework elements become Triumvirate outcomes. The table rename reflects this:

CREATE TABLE module_outcome_map (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    module_id INTEGER NOT NULL REFERENCES course_module(id),
    outcome_id TEXT NOT NULL REFERENCES triumvirate_outcome(id),
    UNIQUE (module_id, outcome_id)
);

This is the only structural bridge between the course layer (output) and the Triumvirate layer (input). It records which Triumvirate outcome each course module addresses — the provenance chain from output back to legislative source.


4. Corpus Adapter Interface

4.1 Relationship to triumvirate-adapter-design-v1.md

The triumvirate-adapter-design-v1.md defines the Corpus Adapter interface at the API/contract level. This section addresses the implementation architecture — where adapters live in the codebase, how they're registered, how they interact with the database, and how the TGA adapter is extracted from the current codebase.

4.2 Directory Structure

ucca-engine/
├── engine/                          ← Core engine (corpus-agnostic)
│   ├── triumvirate/                       ← Triumvirate schema handling
│   │   ├── schema.py                ← Triumvirate dataclasses + validation
│   │   ├── store.py                 ← D1 read/write for Triumvirate tables
│   │   └── integrity.py             ← Cross-reference validation
│   ├── cco/                         ← CCO processing (future)
│   ├── models.py                    ← Core dataclasses (Course, Module, etc.)
│   ├── db.py                        ← D1 access layer
│   └── code_allocator.py            ← UCCA code allocation
├── adapters/                        ← Corpus adapters (one per source system)
│   ├── base.py                      ← Abstract CorpusAdapter base class
│   ├── registry.py                  ← Adapter registration + lookup
│   └── tga/                         ← TGA adapter (first implementation)
│       ├── adapter.py               ← TGAAdapter(CorpusAdapter)
│       ├── scraper.py               ← ← generator/tga_scraper.py
│       ├── pdf_processor.py         ← ← generator/tga_pdf_processor.py
│       ├── packages.py              ← ← generator/tga_packages.py
│       ├── domain_map.py            ← ← backend/app/utils/ucca_domains.py (TGA parts)
│       └── data/                    ← TGA-specific lookup data
│           ├── packages.json        ← ← training_packages table (exported)
│           └── domain_map.json      ← ← training_package_domain_map (exported)
├── generator/                       ← Content generation (refactored)
│   ├── course_generator.py          ← ← us_course_generator.py (generalized)
│   ├── artifact_generators/
│   │   ├── marketing.py             ← ← learnworlds_generator.py + pictory_generator.py
│   │   └── workbook.py
│   └── config.py                    ← ← ucca_config.py (TGA-specific parts removed)
├── cli/                             ← CLI layer
│   └── ucca                         ← ← ucca bash wrapper (unchanged interface)
└── scripts/                         ← ← backend/scripts/ (refactored)

4.3 TGA Adapter Extraction

The TGA adapter is constructed from existing code. Here's the file-by-file extraction map:

Current Location Target What Moves
generator/tga_scraper.py (230 lines) adapters/tga/scraper.py Entire file — unchanged logic, new namespace
generator/tga_pdf_processor.py (444 lines) adapters/tga/pdf_processor.py Entire file — unchanged logic, new namespace
generator/tga_packages.py (731 lines) adapters/tga/packages.py Entire file — the 54-package registry
backend/app/utils/ucca_domains.py lines 198–256 adapters/tga/domain_map.py get_domain_for_tga_package() and get_level_from_tga_code() — TGA-specific resolution functions
backend/app/utils/ucca_domains.py lines 1–196 engine/domains.py The 22-domain taxonomy definition — generic, stays in engine
backend/app/models/training_package.py (63 lines) adapters/tga/data/ TrainingPackage model → exported JSON lookup. The DB table is eliminated; the adapter uses local JSON.
backend/scripts/bootstrap_from_tga_unit.py (1025 lines) adapters/tga/adapter.py Refactored into the TGA adapter's identify(), ingest(), translate() methods
backend/scripts/sync_training_packages.py adapters/tga/ Becomes an adapter-internal data maintenance script
backend/scripts/sync_training_package_domain_map.py adapters/tga/ Same
backend/scripts/sync_training_package_us_meta.py adapters/tga/ Same
generator/ucca_config.py lines 82–156 adapters/tga/config.py INDUSTRY_TERMINOLOGY, US_CONTEXT, MARKETING — TGA/VET-specific content
generator/ucca_config.py lines 1–80 generator/config.py Branding, pricing, AI model config — generic, stays in generator

4.4 Adapter Registration

CREATE TABLE adapter_registration (
    adapter_id TEXT PRIMARY KEY,
    adapter_type TEXT NOT NULL,              -- e.g. "tga", "faa"
    adapter_version TEXT NOT NULL,
    source_system TEXT NOT NULL,
    supported_corpus_types TEXT NOT NULL,    -- JSON array as TEXT
    triumvirate_schema_version TEXT NOT NULL,
    registered_at TEXT NOT NULL,
    status TEXT NOT NULL DEFAULT 'active'
        CHECK (status IN ('active', 'suspended', 'retired'))
);

CREATE TABLE adapter_audit_log (
    id TEXT PRIMARY KEY,
    adapter_id TEXT NOT NULL REFERENCES adapter_registration(adapter_id),
    client_id TEXT NOT NULL,
    event_type TEXT NOT NULL,
    event_data TEXT,                         -- JSON
    created_at TEXT NOT NULL
);

Decision: Adapter registration is in engine-db, not in the adapter's own storage. The engine needs to know which adapters exist and what they support. The audit log is per-adapter, per-client — the isolation boundary is at the adapter_id + client_id level.

4.5 How the TGA Adapter Implements the Interface

Mapping the six abstract methods from triumvirate-adapter-design-v1.md to the existing TGA codebase:

Method Current Implementation Refactored Implementation
identify() Line 67 of bootstrap_from_tga_unit.py: regex _TGA_UNIT_CODE_RE validates format, infers package from 3-letter prefix Same logic, wrapped in CorpusIdentity return type
ingest() Lines 400–500 of bootstrap_from_tga_unit.py: HTML scrape + PDF download from training.gov.au Same logic, returns IngestionResult with acquisition log
translate() tga_pdf_processor.py LLM extraction + bootstrap_from_tga_unit.py DB bootstrap LLM extraction produces raw_corpus, then a new mapping function builds Triumvirate from TGA JSON structure
validate() No explicit validation today — data is trusted from TGA New: structural validation against Triumvirate schema + semantic validation (every element has ≥1 criterion, etc.)
audit_entry() Not implemented — no audit trail today New: writes to adapter_audit_log table
version() Not implemented New: returns adapter identity (tga, version, schema version, supported types)

The translate() method is the core work. It maps:

TGA elements[]                → triumvirate_outcome rows
TGA performance_criteria[]    → triumvirate_criterion rows (evidence_type = 'performance')
TGA knowledge_evidence[]      → triumvirate_evidence_requirement rows (requirement_type = 'knowledge')
TGA performance_evidence[]    → triumvirate_evidence_requirement rows (requirement_type = 'performance')

This mapping was already proven in triumvirate-adapter-design-v1.md Section 4 (TGA Mapping Proof).


5. Refactor Phases

5.1 Phase Design Principles

P1: Canary suite must pass at every phase boundary. No phase is complete until the canary suite runs green. If the canary breaks, the phase is not done.

P2: Each phase produces a working engine. No phase leaves the engine in an intermediate state where it can't process a TGA unit end-to-end. The refactor is a series of working states, not a tear-down-and-rebuild.

P3: No big-bang switches. Each phase changes one architectural layer while the others hold steady. Database migration doesn't happen simultaneously with schema generalization.

P4: The CLI interface (./ucca) does not change. The user's interface to the engine is the CLI. All 9 commands continue to work identically throughout the refactor. Internal reorganization is invisible from the CLI.

5.2 Phase 1: Schema Separation (Core vs TGA)

Goal: Separate TGA-specific code from core engine code in the directory structure, without changing any behavior.

What happens: 1. Create adapters/tga/ directory 2. Move TGA-specific files into adapters/tga/ (per the extraction map in §4.3) 3. Update all import paths 4. Add adapters/base.py with the abstract CorpusAdapter class (empty methods, not yet called) 5. TGA adapter implements the interface but internally still calls the same code paths 6. No database changes. No schema changes. No behavior changes.

Why first: This is a pure code reorganization. It establishes the adapter boundary in the codebase without changing any logic. If the canary suite passes after this, the separation is correct.

Canary checkpoint: All existing commands (./ucca run, ./ucca validate, ./ucca canary) produce identical results.

Risk: Import path breakage. Mitigation: check_import_roots.py already exists for exactly this purpose.

5.3 Phase 2: Schema Generalization (framework_* → Triumvirate)

Goal: Evolve the framework_* tables and models into the Triumvirate schema, while TGA data continues to flow through them.

What happens: 1. Add Triumvirate DDL to the database (new tables coexist with framework_* tables) 2. Write migration code that populates Triumvirate tables from framework_* tables for existing data 3. Update ingest_course.py to write to both framework_* and Triumvirate tables (dual-write) 4. Update read paths to read from Triumvirate tables 5. Once all reads come from Triumvirate tables, stop writing to framework_* tables 6. Drop framework_* tables 7. Rename module_element_mapmodule_outcome_map

Why second: The schema must be generalized before the adapter can produce Triumvirate output. But the adapter is already separated (Phase 1), so this change is isolated to the storage layer.

Canary checkpoint: Course generation still works. Run bundle validation still passes. Generated content is byte-identical.

Risk: Data loss during migration. Mitigation: The dataset is ~650 rows. Full backup before migration. Side-by-side comparison of old and new table contents.

5.4 Phase 3: Database Migration (PostgreSQL → D1)

Goal: Replace PostgreSQL with Cloudflare D1 as the engine's database.

What happens: 1. Write D1-compatible DDL script (all tables including Triumvirate schema from Phase 2) 2. Create engine-db in Cloudflare D1 3. Write data export script (PG → D1 INSERT statements) 4. Replace backend/app/db.py (SQLAlchemy session) with a D1 access layer 5. Replace all SQLAlchemy model usage with dataclasses + raw SQL 6. Remove SQLAlchemy, Alembic, psycopg2 from requirements.txt 7. Remove docker-compose.yml (no longer need local PG) 8. Update .env configuration (remove DATABASE_URL, add D1 binding config)

Why third: By Phase 3, the schema is already generalized (Phase 2) and the code is separated (Phase 1). The database migration is a pure infrastructure change — the schema shape is settled, only the access layer changes.

Canary checkpoint: All commands work against D1. Row counts match. Generated content is byte-identical.

Risk: SQLite behavioral differences (type affinity, NULL handling, string comparison). Mitigation: The canary suite catches any behavioral divergence. Run the full suite against D1 before decommissioning PG.

5.5 Phase 4: Adapter Protocol Activation

Goal: Make the TGA adapter implement the full Corpus Adapter interface, with the engine consuming Triumvirate objects from the adapter instead of directly from TGA-specific code paths.

What happens: 1. TGA adapter's identify(), ingest(), translate(), validate() methods become the real entry points 2. bootstrap_from_tga_unit.py refactored: its orchestration logic moves to the adapter's method implementations, the script becomes a thin wrapper that calls adapter.identify() → adapter.ingest() → adapter.translate() → adapter.validate() → engine.process(triumvirate) 3. The engine's process(triumvirate) method receives a validated Triumvirate object and produces output — it never calls TGA-specific code 4. Adapter audit logging activated (adapter_audit_log table) 5. Adapter registration populated for TGA adapter

Why fourth: By Phase 4, the code separation (Phase 1), schema generalization (Phase 2), and database migration (Phase 3) are complete. The adapter protocol is the final wiring — connecting the separated adapter to the generalized engine through the Triumvirate contract.

Canary checkpoint: End-to-end pipeline works via adapter protocol. Audit log records every adapter action. Triumvirate validation gate correctly rejects malformed input.

Risk: Behavioral drift during the orchestration refactor. Mitigation: Phase 4 is explicitly about changing the call graph, not the logic. Each adapter method wraps existing, tested code. The canary suite verifies output equivalence.

5.6 Phase 5: Generator Generalization

Goal: Make the content generation layer (USCourseGenerator) work from Triumvirate data rather than TGA-specific data structures.

What happens: 1. USCourseGenerator refactored to CourseGenerator — reads from Triumvirate outcomes and criteria, not from TGA elements and performance_criteria 2. Authority header in LLM prompts generalized: "Source: {instrument_ref}" instead of "TGA Unit Code: {code}" 3. Package context injection generalized: the adapter provides domain context, not tga_packages.py 4. CONTRACT_V0_FORMATTING_RULES reviewed and generalized if TGA-specific 5. generate_complete_course.py updated to accept a Triumvirate object, not a TGA JSON 6. create_course_spec.py updated: source.tga_unit_codesource.instrument_ref 7. folder_manager.py generalized: folder names use UCCA codes, not TGA codes 8. CQ-01/CQ-02A spikes generalized or removed if UCCA-CARE-100-002-specific

Why fifth: The generator layer is downstream of the adapter and the Triumvirate. By Phase 5, the generator receives Triumvirate data from the engine, not raw TGA data from the pipeline. The refactor changes the generator's input interface, not its output quality.

Canary checkpoint: Generated courses from TGA units are functionally equivalent (same structure, same quality) whether generated via old pipeline or new Triumvirate pipeline. Content may differ slightly due to prompt changes, but structural contracts are preserved.

Risk: LLM output quality changes when prompts are modified. Mitigation: Run a comparison suite — generate the same courses via both paths and diff the structural output (module count, artifact count, learning outcome count). Content text will differ (LLM non-determinism), but structure must be identical.

5.7 Phase 6: Cleanup and Validation

Goal: Remove all dead code, legacy paths, and TGA assumptions from the core engine. Verify the system is truly corpus-agnostic.

What happens: 1. Delete framework_* table references (already dropped in Phase 2, but check for stale imports) 2. Delete backend/app/models/training_package.py (if not already removed) 3. Delete TGA-specific code remaining in core engine paths 4. Remove all source_system = "TGA" defaults and hardcoded references from core code 5. Run a full grep for "tga", "TGA", "training.gov", "training_package" in engine/ — zero results expected 6. Run a full grep for "tga", "TGA" in adapters/tga/ — expected, this is where TGA knowledge lives 7. Update run bundle validators: inputs/tga_unit.jsoninputs/source_unit.json or inputs/triumvirate.json 8. Update run_meta.json contract: tga_codesource_code (or instrument_ref) 9. Full canary suite run 10. OSCAL compliance loop: document the refactor as a CM (Configuration Management) event

Canary checkpoint: Full suite green. No TGA references in core engine. All existing run bundles still validate (backward compatibility mode).

Risk: Breaking existing run bundles. Mitigation: Add a version field to run_meta.json. v1 bundles expect tga_code. v2 bundles expect source_code. The validator handles both. Old bundles are not invalidated.

5.8 Phase Summary

Phase Name Changes Database Risk
1 Schema Separation Code reorganization only None Low (import paths)
2 Schema Generalization New Triumvirate tables, evolve framework_* Add tables, then drop old Medium (data migration)
3 Database Migration PG → D1, SQLAlchemy removal Full migration Medium (behavioral differences)
4 Adapter Protocol Wire adapter interface Adapter tables Low (wrapping existing code)
5 Generator Generalization Triumvirate-based generation None Medium (LLM output quality)
6 Cleanup Dead code removal, validation Schema cleanup Low (deletion only)

Total estimated scope: ~3,500 lines of code changed or moved. ~500 lines of new code (Triumvirate schema, adapter base class, D1 access layer). ~2,000 lines of TGA code relocated to adapters/tga/. ~500 lines deleted (dead code, SQLAlchemy boilerplate).

5.9 Phase Dependencies

Phase 1 (Separation)
Phase 2 (Schema Generalization)
    ├──────────────────┐
    ▼                  ▼
Phase 3 (DB Migration)   Phase 4 (Adapter Protocol)
    │                  │
    └──────┬───────────┘
    Phase 5 (Generator)
    Phase 6 (Cleanup)

Phases 3 and 4 can run in parallel after Phase 2 completes — they change different layers (infrastructure vs application protocol). Phase 5 requires both to be complete (needs D1 + adapter protocol). Phase 6 is strictly sequential after Phase 5.


Appendix A: PostgreSQL → D1 Column Type Mapping

PostgreSQL Type D1/SQLite Type Notes
integer / serial INTEGER Direct mapping
bigint / bigserial INTEGER SQLite integers are 64-bit
character varying(N) TEXT SQLite has no length enforcement on TEXT
text TEXT Direct mapping
boolean INTEGER 0 = false, 1 = true
uuid TEXT Generated in application code
jsonb TEXT Stored as JSON string, parsed in application
timestamp with time zone TEXT ISO 8601 UTC string
date TEXT YYYY-MM-DD string

Appendix B: Tables Retained, Evolved, Moved, and Dropped

Table Disposition Phase
ucca_domain Retained — core engine 1
ucca_unit Retained — core engine 1
ucca_unit_source_map Retained — core engine 1
ucca_code_sequence Retained — core engine 1
source_unit Generalized — remove TGA defaults, rename columns 2
source_unit_dependency Generalized — remove TGA defaults 2
course Retained — core engine 1
course_module Retained — core engine 1
course_artifacts Retained — core engine 1
course_unit_map Retained — core engine 1
course_source_map Retained — core engine 1
module_element_map Evolved — renamed to module_outcome_map, FK target changes 2
framework Dropped — replaced by triumvirate_envelope 2
framework_unit Dropped — replaced by triumvirate_instrument 2
framework_element Dropped — replaced by triumvirate_outcome 2
framework_criterion Dropped — replaced by triumvirate_criterion 2
training_packages Moved — to TGA adapter internal storage 1
training_package_domain_map Moved — to TGA adapter internal storage 1
alembic_version Dropped — Alembic removed with SQLAlchemy 3

Appendix C: New Tables

Table Purpose Phase
triumvirate_envelope Triumvirate container metadata 2
triumvirate_instrument Instrument reference (1 per instrument type per Triumvirate) 2
triumvirate_outcome Outcomes within Instrument 1 2
triumvirate_criterion Criteria within outcomes 2
triumvirate_evidence_requirement Evidence requirements at instrument level 2
triumvirate_compliance_rule Rules within Instrument 2 2
triumvirate_credential Credentials within Instrument 3 2
triumvirate_cross_reference Cross-instrument references 2
adapter_registration Registered adapters and their capabilities 4
adapter_audit_log Per-adapter, per-client audit trail 4

Version History

Version Date Change Author
1.0 2026-03-09 Initial creation Claude Code