Designing the design stage: a comprehensive synthesis for an LLM-driven architecture plugin
A practical playbook for the architecture/design stage of an LLM-driven SDLC plugin: what makes design artifacts good, a file-per-artifact schema that mirrors the requirements layer, a methodology-agnostic core (ISO 42010 + arc42 + C4 + MADR + SEI), and a multi-agent pipeline with deterministic validation gates.
Bottom line up front. The architecture/design stage of an LLM-driven SDLC plugin should consume the requirements artifact and produce a directory of atomic, ID'd Markdown-with-YAML-frontmatter files that mirror the requirements layer one-for-one — ADRs, components, interfaces, views, quality scenarios, threats, fitness functions — bound together by bidirectional traces_from/traces_to graphs. The methodology-agnostic core is ISO/IEC/IEEE 42010:2022 (entities, viewpoints, views, decisions, correspondences) wrapped in the arc42 twelve-section narrative and the C4 four-level visual hierarchy, with MADR 4.0 ADRs and SEI quality attribute scenarios + tactics as the connective tissue between NFRs and architectural choices. The generator should be a multi-agent orchestrator → quality-attribute specialists → synthesizer → critic loop → human gate pipeline, governed by deterministic linters (schema, cross-reference, diagram, hallucination, drift) that play the same role the INCOSE 29148 rubric played for requirements. Design is more open-ended than requirements, so the validation strategy must shift from "is this requirement well-formed?" to "is this decision falsifiable, traceable, and bounded by an executable fitness function?"
The rest of this report is organized as a practical playbook: (1) what makes design artifacts good, (2) the schema and pipeline to generate them, (3) the methodology-agnostic core, (4) quality gates, (5) downstream connections, and (6) handling the open-ended nature.
1. What makes an architecture/design artifact good
The community converges on eleven measurable qualities for a design artifact, and almost all of them can be turned into automated checks — this is what makes the design stage tractable for an LLM tool rather than merely a creative one.
Clarity, completeness, correctness, consistency are the table stakes — one canonical reading per element, every in-scope requirement has a corresponding decision and component, semantics align with what the code will actually do, and views don't contradict each other. Traceability runs in two directions: every architectural element must justify itself with at least one upstream requirement, quality scenario, or constraint, and every element must point downstream at a code path, test, or fitness function it will materialize as. Evolvability and testability are properties of where the architecture places its seams — the design should hide likely-to-change decisions behind stable interfaces (Parnas's foundational 1972 insight, which is the root of OCP, ISP, DIP, Protected Variations, and Bounded Contexts), and every interface should be observable enough to test. Right-sizing (Just Enough Up-Front Design, JEUFD) means artifact effort is proportional to scope, novelty, and irreversibility — using Bezos's Type 1 (one-way door) vs. Type 2 (two-way door) distinction as the practical filter. Communication value means multiple audiences (dev, ops, security, PM) extract what they need without slogging through a 200-box "kitchen sink" diagram — this is the deepest argument for C4's audience-graded levels. Finally, falsifiability is the single most LLM-relevant property: the artifact must assert something checkable. "System will be scalable" is noise; "p95 latency ≤ 200 ms at 500 RPS, verified by k6 script tests/perf/checkout.k6.js" is signal.
The good-vs-bad litmus test for components and interactions. A bad component diagram has an amorphous "Backend" box connected by an unlabeled line to "Database" — no responsibility, no protocol, no failure modes. A good one names each component, places labeled ports on its boundary, marks protocol on each edge (HTTPS/JSON, AMQP), distinguishes synchronous from asynchronous arrows, shows error-propagation paths, and is paired with a sequence diagram for the dominant interaction that includes timeouts, idempotency keys, and explicit failure branches. The same gap exists for interface contracts: a bad one says "the API returns the order"; a good one specifies preconditions, postconditions, invariants, an enumerated error taxonomy with retryability flags, idempotency semantics, latency SLAs at p50/p95/p99, and a versioning policy.
Right-sizing decision table. A bug fix touching one module needs at most an ADR if it changes a public contract. A new endpoint needs an OpenAPI delta, one sequence diagram, and an updated error taxonomy. A new service needs C4 container+component, API contract, bounded-context placement, three or more ADRs for major trade-offs, a data model, an updated deployment view, fitness functions for new architectural characteristics, and an error taxonomy. A new system or platform needs all of that plus a context view, integration map, consistency model, multi-region story, threat model, and explicit "sacrificial architecture" plan. The plugin should pick the artifact set automatically from the requirements scope rather than always emit the full kit.
2. Recommended artifact schema (mirrors the requirements layer)
The single most important design choice for the plugin is to make the design artifact schema isomorphic to the requirements artifact schema: one file per atomic concept, Markdown body for human/LLM rationale, YAML frontmatter for machine metadata, stable ID per file, traces_from and traces_to fields on every artifact. This is the consensus pattern across Kiro, Spec Kit, BMAD, Google's DESIGN.md, Backstage TechDocs, and Living Documentation, and it lets downstream test/code/DoD generators iterate file-by-file just as the requirements generator does.
File and folder layout. The plugin should emit a docs/architecture/ tree alongside the existing docs/requirements/:
docs/architecture/
├── README.md # arc42 §1–4 narrative; links the rest
├── decisions/ # ADR-NNNN MADR 4.0 files
├── components/ # CMP-NNN one-per-file
├── interfaces/ # INT-NNN one-per-file
├── views/ # VIEW-NNN diagrams + narrative (Mermaid/C4-PlantUML)
├── quality-scenarios/ # QAS-NNN six-part SEI scenarios
├── threats/ # THR-NNN STRIDE/LINDDUN entries
├── fitness-functions/ # FF-NNN executable architecture tests
├── data-model/ # conceptual.md, logical.dbml, physical/
├── apis/ # order-api.openapi.yaml, order-events.asyncapi.yaml, *.proto
├── workspace.dsl # Structurizr/LikeC4 root model (single source of truth)
└── glossary.md # arc42 §12, ubiquitous language
ID scheme mirrors requirements: ADR-0001 (four-digit per MADR convention), CMP-001, INT-001, VIEW-001, QAS-001, THR-001, FF-001. IDs are stable across renames; the filename slug is just human helper.
Canonical YAML frontmatter shapes. Six artifact types, each with traces_from/traces_to. The component file is the most representative:
---
id: CMP-014
name: Order API
kind: container # system | container | component | module
parent: SYS-Storefront
technology: "Node 20 / Fastify"
responsibility: >
Sole authority for the Order aggregate lifecycle (create, pay, cancel, ship).
provides: [INT-007, INT-008]
requires: [INT-022, INT-031]
data_owned: [ENT-Order, ENT-OrderLine]
nfr_commitments:
- id: NFR-003
target: "p95 latency < 200ms at 500 rps"
realized_tactics: # SEI quality tactics actually applied
- tactic: bounded_queue
addresses: [QAS-PERF-007]
- tactic: authenticate_actors
addresses: [QAS-SEC-007]
error_handling:
retry_policy: "exponential backoff, max 3"
dead_letter: "kafka.dlq.orders"
location: "src/order/api"
related_adrs: [ADR-0001, ADR-0007]
related_views: [VIEW-002, VIEW-005]
threats: [THR-018]
traces_from: [FR-014, FR-015, FR-016, NFR-003]
traces_to:
code: ["src/order/api", "src/order/domain"]
tests:
unit: "src/order/**/*_test.ts"
integration: "tests/integration/order/"
contract: "pacts/web_order.json"
performance: "tests/perf/checkout.k6.js"
security: "tests/security/order_stride.py"
fitness_functions: [FF-LAYER-001, FF-NAMING-002]
iac: ["terraform/modules/order/", "k8s/charts/order/"]
slos: [SLO-ORDER-LAT-001]
dod_items: [DOD-OAS-001, DOD-HEALTH-001]
status: accepted
---The ADR file follows MADR 4.0 with status, date, decision-makers, consulted, informed, supersedes/superseded-by/related-to link types, plus traces_from (requirements/threats it answers) and traces_to (components/code it affects) and a confirmation block specifying how compliance is verified (review, fitness function, architecture test). The interface file carries operation signature or schema reference, preconditions, postconditions, invariants, enumerated errors with retryability, idempotency semantics, side effects, an SLA at p50/p95/p99, versioning policy, authn/authz scopes, and rate limits — all the fields a downstream test generator needs to produce contract, property-based, and performance tests without further prompting. The quality scenario file is the SEI six-part shape verbatim (source, stimulus, artifact, environment, response, response measure) plus pointers to the components and tactics that realize it and the SLO that monitors it; this lets a single requirements-level NFR fan out cleanly into a scenario, a tactic, a component obligation, an SLO, a fitness function, and a load test. The threat file follows STRIDE per element/interaction with likelihood/impact/risk and a mitigation list each pointing to a component and a security test. The view file embeds a Mermaid or C4-PlantUML diagram in a fenced block and tags itself with C4 level and the components it includes.
Why this works for LLMs as well as humans. Markdown is the highest-fidelity format in LLM training corpora. YAML frontmatter gives free structured metadata. Atomic files match LLM context-window economics — load interfaces/INT-007.md to work on one endpoint rather than scrolling a 50-page document. Stable IDs make cross-references mechanically checkable. Mermaid and C4-PlantUML diagrams are text, diffable, and LLM-editable. And the schema is the same shape used by the requirements artifact, so a single traceability graph spans the SDLC.
3. The methodology-agnostic core: ISO 42010 + arc42 + C4 + MADR + SEI
The frameworks landscape is wide, but a small core composes coherently and survives agile/waterfall/regulated contexts equally.
ISO/IEC/IEEE 42010:2022 is the meta-standard. It defines the conceptual model the plugin's schema should reflect: an Architecture Description identifies an Entity of Interest, lists stakeholders and their concerns, specifies viewpoints (each framing one or more concerns with model kinds and notations), instantiates views governed by those viewpoints, records architecture decisions with rationale, and ties everything together with correspondences — the standard's first-class traceability mechanism. The 2022 edition renamed "system of interest" to "Entity of Interest" for enterprise applicability, replaced "Architecture Model" with "View Component", elevated "Architecture Description Framework" as a distinct concept (arc42, TOGAF, V&B are ADFs), and dropped the UML metamodel in favor of MOF-agnostic concepts. The plugin's schema (Section 2) maps cleanly: stakeholders/concerns are referenced from requirements, viewpoints are C4 levels and arc42 sections, view components are individual diagrams, decisions are ADRs, and correspondences are the traces_from/traces_to graph.
arc42 supplies the narrative structure. Its twelve sections — Introduction & Goals, Constraints, Context & Scope, Solution Strategy, Building Block View, Runtime View, Deployment View, Crosscutting Concepts, Architecture Decisions, Quality Requirements, Risks & Technical Debt, Glossary — are stable, numbered, and writable in Markdown. They map directly onto the plugin's folder layout: §5 → components/, §6 → views/ (dynamic), §7 → views/ (deployment), §9 → decisions/, §10 → quality-scenarios/, §11 → threats/, §12 → glossary.md. The README.md at the architecture root holds §1–4 as a navigable overview.
C4 supplies the visual hierarchy. System Context for non-technical audiences, Container for architects and SREs (the most useful level — every project should have one), Component for developers working inside a container, Code only when auto-generated. C4 maps to arc42: §3 ↔ Context, §5 levels 1–3 ↔ Container/Component, §6 ↔ Dynamic, §7 ↔ Deployment. For diagrams the plugin should emit Mermaid by default (GitHub renders it natively in PRs, issues, and Markdown) and C4-PlantUML for richer C4 diagrams, with Structurizr DSL or LikeC4 as the canonical single-source-of-truth model when the architecture is large enough that "one model, many views" matters. D2 is a strong fallback for aesthetic-leaning sketches but has a smaller LLM training base than Mermaid/PlantUML, so the LLM is more likely to emit broken syntax — relevant when generation is the bottleneck.
MADR 4.0 is the ADR format to standardize on. It carries Nygard's original spirit (Status, Context, Decision, Consequences) but adds YAML frontmatter, explicit decision-makers/consulted/informed roles, decision drivers, considered options, pros and cons per option, and a Confirmation section that ties the decision to a verification mechanism. Y-Statements (Zimmermann's single-sentence form — "In the context of X, facing concern Y, we decided for Z to achieve W, accepting V") can ride inside the More Information section as a TL;DR. Accepted ADRs are immutable — they're superseded by new ADRs rather than edited; this is the one place the plugin must resist LLM "helpfulness" that would edit history.
SEI quality attribute scenarios and tactics are the bridge between NFRs and structure. The six-part scenario shape is already what the requirements artifact captures for NFRs; the plugin's job is to turn each high-priority scenario into a utility tree (root: utility; level 2: ISO 25010 characteristics; leaves: scenarios with importance/difficulty H/M/L), select an architectural style that scores well against the prioritized scenarios, layer secondary patterns (CQRS, Saga, BFF, API Gateway, Strangler) for specific scenarios, and emit a tactic per scenario from the SEI catalog. The catalog is comprehensive and stable: availability tactics (detect/recover/reintroduce/prevent), performance tactics (control resource demand / manage resources), security tactics (detect/resist/react/recover from attacks), modifiability tactics (reduce size / increase cohesion / reduce coupling / defer binding), testability tactics, usability tactics, interoperability tactics, and — added in the 4th edition — safety tactics (avoid/detect/contain/recover from hazards).
Domain-Driven Design supplies the decomposition vocabulary. Strategic patterns — bounded context, ubiquitous language, core/supporting/generic subdomains, and context-mapping relationships (Shared Kernel, Customer/Supplier, Conformist, Anticorruption Layer, Open Host Service, Published Language, Separate Ways) — are how the plugin should justify where to draw component boundaries. Tactical patterns — Aggregate with single Aggregate Root, Entity, Value Object, Domain Event, Repository, Domain Service, Application Service, Factory, Module — populate the interior of each bounded context. The Hexagonal / Clean / Onion family (essentially three vocabularies for the same dependency-rule pattern) is the natural fit for inside a bounded context that has rich logic.
4. Architecture pattern selection driven by NFRs
The plugin should drive style selection deterministically from prioritized SEI scenarios and constraints rather than from LLM intuition (which is biased toward microservices, popular stacks, and over-abstraction). The algorithm is:
- Build the utility tree from the NFRs the requirements artifact already carries. Each leaf is an SEI six-part scenario with importance and difficulty.
- Apply hard constraints as filters. Team under 10 engineers eliminates full microservices. Sub-100ms p99 synchronous latency eliminates cold-start serverless. On-prem-only eliminates managed serverless. Regulated audit required nudges toward event sourcing or strong logs. HIPAA/PCI scope requires service/data isolation boundaries.
- Score candidate styles by a dot product of QA priorities and a published impact matrix. Microservices score ++ on modifiability/flexibility/team scale-out and -- on functional suitability under ACID, latency, and operational cost. CQRS scores ++ on read performance and -- on complexity/eventual consistency. Event sourcing scores ++ on audit and -- on schema evolution. Serverless scores ++ on scalability at low load and -- on cold-start latency.
- Apply architect-bias tiebreakers. When two styles score within 5%, prefer the simpler one ("boring architecture wins"). When Performance-latency is top-2 and Maintainability is bottom-3, prefer in-process styles. When Maintainability + Flexibility are both top-3 and team ≥ 3 teams, microservices become viable.
- Layer secondary patterns for specific scenarios — Saga (orchestration if many participants or compensating logic; choreography if few and event-native), BFF for diverse clients, API Gateway for cross-cutting concerns at the edge, Strangler Fig for legacy migration over 12–18 months.
- Emit tactic checklists per remaining scenario from the SEI catalog, plus resilience patterns from Nygard's Release It! and Microsoft Cloud Patterns — always pair retry with idempotency and circuit breaker, always pair bulkhead with timeout and a saturation alert, always pair caches with invalidation strategy.
- Produce an ATAM-style risk register — sensitivity points (decisions strongly affecting one QA), tradeoff points (decisions affecting multiple QAs in opposite directions), risks, non-risks — comparing the chosen style against the top scenarios. Empirical studies (arXiv 2506.00150, 2603.28914) show LLMs identify more tradeoffs than student teams when prompted with the ATAM scaffold, so this is a clear win for the multi-agent pipeline.
Guardrails for the LLM. Never recommend microservices without a Conway's-Law fit (team structure) and an explicit deployability/scalability scenario. Never recommend event sourcing without an audit-trail or replay scenario. Never invent abstractions without two concrete instances justifying them (Rule of Three). Always surface what each recommendation sacrifices — every tactic should name the QA it improves and the QA it costs.
5. The multi-agent pipeline
Single-pass LLM generation of full architectures plateaus well short of human expert quality and exhibits architectural erosion even when explicitly instructed otherwise (Slater 2025 measured Claude 4.5 Sonnet, GPT-5.1, and Llama-3-8B violating hexagonal boundaries despite explicit constraints). The empirical consensus is a five-stage orchestrator-worker-critic-human topology:
Requirements artifact + Constitution/Steering
│
▼
┌──────────────────────────────┐
│ ORCHESTRATOR (Architect lead)│
│ parses reqs, decomposes work │
└──────┬─────────┬─────────┬───┘
▼ ▼ ▼
┌──────────┐ ┌─────────┐ ┌────────────────────┐
│Structure │ │Data │ │QA Specialists: │
│Specialist│ │Modeller │ │ Performance │
│(patterns,│ │ │ │ Security (STRIDE) │
│ context │ │ │ │ SRE / Observability│
│ map) │ │ │ │ Cost / FinOps │
└─────┬────┘ └────┬────┘ └─────────┬──────────┘
└─────┬─────┴────────────────┘
▼
┌──────────────────────────────────┐
│ SYNTHESIZER │
│ merges into design.md + ADRs + │
│ component/interface/view files │
└──────────────┬───────────────────┘
▼
┌──────────────────────────────────┐
│ CRITIC LOOP (Evaluator-Optimizer)│
│ • Adversarial Reviewer agent │
│ • Deterministic linter │
│ • Cross-artifact consistency │
│ ↓ Findings Table │
└──────────────┬───────────────────┘
pass? ── no → back to Synthesizer
│ yes
▼
┌──────────────────────────────────┐
│ HUMAN GATE (approve/edit/reject)│
└──────────────┬───────────────────┘
▼
downstream: tasks, code, tests, DoD
Why this topology beats single-agent. The orchestrator-worker pattern (Anthropic's "Building Effective Agents") dominates for design tasks because sub-tasks aren't pre-known. Specialist-per-quality-attribute matches BMAD's expansion-pack pattern and ATAM, and empirical ATAM-LLM studies show parallel specialists find more tradeoffs than monolithic agents. A separate synthesizer avoids each specialist over-weighting its own concern. The critic loop should include deterministic linters, not just an LLM reviewer — LLMs disagree with themselves; multi-LLM ensembles (arXiv 2602.07609) improve precision but don't reach 1.0 without hard checks. The human gate at the end is non-negotiable: Kiro, Spec Kit, and BMAD all converge on it, and Fowler/Böckeler's observation that "your role isn't just to steer — it's to verify" applies most acutely to architecture.
When not to multi-agent. Single-agent flows win for small, well-scoped tasks (Kiro's Quick Plan, BMAD Quick Flow). Trigger the full pipeline only when at least two quality attributes are constrained, at least two architectural patterns are candidates, or the change touches a brownfield with non-trivial integration surface. Otherwise the orchestration overhead and cross-agent context loss degrade quality.
Persona conditioning matters. Distinct system prompts per specialist measurably reduce hallucination cascades (MetaGPT evidence). Architect: "calm, pragmatic, prefers boring tech, connects every decision to business value." Security: "STRIDE-aware, threat-models every component." SRE: "obsessed with failure modes; demands runbooks." Cost: "always names the cheaper alternative."
Retrieval matters more. Gupta 2026 showed retrieval over historical ADRs is the strongest context strategy for ADR generation. The plugin should index the project's own ADRs and feed the nearest neighbors as few-shot examples for every new decision — and tag them with the project's stack, scale, and constraint profile so retrieval matches the kind of decision being made.
6. Quality gates and validation rules
Requirements engineering had the ISO/IEC/IEEE 29148 nine-quality rubric (necessary, appropriate, unambiguous, complete, singular, feasible, verifiable, correct, conforming). Design has no equivalent single rubric — but synthesizing across SEI Views & Beyond, MADR, ATAM, Building Evolutionary Architectures, and recent LLM-arch empirical work yields a layered validation strategy with five families of checks.
Schema and structural checks. Every ADR matches the MADR template (heading regex). YAML frontmatter parses against the typed schemas in Section 2. Every required artifact section exists (design output has all of: components, at least one view, ADRs for any Type 1 decisions, quality scenarios for top NFRs, threat model if external surface). All NEEDS CLARIFICATION sentinels (Spec Kit pattern) are resolved before "Accepted" status. No TBD/TODO/? remains in Accepted artifacts.
Cross-artifact consistency. Every component in prose appears in the architecture diagram and vice versa. Every requirement ID referenced in design exists in requirements/. Every ADR traces to at least one requirement, constraint, or quality scenario. Every data entity in the diagram has a definition in the data model. Every API in apis/ is consumed by at least one component. Every NFR has at least one tactic assigned (the ATAM coverage check). Every threat has at least one mitigation linked to a component and a test. Every accepted ADR has at least one component impact. Orphan detection: requirements with no traces_to, components with no traces_from, ADRs with no enforcing fitness function.
Diagram lint. Mermaid/PlantUML round-trips through the renderer with no syntax errors (auto-repair loop if it fails). No orphan components unless explicitly marked external. No circular dependencies in the DDD context map or module graph (Tarjan SCC). Layer-boundary edges respect allowed directions (hexagonal: domain has no outbound edges to adapters). Diagram node names appear verbatim in prose (catches case/typo drift).
Hallucination and content checks. Mentioned libraries and packages resolve in npm/pypi/Maven (catches the common LLM failure of fabricating package names). Mentioned API endpoints match a schema in apis/. Each architectural decision presents at least two alternatives or an explicit "single forced option" justification (the most common LLM failure mode is single-option commits). Constitution violations have explicit justifications in a Complexity Tracking section (Spec Kit pattern — penalizes deviations rather than forbidding them). Word-count budget per section forces brevity and reduces filler.
Principles-as-fitness-functions. This is the layer that turns design principles into automatable gates. SRP: classes with more than ~7 public methods spanning unrelated semantic clusters get flagged. DIP: layer-dependency rules in ArchUnit (or NetArchTest/ts-arch/arch-go/import-linter) — domain packages cannot import infrastructure. Cohesion: LCOM4 threshold. Coupling: Martin's Instability metric I = Ce/(Ca+Ce). Connascence of Position: lint on methods with more than three positional parameters. Connascence of Meaning: AST scan for literals in branch predicates, suggest enums. Connascence of Algorithm: token-based clone detection across services. Connascence of Execution: detect "must call init first" patterns; suggest typestate or builder. Law of Demeter: AST chain depth. Cyclomatic complexity > 10 per method. No cycles via slices().beFreeOfCycles(). Anemic domain model: ratio of behavior methods on entities vs. service classes. Distributed monolith: cross-service synchronous chains, shared DB schemas, coupled releases (trace analysis from Jaeger/Zipkin plus DB-ownership audit).
LLM-as-judge for the soft properties. A second-pass evaluator agent scores clarity, completeness vs. requirements, trade-off coverage, constraint adherence, and testability on 1–5 scales. Use three independent rater LLMs and average (empirically supported by the multi-LLM validator paper). Below threshold triggers re-prompt with explicit failure feedback. Mandatory N-findings ("zero findings is a failure of the review") prevents rubber-stamping — BMAD's adversarial review uses this directly.
These five families together let an LLM pipeline self-validate before consuming a human review cycle. They also explain why design is harder to validate than requirements: requirements either are or are not well-formed; design is well-formed relative to its requirements and constraints, so every check is relational.
7. Connecting architecture to downstream artifacts
The point of the design stage is to make the next three stages (implementation, tests, Definition of Done) mechanically generatable. The traces_to graph is the conduit.
Architecture → implementation. Components map to package/module directories with README, OWNERS, and an ADR index. REST interfaces become OpenAPI 3.1 stubs plus typed client SDKs plus Pact consumer stubs. gRPC interfaces become .proto files with generated stubs. Async interfaces become AsyncAPI specs with producer/consumer scaffolds. Deployment views become Dockerfiles, Helm chart skeletons, and Terraform modules. Observability specs become OpenTelemetry SDK initialization, structured-logger config, Prometheus metrics, and SLO YAML. Tooling: Backstage Software Templates for enterprise golden paths, Cookiecutter/Yeoman/Plop/Hygen for lighter scaffolding, JHipster for JVM monoliths, Spring Initializr-style for single-language bootstrap. The plugin should treat the architecture spec as the source of truth and make re-generation idempotent, with warnings on drift.
Architecture → tests. The test pyramid aligns with architectural boundaries: unit tests at the class/function level, integration tests at the component level with Testcontainers, contract tests at every interface boundary with Pact (consumer-driven, with can-i-deploy as a deploy gate), E2E for high-priority quality-scenario happy paths only, performance tests derived from QAS response measures (the plugin emits k6 thresholds directly from response_measure), security tests one per threat with category-specific tooling, chaos tests one per resilience tactic. Test files traceability is mechanically enforced: every test must traces_from at least one FR/NFR/THR, and the trace can be auto-derived from filename conventions (test_FR_012_login.py).
Architecture → CI/CD pipeline. Pipeline structure derives from the deployment topology plus the threat model plus the NFRs: build → unit tests → fitness functions → SAST/IaC scan → contract tests (consumer side) → security scans (one per H/Critical threat) → integration tests → performance regression gate (from NFR thresholds) → contract verification (provider side) plus can-i-deploy → canary deploy → SLO burn-rate watch plus smoke tests plus automated rollback → progressive rollout → chaos exercises per release. SBOM and Cosign/SLSA attestation slot in per NIST SP 800-218 SSDF practices.
Architecture → infrastructure-as-code. Logical services become Helm charts. Stateful stores become Terraform modules. Network zones become VPC/subnet/security-group definitions tied to trust boundaries from the threat model. Cross-region needs become Terraform workspaces per region plus Route53/Traffic Manager modules. Secrets management ties to External Secrets Operator and Vault/SSM modules. Observability becomes an OTel Collector chart. The mapping is one-to-many — a single component fans out to compute, storage, network, security, and observability modules — and should be expressed explicitly in traces_to.iac as a typed dictionary, not a flat list.
Architecture → Definition of Done. DoD evolves as a quality-gate ladder. Story-level DoD pulls from per-change traceability: code has @Traces(FR-xxx) annotation, ≥80% line coverage on changed files, integration tests for changed interfaces. Sprint-level DoD pulls from interface contracts: every new endpoint has OpenAPI, contract tests pass, can-i-deploy returns ok, every new endpoint emits a canonical wide-event with the required observability fields. Release-level DoD pulls from operational readiness: /healthz and /readyz exist, SLO defined and dashboard exists, runbook per alert, threat model reviewed with all H/Crit mitigations done, PRR checklist signed off. Per-component DoD additions come from ADRs themselves — an ADR mandating libsodium for crypto generates a DOD-CRYPTO-001 item enforced by a fitness function.
The bidirectional traceability graph. Once traces_from and traces_to are populated on every artifact, the plugin can compile the whole corpus into a graph (NetworkX, Neo4j, or DuckDB tables) and run SQL-like queries: every FR/NFR must have at least one traces_to (no orphan requirements), every component must have at least one traces_from (no speculative architecture), every H/Crit threat must trace to both a component and a test, every measurable NFR must have an SLO, every accepted ADR must reference at least one component, every test file must trace to at least one FR/NFR/THR. The ten fitness functions that enforce these invariants in CI are themselves the most concrete deliverable of the design stage — they are what make traceability survive contact with reality.
Avoiding trace decay. Store traceability as YAML alongside source, reviewable via PRs. Validate at build time. Auto-derive traces from filename and annotation conventions where possible. Track trace-freshness and surface stale traces in release readiness. Generate downstream artifacts rather than hand-write them so traces never drift. Treat the architecture YAML as code with the same review and test gates.
8. Handling the open-ended nature of design
Design is harder than requirements because requirements have a roughly closed shape (FR/NFR/CON/BR with EARS notation, six-part scenarios, INCOSE rubric) while design is generative — the same requirements can yield many defensible architectures. Three strategies tame this without forcing premature closure.
Make every decision falsifiable. The single highest-leverage rule. Unfalsifiable claims ("the system will be scalable") are noise; falsifiable ones ("p95 latency ≤ 200 ms at 500 RPS, verified by tests/perf/checkout.k6.js") are the only legitimate design content. The plugin should refuse to mark any artifact as Accepted if it asserts a quality without binding it to a measurable predicate and a fitness function. This is the Building Evolutionary Architectures discipline restated as a generator invariant.
Generate alternatives, not just choices. Every Type 1 decision (one-way door — primary datastore, public API contract, security model, bounded-context boundaries) must produce an ADR with at least two alternatives and an explicit trade-off table. Type 2 decisions (two-way door — internal module boundaries, internal library choice) get lightweight ADRs or none at all. The plugin should classify each decision's reversibility and tune the artifact rigor accordingly — this is the JEUFD principle made operational. ATAM-style sensitivity-and-tradeoff registers replace the single-option commits that LLMs tend toward.
Treat the design as sacrificial. Fowler's "sacrificial architecture" framing — the first design will be replaced — flips the optimization target from "correctness" to "reversibility under fitness-function-guarded evolution." Combine with "last responsible moment" decisions (delay until further delay would eliminate options, not just as late as possible) and the Rule of Three before extracting abstractions. This is the Evolutionary Architecture stance: the design isn't an answer, it's a starting point with guardrails that automate course-correction. For an LLM tool, this is liberating: the plugin doesn't need to produce a perfect architecture, only a coherent and instrumented one.
Where the field disagrees, surface the disagreement. The plugin should be epistemically honest: SOLID has critics (Dan North's CUPID, Hickey's "Simple Made Easy"); the Anemic Domain Model is "anti-pattern" to Fowler and "legitimate FP separation" to others; microservices have powerful skeptics (Fowler's "MonolithFirst", Newman, Taibi's anti-pattern taxonomy); the Three Pillars of Observability is industry advocacy with active counter-proposals (Charity Majors' Observability 2.0, Hazel Weakly's 3.0); the GoF patterns have Norvig's critique that many dissolve in expressive languages. The plugin shouldn't pretend these are settled — when it makes a contested choice, the ADR should name the opposing view and the reason for the chosen side. This is the difference between an architecture tool and an architecture opinion.
9. Conclusion: what this means for the plugin
The architecture stage of an LLM-driven SDLC plugin is not a creative phase that should be open-loop generative; it is a constrained translation from a structured requirements corpus into a structured design corpus, with every element traceable upstream and downstream and every quality assertion bound to a fitness function. The schema, the pipeline, and the validation rules together do most of the work — the LLM's contribution is best confined to (a) proposing alternatives within the constrained space, (b) articulating trade-offs against the prioritized quality attributes, (c) drafting ADRs and component specs that conform to the templates, and (d) self-critiquing under adversarial prompts. The deterministic linters and human gates do the rest.
The five highest-leverage choices are: standardize on MADR 4.0 ADRs with YAML frontmatter and a Confirmation block; isomorphic file-per-artifact schema mirroring the requirements layer with traces_from/traces_to on every file; SEI quality attribute scenarios and tactics as the bridge between NFRs and components; a multi-agent orchestrator-with-QA-specialists-and-critic-loop pipeline; and a deterministic validation suite that combines schema, cross-reference, diagram, hallucination, and principles-as-fitness-functions checks. Every other choice in this report flows from these five.
The most important insight is that "good architecture" for an LLM-driven generator is not the absence of mistakes — LLMs measurably make architectural-erosion mistakes even when explicitly instructed otherwise — but the presence of mechanical guardrails that make mistakes visible early and cheap to fix. The fitness functions are the architecture, not the diagrams. The traceability graph is the architecture, not the prose. The ADRs are the architecture, not the patterns named within them. Design as a stage exists to manufacture those guardrails. An LLM that produces them well — even if its prose and diagrams are merely competent — has produced a useful architecture; an LLM that produces beautiful prose without them has produced a hallucination dressed as design.