Applying the risk-based software categorization framework to a technology landscape GAMP 5 was never designed for
GAMP 5 (second edition, 2022) provides the pharmaceutical industry's most widely accepted framework for risk-based computer system validation. Its software category model — Categories 1, 3, 4, and 5 — was designed to reflect the degree of configurability and customization in a system, and to scale validation effort accordingly. A commercial off-the-shelf operating system (Category 1) requires less validation effort than a configured ERP (Category 4), which requires less than fully bespoke custom code (Category 5).
AI agent systems do not fit cleanly into this model. A single AI agent deployment may contain components spanning all four categories simultaneously. Worse, some AI components exhibit properties — non-determinism, emergent behavior, continuous learning — that the GAMP 5 framework did not anticipate and for which its standard V-model testing approach is insufficient on its own.
This article does two things. First, it maps every component of a clinical AI agent stack to the most appropriate GAMP 5 category, with justification. Second, and more importantly, it explains where the standard category-based validation approach must be supplemented with AI-specific validation strategies to achieve genuine fitness for regulated use — not just documentation compliance.
The goal is a defensible, risk-proportionate validation approach that a competent authority inspector would find credible.
For alignment, the four active categories in GAMP 5 Second Edition are:
Category 1 — Infrastructure Software Operating systems, database engines, middleware, network software, and virtual machine platforms. Vendor-supplied, not configured by the user for GxP function. Validation effort focuses on Installation Qualification (IQ) — confirming correct installation and configuration — and leveraging vendor-supplied documentation. No Operational Qualification (OQ) or Performance Qualification (PQ) is typically required at the Category 1 layer.
Category 3 — Non-Configured Products Commercial off-the-shelf (COTS) software used as-is, without configuration or customization for GxP function. Examples include standard productivity tools used in a non-GxP capacity. Validation focuses on IQ and basic OQ to confirm the product works as the vendor specifies.
Category 4 — Configured Products COTS software configured by the user to meet specific GxP requirements. This is the most common category in clinical systems — EDC systems, CTMS, safety databases, and LIMS all typically fall here. Validation effort includes IQ, OQ (testing configured functionality against specifications), and PQ (confirming the system performs correctly in its intended operational environment).
Category 5 — Custom Software (Bespoke and Custom Applications) Software developed specifically for the regulated user, including custom scripts, bespoke integrations, and internally developed applications. Carries the highest validation burden: full software development lifecycle (SDLC) documentation, code review, comprehensive IQ/OQ/PQ, and ongoing change control. GAMP 5 Second Edition explicitly includes custom algorithms and analytical software in this category.
A production AI agent for clinical operations is not a single system. It is a composition of components, each with a different GAMP 5 classification, a different owner, and a different validation approach. The validation lead's first task is to decompose the agent architecture into its constituent components and classify each independently.
The table below provides the classification framework. Detailed justification for each follows.
| Component | GAMP 5 Category | Rationale | Primary Validation Activities |
|---|---|---|---|
| Cloud infrastructure (AWS/Azure compute, networking, storage) | 1 | Vendor-managed, no GxP configuration by user | IQ using vendor GxP qualification packages |
| Container orchestration platform (EKS, AKS) | 1 | Infrastructure software; GxP function not dependent on Kubernetes internals | IQ; configuration specification |
| Database engines (PostgreSQL, underlying QLDB engine) | 1 | Vendor-supplied database software | IQ |
| LLM API (Azure OpenAI, AWS Bedrock) — infrastructure layer | 1 | Compute and API infrastructure; vendor-managed | IQ; Data Processing Agreement review |
| LangSmith Self-Hosted — platform layer | 3→4 | COTS observability product; configured for clinical use | IQ + OQ of configured trace capture and retention |
| Weaviate (vector database) — platform layer | 3→4 | COTS database product configured for clinical RAG | IQ + OQ of schema, access controls, query logging |
| HashiCorp Vault | 4 | COTS product configured for secrets management in GxP context | IQ + OQ of policies, audit log, rotation schedules |
| Prefect (workflow orchestration) | 4 | COTS product configured to implement validated clinical workflows | IQ + OQ of flow definitions, retry logic, alerting |
| LLM model weights (GPT-4o, Claude, Llama) | 4* | Vendor-supplied ML model; configured via prompt and parameters | *Special handling — see Section below |
| LangGraph agent graph definition | 5 | Custom-developed workflow logic specific to clinical use case | Full SDLC: URS, FS, CS, IQ/OQ/PQ, code review |
| Guardrails AI validators (custom rules) | 5 | Custom-developed business logic | Full SDLC; OQ per validator rule |
| Pydantic output schemas | 5 | Custom-developed data specifications | FS documentation; OQ schema validation tests |
| RAG retrieval corpus (document content + chunking logic) | 5 | Custom data and processing logic for specific clinical context | Data qualification; corpus change control |
| System prompts and prompt templates | 5 | Custom-authored instructions that determine agent behavior | Prompt version control; OQ regression testing |
| Custom integrations (EDC API, safety DB connectors) | 5 | Bespoke integration code | Full SDLC; interface testing |
| OPA Gatekeeper policies | 5 | Custom policy code enforcing deployment controls | Policy specification; OQ |
| Agent reasoning audit log schema + writer | 5 | Custom-developed audit infrastructure | Full SDLC; audit trail OQ |
AWS, Azure, and GCP each publish GxP infrastructure qualification packages that cover their physical data centers, hypervisors, and core managed services. These packages provide IQ evidence for the infrastructure layer and are available under NDA from each provider's compliance team. For EKS and AKS, the Kubernetes control plane is vendor-managed; the user is responsible only for cluster configuration (node pools, network policies, RBAC), which is documented in a Configuration Specification and confirmed during IQ.
The key validation activity at this layer is confirming that the infrastructure configuration matches the approved specification and is deployed within the correct region and compliance boundary. This is an IQ activity, not an OQ or PQ activity.
Both products are COTS software that ship with defined functionality. When deployed without GxP-specific configuration, they are Category 3. The moment you configure them for regulated use — defining retention policies in LangSmith, creating a clinical document schema in Weaviate, configuring access controls that restrict agent query permissions — they become Category 4.
For LangSmith, OQ test cases should confirm: traces are captured for every LLM call and tool invocation; traces are retained for the defined period; access to trace data is restricted to authorized roles; and trace export functions work correctly for inspection response scenarios.
For Weaviate, OQ test cases should confirm: the document schema enforces required fields; queries are logged with agent identity and timestamp; unauthorized agents cannot query restricted collections; and hybrid search returns the expected results for defined test queries.
This is the most contested classification in the industry, and it warrants careful explanation.
LLM model weights are vendor-supplied and not modifiable by the user (in the standard hosted deployment scenario). The user "configures" the model's behavior through the system prompt, temperature, and other inference parameters — analogous to configuring a COTS product. This makes Category 4 the closest fit.
However, standard Category 4 validation — which relies on testing configured behavior against a specification — is fundamentally insufficient for LLMs because of non-determinism. You cannot write an OQ test case that says "given this input, the output shall be exactly this string" and expect it to pass reliably. The model's output distribution, not any single output, is what you are validating.
The correct approach supplements Category 4 validation with three AI-specific activities:
Performance baseline establishment. Rather than testing specific outputs, test output distributions. Define a curated evaluation dataset of representative inputs with gold-standard expected outputs (created by clinical SMEs). Measure the model's performance against this dataset using appropriate metrics (ROUGE for narrative quality, F1 for classification, expert human review scores for complex judgment tasks). This performance baseline becomes the validated specification for the model's behavior.
Sensitivity analysis. Document how the model's outputs vary with changes in the system prompt, temperature, and retrieved context. This analysis informs the risk assessment for prompt change control events.
Ongoing monitoring against baseline. Post-deployment, run the evaluation dataset on a defined schedule (monthly, or triggered by any model version change). Statistically significant degradation against the baseline constitutes a change control event, even if no intentional change was made. LLM providers do update models; your validation must detect when those updates affect regulated outputs.
When the model is fine-tuned by the organization — even on a small clinical dataset — the classification shifts toward Category 5, because the model weights are now partially custom. Fine-tuning should be treated as a custom software development activity with full SDLC documentation, including a training data qualification package.
The LangGraph graph definition is the heart of the agent's logic. It specifies which tools the agent can call, in what order, under what conditions, and when human checkpoints are mandatory. This is entirely custom-developed code with direct GxP impact. It receives the full Category 5 treatment.
User Requirements Specification (URS): Written by clinical operations and quality, not by engineers. Specifies what the agent must do in business terms: "The agent shall always route an SAE case for medical review before writing to the safety database." These are testable requirements.
Functional Specification (FS): Translates URS requirements into system behavior descriptions. For each LangGraph node, the FS specifies inputs, processing logic, outputs, and error handling. For each conditional edge, the FS specifies the exact conditions that determine routing.
Configuration/Design Specification (CS/DS): Documents the LangGraph implementation: the Python class structure, the state schema, the tool definitions, the interrupt configuration. This is the technical implementation document that a code reviewer and a validator can work from simultaneously.
Code review: All LangGraph graph code, tool implementations, and custom node functions undergo peer code review. Code review records are validation documentation. Pay particular attention to conditional edge logic — these are the deterministic guardrails that contain the non-deterministic LLM behavior, and errors here are the highest-risk failure mode.
OQ test suite: Every node is tested independently (unit tests). Every conditional edge is tested with inputs designed to exercise each branch (branch coverage). Every mandatory human checkpoint is tested to confirm the agent cannot bypass it. Every error handling path is tested. Test cases are documented with expected results and actual results; deviations are investigated and resolved before validation sign-off.
PQ: Executed in the production environment (or a validated production-equivalent environment) using realistic clinical scenarios. PQ should include at least one end-to-end run of each major workflow the agent supports, reviewed by clinical SMEs who confirm that outputs are clinically appropriate, not just technically correct.
This classification surprises many engineers, who think of prompts as configuration text rather than code. In a regulated context, system prompts are functional specifications written in natural language — they directly determine the agent's behavior in regulated workflows. A change to the system prompt is a change to the system's behavior and must be managed as such.
The validation implications are significant:
Every prompt must be versioned in source control (LangSmith Hub or a Git repository) with the same rigor as code. The production prompt version must be locked and tied to the validation package. Changes to a prompt — even minor wording changes — require a change control assessment. Changes that could affect regulated outputs require re-execution of relevant OQ test cases.
The OQ test suite for each prompt should include adversarial test cases: inputs designed to probe edge cases, elicit off-specification outputs, or test the prompt's robustness to ambiguous inputs. These adversarial cases are as important as the happy-path cases, because they define the boundary of the validated operational envelope.
The retrieval corpus — the set of documents the agent can retrieve to inform its outputs — is a regulated data artifact. Its contents directly affect agent outputs, which directly affect regulated records. It must be managed with corresponding rigor.
Document qualification: Every document ingested into the corpus must be approved for use (the same document that exists in your validated document management system, at the same version). A document registry tracks each document's ID, version, ingestion date, embedding model used, and the agent(s) authorized to retrieve it.
Corpus change control: Adding, updating, or removing a document from the corpus is a change control event. The impact assessment must consider: does this document change the information available to the agent in ways that could affect regulated outputs? If the updated protocol changes an inclusion criterion, does the corpus update need to be accompanied by prompt updates, evaluation dataset updates, and OQ regression testing?
Embedding model versioning: If the embedding model used to encode documents changes (e.g., upgrading from text-embedding-ada-002 to text-embedding-3-large), the entire corpus must be re-embedded and the retrieval performance re-validated. This is a major change control event.
The audit trail infrastructure itself is custom-developed and GxP-critical. Its validation is non-negotiable and should be treated with the same rigor as the agent's core logic.
OQ test cases for the audit trail writer must confirm: every required field in the audit log schema is populated for every agent action; timestamp accuracy is verified against a trusted time source; records written to QLDB are not modifiable after writing; the cryptographic verification function confirms hash chain integrity; and records are queryable by all required search criteria (agent ID, date range, target record ID, workflow step).
A specific OQ test that is often overlooked: simulate an audit trail failure (e.g., QLDB write fails due to a network error). Confirm that the agent halts rather than proceeding with an unlogged action. An agent that can take regulated actions without logging them is non-compliant regardless of how good the rest of the validation is.
Classifying and validating individual components is necessary but not sufficient. The composite system — the agent as it operates in the clinical environment — must also be validated as a whole. This is where GAMP 5's category model reaches its limit, and where clinical judgment must supplement the framework.
Integration testing confirms that data flows correctly between components: that the LangGraph graph correctly invokes Weaviate for retrieval, that Guardrails AI validators correctly intercept LLM outputs before they reach the downstream system, that the audit log writer correctly captures the full reasoning chain including retrieved document IDs, and that human review interrupts correctly pause execution and resume after electronic signature.
End-to-end scenario testing executes complete clinical workflows — not individual component tests — with realistic data. Each scenario should be designed with clinical SMEs and reviewed by both the validation team and the quality team. Scenarios should include normal operation, edge cases, and failure modes.
Failure mode and effects analysis (FMEA) at the system level identifies the failure modes of the composite system that have the highest risk impact. For each high-risk failure mode (e.g., "agent writes incorrect data to safety database without triggering human review"), the FMEA documents the detection control, the prevention control, and the residual risk. The FMEA drives both the validation test case prioritization and the ongoing monitoring plan.
GAMP 5 Second Edition introduced the concept of continuous validation — the idea that validation is not a one-time event but an ongoing lifecycle activity. For AI agents, this is not a philosophical aspiration but a practical necessity.
Implement a formal performance monitoring program that runs the evaluation dataset against the production agent on a defined schedule. Define statistical thresholds that trigger a change control assessment. Document the monitoring results in a Periodic Review report (annually at minimum, more frequently for high-risk agents). The Periodic Review assesses: has the agent's performance changed? Have the clinical workflows it supports changed? Have the predicate rules that govern its records changed? Has the threat landscape changed in ways that affect data integrity controls?
ISPE's GAMP 5 Special Interest Group on Data Integrity and Artificial Intelligence published a discussion document in 2024 addressing the gap between the GAMP 5 category model and AI/ML systems. While not yet codified as formal GAMP guidance, that document — along with FDA's 2023 framework for AI in drug development and the EU GMP Annex 11 revision that is anticipated to address AI systems — signals where formal regulatory guidance is heading. Organizations building AI agent validation frameworks today should design them to be compatible with that anticipated guidance, not just the current state of GAMP 5.
The core message emerging from all of these sources is consistent with the framework presented here: classify components by their nature and risk, validate deterministic components with deterministic tests, and supplement AI-specific components with performance-based validation, continuous monitoring, and robust change control.
The following principles distill the framework into guidance for validation leads approaching a new AI agent project:
Decompose before you classify. An AI agent is not a single system. Map every component before assigning categories. A single misclassification — treating the LangGraph graph as Category 4 because it runs on a COTS framework — can result in a validation package that fails to document the highest-risk custom logic in the system.
Custom logic is always Category 5. If your organization wrote it, configured it, or authored it — graph definitions, validators, prompts, corpus processing logic, audit log writers — it is Category 5 regardless of what framework it runs on.
LLMs are Category 4 with mandatory supplements. Standard Category 4 OQ testing is insufficient for non-deterministic models. Performance baseline establishment, sensitivity analysis, and ongoing monitoring are required additions, not optional enhancements.
The composite system requires its own validation. Component-level validation is necessary but not sufficient. End-to-end scenario testing, system-level FMEA, and integration testing are required at the system level regardless of how well individual components are validated.
Prompts are code. Version them, review them, test them, and control them with the same rigor as the LangGraph graph they inform.
Continuous validation is not optional for AI. Model drift, corpus changes, and infrastructure updates mean that an AI agent's validated state can change without any intentional action by the organization. Scheduled performance monitoring and a defined drift response protocol are foundational requirements of an AI agent validation strategy.
This article reflects current industry interpretation of GAMP 5 Second Edition (2022) as applied to AI agent systems, as of early 2026. Organizations should monitor evolving ISPE guidance, FDA regulatory frameworks, and ICH Q10 interpretations as formal guidance on AI system validation continues to develop.