Conversations That Matter
No deviation’s Takeaways from 2025 ISPE France Conference – Validation of AI in the pharmaceutical industry.
Presentation of the approach currently being developed at Servier to structure a methodology for validating AI systems within a GxP framework.
Speaker : Léo Martirien and Florien Barrero – Servier
Integrating Artificial Intelligence (AI) into GMP/GxP (Good Manufacturing Practices) environments is transforming the pharmaceutical industry, particularly in critical areas such as quality control. However, adopting these solutions requires rigorous validation. According to regulatory guidelines, including Annex 22, this validation must be risk-based and proportionate to the system’s criticality.
But how can we ensure that a complex model—often perceived as a “black box”—meets the requirements for traceability, quality, and patient safety? Below we explore the key challenges of validating AI in a GxP context, using visual inspection as an example.
Ground Truthing: Data Quality at the Heart of AI
An AI model is only as strong as the training data behind it. Validation therefore begins long before final testing, with what is known as ground truthing.
This stage requires subject matter experts (SMEs) to label the data. For example, in visual inspection, experts must determine whether a vial shows a defect (such as a chip or a crack) or not. It is important that this task is entrusted to subject matter experts, as only they can ensure the accuracy of the data. A double-check process is even recommended.
Another often overlooked aspect is staff motivation. If operators fear being replaced by AI, they might intentionally compromise the model by deliberately mislabeling images. These human biases must be considered in the system’s design.
AI in Visual Inspection: The Risk–Benefit Trade-off
Visual inspection of sterile products (ampoules, vials, cartridges) is a typical AI use case. The goal is often to replace or assist human operators (traditionally women, for their reputed focus) in detecting impurities or breaches.
This inevitably creates a trade-off between two types of errors:
- False Negative – A defective product is accepted as compliant. This is the most critical risk because it can endanger patient safety. To avoid this, the model must be designed so that no defective products are missed, even if that means overfitting to avoid false negatives.
- False Positive – A compliant product is rejected. This is mainly a business risk, as it reduces yield—especially costly when products are expensive to produce.
In a pharmaceutical setting, it is imperative that the AI model’s performance is at least equivalent to that of humans, whose reliability rate often exceeds 95%. Achieving this human-level comparability is a significant challenge.
Transparency and Explainability: Opening the “Black Box”
Annex 22 emphasizes explainability—the ability to understand how AI reached a decision. This requirement must be addressed in the documentation, typically in a dedicated AI Explainability Document (AED).
Techniques such as heat maps (feature attribution maps) show which areas of an image influenced the AI’s decision. Reviewing these maps during testing ensures the AI identified the actual defect and not an unrelated visual cue (as in the well-known “husky vs. snow” example, where the model detected snow rather than the animal itself).
A major challenge arises when choosing AI vendors, who often refuse to disclose their training details for intellectual property reasons. Yet without knowledge of how a model was pre-trained, a pharmaceutical company cannot meet its regulatory obligation to explain AI decisions. This makes it difficult to accept pre-trained models that are not open source and transparent.
Maintaining the Validated State: Continuous Monitoring
Unlike traditional IT systems, dynamic AI (which continues to learn in production) or static AI operating in a changing environment requires continuous oversight. AI validation cannot be treated as a simple “fire and forget” project.
Annex 22 requires ongoing monitoring of model performance—not every three months, but continuously. This effectively means daily babysitting of the system. If gradual model drift is detected, immediate corrective action is essential, rather than waiting for the next scheduled review.
To maintain the validated state, monitoring criteria must be defined, such as:
- Statistical performance analyses (accuracy, recall, etc.).
- Data quality checks (to confirm that production data remain consistent with training data, for example via a Kolmogorov–Smirnov test).
- Predefined alert thresholds requiring mitigation actions, such as human intervention or immediate system shutdown.
The operational effort required for this continuous monitoring is often underestimated by AI advocates, even though AI demands significant resources well beyond the initial project phase.
While AI holds enormous potential to improve efficiency and even enhance quality (for example, by leveraging imaging beyond the human-visible spectrum), integrating AI into a GxP environment is a complex governance and quality challenge. It requires close collaboration between IT, data scientists, quality assurance, and subject matter experts and above all, constant vigilance from end users, who bring critical process knowledge and on-the-ground expertise, making them the necessary final checkpoint alongside AI. Their judgment is key to ensuring safe and reliable product release decisions.

