Improved performance of biomanufacturing processes using data sciences – Part 01

The recent pandemic context, as well as unprecedented innovation in advanced therapies, have emphasized the importance, for the pharmaceutical industry, to be able to quickly design and deploy manufacturing processes with high quality and performance levels. The important amount of data, produced and archived along manufacturing activities, constitutes a tremendous source of potential knowledge.

The processing of those data with modern algorithms from Artificial Intelligence may generate new insights into the manufacturing processes, and propose new process configurations leading to better industrial performance. In this series of 3 blog articles, we illustrate a consistent methodology for generating business value out of manufacturing data. This first article will focus on the data themselves, and the challenges of using them, while the second will shed light on how they can contribute to value creation in everyday manufacturing operations. The last one will highlight how the same data can contribute to process improvement.

Part 1: Data: where to find them, how to use them?

The primary challenge in biomanufacturing: accessing the process data

Any effective data sciences project has to start…from the data!

In a biomanufacturing context, access to data faces multiple challenges. First, the data related to a given process may be distributed across multiple platforms: characteristics and certificates of analyses (CoA) of raw materials documented in the ERP system, quality controls stored in LIMS software, automation and regulation data points archived in SCADA database, and most frequently a bunch of process operations documented and signed-off on isolated worksheets.

Very obviously, data must be available “somewhere”. And depending on the digital maturity level of the considered company, this is not always the case (or not completely). Systems for data capture must be put in place for filling the gaps of digitalization.

Second, production operations are by nature business-critical for the organization: inducing disruptions in the operations, imposing the training and qualification of the staff on new tools and methods are most generally excluded.

Finally, biopharmaceutical production is highly regulated: hardware and software technologies have therefore to meet strict regulatory prescriptions, and the substitution or addition of new technologies in a registered process is generally discouraged.

As a result, routine biopharmaceutical operations are typically implemented using a collection of independent, disconnected systems (with respect to each other). In the perspective of a more transversal look over the data, we promote the setup of a secondary database, that may or may not be isolated from the critical operations depending on the way it will be used (as will be discussed in the next two articles). This database shall be used as input for all data sciences initiatives. This database is structured along a data model fit for representing almost any biomanufacturing process, inspired by ISA88 standard.

Ultimately, all data attached to a given process (production, quality, operations, automation) are consolidated into a unique, rich and contextualized data model, which is further complemented by meta-information attached to the process (measure boundaries inherited from design-of-experiment efforts, target values as documented in the registered process, and numerous other elements making the database “process-aware”…). Routine operations and instructions given to the staff-on-duty remain untouched: we simply deploy dedicated software connectors that will automatically grab the relevant data in the different IT systems, and populate our data lake. Any nature of data is eligible to enter this data lake: numerical, textual, categorial (binary or more), time-series, spectra, and more.

the primary challenge in biomanufacturing accessing the process data dnalytics no deviation

Once this valuable dataset has been constituted, different scenarios for exploiting the data can be proposed: in routine operations, or for process improvements. This will be the subject of the next articles. But before getting there, let us remind a few challenges (among others) that will have to be addressed along the way.

A biomanufacturing process is by nature highly multivariate, while humans have difficulties comprehending more than 2-3 variables simultaneously. This may give a hard time to process experts to appreciate recommendations made by modeling and optimizations initiatives. This pleads for appropriate training to make multidisciplinary teams of biotech and pharma companies able to make the most of the results of these initiatives (change management). It also certainly pleads for proposing visualisations, namely interactive ones, of the models, their predictions and recommendations. Modern developments of AI push to models able to motivate their recommendations (concept of “trustworthy AI” or “transparent AI”).

AI or data sciences are not bullshit, but are not magic either. Data sciences and process experts should walk hand in hand. The former is able to process huge amounts of data in a systematic way, allowing a team to make some “beliefs” more objective, and is also a very powerful tool to generate hypotheses. But the latter should sort the recommendations and hypotheses based on their precious knowledge. Actually, iterations between data scientists and process experts is the most beneficial strategy… provided that process experts have the basic notions to appreciate the data-driven results, while data scientists, in this field, should have a fair knowledge of biomanufacturing processes.

Curse of dimensionality: even with large pharma corporations, the number of batches produced over several years will be in the hundreds (in the best cases) while the number of measures that can describe each batch will be in the (tens of) thousands. This setting constitutes an “empty space”, an under-constrained scenario for modeling endeavors. Specific modeling and analysis techniques must be implemented to avoid “overfitting” i.e. generate models that fit well (even perfectly) on past batches data, but completely fail to capture the actual concepts hidden in the data, hence failing at modeling anything interesting in future batches. This also involves techniques for correctly assessing the performances of a model in this particular context.

Provided these challenges can be addressed properly, the benefits of data sciences for biomanufacturing can be numerous:

Improved efficiency: By analyzing large amounts of data, data scientists can help identify trends and patterns that can be used to optimize production and reduce waste, resulting in cost savings and improved efficiency.

Enhanced accuracy: Data science can be used to develop predictive models that can help companies anticipate and plan for potential issues, reducing the risk of errors and improving the accuracy of routine operations.

Improved product quality: By using data science to monitor and optimize production processes, companies can improve the quality of their products, which is important for maintaining customer satisfaction and loyalty.

Increased competitiveness: By using data science to drive operational efficiencies and improve product quality, companies can stay competitive in an increasingly complex and fast-paced market.

Take-home message

In this first article of our blog series, we reminded the rationale for initiating a modeling effort in a biomanufacturing context. Generally speaking, modeling supports the improvement of understanding and knowledge attached to a complex activity; it also supports significant sparing in time and money, as a number of experimental investigations may then be replaced by numerical explorations.

Given the complexity and dimensionality of biomanufacturing, the fundamentals of a modeling effort based on Machine Learning techniques have been highlighted:

(i) Training of models based on the available retrospective data

(ii) Capacity of these models to extract the measures with dominant influence – a technique known as feature selection, and

(iii) Qualification of the model (including selected parameters) through the performance indicators that are best suited to the problem.

As a prerequisite towards such modeling exercise, the elaboration of a comprehensive dataset has been discussed. In particular, a practical scenario for constituting a comprehensive dataset without impacting routine, regulated operations has been proposed: it takes the form of an ancillary database collecting the primary data from the different systems in place, and shaping those data in a rich, unique, contextualized data model. This dataset constitutes the fundamental ingredients for different Data Sciences initiatives that will be described in the next 2 articles.

DNAlytics is a Belgian company specialized in the applications of Data Sciences and Artificial Intelligence in the healthcare sector. DNAlytics proposes HERCULE, a versatile software platform for the monitoring and enhancement of biomanufacturing processes. Through a modular set of analysis libraries, HERCULE accelerates the process development effort, facilitates process qualification and quality reviews, and proposes practical recommendations for enhancing the performance of routine production.

The implementation of HERCULE in biopharmaceutical companies in Asia is proposed through an exclusive partnership with No deviation. For more information, do not hesitate to reach out to

Leave a Comment

Your email address will not be published. Required fields are marked *