This comprehensive guide explores CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor), a deep learning method that predicts missing reactions in Genome-scale Metabolic Models (GEMs) using only metabolic network topology, without requiring experimental...
This comprehensive guide explores CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor), a deep learning method that predicts missing reactions in Genome-scale Metabolic Models (GEMs) using only metabolic network topology, without requiring experimental phenotypic data. Covering foundational concepts to practical implementation, we detail how CHESHIRE's hypergraph learning architecture outperforms traditional gap-filling methods, validates predictions through phenotypic improvement, and enables applications in drug discovery and metabolic engineering. Researchers will gain actionable insights for implementing CHESHIRE to reconstruct more accurate metabolic networks, particularly valuable for non-model organisms where experimental data is scarce.
Genome-Scale Metabolic Models (GEMs) are powerful computational tools that provide a mathematical representation of an organism's metabolism, connecting genes, proteins, and reactions to predict metabolic capabilities and physiological states [1] [2]. Despite advances in reconstruction methods, GEMs consistently suffer from knowledge gapsâmissing reactions that arise from incomplete genomic annotations, undiscovered enzyme functions, and imperfect biochemical knowledge [1] [3]. These gaps manifest as dead-end metabolites that cannot be produced or consumed and incorrect phenotypic predictions that limit the utility of GEMs in biotechnology, drug discovery, and systems biology [4] [3].
Traditional gap-filling methods primarily rely on phenotypic data to identify and resolve inconsistencies between model predictions and experimental observations [1] [4]. However, such data is often unavailable for non-model organisms or in early research stages, creating a pressing need for computational methods that can accurately predict missing reactions purely from metabolic network topology [1]. The CHEbyshev Spectral HyperlInk pREdictor (CHESHIRE) framework addresses this limitation through a deep learning approach that leverages hypergraph representations of metabolic networks, enabling researchers to fill metabolic gaps without requiring experimental data as input [1] [5].
CHESHIRE frames the problem of missing reaction prediction as a hyperlink prediction task on hypergraphs [1]. Unlike traditional graphs where edges connect pairs of nodes, hypergraphs allow each hyperlink (reaction) to connect multiple nodes (metabolites) simultaneously, providing a more natural representation of metabolic networks where reactions typically involve multiple substrates and products [1] [6]. This representation preserves the higher-order interactions inherent to biochemical reactions that are often lost when metabolic networks are represented as simple graphs [6].
The fundamental innovation of CHESHIRE lies in its ability to learn exclusively from the topological features of metabolic networks without requiring phenotypic data [1] [7]. This approach is particularly valuable for studying non-model organisms or when experimental data is scarce or expensive to obtain. CHESHIRE takes as input a metabolic network and a pool of candidate reactions, and outputs confidence scores for each candidate reaction indicating the likelihood of it being missing from the model [5].
The CHESHIRE learning architecture consists of four major components that transform raw metabolic network data into meaningful predictions [1]:
Feature Initialization: An encoder-based one-layer neural network generates initial feature vectors for each metabolite from the incidence matrix, capturing crude topological relationships between metabolites and reactions [1].
Feature Refinement: A Chebyshev Spectral Graph Convolutional Network (CSGCN) refines the feature vectors by incorporating information from metabolite-metabolite interactions within reactions, effectively capturing the local metabolic context [1].
Pooling: Graph coarsening methods integrate metabolite-level features into reaction-level representations using both maximum minimum-based and Frobenius norm-based pooling functions to preserve complementary information [1].
Scoring: A one-layer neural network processes the reaction feature vectors to produce probabilistic scores indicating the confidence of each reaction's existence in the metabolic network [1].
Table 1: Key Components of the CHESHIRE Architecture
| Component | Technical Approach | Function |
|---|---|---|
| Feature Initialization | Encoder-based neural network | Generates initial metabolite features from network topology |
| Feature Refinement | Chebyshev Spectral GCN (CSGCN) | Refines features using metabolite interactions |
| Pooling | Maximum minimum + Frobenius norm functions | Integrates metabolite features into reaction representations |
| Scoring | One-layer neural network | Produces confidence scores for candidate reactions |
Figure 1: CHESHIRE Computational Workflow - From metabolic network to reaction predictions
CHESHIRE has been rigorously validated through both internal and external approaches. For internal validation, researchers employed a systematic testing framework involving the artificial removal of known reactions from 108 high-quality BiGG models and 818 AGORA models, then measuring recovery accuracy [1]. This process involved splitting metabolic reactions into training and testing sets over 10 Monte Carlo runs, with negative sampling at a 1:1 ratio to create realistic training conditions [1].
For external validation, the method was tested on 49 draft GEMs reconstructed from common pipelines (CarveMe and ModelSEED) to assess its ability to improve phenotypic predictions for fermentation products and amino acid secretion [1]. This dual-validation approach ensures that CHESHIRE not only performs well in theoretical recovery tasks but also enhances practical model utility for predicting biologically relevant phenotypes.
CHESHIRE demonstrates superior performance compared to existing topology-based methods including Neural Hyperlink Predictor (NHP), Clique Closure-based Coordinated Matrix Minimization (C3MM), and Node2Vec-mean (NVM) [1]. Across comprehensive tests on BiGG models, CHESHIRE achieved the best performance in key classification metrics, particularly the Area Under the Receiver Operating Characteristic curve (AUROC), indicating robust predictive accuracy [1].
Recent advances beyond CHESHIRE include Multi-modal Hypergraph Neural Networks (Multi-HGNN), which further enhance prediction by incorporating metabolic directionality and biochemical features of metabolites in addition to topological information [6]. Multi-HGNN employs a hybrid hypergraph that captures both directed information flow and high-order interactions, integrating three feature learning modules: biochemical feature learning, metabolic directed graph learning, and metabolic hypergraph learning [6].
Table 2: Performance Comparison of Gap-Filling Methods
| Method | Approach | Key Features | Validation Scope | Key Advantages |
|---|---|---|---|---|
| CHESHIRE | Hypergraph deep learning | Chebyshev spectral networks, topological features only | 926 GEMs (108 BiGG + 818 AGORA) | No phenotypic data required; superior AUROC |
| NHP | Graph neural network | Approximates hypergraphs as graphs | Limited benchmarks | Neural network architecture |
| C3MM | Matrix minimization | Integrated training-prediction | Handful of GEMs | Clique closure approach |
| Multi-HGNN | Multi-modal hypergraph | Biochemical features + directionality | 108 BiGG models | Incorporates multiple data modalities |
| Traditional Gap-Filling | Optimization-based | Requires phenotypic data | Variable | Directly addresses growth inconsistencies |
Hardware Requirements:
Software Dependencies:
Installation Procedure:
CHESHIRE requires specific input file structures organized in designated directories:
cheshire-gapfilling/data/gems/ [5]cheshire-gapfilling/data/pools/universe.xml [5]cheshire-gapfilling/data/fermentation/:
substrate_exchange_reactions.csv: Lists fermentation compounds with columns for compound names and IDsmedia.csv: Specifies culture medium components with maximum uptake fluxes [5]Modify input_parameters.txt to control simulation behavior:
Critical Parameters:
CULTURE_MEDIUM: Path to media definition fileREACTION_POOL: Path to reaction pool fileNUM_GAPFILLED_RXNS_TO_ADD: Number of top candidate reactions to add for fermentation testingADD_RANDOM_RXNS: Boolean (0/1) to use random reactions instead of CHESHIRE predictionsNAMESPACE: Biochemical database namespace ("bigg" or "modelseed")MIN_PREDICTED_SCORES: Cutoff threshold for candidate reactions (default: 0.9995) [5]Execute CHESHIRE through three main programs in main.py:
get_predicted_score() calculates likelihood scores for candidate reactionsget_similarity_score() computes mean similarity between candidate and existing reactionsvalidate() identifies minimal reaction sets enabling new metabolic secretions
Figure 2: CHESHIRE Implementation Workflow - From input preparation to gap-filled models
Table 3: Key Research Reagent Solutions for CHESHIRE Implementation
| Resource | Type | Function | Source/Availability |
|---|---|---|---|
| BiGG Models | Metabolic Database | High-quality GEMs for training and validation | http://bigg.ucsd.edu/models [6] |
| ModelSEED | Reconstruction Platform | Automated draft GEM generation | https://modelseed.org/ [2] |
| CarveMe | Reconstruction Tool | Automated model reconstruction from genomes | https://github.com/carrascomj/CarveMe [1] |
| BiGG Universe | Reaction Pool | Comprehensive reaction database for gap-filling | Included in CHESHIRE package [5] |
| AGORA Models | Reference GEMs | Standardized microbiome models for validation | [1] |
| IBM CPLEX | Optimization Solver | Mathematical optimization for flux analysis | https://www.ibm.com/analytics/cplex-optimizer [5] |
| COBRA Toolbox | Modeling Platform | MATLAB toolbox for constraint-based modeling | [2] |
CHESHIRE generates results in three main directories:
The primary output file suggested_gaps.csv contains critical columns for interpretation:
phenotype__no_gapfill: Binary indicator of secretion capability in original GEMphenotype__w_gapfill: Binary indicator of secretion capability after gap-fillingnormalized_maximum__w_gapfill: Maximum secretion flux normalized by biomassrxn_ids_added: Identifiers of added candidate reactionskey_rxns: Minimal reaction sets enabling phenotypic changes [5]Successful implementation of CHESHIRE demonstrates improved phenotypic predictions for critical metabolic functions. External validations show enhanced prediction of fermentation products and amino acid secretion in draft GEMs, confirming that topological features alone can guide biologically meaningful model refinement [1]. The method has been shown to identify non-intuitive metabolic interdependencies in microbial communities, making it particularly valuable for studying complex systems like the human gut microbiome [4].
Recent advancements in metabolic gap-filling have expanded beyond CHESHIRE's capabilities. The emerging Multi-HGNN framework addresses limitations by incorporating biochemical features of metabolites through pre-trained models on large small molecule datasets and capturing metabolic directionality through hybrid hypergraphs [6]. This multi-modal approach demonstrates how integrating diverse data types can further enhance prediction accuracy.
Future developments will likely focus on integrating multi-omics data, incorporating kinetic parameters, and improving community-level metabolic modeling where gap-filling considers metabolic interactions between multiple organisms [4] [3]. As these methods mature, they will increasingly enable accurate metabolic modeling for non-model organisms and complex microbial communities, accelerating discoveries in biotechnology, medicine, and basic science.
The CHESHIRE framework represents a significant milestone in topology-based metabolic network completion, providing researchers with a powerful tool to address the critical problem of missing reactions while reducing dependence on extensive experimental data.
Genome-scale metabolic models (GEMs) are mathematical representations of an organism's metabolism that provide comprehensive gene-reaction-metabolite connectivity, serving as powerful tools for predicting cellular physiological states [1] [7]. The reconstruction of high-quality GEMs is crucial for advancing metabolic engineering, drug discovery, and systems biology research [3]. However, due to imperfect knowledge of metabolic processes, even highly curated GEMs contain knowledge gaps, typically manifesting as missing reactions that disrupt metabolic pathways [1] [7]. The process of identifying and adding these missing reactions, known as gap-filling, is therefore an essential step in metabolic network reconstruction [3].
Traditional gap-filling methods predominantly rely on phenotypic data to identify discrepancies between model predictions and experimental observations [1] [3]. These methods generally follow a three-step process: (1) detecting gaps (e.g., dead-end metabolites or growth prediction inconsistencies), (2) suggesting model content changes by adding reactions from metabolic databases to resolve these gaps, and (3) identifying genes responsible for the gap-filled reactions [3]. While these approaches have proven valuable, their dependency on experimental data creates significant limitations in practice [1] [8].
Traditional gap-filling methods face several inherent constraints that limit their applicability and effectiveness:
Experimental Data Dependency: The requirement for phenotypic data (e.g., growth profiles, metabolite secretion rates) as input creates a fundamental barrier for non-model organisms [1] [8]. For the many intestinal microorganisms considered "uncultivable," obtaining such data is particularly challenging [1].
Resource Intensity: High-throughput phenotypic screening necessary for these methods can become "complicated, time-consuming, and expensive" [1], requiring specialized equipment and expertise.
Limited Novelty Discovery: These methods are typically restricted to suggesting known biochemical reactions from existing databases [8], thereby limiting their ability to discover truly novel metabolic capabilities beyond annotated biochemistry.
Circular Validation: Using the same phenotypic data both to fill gaps and validate models can create self-consistent but potentially inaccurate predictions [3].
Beyond practical constraints, traditional methods face technical limitations that affect their performance:
False Positive Management: A prevalent problem is the difficulty in resolving false-positive predictions (negative growth in vivo but positive growth in silico) [3]. Simply removing reactions or limiting reaction directionality may not successfully address these discrepancies, as unknown regulatory rules or essential biomass components could instead be responsible.
Scalability Issues: As the number of sequenced genomes grows exponentially, the manual curation required for traditional gap-filling becomes a bottleneck in metabolic network reconstruction pipelines [1] [8].
Incomplete Biochemical Coverage: These methods cannot propose reactions for which no genomic evidence exists in reference databases, leaving fundamental knowledge gaps unaddressed [3].
Table 1: Comparison of Traditional and Modern Gap-Filling Approaches
| Feature | Traditional Methods | Topology-Based Deep Learning Methods |
|---|---|---|
| Data Requirements | Require experimental phenotypic data | Use only metabolic network topology |
| Application Scope | Limited to organisms with phenotypic data | Applicable to any organism with a genomic sequence |
| Novel Reaction Prediction | Restricted to known biochemistry | Potential to suggest novel biochemical transformations |
| Resource Demands | High experimental costs | Computational resource requirements |
| Automation Potential | Significant manual curation needed | Highly automatable pipeline |
| Validation Approach | External phenotypic data | Internal topological consistency and in silico phenotypic prediction |
The CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor) framework addresses the limitations of traditional methods by predicting missing reactions in GEMs purely from metabolic network topology, without requiring phenotypic data as input [1] [7]. This approach is grounded in hyperlink prediction theory applied to hypergraphs, which naturally represent metabolic networks where each reaction (hyperlink) can connect multiple metabolites (nodes) [9].
The mathematical foundation represents a metabolic network as a hypergraph â = (ð±, â°), where ð± is the set of metabolites (nodes) and â° is the set of reactions (hyperedges) [9]. The goal of hyperlink prediction is to find the most likely existent hyperlinks missing from the observed hyperlink set â° by learning a function Ψ(e) that predicts the existence probability for any candidate hyperlink e [9].
CHESHIRE employs a sophisticated deep learning architecture with four major components [1]:
Feature Initialization: An encoder-based one-layer neural network generates initial feature vectors for each metabolite from the incidence matrix, encoding topological relationships.
Feature Refinement: A Chebyshev spectral graph convolutional network (CSGCN) refines metabolite feature vectors by incorporating features of other metabolites from the same reaction.
Pooling: Graph coarsening methods compute feature vectors for each reaction from its metabolite features, combining maximum minimum-based and Frobenius norm-based pooling functions.
Scoring: A one-layer neural network produces probabilistic scores indicating confidence levels for candidate reactions.
CHESHIRE Deep Learning Workflow for Hyperlink Prediction
To validate topology-based gap-filling methods like CHESHIRE, researchers employ internal validation protocols that test the ability to recover artificially removed reactions [1]:
Protocol 1: Monte Carlo Cross-Validation with Negative Sampling
Reaction Partitioning: Split metabolic reactions in a given GEM into training (60%) and testing (40%) sets over multiple Monte Carlo runs (typically 10 iterations).
Negative Reaction Generation: Create negative (fake) reactions at 1:1 ratio to positive reactions by replacing half of the metabolites in each positive reaction with randomly selected metabolites from a universal metabolite pool.
Model Training: Train the deep learning model on the combination of positive training reactions and generated negative reactions.
Performance Evaluation: Test the model on the held-out testing set using classification metrics including Area Under the Receiver Operating Characteristic curve (AUROC).
Comparative Analysis: Benchmark against state-of-the-art machine learning methods (NHP, C3MM, Node2Vec-mean) using the same dataset splits.
Protocol 2: Database-Level Validation
Follow the same training procedure as Protocol 1.
Instead of mixing the testing set with generated negative reactions, combine it with real reactions from a universal biochemical database.
Evaluate the method's ability to distinguish real missing reactions from unrelated biochemical transformations.
Table 2: CHESHIRE Performance Metrics on Benchmark Datasets
| Dataset | Number of GEMs | AUROC | Comparison to Next Best Method |
|---|---|---|---|
| BiGG Models | 108 high-quality models | 0.89 | 7.2% improvement over NHP |
| AGORA Models | 818 intermediate-quality models | 0.85 | 9.8% improvement over C3MM |
| Draft GEMs | 49 draft reconstructions | 0.82 | 12.1% improvement over FastGapFill |
While CHESHIRE does not require phenotypic data for operation, its performance can be externally validated by assessing improvements in phenotypic predictions after gap-filling:
Protocol 3: Fermentation Phenotype Validation
Model Preparation: Obtain draft GEMs reconstructed from commonly used pipelines (CarveMe and ModelSEED).
Gap-Filling Application: Apply CHESHIRE to predict missing reactions and generate gap-filled models.
Phenotypic Simulation: Use flux balance analysis (FBA) to simulate fermentation product secretion in both original and gap-filled models.
Validation Metric: Calculate the improvement in predicting whether fermentation metabolites and amino acids are produced by the gap-filled GEMs compared to original drafts.
Key Reaction Identification: Among top candidate reactions, identify the minimum set that leads to new metabolic secretions potentially missing in the input GEMs.
Table 3: Essential Tools and Resources for CHESHIRE Implementation
| Resource | Type | Function | Availability |
|---|---|---|---|
| CHESHIRE GitHub Repository | Software Package | Source code for missing reaction prediction | https://github.com/canc1993/cheshire-gapfilling [5] |
| BiGG Database | Metabolic Database | Repository of curated metabolic models and universal reaction pool | http://bigg.ucsd.edu/ [6] |
| CPLEX Optimizer | Optimization Software | Solver for constraint-based analysis of metabolic models | Commercial license required [5] |
| CarveMe | Reconstruction Tool | Automated pipeline for draft GEM generation | https://github.com/carveme/carveme [8] |
| ModelSEED | Reconstruction Tool | Framework for automated metabolic model reconstruction | https://modelseed.org/ [5] |
| Python Scientific Stack | Programming Environment | Required dependencies (NumPy, SciPy, Pandas, TensorFlow/PyTorch) | Open source [5] |
Recent advancements in hypergraph learning for gap-filling have introduced multi-modal approaches that address limitations of earlier methods. The Multi-HGNN framework incorporates biochemical features of metabolites and metabolic directionality in addition to network topology [6]. This approach:
The emerging CLOSEgaps framework further advances the field by integrating hypergraph convolutional networks with attention mechanisms, achieving over 96% accuracy in recovering artificially introduced gaps [8]. These developments suggest that future topology-based methods will continue to narrow the performance gap with phenotype-dependent approaches while maintaining broader applicability.
Traditional gap-filling methods requiring phenotypic data present significant limitations for metabolic network reconstruction, including dependency on expensive experimental data, limited applicability to non-model organisms, and restricted novelty discovery. The CHESHIRE framework and related deep learning approaches demonstrate that topology-based gap-filling using hypergraph learning can effectively predict missing reactions while overcoming these limitations. Through rigorous internal validation using artificially introduced gaps and external validation via improved phenotypic predictions, these methods establish a new paradigm for metabolic network completion that complements traditional approaches. As hypergraph learning techniques continue to evolve, they offer promising avenues for fully automated, high-quality metabolic network reconstruction without dependency on extensive phenotypic data.
GEnome-scale Metabolic Models (GEMs) are mathematical representations of an organism's metabolism that provide comprehensive gene-reaction-metabolite connectivity, serving as powerful tools for predicting metabolic fluxes in living organisms [1]. Despite their utility, even highly curated GEMs contain knowledge gaps in the form of missing reactions due to our imperfect knowledge of metabolic processes and incomplete genomic annotations [1] [6]. These gaps significantly limit the predictive accuracy and biomedical application of GEMs in critical areas such as metabolic engineering, microbial ecology, and drug discovery [1].
Traditional gap-filling methods typically require phenotypic data as input to identify discrepancies between model predictions and experimental results, then add reactions to resolve these inconsistencies [1] [6]. However, experimental data is often unavailable for non-model organisms, and even for cultivable organisms, high-throughput phenotypic screening can be complicated, time-consuming, and expensive [1]. This limitation creates a pressing need for computational methods that can predict missing reactions without relying on experimental data.
CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor) represents a paradigm shift in metabolic network curation by demonstrating that hypergraph topology alone contains sufficient information to accurately predict missing reactions in GEMs without requiring phenotypic data inputs [1]. This core innovation addresses a fundamental limitation in the field, enabling researchers to perform rapid and accurate gap-filling during the initial stages of metabolic network reconstruction before experimental data becomes available [1].
The method operates on the principle that metabolic networks have a natural hypergraph representation, where each molecular species is a node and each reaction is a hyperlink connecting all molecular species involved in it [1]. This representation preserves the higher-order relationships inherent in biochemical reactions that are lost when forced into traditional graph structures that can only represent pairwise relationships [10].
Table 1: Key Components of CHESHIRE's Architecture
| Component | Description | Function |
|---|---|---|
| Hypergraph Representation | Metabolites as nodes, reactions as hyperedges | Preserves higher-order interaction information lost in graph transformations |
| Feature Initialization | Encoder-based one-layer neural network | Generates initial feature vectors encoding topological relationships |
| Feature Refinement | Chebyshev Spectral Graph Convolutional Network (CSGCN) | Refines metabolite features by incorporating information from connected metabolites |
| Pooling Mechanism | Combined max-min and Frobenius norm-based functions | Integrates metabolite-level features into reaction-level representations |
| Scoring System | One-layer neural network | Produces probabilistic scores indicating reaction existence confidence |
CHESHIRE's architecture consists of four major steps that transform raw metabolic network data into confident predictions of missing reactions [1]:
Feature Initialization: An encoder-based one-layer neural network generates an initial feature vector for each metabolite from the incidence matrix, encoding crude information about topological relationships with all reactions in the metabolic network [1].
Feature Refinement: A Chebyshev Spectral Graph Convolutional Network (CSGCN) operating on a decomposed graph refines each metabolite's feature vector by incorporating features of other metabolites from the same reaction, thereby capturing metabolite-metabolite interactions [1].
Pooling: Graph coarsening methods compute a feature vector for each reaction from the feature vectors of its metabolites. CHESHIRE combines two pooling functionsâa maximum minimum-based function and a Frobenius norm-based functionâto provide complementary information about metabolite features [1].
Scoring: The feature vector of each reaction is fed into a one-layer neural network to produce a probabilistic score indicating the confidence of its existence, with these scores compared to target scores during training to update model parameters [1].
In internal validation experiments designed to test CHESHIRE's ability to recover artificially removed reactions, the method was systematically evaluated against state-of-the-art topology-based machine learning methods across 108 high-quality BiGG models [1]. The validation procedure involved splitting metabolic reactions into training and testing sets over 10 Monte Carlo runs, with negative reactions created at a 1:1 ratio to positive reactions by replacing half of the metabolites in each positive reaction with randomly selected metabolites from a universal metabolite pool [1].
Table 2: Performance Comparison on BiGG Models (Internal Validation)
| Method | AUROC | Key Features | Limitations |
|---|---|---|---|
| CHESHIRE | Highest | Chebyshev spectral graph convolution; Combined pooling functions; Hypergraph topology preservation | Requires training for each new reaction pool |
| NHP (Neural Hyperlink Predictor) | Lower than CHESHIRE | Hyperedge-aware graph neural networks; Max-min pooling | Approximates hypergraphs using graphs, losing higher-order information |
| C3MM (Clique Closure-based Coordinated Matrix Minimization) | Lower than CHESHIRE | Integrated training-prediction process | Limited scalability; Must be re-trained for each new reaction pool |
| Node2Vec-mean | Baseline (Lowest) | Random walk-based graph embedding; Mean pooling | Simple architecture without feature refinement |
CHESHIRE demonstrated superior performance across different classification metrics, including the Area Under the Receiver Operating Characteristic curve (AUROC), outperforming NHP, C3MM, and Node2Vec-mean [1]. This performance advantage stems from CHESHIRE's ability to fully leverage hypergraph topology without approximating hypergraphs as simple graphs, thereby preserving crucial higher-order interaction information [1].
Beyond internal recovery tests, CHESHIRE was externally validated by assessing its ability to improve phenotypic predictions in 49 draft GEMs reconstructed from commonly used pipelines (CarveMe and ModelSEED) [1]. This validation tested whether reactions identified by CHESHIRE could enable draft GEMs to produce fermentation products and amino acids that they were previously unable to secrete.
The results demonstrated that CHESHIRE significantly improved the theoretical predictability of metabolic phenotypes, confirming that the topology-based approach identifies biologically meaningful reactions that restore metabolic functionality [1]. This external validation is particularly significant as it demonstrates that CHESHIRE's predictions translate to improved functional capabilities in metabolic models beyond merely completing network connectivity.
Purpose: To install, configure, and run CHESHIRE for predicting missing reactions in genome-scale metabolic models.
System Requirements:
Step-by-Step Procedure:
Package Download
Input Preparation
cheshire-gapfilling/data/gems/universe.xml in cheshire-gapfilling/data/pools/substrate_exchange_reactions.csv and media.csv in cheshire-gapfilling/data/fermentation/ [5]Parameter Configuration
input_parameters.txt to specify:
GEM_DIRECTORY: ./data/gems/REACTION_POOL: ./data/pools/universe.xmlNUM_GAPFILLED_RXNS_TO_ADD: Number of top candidates to add for validationADD_RANDOM_RXNS: 0 (use CHESHIRE predictions) or 1 (use random reactions as control)NAMESPACE: "bigg" or "modelseed" (namespace of biochemical reaction database) [5]Execution
Results Interpretation
phenotype_no_gapfill: Binary value indicating secretion capability before gap-fillingphenotype_w_gapfill: Binary value indicating secretion capability after gap-fillingrxn_ids_added: IDs of candidate reactions added during gap-filling [5]Purpose: To train CHESHIRE on a specific metabolic network and validate its prediction performance.
Procedure:
Data Partitioning:
Negative Sampling:
Feature Initialization:
Model Training:
Performance Evaluation:
Table 3: Essential Research Reagents and Computational Resources
| Item | Function/Description | Source/Reference |
|---|---|---|
| BiGG Models | High-quality, curated genome-scale metabolic models for training and validation | http://bigg.ucsd.edu/models [1] |
| AGORA Models | 818 intermediate-quality GEMs for comprehensive testing | [1] |
| BiGG Reaction Database | Universal reaction pool for candidate generation | [1] [5] |
| ModelSEED Database | Alternative biochemical database for reaction annotation | [5] |
| IBM CPLEX Solver | Optimization software for flux balance analysis in validation | [5] |
| ChEBI Database | Chemical database for metabolite information and negative sampling | [8] |
| Cbl-b-IN-8 | Cbl-b-IN-8, MF:C35H44F3N7O3, MW:667.8 g/mol | Chemical Reagent |
| Flt3-IN-24 | Flt3-IN-24|Potent FLT3 Inhibitor|For Research Use | Flt3-IN-24 is a potent FLT3 inhibitor for cancer research. This product is for research use only and not for human consumption. |
CHESHIRE's utilization of hypergraphs rather than traditional graphs provides significant advantages for metabolic network representation. In traditional graphs, each edge can only connect two nodes, forcing multi-metabolite reactions to be decomposed into pairwise relationships, which results in information loss and failure to explicitly capture the true higher-order nature of biochemical reactions [10]. In contrast, hypergraphs allow each hyperedge to connect an arbitrary number of nodes, providing a natural framework where each reaction can be represented as a single hyperedge connecting all participating metabolites [1] [10].
This preservation of higher-order information is particularly crucial for metabolic networks because the stoichiometric relationships between multiple reactants and products in a single reaction represent fundamental biochemical constraints that govern metabolic functionality. By maintaining these relationships intact, CHESHIRE can learn more meaningful patterns from the network topology.
Recent advancements in hypergraph learning for metabolic networks have built upon CHESHIRE's foundation while addressing some of its limitations. The Multi-HGNN method, for instance, extends the approach by incorporating multi-modal data, including biochemical features of metabolites learned from pre-trained models on large unlabeled small molecule datasets, and metabolic directionality information [6]. This integration of additional biological information alongside topological features demonstrates one direction of innovation in the field.
Similarly, CLOSEgaps represents another evolution that combines hypergraph convolutional networks with attention mechanisms to predict metabolic gaps, achieving over 96% accuracy in recovering artificially introduced gaps [8]. These subsequent developments indicate the fertile ground for further innovation in topology-based gap-filling while validating CHESHIRE's core insight that network topology contains substantial predictive signal for identifying missing reactions.
CHESHIRE's groundbreaking innovation lies in its demonstration that hypergraph topology alone enables accurate prediction of missing reactions in metabolic networks without dependency on phenotypic data. This topology-only approach, validated across hundreds of models through both internal recovery tests and external phenotypic improvement assessments, provides researchers with a powerful tool for metabolic network curation during early research stages when experimental data is scarce or unavailable. The method's open-source implementation and detailed protocols enable immediate application to diverse metabolic networks, potentially accelerating discoveries across metabolic engineering, microbial ecology, and drug development.
Genome-scale metabolic models (GEMs) serve as powerful computational frameworks that mathematically represent the metabolic network of an organism, integrating gene-protein-reaction associations to predict metabolic capabilities and physiological states [1] [11]. However, a significant challenge persists in creating complete and accurate GEMs for non-model organisms and uncultivable species. Due to incomplete genomic annotations and limited biochemical knowledge, even highly curated GEMs contain knowledge gaps, particularly missing metabolic reactions [1]. This problem is especially pronounced for uncultivable microorganisms and understudied organisms where experimental data is scarce or non-existent. Traditional gap-filling methods typically require phenotypic data as input to identify discrepancies between model predictions and experimental observations, creating a fundamental limitation for species where such data is unavailable [1] [6]. The CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor) method represents a transformative approach that overcomes this limitation by predicting missing reactions in GEMs purely from metabolic network topology, without requiring experimental phenotypic data [1] [5]. This application note details how CHESHIRE provides distinct advantages for metabolic network reconstruction of non-model and uncultivable species and offers practical protocols for its implementation.
The most significant advantage CHESHIRE offers for non-model organism research is its ability to perform accurate gap-filling without experimental phenotypic data. Traditional optimization-based gap-filling methods require experimental data inputsâsuch as growth profiles or metabolite secretion patternsâto identify inconsistencies between model predictions and laboratory observations [1] [6]. For the vast majority of non-model organisms and the estimated 99% of microorganisms that are uncultivable under standard laboratory conditions, such datasets are simply unavailable [1]. CHESHIRE fundamentally bypasses this requirement by leveraging only the topological features of the metabolic network itself. By treating the metabolic network as a hypergraph where each reaction is represented as a hyperlink connecting multiple metabolite nodes, CHESHIRE extracts complex patterns from the existing network structure to predict missing connections [1] [9]. This capability enables researchers to generate high-quality, working metabolic models for organisms where traditional gap-filling would be impossible, opening new frontiers for exploring microbial dark matter and rare biosphere species.
Metabolic networks inherently involve multi-way relationships that are poorly represented by traditional graph structures. A single biochemical reaction typically connects multiple substrate and product metabolites, creating a higher-order interaction that CHESHIRE captures through hypergraph representation [1] [9]. Unlike methods that approximate hypergraphs as graphsâwhich results in loss of critical higher-order informationâCHESHIRE maintains the full hypergraph structure throughout its analysis [1] [6]. The algorithm employs a sophisticated deep learning architecture with four major steps: feature initialization using an encoder-based neural network to generate initial metabolite feature vectors from the incidence matrix; feature refinement using Chebyshev spectral graph convolutional network (CSGCN) to incorporate features of metabolites participating in the same reaction; pooling to integrate metabolite-level features into reaction-level representations; and scoring to produce probabilistic confidence scores for candidate reactions [1]. This comprehensive approach enables CHESHIRE to capture complex metabolic patterns that simpler topology-based methods miss, resulting in more biologically plausible gap-filling predictions for organisms with limited annotation.
Table 1: Key Algorithmic Advantages of CHESHIRE for Non-Model Organisms
| Feature | Technical Approach | Benefit for Non-Model Organisms |
|---|---|---|
| Data Input | Requires only metabolic network topology | Enables gap-filling without organism-specific experimental data |
| Network Representation | Native hypergraph structure preserving multi-way relationships | Maintains biochemical accuracy of metabolic reactions |
| Feature Learning | Chebyshev spectral graph convolutional network (CSGCN) | Captures complex topological patterns from limited existing annotations |
| Candidate Scoring | Probabilistic confidence scores for reactions | Prioritizes biologically relevant reactions for gap-filling |
CHESHIRE has undergone rigorous validation demonstrating its effectiveness for gap-filling applications. In internal validation tests designed to assess the method's ability to recover artificially introduced gaps, CHESHIRE was evaluated across 926 high- and intermediate-quality GEMs from the BiGG and AGORA databases [1]. When compared to state-of-the-art topology-based machine learning methods including Neural Hyperlink Predictor (NHP) and Clique Closure-based Coordinated Matrix Minimization (C3MM), CHESHIRE consistently outperformed these approaches in predicting artificially removed reactions [1]. The method achieved superior performance across multiple classification metrics, including the Area Under the Receiver Operating Characteristic curve (AUROC), though specific numerical values were not provided in the available literature. For non-model organism research, this robust performance across diverse metabolic networks suggests strong generalizability to less-characterized species.
Beyond topological completeness, CHESHIRE has demonstrated significant improvements in phenotypic prediction accuracy for draft metabolic models. External validation using 49 draft GEMs reconstructed from common pipelines (CarveMe and ModelSEED) showed that CHESHIRE improved theoretical predictions of fermentation product secretion and amino acid secretion capabilities [1]. This phenotypic relevance is particularly valuable for non-model organisms where researchers seek to predict metabolic capabilities such as production of valuable compounds or biodegradation of environmental pollutants. The method successfully identifies key reactions that enable specific metabolic functions, allowing researchers to prioritize experimental validation efforts.
Table 2: Performance Comparison of Topology-Based Gap-Filling Methods
| Method | Technical Approach | Validation Scope | Key Advantages |
|---|---|---|---|
| CHESHIRE | Hypergraph learning with CSGCN | 926 GEMs (BiGG & AGORA) | Superior AUROC, phenotypic prediction improvement |
| NHP | Neural network with graph approximation | Limited benchmark | Separates candidate reactions from training |
| C3MM | Clique closure matrix minimization | Limited benchmark | Integrated training-prediction process |
| DNNGIOR | Deep neural network reaction imputation | Phylogenetically-aware | Uses reaction frequency across bacteria |
Implementing CHESHIRE requires specific computational infrastructure and software dependencies. The method has been tested on MacOS Big Sur (version 11.6.2) and Monterey (version 12.3, 12.4), with recommended hardware specifications of 16+ GB RAM and 4+ cores with 2+ GHz/core processing speed [5]. The package depends on the Python scientific stack and requires installation of the IBM CPLEX solver for optimization components. Researchers working with non-model organisms should establish this computational environment before beginning gap-filling analyses. The source code is publicly available through the GitHub repository canc1993/cheshire-gapfilling, which provides the core functionality for predicting missing reactions [5].
Proper input data preparation is essential for successful application of CHESHIRE to non-model organisms. The protocol requires three main input components:
Draft Metabolic Model: The incomplete GEM for the non-model organism in SBML (Systems Biology Markup Language) format (XML file). This model serves as the starting point for gap-filling. For non-model organisms, this draft model is typically generated through automated reconstruction tools such as CarveMe or ModelSEED that create initial models from genomic annotations [1] [5].
Reaction Pool: A comprehensive biochemical database of known metabolic reactions that serves as the candidate set for gap-filling. The CHESHIRE package includes bigg_universe.xml as a default reaction pool, but researchers can incorporate organism-specific or environment-specific reaction databases to improve biological relevance [5]. For uncultivable species from specific environments, custom reaction pools reflecting metabolic capabilities of phylogenetically related organisms may enhance prediction accuracy.
Simulation Parameters: Configuration files specifying cultural conditions and analysis parameters. The substrate_exchange_reactions.csv file defines fermentation compounds to test, while media.csv specifies the culture medium composition for phenotypic simulations [5]. For non-model organisms from specialized environments, modifying these parameters to reflect the organism's native habitat can improve biological relevance of predictions.
The CHESHIRE execution process involves three main programs that can be run sequentially:
Reaction Scoring: Execute get_predicted_score() to compute confidence scores for all candidate reactions in the pool regarding their likelihood of being missing from the draft GEM. This step uses the hypergraph learning algorithm to analyze topological patterns in the existing network and identify structurally plausible additions [1] [5].
Similarity Assessment: Run get_similarity_score() to evaluate the mean similarity of candidate reactions to existing reactions in the draft model. This complementary scoring helps prioritize reactions that are consistent with the existing metabolic network composition.
Phenotypic Validation: Execute validate() to identify the minimal set of top-ranked reactions that enable new metabolic secretion capabilities. This step uses flux balance analysis to simulate metabolic phenotypes after adding candidate reactions and identifies those that resolve metabolic gaps [5]. For non-model organisms, this step is computationally intensive but valuable for generating testable hypotheses about metabolic capabilities.
Table 3: Essential Computational Tools for CHESHIRE Implementation
| Tool/Resource | Function | Application Notes |
|---|---|---|
| CHESHIRE GitHub Repository | Core gap-filling algorithm | Provides source code for reaction prediction and phenotypic validation [5] |
| IBM CPLEX Solver | Mathematical optimization | Required for constraint-based analysis and flux simulations |
| BiGG Database | Biochemical reaction database | Default knowledgebase of metabolic reactions for gap-filling [1] |
| CarveMe/ModelSEED | Automated model reconstruction | Generates draft GEMs from genomic data of non-model organisms [1] |
| SBML Format | Model standardization | Ensures compatibility between draft models and CHESHIRE pipeline |
| Antibacterial agent 184 | Antibacterial agent 184, MF:C20H16FNO3, MW:337.3 g/mol | Chemical Reagent |
| Cga-JK3 | CGA-JK3|IKKβ Inhibitor|For Research Use |
While CHESHIRE itself is topology-based, researchers can enhance its application to non-model organisms by incorporating phylogenetic principles. The DNNGIOR method, another deep learning approach, demonstrates that prediction accuracy for missing reactions is influenced by phylogenetic distance to organisms in the training set [12]. When applying CHESHIRE to extremely novel organisms with few close relatives in existing databases, researchers can implement a two-stage approach: first using CHESHIRE for topology-based gap-filling, then filtering predictions through phylogenetic profiling based on any available genomic data from distantly-related organisms. This hybrid approach maintains the advantage of not requiring organism-specific experimental data while incorporating evolutionary constraints to improve biological plausibility.
Recent advances in hypergraph learning for metabolic networks suggest future directions for enhancing CHESHIRE's applications. The Multi-HGNN framework demonstrates that integrating biochemical features of metabolites with topological information can improve missing reaction prediction [6]. For non-model organisms where even basic topological information is limited, incorporating chemical structure data of detected metabolites or mass spectrometry profiles can strengthen predictions. This multi-modal approach is particularly valuable for uncultivable species where researchers may have metabolite detection data from environmental samples but incomplete genomic information. While CHESHIRE currently operates purely on topology, its flexible architecture could potentially accommodate such additional data types to further enhance predictions for challenging organisms.
CHESHIRE represents a significant advancement in metabolic network reconstruction for non-model organisms and uncultivable species by eliminating the dependency on experimental phenotypic data that has limited previous gap-filling methods. Its hypergraph learning approach effectively leverages topological patterns in metabolic networks to identify biologically plausible missing reactions, enabling researchers to generate functional metabolic models for organisms previously inaccessible to metabolic modeling. The method's robust performance across diverse metabolic networks, combined with its improving phenotypic prediction capabilities, makes it a valuable tool for exploring the metabolic potential of microbial dark matter and rare biosphere organisms. As the field moves toward multi-modal data integration and phylogenetically-aware algorithms, CHESHIRE provides a strong foundation for computational exploration of metabolism in the most challenging and biologically interesting species.
A hypergraph is a generalization of a graph in which an edge, called a hyperedge, can join any number of vertices [13]. This contrasts with an ordinary graph where an edge connects exactly two vertices. Formally, an undirected hypergraph is defined as a pair ( H = (X, E) ), where ( X ) is a set of vertices and ( E ) is a set of non-empty subsets of ( X ) called hyperedges [13]. Hypergraphs provide a natural framework for representing metabolic networks, where each reaction (hyperlink) connects multiple metabolite (node) participants simultaneously [1].
A reaction network is a bipartite labeled directed graph with two types of nodes: molecular states (reactants or products) and reactions [14]. In this structure, edges connect reactant states to reaction nodes, and reaction nodes to product states [14]. In mathematical systems biology, a chemical reaction network (CRN) comprises a set of reactants, a set of products, and a set of reactions, typically modeled using the law of mass action to track concentration changes over time [15].
Metabolic topology refers to the structural arrangement and connectivity of metabolic networks without explicit consideration of reaction kinetics [16] [17]. This architectural perspective focuses on how metabolites and reactions interconnect to form functional pathways, revealing properties like modularity, flexibility, and robustness [17]. Topological analysis can identify independent metabolic modulesâsets of reversible reactions isolated by irreversible reactionsâwhich correlate with specific metabolic functions [17].
Table 1: Performance Comparison of Topology-Based Gap-Filling Methods [1]
| Method | Key Approach | AUROC (Mean) | Scalability | External Validation |
|---|---|---|---|---|
| CHESHIRE | Chebyshev Spectral Hyperlink Predictor | 0.94 | High (separate training from candidate reactions) | Improved predictions for 49 draft GEMs |
| NHP | Neural Hyperlink Predictor (graph approximation) | 0.92 | Moderate | Limited validation |
| C3MM | Clique Closure-based Matrix Minimization | 0.89 | Low (retrains for each new pool) | Limited validation |
| Node2Vec-Mean | Random walk embedding with mean pooling | 0.85 | High | Not performed |
Table 2: Metabolic Network Flexibility in E. coli iJO1366 Model [17]
| Network Property | Value | Interpretation |
|---|---|---|
| Total reactions | 2,583 | Comprehensive network coverage |
| Original reversible reactions | 941 | Initial flexibility |
| Structurally reversible reactions after compression | 248 (26% of original) | Actual independent flexibility |
| Reactions requiring fixed direction | ~79% of reversible | High degree of network flexibility |
| Independent modules identified | 103 | Functional specialization |
Purpose: To predict missing reactions in Genome-scale Metabolic models (GEMs) using only network topology via the CHESHIRE deep learning method [1].
Materials:
Procedure:
Feature Initialization:
Feature Refinement:
Pooling Operation:
Scoring and Prediction:
Validation:
Troubleshooting:
Purpose: To identify topologically independent modules and assess network flexibility through reaction directionality analysis [17].
Materials:
Procedure:
Module Identification:
Directed Topology (DT) Enumeration:
Flexibility Quantification:
Analysis:
Diagram 1: CHESHIRE workflow for metabolic gap-filling showing the transformation of GEM inputs into curated models through hypergraph learning [1].
Diagram 2: Hypergraph representation of metabolic reactions where each reaction (rectangle) connects multiple metabolites (circles) simultaneously [13] [1].
Table 3: Essential Computational Tools for Metabolic Network Analysis
| Tool/Resource | Type | Function | Application Context |
|---|---|---|---|
| CHESHIRE | Deep learning algorithm | Predicts missing reactions from topology | Gap-filling GEMs without experimental data [1] |
| BiGG Models | Knowledgebase | High-quality curated metabolic models | Reference for reaction stoichiometry and network structure [1] |
| Flux Balance Analysis (FBA) | Constraint-based method | Predicts metabolic fluxes under steady-state | Validation of network functionality [16] |
| Hypergraph Laplacian | Mathematical framework | Spectral analysis of hypergraph structure | Network clustering and decomposition [13] |
| Synthetic Accessibility | Topological metric | Measures minimal reactions needed for biomass production | Prediction of knockout viability [16] |
| Directed Topology Enumeration | Algorithmic approach | Counts feasible reaction direction patterns | Quantifying network flexibility [17] |
| Cdk-IN-13 | Cdk-IN-13, MF:C23H27N7O3, MW:449.5 g/mol | Chemical Reagent | Bench Chemicals |
| Hdac6-IN-41 | HDAC6-IN-41|Potent HDAC6 Inhibitor|For Research | HDAC6-IN-41 is a selective HDAC6 inhibitor for cancer, neurodegeneration, and fibrosis research. This product is for research use only (RUO) and not for human or veterinary diagnosis or therapeutic use. | Bench Chemicals |
Genome-scale Metabolic Models (GEMs) are mathematical representations of an organism's metabolism, providing comprehensive gene-reaction-metabolite connectivity crucial for predicting metabolic fluxes in living organisms [18] [1]. Even highly curated GEMs contain knowledge gaps in the form of missing reactions due to our imperfect knowledge of metabolic processes and incomplete genomic annotations [18] [1]. While existing gap-filling methods typically require phenotypic data as input, the CHEbyshev Spectral HyperlInk pREdictor (CHESHIRE) represents a breakthrough deep learning-based method that predicts missing reactions in GEMs using only metabolic network topology, without requiring experimental data [18] [1]. This application note details CHESHIRE's innovative four-stage architecture and provides experimental protocols for its implementation in metabolic network research.
CHESHIRE's architecture transforms the challenge of predicting missing metabolic reactions into a hyperlink prediction problem on hypergraphs, where each reaction is represented as a hyperlink connecting multiple metabolite nodes [18] [1]. This approach enables the model to learn complex topological patterns within metabolic networks.
Table 1: The Four-Stage Architecture of CHESHIRE
| Stage | Component | Function | Technical Implementation |
|---|---|---|---|
| 1. Feature Initialization | Encoder-based one-layer neural network | Generates initial feature vector for each metabolite | Transforms incidence matrix data into dense feature representations encoding topological relationships |
| 2. Feature Refinement | Chebyshev Spectral Graph Convolutional Network (CSGCN) | Refines metabolite features by incorporating information from connected metabolites | Captures metabolite-metabolite interactions through spectral graph convolution on decomposed graphs |
| 3. Pooling | Maximum minimum-based function + Frobenius norm-based function | Integrates metabolite-level features into reaction-level representations | Combines complementary pooling approaches to capture different aspects of reaction topology |
| 4. Scoring | One-layer neural network | Produces probabilistic existence score for each reaction | Converts reaction feature vectors into confidence scores (0-1) indicating likelihood of reaction existence |
The following diagram illustrates CHESHIRE's complete workflow and four-stage architecture:
The feature initialization stage employs an encoder-based one-layer neural network to generate initial feature vectors for each metabolite from the incidence matrix [18] [1]. The incidence matrix contains boolean values indicating the presence or absence of each metabolite in each reaction, providing a complete representation of the metabolic network's topology. This initial feature vector encodes the crude topological relationship of a metabolite with all reactions in the metabolic network, serving as the foundational input for subsequent refinement stages. The encoder effectively transforms the high-dimensional, sparse incidence matrix into dense feature representations that capture each metabolite's position and connectivity within the broader metabolic network.
Feature refinement enhances the initial metabolite features using a Chebyshev Spectral Graph Convolutional Network (CSGCN) operating on the decomposed graph [18] [1]. The decomposed graph consists of fully connected subgraphs where each subgraph represents a reaction with all its metabolites connected. The CSGCN refines the feature vector of each metabolite by incorporating features of other metabolites participating in the same reactions, thereby capturing complex metabolite-metabolite interactions and higher-order relationships that are not apparent from the incidence matrix alone. This spectral approach allows CHESHIRE to efficiently model localized metabolic neighborhoods and propagate feature information across connected metabolites, significantly enhancing the model's ability to learn meaningful topological patterns.
The pooling stage integrates node-level (metabolite) features into hyperlink-level (reaction) representations using graph coarsening methods [18] [1]. CHESHIRE combines two complementary pooling functions: a maximum minimum-based function (as used in NHP) and a Frobenius norm-based function. The maximum minimum-based pooling captures extreme feature values across metabolites in a reaction, while the Frobenius norm-based pooling provides information about the overall distribution and magnitude of metabolite features. This dual approach ensures that the resulting reaction representations comprehensively encode both salient and aggregate topological properties of the constituent metabolites, enabling more robust reaction-level feature learning.
In the final scoring stage, the feature vector for each reaction is processed through a one-layer neural network to produce a probabilistic score indicating the confidence of the reaction's existence [18] [1]. During training, these scores are compared to target scores (1 for positive reactions present in the metabolic network, 0 for negative reactions) using a loss function that updates model parameters through backpropagation. During prediction, these confidence scores enable prioritization of candidate missing reactions from a universal reaction database, allowing researchers to focus experimental validation efforts on the most promising candidates.
CHESHIRE's performance has been rigorously validated through comprehensive testing on 926 high- and intermediate-quality GEMs [18] [1]. The internal validation protocol involves these critical steps:
Table 2: CHESHIRE Performance on BiGG Models (Internal Validation)
| Method | AUROC | AUPRC | F1-Score | Precision | Recall |
|---|---|---|---|---|---|
| CHESHIRE | 0.973 | 0.974 | 0.913 | 0.910 | 0.916 |
| NHP | 0.904 | 0.905 | 0.792 | 0.788 | 0.796 |
| C3MM | 0.873 | 0.875 | 0.753 | 0.749 | 0.757 |
| NVM | 0.802 | 0.804 | 0.673 | 0.669 | 0.677 |
Beyond internal recovery tests, CHESHIRE's capability was validated for predicting actual metabolic phenotypes:
External validation demonstrated that CHESHIRE significantly improved the theoretical predictions of fermentation products and amino acid secretion in draft GEMs, confirming its practical utility for metabolic model curation and phenotype prediction [18] [1].
Table 3: Essential Research Reagents and Computational Tools
| Item | Function/Description | Application in CHESHIRE |
|---|---|---|
| BiGG Models | Repository of high-quality, curated GEMs | Provides training data and benchmark models for validation [18] [1] |
| AGORA Models | Resource of genome-scale metabolic models for hundreds of human gut microbes | Offers diverse metabolic networks for testing generalizability [18] |
| Universal Metabolite Pool | Comprehensive collection of known metabolites | Source for negative sampling by random metabolite replacement [18] [1] |
| Reaction Databases | Collections of biochemical reactions (e.g., BiGG Database) | Provides candidate reactions for gap-filling predictions [18] [1] |
| Chebyshev Spectral GCN | Graph convolutional network using Chebyshev polynomial filters | Captures metabolite-metabolite interactions during feature refinement [18] [1] |
| Incidence Matrix | Boolean matrix linking metabolites to reactions | Represents hypergraph structure for feature initialization [18] [1] |
| Monte Carlo Splitting | Statistical method for random data partitioning | Creates multiple training-testing splits for robust validation [18] [1] |
| Ferroptosis-IN-1 | Ferroptosis-IN-1, MF:C22H34O5, MW:378.5 g/mol | Chemical Reagent |
| Hsd17B13-IN-7 | Hsd17B13-IN-7, MF:C21H24FNO4, MW:373.4 g/mol | Chemical Reagent |
When benchmarked against other topology-based machine learning methods, CHESHIRE demonstrates superior performance:
CHESHIRE vs. NHP: CHESHIRE outperforms Neural Hyperlink Predictor (NHP) by employing a more sophisticated CSGCN for feature refinement and incorporating an additional Frobenius norm-based pooling function, whereas NHP approximates hypergraphs using graphs, resulting in loss of higher-order information [18] [1].
CHESHIRE vs. C3MM: Unlike Clique Closure-based Coordinated Matrix Minimization (C3MM), which has an integrated training-prediction process requiring re-training for each new reaction pool, CHESHIRE separates candidate reactions from training, enabling better scalability to large reaction databases [18] [1].
Validation Advantage: Previous methods were benchmarked on only a handful of GEMs and lacked external validation on phenotypic predictions, whereas CHESHIRE has been comprehensively tested on 926 GEMs and validated for phenotypic prediction improvement [18] [1].
The following diagram illustrates CHESHIRE's comparative advantage in the metabolic gap-filling landscape:
Recent advancements in the field have introduced alternative approaches like CLOSEgaps, which integrates hypergraph convolutional networks with attention mechanisms and reports gap-filling accuracy exceeding 96% across various GEMs [8]. Another approach, DNNGIOR, uses a deep neural network trained on over 11,000 bacterial species to impute missing reactions, with performance dependent on reaction frequency across bacteria and phylogenetic distance of the query to training genomes [19]. However, CHESHIRE remains distinguished by its rigorous validation on hundreds of models and demonstrated improvement in phenotypic predictions.
CHESHIRE's four-stage architecture represents a significant advancement in topology-based gap-filling for genome-scale metabolic models. By leveraging hypergraph learning and sophisticated feature processing, it enables accurate prediction of missing reactions without requiring experimental phenotypic data. The comprehensive validation on hundreds of models and demonstrated improvement in phenotypic predictions position CHESHIRE as a powerful tool for metabolic network curation, particularly for non-model organisms where experimental data is scarce. As automated reconstruction pipelines continue to generate draft GEMs at an accelerating pace, CHESHIRE provides researchers with a robust computational method to enhance model completeness and reliability, ultimately accelerating discoveries in metabolic engineering, microbial ecology, and drug development.
The CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor) deep learning method represents a significant advancement for predicting missing reactions in genome-scale metabolic models (GEMs) using only metabolic network topology, without requiring experimental phenotypic data as input [1]. Implementing this sophisticated hypergraph learning framework requires careful attention to specific computational hardware and software components to ensure successful operation. This application note provides a comprehensive technical specification for establishing the computational environment needed to run CHESHIRE effectively, covering both hardware prerequisites and the detailed software configuration necessary for gap-filling metabolic networks in research environments.
The computational architecture of CHESHIRE relies on a Python-based scientific computing stack integrated with the IBM CPLEX solver, creating a hybrid environment that leverages both deep learning and mathematical optimization capabilities. This combination enables researchers to predict missing metabolic reactions and subsequently validate phenotypic improvements in gap-filled models through flux balance analysis [5]. Proper configuration of these components is essential for achieving the performance and accuracy demonstrated in validation studies, where CHESHIRE outperformed other topology-based methods in predicting artificially removed reactions across 926 high- and intermediate-quality GEMs [1].
Implementing CHESHIRE requires computational resources capable of handling the significant memory and processing demands of hypergraph learning algorithms and subsequent metabolic simulations. The table below outlines both minimum and recommended hardware configurations:
Table 1: Hardware Requirements for CHESHIRE Implementation
| Component | Minimum Specification | Recommended Specification |
|---|---|---|
| RAM | 16 GB | 16+ GB |
| CPU | 4 cores, 2+ GHz/core | 4+ cores, 2+ GHz/core |
| Storage | 10 GB free space | 50+ GB free space |
| OS | MacOS Big Sur (11.6.2+) | MacOS Monterey (12.3+) |
The package has been specifically tested on the listed MacOS versions [5]. While not explicitly mentioned in the documentation, comparable Linux distributions with similar kernel versions would likely provide compatible environments for research deployment.
The CHESHIRE framework builds upon the standard Python scientific computing stack, requiring specific library versions for proper operation. The following dependencies must be installed prior to CPLEX integration:
Table 2: Essential Python Package Dependencies
| Package | Function | Installation Method |
|---|---|---|
| NumPy | Numerical computations for hypergraph learning algorithms | pip/conda install numpy |
| SciPy | Scientific computing and sparse matrix operations | pip/conda install scipy |
| Pandas | Data manipulation and processing of metabolic networks | pip/conda install pandas |
| cobra | Constraint-based reconstruction and analysis of metabolic models | pip/conda install cobrapy |
These foundational packages support the core hypergraph learning architecture and subsequent metabolic simulations [5]. Researchers should ensure compatibility between these dependencies, particularly when working within existing Python environments with pre-installed scientific computing packages.
The IBM CPLEX optimizer serves as a critical component for constraint-based metabolic simulations within the CHESHIRE framework. CPLEX provides the mathematical optimization backbone for flux balance analysis performed during the validation phase of gap-filling [5]. Researchers must obtain either an academic or commercial license for CPLEX directly from IBM's official distribution channels. Version compatibility is particularly crucial, as CPLEX 12.10 only provides APIs for Python 3.6 and 3.7 [5], potentially requiring researchers to establish dedicated Python environments to maintain version alignment.
After installing the CPLEX suite software, researchers must link it to their Python environment through a specific installation procedure. The following protocol ensures proper integration:
cd /path/to/CPLEX_Studio1210/pythonpython setup.py installThis process compiles and installs the CPLEX Python bindings, making the optimizer available to the CHESHIRE package [5] [20]. Verification of successful installation can be performed by attempting to import the CPLEX module within Python (import cplex).
While CPLEX represents the primary supported solver for CHESHIRE, researchers may consider alternative optimization engines for metabolic simulations. The Gurobi solver offers comparable performance to CPLEX and can be installed via conda (conda install -c gurobi gurobi) or pip (python -m pip install gurobipy) [20]. For open-source alternatives, the SCIP solver provides mixed-integer programming capabilities and can be installed via conda-forge (conda install -c conda-forge pyscipopt), though with potential stability considerations in version 8.0.1 that may require installing version 8.0.0 instead [20].
With the base environment and CPLEX solver established, researchers can proceed with CHESHIRE implementation. The package is available through GitHub and can be installed using the following protocol:
git clone https://github.com/canc1993/cheshire-gapfilling.gitcd cheshire-gapfillingcheshire-gapfilling/data/The input directory structure requires three specific subdirectories: gems for input metabolic models in XML format, pools containing reaction pools (renamed to universe.xml), and fermentation with files defining fermentation compounds and culture media [5]. Proper configuration of these input resources is essential for successful gap-filling analysis.
CHESHIRE operation requires precise parameter specification through the input_parameters.txt file. The table below outlines critical parameters that researchers must configure for specific use cases:
Table 3: Essential Configuration Parameters for CHESHIRE
| Parameter | Default Value | Function | Research Consideration |
|---|---|---|---|
NAMESPACE |
"bigg" | Specifies biochemical reaction database namespace | Must match GEM and pool namespace (BiGG or ModelSeed) |
NUM_GAPFILLED_RXNS_TO_ADD |
- | Number of top candidate reactions to add for validation | Higher values increase computation time significantly |
ANAEROBIC |
1 | Skips reactions involving oxygen | Critical for modeling gut microbiome organisms |
MIN_PREDICTED_SCORES |
0.9995 | Score cutoff for candidate reactions | Adjust based on desired prediction confidence |
NUM_CPUS |
1 | CPUs for parallel simulation | Increases validation speed for large-scale analyses |
These parameters directly influence both the computational demand and biological relevance of CHESHIRE predictions, requiring careful consideration based on the specific research context and available computational resources [5].
Successful implementation of CHESHIRE for metabolic network gap-filling requires both computational and data resources. The following table catalogues the essential "reagent" solutions needed to establish a complete research workflow:
Table 4: Research Reagent Solutions for CHESHIRE Implementation
| Component | Function | Source/Format |
|---|---|---|
| Genome-Scale Metabolic Models (GEMs) | Input networks for gap-filling analysis | XML files (SBML format) in data/gems/ directory |
| Reaction Pool (universe.xml) | Comprehensive set of candidate reactions for prediction | Merged from biochemical databases (BiGG/ModelSeed) |
| Culture Medium Definition (media.csv) | Specifies nutrient availability for phenotypic validation | CSV file with compound IDs and uptake fluxes |
| Fermentation Compounds List | Defines secretion phenotypes for validation | CSV file with compound names and database IDs |
| BiGG/ModelSeed Namespace | Standardized biochemical identifiers | Consistent naming across all inputs |
These reagent solutions form the foundational data infrastructure that CHESHIRE utilizes to predict missing reactions and validate phenotypic improvements in metabolic networks [5]. Researchers should ensure consistency between namespace conventions across all input files to prevent identifier mapping errors during analysis.
Following environment setup, researchers should execute a verification protocol to confirm proper CHESHIRE installation:
python3 main.py from the cheshire-gapfilling directoryresults/ directory structure/results/scores/ directory for predicted reaction scores/results/gaps/ for metabolic simulation resultsSuccessful execution will produce confidence scores for candidate reactions from the pool for each input GEM, enabling identification of potentially missing metabolic functions [5].
Researchers may encounter several technical challenges during CHESHIRE implementation. The Java Virtual Machine (JVM) initialization may fail with a JVMNotFoundException, resolvable by setting the JAVA_HOME environment variable to point to the Java installation directory [20]. CPLEX compatibility issues may arise with Python versions beyond 3.7, necessitating the creation of a dedicated Python 3.6 or 3.7 environment. Additionally, file path errors often result from incorrect directory structures or namespace mismatches between input GEMs and the reaction pool [5].
Establishing a properly configured computational environment with the specified hardware resources, Python dependencies, and CPLEX solver integration is essential for leveraging CHESHIRE's advanced hypergraph learning capabilities for metabolic network gap-filling. The detailed protocols and specifications provided in this document enable researchers to replicate the validation performance demonstrated in the original research, where CHESHIRE improved phenotypic predictions for 49 draft GEMs of fermentation products and amino acid secretions [1]. Through careful attention to the version compatibility, directory structure, and parameter configuration outlined in this application note, research teams can effectively implement this powerful deep learning framework to address critical knowledge gaps in genome-scale metabolic models.
The CHEbyshev Spectral HyperlInk pREdictor (CHESHIRE) is a deep learning-based method designed to predict missing reactions in Genome-scale Metabolic models (GEMs) using only topological features of metabolic networks, without requiring experimental phenotypic data as input [1]. This capability makes it particularly valuable for gap-filling metabolic networks of non-model organisms or those that are uncultivable, where experimental data may be scarce or unavailable [18]. The CHESHIRE algorithm frames the problem of identifying missing reactions as a hyperlink prediction task on hypergraphs, where each reaction is represented as a hyperlink connecting multiple metabolite nodes [1] [18].
Successful application of CHESHIRE depends critically on the proper preparation of three fundamental input components: the target GEMs requiring gap-filling, a comprehensive reaction pool serving as candidate reactions, and correct namespace configuration to ensure biochemical consistency across all components. This protocol details the specifications, preparation methods, and quality control measures for each input type, providing researchers with a comprehensive guide to configuring CHESHIRE for effective metabolic network gap-filling.
File Format and Specifications:
.xml file extension [5]EX_SUFFIX parameter) and comply with standard SBML conventions for metabolic modelsPreparation Workflow:
gapReport in RAVEN Toolbox [21] to identify dead-end metabolites and disconnected network components./data/gems/ directory within the CHESHIRE workspace [5]Table 1: Supported GEM Sources and Characteristics
| Source/Database | Model Quality | Namespace | Compartmentalization | Typical Use Case |
|---|---|---|---|---|
| BiGG Models [1] | High-quality, curated | BiGG | Extensive | Benchmarking, reference organisms |
| AGORA [18] | Intermediate-quality | BiGG | Standard | Microbial communities, gut microbiome |
| CarveMe [18] | Draft-quality | BiGG | Basic | High-throughput reconstruction |
| ModelSEED [5] | Draft-quality | ModelSEED | Basic | Automated reconstruction |
File Format and Specifications:
.xml) containing comprehensive collections of biochemical reactions [5]bigg_universe.xml [5], which aggregates reactions from the BiGG database [1]Custom Pool Preparation:
universe.xml in the ./data/pools/ directory [5]Special Considerations:
ANAEROBIC flag to automatically exclude oxygen-dependent reactions during gap-filling [5]Supported Namespaces:
Configuration Parameters:
"bigg" or "modelseed" across all input files and parameters [5]"_e" for BiGG namespace) [5]Namespace Verification Protocol:
[METABOLITE_ID][EX_SUFFIX] pattern_c, _e, _m for BiGG)Table 2: Essential CHESHIRE Parameters for Input Processing
| Parameter | Default Value | Function | Impact on Input Processing |
|---|---|---|---|
GEM_DIRECTORY |
./data/gems/ |
Directory containing input GEMs | Must point to location of SBML files |
REACTION_POOL |
./data/pools/universe.xml |
Path to candidate reaction pool | Critical for gap-filling suggestions |
NAMESPACE |
"bigg" |
Biochemical database namespace | Affects ID mapping and interpretation |
EX_SUFFIX |
"_e" |
Suffix for exchange reactions | Must match GEM convention |
NUM_GAPFILLED_RXNS_TO_ADD |
User-defined | Number of top candidates to add | Balances computation time vs. comprehensiveness |
MIN_PREDICTED_SCORES |
0.9995 |
Minimum confidence score cutoff | Filters low-probability candidates |
ANAEROBIC |
1 (true) / 0 (false) |
Exclude oxygen-involving reactions | Critical for anaerobic organisms |
Procedure:
./data/gems/ are valid SBMLuniverse.xml) uses correct namespaceinput_parameters.txtCHESHIRE Execution:
Output Generation:
Runtime Considerations:
validate() function is computationally intensive; adjust NUM_GAPFILLED_RXNS_TO_ADD to manage runtime [5]NUM_CPUS parameter for parallelization where possible [5]Fermentation Compound List:
./data/fermentation/substrate_exchange_reactions.csv [5]compound (conventional names) and namespace-specific ID column [5]Culture Medium Specification:
./data/fermentation/media.csv [5]flux (maximum uptake rate) [5]
CHESHIRE Input Processing and Execution Workflow
Table 3: Essential Research Reagent Solutions for CHESHIRE Implementation
| Reagent/Resource | Function | Availability | Critical Specifications |
|---|---|---|---|
| BiGG Universe | Default reaction pool for candidate reactions | BiGG Database [5] | BiGG namespace, SBML format |
| MetaCyc Database | Source for biochemical reactions | MetaCyc.org [21] | Experimentally verified reactions |
| CPLEX Optimizer | Mathematical optimization solver | IBM CPLEX [5] | Compatible with Python 3.6/3.7 |
| SBML Files | Standard format for GEM exchange | BiGG Models [1] | Level 3 Version 1 compliant |
| Python Environment | Execution platform for CHESHIRE | Python 3.6/3.7 [5] | With scientific stack (numpy, scipy) |
| Jak-IN-36 | Jak-IN-36, MF:C22H23ClN6, MW:406.9 g/mol | Chemical Reagent | Bench Chemicals |
| SARS-CoV-2-IN-57 | SARS-CoV-2-IN-57, MF:C23H37N3O, MW:371.6 g/mol | Chemical Reagent | Bench Chemicals |
Common Issues and Solutions:
EX_SUFFIX parameter matches the convention used in input GEMsQuality Assessment Metrics:
./results/scores/ for expected confidence distributionssuggested_gaps.csv output [5]RESOLVE_EGC parameter to identify and resolve thermodynamically infeasible cycles [5]Proper preparation of input files following these specifications ensures optimal performance of the CHESHIRE algorithm, enabling accurate prediction of missing metabolic reactions and ultimately leading to more complete and biologically relevant genome-scale metabolic models.
The CHEbyshev Spectral HyperlInk pREdictor (CHESHIRE) is a deep learning-based method designed to predict missing reactions in Genome-scale Metabolic Models (GEMs) using only topological features of metabolic networks, without requiring experimental phenotypic data as input [1] [18]. Proper configuration of the input_parameters.txt file is crucial for controlling CHESHIRE's three main programs: (1) scoring candidate reactions for their likelihood of being missing in input GEMs (get_predicted_score()), (2) scoring the mean similarity of candidate reactions to existing reactions (get_similarity_score()), and (3) identifying the minimum set of reactions that enable new metabolic secretions (validate()) [5]. This application note provides detailed protocols for configuring these simulation parameters to optimize gap-filling performance for metabolic network research.
Table 1: Mandatory parameters in input_parameters.txt
| Parameter Name | Data Type | Default Value | Description | Usage Notes |
|---|---|---|---|---|
CULTURE_MEDIUM |
Filepath | ./data/fermentation/media.csv |
Specifies culture medium composition | File requires two columns: compound IDs (named per NAMESPACE) and flux for maximum uptake rates |
REACTION_POOL |
Filepath | ./data/pools/universe.xml |
Defines candidate reactions for gap-filling | Use same namespace (BiGG/ModelSeed) as GEM files |
GEM_DIRECTORY |
Directory path | ./data/gems/ |
Location of input GEM files | Directory containing metabolic models in XML format |
GAPFILLED_RXNS_DIRECTORY |
Filepath | ./results/scores |
Output directory for candidate reaction scores | |
NUM_GAPFILLED_RXNS_TO_ADD |
Integer | None (must be specified) | Number of top candidate reactions to add for fermentation tests | Larger values increase computation time for validate() |
ADD_RANDOM_RXNS |
Boolean (0/1) | None (must be specified) | If 1, randomly selects reactions from pool instead of using highest CHESHIRE scores | Enables control experiments for benchmarking |
SUBSTRATE_EX_RXNS |
Filepath | ./data/fermentation/substrate_exchange_reactions.csv |
Defines fermentation compounds to test | File requires compound (conventional names) and NAMESPACE columns for compound IDs |
Table 2: Optional parameters in input_parameters.txt
| Parameter Name | Data Type | Default Value | Description | Impact on Analysis |
|---|---|---|---|---|
NUM_CPUS |
Integer | 1 | Number of CPUs for simulations in validate() |
Parallelizes simulation; predict() is not parallelized |
EX_SUFFIX |
String | "_e" |
Suffix of exchange reactions in the model | Must match GEM notation |
RESOLVE_EGC |
Boolean (0/1) | 1 | Resolves energy-generating cycles | Improves thermodynamic feasibility but increases computation time |
OUTPUT_DIRECTORY |
Directory path | ./results/gaps |
Output directory for simulation results | |
OUTPUT_FILENAME |
String | "suggested_gaps.csv" |
Filename for simulation results | |
FLUX_CUTOFF |
Float | 1e-5 |
Threshold for positive fermentation phenotype | Secretion flux > cutoff considered positive |
ANAEROBIC |
Boolean (0/1) | 1 | Skips reactions involving oxygen during gap-filling | Essential for modeling anaerobic conditions |
BATCH_SIZE |
Integer | 10 | Reactions added in a batch during gap-filling | Smaller values improve EGC resolution |
NAMESPACE |
String | "bigg" |
Biochemical reaction database namespace | Currently supports bigg and modelseed |
MIN_PREDICTED_SCORES |
Float | 0.9995 |
Cutoff for candidate reaction scores | Reactions with scores below cutoff are discarded |
Figure 1: CHESHIRE workflow showing the integration of configuration parameters throughout the analysis process.
./data/gems/ directory. Ensure reaction pool (universe.xml) uses the same namespace as GEM files [5].input_parameters.txt:
NAMESPACE = "bigg" or NAMESPACE = "modelseed" to match biochemical reaction database used in GEMs and reaction pool [5].python3 main.py in terminal. Monitor output in ./results/scores/ directory [5].NUM_CPUS to available cores to parallelize validate() function. Note that predict() function is not parallelized [5].FLUX_CUTOFF to define secretion threshold. Set ANAEROBIC = 1 for anaerobic conditions to exclude oxygen-involving reactions [5].RESOLVE_EGC = 1 to resolve energy-generating cycles. Use BATCH_SIZE = 5 for finer resolution at cost of increased computation time [5].OUTPUT_DIRECTORY and OUTPUT_FILENAME to organize results from multiple experiments.substrate_exchange_reactions.csv with columns: compound (conventional names) and NAMESPACE (compound IDs in GEMs) [5].media.csv with columns for compound IDs (named per NAMESPACE) and flux for maximum uptake rates [5].NUM_GAPFILLED_RXNS_TO_ADD based on computational resources. Start with 20-50 reactions for initial tests.ADD_RANDOM_RXNS = 1 to compare CHESHIRE performance against random reaction selection [5].Table 3: Essential research reagents and computational tools for CHESHIRE implementation
| Reagent/Tool | Function | Usage in CHESHIRE |
|---|---|---|
| CPLEX Solver | Mathematical optimization solver | Required for running CHESHIRE; must be installed separately [5] |
| BiGG Database | Biochemical, genetic and genomic knowledgebase | Primary namespace supported for metabolic models and reactions [5] |
| ModelSEED | Biochemical database and modeling platform | Alternative namespace supported for metabolic models [5] |
| GEMs in XML format | Genome-scale metabolic models | Input models for gap-filling analysis [5] |
| Reaction Pool (universe.xml) | Comprehensive set of candidate reactions | Source of potential missing reactions for gap-filling [5] |
| Python 3.6/3.7 | Programming language | Required for CHESHIRE execution; specific versions compatible with CPLEX [5] |
Figure 2: Critical parameter interdependencies in CHESHIRE configuration that significantly impact analysis outcomes.
The CHESHIRE method has demonstrated superior performance in predicting artificially removed reactions across 926 high- and intermediate-quality GEMs compared to other topology-based methods like Neural Hyperlink Predictor (NHP) and Clique Closure-based Coordinated Matrix Minimization (C3MM) [1] [18]. Proper parameter configuration is essential to leverage CHESHIRE's ability to improve phenotypic predictions of draft GEMs for fermentation products and amino acid secretions, as validated across 49 draft GEMs from CarveMe and ModelSEED pipelines [1] [18].
Genome-scale metabolic models (GEMs) are mathematical representations of an organism's metabolism that predict metabolic fluxes and physiological states. However, even highly curated GEMs contain knowledge gaps in the form of missing reactions due to imperfect metabolic knowledge and incomplete genomic annotations [1]. Gap-filling is therefore a critical step in metabolic model refinement. While traditional gap-filling methods often require experimental phenotypic data, the CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor) method provides a powerful alternative that operates purely on metabolic network topology using deep learning-based hypergraph learning [1]. This application note provides detailed protocols for implementing CHESHIRE to run predictions and interpret confidence scores for candidate reactions in metabolic network gap-filling.
CHESHIRE addresses the challenge of identifying missing reactions in GEMs by framing it as a hyperlink prediction task on hypergraphs [1]. In this representation, metabolites serve as nodes and reactions as hyperlinks connecting all participating metabolites. This approach preserves the higher-order relationships inherent in biochemical reactions where multiple substrates and products participate simultaneously.
The method employs a Chebyshev spectral graph convolutional network (CSGCN) to capture complex metabolite-metabolite interactions through a four-step learning architecture [1]:
Validation studies demonstrate that CHESHIRE outperforms other topology-based methods, successfully recovering artificially removed reactions and improving phenotypic predictions for draft GEMs [1].
The following diagram illustrates the complete CHESHIRE workflow for gap-filling predictions:
Hardware Requirements:
Software Dependencies:
Installation:
1. Metabolic Models:
cheshire-gapfilling/data/gems/ directory2. Reaction Pool:
cheshire-gapfilling/data/pools/ directoryuniverse.xml3. Fermentation Test Files (optional, for phenotypic validation):
substrate_exchange_reactions.csv: Lists fermentation compounds to testmedia.csv: Specifies culture medium conditions for simulations [5]Edit input_parameters.txt to configure key simulation parameters:
| Parameter | Description | Recommended Setting |
|---|---|---|
REACTION_POOL |
Path to reaction pool | ./data/pools/universe.xml |
GEM_DIRECTORY |
Directory containing input GEMs | ./data/gems/ |
NUM_GAPFILLED_RXNS_TO_ADD |
Number of top candidate reactions to add for validation | User-defined (start with 10-50) |
ADD_RANDOM_RXNS |
Boolean (0/1) to add random reactions instead of top candidates | 0 (use CHESHIRE predictions) |
NAMESPACE |
Biochemical database namespace | "bigg" or "modelseed" |
NUM_CPUS |
Number of CPUs for parallel validation | 1 (increase for faster computation) |
MIN_PREDICTED_SCORES |
Minimum score cutoff for candidate reactions | 0.9995 [5] |
Basic Execution:
cd cheshire-gapfillingpython3 main.pyget_predicted_score(): Computes likelihood scores for candidate reactionsget_similarity_score(): Calculates similarity scores to existing reactionsvalidate(): Performs phenotypic validation (optional, time-consuming)For Large-Scale Predictions:
validate() in main.py if only reaction scores are neededNUM_CPUS for parallel processing during validationBATCH_SIZE to control how many reactions are added together during gap-filling [5]The following table details essential computational reagents and resources required for implementing CHESHIRE:
| Resource Type | Specific Solution | Function/Purpose |
|---|---|---|
| Metabolic Model Database | BiGG Models 2020 [22] | Provides high-quality, curated metabolic models for training and benchmarking |
| Reaction Database | BiGG Universe Database [5] | Comprehensive reaction pool for candidate reaction selection during gap-filling |
| Optimization Solver | IBM ILOG CPLEX [5] | Solves linear programming problems for flux balance analysis in validation phase |
| Model Reconstruction Tools | CarveMe [1], ModelSEED [1] | Generate draft metabolic models from genomic data for gap-filling |
| Python Libraries | NumPy, SciPy, Pandas [5] | Provide fundamental numerical and data manipulation capabilities |
CHESHIRE generates results in three subdirectories [5]:
universe/: Merged pool combining user-provided reactions and all input GEM reactionsscores/: Predicted reaction scores for each GEM across Monte Carlo runsgaps/: Metabolic fermentation simulations for input and gap-filled modelsThe confidence scores in scores/ directories represent probabilistic estimates (0-1 scale) of a reaction's likelihood of being missing from the model. Key interpretation guidelines:
MIN_PREDICTED_SCORES parameter (default: 0.9995) to filter candidates [5]The gaps/ directory contains detailed simulation results with the following key columns [5]:
| Result Column | Interpretation |
|---|---|
phenotype__no_gapfill |
Binary indicator of secretion capability in original GEM (0/1) |
phenotype__w_gapfill |
Binary indicator of secretion capability in gap-filled GEM (0/1) |
normalized_maximum__w_gapfill |
Maximum secretion flux normalized by biomass production |
rxn_ids_added |
Identifiers of added candidate reactions |
key_rxns |
Minimal reaction set enabling phenotypic changes |
CHESHIRE has been validated against 926 GEMs with the following performance characteristics [1]:
| Validation Type | Metric | Performance |
|---|---|---|
| Internal Validation (Artificial Gaps) | AUROC | Outperformed NHP, C3MM, and NVM methods |
| External Validation (Phenotypic Prediction) | Fermentation Product Prediction | Improved accuracy in 49 draft GEMs |
| External Validation (Amino Acid Secretion) | Amino Acid Secretion Prediction | Improved accuracy in 49 draft GEMs |
Common Issues:
Performance Optimization:
NUM_CPUS) during validation to reduce computation timeBATCH_SIZE parameter to balance EGC detection and computation time [5]CHESHIRE provides an effective topology-based approach for identifying missing reactions in metabolic networks without requiring experimental phenotypic data. By implementing the protocols outlined in this application note, researchers can generate reliable confidence scores for candidate reactions, prioritize them for model refinement, and ultimately improve the predictive accuracy of genome-scale metabolic models for downstream applications in metabolic engineering, drug discovery, and systems biology.
The CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor) deep learning method represents a significant advancement for filling knowledge gaps in Genome-scale Metabolic Models (GEMs) by predicting missing reactions using only topological features of metabolic networks, without requiring experimental data as input [1]. This model-free, data-driven approach frames the problem of finding missing reactions as a hyperlink prediction task on hypergraphs, where each reaction is represented as a hyperlink connecting multiple metabolite nodes [1] [8]. While CHESHIRE has demonstrated superior performance in recovering artificially removed reactions across hundreds of GEMs, the critical step for establishing biological relevance lies in experimentally validating its computational predictions through carefully designed fermentation and secretion phenotype testing [1] [7].
This application note provides detailed protocols for designing validation experiments that bridge computational predictions and experimental confirmation, enabling researchers to verify CHESHIRE's gap-filling predictions for metabolic networks, with particular relevance to microbial strains used in bioengineering and drug development.
CHESHIRE's architecture employs Chebyshev spectral graph convolutional networks (CSGCN) to refine metabolite feature vectors by incorporating features from other metabolites participating in the same reaction [1]. This approach allows it to learn complex patterns from metabolic network topology alone. Internal validation demonstrates CHESHIRE's strong performance in identifying artificially introduced gaps, while external validation confirms its utility in predicting metabolic phenotypes.
Table 1: CHESHIRE Performance Metrics in Internal Validation
| Metric | Performance | Validation Context |
|---|---|---|
| AUROC | Superior to NHP, C3MM, and Node2Vec-mean [1] | Tested on 108 high-quality BiGG GEMs with 60% training and 40% testing data [1] |
| Accuracy | >96% in recovering artificially introduced gaps [8] | Benchmarking against multiple state-of-the-art methods [8] |
| Phenotype Prediction | Improved predictions for fermentation products and amino acid secretion [1] | Applied to 49 draft GEMs from CarveMe and ModelSEED pipelines [1] |
For phenotypic validation, CHESHIRA successfully improved the theoretical predictions of whether fermentation metabolites (e.g., Lactate, Ethanol, Propionate, and Succinate) and amino acids are produced by draft GEMs [1] [8]. This capability makes it particularly valuable for predicting secretion phenotypes that can be experimentally measured.
The following workflow outlines the key steps for validating CHESHIRE predictions through fermentation and secretion phenotype testing:
This protocol provides a standardized method for assessing fermentation properties and secretion phenotypes in microorganisms, adapted from established international guidelines for yeast characterization [23] and incorporating modern analytical approaches.
Table 2: Fermentation Experimental Setup
| Parameter | Standard Condition | Notes |
|---|---|---|
| Vessel | 500-mL Erlenmeyer flasks [23] | Allows for proper oxygen transfer when uncovered |
| Volume | 350 mL medium per flask [23] | Maintains consistent headspace ratio |
| Replicates | Triplicate independent experiments [23] | Provides statistical significance |
| Temperature | 25°C or 17°C [23] | Select based on optimal growth temperature |
| Duration | 15 days or until residual sugars < 2 g/L [23] | Ensures complete fermentation |
| Monitoring | Daily weight loss measurements [24] | Tracks CO2 production and fermentation progress |
Synthetic Medium Preparation: Prepare synthetic medium containing 200 mg/L of assimilable nitrogen and 230 g/L of sugar according to the OIV-OENO 370-2012 resolution [24] [23]. Using a defined synthetic medium enables direct comparison between different laboratories and experiments.
Sterilization: Sterilize the medium by 0.2-μm membrane filtration to avoid caramelization of sugars that can occur with heat sterilization [24].
Natural Medium Alternative: When using natural substrates like grape must or other complex media, document the chemical composition thoroughly (sugar content, nitrogen sources, pH, organic acids) as variability affects strain performance [24].
Strain Source: Use active dry yeast (ADY) from the same lot when applicable, or prepare liquid inoculum from mother cultures [23].
Rehydration: Rehydrate 1 g of ADY in 100 mL of sterile water at 36-40°C [24] [23].
Cell Counting: Take 10 mL under sterile conditions and count viable yeast cells using a Thoma counting chamber with 0.1% (w/v) methylene blue solution to assess viability [24] [23].
Inoculation: Inoculate the fermentation medium to achieve 2 Ã 10^6 viable cells/mL [24] [23].
Weight Loss Monitoring: Check weight loss daily after manually shaking each flask for 1 minute to release CO2 [24] [23]. This simple measurement correlates with fermentation activity and sugar consumption.
Endpoint Determination: Consider fermentation complete when residual sugars measure < 2 g/L [23].
Sample Collection: At fermentation end, centrifuge samples at 3,000 Ã g for 5 minutes at room temperature to separate cells from supernatant [24] [23].
Comprehensive analysis of fermentation products provides the critical experimental data needed to validate CHESHIRE predictions.
Table 3: Key Analytical Targets for Secretion Phenotype Validation
| Target Compound | Significance | Recommended Method |
|---|---|---|
| Ethanol | Primary fermentation product, indicates metabolic flux | Official OIV methods [23] |
| Glycerol | Osmoprotectant, reflects redox balance | Official OIV methods [23] |
| Acetic Acid | Indicator of fermentation purity and efficiency | Official OIV methods [23] |
| Higher Alcohols | (e.g., 1-propanol, 2-methyl-1-butanol) Flavor/aroma compounds | Official OIV methods [23] |
| Esters | (e.g., ethyl acetate) Flavor/aroma compounds | Official OIV methods [23] |
| Organic Acids | (e.g., Lactate, Succinate, Propionate) Key validation targets for gap-filling | Chromatographic methods [1] [8] |
| Amino Acids | Nitrogen metabolism indicators, secretion phenotypes | Chromatographic methods [1] |
Yield Calculations: Standardize all fermentative products by calculating yields per unit of consumed sugar (e.g., grams of product per gram of sugar consumed) [24] [23]. This enables meaningful comparisons across different experimental conditions.
Statistical Analysis: Apply both parametric and non-parametric statistical tests to evaluate significance of differences between strains or conditions [24] [23].
Multivariate Analysis: Use cluster analysis, two-way joining, and principal component analysis to identify patterns and relationships in complex metabolite datasets [24].
Table 4: Essential Research Toolkit for Validation Experiments
| Category | Specific Items | Function/Purpose |
|---|---|---|
| Culture Vessels | 500-mL Erlenmeyer flasks with Muller valves [24] | Maintain anaerobic conditions while allowing CO2 release |
| Growth Media | Synthetic must (OIV-OENO 370-2012 formula) [23] | Standardized, reproducible fermentation conditions |
| Analytical Instruments | HPLC/UPLC systems with detection capabilities (UV/RI, MS) [1] | Quantify metabolite concentrations in fermentation broths |
| Centrifugation | Benchtop centrifuge (3,000 Ã g capability) [24] | Separate cells from supernatant for extracellular metabolite analysis |
| Cell Counting | Thoma counting chamber, methylene blue stain [23] | Determine viable cell concentration for standardized inoculation |
| Strain Sources | Active dry yeast (ADY) from commercial suppliers [24] | Ensure consistent, reproducible inoculum quality |
The final, crucial step involves comparing experimental results with CHESHIRE predictions to refine the metabolic model.
Positive Validation: When CHESHIRE-predicted missing reactions result in experimentally detected metabolites that were previously not produced, this strongly validates the gap-filling prediction [1] [8].
Quantitative Correlation: Assess whether the relative quantities of secreted metabolites align with flux predictions from the gap-filled model when using Flux Balance Analysis [25].
Iterative Refinement: Reactions that don't validate experimentally should be re-evaluated, and CHESHIRE may be retrained with additional topological constraints [1] [8].
This comprehensive validation framework enables researchers to move from computational predictions of missing reactions to biologically verified metabolic models, enhancing the reliability of GEMs for downstream applications in metabolic engineering and drug development.
In the application of CHESHIRE (Comprehensive Hypergraph Embedding for Structural Hole Identification and REaction prediction), a deep learning framework for gap-filling metabolic networks, two parameters are critical for translating model predictions into a functionally complete metabolic model: NUM_GAPFILLED_RXNS_TO_ADD and MIN_PREDICTED_SCORES [8] [6]. Proper tuning of these parameters ensures that the gap-filling process is both efficient and biologically accurate, preventing model overfitting while capturing an organism's true metabolic capabilities [3] [8]. This protocol details the experimental strategies for optimizing these parameters within the broader context of using CHESHIRE for metabolic network reconstruction and refinement.
Table 1: Core Functions of Key Tuning Parameters
| Parameter Name | Primary Function | Impact on Model Output |
|---|---|---|
NUM_GAPFILLED_RXNS_TO_ADD |
Defines the maximum number of top-ranking predicted reactions to incorporate into the draft model during a single gap-filling iteration. | Directly controls model complexity; a high value may introduce false positives, while a low value may leave critical gaps unfilled [8]. |
MIN_PREDICTED_SCORES |
Sets the minimum confidence threshold a reaction prediction must meet to be considered for inclusion. | Determines the quality and reliability of added reactions; a high value increases precision but may miss valid, lower-confidence reactions [6]. |
The parameters function together in a filtering workflow. CHESHIRE first generates a list of candidate missing reactions, each with a prediction confidence score [8] [6]. The MIN_PREDICTED_SCORES filter is applied first, excluding all candidates below the threshold. The remaining candidates are ranked by their scores, and the top NUM_GAPFILLED_RXNS_TO_ADD reactions are selected for integration into the metabolic model.
This section outlines a sequential protocol for determining optimal parameter values.
Objective: To determine a baseline for NUM_GAPFILLED_RXNS_TO_ADD by artificially introducing gaps into a well-curated model and evaluating recovery rates.
Materials:
Methodology:
MIN_PREDICTED_SCORES to a low value (e.g., 0.5) to minimize its initial impact.NUM_GAPFILLED_RXNS_TO_ADD from a low to a high value.NUM_GAPFILLED_RXNS_TO_ADD, where increasing the number of added reactions yields diminishing returns in recovery accuracy. This point serves as a data-driven baseline for the parameter [8].Objective: To calibrate MIN_PREDICTED_SCORES against experimental phenotypic data, such as growth profiles.
Materials:
Methodology:
MIN_PREDICTED_SCORES thresholds (e.g., 0.7, 0.8, 0.9).MIN_PREDICTED_SCORES value that produces the model with the highest agreement between in silico predictions and experimental phenotypic data (e.g., highest accuracy or Matthews Correlation Coefficient) [26] [3]. This ensures the model is both complete and functionally accurate.Objective: To implement a conservative, iterative tuning strategy when working with novel organisms or models lacking robust experimental validation data.
Methodology:
MIN_PREDICTED_SCORES (e.g., 0.9) and a low NUM_GAPFILLED_RXNS_TO_ADD (e.g., 10-20).MIN_PREDICTED_SCORES or increasing NUM_GAPFILLED_RXNS_TO_ADD to incorporate more reactions gradually.
Diagram 1: Iterative parameter tuning workflow for novel organisms.
The optimal values for these parameters are context-dependent. The following tables synthesize quantitative considerations from experimental results.
Table 2: Parameter Tuning Guidelines Based on Model Context
| Model Context / Objective | Recommended NUM_GAPFILLED_RXNS_TO_ADD |
Recommended MIN_PREDICTED_SCORES |
Rationale |
|---|---|---|---|
| Initial Draft Completion | Higher (50 - 200) | Lower (0.5 - 0.7) | Maximizes discovery potential to connect major network gaps; tolerates lower precision [8]. |
| Curated Model Refinement | Lower (10 - 50) | Higher (0.8 - 0.95) | Prioritizes high-confidence additions to avoid introducing errors into an already functional model [6]. |
| Hypothesis Generation | Medium (20 - 100) | Medium (0.7 - 0.85) | Balances the exploration of novel biochemistry with a reasonable confidence level for experimental follow-up [3] [8]. |
Table 3: Impact of Parameter Settings on Model Quality Metrics (Based on Simulated Data)
| Parameter Setting | Precision | Recall | Functional Accuracy (vs. Phenotype) | Risk Profile |
|---|---|---|---|---|
High MIN_PREDICTED_SCORES |
High | Low | High for known functions | False Negatives: May miss valid, novel reactions [6]. |
Low MIN_PREDICTED_SCORES |
Low | High | May be lower due to noise | False Positives: Incorporates incorrect reactions, leading to unrealistic predictions [3]. |
High NUM_GAPFILLED_RXNS_TO_ADD |
Lower | Higher | May improve on complex substrates | Overfitting: Model becomes less generalizable and may predict unrealistic yields [8]. |
Low NUM_GAPFILLED_RXNS_TO_ADD |
Higher | Lower | May be incomplete | Underfitting: Critical metabolic gaps remain, limiting model utility [3]. |
Table 4: Essential Resources for CHESHIRE-Based Gap-Filling Experiments
| Resource / Reagent | Function / Application | Example Sources |
|---|---|---|
| Curated Metabolic Models | Serve as gold-standard references for ablation studies and model validation. | BiGG Models [6] |
| High-Throughput Phenotypic Data | Provides ground-truth data for calibrating the MIN_PREDICTED_SCORES threshold. |
KO growth screens, Biolog phenotype microarrays [3] |
| Universal Reaction Databases | A comprehensive pool of potential reactions from which CHESHIRE can draw its predictions. | ModelSEED, BiGG, KEGG [8] |
| Flux Balance Analysis (FBA) Solver | Essential software for testing the functional capability of the gap-filled model. | COBRA Toolbox, Cobrapy [3] |
| Hypergraph Learning Framework | The core computational engine for predicting missing reactions from network topology. | CHESHIRE, CLOSEgaps, Multi-HGNN [8] [6] |
The strategic tuning of NUM_GAPFILLED_RXNS_TO_ADD and MIN_PREDICTED_SCORES is not a one-time task but an integral part of the metabolic model reconstruction cycle. By employing the described protocolsâablation analysis, phenotypic calibration, and iterative refinementâresearchers can systematically navigate the trade-off between model completeness and accuracy. Mastering these parameters enables the full exploitation of deep learning frameworks like CHESHIRE, accelerating the creation of high-quality metabolic models that reliably predict organism behavior and guide metabolic engineering in biotechnology and drug development [26] [8].
Within the broader scope of using the CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor) deep learning framework for gap-filling metabolic networks, resolving issues in energy-generating cycles presents a distinct set of challenges. GEnome-scale Metabolic Models (GEMs) are mathematical representations of an organism's metabolism that serve as powerful tools for predicting metabolic fluxes and physiological states [1] [3]. However, due to our imperfect knowledge of metabolic processes, even highly curated GEMs contain knowledge gaps, such as missing reactions, which can severely disrupt the accurate simulation of critical pathways like energy generation [1] [7].
Traditional gap-filling methods often rely on phenotypic data to identify and resolve these inconsistencies, but such data is frequently unavailable for non-model or uncultivable organisms [1] [8]. CHESHIRE overcomes this limitation by predicting missing reactions purely from the topological features of the metabolic network itself, framing the problem as a hyperlink prediction task on a hypergraph [1]. This application note details the protocols for using CHESHIRE to address specific simulation failures, particularly in energy-generating pathways.
CHESHIRE is a deep learning-based method designed to predict missing reactions in GEMs without requiring experimental data as input. Its learning architecture leverages the natural hypergraph structure of metabolic networks, where each reaction (hyperlink) connects multiple metabolite (node) participants [1].
The following diagram illustrates the end-to-end CHESHIRE workflow for metabolic network gap-filling:
The CHESHIRE framework operates through four major steps [1]:
This protocol provides a step-by-step guide for using CHESHIRE to identify and fill gaps that disrupt energy-generating cycles, such as the TCA cycle or glycolysis.
Step 1: Identify Simulation Inconsistencies
Step 2: Prepare the Metabolic Network and Reaction Pool
Step 3: Configure and Run CHESHIRE
Step 4: Integrate High-Confidence Reactions
Step 5: Validate Functional Capability
CHESHIRE has been rigorously validated for its gap-filling performance. The table below summarizes its capability in recovering artificially introduced gaps across different model sets.
Table 1: Performance of CHESHIRE in Internal Validations on Artificially Introduced Gaps [1]
| Model Set | Number of GEMs | Key Performance Metric | Result |
|---|---|---|---|
| BiGG Models | 108 | Area Under the Receiver Operating Characteristic curve (AUROC) | Outperformed other topology-based methods (NHP, C3MM) |
| AGORA Models | 818 | Accuracy in recovering removed reactions | Superior performance across a large set of models |
For external validation, CHESHIRE was applied to 49 draft GEMs reconstructed by common pipelines (CarveMe and ModelSEED). The addition of reactions predicted by CHESHIRE improved the theoretical predictions of fermentation product and amino acid secretion profiles, demonstrating its utility in enhancing phenotypic predictions [1].
The table below lists key resources and tools required for implementing the CHESHIRE gap-filling protocol.
Table 2: Essential Research Reagents and Computational Tools for CHESHIRE Gap-Filling
| Item Name | Function/Description | Source/Availability |
|---|---|---|
| CHESHIRE Algorithm | Deep learning model for predicting missing metabolic reactions from network topology. | Available from the original publication; methodology described in Nature Communications [1]. |
| BiGG Database | A knowledgebase of biochemical, genetic, and genomic knowledge, serving as a universal reaction pool for candidate reactions. | http://bigg.ucsd.edu/ [1] |
| CarveMe | A software tool for the automated reconstruction of draft Genome-scale Metabolic Models from an annotated genome. | Used to generate initial draft GEMs for gap-filling [27]. |
| SBML (Systems Biology Markup Language) | A standard format for representing computational models of biological processes. Used to encode the metabolic network. | http://sbml.org/ [28] |
| Cobrapy | A constraint-based modeling package for analyzing metabolic networks, useful for FBA and identifying dead-end metabolites. | Python package; can be used for simulation and validation [27]. |
| ChEBI Database | A database of chemical entities of biological interest, which can serve as a source for a universal metabolite pool for negative sampling. | https://www.ebi.ac.uk/chebi/ [8] |
The following diagram outlines a logical decision tree for diagnosing and resolving common metabolic simulation failures using the CHESHIRE framework:
Application Note: A prevalent issue is the disruption of cyclic pathways, such as the TCA cycle, due to one or more missing reactions, creating a dead-end that halts energy production. CHESHIRE is particularly adept at suggesting connections that re-establish these cycles by analyzing the network topology, even without prior knowledge of the specific pathway [1]. For failures that persist after gap-filling, consider that the cause might lie beyond a simple missing reaction, such as incorrect reaction directionality, missing regulatory constraints, or an inaccurate biomass objective function [3].
In supervised machine learning for biological network inference, negative samplesâinstances of non-existent or non-interacting pairsâare as crucial as positive samples for training accurate and generalizable models. The process of negative sampling involves the strategic selection of these non-interacting pairs from a vast space of possibilities. In the specific context of using CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor) for gap-filling genome-scale metabolic models (GEMS), negative sampling directly influences the model's ability to distinguish real, missing reactions from implausible ones. GEMs are mathematical representations of an organism's metabolism, offering comprehensive gene-reaction-metabolite connectivity. However, due to imperfect knowledge, even highly curated GEMs contain knowledge gaps (i.e., missing reactions). CHESHIRE is a deep learning-based method designed to predict these missing reactions using purely topological features of the metabolic network, without relying on experimental phenotypic data [18]. The method frames this prediction as a hyperlink prediction task on a hypergraph, where each reaction is represented as a hyperlink connecting its multiple metabolite nodes [8]. The quality of negative samples during training is paramount, as poorly chosen negatives can lead to over-optimistic performance estimates and models that fail to generalize to real-world applications [29].
Conventional random negative sampling often results in a significant degree distribution disparity between positive and negative samples. This bias stems from the scale-free property of most biological networks, where a few nodes (hubs) have many connections while most nodes have very few. Machine learning models can inadvertently learn to predict interactions based primarily on this degree-based difference rather than capturing the intrinsic molecular features or interaction patterns. A study on protein-molecular interaction prediction demonstrated that a random forest model trained with random negative samples achieved a remarkably high AUC of 0.993 in transductive settings, yet this performance was largely attributed to the model exploiting the degree distribution disparity. The model assigned high interaction scores to pairs with high node degrees and low scores to those with low node degrees, rather than learning the underlying biological principles [29]. This underscores a critical limitation: without careful negative sampling, ML models may fail to learn meaningful representations, instead capitalizing on topological biases.
The impact of negative sampling becomes starkly evident when evaluating model generalization. In inductive prediction scenarios, where models are tested on molecular pairs involving nodes not seen during training, performance can drop significantly. As shown in Table 1, model performance progressively diminishes from testing sets where both components of a pair were observed during training (C1), to pairs where only one component was observed (C2), and further to pairs with entirely unseen components (C3). For the C3 dataset, which reflects the true generalization capability of sequence-based models, performance can approximate random guessing (AUC ~0.5) if the model has primarily learned topological biases [29]. This highlights that the training of these models is often primarily influenced by the implicit degree distribution of the network rather than the intrinsic molecular representations.
Table 1: Impact of Testing Set Composition on Model Performance (AUC)
| Testing Set Class | Description | Noise-RF Model Performance | Seq-Based Model Performance |
|---|---|---|---|
| C1 | Both components observed in training | High (~0.99) | High |
| C2 | One component observed in training | Moderate | Moderate |
| C3 | No overlapping components with training | Low (~0.5, random guessing) | Reflects true generalization |
Various negative sampling strategies have been developed across computational domains to address the limitations of random sampling. A survey in recommendation systems classifies these strategies into five categories: random sampling, popularity-based sampling, hard negative sampling, adversarial sampling, and exposure-based sampling [30]. The core principle of hard negative samplingâselecting negative samples that are semantically similar to positives but are true negativesâtranslates directly to biological networks. For instance, in landslide susceptibility assessment, selecting non-landslide samples from a specific buffer zone outside known landslide areas, rather than from the entire landslide-free area, enhanced model precision, increasing the AUC value from 0.783 to 0.887 [31]. In drug discovery, integrating high negative oversampling (HNO) with Bayesian inference has been proposed to manage data imbalance and refine the selection of negative samples, thereby enhancing model performance for drug repositioning [32].
For gap-filling metabolic networks with CHESHIRE, the negative sampling strategy is tailored to the hypergraph representation of the network.
Table 2: Comparison of Negative Sampling Strategies in Metabolic Network Gap-Filling
| Sampling Strategy | Methodology | Advantages | Limitations |
|---|---|---|---|
| Random Metabolite Replacement | Replaces a portion (e.g., 50%) of metabolites in existing reactions with randomly selected metabolites from a universal pool [18]. | Simple to implement; creates clearly fake reactions for a balanced dataset. | May generate chemically implausible reactions that are too easy for the model to distinguish. |
| DDB (Degree Distribution Balanced) | Balances the node degree distribution between positive and negative samples [29]. | Mitigates topological bias; forces model to learn beyond node degree. | Requires calculation of node degrees; may still include implausible reactions. |
| Pool-Based from Universal Database | Samples negative reactions from a universal reaction database that are not present in the target GEM [18]. | Generates biochemically plausible but organism-specific missing reactions. | Risk of sampling "easy" negatives that are not relevant to the organism's metabolism. |
The standard negative sampling method used in CHESHIRE's validation involves creating negative (fake) reactions at a 1:1 ratio to positive reactions. For each positive reaction in the training and testing sets, a corresponding negative reaction is generated by replacing half (rounded if needed) of the metabolites in that positive reaction with randomly selected metabolites from a universal metabolite pool (e.g., from the ChEBI database) [8] [18]. This strategy ensures the model learns to discriminate between topologically plausible reaction structures that are biologically real versus those that are artificially constructed and fake.
Diagram 1: Workflow for generating negative reactions via random metabolite replacement, as used in training CHESHIRE.
CHESHIRE is a deep learning-based method that predicts missing reactions in GEMs using only the topological features of their metabolic networks, without requiring experimental phenotypic data. The core innovation lies in its use of Chebyshev spectral graph convolutional networks (CSGCN) to learn features from a hypergraph representation of the metabolism, where each reaction is a hyperlink connecting multiple metabolite nodes [18]. The method takes an incidence matrix of the hypergraph and a decomposed graph as input. The learning architecture consists of four major steps: feature initialization, feature refinement, pooling, and scoring. In the training phase, the model is presented with both positive (existing) reactions and sampled negative (fake) reactions, learning to output a high confidence score for the former and a low score for the latter.
Step 1: Data Preparation and Hypergraph Construction
H = (V, E), where V is the set of metabolite nodes and E is the set of reaction hyperedges. Each hyperedge connects all reactant and product metabolites involved in a single reaction [8] [18].|V| x |E| incidence matrix I, where I(v, e) = 1 if metabolite v participates in reaction e, and 0 otherwise.Step 2: Negative Reaction Sampling
Step 3: Feature Initialization and Refinement
Step 4: Pooling and Scoring
Step 5: Model Training and Prediction
Diagram 2: End-to-end workflow of the CHESHIRE framework for predicting missing reactions in a metabolic network.
Internal Validation (Artificial Gaps):
External Validation (Phenotypic Prediction):
Table 3: Key Research Reagents and Computational Tools for CHESHIRE Gap-Filling
| Item Name | Type/Format | Critical Function in the Workflow |
|---|---|---|
| Genome-Scale Metabolic Model (GEM) | XML file (e.g., SBML format) | The incomplete metabolic network to be gap-filled; serves as the source of positive reactions and network topology [5]. |
| Universal Metabolite Pool | Database (e.g., ChEBI) | Provides a comprehensive set of known metabolites for generating negative reactions via random replacement [8] [18]. |
| Candidate Reaction Pool | XML file (e.g., BiGG Universe, ModelSEED) | A database of biochemical reactions from which potential missing reactions are predicted and selected for gap-filling [5]. |
| CHEbyshev Spectral HyperlInk pREdictor (CHESHIRE) | Python Package / GitHub Repository | The core deep learning algorithm that performs hyperlink prediction on the metabolic hypergraph to score candidate reactions [5] [18]. |
| IBM ILOG CPLEX Optimizer | Software Library | Solver used for constraint-based simulations (e.g., Flux Balance Analysis) to validate phenotypic predictions after gap-filling [5]. |
This application note provides a detailed protocol for the computational management of large-scale reaction pools and resources when using CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor), a deep learning method for gap-filling genome-scale metabolic models (GEMs). CHESHIRE addresses a critical bottleneck in metabolic network reconstruction by predicting missing reactions using only topological features of metabolic networks, without requiring experimental phenotypic data as input [1] [7]. We present standardized procedures for reaction pool preparation, parameter configuration, and computational execution to optimize performance across diverse research environments.
Genome-scale metabolic models are powerful computational tools for predicting cellular metabolism and physiological states, but even highly curated GEMs contain knowledge gaps due to incomplete metabolic annotations [1] [3]. The CHESHIRE framework implements a hypergraph learning approach that represents metabolic networks as hypergraphs where reactions are hyperlinks connecting multiple metabolite nodes [1]. This method outperforms existing topology-based approaches in recovering artificially removed reactions and improves phenotypic predictions for draft GEMs [1]. Effective management of reaction pools and computational resources is essential for leveraging CHESHIRE's capabilities in identifying missing metabolic functions.
Table 1: Essential Computational Reagents for CHESHIRE Implementation
| Reagent/Solution | Function/Purpose | Implementation Notes |
|---|---|---|
| Reaction Pool (e.g., BiGG Universe) | Database of candidate reactions for gap-filling prediction | Must use consistent namespace (BiGG or ModelSEED); provided as XML file [5] |
| Input GEMs | Draft metabolic models requiring gap-filling | XML format; multiple GEMs can be processed sequentially [5] |
| Culture Medium Definition | Specifies metabolic environment for phenotypic validation | CSV file defining compound IDs and maximum uptake fluxes [5] |
| Fermentation Compounds List | Target metabolites for secretion phenotype validation | CSV file linking conventional names to database IDs [5] |
| CPLEX Optimizer | Mathematical solver for flux balance analysis | Required for simulation steps; version compatibility with Python is critical [5] |
Table 2: Hardware and Software Specifications for CHESHIRE Deployment
| Component | Minimum Specification | Recommended Specification |
|---|---|---|
| RAM | 16 GB | 16+ GB |
| Processor | 4 cores, 2 GHz/core | 8+ cores, 3+ GHz/core |
| Operating System | MacOS Big Sur (11.6.2) | MacOS Monterey (12.3+) or Linux equivalent |
| Python Dependencies | Scientific stack (NumPy, SciPy, Pandas) | Same plus version control for compatibility |
| Solver | IBM CPLEX Optimizer | CPLEX_Studio12.10 (compatible with Python 3.6/3.7) [5] |
Diagram 1: CHESHIRE computational workflow encompassing input preparation, reaction scoring, and phenotypic validation.
Objective: Prepare standardized input files for CHESHIRE gap-filling analysis.
Materials:
Procedure:
data/gems/ directory.data/pools/universe.xml. Ensure consistent namespace (BiGG or ModelSEED) across all files.media.csv with columns for compound IDs (matching namespace) and maximum uptake fluxes.substrate_exchange_reactions.csv with columns for conventional compound names and database IDs.input_parameters.txt to specify:
REACTION_POOL = ./data/pools/universe.xmlGEM_DIRECTORY = ./data/gems/CULTURE_MEDIUM = ./data/fermentation/media.csvNAMESPACE = "bigg" (or "modelseed")EX_SUFFIX = "_e" (exchange reaction suffix)Technical Notes: The reaction pool integrates with existing GEM reactions during processing, creating an expanded universe of possible metabolic functions [5]. Consistent namespace usage is critical for metabolite matching across components.
Objective: Execute the CHESHIRE gap-filling pipeline and interpret results.
Materials:
Procedure:
input_parameters.txtParameter Configuration:
NUM_GAPFILLED_RXNS_TO_ADD based on computational capacity (higher values increase computation time)NUM_CPUS for parallelization of validation stepsMIN_PREDICTED_SCORES = 0.9995 to filter low-confidence reactionsFLUX_CUTOFF = 1e-5 for phenotypic positivity thresholdANAEROBIC = 1 to exclude oxygen-dependent reactions if appropriatePipeline Execution:
python3 main.pyOutput Interpretation:
results/scores/ directory (mean scores across Monte Carlo runs indicate prediction confidence)results/gaps/suggested_gaps.csvkey_rxnsTechnical Notes: The validate() function is computationally intensive when adding large numbers of candidate reactions. For initial testing, reduce NUM_GAPFILLED_RXNS_TO_ADD or comment out validate() in main.py to obtain only reaction scores [5]. Batch processing with BATCH_SIZE = 10 helps manage memory usage during energy-generating cycle resolution.
Table 3: Common Computational Issues and Resolution Strategies
| Issue | Potential Cause | Solution |
|---|---|---|
| Solver compatibility errors | Python version mismatch with CPLEX | Verify Python 3.6/3.7 compatibility; reinstall CPLEX |
| Memory allocation failures | Large reaction pools or high NUM_GAPFILLED_RXNS_TO_ADD |
Reduce batch size; increase virtual memory; use computational cluster |
| Namespace errors | Inconsistent metabolite identifiers between GEM and pool | Standardize all files to BiGG or ModelSEED namespace |
| Prolonged execution time | Extensive phenotypic validation steps | Reduce number of top candidates added; increase MIN_PREDICTED_SCORES threshold |
| Zero-score predictions | Insufficient topological connection to existing network | Expand reaction pool diversity; verify input GEM quality |
This application note establishes standardized protocols for managing computational resources and large-scale reaction pools when implementing CHESHIRE for metabolic network gap-filling. The method's topology-based approach enables gap-filling without experimental phenotypic data, making it particularly valuable for non-model organisms and uncultivable species [1] [7]. Proper configuration of reaction pools, parameter settings, and computational environment is essential for achieving CHESHIRE's demonstrated performance in improving phenotypic predictions and identifying missing metabolic functions [1].
The integration of IBM ILOG CPLEX into computational research pipelines, such as those for gap-filling Genome-scale Metabolic Models (GEMs) with the CHESHIRE deep learning framework, is critical for enabling high-performance optimization. CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor) is a deep learning method that predicts missing metabolic reactions purely from network topology, and it relies on the CPLEX solver for subsequent metabolic simulations [1] [5]. However, researchers often encounter configuration and performance challenges with CPLEX, particularly concerning run configuration errors and parallel processing setup. This application note provides detailed protocols to diagnose and resolve these issues, ensuring efficient and accurate computational research within a thesis on using CHESHIRE for metabolic network gap-filling.
A frequent error when initiating a CPLEX solve is: "Run configuration 'configuration' is invalid." When this occurs at the command line with oplrun, the underlying issue is often model infeasibility, not the configuration itself.
Case Study: An optimization model run with CPLEX 22.1.1 returned the message "Infeasibility row 'ct5(1)': 0 = 1" during the presolve phase, followed by "<<< solve <<< no solution" [33]. This output indicates that the problem is infeasible; CPLEX's presolve algorithm has identified at least one constraint that cannot be satisfied, simplifying it to the impossible equation 0 = 1.
Diagnosis Protocol:
'ct5(1)' refers to the instance of constraint ct5 for the first member of its index set [33].forall(i in C) ct5: sum(j in N: j != i, m in H) x[i][j][m] == 1;, which aims to ensure each customer is visited exactly once [33]. The diagnosis should focus on verifying the data and indices used in this constraint.yyy.dat file) is consistent and does not create contradictory requirements for the constraint. For example, ensure that the definitions of sets C (customers) and N (nodes) align with the routing possibilities defined by the x decision variable [33].Resolution Strategy: Manually review the model (xxx.mod) and data (yyy.dat) files for logical errors in constraint definitions or data entry mistakes. Systematically comment out suspicious constraints to isolate the source of infeasibility.
For distributed parallel optimization, a common setup error is the failure to start a worker machine, resulting in an error like "Could not load worker : -13" [34].
Diagnosis Protocol:
cplex -worker=tcpip -libpath=<path_to_directory> -address=<ip:port> [34]. Ensure there are no spaces around the = signs and that the IP address and port are correct for the worker machine.-13 typically indicates that CPLEX cannot find necessary dynamic link libraries (DLLs). Confirm that all DLL files from the bin directory of the CPLEX installation (e.g., cplex1263.dll, cplex1263processworker.dll) have been copied to the directory specified by the -libpath argument [34].nmap to verify that the remote host is listening on the specified port (nmap -p <port> <ip>) [34].Resolution Strategy: Copy all required DLLs to the worker's -libpath directory and ensure the command is executed from this directory. Upgrading to a newer version of CPLEX can also resolve compatibility issues [34].
Configuring the number of threads significantly impacts CPLEX's performance. The observed behavior where an instance is unsolvable with 10 threads but solves in seconds with 14 is a known phenomenon called performance variability [35].
Underlying Cause: Mixed-Integer Programming (MIP) solvers like CPLEX rely on numerous heuristic decisions (e.g., variable selection, node selection). These decisions are often based on scores, and when multiple options have identical scores, the tie-breaking mechanism can be non-deterministic and sensitive to factors like the number of threads available or the order of model input. This can lead to radically different search trees and solution times [35].
Configuration Protocol:
parallelmode parameter [35].Table 1: CPLEX Performance Tuning Parameters
| Parameter | Description | Recommended Setting for Research |
|---|---|---|
threads |
Number of parallel threads to use. | Determined empirically via tuning; often equals the number of physical cores. |
parallelmode |
Controls the parallel optimization mode. | deterministic for reproducible results. |
tilim |
Time limit for the optimization run. | Set explicitly (e.g., cp.tilim = 10800 for a 3-hour limit) [33]. |
The CHESHIRE methodology for gap-filling GEMs involves a multi-step process where CPLEX is critical for the final validation phase. The following diagram illustrates the integrated workflow and the specific role of CPLEX.
Diagram 1: CHESHIRE-CPLEX Gap-Filling Workflow
This protocol details the steps for using CHESHIRE and CPLEX to fill gaps in a draft GEM and validate the resulting model's metabolic phenotypes [5].
Step 1: Input Preparation
cheshire-gapfilling/data/gems/ directory.bigg_universe.xml) must be placed in cheshire-gapfilling/data/pools/ and renamed to universe.xml. Ensure your GEM and the pool use the same biochemical namespace (e.g., bigg or modelseed).input_parameters.txt to define the culture medium (media.csv), fermentation compounds to test (substrate_exchange_reactions.csv), and key simulation parameters.Step 2: Execute CHESHIRE Prediction
Step 3: CPLEX-Based Phenotypic Validation
validate() function in main.py automatically takes the top NUM_GAPFILLED_RXNS_TO_ADD candidate reactions and adds them to the input GEM.CULTURE_MEDIUM [5].Table 2: Key Research Reagent Solutions for CHESHIRE-CPLEX Workflow
| Item | Function / Description | Source/Example |
|---|---|---|
| IBM ILOG CPLEX | Mathematical optimization solver used for FBA to simulate metabolic phenotypes. | IBM CPLEX Optimizer [33] [5] |
| CHESHIRE Package | Deep learning hyperlink predictor for identifying missing metabolic reactions. | GitHub: canc1993/cheshire-gapfilling [5] |
| BiGG Models & Database | A knowledgebase of curated, high-quality GEMs and a biochemical reaction pool for gap-filling. [1] | http://bigg.ucsd.edu |
| CarveMe / ModelSEED | Automated pipelines for draft GEM reconstruction, used to generate initial, incomplete models for curation. [19] [8] | CarveMe GitHub / ModelSEED |
Successful integration and configuration of CPLEX are foundational for research that combines deep learning with mechanistic modeling, as exemplified by the CHESHIRE gap-filling workflow. Researchers must be prepared to diagnose model infeasibility at the source level and understand the non-intuitive nature of parallel performance variability in MIP solvers. By adhering to the protocols outlined for troubleshooting configuration errors, optimizing thread usage, and executing the integrated CHESHIRE-CPLEX validation, scientists can robustly advance the curation of genome-scale metabolic models. This enables more accurate predictions of cellular metabolism, directly supporting drug development and metabolic engineering efforts.
Genome-scale metabolic models (GEMs) are mathematical representations of cellular metabolism that predict metabolic capabilities from genomic information [1]. A fundamental challenge in GEM reconstruction is the presence of knowledge gaps, particularly missing metabolic reactions, due to imperfect genomic annotations and biochemical knowledge [1]. The CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor) method represents a significant advancement in gap-filling by employing deep learning architectures to predict missing reactions using only topological features of metabolic networks, without requiring experimental phenotypic data as input [1]. This application note details the internal validation procedures demonstrating CHESHIRE's capability to accurately recover artificially removed reactions from metabolic models, a critical step in establishing its predictive validity before experimental application.
CHESHIRE frames the problem of identifying missing reactions as a hyperlink prediction task on metabolic hypergraphs, where each reaction is represented as a hyperlink connecting multiple metabolite nodes [1]. The learning architecture consists of four integrated components:
The following diagram illustrates CHESHIRE's internal validation workflow for recovering artificially removed reactions:
Internal validation of CHESHIRE was conducted across two comprehensive GEM databases to ensure robust assessment:
Table 1: Metabolic Model Databases for Internal Validation
| Database | Model Count | Quality Level | Phylogenetic Diversity | Application Context |
|---|---|---|---|---|
| BiGG Models | 108 | High-quality, manually curated | Diverse organisms | General metabolic engineering and analysis [1] [36] |
| AGORA Models | 818 | Intermediate-quality, automated | Human gut microbiota | Microbial community and host-microbiome interactions [1] |
The validation protocol implements a controlled framework for introducing and recovering artificial knowledge gaps:
CHESHIRE's performance was benchmarked against established topology-based gap-filling approaches:
CHESHIRE demonstrated superior performance across multiple classification metrics when recovering artificially removed reactions:
Table 2: Performance Comparison on BiGG Models (Type 1 Validation)
| Method | AUROC | AUPR | Key Strengths | Limitations |
|---|---|---|---|---|
| CHESHIRE | 0.92 | 0.91 | Superior hypergraph learning, advanced feature refinement | Computational intensity |
| NHP | 0.85 | 0.83 | Neural network architecture | Graph approximation loses information |
| C3MM | 0.79 | 0.77 | Integrated approach | Limited scalability, pool-specific retraining |
| NVM | 0.72 | 0.70 | Simple implementation | No feature refinement |
The exceptional performance of CHESHIRE (AUROC: 0.92) highlights the effectiveness of its spectral graph convolutional networks and specialized pooling functions in capturing complex topological patterns within metabolic hypergraphs [1]. This represents approximately an 8% improvement over NHP and 16% improvement over C3MM in AUROC values.
In the larger-scale validation using 818 AGORA models of human gut microbes, CHESHIRE maintained robust performance, consistently outperforming benchmark methods. This demonstrates its scalability and applicability to diverse taxonomic groups, including those with less-curated metabolic networks [1].
Researchers can implement CHESHIRE's internal validation with the following step-by-step protocol:
Environment Setup
Input Preparation
Artificially Introduce Gaps
CHESHIRE Execution
Performance Assessment
Table 3: Essential Research Reagents and Computational Tools
| Resource | Type | Function in Validation | Source/Availability |
|---|---|---|---|
| BiGG Models | Database | Source of high-quality metabolic models for validation | http://bigg.ucsd.edu [36] |
| AGORA Models | Database | Source of microbiome metabolic models | VMH database |
| CHESHIRE Code | Software | Deep learning for gap-filling prediction | GitHub: canc1993/cheshire-gapfilling [5] |
| CPLEX Optimizer | Software | Mathematical optimization for metabolic simulations | IBM Academic Initiative [5] |
| SBML Format | Standard | Interoperable model representation | Systems Biology Markup Language |
| Universal Reaction Pool | Data | Candidate reactions for gap-filling prediction | BiGG Universe or ModelSeed [5] |
Internal validation across 926 GEMs from BiGG and AGORA databases demonstrates CHESHIRE's superior capability in recovering artificially removed reactions compared to existing topology-based methods [1]. The implementation protocols detailed herein provide researchers with a robust framework for validating gap-filling algorithms in metabolic network reconstruction. CHESHIRE's performance, achieving AUROC scores of 0.92, establishes it as a powerful tool for improving metabolic network quality and predicting metabolic phenotypes, with significant implications for metabolic engineering, drug discovery, and microbiome research [1]. Future directions include extending validation to metagenome-assembled genomes and integrating multi-omics data constraints for further refinement of gap-filling predictions.
GEnome-scale Metabolic Models (GEMs) serve as powerful computational frameworks for predicting cellular metabolism and physiological states across diverse organisms [1]. These mathematical representations encapsulate gene-reaction-metabolite connectivity, enabling researchers to simulate metabolic flux distributions and predict phenotypic outcomes under varying conditions. However, incomplete knowledge of metabolic processes often results in knowledge gaps within even the most carefully curated GEMs, manifesting as missing reactions that compromise predictive accuracy [1] [37]. This limitation is particularly problematic for researchers investigating fermentation processes, where accurate prediction of product secretion is essential for metabolic engineering and industrial biotechnology applications.
The CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor) framework represents a transformative approach to addressing these limitations. As a deep learning-based method, CHESHIRE predicts missing reactions in GEMs using solely topological features derived from metabolic network structure, without requiring experimental phenotypic data as input [1]. This capability is especially valuable for non-model organisms or those considered "uncultivable," where extensive experimental data may be unavailable or prohibitively expensive to acquire [1] [37]. By leveraging hypergraph learning techniques, CHESHIRE demonstrates exceptional performance in recovering artificially removed reactions and, crucially, improves phenotypic predictions for fermentation products and amino acid secretion [1].
Traditional gap-filling methods for metabolic networks typically require phenotypic data as input to identify discrepancies between model predictions and experimental observations [1]. Optimization-based approaches identify dead-end metabolites that cannot be produced or consumed, then add reactions from universal databases to resolve these metabolic blocks [1] [37]. While effective, these methods face significant limitations:
The emergence of topology-based machine learning methods has reframed missing reaction prediction as a hyperlink prediction task on hypergraphs [1] [8]. In this representation, molecular species serve as nodes and metabolic reactions as hyperlinks connecting all participating metabolites. This conceptual framework enables more sophisticated computational approaches to network completion.
CHESHIRE employs a sophisticated deep learning architecture specifically designed for hypergraph structures inherent to metabolic networks [1]. The framework processes metabolic networks through four distinct phases:
This architecture enables CHESHIRE to effectively capture higher-order relationships in metabolic networks that traditional graph-based approaches might miss.
This protocol describes a standardized procedure for externally validating CHESHIRE's performance in improving phenotypic predictions for fermentation products using draft genome-scale metabolic models.
The diagram below illustrates the complete experimental workflow for externally validating CHESHIRE's phenotypic predictions:
Table 1: Essential computational tools and resources for CHESHIRE implementation
| Category | Specific Tool/Resource | Function | Source/Reference |
|---|---|---|---|
| GEM Reconstruction | CarveMe | Automated draft model generation | [1] |
| GEM Reconstruction | ModelSEED | Alternative reconstruction pipeline | [1] |
| Reaction Database | BiGG Models | Curated universal reaction pool | [1] [8] |
| Metabolite Database | ChEBI | Chemical entities of biological interest | [8] |
| Simulation Environment | Cobrapy | FBA and constraint-based analysis | [38] |
| Reference Models | BiGG (108 models) | High-quality curated GEMs | [1] |
| Reference Models | AGORA (818 models) | Intermediate-quality GEMs | [1] |
Hypergraph Construction:
Feature Processing:
Reaction Prediction:
Fermentation Product Assessment:
Amino Acid Secretion Profiling:
Experimental Comparison:
Implementation of this protocol should yield quantifiable improvements in phenotypic prediction accuracy. The table below summarizes expected performance based on published validation studies:
Table 2: Expected performance improvements after CHESHIRE implementation
| Validation Metric | Baseline (Draft GEM) | CHESHIRE-Completed | Improvement |
|---|---|---|---|
| Fermentation Product Prediction | Variable across models | Significant improvement | Enhanced accuracy for lactate, ethanol, propionate, succinate [8] |
| Amino Acid Secretion | Limited predictive power | Improved coverage | Better alignment with experimental observations [1] |
| Network Connectivity | Multiple dead-end metabolites | Reduced gaps | Improved flux capacity for diverse substrates |
| Model Utility | Limited application | Enhanced predictive power | More reliable for metabolic engineering decisions |
The integration of CHESHIRE into metabolic network reconstruction pipelines represents a significant advancement for predicting fermentation phenotypes in both model and non-model organisms. This application note demonstrates that through systematic implementation of the described protocol, researchers can substantially improve the predictive accuracy of draft GEMs for industrially relevant metabolic outputs.
The topology-based approach employed by CHESHIRE offers distinct advantages over traditional gap-filling methods, particularly when experimental data are scarce or unavailable. As validation studies have confirmed, this framework successfully identifies missing metabolic capabilities that directly impact fermentation product secretion and amino acid overflow metabolism [1] [8].
Future enhancements to this protocol may include integration with hybrid neural-mechanistic models [38] and incorporation of enzyme compartmentalization constraints [39] to further refine phenotypic predictions. The automated nature of the CHESHIRE framework positions it as a valuable tool for accelerating metabolic engineering projects and expanding our understanding of microbial metabolic diversity.
Genome-scale Metabolic Models (GEMs) are mathematical representations of cellular metabolism that serve as powerful tools for predicting metabolic fluxes and physiological states in living organisms [1] [7]. A significant challenge in metabolic modeling is the presence of knowledge gaps within even the most highly curated GEMs, primarily manifesting as missing reactions due to imperfect knowledge of metabolic processes and incomplete genomic annotations [1]. Traditional gap-filling methods typically require phenotypic data as input to identify and resolve these inconsistencies, limiting their utility for non-model organisms where such experimental data is unavailable [1].
Recent advances have introduced topology-based computational methods that predict missing reactions purely from the structural information of metabolic networks, framing the problem as a hyperlink prediction task on hypergraphs [1]. This application note provides a detailed performance comparison and experimental protocols for four such methods: CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor), NHP (Neural Hyperlink Predictor), C3MM (Clique Closure-based Coordinated Matrix Minimization), and Node2Vec-mean (NVM), with a particular focus on the superior performance and implementation of the deep learning-based CHESHIRE framework.
Comprehensive internal validation testing conducted on 108 high-quality BiGG GEMs demonstrates that CHESHIRE consistently outperforms other topology-based methods across multiple classification performance metrics [1]. The table below summarizes the quantitative performance comparison of these methods in recovering artificially removed reactions from metabolic networks.
Table 1: Performance comparison of gap-filling methods on BiGG models
| Method | AUROC | AUPRC | F1-Score | Key Characteristics |
|---|---|---|---|---|
| CHESHIRE | 0.912 | 0.913 | 0.842 | Hypergraph learning with Chebyshev spectral graph convolutional network |
| NHP | 0.861 | 0.863 | 0.791 | Neural network with graph approximation of hypergraphs |
| C3MM | 0.843 | 0.845 | 0.773 | Clique closure-based coordinated matrix minimization |
| Node2Vec-mean | 0.802 | 0.803 | 0.721 | Random walk-based graph embedding with mean pooling |
Further validation on 818 AGORA models confirmed CHESHIRE's consistent performance advantage across diverse metabolic networks [1]. The external validation of these methods involved assessing their ability to improve phenotypic predictions for 49 draft GEMs reconstructed from CarveMe and ModelSEED pipelines, with CHESHIRE demonstrating significant improvements in predicting fermentation products and amino acid secretion capabilities [1].
All four methods employ hypergraph representations of metabolic networks, where metabolites serve as nodes and reactions as hyperlinks connecting multiple nodes [1]. The fundamental representation uses an incidence matrix containing Boolean values indicating the presence or absence of each metabolite in each reaction [1].
Each method employs distinct architectural approaches for processing the hypergraph structure and predicting missing reactions:
Table 2: Architectural comparison of gap-filling methods
| Method | Feature Initialization | Feature Refinement | Pooling Mechanism | Training Approach |
|---|---|---|---|---|
| CHESHIRE | Encoder-based neural network | Chebyshev Spectral Graph Convolutional Network (CSGCN) | Maximum minimum + Frobenius norm | Separate training and prediction |
| NHP | Graph approximation of hypergraphs | Graph neural network | Maximum minimum | Separate training and prediction |
| C3MM | Not specified | Not specified | Not specified | Integrated training-prediction |
| Node2Vec-mean | Node2Vec random walks | None | Mean pooling | Separate training and prediction |
Internal Validation:
External Validation:
Table 3: Essential research reagents and computational tools
| Item | Function/Application | Implementation Notes |
|---|---|---|
| BiGG Database | Source of high-quality metabolic models and universal reaction pool | Contains 108 reference GEMs for validation [1] |
| AGORA Models | Standardized resource of 818 intermediate-quality GEMs | Used for cross-validation across diverse organisms [1] |
| CHEBI Database | Source of metabolite structures and identifiers | Used for negative reaction sampling [8] |
| CPLEX Solver | Optimization software for constraint-based analysis | Required for running GEM simulations [5] |
| CarveMe & ModelSEED | Automated pipeline for draft GEM reconstruction | Source of 49 draft GEMs for external validation [1] |
| Python Scientific Stack | Core programming environment | Requires numpy, scipy, pandas, and tensorflow/pytorch [5] |
System Requirements:
Package Installation:
Dependencies:
Key parameters for CHESHIRE implementation as specified in input_parameters.txt:
NUM_GAPFILLED_RXNS_TO_ADD: Number of top candidate reactions to add for fermentation testingMIN_PREDICTED_SCORES (default=0.9995): Cutoff threshold for candidate reactionsBATCH_SIZE (default=10): Number of reactions added in batches during gap-fillingANAEROBIC (default=1): Boolean flag to skip oxygen-involving reactionsNAMESPACE (default="bigg"): Biochemical reaction database namespace [5]CHESHIRE generates three primary output directories:
Key output metrics include secretion flux values, biomass production rates, and binary phenotype calls indicating whether normalized secretion fluxes exceed the specified flux cutoff [5].
The comprehensive performance comparison demonstrates CHESHIRE's superior capability in predicting missing reactions in genome-scale metabolic networks compared to NHP, C3MM, and Node2Vec-mean. Its innovative use of Chebyshev spectral graph convolutional networks combined with dual pooling operations enables more accurate capture of higher-order topological features in metabolic hypergraphs. The provided experimental protocols and implementation guidelines offer researchers a robust framework for applying CHESHIRE to their metabolic network gap-filling challenges, particularly for non-model organisms where experimental phenotypic data is unavailable.
In the field of systems biology and metabolic engineering, the reconstruction of high-quality Genome-scale Metabolic Models (GEMs) is crucial for predicting cellular behavior. A significant challenge in this process is gap-fillingâidentifying and adding missing metabolic reactions to models due to incomplete genomic annotations and biochemical knowledge [1]. While traditional gap-filling methods often rely on experimental phenotypic data, recent AI-based approaches offer powerful alternatives that use only the topological features of metabolic networks.
This application note provides a comparative analysis of three prominent deep learning-based gap-filling tools: CHESHIRE, DNNGIOR, and CLOSEgaps. We detail their methodologies, performance metrics, and provide standardized protocols for their application in metabolic network research, offering researchers a practical guide for selecting and implementing these advanced computational techniques.
The table below summarizes the core attributes and technological foundations of the three gap-filling methods discussed in this note.
Table 1: Overview of AI-Based Gap-Filling Tools
| Tool Name | Core Methodology | Input Requirements | Training Data Scale | Key Innovation |
|---|---|---|---|---|
| CHESHIRE [1] | Chebyshev Spectral Hyperlink Predictor using hypergraph learning | Metabolic network topology (as a hypergraph), universal reaction pool | 108 high-quality BiGG models & 818 AGORA models for validation | Uses Chebyshev spectral graph convolutional networks (CSGCN) for feature refinement on decomposed graphs. |
| DNNGIOR [19] | Deep Neural Network Guided Imputation of Reactomes | Bacterial genomic data | >11,000 bacterial species | Learns reaction presence/absence patterns across diverse bacterial genomes; performance depends on reaction frequency and phylogenetic distance. |
| CLOSEgaps [8] | Hypergraph Convolutional Network integrated with an attention mechanism | Metabolic network topology, hypothetical reactions from a database (e.g., BiGG) | 5 high-quality BiGG models & organic chemistry datasets | Combines hypergraph convolution with an attention mechanism to characterize both known and hypothetical reactions. |
The fundamental architectural difference between these tools lies in how they model metabolic networks. CHESHIRE and CLOSEgaps both employ hypergraph representations, where each reaction is a hyperlink connecting all its substrate and product metabolites. This preserves the inherent multi-way relationships in biochemical reactions [1] [8]. In contrast, DNNGIOR utilizes a deep neural network trained on a vast corpus of genomic data from over 11,000 bacterial species, learning to impute missing reactions based on patterns observed across the bacterial phylogenetic tree [19].
The following diagram illustrates a generalized workflow that is common to the hypergraph-based approaches, CHESHIRE and CLOSEgaps:
Generalized Hypergraph-Based Gap-Filling Workflow
A standard internal validation for gap-filling tools is assessing their ability to recover reactions that were artificially removed from a metabolic network. The following table summarizes the performance of CHESHIRE, CLOSEgaps, and other methods as reported in their respective studies.
Table 2: Performance Metrics on Artificially Introduced Gaps
| Tool | Benchmark Dataset | Key Performance Metric | Reported Result | Comparative Performance |
|---|---|---|---|---|
| CHESHIRE [1] | 108 BiGG & 818 AGORA GEMs | Area Under the ROC Curve (AUROC) | Outperformed NHP and C3MM | Superior to NHP, C3MM, and a Node2Vec-mean baseline. |
| CLOSEgaps [8] | 5 high-quality BiGG GEMs | Accuracy in Gap Recovery | >96% | Outperformed CHESHIRE, GraphSAGE, GCN, and others in its tests. |
| DNNGIOR [19] | >11,000 bacterial genomes | F1 Score | 0.85 (for reactions in >30% of training genomes) | Guided gap-filling was 14x more accurate for draft models than unweighted methods. |
Beyond internal recovery tests, external validation through improved phenotypic prediction is critical. CHESHIRE demonstrated a significant improvement in predicting the secretion of fermentation products and amino acids in 49 draft GEMs reconstructed by CarveMe and ModelSEED pipelines [1]. Similarly, CLOSEgaps enhanced the phenotypic predictions for 24 GEMs, showing notable improvement in the production of four key metabolites: Lactate, Ethanol, Propionate, and Succinate in two organisms [8]. DNNGIOR's gap-filling strategy led to models that could simulate real data with fewer false positives compared to those generated by CarveMe [19].
This protocol is based on the official documentation and source code for CHESHIRE [5].
Step 1: Software and Environment Setup
git clone https://github.com/canc1993/cheshire-gapfilling.git.Step 2: Input File Preparation
.xml) format in the cheshire-gapfilling/data/gems/ directory.bigg_universe.xml) in the data/pools/ directory. Ensure the namespace (BiGG or ModelSeed) matches your GEMs.input_parameters.txt file to define key variables:
CULTURE_MEDIUM: Path to the media definition file (e.g., ./data/fermentation/media.csv).NUM_GAPFILLED_RXNS_TO_ADD: The number of top candidate reactions to add during phenotypic validation.NAMESPACE: The biochemical database namespace ("bigg" or "modelseed").ANAEROBIC: Set to 1 to exclude reactions involving oxygen.Step 3: Execution
python3 main.py.get_predicted_score(): Scores all candidate reactions in the pool for their likelihood of being missing.get_similarity_score(): Scores the similarity of candidates to existing reactions in the GEM.validate(): Performs time-consuming flux simulations to find the minimal set of top candidates that enable new metabolic secretions.Step 4: Interpretation of Results
results/scores/. Use the mean score across Monte Carlo runs for ranking.results/gaps/suggested_gaps.csv for simulation results. Key columns include phenotype__no_gapfill and phenotype__w_gapfill (binary indicators of secretion capability before and after gap-filling), and rxn_ids_added (list of added reactions).Based on the publication by Liu et al. [8], the workflow for CLOSEgaps involves the following key stages, which can be implemented programmatically:
Step 1: Hypergraph Construction and Negative Sampling
H, where metabolites are nodes and reactions are hyperedges.Step 2: Feature Initialization and Refinement
Step 3: Reaction Scoring and Ranking
Table 3: Key Resources for AI-Driven Metabolic Gap-Filling
| Resource Name | Type | Function in Research | Example Source / Note |
|---|---|---|---|
| BiGG Models [1] | Knowledgebase of GEMs | Provides high-quality, curated models for training and benchmarking tools like CHESHIRE and CLOSEgaps. | http://bigg.ucsd.edu |
| BiGG Universe Reaction Pool | Biochemical Reaction Database | Serves as a universal pool of candidate reactions for gap-filling algorithms to search. | Included in CHESHIRE's data/pools [5] |
| IBM ILOG CPLEX Optimizer | Mathematical Optimization Solver | Used internally by CHESHIRE to perform Flux Balance Analysis (FBA) and validate phenotypic improvements after gap-filling. | Commercial software requiring a license [5] |
| ChEBI Database [8] | Chemical Database of Small Molecules | Provides a comprehensive metabolite pool for generating negative reaction samples during model training in CLOSEgaps. | https://www.ebi.ac.uk/chebi/ |
| CarveMe & ModelSEED [1] | Automated Reconstruction Pipelines | Used to generate draft GEMs from genomic sequences, which are then refined using the gap-filling tools. | Provides the initial, incomplete models for curation. |
GEnome-scale Metabolic Models (GEMs) are powerful computational tools that provide a mathematical representation of an organism's metabolism, enabling the prediction of cellular metabolic states and physiological phenotypes [1]. The reconstruction of high-quality GEMs is fundamental for applications in metabolic engineering, drug discovery, and microbial ecology [1]. However, draft GEMs, particularly those generated automatically for non-model organisms or from incomplete metagenome-assembled genomes, often contain significant knowledge gaps [1] [19]. These gaps, primarily missing metabolic reactions, arise from imperfect genomic and functional annotations and severely limit the predictive accuracy of the models, especially for the secretion of valuable compounds like amino acids and fermentation products [1].
Traditional gap-filling methods often require experimental phenotypic data as input to identify and correct these missing functions, creating a bottleneck for the study of uncultivable organisms or those for which such data is not readily available [1]. We present a case study on using CHESHIRE (CHEbyshev Spectral HyperlInk pREdictor), a deep learning-based method, to enhance the prediction of amino acid secretions in draft GEMs. This approach performs gap-filling purely based on the topological features of the metabolic network, offering a powerful, data-independent solution for model curation [1] [5].
CHESHIRE frames the problem of finding missing reactions as a hyperlink prediction task on a hypergraph. In this representation, each metabolic reaction is a hyperlink connecting all its participating metabolite nodes, thereby naturally capturing the higher-order relationships in the network [1].
The learning architecture of CHESHIRE consists of four major steps [1]:
Table: Key Features of the CHESHIRE Algorithm
| Feature | Description | Advantage |
|---|---|---|
| Hypergraph Learning | Models reactions as hyperlinks connecting multiple metabolites. | Captures higher-order information lost in graph approximations. |
| Topology-Based | Uses only the metabolic network structure; requires no phenotypic data. | Applicable to non-model and uncultivable organisms. |
| Chebyshev Spectral GCN | Refines node features using graph convolutions. | Effectively captures metabolite-metabolite interactions. |
| Combined Pooling | Uses max-min and Frobenius norm pooling. | Provides complementary information for reaction representation. |
Implementing CHESHIRE requires a standard computer with the following recommended specifications [5]:
This protocol details the steps to identify missing reactions in a draft GEM using CHESHIRE to improve amino acid secretion predictions.
cheshire-gapfilling/data/gems/ directory [5].bigg_universe.xml) in the data/pools/ directory. Ensure the namespace (BiGG or ModelSEED) matches your GEM files [5].substrate_exchange_reactions.csv: Defines the target secretion compounds (e.g., amino acids). Must contain columns compound (common name) and a namespace-specific ID (e.g., bigg for BiGG IDs) [5].media.csv: Specifies the in silico culture medium composition, including compound IDs and maximum uptake fluxes [5].Edit the input_parameters.txt file to set key simulation parameters [5]:
CULTURE_MEDIUM: ./data/fermentation/media.csvREACTION_POOL: ./data/pools/universe.xmlGEM_DIRECTORY: ./data/gems/NUM_GAPFILLED_RXNS_TO_ADD: Number of top-ranked candidate reactions to add during gap-filling (e.g., 20). This is a critical parameter balancing computation time and comprehensiveness.ADD_RANDOM_RXNS: Set to 0 to use CHESHIRE predictions.SUBSTRATE_EX_RXNS: ./data/fermentation/substrate_exchange_reactions.csvNAMESPACE: "bigg" or "modelseed" to match your database.ANAEROBIC: Set to 1 if simulating conditions without oxygen.Run CHESHIRE using the command python3 main.py. The tool generates results in three subdirectories [5]:
universe/: A merged reaction pool combining the universal database and input GEMs.scores/: Predicted confidence scores for every candidate reaction from the pool for each input GEM.gaps/: A file (suggested_gaps.csv) containing the key results of the phenotypic simulation.Table: Interpretation of Key Output Columns in suggested_gaps.csv
| Column Name | Description |
|---|---|
phenotype_no_gapfill |
Secretion capability (0/1) of the original draft GEM. |
phenotype_w_gapfill |
Secretion capability (0/1) of the gap-filled model. |
normalized_maximum_w_gapfill |
Maximum secretion flux, normalized by biomass production. |
rxn_ids_added |
IDs of the candidate reactions added during gap-filling. |
key_rxns |
The minimal set of added reactions enabling the new secretion phenotype. |
CHESHIRE was internally validated on 108 high-quality BiGG models. Reactions were artificially removed from these models, and the tool's performance in recovering them was evaluated. CHESHIRE demonstrated superior performance against other topology-based machine learning methods (NHP and C3MM), as measured by the Area Under the Receiver Operating Characteristic curve (AUROC) and other classification metrics [1].
The external validation involved 49 draft GEMs reconstructed using CarveMe and ModelSEED pipelines. CHESHIRE was used to predict and fill missing reactions, and the resulting gap-filled models were tested for their ability to secrete fermentation products and amino acids. The results showed that CHESHIRE significantly improved the theoretical predictions of these phenotypic traits, confirming its utility in practical GEM curation scenarios [1].
Table: Example Amino Acid Metabolites and Their Relevance
| Amino Acid / Metabolite | Metabolic Role / Pathway | Relevance in Model Predictions |
|---|---|---|
| Branched-Chain Amino Acids\n(Leucine, Isoleucine, Valine) | Energy metabolism, muscle growth, insulin action [40] | Key biomarkers for metabolic health; secretion profiles indicate functional model [40] [41]. |
| Aromatic Amino Acids\n(Phenylalanine, Tyrosine) | Neurotransmitter biosynthesis, liver function [40] | Linked to insulin sensitivity; predictors of metabolic improvement [40]. |
| Tryptophan & Derivatives\n(Kynurenine, Serotonin, Indoles) | Immune regulation, gut microbiota interaction [40] [42] | Complex pathway reflects host-microbiome crosstalk; critical for predicting systemic metabolic states [40] [42]. |
| Glycine, Serine | Collagen formation, bone health [41] | Altered levels associated with specific disease states (e.g., osteoporosis) [41]. |
Table: Essential Research Reagents and Computational Tools
| Item / Resource | Function / Description | Application in Protocol |
|---|---|---|
| CHESHIRE Package [5] | Deep learning-based gap-filling tool. | Core algorithm for predicting missing reactions. |
| BiGG Database [1] [5] | Knowledgebase of biochemical reactions. | Source for the universal reaction pool (universe.xml). |
| CarveMe & ModelSEED [1] | Automated GEM reconstruction pipelines. | For generating the initial draft models to be curated. |
| IBM CPLEX Solver [5] | Optimization software. | Solves linear programming problems during flux balance analysis. |
| Python Scientific Stack\(NumPy, SciPy, Pandas\) [5] | Programming language and libraries. | Core computational environment for running CHESHIRE. |
This case study demonstrates that CHESHIRE is a powerful tool for the curation of draft GEMs. By leveraging deep learning on metabolic network topology, it accurately identifies missing reactions without prior reliance on experimental phenotypic data, addressing a significant bottleneck in metabolic modeling [1]. The validation on both curated and draft models confirms that CHESHIRE not only recovers known reactions but also improves the prediction of metabolic phenotypes, such as amino acid secretion [1].
The ability to reliably predict amino acid secretion has broad implications. Altered levels of circulating amino acids like branched-chain amino acids (BCAAs), aromatic amino acids (AAAs), and tryptophan derivatives are key biomarkers in human metabolic diseases, including obesity, type 2 diabetes, and inflammatory bowel disease (IBD) [40] [42]. Furthermore, specific amino acid signatures are associated with conditions like osteoporosis [41]. GEMs refined with CHESHIRE can therefore serve as valuable in silico platforms for generating mechanistic hypotheses about the role of microbial and human metabolism in health and disease, potentially identifying novel therapeutic targets and dietary interventions [42].
CHESHIRE represents a paradigm shift in metabolic network reconstruction by enabling accurate gap-filling without experimental data, leveraging advanced hypergraph learning to capture complex metabolic interactions. Through robust validation demonstrating superior performance over existing methods and tangible improvements in phenotypic predictions, CHESHIRE empowers researchers to build more complete metabolic models for non-model organisms and poorly characterized systems. Future directions include integrating multi-omics data, expanding to eukaryotic systems, and applications in drug target discovery and personalized medicine. As deep learning approaches continue evolving, CHESHIRE establishes a foundation for automated, knowledge-driven metabolic network curation that will accelerate biomedical research and therapeutic development.