This article details NP-PRESS, an innovative two-stage mass spectrometry pipeline designed to overcome the critical bottleneck of dereplication in natural product research.
This article details NP-PRESS, an innovative two-stage mass spectrometry pipeline designed to overcome the critical bottleneck of dereplication in natural product research. Tailored for researchers and drug development professionals, it explores the strategy's foundational need in filtering complex metabolomes, its methodological core involving the FUNEL and simRank algorithms, practical guidance for troubleshooting and optimization, and a comparative analysis against traditional and emerging methods. By synthesizing proof-of-concept successes in discovering novel bioactive compounds, the article demonstrates how NP-PRESS refines metabolomic data to prioritize truly novel features, thereby reducing costly and fruitless isolation efforts and streamlining the path to new drug leads.
The discovery of novel bioactive natural products (NPs) remains a cornerstone of pharmaceutical development, yet the process is fundamentally hindered by the persistent challenge of dereplication—the early and accurate identification of known compounds within complex biological extracts [1]. Modern high-resolution mass spectrometry (MS) and liquid chromatography-mass spectrometry (LC-MS) generate vast metabolomic datasets, but the true signals of novel, often low-abundance secondary metabolites are frequently obscured by an overwhelming background of interfering features [1]. These interferences originate not only from abiotic sources but, more problematically, from biotic processes, including microbial degradation products and media components, which are chemically analogous to target NPs and thus exceptionally difficult to filter out using conventional methods [1].
This challenge frames the critical need for advanced strategies that move beyond simple database matching. Effective dereplication must prioritize novelty by systematically removing both known compounds and irrelevant biological noise, thereby focusing precious research resources on the most promising, unidentified features. This article details the application notes and protocols for a modern solution to this challenge: the NP-PRESS (Natural Product PRIoritization via Elimination of Spectral Signatures) strategy, a two-stage MS feature dereplication framework. NP-PRESS integrates novel algorithmic filters to highlight new NPs by thoroughly eliminating overwhelming irrelevant features, enabling the discovery of novel chemical entities from diverse and underexplored bacterial sources [1].
The NP-PRESS strategy is a methodical, two-tiered computational workflow designed to process LC-MS/MS data for the specific purpose of novel natural product discovery. Its core innovation lies in two custom algorithms, FUNEL (for MS1 data) and simRank (for MS2 data), which work in concert to remove irrelevant features [1].
Objective: To prioritize LC-MS features corresponding to putative novel natural products by eliminating signals from known compounds and biotic interference.
Materials & Input Data:
Experimental Procedure:
Sample Preparation & LC-MS/MS Acquisition:
Stage 1: MS1 Filtering with the FUNEL Algorithm:
Stage 2: MS2 Dereplication with the simRank Algorithm:
Priority List Generation & Downstream Analysis:
The efficacy of the NP-PRESS strategy has been validated through the discovery of new NPs. Applied to Streptomyces albus J1074, it guided the discovery of new surugamide analogs. More significantly, its application to the unusual anaerobic bacterium Wukongibacter baidiensis M2B1 led to the discovery of the baidienmycins, a new family of depsipeptides with potent antimicrobial and anticancer activities [1]. These cases underscore its utility in mining complex datasets from diverse bacteria, particularly extremophiles.
Table 1: Key Performance Metrics of the NP-PRESS Dereplication Strategy.
| Metric | Description | Outcome in Validation Studies |
|---|---|---|
| Feature Reduction Rate | Percentage of initial LC-MS features filtered out in Stage 1 (FUNEL). | Dramatically reduces feature count, eliminating >50% of irrelevant biotic/abiotic interferences [1]. |
| Novel Compound Prioritization | Ability to rank unknown features leading to successful isolation of new NPs. | Successfully prioritized signals leading to discovery of baidienmycins and new surugamide analogs [1]. |
| Dereplication Accuracy | Specificity in correctly identifying known compounds via simRank MS2 matching. | High-confidence dereplication of knowns, minimizing false negatives for novel compounds [1]. |
| Application Scope | Suitability for diverse microbial taxa, including challenging cultures. | Proven effective for standard actinomycetes (Streptomyces) and unusual anaerobic bacteria [1]. |
Diagram 1: NP-PRESS Two-Stage Dereplication Workflow (78 characters)
Beyond NP-PRESS, contemporary dereplication is a multi-faceted process integrating chemical and genetic data to maximize the efficiency of novel compound discovery.
Objective: To rationally build a natural product library with broad metabolite diversity by linking phylogenetic clades with chemical feature accumulation [2].
Procedure:
Isolate Collection & Barcoding:
Metabolomic Profiling:
Bifunctional Data Analysis:
Actionable Library Design:
Table 2: Quantitative Insights from Genetic Barcoding-Metabolomics Integration in Alternaria Fungi [2].
| Analysis Parameter | Quantitative Finding | Implication for Library Design |
|---|---|---|
| Isolates Needed for ~99% Coverage | 195 isolates | Provides a quantitative target for library size to efficiently capture genus-level diversity. |
| Proportion of "Singleton" Features | 17.9% of features appeared in only one isolate. | Indicates high chemical rarity; very deep sampling is required to capture full diversity. |
| Clade-Chemistry Correlation | Non-equivalent levels of chemical diversity across different ITS clades. | Enables targeted sampling of genetically distinct, chemically rich clades. |
| Key Tool | Feature accumulation curves. | Allows real-time monitoring and prediction of chemical diversity coverage during library building. |
Machine learning (ML) models are increasingly deployed to predict compound class, bioactivity, or structural novelty directly from MS or spectral data, adding a predictive layer to dereplication [3].
Protocol Outline: ML Model Training for Spectral Classification
Data Curation & Preprocessing:
Model Selection & Training:
Model Deployment in Dereplication:
Diagram 2: Integrated Dereplication Strategy Roadmap (55 characters)
Table 3: Key Reagents, Materials, and Software for Advanced Dereplication.
| Tool/Reagent | Function in Dereplication | Application Notes |
|---|---|---|
| High-Resolution LC-MS System (e.g., UPLC-QTOF, UPLC-Orbitrap) | Generates high-fidelity MS1 and MS2 spectral data for feature detection and algorithmic processing. | Essential for initial data acquisition. Requires precise calibration for accurate mass measurements. |
| FUNEL & simRank Algorithms (NP-PRESS) | Core computational filters for removing biotic interference and dereplicating known compounds via MS2 matching [1]. | Custom software; critical for executing the two-stage NP-PRESS strategy. |
| Global Natural Products Social (GNPS) Library | A public, crowd-sourced database of MS2 spectra for known natural products. | Serves as a key reference database for the simRank algorithm and standard spectral library searches. |
| ITS/16S rRNA PCR Primers & Sequencing Kits | Enables genetic barcoding of fungal/bacterial isolates to establish phylogenetic clades [2]. | Allows correlation of chemical diversity with genetic diversity for rationalized library design. |
| Machine Learning Platforms (e.g., Python scikit-learn, TensorFlow) | Provides environment to build, train, and deploy models for spectral classification and novelty prediction [3]. | Used to add a predictive layer to dereplication, moving from identification to forecasting compound properties. |
| Solid Phase Extraction (SPE) Cartridges (C18, HLB) | Pre-fractionates complex crude extracts to reduce complexity prior to LC-MS analysis. | Can lower ion suppression and simplify chromatograms, improving feature detection. |
The dereplication challenge in modern natural product research is no longer a simple task of database lookup. It is a sophisticated data triage process that requires integrated strategies to separate novel bioactive compound signals from a dense background of chemical noise. The NP-PRESS two-stage strategy, employing the FUNEL and simRank algorithms, provides a robust, specialized protocol to directly address the critical problem of biotic interference, effectively prioritizing novel chemical entities [1]. This core methodology is powerfully augmented by quantitative library-building approaches that use genetic barcoding and feature accumulation curves to optimize source selection [2], and by machine learning models that add predictive power to the analysis of spectral data [3]. Together, these protocols form a comprehensive, modern dereplication pipeline that transforms the discovery process from one of serendipity to a rational, data-driven endeavor, significantly accelerating the identification of new natural products for drug development.
In untargeted metabolomics, particularly within natural product (NP) discovery, the true signal of interest is often buried in noise. This noise originates from a complex background of irrelevant chemical features generated by both biotic processes (e.g., microbial degradation of cellular components and media) and abiotic processes (e.g., spontaneous chemical reactions and environmental contaminants) [1] [4]. The consequence is a high cost in research efficiency: significant resources are wasted on the fruitless isolation and structural elucidation of these interfering compounds, diverting effort from genuine bioactive metabolites [4].
The scale of this problem is immense. While a single plant species may contain up to an estimated 5,000 metabolites, a typical mass spectrometry (MS) run may detect tens of thousands of molecular features, most of which are not the target secondary metabolites [5]. This creates a classic "needle in a haystack" scenario. The NP-PRESS (Natural Products Prioritization and Refinement by Elimination of Spurious Signals) research framework addresses this directly. It is a two-stage MS feature dereplication strategy designed to systematically prioritize novel natural products by thoroughly removing overwhelming irrelevant features, thereby refining the metabolome for efficient analysis [1] [4].
The NP-PRESS pipeline is built upon two novel computational algorithms applied sequentially to LC-MS/MS data to filter out irrelevant signals and highlight putative novel natural products [4].
Stage 1: FUNEL (Filtering Using Neutral Loss) – This stage operates on MS1-level data. FUNEL identifies and removes features originating from expected biochemical noise, such as known media components, common cellular building blocks, and their predictable derivatives (e.g., adducts, fragments, and neutral loss patterns). It functions as a high-stringency filter to drastically reduce dataset complexity before more detailed analysis [4].
Stage 2: simRank (Spectral Similarity Ranking) – This stage analyzes MS2 fragmentation spectra. simRank evaluates the spectral similarity of remaining features against comprehensive databases of known natural products and their derivatives. Features with high similarity to known compounds are dereplicated, while those with novel or unusual fragmentation patterns are prioritized for further investigation [4].
The effectiveness of this strategy is demonstrated in its application to microbial strains. For instance, when applied to Streptomyces albus J1074, NP-PRESS facilitated the identification of new surugamide analogs. More notably, its use on the anaerobic bacterium Wukongibacter baidiensis M2B1 led to the discovery of the baidienmycins, a new family of depsipeptides with potent bioactivity [4].
Table: Key Performance Outcomes of the NP-PRESS Strategy
| Microbial Strain | NP-PRESS Application Outcome | Significance |
|---|---|---|
| Streptomyces albus J1074 | Identification of new surugamide analogs | Validated pipeline on a model streptomycete [4] |
| Wukongibacter baidiensis M2B1 | Discovery of baidienmycins (new depsipeptides) | Uncovered novel chemistry from an unusual anaerobe; compounds show potent antimicrobial and anticancer activities [1] [4] |
This protocol is designed to characterize the sources of irrelevant signals in microbial fermentation samples, a prerequisite for effective dereplication.
1. Cultivation and Sample Generation:
2. Metabolite Quenching and Extraction:
3. LC-MS/MS Data Acquisition:
4. Interference Analysis with NP-PRESS:
Effective visualization is critical for evaluating data quality and the impact of interference at each processing step [8].
1. Pre-processing and Quality Control Visualization:
2. Post-Dereplication Evaluation Visualization:
Diagram: A two-stage workflow showing the sequential application of FUNEL and simRank algorithms to filter and prioritize mass spectrometry data for natural product discovery.
Interpreting metabolomic data requires contextualizing metabolites within their biological pathways and understanding how stress signaling can generate interference [10].
Diagram: Illustrates how biotic and abiotic stresses trigger both target metabolic responses (primary and secondary metabolites) and the generation of interfering chemical byproducts that co-elute in mass spectrometry analysis.
Table: Key Research Reagent Solutions for Interference-Aware Metabolomics
| Reagent/Material | Function in Protocol | Rationale & Consideration |
|---|---|---|
| Stable Isotope-Labeled Media (e.g., U-¹³C glucose) | Cultivation of microbes in controls and experimental samples. | Enables tracking of true microbial metabolites vs. abiotic carryover from media via isotope patterns; crucial for validating FUNEL filters [5]. |
| Biphasic Extraction Solvents (Methanol/Chloroform/Water) | Comprehensive metabolite extraction from cell pellets and broth. | Provides broad recovery of both polar and non-polar metabolites, ensuring the detected "interference" profile is representative of the total chemical space [5]. |
| Solid-Phase Extraction (SPE) Cartridges (C18, HLB, mixed-mode) | Clean-up and fractionation of crude extracts prior to LC-MS. | Removes salts and highly polar media components that cause ion suppression and column degradation, reducing a major source of abiotic interference and improving sensitivity for target NPs. |
| Quality Control (QC) Reference Mix | Injection at regular intervals during LC-MS sequence. | Monitors instrument stability; data from QC samples is used for signal correction and to distinguish technical drift from biological variation [6]. |
| MS Spectral Databases (GNPS, NIST, In-house libraries) | Reference for simRank algorithm and manual validation. | Essential for the dereplication stage. The comprehensiveness of the database directly impacts the false-positive rate for novelty claims [5] [4]. |
| Metabolomics Analysis Software (MetaboAnalyst [7], MZmine, GNPS) | Data processing, statistical analysis, and visualization. | Platforms like MetaboAnalyst integrate multiple visualization strategies (PCA, volcano plots, heatmaps) critical for diagnosing interference and interpreting NP-PRESS output [9] [8] [7]. |
The high cost of irrelevant signals in metabolomics—measured in wasted time, resources, and missed discoveries—is a critical bottleneck in natural product research and metabolomics broadly. The NP-PRESS two-stage dereplication strategy provides a robust, algorithmic framework to address this by systematically subtracting biotic and abiotic interference. By integrating careful experimental design with controlled samples, followed by sequential FUNEL (MS1) and simRank (MS2) filtering, researchers can transform a complex, noisy metabolome into a refined list of high-priority candidates. This strategy, supported by diagnostic visualizations and a dedicated toolkit, directly enhances the probability of discovering novel, bioactive natural products by ensuring that analytical effort is focused on true signals of biological and chemical novelty.
The discovery of novel natural products (NPs) from microbial metabolomes represents a cornerstone of pharmaceutical development, yielding compounds with potent antimicrobial, anticancer, and various other therapeutic activities [1]. However, the analytical landscape is dominated by a critical bottleneck: the sheer complexity of metabolomic data. In a typical liquid chromatography-tandem mass spectrometry (LC-MS/MS) experiment, the signals of bioactive secondary metabolites are obscured by an overwhelming majority of irrelevant chemical features originating from abiotic processes, culture media, and cellular degradation products [4]. Traditional one-stage dereplication approaches, which often rely on direct database matching of MS/MS spectra, struggle to differentiate these interfering biotic features from true NPs, leading to high rates of false positives and fruitless isolation campaigns [1].
This document details a transformative two-stage metabolome refining pipeline known as NP-PRESS (Natural Product Prioritization and Refinement via Elimination of Signal Surplus). Framed within a broader thesis on advanced dereplication strategies, NP-PRESS introduces a conceptual shift from simple feature filtering to a systematic, two-stage data refinement process. This strategy employs two novel algorithms—FUNEL for MS1-level filtering and simRank for MS2-level prioritization—to sequentially remove irrelevant features and highlight putative novel NPs [1]. The following application notes and protocols provide a comprehensive guide to implementing this strategy, complete with experimental workflows, data visualization standards, and a toolkit for researchers.
The NP-PRESS pipeline is engineered to deconvolute complex metabolomes by sequentially applying distinct data refinement steps at the MS1 and MS2 levels. This staged approach ensures a thorough removal of non-relevant features before committing resources to the detailed analysis of prioritized candidates [4].
Stage 1: MS1-Level Filtering with FUNEL The first stage addresses the "signal surplus" from biotic and abiotic interferences. The FUNEL (Filtering of Uninteresting Nuisance Elements) algorithm operates on MS1 spectral data. Its core function is to perform comparative metabolomics between the sample of interest (e.g., a bacterial fermentation) and a set of control samples. These controls are meticulously designed to capture the chemical background, including sterile media, spent media from non-producing strains, or cells harvested during non-productive growth phases. FUNEL identifies and subtracts MS1 features that are statistically non-significant or are consistently present in these control samples. This step drastically reduces the dataset's complexity by eliminating up to 70-90% of total features attributed to media components and primary metabolic debris [1] [4].
Stage 2: MS2-Level Prioritization with simRank The second stage focuses on the remaining, refined feature set. The simRank algorithm analyzes the MS/MS fragmentation spectra of these features. Instead of relying solely on direct library matches, which are often incomplete for novel compounds, simRank calculates spectral similarity networks. It prioritizes features that exhibit moderate spectral relatedness to known natural product families or structural classes within databases like GNPS (Global Natural Products Social Molecular Networking) but are not direct matches. This prioritizes "scaffold-relative" novelty—compounds that are structurally new but may share biosynthetic logic with known families, making them prime candidates for discovery [1]. The final output is a shortlist of high-priority mlz-RT features accompanied by annotated putative structural classes and novelty scores.
Table: Core Algorithmic Functions in the NP-PRESS Pipeline
| Algorithm | Stage | Primary Data Input | Core Function | Key Outcome |
|---|---|---|---|---|
| FUNEL | 1 | MS1 (Precursor Ion) | Comparative analysis against control samples to subtract background features. | Removal of biotic/abiotic interference; drastic reduction of feature list. |
| simRank | 2 | MS2 (Fragmentation Spectra) | Spectral similarity networking and ranking against known NP libraries. | Prioritization of features with scaffold-relative novelty. |
Principle: Generate paired experimental and control samples to feed the FUNEL algorithm. The goal is to produce a metabolome enriched for secondary metabolites while simultaneously capturing the chemical background from all non-producing sources [4].
Materials:
Procedure:
Principle: Acquire high-resolution MS1 and MS2 data suitable for both FUNEL (MS1 comparison) and simRank (MS2 networking) analysis [11].
Materials:
Chromatography Conditions (Example):
Mass Spectrometry Parameters (Example for DDA in positive mode):
Acquisition Sequence: Inject the QC sample at the beginning (3-5 times for system equilibration) and repeatedly throughout the batch (after every 4-6 experimental samples). Randomize the injection order of all experimental and control samples to mitigate instrument drift.
Principle: Process raw LC-MS/MS files through the two-stage NP-PRESS workflow to generate a prioritized list of novel natural product candidates [1] [11].
Software & Platforms:
Procedure:
Table: Key Parameters for LC-MS/MS Data Pre-processing
| Processing Step | Software (Example) | Critical Parameters | Purpose |
|---|---|---|---|
| Peak Picking | MZmine | Noise level, mlz tolerance (e.g., 0.005 Da), min peak duration | Detect individual ion signals from raw data. |
| Chromatogram Deconvolution | MZmine | MS1 & MS2 mlz tolerance, RT span | Resolve co-eluting peaks and link MS2 spectra to MS1 features. |
| Alignment | MZmine | mlz tolerance (0.01 Da), RT tolerance (0.1 min) | Match identical features across multiple sample runs. |
| Isotope Grouping | MZmine | mlz tolerance, RT tolerance | Group adducts and isotopes belonging to the same molecule. |
Effective visualization is critical for interpreting the complex, multi-dimensional data generated by the NP-PRESS pipeline and for communicating results [8]. The following standards should be applied.
1. Molecular Network Visualization (simRank Output): The primary output of simRank is a molecular network typically visualized using Cytoscape.
2. Feature Abundance Plots (FUNEL Output): Visualize the effect of Stage 1 filtering.
3. Chromatographic and Spectral Visualization: Essential for manual validation of priority candidates.
Table: Key Reagent Solutions for NP-PRESS Workflow Implementation
| Category | Item / Reagent | Specification / Function | Critical Notes |
|---|---|---|---|
| Chromatography | Mobile Phase Additive | Formic Acid (0.1%) or Ammonium Acetate (5-10 mM). | Enhances ionization in positive or negative ESI mode and improves peak shape. Must be LC-MS grade. |
| Mass Spectrometry | Calibration Solution | Manufacturer-specific ESI-L low concentration tuning mix. | Enables accurate mass measurement (< 5 ppm error). Must be infused pre-run for high-resolution instruments. |
| Sample Preparation | Extraction Solvent System | Methanol:Ethyl Acetate:Acetic Acid (50:50:1, v/v/v). | Broad-spectrum solvent for secondary metabolites of varying polarity from biomass [11]. |
| Sample Preparation | Reconstitution Solvent | Methanol:Water (1:1, v/v). | Ensures solubility of a wide polarity range of compounds and compatibility with reversed-phase LC gradients. |
| Data Processing | Internal Standard | Deuterated or non-native compound (e.g., chloramphenicol-d5). | Spiked into all samples pre-extraction to monitor and correct for extraction efficiency and instrument variability. |
| Cultivation | Control Media | Identical, sterile production media. | Serves as the abiotic control for FUNEL to subtract all media-derived chemical features [4]. |
The efficacy of the NP-PRESS strategy is demonstrated by its application to diverse bacterial strains, leading to the discovery of novel bioactive compounds [1] [4].
Case Study: Streptomyces albus J1074
Case Study: Wukongibacter baidiensis M2B1 (Anaerobic Bacterium)
These case studies confirm that the two-stage conceptual shift from simple analysis to systematic refinement effectively addresses the core challenge of metabolome complexity, turning high-dimensional MS data into a targeted discovery engine for novel natural products.
The NP-PRESS (Natural Products Prioritization and Refinement via Elimination of Spectral Similarity) pipeline is founded on a core philosophical shift in natural product (NP) discovery: moving from a detection-centric to a refinement-centric paradigm. Conventional mass spectrometry (MS) workflows are excellent at detecting thousands of chemical features but struggle to distinguish true, biosynthetically relevant natural products from the overwhelming background of biotic interference—such as media components, cellular degradation products, and horizontally acquired metabolites. NP-PRESS posits that the key to unlocking novel chemistry lies not in more sensitive detection, but in more intelligent, context-aware filtration.
Its primary objective is to serve as a decisive two-stage filter that aggressively removes irrelevant MS features while preserving and prioritizing those with high biosynthetic potential. This is achieved by integrating orthogonal data analysis strategies at the MS1 and MS2 levels, effectively mimicking the logical deduction of an experienced natural products chemist. The pipeline is designed to be especially effective for challenging microbial sources, such as extremophiles or strains with sparse metabolomic profiles, where signal-to-noise ratios are notoriously low and high-value metabolites are easily missed [1].
The following table quantifies the paradigm shift introduced by NP-PRESS, contrasting its strategic approach and outcomes with conventional dereplication methods.
Table 1: Strategic and Outcome Comparison: NP-PRESS vs. Conventional Dereplication
| Aspect | Conventional Dereplication | NP-PRESS Strategy | Impact |
|---|---|---|---|
| Core Focus | Identity matching against known compound libraries. | Prioritization of unknown features via background subtraction. | Shifts focus from known to unknown chemical space. |
| Primary Data Used | Predominantly MS2 spectral matching. | Integrated MS1 feature behavior and MS2 network analysis. | Uses orthogonal data layers for robust decision-making. |
| Handling of Biotic Interference | Often unaddressed; treated as part of the sample background. | Actively modeled and subtracted using the FUNEL algorithm. | Dramatically reduces feature list size, enhancing clarity. |
| Key Algorithmic Engine | Spectral similarity scoring (e.g., cosine score). | Two-stage: FUNEL (MS1) and simRank (MS2). | Enables prioritization based on biosynthetic logic. |
| Typical Outcome | List of known compounds and unresolved "unknown" features. | A ranked, shortlist of features most likely to be novel NPs. | Directs purification efforts efficiently to high-priority targets. |
| Demonstrated Novel Discovery | Can rediscover known compounds efficiently. | Enabled discovery of baidienmycins and new surugamide analogs [1]. | Proven efficacy in de novo structure family identification. |
The NP-PRESS pipeline implements a sequential, two-stage refinement process. The workflow diagram below illustrates the logical flow from raw data to prioritized compound discovery.
NP-PRESS Two-Stage MS Dereplication Workflow
The analytical power of NP-PRESS is driven by two specialized algorithms, each operating on a different level of MS data. Their functions are detailed below.
Table 2: Core Algorithms of the NP-PRESS Pipeline
| Algorithm | Stage | Function | Key Mechanism |
|---|---|---|---|
| FUNEL (FUll and NEgative feature anaLysis) | 1 (MS1) | Elimination of biotic interference. | Compares feature profiles between experimental (full) and control (negative) cultures. Features not significantly enriched in the experimental group are flagged as non-biosynthetic background and removed [1]. |
| simRank | 2 (MS2) | Prioritization of novel NP clusters. | Analyzes spectral similarity networks of the refined features. Features that form tight clusters (high connectivity) with known NPs are deprioritized. Novel, structurally unique features that form distinct clusters or singletons are prioritized for investigation [1]. |
The following diagram details the decision logic within the critical second stage of the pipeline.
Stage 2 Priority Logic via simRank Analysis
This protocol is optimized to generate the paired "full" and "negative" culture datasets required for the FUNEL algorithm.
1.1 Materials Preparation
1.2 Procedure
This protocol ensures consistent, high-quality data suitable for both FUNEL and simRank analysis.
2.1 LC Conditions
2.2 MS Conditions (Q-TOF or Orbitrap recommended)
2.3 Data File Organization
Maintain a strict file naming convention to pair samples for FUNEL (e.g., StrainA_Rep1_Full.mzML, StrainA_Rep1_Neg.mzML). Acquire all samples in a randomized order within a continuous sequence to minimize instrumental drift.
As a proof of concept, NP-PRESS was applied to the analysis of Streptomyces albus J1074 and the anaerobic bacterium Wukongibacter baidiensis M2B1 [1].
Table 3: Key Research Reagent Solutions for NP-PRESS Implementation
| Item Category | Specific Example/Description | Function in NP-PRESS Workflow |
|---|---|---|
| Biological Materials | Wild-type and biosynthetically "knock-out" mutant bacterial strains. | Provides the paired "Full" vs. "Negative" culture essential for the FUNEL algorithm's background subtraction [1]. |
| Chromatography | HPLC-grade solvents (MeCN, MeOH, H₂O) with 0.1% formic acid. | Forms the mobile phase for high-resolution LC separation, impacting feature detection and peak shape. |
| Mass Spectrometry | Tuning and calibration solution for MS (e.g., sodium formate cluster). | Ensures mass accuracy and reproducibility across the analytical sequence, critical for reliable feature alignment. |
| Software & Databases | MS-Dial, MZmine, or similar for feature finding; GNPS for spectral networking; In-house NP spectral library. | Used for initial data processing, public spectral matching, and custom spectral comparisons for simRank analysis. |
| Data Analysis | Python/R environment with packages for statistical comparison (e.g., for FUNEL) and graph-based clustering (e.g., for simRank). | Enables execution of the core computational algorithms that define the NP-PRESS pipeline [1]. |
The discovery of novel natural products (NPs) from microbial and plant sources remains a cornerstone of pharmaceutical development, yielding critical leads for antimicrobial, anticancer, and other therapeutic agents [1]. However, a fundamental bottleneck in this pipeline is the initial dereplication step—the rapid identification of known compounds to prioritize truly novel entities for costly and time-consuming isolation and structure elucidation [13]. Modern high-resolution mass spectrometry (HRMS) generates vast datasets of metabolic features, but the signals of potential NPs are often obscured by an overwhelming background of irrelevant ions originating from abiotic sources (e.g., solvents, plastics) and, more challengingly, from biotic processes (e.g., cellular degradation products, media components) [4].
This document details the first stage of NP-PRESS (Natural Products - Prioritization and Refinement by Enhanced Spectrometry Strategies), a novel two-stage MS feature dereplication strategy framed within a broader thesis on accelerating natural product discovery [1]. NP-PRESS introduces two new computational algorithms to systematically filter this complex metabolomic data. The first stage, the focus of this protocol, employs the FUNEL (Filtering of Uninteresting Non-Product-Related Features by Elution Profile) algorithm to perform rigorous filtering at the MS1 level [4]. By removing non-product-related features, FUNEL dramatically reduces dataset complexity and false leads before the more resource-intensive MS2 analysis. This initial refinement is critical for the success of the second stage, which utilizes the simRank algorithm for spectral similarity networking to highlight novel compound families [1]. The integrated NP-PRESS pipeline has proven effective, guiding the discovery of new surugamide analogs from Streptomyces albus and a new family of depsipeptides, the baidienmycins, from the anaerobic bacterium Wukongibacter baidiensis [4].
The FUNEL algorithm is designed to address a key limitation in current metabolomics: the inability to distinguish MS1 features originating from true secondary metabolites (natural products) from those generated by the routine metabolic turnover of the producing organism or its growth medium [4]. While background chemical noise can be partially subtracted using blank injections, biotic background from processed media and cellular debris is sample-inherent and has been historically difficult to filter out.
FUNEL operates on the principle that true natural products are typically synthesized de novo during a cultivation period. In contrast, compounds derived from the biotransformation of media components (e.g., peptides from hydrolyzed yeast extract) are present from the start of cultivation and are gradually consumed or transformed over time [4]. The algorithm exploits this difference in temporal profiles.
Logical Workflow of the NP-PRESS Strategy with FUNEL
Diagram Title: The Two-Stage NP-PRESS Dereplication Strategy
The algorithm requires LC-HRMS data collected from two key sample sets:
FUNEL processes the aligned MS1 feature table (containing m/z, retention time, and intensity across all samples) by applying two sequential filters [4]:
This two-pronged approach ensures that only features showing biological production by the organism during cultivation are passed to the next stage.
The following protocol is adapted from successful applications in bacterial natural product discovery [4] and aligns with standard practices for microbial metabolomics [11].
Materials & Growth:
Procedure:
Raw LC-HRMS data must be converted into an aligned feature table.
.csv file, is the primary input for the FUNEL algorithm.The logic of the FUNEL filter is implemented through feature intensity comparisons.
Algorithmic Steps:
Diagram Title: Decision Workflow of the FUNEL Filtering Algorithm
The efficacy of the FUNEL algorithm within the NP-PRESS pipeline is demonstrated by its application in real discovery campaigns. The filtering drastically reduces dataset complexity, allowing downstream resources to focus on promising leads.
Table 1: Performance of FUNEL Filtering in NP-PRESS Case Studies [1] [4]
| Producer Organism | Initial MS1 Features | Features After FUNEL | Reduction (%) | Key Discovery Enabled |
|---|---|---|---|---|
| Streptomyces albus J1074 | ~15,000 | ~3,000 | 80% | New surugamide analogs |
| Wukongibacter baidiensis M2B1 | ~12,000 | ~2,500 | 79% | Baidienmycins (new depsipeptides) |
The utility of FUNEL is further underscored when compared to other state-of-the-art feature prioritization methods. While tools like MassQL provide a powerful, flexible language for querying specific patterns (e.g., isotopes, neutral losses) in public MS data repositories [14], FUNEL is specifically designed for a different problem: distinguishing biologically synthesized products from complex biotic background in controlled cultivation experiments.
Table 2: Comparison of MS1 Feature Prioritization Strategies
| Method / Algorithm | Primary Function | Key Advantage | Limitation in NP Discovery Context |
|---|---|---|---|
| FUNEL (NP-PRESS) | Filters based on temporal cultivation profile. | Removes sample-inherent biotic background; highly specific for de novo synthesis. | Requires carefully designed time-course experiment. |
| Blank Subtraction (Standard) | Subtracts features found in process blanks. | Removes abiotic contamination (solvents, tubing). | Cannot remove background from processed media components. |
| MassQL [14] | Query language for MS data patterns (MS1 & MS2). | Extremely flexible for finding known chemical motifs; vendor-agnostic. | Does not prioritize based on biological origin; requires pattern definition. |
| MBR/PIP [15] [16] | Transfers IDs between runs using RT, m/z, IM. | Increases feature identification sensitivity across samples. | Can propagate errors; requires high-quality reference library; not a filter for biotic noise [16]. |
The filtered output from FUNEL is the essential input for the second stage of NP-PRESS, which employs the simRank algorithm. simRank performs modified molecular networking on MS2 spectral data but is applied only to the precursors that passed the FUNEL filter. This focused analysis increases the chance that spectral similarity clusters represent true families of secondary metabolites rather than background compounds [1] [4].
For comprehensive dereplication, the FUNEL-simRank pipeline can be integrated with other established tools. The refined feature list can be queried against natural product databases using molecular formula or exact mass. Furthermore, the accurate MS1 features (with m/z and RT) can be used as high-fidelity targets for Match Between Runs (MBR) or Peptide-Identity-Propagation (PIP) in subsequent analyses of new strains or conditions, though such transfers require rigorous false-discovery rate control [15] [16]. Advanced software platforms like MaxQuant, which now support ion mobility dimensions, can enhance the accuracy of such alignments by using collision cross section (CCS) as an additional coordinate [15].
Table 3: Key Research Reagent Solutions and Software for FUNEL/NP-PRESS Implementation
| Item | Function/Description | Application in Protocol |
|---|---|---|
| Methanol/Water/Formic Acid (49:49:2) | Extraction solvent for intracellular and extracellular metabolites. Provides good recovery of a wide polarity range of NPs [11]. | Sample preparation, metabolite extraction. |
| LC-MS Grade Water & Acetonitrile (with 0.1% FA) | Mobile phases for reversed-phase LC-HRMS. High purity minimizes background chemical noise in MS1 spectra. | LC-HRMS separation during data acquisition. |
| MZmine 3 | Open-source software for mass spectrometry data processing. Performs chromatogram building, deisotoping, alignment, and feature table export [11]. | Data preprocessing before FUNEL analysis. |
| MaxQuant | Comprehensive software suite for quantitative proteomics (and metabolomics). Its advanced "Match Between Runs" (MBR) algorithm can utilize multiple dimensions (RT, m/z, CCS) for high-confidence feature alignment [15]. | Optional for advanced feature alignment and integration with ion mobility data. |
| R Script/Python Environment | Custom computational environment for implementing the FUNEL logic (statistical tests, thresholding, filtering). | Execution of the core FUNEL algorithm. |
| GNPS / MassIVE | Public repository and ecosystem for mass spectrometry data. Used for spectral library matching, molecular networking, and sharing raw data [14]. | Downstream analysis after FUNEL filtering (e.g., with simRank) and data deposition. |
The discovery of novel, bioactive natural products (NPs) from microbial metabolomes is persistently challenged by the overwhelming chemical background of non-relevant metabolites. These include primary metabolites, cellular degradation products, and components from growth media, whose signals in mass spectrometry (MS) analyses can obscure the often lower-abundance secondary metabolites of interest. The NP-PRESS (Natural Products - Prioritization and Refinement by Elimination of Spectral Signatures) pipeline is a novel, two-stage metabolome refining strategy designed to overcome this hurdle [1] [4].
This pipeline systematically removes irrelevant chemical features to highlight NPs with higher potential for novelty and bioactivity. Stage 1 employs the FUNEL (FUnctional-group guided comparisoN of Extracted ion chromatogram and MS1 spectra for List) algorithm. FUNEL operates on MS1 data to filter out features originating from "biotic processes" by comparing experimental samples against a comprehensive database of control samples (e.g., spent media, host organism extracts). It does this by evaluating mass defects, isotopic patterns, and retention time shifts indicative of common biochemical transformations [1] [4].
Stage 2, which is the focus of these application notes, utilizes the simRank algorithm to analyze MS2 (tandem mass spectrometry) data. While Stage 1 effectively reduces dataset complexity, Stage 2 provides a higher-order, structural similarity-based filter. It prioritizes NP candidates by identifying MS2 spectra in the experimental samples that are dissimilar to all spectra found in control samples, thereby flagging compounds with potentially novel chemical scaffolds [1] [4].
The simRank algorithm, in its general form, is a graph-theoretic measure of structural-context similarity. Its core principle is: "two objects are considered similar if they are related to similar objects" [17]. In the context of web page analysis, this translates to pages being similar if they are linked to by similar pages.
For MS2 spectral analysis within NP-PRESS, this concept is adapted. Here, "objects" are precursor ions (detected features from MS1). The "relationship" is defined by their fragment ions (the MS2 spectrum). The adapted simRank principle for NP discovery becomes: Two precursor ions are considered to have similar chemical structures if their fragmentation spectra (the ions they are "related to") are similar [17] [18].
The simRank stage is not a standalone process but a critical, refining component of the sequential NP-PRESS pipeline. The following diagram illustrates the complete two-stage workflow and the specific role of the simRank module.
Diagram 1: The Two-Stage NP-PRESS Dereplication and Prioritization Pipeline.
This protocol assumes the completion of Stage 1 (FUNEL) processing and the availability of raw LC-MS/MS data files (.mzML or .mzXML format) for both experimental and control samples.
mz and rt (retention time in seconds) can be supplied to restrict analysis only to precursor ions of interest, such as those passing Stage 1 [18].Before similarity calculation, MS2 spectra undergo merging and cleaning. Key configurable parameters in platforms like simRank-Filter include [18]:
Table 1: Key Pre-processing and Algorithm Parameters for simRank Analysis
| Parameter | Default Value | Function & Impact on Analysis |
|---|---|---|
| Fragment Intensity Threshold | 1% | Fragments with normalized intensity below this value are excluded from comparison, reducing noise [18]. |
| Retention Time Merge Window (ΔRT) | 30 sec | MS2 spectra from the same precursor ion within this RT window are merged to create a consensus spectrum [18]. |
| Precursor Alignment Tolerance | 20 ppm, 0.01 Da | Maximum m/z difference to align precursor ions across runs for control library building [18]. |
| Fragment Alignment Tolerance | 0.01 Da | Maximum m/z difference to consider two fragment ions as identical during spectrum comparison [18]. |
| Minimum Fragments per Spectrum | 5 | Spectra with fewer fragments are considered low-quality and excluded from analysis [18]. |
| Remove Precursor Ion Window | Enabled (17 Da) | Removes fragments close to the precursor m/z (e.g., water/ammonia losses), which are often non-informative [18]. |
| simRank Similarity Threshold | 15 | Critical. Sample spectra with a similarity score below this value against all control spectra are output as novel candidates [18]. |
The core algorithm follows a defined computational workflow, as detailed below.
Diagram 2: The simRank Spectral Comparison and Prioritization Logic.
The primary output is a table of filtered features. The most critical column is the simRank similarity score. Features with scores below the applied threshold represent MS2 spectra not found in the control background and are high-priority targets for downstream isolation and structure elucidation [1] [18].
Table 2: Exemplar Output from NP-PRESS simRank Analysis
| Precursor m/z | Retention Time (s) | simRank Score (vs. Controls) | Status | Proposed Action |
|---|---|---|---|---|
| 487.2564 | 654 | 5.2 | Novel | High Priority: Proceed to isolation |
| 322.1541 | 432 | 78.9 | Known/Dereplicated | Low priority, likely from media/biotic process |
| 601.2987 | 721 | 12.1 | Novel | High Priority: Proceed to isolation |
| 455.2302 | 589 | 92.3 | Known/Dereplicated | Deprioritize |
The efficacy of the integrated NP-PRESS pipeline, culminating in the simRank filter, has been demonstrated in multiple discovery campaigns [1] [4].
Case Study 1: Streptomyces albus J1074 Application of NP-PRESS guided the discovery of previously overlooked surugamide analogs. The simRank stage was critical in distinguishing their MS2 signatures from the complex metabolic background [1] [4].
Case Study 2: Wukongibacter baidiensis M2B1 (Anaerobic Bacterium) NP-PRESS analysis led to the discovery of an entirely new family of depsipeptides, named baidienmycins. These compounds exhibited potent antimicrobial and anticancer activities in bioassays. This success underscores the pipeline's power in uncovering novel NPs from underexplored and extremophile microorganisms [1] [4].
Table 3: Key Reagents and Materials for NP-PRESS simRank Protocol
| Item | Function in Protocol | Specifications & Notes |
|---|---|---|
| LC-MS Grade Solvents | Mobile phase for chromatography. | Acetonitrile, Methanol, Water (with 0.1% Formic Acid). Essential for reproducible retention times and high MS sensitivity. |
| Microbial Growth Media | Culturing experimental and control samples. | Use chemically defined media if possible to simplify the control background. Document all components for reference. |
| Standard QA/QC Compounds | System suitability and calibration. | A mix of known compounds to verify LC-MS performance and mass accuracy before analytical runs. |
| Data Processing Software | Raw data conversion and peak picking. | e.g., MSConvert (ProteoWizard) to generate .mzML files from vendor formats [19]. |
| simRank Implementation Platform | Executing the Stage 2 algorithm. | e.g., simRank-Filter web module or custom Python/R scripts implementing the algorithm [17] [18]. |
| Dereplication Databases | Contextualizing simRank results. | Public (GNPS, NP Analyst [19]) or commercial spectral libraries for additional validation of novelty. |
The discovery of novel natural products (NPs) from microbial sources remains a cornerstone of pharmaceutical development, yet is challenged by the high rate of compound rediscovery and the obscurity of low-abundance metabolites within complex biological extracts [20]. This application note details a targeted methodology for the discovery of new surugamide analogs, a family of bioactive cyclic nonribosomal peptides (NRPs), from Streptomyces species. The protocol is explicitly framed within the methodological context of the NP-PRESS (Natural Product Prioritization and Refinement by mass Spectrometry Strategy) research, a novel two-stage MS feature dereplication pipeline [1] [4].
The NP-PRESS strategy addresses a critical gap in metabolomics by systematically removing irrelevant MS features originating from abiotic processes and, more challengingly, biotic processes such as media components and cellular degradation products [1]. By integrating two specialized algorithms—FUNEL for MS1-level feature refinement and simRank for MS2-spectral similarity scoring—NP-PRESS refines crude metabolomes to highlight genuine secondary metabolites [4]. As a proof-of-concept, this pipeline was successfully applied to Streptomyces albus J1074, facilitating the identification of previously overlooked surugamide analogs [1] [4]. This document translates that research into a standardized, detailed protocol for researchers aiming to discover novel derivatives within known natural product families.
Surugamides are a growing family of peptides produced by Streptomyces, primarily characterized by an eight-amino-acid macrocyclic core structure that includes multiple D-amino acid residues [21]. They are biosynthesized by a unique non-ribosomal peptide synthetase (NRPS) gene cluster (surABCD) and cyclized by a dedicated penicillin-binding protein-like thioesterase (PBP-like TE), SurE [22] [23] [24].
The NP-PRESS pipeline is designed to prioritize NP-derived MS signals by removing interfering features in two sequential stages [1] [4].
Stage 1: MS1-Level Filtering with FUNEL The FUNEL algorithm processes untargeted LC-HRMS data to remove mass features associated with cultivation media, primary metabolites, and common laboratory contaminants. It employs blank subtraction, isotopic pattern recognition, and heuristic rules based on the typical physicochemical properties of secondary metabolites to drastically reduce dataset complexity before MS2 analysis [1] [4].
Stage 2: MS2-Level Prioritization with simRank The simRank algorithm analyzes the MS/MS spectra of the refined feature list. It computes spectral similarity scores against curated databases of known natural product spectra (e.g., GNPS). Critically, it prioritizes features that show high similarity to a known NP family (indicating structural relatedness) but are not exact matches, thereby flagging potential novel analogs like new surugamides for isolation [1] [4].
This protocol combines strain selection, culture elicitation, NP-PRESS-based analysis, and targeted isolation.
Objective: To activate the silent sur BGC and maximize surugamide analog production [26] [27].
Objective: Generate high-quality MS1 and MS2 data for dereplication.
Objective: Dereplicate known compounds and prioritize novel surugamide analogs.
Objective: Physically isolate and determine the structure of prioritized analogs.
Table 1: Recently Discovered Surugamide Analogs and Their Properties
| Analog Name | Producing Strain | Molecular Formula | Key Structural Feature | Reported Bioactivity (IC50/EC50) | Citation |
|---|---|---|---|---|---|
| Acyl-Surugamide A2 | S. albidoflavus RKJM-0023 | C₅₀H₈₃N₉O₉ | N-ε-acetyl-L-lysine residue | Antifungal (data pending) | [21] |
| Acyl-Surugamide A3 | Streptomyces sp. CMB-M0112 | Not specified | Acylated lysine derivative | Anthelmintic vs. D. immitis: 3.3 µg/mL | [25] |
| Surugamide K | Streptomyces sp. CMB-MRB032 | Not specified | N-methylated analog | Inactive vs. D. immitis (>25 µg/mL) | [25] |
| Acyl-Surugamide AS3 (semi-synthetic) | Derivatized from Surugamide A | Not specified | Synthetic acylation | Anthelmintic vs. D. immitis: 3.4 µg/mL | [25] |
Table 2: Elicitation Effect on Surugamide Production in S. albus J1074
| Cultivation Condition | Relative Production of Surugamides (vs. Control) | Key Elicitor/Medium | Citation |
|---|---|---|---|
| Standard Medium (TSB) | Low/Basal (Repressed) | N/A | [26] |
| YD Medium | >13-fold increase (in marine strain SM17) | Marine strain vs. terrestrial J1074 | [26] |
| Chemical Elicitation (Ivermectin) | Up to 5-fold induction of sur BGC expression | HiTES screening | [27] |
| Chemical Elicitation (Etoposide) | Up to 5-fold induction of sur BGC expression | HiTES screening | [27] |
The biosynthesis of surugamides involves a unique NRPS assembly line and a dedicated cyclase.
Table 3: Key Research Reagent Solutions for Surugamide Discovery
| Item | Function/Description | Example/Application in Protocol |
|---|---|---|
| BFM15m / SYP-NaCl Media | Cultivation media that enhance surugamide production in marine Streptomyces strains [21] [26]. | Used in OSMAC cultivation (Protocol 4.1). |
| Ivermectin & Etoposide Elicitors | Chemical inducers of silent biosynthetic gene clusters. Act via pathway-specific repression relief/SOS response [27]. | Used in chemical elicitation (Protocol 4.1). |
| Ethyl Acetate (EtOAc) | Organic solvent for broad-spectrum metabolite extraction from acidified culture broth. | Used in harvest and extraction (Protocol 4.1). |
| C18 Reversed-Phase HPLC Columns | Standard for peptide separation based on hydrophobicity. Critical for analytical profiling and purification. | Used in LC-HRMS (4.2) and Targeted Purification (4.4). |
| Marfey's Reagent (FDAA) | Chiral derivatizing agent for determining the absolute configuration (L/D) of amino acids after hydrolysis. | Used in Structural Characterization (Protocol 4.4). |
| N-Acetylcysteamine (SNAC) Thioester | Synthetic mimic of the peptidyl carrier protein (PCP)-bound thioester intermediate. | Used in in vitro studies of SurE cyclase activity [22]. |
| NP-PRESS Software Pipeline | Custom algorithms (FUNEL, simRank) for two-stage MS data dereplication and novel NP prioritization [1] [4]. | Core of data processing (Protocol 4.3). |
| GNPS Molecular Networking Platform | Public web-based platform for MS/MS spectral similarity networking and database comparison [21] [20]. | Used for orthogonal validation (Protocol 4.3). |
This application note details the successful integration of the NP-PRESS (Natural Products-Prioritization and Evaluation by Stage-wise Screening) two-stage MS dereplication strategy for the discovery of novel depsipeptides from the anaerobic, extremophilic bacterium Wukongibacter baidiensis. The NP-PRESS strategy utilizes newly developed MS1 (FUNEL) and MS2 (simRank) algorithms to effectively remove interfering signals from abiotic and biotic processes, enabling the prioritization of low-yield, hard-to-detect natural products. As a proof-of-concept, this approach guided the isolation and characterization of the baidienmycin family, a new class of depsipeptides exhibiting potent antimicrobial and anticancer activities [1]. This study underscores the efficacy of targeted dereplication in unlocking the bioactive potential of under-explored extremophilic bacteria within the context of modern natural product drug discovery.
The rediscovery of known compounds remains a critical bottleneck in natural product (NP)-based drug discovery. Mass spectrometry (MS) is a powerful discovery tool, but its utility is often hampered by the overwhelming complexity of microbial extracts, where signals from true secondary metabolites are obscured by a background of interfering features derived from culture media, cellular degradation products, and other biotic processes [1]. This challenge is particularly acute when investigating unusual or extremophilic bacteria, such as Wukongibacter baidiensis, which thrive in harsh environments like hydrothermal vents and are promising sources of novel chemistry [28] [29].
This application note is framed within the broader thesis research on the NP-PRESS two-stage MS feature dereplication strategy. The core thesis posits that a systematic, algorithm-driven filtering of MS data can dramatically improve the efficiency of novel NP discovery. NP-PRESS operationalizes this by implementing two sequential filtering stages: first, the FUNEL algorithm cleans MS1 data by removing features associated with common biochemical building blocks and known noise patterns; second, the simRank algorithm analyzes MS2 fragmentation spectra to cluster and rank features based on structural novelty compared to dereplication libraries [1]. The case study presented here—the discovery of baidienmycins from W. baidiensis—serves as a critical validation of this thesis, demonstrating its practical application and effectiveness in a real-world discovery pipeline targeting depsipeptides, a class of compounds with proven therapeutic potential [30] [31] [32].
Wukongibacter baidiensis is an anaerobic, Gram-stain-positive, spore-forming bacterium first isolated from mixed hydrothermal sulfide samples collected from a deep-sea vent [28].
The NP-PRESS strategy is designed to overcome the signal-to-noise problem in LC-MS-based metabolomics. Its two-stage workflow is summarized in the table below and visualized in Figure 1.
Table 1: The Two-Stage NP-PRESS Dereplication Workflow
| Stage | Algorithm | Data Level | Primary Function | Key Action |
|---|---|---|---|---|
| Stage 1 | FUNEL | MS1 (Precursor Ion) | Filtering & Cleanup | Removes mass features corresponding to ubiquitous biochemical building blocks, media components, and known biotic interference patterns. |
| Stage 2 | simRank | MS2 (Fragmentation) | Dereplication & Prioritization | Compresses MS2 spectra into loss/decomposition vectors, clusters them via similarity ranking, and flags clusters with no match to known compound libraries as high-priority for novel NPs. |
Figure 1: The NP-PRESS Two-Stage MS Dereplication Workflow. This diagram illustrates the sequential application of the FUNEL (Stage 1) and simRank (Stage 2) algorithms to filter complex LC-MS data and prioritize novel natural product candidates [1].
4.1 Experimental Workflow
4.2 Key Outcomes & Biological Activities The application of NP-PRESS led directly to the efficient discovery of the baidienmycin family. Preliminary biological evaluation revealed significant activities, as summarized below.
Table 2: Biological Activity Profile of Baidienmycins from W. baidiensis
| Activity Assay | Target / Cell Line | Reported Potency | Significance |
|---|---|---|---|
| Antimicrobial | Panel of bacterial pathogens | Potent activity | Indicates potential as a new antibiotic scaffold, crucial in the AMR crisis [1] [30]. |
| Anticancer | Panel of human cancer cell lines | Potent activity | Suggests potential for development as anticancer agents [1] [31]. |
5.1 Protocol 1: NP-PRESS Data Analysis for Depsipeptide Prioritization
5.2 Protocol 2: Fermentation & Targeted Fractionation of W. baidiensis
Table 3: Essential Materials for Depsipeptide Discovery from Unusual Bacteria
| Item | Function & Specification | Application in Protocol |
|---|---|---|
| Specialized Culture Media | Anaerobic broth (e.g., Marine Broth 2216), pre-reduced, with specific salinity (30-40 g/L sea salts) and pH 8.0 buffer. | Cultivation of fastidious anaerobic extremophiles like W. baidiensis [28]. |
| Anaerobic Chamber or Jars | System to maintain an oxygen-free atmosphere (e.g., N₂:CO₂:H₂, 80:10:10). | Essential for inoculating, transferring, and growing strict anaerobes. |
| High-Resolution LC-MS/MS System | Instrument capable of data-dependent acquisition (DDA) or data-independent acquisition (DIA), e.g., UPLC-QTOF or UPLC-Orbitrap. | Generation of high-quality MS1 and MS2 spectra for NP-PRESS analysis [1]. |
| Dereplication Libraries | Digital spectral databases: Public (GNPS) and in-house curated libraries of known NPs and depsipeptides. | Reference for simRank algorithm to identify known compounds and highlight novelty [1] [30]. |
| Preparative HPLC System | System with C18 column, UV-Vis/DAD detector, and automated fraction collector. | Isolation of gram-scale quantities of target compounds guided by NP-PRESS output. |
| NMR Solvents (Deuterated) | High-purity solvents: DMSO-d6, Methanol-d4, CDCl3. | Structure elucidation of purified novel depsipeptides. |
Depsipeptides like baidienmycins are typically biosynthesized by multi-modular enzymatic complexes known as non-ribosomal peptide synthetases (NRPS), often with hybrid polyketide synthase (PKS) components [30] [31]. A generalized NRPS/PKS pathway is illustrated below.
Figure 2: Generalized NRPS/PKS Biosynthetic Pathway for Depsipeptides. This diagram outlines the key enzymatic steps in assembling cyclic depsipeptides, involving substrate activation, sequential condensation, and final macrocyclization [30] [31].
This case study validates the NP-PRESS two-stage MS dereplication strategy as a powerful framework for thesis research and applied natural product discovery. By systematically eliminating analytical noise and prioritizing true novelty, it enables researchers to efficiently probe "difficult" sources like extremophilic bacteria. The discovery of the bioactive baidienmycins from Wukongibacter baidiensis serves as a compelling model for future efforts aimed at mining the unique chemical space encoded by unusual microorganisms, accelerating the identification of novel depsipeptides and other lead compounds for therapeutic development.
The discovery of novel natural products (NPs) from microbial sources is pivotal for pharmaceutical development, yet it is hampered by the high complexity of microbial metabolomes and the resource-intensive nature of traditional bioassay-guided isolation [33] [4]. A significant challenge lies in the mass spectrometry (MS) data, where signals from novel NPs are often obscured by a vast number of interfering features originating from abiotic sources, culture media, and microbial processed products [1] [4]. This "chemical noise" leads to inefficient resource allocation and missed discoveries.
This application note is framed within the broader research thesis on the two-stage MS feature dereplication strategy, NP-PRESS (Natural Product Prioritization and Evaluation by Semi-Supervised Scoring) [1] [4]. NP-PRESS addresses the core dereplication challenge by implementing a metabolome-refining pipeline designed to systematically remove irrelevant chemical features and prioritize those most likely to be novel secondary metabolites. The strategy employs two key algorithms: FUNEL for MS1-level filtering of non-NP features and simRank for MS2-level spectral networking and novelty scoring [4].
Here, we detail the extension and successful application of this NP-PRESS workflow to the mining of NPs from mangrove-derived Streptomyces, an exceptionally promising but metabolomically complex source. Mangrove ecosystems are biodiversity hotspots, and their unique environmental pressures (e.g., high salinity, low oxygen) drive microbes like Streptomyces to produce diverse and bioactive secondary metabolites [34]. Demonstrating efficacy in this challenging context validates NP-PRESS as a robust strategy for accelerating NP discovery from complex environmental microbiomes.
2.1. Proof-of-Concept and Validation The NP-PRESS pipeline was initially validated using the model strain Streptomyces albus J1074, where it successfully facilitated the identification of new surugamide analogs [1] [4]. Its performance was further demonstrated on the unusual anaerobic bacterium Wukongibacter baidiensis M2B1, leading to the discovery of the new, bioactive depsipeptide family, baidienmycins [1]. These successes established NP-PRESS's capability to uncover novel metabolites from diverse bacterial sources by effectively differentiating NP signals from complex background interference.
2.2. Direct Application: Mining Mangrove Streptomyces speibonae W307 The NP-PRESS strategy was directly applied to Streptomyces speibonae W307, isolated from a mangrove environment [33]. The two-stage dereplication process was critical for managing the metabolomic complexity of this strain.
This targeted analysis guided the isolation efforts toward a specific cluster of unknown features, culminating in the identification of three new natural products, strepyrazinones A, B, and C [33]. Structural elucidation via HR-MS and NMR, coupled with ECD calculations for configuration determination, confirmed that two of these compounds possess entirely new skeletons [33]. This case study concretely demonstrates how NP-PRESS extends the discovery workflow by providing a rational, data-driven prioritization scheme that directly leads to the isolation of novel chemical entities.
2.3. Corroborative Genomics-Metabolomics Workflow Complementary studies on mangrove-derived Streptomyces highlight the synergy of genomics with metabolomics, a philosophy aligned with NP-PRESS's data-centric approach. For instance, whole-genome sequencing of Streptomyces murinus THV12 revealed a significant biosynthetic potential, with 47 secondary metabolite biosynthetic gene clusters (smBGCs), representing 17.9% of its 8.3 Mb genome [35]. Concurrent LC-HR-MS/MS metabolomics under OSMAC (One Strain Many Compounds) cultivation conditions detected a range of metabolites, including actinomycin D and cinnabaramide A, validating the expression of these genomic potentials [35]. This combined strategy mirrors the preparatory and investigative steps that make NP-PRESS application effective, by first identifying a strain of high potential and then applying focused metabolomic dereplication.
Table 1: Summary of NP Discovery from Mangrove-Derived Streptomyces Using Advanced Strategies
| Strain | Source | Key Strategy | Major Findings | Reference |
|---|---|---|---|---|
| Streptomyces speibonae W307 | Mangrove environment | NP-PRESS dereplication pipeline | Isolation of three strepyrazinones (A-C), two with new structures. | [33] |
| Streptomyces murinus THV12 | Mangrove sediment | Combined genomics & metabolomics | Genome harbors 47 smBGCs. Metabolomics detected actinomycin D, pentamycin, etc. | [35] |
| Streptomyces sp. (Various) | Mangrove sediments (Review) | Traditional bioassay-guided fractionation | Catalog of 519 NPs (70% bioactive), including piericidins, azalomycins, etc. | [34] |
3.1. Protocol 1: NP-PRESS Dereplication Workflow for LC-MS/MS Data This protocol details the computational steps for implementing the two-stage NP-PRESS strategy [1] [4].
3.2. Protocol 2: Integrated Genomics & Metabolomics for Strain Prioritization This protocol outlines a complementary approach to identify high-potential mangrove Streptomyces strains for NP-PRESS analysis [35].
3.3. Protocol 3: Isolation and Characterization of Prioritized Compounds
Diagram 1: The NP-PRESS Two-Stage Dereplication Pipeline [1] [4]
Diagram 2: Integrated Strategy for Strain Prioritization and Discovery
Table 2: Key Research Reagents and Solutions for Mangrove Streptomyces NP Mining
| Item/Category | Function/Application | Example/Note |
|---|---|---|
| Selective Isolation Media | Favors growth of actinomycetes from complex mangrove sediment. | Actinomycetes Isolation Agar (AIA), ISP media supplemented with nalidixic acid and cycloheximide [35]. |
| OSMAC Elicitors | To activate silent biosynthetic gene clusters by varying cultivation parameters. | Different carbon/nitrogen sources, salts, enzyme inhibitors, or co-culture with other microbes [35]. |
| LC-HR-MS/MS System | High-resolution metabolomic profiling for dereplication and compound detection. | Systems like UPLC coupled to Q-TOF or Orbitrap mass spectrometers are standard [35] [36]. |
| Genome Mining Software | In silico prediction of secondary metabolite potential from genome sequence. | antiSMASH: Primary tool for BGC identification and analysis [35]. |
| Dereplication Platforms | Computational analysis of MS data for rapid compound identification. | GNPS (Global Natural Products Social): For molecular networking and library matching [36]. NP-PRESS: For specialized two-stage MS feature filtering [1]. |
| Chromatography Resins | Fractionation and purification of target metabolites from crude extract. | Solid-phase extraction (SPE) cartridges, and preparative HPLC columns (C18, silica gel). |
| NMR Solvents | Solubilizing purified compounds for structural elucidation. | Deuterated solvents (e.g., DMSO-d6, CDCl3, CD3OD). |
Abstract This application note details the critical parameter tuning of the FUNEL and simRank algorithms within the NP-PRESS (Natural Products Prioritization and Refinement Strategy) pipeline. NP-PRESS is a two-stage mass spectrometry (MS) feature dereplication strategy designed to uncover novel natural products (NPs) by removing overwhelming irrelevant features from microbial metabolomes, particularly those originating from biotic processes [1] [4]. The core innovation lies in the stepwise application of FUNEL for MS1-level feature refinement and simRank for MS2-level spectral prioritization [4]. Precise calibration of these algorithms is paramount, as it governs the essential trade-off between sensitivity (discovering true novel NPs) and specificity (rejecting known or irrelevant compounds). This document provides a structured framework, experimental protocols, and practical guidelines for researchers to optimize these parameters, thereby maximizing the efficacy of novel bioactive compound discovery in projects such as the study of Streptomyces albus J1074 and the anaerobic bacterium Wukongibacter baidiensis M2B1 [1].
The discovery of novel natural products (NPs) from microbial sources is a cornerstone of pharmaceutical development. However, a major bottleneck is the sheer complexity of metabolomic data, where signals from novel, often low-abundance NPs are obscured by a vast excess of features from culture media, cellular degradation products, and known metabolites [1] [4]. Traditional dereplication methods struggle to differentiate true NP signals from this biotic interference, leading to costly and fruitless isolation efforts [4].
The NP-PRESS pipeline addresses this by implementing a rigorous two-stage filtering strategy [4]:
The sequential application of FUNEL and simRank creates a powerful gating mechanism. The performance of the entire NP-PRESS pipeline is critically dependent on the parameter settings for each stage, which directly control the balance between sensitivity and specificity.
Optimal performance of NP-PRESS is achieved not by maximizing either sensitivity or specificity in isolation, but by tuning parameters to find an optimal balance suitable for the research goal. The following table summarizes the key tunable parameters for each algorithm and their effect on the discovery workflow.
Table 1: Critical Parameters for FUNEL and simRank Algorithms in NP-PRESS
| Algorithm | Core Parameter | Effect on SENSITIVITY | Effect on SPECIFICITY | Recommended Tuning Strategy |
|---|---|---|---|---|
| FUNEL (MS1) | Mass Tolerance Window | Increases: Wider windows retain more true NPs with slight m/z deviations. | Decreases: Wider windows admit more unrelated interfering features. | Start with instrument accuracy (e.g., ±5 ppm). Widen slightly for complex samples or unknown adducts. |
| Retention Time Tolerance | Increases: Liberal RT windows accommodate shifts from matrix effects. | Decreases: Liberal RT windows increase chance of co-eluting interference. | Define based on chromatographic reproducibility (e.g., ±0.1 min). Tighten for high-resolution separations. | |
| Blank Subtraction Threshold | Increases: Lower thresholds aggressively subtract background, risking NP loss. | Decreases: Lower thresholds may remove true NP signals also present in blanks. | Use fold-change (e.g., ≥10x intensity in sample vs. blank) and visually inspect EICs for key features. | |
| simRank (MS2) | Spectral Similarity Score Cutoff | Increases: Lower score thresholds retain more spectra, including weak matches to knowns. | Decreases: Lower thresholds populate networks with false connections, diluting novel clusters. | Set initial cutoff at 0.7 (cosine score). Adjust based on library quality; increase for cleaner networks. |
| Minimum Matched Fragment Ions | Increases: Lower minimum count retains spectra with poor fragmentation. | Decreases: Lower count increases false-positive spectral matches. | Require ≥4-6 matched fragment ions for high-confidence dereplication. | |
| Maximum Cluster Size | Increases: Larger clusters group more related analogs, capturing diversity. | Decreases: Very large clusters can become noisy, obscuring novel scaffold outliers. | Monitor cluster distribution; break apart clusters exceeding 20-30 nodes for manual review. |
This protocol outlines the prerequisite steps for generating high-quality data suitable for FUNEL and simRank analysis [4] [11].
Sample Preparation & LC-MS/MS Acquisition:
Data Pre-processing:
This protocol uses characterized samples to establish baseline FUNEL parameters before analyzing novel strains.
This protocol focuses on tuning simRank to highlight spectral novelty after FUNEL pre-filtering.
Table 2: Essential Research Reagents and Solutions
| Item | Specification / Recommended Product | Function in NP-PRESS Workflow |
|---|---|---|
| UHPLC-HRMS System | Q-TOF or Orbitrap mass spectrometer with nanoflow or conventional UHPLC. | Generates high-resolution MS1 and MS2 data essential for accurate feature detection and spectral matching [11]. |
| Chromatography Column | Reversed-phase C18 column (e.g., 2.1 x 150 mm, 1.8 μm). | Provides the compound separation necessary for resolving complex metabolomes and obtaining pure MS2 spectra [11]. |
| Data Processing Software | MZmine, MS-DIAL, or similar open-source platforms. | Performs feature detection, alignment, blank subtraction, and exports data in formats compatible with FUNEL/simRank [11]. |
| Molecular Networking Platform | Global Natural Products Social Molecular Networking (GNPS). | Provides the computational environment and public spectral libraries to execute and visualize simRank-based molecular networks [11]. |
| Chemical Standards | Authentic standards of expected metabolite classes. | Serves as positive controls for validating LC-MS performance and tuning FUNEL parameters (Protocol B). |
The NP-PRESS strategy, powered by the sequential application of FUNEL and simRank, represents a significant advance in metabolomic dereplication by systematically removing biotic interference [1] [4]. Its success is intrinsically linked to the deliberate tuning of algorithm parameters, a process that governs the critical sensitivity-specificity equilibrium. The frameworks and protocols provided here offer a practical roadmap for researchers to calibrate the NP-PRESS pipeline for their specific systems. By methodically applying Protocol B to establish robust FUNEL filters and Protocol C to optimize simRank for novelty detection, drug discovery professionals can significantly enhance their probability of isolating previously undiscovered natural products with potent biological activities, as demonstrated by the discovery of the baidienmycins [4].
The discovery of novel Natural Products (NPs) remains a cornerstone of pharmaceutical development, particularly in the urgent fight against antimicrobial resistance (AMR) and cancer [39]. However, researchers face a dual challenge of biological and analytical complexity. Biologically, the most promising NPs are often produced under specific conditions: by extremophiles thriving in unique geological niches or from silent biosynthetic gene clusters (BGCs) that are not expressed in standard laboratory settings [40] [41]. Analytically, the mass spectrometry (MS) data from such experiments is immensely complex, filled with interfering signals from media, cellular debris, and primary metabolism that can obscure the target NPs [1].
This article presents integrated Application Notes and Protocols framed within the context of the NP-PRESS (Natural Product Prioritization and Evaluation by Sequential Scoring) research thesis [1]. We detail how the NP-PRESS two-stage MS dereplication strategy directly addresses data complexity, enabling researchers to confidently prioritize novel features from challenging samples like extremophile extracts or activated silent BGCs. The following sections provide a detailed workflow, from experimental design and sample preparation to data analysis and compound prioritization, equipping scientists with a robust framework for next-generation NP discovery.
The NP-PRESS strategy is engineered to filter out overwhelming irrelevant MS features and highlight true, novel natural products [1]. It operates through two sequential computational stages applied to LC-MS/MS data.
Stage 1: FUNEL (Filtering Using Neutral Loss) Algorithm
Stage 2: simRank Similarity Scoring Algorithm
Table 1: NP-PRESS Performance Metrics in Proof-of-Concept Studies
| Microbial Strain | Sample Type / Challenge | Key NP-PRESS Outcome | Identified Novel Compounds |
|---|---|---|---|
| Streptomyces albus J1074 [1] | Model actinobacterium; complex metabolite background. | Prioritized features from the silent sur BGC after elicitation. | New surugamide analogs [1]. |
| Wukongibacter baidiensis M2B1 [1] | Unusual anaerobic extremophile; high interference. | Enabled discovery of a new compound family from a prioritized, unknown feature. | Baidienmycins (new depsipeptides with antimicrobial & anticancer activity) [1]. |
Diagram 1: The NP-PRESS Two-Stage MS Dereplication Workflow.
3.1 Rationale and Hypothesis Extreme environments (deep-sea vents, acid mine lakes, hypersaline basins) exert intense geochemical pressures that drive the evolution of unique microbial secondary metabolism [40]. The "Extremophile Hypothesis" posits that these conditions foster biochemical novelty, making extremophiles prime sources for new antibiotic and anticancer scaffolds [40]. However, their extracts are analytically challenging due to high salt content, unusual media, and potent primary metabolites that interfere with MS detection of rare NPs.
3.2 Protocol: From Sample to Prioritized Compound
Step 1: Sample Collection & Strain Isolation
Step 2: Small-Scale Fermentation & Elicitation
Step 3: Metabolite Extraction for LC-MS
Step 4: LC-HRMS/MS Data Acquisition & NP-PRESS Analysis
Table 2: Bioactive Compound Specialization in Extreme Environments [40]
| Geological Niche | Dominant Stressors | Adaptive Strategy | Associated Bioactive Compound Classes |
|---|---|---|---|
| Deep-Sea Hydrothermal Vents | High pressure, temperature gradients, heavy metals. | Thermostable/protective molecules, metal chelators. | Potent antimicrobial peptides (e.g., Marthiapeptide A), anticancer polyketides [40]. |
| Acid Mine Drainage Lakes | Extreme acidity (pH <3), toxic metal ions (As, Cu). | Metal efflux pumps, intracellular pH buffering. | Novel meroterpenoids, lactones with anti-inflammatory activity (e.g., Berkeleylactone A) [40] [42]. |
| Hypersaline Lakes | High osmotic pressure, ionic stress. | "Salt-in" or compatible solute synthesis. | Bacterioruberin carotenoids, extremolytes, halocins (antimicrobial peptides) [40]. |
Diagram 2: Linking Geological Stress to NP Specialization in Extremophiles.
4.1 The Silent Cluster Problem Genomic sequencing reveals that prolific microbes harbor 5-10 times more Biosynthetic Gene Clusters (BGCs) than they produce compounds under lab conditions [41]. These "silent" or "cryptic" clusters represent the greatest untapped reservoir of NP diversity. Activation is required, followed by efficient dereplication to identify novel products amidst newly expressed metabolites.
4.2 Protocol: Activation and Targeted Analysis
Step 1: Genetic Activation via Promoter Engineering
Step 2: Chemogenetic Activation via HiTES (High-Throughput Elicitor Screening)
Step 3: Heterologous Expression via FAC (Fungal Artificial Chromosome)
Step 4: Metabolite Analysis with NP-PRESS
Table 3: Strategies for Activating Silent Biosynthetic Gene Clusters
| Strategy | Core Principle | Key Advantage | Example Outcome |
|---|---|---|---|
| CRISPR-Cas9 Promoter Insertion [41] | Replace native promoter with a strong, constitutive one. | Precise, genetic; leads to consistent, high-level production. | Production of alteramides, FR-900098, and novel pigments in Streptomyces [41]. |
| HiTES (High-Throughput Elicitor Screening) [41] | Screen small molecules for ability to induce a BGC reporter. | Uncovers natural ecological signals; no genetic modification needed. | Identified ivermectin/etoposide as elicitors of the sur cluster, yielding novel surugamides [41]. |
| FAC Heterologous Expression [42] | Capture & express entire BGC in a tractable surrogate host. | Bypasses host regulatory networks; ideal for non-model/ extremophile fungi. | Activated 10 BGCs from Penicillium spp., yielding 14 compounds including novel citreohybriddional [42]. |
Table 4: Key Research Reagent Solutions for NP Discovery from Complex Sources
| Reagent / Material | Function / Purpose | Application Context |
|---|---|---|
| HP-20 / XAD Resins | Hydrophobic adsorption resin for in situ capture of metabolites from fermentation broth. | Pre-concentrates NPs; removes salts & water-soluble interferents—critical for extremophile broths [40]. |
| Elicitor Library (e.g., NCI Diversity Set) | A collection of structurally diverse small molecules used to probe BGC induction. | HiTES screening for silent BGC activation in reporter strains [41]. |
| CRISPR-Cas9 Plasmid System for Actinomycetes | All-in-one vector for sgRNA expression and Cas9 protein production in Streptomyces. | Genetic activation of silent BGCs via promoter knock-in [41]. |
| FAC (Fungal Artificial Chromosome) Vector | High-capacity cloning vector (100-300 kb) for capturing entire fungal BGCs. | Heterologous expression of cryptic BGCs from non-model or extremophile fungi [42]. |
| Modified A. nidulans (FAC-AnHH) | Engineered fungal host strain optimized for FAC integration and secondary metabolism. | Heterologous production host for FACs, often activating silent clusters [42]. |
| Curated In-House MS/MS Spectral Library | A local database of MS2 spectra from known NPs relevant to the research focus. | Crucial for accurate simRank scoring in NP-PRESS; improves novelty assessment vs. public DBs. |
This protocol integrates the above strategies for a targeted campaign on an extremophile bacterium with a bioinformatically identified silent BGC.
The discovery of novel natural products (NPs) through mass spectrometry is fundamentally hampered by the problem of irreproducibility. Variability in sample preparation and liquid chromatography-mass spectrometry (LC-MS) conditions generates inconsistent feature sets, making biological comparisons unreliable and obscuring the detection of rare, low-abundance metabolites. This irreproducibility stems from multiple sources: the inherent complexity of biological matrices, the chemical diversity of NPs, and the sensitivity of MS detection to subtle changes in experimental parameters.
The NP-PRESS (Natural Product Prioritization and Evaluation with a Two-Stage Strategy) research provides a critical framework for addressing this challenge [1]. This two-stage dereplication strategy is not merely an informatics solution but necessitates rigorous, reproducible upstream analytical chemistry to function correctly. Its first stage employs the FUNEL algorithm to filter out abiotic background and noise from MS1 data, while the second stage uses the simRank algorithm to differentiate true natural products from biotic interference (e.g., media components, degradation products) in MS2 data [1]. The efficacy of this prioritization is entirely dependent on the consistency of the feature lists input into the system. Therefore, establishing standardized, robust protocols for sample preparation and LC-MS analysis is the essential foundation upon which advanced dereplication strategies are built. This article details the best practices required to ensure reproducible data generation for NP discovery pipelines.
Selecting an appropriate sample preparation method is the first and most critical determinant of reproducibility. The choice must balance the desired level of sample cleanliness with the need to capture the broadest possible metabolome, including non-polar and polar secondary metabolites. The core principle is that cleaner samples drive better and more consistent assay performance [43].
Table 1: Comparison of Common LC-MS Sample Preparation Methods for Natural Product Workflows
| Method | Key Principle | Best For | Advantages for Reproducibility | Limitations |
|---|---|---|---|---|
| Dilute-and-Shoot [43] | Minimal processing; sample dilution in MS-compatible solvent. | Relatively clean matrices (e.g., microbial culture supernatant, plant sap). | Low handling minimizes human error; very fast; high recovery of a wide analyte range. | High matrix effects; prone to ion suppression; not suitable for complex, protein-rich samples. |
| Protein Precipitation (PPT) [43] [44] | Denaturation and pelleting of proteins using organic solvent (e.g., methanol, acetonitrile). | Protein-rich samples (e.g., fermentation broths, cell lysates). | Simple, rapid, and effective at removing proteins; uses common lab reagents. | Limited selectivity; phospholipids and salts remain; can precipitate some metabolites of interest. |
| Liquid-Liquid Extraction (LLE) [43] [44] | Partitioning of analytes based on solubility in two immiscible solvents (aqueous vs. organic). | Extraction of non-polar to moderately polar compounds from aqueous matrices. | Excellent cleanup; effective removal of salts and polar matrix interferences; can concentrate analytes. | Labor-intensive; difficult to automate fully; emulsion formation can cause variability. |
| Solid-Phase Extraction (SPE) [43] [44] | Selective adsorption of analytes onto a functionalized sorbent, followed by washing and elution. | High-purity extraction and concentration of analytes from complex matrices; targeted or untargeted work. | High selectivity and cleanliness; reduces matrix effects significantly; compatible with automation for high reproducibility. | More complex protocol; requires method development (sorbent, solvent selection); can be costly. |
For NP-PRESS, which aims to detect low-abundance features, methods that reduce matrix interference are paramount. Solid-Phase Extraction (SPE) is often the best choice for achieving reproducible, high-quality data from complex bacterial cultures [43]. The use of mixed-mode or selective sorbents can help fractionate samples, simplifying the chromatogram and reducing ion suppression, which in turn yields more consistent feature detection across replicates.
Automation is a key enabler of reproducibility. Automated liquid handlers can execute SPE, LLE, and PPT protocols with superior precision and consistency compared to manual pipetting, reducing human error and inter-operator variability [43] [44]. One documented implementation cut hands-on analyst time from 3 hours to 10 minutes while standardizing the process [43].
This protocol is optimized for the extraction of a broad range of secondary metabolites from Streptomyces or similar bacterial culture filtrates, suitable for input into the NP-PRESS pipeline.
Materials: Culture supernatant (acidified to pH ~3 with formic acid if targeting acidic compounds), SPE cartridges or 96-well plates (e.g., mixed-mode reversed-phase/cation exchange, 30 mg sorbent), Conditioning Solvent (Methanol), Equilibration Solvent (Water with 0.1% Formic Acid), Wash Solvent 1 (Water with 0.1% Formic Acid), Wash Solvent 2 (Methanol:Water 20:80 v/v), Elution Solvent (Methanol with 2% Ammonium Hydroxide or acetonitrile/methanol with 0.1% formic acid for reversed-phase), vacuum manifold or positive pressure processor, collection tubes/plates.
Procedure:
NP-PRESS Context: This SPE protocol significantly reduces salts, sugars, and primary metabolites that constitute the "biotic interference" stage two of NP-PRESS (simRank) is designed to computationally filter [1]. A clean extract improves chromatographic peak shape and MS/MS spectral quality, increasing the confidence of both FUNEL and simRank algorithmic evaluations.
Chromatographic and mass spectrometric parameters must be locked down to ensure the same features are detected in every run. This is non-negotiable for longitudinal studies or multi-batch analyses common in NP discovery.
Table 2: Key LC-MS Parameters Requiring Standardization for Reproducible Untargeted Analysis
| System Component | Critical Parameters | Recommended Practice for Reproducibility | Impact on NP-PRESS |
|---|---|---|---|
| Liquid Chromatography | Column (make, chemistry, lot, age), Gradient profile, Flow rate, Column Temperature, Injection volume, Mobile Phase (brand, additives, pH). | Use the same column brand/chemistry; document lot numbers; utilize pre-set, validated gradient tables; prepare mobile phases in large, consistent batches. | Directly affects retention time (RT) stability, a critical metric for FUNEL's alignment and background subtraction [1]. |
| Mass Spectrometry (MS1 Survey Scan) | Resolution, Scan Range, AGC Target, Maximum Injection Time, Polarity Switching Dwell Time. | Use consistent resolution settings (e.g., 60,000-120,000 at m/z 200); calibrate instrument daily; use auto-gain control (AGC) to maintain consistent ion populations. | Determines the mass accuracy and peak shape of precursor ions, which are essential for accurate molecular formula prediction and adduct deconvolution. |
| Tandem MS (MS/MS Data-Dependent Acquisition) | Isolation Width, Collision Energy (fixed, ramped, or stepped), AGC Target for MS2, Top N for fragmentation, Dynamic Exclusion. | Apply normalized collision energy (e.g., 20-35% for HCD); use a consistent dynamic exclusion window (e.g., 15s). | Directly governs the quality and consistency of MS/MS spectra, which are the sole input for the simRank algorithm's structural similarity comparisons [1]. |
| System Suitability & QC | Reference standard mixture, Pooled QC sample injection frequency. | Inject a standardized mixture of NPs at beginning of sequence; inject a pooled QC sample every 5-10 experimental samples to monitor system drift. | Allows for post-acquisition correction of minor RT or intensity drift, ensuring features are comparable across the entire batch analyzed by NP-PRESS. |
A pooled Quality Control (QC) sample, created by combining a small aliquot of every experimental sample, is indispensable. It is injected repeatedly throughout the acquisition batch. Consistency in the total ion chromatogram and feature detection of the QC samples indicates a stable system. Significant drift necessitates investigation and potentially re-calibration or column re-equilibration before proceeding.
Diagram: QC-Driven Workflow for LC-MS System Stability. A pooled QC sample, analyzed at intervals, provides feedback on system performance. Drift triggers corrective action before valuable experimental samples are compromised.
Reproducibility extends into the digital domain. Consistent, documented data processing workflows are required to transform raw files into the feature lists used by NP-PRESS.
Project_Strain_Date_Replicate.ext). Store all raw data, methods, and audit trails from the LC-MS system.This processed, clean feature table is the optimal input for the NP-PRESS two-stage dereplication. The FUNEL algorithm can more effectively filter abiotic noise when the input data itself is free from technical artifacts generated by inconsistent sample prep or instrument drift [1].
A reproducible workflow is a validated workflow. Implement these final guardrails:
Diagram: Integrated Reproducible Workflow for NP-PRESS. From sample to discovery, each stage is controlled and monitored, with the pooled QC sample providing a feedback loop to ensure data quality before processing and algorithmic dereplication.
Table 3: Key Reagents and Materials for Reproducible NP Sample Preparation and LC-MS
| Item | Function & Role in Reproducibility |
|---|---|
| Mixed-Mode SPE Sorbents (e.g., Reverse-Phase/Cation Exchange) | Provides selective, robust cleanup of complex culture broths; removing salts and primary metabolites reduces matrix effects and improves LC column lifetime, leading to more consistent retention times [43] [44]. |
| LC-MS Grade Solvents (Water, Acetonitrile, Methanol) | Ultra-pure solvents minimize background chemical noise and ion suppression in the MS, ensuring consistent baseline and detection sensitivity across batches and vendors. |
| Volatile Mobile Phase Additives (Formic Acid, Ammonium Formate, Ammonium Hydroxide) | Provides consistent pH control for reproducible chromatographic separation (especially for ionic compounds) and efficient positive/negative electrospray ionization. |
| Stable, High-Purity LC Column (e.g., C18, 1.7-1.9 µm, 100-150 mm length) | The core component for separation; using the same brand, chemistry, and lot number is critical for replicating the exact retention time landscape of metabolites across studies. |
| System Suitability & QC Standard Mix | A cocktail of known natural products or metabolites covering a range of polarities. Injected at the start of each batch, it verifies column performance, MS sensitivity, and mass accuracy are within specified limits. |
| Automated Liquid Handling System | Automates pipetting, SPE, and plate transfers, drastically reducing human error and variability in sample prep, especially for 96-well formats, enhancing throughput and reproducibility [43] [44]. |
Reproducibility in NP discovery is not a single step but a holistic discipline encompassing wet-lab bench work, instrumental analysis, and data science. By implementing standardized, robust protocols for sample preparation—prioritizing selective cleanup like SPE—and rigidly controlling LC-MS conditions, researchers generate the high-fidelity, consistent data required by advanced dereplication frameworks like NP-PRESS. This integrated approach, from controlled extraction to validated data processing, transforms LC-MS from a variable screening tool into a reliable engine for the reproducible discovery of novel natural products. The ultimate goal is a seamless pipeline where technical variability is minimized, allowing biological and chemical novelty to be revealed with confidence.
The discovery of novel natural products (NPs) with pharmaceutical potential is fundamentally hampered by the complexity of biological metabolomes. In traditional workflows, the majority of chromatographic and spectroscopic effort is spent on the re-isolation of known compounds or the pursuit of analytical artifacts, a process that is both time-consuming and resource-intensive [45]. The core challenge lies in accurately interpreting mass spectrometry (MS) data to differentiate between signals representing genuine novel metabolites, those belonging to known but undocumented compounds (database gaps), and those arising from non-biological processes or analytical artifacts [1]. This distinction is critical for directing isolation efforts toward true novelty.
The NP-PRESS (Natural Products – Prioritization and Refinement by Enhanced Spectral Screening) strategy presents a targeted solution within this framework [1] [4]. It is a two-stage mass spectrometry feature dereplication pipeline designed to systematically remove irrelevant chemical features from complex metabolomic data, thereby refining the analysis to prioritize signals with a higher probability of representing novel NPs. This application note details the protocols for implementing NP-PRESS and contextualizes its performance against established dereplication methodologies, providing researchers with a structured approach to interpret MS results and confidently identify true novelty.
The NP-PRESS strategy is engineered to address the specific problem of biotic interference—signals originating from microbial-processed cellular degradation products and culture media components—which often overwhelm the true NP metabolome in bacterial extracts [4]. The pipeline employs two sequentially applied, novel algorithmic filters to process LC-MS/MS data.
Stage 1: FUNEL (FUll-NOise-ELimination) Algorithm. This initial stage operates on MS1 (precursor ion) data. FUNEL is designed to perform a comprehensive subtraction of irrelevant features by comparing the metabolomic profile of the target sample against a rigorously constructed control model. This model encapsulates the "baseline" chemical noise derived from abiotic and biotic processes not associated with specialized metabolite production [1] [4]. The output is a significantly refined metabolome dataset depleted of ubiquitous background interference.
Stage 2: simRank Algorithm. The refined feature list from FUNEL is then analyzed using the simRank algorithm at the MS2 (fragmentation) level. simRank calculates spectral similarity but incorporates a scoring system that prioritizes features with dissimilarity to known compounds in public spectral libraries (e.g., GNPS). Crucially, it also identifies and groups features with high similarity to each other but low similarity to known entries, effectively highlighting potential new compound families or analogs [1]. The final output is a shortlist of prioritized MS features that are both unique to the producing organism and structurally distinct from previously characterized metabolites.
While NP-PRESS offers a specialized, algorithm-driven approach, dereplication employs a spectrum of methodologies. The table below summarizes key strategies, their technological basis, and their primary strengths and limitations.
Table 1: Comparison of Dereplication Strategies for Natural Products
| Strategy / Protocol Name | Core Technology | Key Mechanism for Identifying Novelty | Typical Application Context | Key Strength | Major Limitation |
|---|---|---|---|---|---|
| NP-PRESS [1] [4] | LC-HRMS/MS, FUNEL (MS1) & simRank (MS2) algorithms | Stepwise removal of biotic/abiotic interference; prioritization of features dissimilar to known libraries. | Microbial extracts, especially extremophiles or complex backgrounds. | High specificity in removing non-NP signals; discovers low-abundance metabolites. | Requires carefully constructed control models; algorithm-dependent. |
| MS/MS Library Matching [45] | LC-ESI-MS/MS with in-house or public spectral libraries | Direct matching of precursor m/z, retention time, and fragmentation pattern against library entries. | Rapid screening of plant extracts for known bioactive phytochemicals. | Fast, high-confidence identification of known compounds. | Cannot identify novel compounds absent from the library; prone to false negatives from library gaps. |
| PLANTA Protocol [46] | NMR-HetCA, HPTLC, Chemometrics (e.g., STOCSY, SH-SCY) | Statistical correlation of spectral/chromatographic features with bioactivity; orthogonal data integration. | Pre-isolation identification of bioactive constituents in complex plant extracts. | Direct link to bioactivity; non-destructive; orthogonal validation. | Lower sensitivity than MS; requires larger sample amounts; complex data analysis. |
| Molecular Networking [47] | LC-MS/MS with spectral similarity networking (e.g., GNPS) | Visualization of related molecules as clusters; novelty inferred from cluster location relative to knowns. | Untargetted exploration of metabolite families in diverse samples. | Visualizes chemical families; good for analog discovery. | Relies on ionization efficiency; weak for structurally unique singletons. |
The performance of a dereplication strategy can be quantified. For example, the PLANTA protocol, when applied to an artificial mixture of 59 compounds, demonstrated a detection rate of 89.5% for active metabolites and a correct identification rate of 73.7% [46]. NP-PRESS has proven effective in real discovery campaigns, leading to the identification of new surugamide analogs from Streptomyces albus and the discovery of an entirely new family of depsipeptides, the baidienmycins, from Wukongibacter baidiensis, which exhibited potent antimicrobial and anticancer activities [1] [4].
Objective: To prioritize novel natural product features from a complex microbial extract by removing interfering signals from biotic and abiotic processes [4].
I. Sample Preparation & LC-MS/MS Data Acquisition
II. Data Processing with FUNEL Algorithm (Stage 1)
III. Data Processing with simRank Algorithm (Stage 2)
IV. Downstream Validation
Objective: To provide orthogonal, activity-guided identification of bioactive compounds in a complex extract prior to isolation, complementing MS-based dereplication [46].
Table 2: Key Reagent Solutions for NP Dereplication Workflows
| Item | Function in Dereplication | Example/Note |
|---|---|---|
| High-Resolution Mass Spectrometer | Provides accurate mass measurement for elemental composition determination and generates MS/MS spectra for structural comparison. | Q-TOF or Orbitrap instruments are standard [1] [45]. |
| UPLC/HPLC System with C18 Column | Separates complex mixtures to reduce ion suppression and provide retention time as a key identification parameter. | Sub-2µm particle columns are recommended for UPLC [45]. |
| MS-Grade Solvents & Additives | Ensure reproducibility, minimize background noise, and promote consistent ionization in ESI-MS. | Methanol, Acetonitrile, Water, Formic Acid [45]. |
| Reference Standard Compound Libraries | Essential for building in-house MS/MS libraries and validating identifications. | Commercial suppliers (e.g., Sigma-Aldrich) or purified isolates [45]. |
| NMR Solvents & Tubes | Required for structural elucidation of isolated compounds to confirm novelty. | Deuterated solvents (e.g., methanol-d4), 5 mm NMR tubes [46]. |
| Specialized Culture Media | Used to cultivate target microorganisms, often influencing secondary metabolite production. | ISP-2, R2A, or other defined media for bacteria [1] [4]. |
| Data Analysis Software | For processing raw MS data, running specialized algorithms, and visualizing molecular networks. | MZmine, MS-DIAL, GNPS, proprietary scripts for FUNEL/simRank [1] [4]. |
| Chromatography Resins for Fractionation | For activity-guided isolation after dereplication pinpoints a target. | Solid-phase extraction (SPE) cartridges, Sephadex LH-20, preparative C18 silica [46] [47]. |
The following diagrams outline the logical flow of the NP-PRESS strategy and the general decision process for interpreting MS features.
Diagram 1: The NP-PRESS Two-Stage Dereplication Workflow.
Diagram 2: Decision Logic for Interpreting MS Features.
Natural Products (NPs) remain an invaluable source for pharmaceutical development, yet their discovery is hampered by the overwhelming chemical complexity of biological extracts [48]. Traditional dereplication methods aim to rapidly identify known compounds to avoid redundant isolation work. However, these methods often fail to distinguish true secondary metabolites—the target bioactive NPs—from the vast background of interfering features originating from biotic processes (e.g., media components, cellular degradation products) and abiotic sources [4]. This limitation leads to wasted resources on fruitless isolations and obscures potentially novel, low-abundance metabolites.
Framed within the broader thesis on the two-stage MS feature dereplication strategy termed NP-PRESS (Natural Product Prioritization and Refinement via Extra Feature Subtraction), this analysis provides a detailed benchmark against established methods [1]. NP-PRESS introduces a paradigm shift by systematically removing irrelevant chemical features before annotation, thereby refining the metabolome to highlight genuine NPs. This document presents application notes and experimental protocols to validate and implement this strategy, demonstrating its superior performance in prioritizing novel bioactive compounds for drug discovery pipelines.
The performance of any dereplication strategy is measured by its accuracy, efficiency, and success in guiding the discovery of novel chemical entities. The following table provides a comparative analysis of the novel NP-PRESS pipeline against three cornerstone traditional approaches: spectral library matching (exemplified by in-house libraries), molecular networking via GNPS, and the PNP-specific algorithm DEREPLICATOR [49] [45] [11].
Table 1: Comparative Performance of Dereplication Strategies
| Performance Metric | Traditional Library Matching [45] | GNPS Molecular Networking [11] | DEREPLICATOR (for PNPs) [49] | NP-PRESS Pipeline [4] [1] |
|---|---|---|---|---|
| Core Strategy | Match MS/MS spectra to curated reference libraries. | Cluster MS/MS spectra by similarity to visualize chemical families. | Hybrid search combining spectral matching with genomic insights for peptides. | Two-stage MS1/MS2 analysis to subtract irrelevant features before annotation. |
| Key Advantage | High confidence for known compounds present in the library. | Powerful for identifying structural analogs and new members of known families. | High-throughput, accurate identification of peptide natural products (PNPs). | Dramatically reduces dataset complexity, exposing low-abundance and novel NPs. |
| Primary Limitation | Limited to compounds in the library; fails for novel scaffolds. | Requires sufficient spectral similarity; can miss unique or highly modified compounds. | Specialized for PNPs (NRPs/RiPPs); not generalizable to all NP classes. | Requires paired experimental design (e.g., with/without culture media). |
| Reported Output (Case Study) | Dereplication of 31 compounds from plant extracts [45]. | Annotation of 51 compounds from Sophora flavescens [11]. | Identification of 100s of variant PNPs from GNPS datasets [49]. | Discovery of new surugamide analogs and the novel baidienmycin family [4]. |
| Novel Compound Discovery | Low (aimed at avoiding rediscovery). | Moderate (enables analog discovery). | High for PNPs (specifically designed for variants). | Very High (actively prioritizes unknown features post-subtraction). |
Analysis of Benchmarking Data: The quantitative comparison highlights NP-PRESS's strategic differentiation. While traditional methods excel at cataloging known compounds, NP-PRESS is engineered for novelty discovery. Its two-stage filtering process, employing the FUNEL (MS1) and simRank (MS2) algorithms, removes up to 90% of irrelevant features in complex microbial metabolomes, which are typically the major noise obscuring target NPs [4]. This pre-filtering step directly addresses the core weakness of methods like GNPS, which processes all acquired features regardless of origin. Consequently, NP-PRESS successfully identified new antibacterial and anticancer depsipeptides (baidienmycins) from an anaerobic bacterium, a task where conventional dereplication would likely have failed due to metabolic interference [1].
This protocol outlines the application of NP-PRESS for prioritizing novel natural products from microbial cultures [4] [1].
I. Experimental Design & Sample Preparation
II. LC-HRMS/MS Data Acquisition
III. Data Processing with NP-PRESS
This protocol details the creation and use of an in-house spectral library for rapid dereplication of common phytochemicals, as validated in recent work [45].
I. Construction of an In-House Tandem MS Library
.mgf format).II. Dereplication of Unknown Extracts
The following diagrams, generated using DOT language and compliant with specified color and contrast rules, illustrate the conceptual and procedural differences between the dereplication strategies.
Diagram 1: Two-Stage NP-PRESS Workflow for Novel NP Prioritization [4] [1]
Diagram 2: Comparative Dereplication Strategy Decision Flow
Successful dereplication requires precise materials and analytical resources. The following table lists key solutions for implementing the protocols described.
Table 2: Essential Reagents and Materials for Dereplication Experiments
| Item Name | Specification / Example | Primary Function in Dereplication |
|---|---|---|
| LC-MS Grade Solvents | Methanol, Acetonitrile, Water (with 0.1% Formic Acid) | Mobile phase components for high-sensitivity, reproducible chromatographic separation [45] [11]. |
| Authentic Chemical Standards | >97% purity compounds (e.g., flavonoids, alkaloids) | Essential for constructing and validating in-house tandem MS libraries for confident peak annotation [45]. |
| Solid Phase Extraction (SPE) Cartridges | C18, HLB, or Mixed-Mode phases | Pre-analytical cleanup of crude extracts to reduce matrix effects and instrument fouling. |
| High-Resolution Mass Spectrometer | Q-TOF, Orbitrap, or FT-ICR MS systems | Provides accurate mass measurement (<5 ppm error) for formula prediction and high-quality MS/MS spectra for structural matching [48] [45]. |
| Chromatography Column | Reversed-Phase C18 (e.g., 2.1 x 150 mm, 1.8 μm) | Core component for separating complex mixtures of natural products prior to mass spectrometric detection [11]. |
| Data Analysis Software | MZmine, MS-DIAL, Compound Discoverer, GNPS | Platforms for feature detection, alignment, spectral deconvolution, and database searching, enabling the processing of large metabolomics datasets [49] [11]. |
| Public Spectral Databases | GNPS, MassBank, NIST, AntiMarin | Critical reference repositories for spectral matching and molecular networking to annotate known compounds and their analogs [49] [45]. |
The discovery of novel natural products (NPs) is persistently hampered by the high rate of compound rediscovery, making dereplication—the early identification of known compounds—a critical first step [50]. Within the framework of a broader thesis on two-stage MS feature dereplication, this document details a proof-of-concept validation for the NP-PRESS (Natural Product Prioritization by Elimination of Self-Signals) strategy [1]. NP-PRESS addresses a key bottleneck: the inability of conventional methods to differentiate signals from novel secondary metabolites from the overwhelming background of "biotic processed" features, such as microbial degradation products and media components [1].
This application note provides detailed protocols for transitioning from a prioritized MS feature to an isolated compound with validated bioactivity, using the discovery of the baidienmycins from Wukongibacter baidiensis M2B1 as a case study [1]. The workflow integrates advanced mass spectrometry, innovative bioinformatics, and classical natural product chemistry to accelerate the targeted discovery of novel bioactive entities.
The NP-PRESS strategy is built upon two novel algorithms, FUNEL and simRank, which operate on MS1 and MS2 data, respectively [1].
The synergy of these stages ensures that only the most promising, novel MS features are carried forward for isolation.
Diagram 1: NP-PRESS Two-Stage Dereplication and Prioritization Workflow
The NP-PRESS strategy was validated on the anaerobic bacterium Wukongibacter baidiensis M2B1. Application of the two-stage filter successfully prioritized a cluster of MS features that were distinct from known compounds in databases. Targeted isolation guided by these features led to the discovery of a new family of depsipeptides, named baidienmycins [1].
Table 1: Key Data for Baidienmycins Discovery via NP-PRESS
| Parameter | Details & Quantitative Results |
|---|---|
| Source Organism | Wukongibacter baidiensis M2B1 (anaerobic bacterium) [1] |
| NP-PRESS Outcome | Prioritization of a novel molecular family distinct from database entries [1]. |
| Discovered Compounds | Baidienmycins (a new family of depsipeptides) [1]. |
| Key Bioactivity | Potent antimicrobial and anticancer activities reported [1]. |
| Validation Method | Comparison of MS2 spectra and features against public GNPS libraries confirmed novelty [1]. |
| Role of Dereplication | NP-PRESS eliminated >90% of interfering MS1 features from media/primary metabolism, allowing focus on novel secondary metabolites [1]. |
Objective: Generate high-quality MS1 and MS2 data from microbial crude extracts suitable for analysis with the FUNEL and simRank algorithms.
Materials:
Procedure:
Objective: Process acquired LC-MS/MS data through the NP-PRESS pipeline to identify novel compound features.
Software/Tools: NP-PRESS algorithms (FUNEL, simRank), MZmine2 or similar for initial feature finding, GNPS for comparative analysis [1] [13].
Procedure:
Objective: Isolate milligram quantities of the target compound from a scaled-up microbial culture.
Materials: Fermentation broth (e.g., 10-50 L), solid-phase extraction (SPE) cartridges (C18, DIAION HP20), preparative HPLC system, Sephadex LH-20, analytical TLC/HPLC supplies [51].
Procedure:
Objective: Confirm the biological activity of the isolated compound.
Materials: Isolated compound, 96-well microtiter plates, appropriate cell lines (e.g., HeLa, MCF-7 for cancer; Bacillus subtilis, Staphylococcus aureus for bacteria; Candida albicans for fungi), cell culture media, alamarBlue or MTT reagent, spectrophotometer/plate reader [51] [52] [53].
Procedure for Anticancer Activity (Cytotoxicity Assay):
Procedure for Antimicrobial Activity (MIC Determination):
Diagram 2: Comprehensive Validation Workflow from Feature to Bioactivity
Table 2: Key Reagents and Materials for NP Dereplication and Isolation
| Item | Function & Application in Workflow |
|---|---|
| Ethyl Acetate (EtOAc) | A standard medium-polarity solvent for liquid-liquid extraction of fermented broth, effective for extracting a wide range of secondary metabolites [51]. |
| Solid-Phase Extraction (SPE) Cartridges (C18, HP20) | For rapid desalting and initial fractionation of crude extracts based on polarity, reducing complexity before HPLC [51]. |
| Sephadex LH-20 | Gel filtration resin for size-based separation and removal of salts/pigments; used with organic solvents like methanol or CH₂Cl₂/MeOH [51]. |
| Preparative C18 HPLC Column | The cornerstone of final purification. Allows high-resolution separation of compounds using gradients of water and acetonitrile/methanol [51]. |
| Deuterated Solvents (CDCl₃, DMSO-d₆, CD₃OD) | Essential for nuclear magnetic resonance (NMR) spectroscopy, the primary technique for determining planar structure and stereochemistry of isolated compounds [50]. |
| LC-MS Grade Solvents (MeOH, ACN, H₂O + 0.1% FA) | Essential for all LC-MS steps to prevent ion suppression, background noise, and column degradation, ensuring high-quality data for dereplication [13] [53]. |
| AlamarBlue/MTT Reagent | Cell viability indicators used in cytotoxicity and antiproliferative bioassays. Metabolic reduction by living cells produces a measurable colorimetric/fluorometric change [51] [52]. |
| Mueller-Hinton Broth (MHB) | A standardized, low-protein medium recommended by CLSI for determining Minimum Inhibitory Concentrations (MICs) of antimicrobial compounds [52]. |
| Database Access (GNPS, DEREPLICATOR+) | Critical bioinformatics resources. GNPS allows molecular networking and spectral library matching [54] [55]. DEREPLICATOR+ enables dereplication against extensive databases of peptides, polyketides, and other NP classes [56]. |
The discovery of novel natural products (NPs) is persistently challenged by the high rate of compound rediscovery and the difficulty in detecting low-abundance or conditionally expressed metabolites [50]. While genomics reveals a vast reservoir of biosynthetic potential, with microbial genomes harboring dozens of uncharacterized Biosynthetic Gene Clusters (BGCs), the majority remain as "orphan" clusters with no identified molecular product [57]. Conversely, mass spectrometry (MS)-based metabolomics detects a plethora of compounds, but efficiently distinguishing novel NP signals from complex biological noise is a formidable task [1].
This application note addresses this critical gap by detailing a synergistic strategy that correlates output from the novel two-stage MS feature dereplication platform, NP-PRESS, with genomic BGC data [1]. NP-PRESS employs specialized algorithms (FUNEL and simRank) to aggressively filter MS1 and MS2 data, removing interfering features from biotic processes and media to prioritize ions most likely to represent novel NPs [1]. When these prioritized MS features are integrated with genomic evidence of BGC activation—such as from transcriptomics or proteomics—researchers gain a powerful, hypothesis-driven framework for discovery. This integrated approach, framed within a broader thesis on advanced dereplication strategies, streamlines the journey from genetic potential to novel chemical entity, dramatically accelerating lead identification for drug development [50].
The core synergy lies in forming a data-driven bridge between a confidently filtered mass spectrometric signal and a transcriptionally active genetic locus. The process transforms parallel data streams into a coherent discovery pipeline.
Table 1: Key Data Types for Correlation and Their Sources
| Data Type | Description | Common Source/Method | Role in Correlation |
|---|---|---|---|
| Prioritized MS1 Feature | m/z, RT, intensity of a metabolite ion. | NP-PRESS processed LC-MS data [1]. | The target chemical entity requiring a genetic origin. |
| MS2 Fragmentation Spectrum | Molecular fingerprint from collision-induced dissociation. | LC-MS/MS analysis. | Used for structural similarity networking (e.g., via GNPS) and database dereplication [50]. |
| BGC Sequence | Genomic locus encoding biosynthetic enzymes. | Genome mining (AntiSMASH, PRISM) [57] [58]. | Predicts chemical class (e.g., NRPS, PKS) and potential structure of the product. |
| Transcriptomic Read Counts (RPKM) | Quantitative gene expression levels. | RNA-Seq of producing vs. non-producing conditions [57]. | Evidences activation of the BGC coincident with metabolite detection. |
| Proteomic Data | Expression levels of biosynthetic enzymes. | LC-MS/MS proteomics. | Confirms translation of BGC genes into functional machinery. |
Integrated NP-PRESS to BGC Correlation Workflow
This protocol refines raw LC-MS data to generate a shortlist of MS features most likely to correspond to novel natural products.
Experimental Workflow:
This protocol identifies which BGCs in a genome are transcriptionally activated under conditions that induce new metabolite production.
Experimental Workflow:
Table 2: NP-PRESS & Genomics Correlation Performance Metrics
| Metric | NP-PRESS Performance | Genomic BGC Correlation | Integrated Workflow Advantage |
|---|---|---|---|
| Feature Reduction | Removes >90% of interfering MS1 features from biotic background [1]. | Narrows 10,000+ BGCs in genomes to a few activated targets [57]. | Focuses analytical resources on a shortlist of high-priority metabolite-BGC pairs. |
| Novelty Confidence | High confidence that prioritized features are not common media/biotic components [1]. | High confidence that activated BGCs encode underexplored chemistry. | Convergent evidence strongly increases probability of true novelty. |
| Case Study Yield | Guided discovery of new surugamide and baidienmycin families [1]. | Enabled linking keyicin production to its specific BGC and interspecies signaling [57]. | Provides both the novel structure (MS) and its genetic blueprint (genomics). |
| Key Analytical Tool | FUNEL and simRank algorithms [1]. | RNA-Seq differential expression analysis (e.g., EdgeR) [57]. | Temporal correlation analysis and molecular networking. |
Experimental Protocol for NP-PRESS & BGC Correlation
The discovery of keyicin provides a seminal example of this integrative approach [57]. While not using NP-PRESS specifically, the methodology mirrors its principles.
Table 3: Key Reagents and Materials for Integrated NP-PRESS & BGC Studies
| Item | Function/Description | Application Notes |
|---|---|---|
| Specialized Growth Media | Media formulations designed to elicit secondary metabolism (e.g., R2A, ISP2) [1]. | Essential for activating silent BGCs. Co-culture setups require compatible media for both organisms [57]. |
| RNA Stabilization Reagent (e.g., TRIzol) | Immediately stabilizes cellular RNA to preserve accurate transcriptional profiles. | Critical for transcriptomics. Samples must be taken from the exact same cultures and time points used for metabolomics. |
| Next-Generation Sequencing Kit | For preparation of RNA-Seq libraries (e.g., Illumina TruSeq). | Enables genome-wide quantification of BGC gene expression. Poly-A selection is not suitable for bacterial RNA. |
| LC-MS Grade Solvents | High-purity acetonitrile, methanol, and water for metabolite extraction and chromatography. | Minimizes chemical background noise, improving NP-PRESS filtering performance [1]. |
| Database Subscriptions/Access | Access to genomic (MIBiG) [58], metabolomic (GNPS) [50], and chemical (PubChem) databases. | Required for BGC annotation, molecular networking, and final dereplication of putative novel compounds. |
Natural product (NP) discovery is a cornerstone of pharmaceutical development, yet it is persistently challenged by the low abundance of novel compounds and the complex interference from biological matrices. Traditional mass spectrometry (MS) workflows struggle to differentiate signals of novel NPs from a background of microbial degradation products and media components [1]. The NP-PRESS strategy addresses this by implementing a two-stage MS feature dereplication process, using algorithms like FUNEL and simRank to prioritize truly novel features by thoroughly removing irrelevant biotic and abiotic signals [1]. However, a critical bottleneck remains: the confident identification of novel peptide or peptide-like natural products once they are prioritized.
This is where AI-driven de novo peptide sequencing, exemplified by InstaNovo, creates a transformative synergy. InstaNovo is a transformer-based deep learning model that translates fragment ion peaks (MS/MS spectra) into peptide sequences without relying on pre-existing protein databases [59] [60]. Its diffusion-model counterpart, InstaNovo+, further refines predictions through iterative processes [59]. By integrating InstaNovo into the NP-PRESS pipeline, researchers can transition from merely prioritizing unknown features to directly sequencing and identifying them. This integration is particularly powerful for discovering novel ribosomal and non-ribosomal peptides, cyclized peptides, and other natural product derivatives that are absent from conventional databases, thereby "illuminating the dark proteome" of microbial producers [59] [60].
Table 1: Comparative Core Functions of NP-PRESS and InstaNovo
| Aspect | NP-PRESS Strategy | InstaNovo/AI-Driven Sequencing | Integrated Advantage |
|---|---|---|---|
| Primary Goal | Dereplication; prioritize novel MS features by removing known/irrelevant signals [1]. | Database-free identification; determine the amino acid sequence from MS/MS spectra [59] [60]. | From prioritization to identification: Converts a list of "interesting unknowns" into definitive sequences. |
| Key Innovation | Two-stage algorithm (FUNEL, simRank) to subtract features from biotic processes and media [1]. | Transformer (InstaNovo) and diffusion (InstaNovo+) models for direct spectrum-to-sequence translation [59]. | Creates a closed-loop discovery engine: Filter, then sequence. Reduces search space for AI, increasing its efficiency. |
| Data Input | MS1 (precursor) and MS2 (fragment) data from complex biological extracts [1]. | MS/MS (MS2) spectrum (peak lists with m/z and intensity) [61]. | NP-PRESS pre-filters the most promising, novel spectra for computationally intensive de novo analysis. |
| Output | A prioritized list of LC-MS features highly likely to represent novel natural products [1]. | Predicted peptide sequences with associated confidence scores (log probabilities) [61]. | Annotated novel natural product sequences with structural hypotheses ready for synthesis and validation. |
| Thesis Context | Provides the essential first stage of the two-stage dereplication strategy by handling complex mixtures. | Provides the decisive second stage by solving the identity of the prioritized unknowns. | Completes the two-stage MS feature dereplication thesis, enabling end-to-end novel NP discovery. |
Table 2: Performance Metrics Demonstrating Integration Potential
| Metric | NP-PRESS (Concept Proof) | InstaNovo/InstaNovo+ (Reported Performance) | Interpretation for Integration |
|---|---|---|---|
| Discovery Yield | Guided discovery of new surugamide analogs and the new depsipeptide family baidienmycins from bacteria [1]. | Identified 1,338 previously undetected protein fragments in well-studied HeLa cell samples [60]. | Suggests high potential to identify novel peptides from NP-PRESS-prioritized features in microbial extracts. |
| Precision Gain | Excels at removing interfering features to highlight true NP signals [1]. | IN+ identified 32.71% more PSMs at 5% FDR than Casanovo (SOTA) on a yeast dataset [59]. | The high precision of both tools is multiplicative, ensuring final identifications are both novel and accurate. |
| Handling Modifications | Not explicitly addressed for PTMs. | InstaNovo-P model fine-tuned for phosphorylation; v1.1 natively supports common PTMs (Oxidation, Deamidation, etc.) [62] [61]. | Enables discovery of modified NP analogues, a common source of bioactivity diversity. |
| Algorithmic Complement | FUNEL and simRank algorithms for feature comparison and filtering [1]. | Transformer neural network with knapsack beam search and diffusion-based refinement [59] [61]. | Filtering (NP-PRESS) reduces noise for the sequence inference model (InstaNovo), optimizing overall computational efficiency. |
Objective: To discover and sequence novel peptide-based natural products from a microbial culture extract.
Workflow Diagram:
(Diagram: Integrated NP Discovery and Sequencing Workflow)
Steps:
- Data Filtering: Filter the predictions in
novel_peptide_predictions.csv based on confidence. Key columns include log_probabilities (higher is better) and delta_mass_ppm (absolute value closer to zero indicates better mass accuracy). Retain sequences with log probability > -3 and |Δ mass| < 10 ppm for further analysis [61].
- Downstream Validation: The resulting candidate sequences form testable structural hypotheses. They can be chemically synthesized for standard co-injection to confirm LC-MS retention time and fragmentation pattern, or used for heterologous expression and biological activity testing [1].
Protocol 2: Targeted Validation via Parallel Reaction Monitoring (PRM)
Objective: To confirm the presence and expression of a candidate novel peptide discovered via Protocol 1.
Workflow Diagram:
(Diagram: Targeted PRM Validation for Candidate Sequences)
Steps:
- Spectral Library Generation: For the candidate peptide sequence (e.g., "ALPYTPKK"), use software like Skyline or the Python
pyteomics library to generate a theoretical MS2 spectrum. Include predicted fragment ions (b- and y-ions) and their expected intensities if using advanced tools.
- PRM Assay Development: Calculate the precursor m/z for the candidate peptide in common charge states (+2, +3). Design a PRM method on your tandem mass spectrometer (e.g., Q-Exactive, TripleTOF) to target these precursor ions with an appropriate isolation width (e.g., 2 m/z) and schedule the acquisition around the expected LC retention time.
- Sample Analysis & Confirmation:
- Synthetic Standard: Analyze a chemically synthesized version of the candidate peptide. Successful co-elution and a high match between the observed PRM spectrum and the theoretical library confirm the InstaNovo-predicted structure.
- Biological Replicates: Re-analyze new extracts from the putative producing microbial strain and relevant controls using the PRM method. Detection of the peptide specifically in the producing strain, with a MS2 spectrum matching the library, provides orthogonal biological validation [62].
- Quantification (Optional): If a synthetic standard is available, a calibration curve can be built to estimate the native concentration of the novel natural product in the culture.
The Scientist's Toolkit
Table 3: Essential Research Reagent and Software Solutions
Category
Item/Software
Function in Integrated Workflow
Key Notes/Specifications
MS Data Generation
High-Resolution LC-MS/MS System (e.g., Q-Exactive, timsTOF)
Generates high-quality MS1 and MS2 spectral data for both discovery (DDA) and validation (PRM).
Essential for achieving the mass accuracy and resolution needed for NP-PRESS filtering and InstaNovo sequencing.
Computational Environment
Workstation with NVIDIA GPU (e.g., RTX 4090, A100)
Accelerates the training and inference of deep learning models like InstaNovo+.
Critical for practical turnaround times when processing hundreds to thousands of prioritized spectra [61].
Core Analysis Software
NP-PRESS Algorithms (FUNEL, simRank)
Performs the initial critical dereplication and feature prioritization from complex LC-MS data [1].
The foundational first stage that filters the data for AI analysis.
Core Analysis Software
InstaNovo Python Package (instanovo)
Executes the de novo peptide sequencing via command line or Python API [61].
Version 1.1.0 natively supports key post-translational modifications relevant to NPs (e.g., oxidation, deamidation) [61].
Data Format Bridge
MS Convert (ProteoWizard) / pymzml
Converts raw spectrometer files (.raw, .d) to open formats (.mzML, .mgf) for NP-PRESS and InstaNovo.
Ensures compatibility between instrument vendor software and open-source analysis pipelines.
Validation & Design
Skyline or Pyteomics Library
Creates theoretical spectral libraries for PRM assay design and provides a platform for targeted data analysis.
Enables the transition from in silico prediction to empirical, hypothesis-driven validation [62].
Chemical Validation
Custom Peptide Synthesis Service
Provides synthetic analytical standards for definitive structural confirmation via co-elution experiments.
Final, definitive step to confirm the predicted structure of a novel natural product.
The NP-PRESS two-stage dereplication strategy represents a significant methodological advance for natural product discovery, directly addressing the persistent challenge of metabolome complexity. By systematically removing irrelevant biotic and abiotic features through its FUNEL and simRank algorithms, the pipeline efficiently prioritizes novel secondary metabolites, as validated by the discovery of new surugamides and the baidienmycin family. This approach not only reduces the resource-intensive risk of erroneous isolations but also proves particularly powerful for mining unconventional microbial sources. Looking forward, the integration of NP-PRESS with rapidly evolving genomic mining tools and deep learning-based de novo sequencing models, such as InstaNovo, promises to create a more holistic, in-silico-first discovery framework. This convergence will further accelerate the identification and structural elucidation of novel bioactive compounds, reinvigorating the natural product pipeline for biomedical and clinical research in the face of emerging diseases and antibiotic resistance.