NP-PRESS: A Two-Stage MS Dereplication Strategy to Accelerate Natural Product Discovery

Hannah Simmons Jan 09, 2026 448

This article details NP-PRESS, an innovative two-stage mass spectrometry pipeline designed to overcome the critical bottleneck of dereplication in natural product research.

NP-PRESS: A Two-Stage MS Dereplication Strategy to Accelerate Natural Product Discovery

Abstract

This article details NP-PRESS, an innovative two-stage mass spectrometry pipeline designed to overcome the critical bottleneck of dereplication in natural product research. Tailored for researchers and drug development professionals, it explores the strategy's foundational need in filtering complex metabolomes, its methodological core involving the FUNEL and simRank algorithms, practical guidance for troubleshooting and optimization, and a comparative analysis against traditional and emerging methods. By synthesizing proof-of-concept successes in discovering novel bioactive compounds, the article demonstrates how NP-PRESS refines metabolomic data to prioritize truly novel features, thereby reducing costly and fruitless isolation efforts and streamlining the path to new drug leads.

The Dereplication Bottleneck: Why a Two-Stage MS Strategy is a Game-Changer for NP Discovery

Defining the Dereplication Challenge in Modern Natural Product Research

The discovery of novel bioactive natural products (NPs) remains a cornerstone of pharmaceutical development, yet the process is fundamentally hindered by the persistent challenge of dereplication—the early and accurate identification of known compounds within complex biological extracts [1]. Modern high-resolution mass spectrometry (MS) and liquid chromatography-mass spectrometry (LC-MS) generate vast metabolomic datasets, but the true signals of novel, often low-abundance secondary metabolites are frequently obscured by an overwhelming background of interfering features [1]. These interferences originate not only from abiotic sources but, more problematically, from biotic processes, including microbial degradation products and media components, which are chemically analogous to target NPs and thus exceptionally difficult to filter out using conventional methods [1].

This challenge frames the critical need for advanced strategies that move beyond simple database matching. Effective dereplication must prioritize novelty by systematically removing both known compounds and irrelevant biological noise, thereby focusing precious research resources on the most promising, unidentified features. This article details the application notes and protocols for a modern solution to this challenge: the NP-PRESS (Natural Product PRIoritization via Elimination of Spectral Signatures) strategy, a two-stage MS feature dereplication framework. NP-PRESS integrates novel algorithmic filters to highlight new NPs by thoroughly eliminating overwhelming irrelevant features, enabling the discovery of novel chemical entities from diverse and underexplored bacterial sources [1].

Detailed Protocol: The NP-PRESS Two-Stage Dereplication Workflow

The NP-PRESS strategy is a methodical, two-tiered computational workflow designed to process LC-MS/MS data for the specific purpose of novel natural product discovery. Its core innovation lies in two custom algorithms, FUNEL (for MS1 data) and simRank (for MS2 data), which work in concert to remove irrelevant features [1].

Protocol: Application of the NP-PRESS Strategy

Objective: To prioritize LC-MS features corresponding to putative novel natural products by eliminating signals from known compounds and biotic interference.

Materials & Input Data:

  • Raw LC-MS/MS Data: High-resolution MS1 and data-dependent MS2 spectra from bacterial extract analysis.
  • Compound Databases: Local or public spectral libraries (e.g., GNPS, internal libraries).
  • NP-PRESS Software Pipeline: Implementing the FUNEL and simRank algorithms [1].
  • Bioinformatic Tools: For genomic analysis if correlating with BGCs.

Experimental Procedure:

  • Sample Preparation & LC-MS/MS Acquisition:

    • Prepare organic extracts from bacterial culture (e.g., Streptomyces albus J1074, Wukongibacter baidiensis M2B1) [1].
    • Analyze extracts using a reversed-phase LC gradient coupled to a high-resolution tandem mass spectrometer (e.g., Q-TOF, Orbitrap).
    • Acquire data in data-dependent acquisition (DDA) mode, collecting full-scan MS1 and subsequent MS2 fragmentation spectra for top ions.
  • Stage 1: MS1 Filtering with the FUNEL Algorithm:

    • Process: Submit the raw MS1 feature table (containing m/z, retention time, and intensity) to the FUNEL algorithm.
    • Action: FUNEL performs a blank subtraction and filters features based on isotopic patterns and retention time behaviors characteristic of non-secondary metabolites (e.g., lipids, peptides from media, common cellular metabolites).
    • Outcome: A significantly reduced list of MS1 features that are enriched for secondary metabolite-like compounds.
  • Stage 2: MS2 Dereplication with the simRank Algorithm:

    • Process: For the filtered features from Stage 1, compile their corresponding MS2 fragmentation spectra.
    • Action: The simRank algorithm compares each experimental MS2 spectrum against a curated reference database of known natural product spectra.
    • Discrimination: Unlike conventional spectral matching, simRank is tuned to discriminate against biotic interference by applying a stricter similarity threshold and accounting for fragmentation patterns of common biological background.
    • Outcome: Features with significant matches to known compounds are flagged for dereplication. The remaining features with no high-confidence database match are prioritized as putative novel compounds.
  • Priority List Generation & Downstream Analysis:

    • Generate a final ranked list of prioritized m/z-RT features.
    • Subject high-priority features to further purification (e.g., preparative HPLC) and structural elucidation (NMR).
    • Optionally, correlate prioritized features with genomic predictions from Biosynthetic Gene Clusters (BGCs) for enhanced validation.
Performance Metrics & Validation

The efficacy of the NP-PRESS strategy has been validated through the discovery of new NPs. Applied to Streptomyces albus J1074, it guided the discovery of new surugamide analogs. More significantly, its application to the unusual anaerobic bacterium Wukongibacter baidiensis M2B1 led to the discovery of the baidienmycins, a new family of depsipeptides with potent antimicrobial and anticancer activities [1]. These cases underscore its utility in mining complex datasets from diverse bacteria, particularly extremophiles.

Table 1: Key Performance Metrics of the NP-PRESS Dereplication Strategy.

Metric Description Outcome in Validation Studies
Feature Reduction Rate Percentage of initial LC-MS features filtered out in Stage 1 (FUNEL). Dramatically reduces feature count, eliminating >50% of irrelevant biotic/abiotic interferences [1].
Novel Compound Prioritization Ability to rank unknown features leading to successful isolation of new NPs. Successfully prioritized signals leading to discovery of baidienmycins and new surugamide analogs [1].
Dereplication Accuracy Specificity in correctly identifying known compounds via simRank MS2 matching. High-confidence dereplication of knowns, minimizing false negatives for novel compounds [1].
Application Scope Suitability for diverse microbial taxa, including challenging cultures. Proven effective for standard actinomycetes (Streptomyces) and unusual anaerobic bacteria [1].

G Start_End Start_End Process Process Data Data Decision Decision title NP-PRESS Two-Stage Dereplication Workflow RawLCMS Raw LC-MS/MS Data (MS1 & MS2 Features) Stage1 Stage 1: MS1 Filtering (FUNEL Algorithm) RawLCMS->Stage1 FilteredMS1 Filtered MS1 Feature List (Enriched for SMs) Stage1->FilteredMS1 Removes biotic/ abiotic noise Stage2 Stage 2: MS2 Dereplication (simRank Algorithm) FilteredMS1->Stage2 Known Known Compound (Dereplicated) Stage2->Known High Similarity Match Priority Prioritized Novel Features (For Isolation) Stage2->Priority No Confident Match DB Spectral Database (Known NPs) DB->Stage2 Spectral Match Isolation Purification & Structural Elucidation Priority->Isolation NewNP Novel Natural Product Identified Isolation->NewNP

Diagram 1: NP-PRESS Two-Stage Dereplication Workflow (78 characters)

Modern Dereplication Strategies & Protocols

Beyond NP-PRESS, contemporary dereplication is a multi-faceted process integrating chemical and genetic data to maximize the efficiency of novel compound discovery.

Protocol: Integrating Genetic Barcoding with Metabolomics for Library Enhancement

Objective: To rationally build a natural product library with broad metabolite diversity by linking phylogenetic clades with chemical feature accumulation [2].

Procedure:

  • Isolate Collection & Barcoding:

    • Obtain microbial isolates (e.g., fungal strains from soil).
    • Extract genomic DNA and sequence a phylogenetic barcode locus (e.g., Internal Transcribed Spacer (ITS) for fungi).
    • Cluster isolates into sequence-based clades based on a defined similarity threshold (e.g., >90% ITS similarity) [2].
  • Metabolomic Profiling:

    • Culture all isolates under standardized metabolite-production conditions.
    • Prepare extracts and analyze by LC-MS under uniform parameters.
    • Process data to detect chemical features (unique m/z-RT pairs).
  • Bifunctional Data Analysis:

    • Perform Principal Coordinate Analysis (PCoA) on chemical feature data to identify metabolomic clusters [2].
    • Correlate sequence-based clades with metabolomic clusters. Identify clades that are chemical "hotspots" or "deserts."
    • Construct a feature accumulation curve: Plot the cumulative number of unique chemical features detected against the number of isolates analyzed [2].
  • Actionable Library Design:

    • Use the curve to determine the point of diminishing returns (e.g., where adding 10 new isolates yields <1% new features).
    • Focus collection efforts on undersampled phylogenetic clades that contribute disproportionately to new chemical diversity.
    • The goal is to achieve a predetermined coverage target (e.g., 99% of estimated chemical features) with an optimal, rationalized number of isolates [2].

Table 2: Quantitative Insights from Genetic Barcoding-Metabolomics Integration in Alternaria Fungi [2].

Analysis Parameter Quantitative Finding Implication for Library Design
Isolates Needed for ~99% Coverage 195 isolates Provides a quantitative target for library size to efficiently capture genus-level diversity.
Proportion of "Singleton" Features 17.9% of features appeared in only one isolate. Indicates high chemical rarity; very deep sampling is required to capture full diversity.
Clade-Chemistry Correlation Non-equivalent levels of chemical diversity across different ITS clades. Enables targeted sampling of genetically distinct, chemically rich clades.
Key Tool Feature accumulation curves. Allows real-time monitoring and prediction of chemical diversity coverage during library building.
Advanced Strategy: Machine Learning-Enhanced Dereplication

Machine learning (ML) models are increasingly deployed to predict compound class, bioactivity, or structural novelty directly from MS or spectral data, adding a predictive layer to dereplication [3].

Protocol Outline: ML Model Training for Spectral Classification

  • Data Curation & Preprocessing:

    • Assemble Training Set: Collect a large dataset of MS2 spectra labeled with compound classes (e.g., peptide, polyketide, terpene).
    • Preprocess Spectra: Apply baseline correction, noise reduction, and normalization. Savitzky-Golay smoothing can be used for spectral data [3].
    • Feature Engineering: Convert spectra into feature vectors (e.g., using binning of m/z values, intensity thresholds, or molecular fingerprints).
  • Model Selection & Training:

    • For classification tasks, train models like Support Vector Machines (SVM), Random Forest, or neural networks [3].
    • For dimensionality reduction and visualization, use Principal Component Analysis (PCA) or Partial Least Squares (PLS) methods [3].
    • Split data into training, validation, and test sets.
  • Model Deployment in Dereplication:

    • Input the MS2 spectrum of an unknown, prioritized feature from the NP-PRESS pipeline.
    • The ML model predicts its most likely compound class or structural scaffold.
    • This prediction guides downstream analysis—for instance, suggesting specific NMR experiments or database search parameters.

G Data Data Process Process Model Model Output Output title Integrated Dereplication Strategy Roadmap Sample Microbial Isolate Genetics Genetic Barcoding (e.g., ITS Sequencing) Sample->Genetics Culture Standardized Metabolite Production Sample->Culture Clade Phylogenetic Clade Assignment Genetics->Clade NP_PRESS NP-PRESS Pipeline (Dereplication & Prioritization) Clade->NP_PRESS Informs sampling strategy LCMS LC-MS/MS Analysis Culture->LCMS Features Chemical Feature Detection LCMS->Features Features->NP_PRESS ML ML Prediction (Class/Activity) NP_PRESS->ML Prioritized Unknown Feature Known2 Dereplicated Known Compound NP_PRESS->Known2 Novel High-Confidence Novel Target ML->Novel

Diagram 2: Integrated Dereplication Strategy Roadmap (55 characters)

The Scientist's Toolkit: Essential Research Reagent Solutions

Table 3: Key Reagents, Materials, and Software for Advanced Dereplication.

Tool/Reagent Function in Dereplication Application Notes
High-Resolution LC-MS System (e.g., UPLC-QTOF, UPLC-Orbitrap) Generates high-fidelity MS1 and MS2 spectral data for feature detection and algorithmic processing. Essential for initial data acquisition. Requires precise calibration for accurate mass measurements.
FUNEL & simRank Algorithms (NP-PRESS) Core computational filters for removing biotic interference and dereplicating known compounds via MS2 matching [1]. Custom software; critical for executing the two-stage NP-PRESS strategy.
Global Natural Products Social (GNPS) Library A public, crowd-sourced database of MS2 spectra for known natural products. Serves as a key reference database for the simRank algorithm and standard spectral library searches.
ITS/16S rRNA PCR Primers & Sequencing Kits Enables genetic barcoding of fungal/bacterial isolates to establish phylogenetic clades [2]. Allows correlation of chemical diversity with genetic diversity for rationalized library design.
Machine Learning Platforms (e.g., Python scikit-learn, TensorFlow) Provides environment to build, train, and deploy models for spectral classification and novelty prediction [3]. Used to add a predictive layer to dereplication, moving from identification to forecasting compound properties.
Solid Phase Extraction (SPE) Cartridges (C18, HLB) Pre-fractionates complex crude extracts to reduce complexity prior to LC-MS analysis. Can lower ion suppression and simplify chromatograms, improving feature detection.

The dereplication challenge in modern natural product research is no longer a simple task of database lookup. It is a sophisticated data triage process that requires integrated strategies to separate novel bioactive compound signals from a dense background of chemical noise. The NP-PRESS two-stage strategy, employing the FUNEL and simRank algorithms, provides a robust, specialized protocol to directly address the critical problem of biotic interference, effectively prioritizing novel chemical entities [1]. This core methodology is powerfully augmented by quantitative library-building approaches that use genetic barcoding and feature accumulation curves to optimize source selection [2], and by machine learning models that add predictive power to the analysis of spectral data [3]. Together, these protocols form a comprehensive, modern dereplication pipeline that transforms the discovery process from one of serendipity to a rational, data-driven endeavor, significantly accelerating the identification of new natural products for drug development.

In untargeted metabolomics, particularly within natural product (NP) discovery, the true signal of interest is often buried in noise. This noise originates from a complex background of irrelevant chemical features generated by both biotic processes (e.g., microbial degradation of cellular components and media) and abiotic processes (e.g., spontaneous chemical reactions and environmental contaminants) [1] [4]. The consequence is a high cost in research efficiency: significant resources are wasted on the fruitless isolation and structural elucidation of these interfering compounds, diverting effort from genuine bioactive metabolites [4].

The scale of this problem is immense. While a single plant species may contain up to an estimated 5,000 metabolites, a typical mass spectrometry (MS) run may detect tens of thousands of molecular features, most of which are not the target secondary metabolites [5]. This creates a classic "needle in a haystack" scenario. The NP-PRESS (Natural Products Prioritization and Refinement by Elimination of Spurious Signals) research framework addresses this directly. It is a two-stage MS feature dereplication strategy designed to systematically prioritize novel natural products by thoroughly removing overwhelming irrelevant features, thereby refining the metabolome for efficient analysis [1] [4].

Core Methodology: The NP-PRESS Two-Stage Dereplication Strategy

The NP-PRESS pipeline is built upon two novel computational algorithms applied sequentially to LC-MS/MS data to filter out irrelevant signals and highlight putative novel natural products [4].

Stage 1: FUNEL (Filtering Using Neutral Loss) – This stage operates on MS1-level data. FUNEL identifies and removes features originating from expected biochemical noise, such as known media components, common cellular building blocks, and their predictable derivatives (e.g., adducts, fragments, and neutral loss patterns). It functions as a high-stringency filter to drastically reduce dataset complexity before more detailed analysis [4].

Stage 2: simRank (Spectral Similarity Ranking) – This stage analyzes MS2 fragmentation spectra. simRank evaluates the spectral similarity of remaining features against comprehensive databases of known natural products and their derivatives. Features with high similarity to known compounds are dereplicated, while those with novel or unusual fragmentation patterns are prioritized for further investigation [4].

The effectiveness of this strategy is demonstrated in its application to microbial strains. For instance, when applied to Streptomyces albus J1074, NP-PRESS facilitated the identification of new surugamide analogs. More notably, its use on the anaerobic bacterium Wukongibacter baidiensis M2B1 led to the discovery of the baidienmycins, a new family of depsipeptides with potent bioactivity [4].

Table: Key Performance Outcomes of the NP-PRESS Strategy

Microbial Strain NP-PRESS Application Outcome Significance
Streptomyces albus J1074 Identification of new surugamide analogs Validated pipeline on a model streptomycete [4]
Wukongibacter baidiensis M2B1 Discovery of baidienmycins (new depsipeptides) Uncovered novel chemistry from an unusual anaerobe; compounds show potent antimicrobial and anticancer activities [1] [4]

Detailed Experimental Protocols

Protocol 1: Assessing Biotic and Abiotic Interference in Microbial Metabolomes

This protocol is designed to characterize the sources of irrelevant signals in microbial fermentation samples, a prerequisite for effective dereplication.

1. Cultivation and Sample Generation:

  • Cultivate the target microbe (e.g., Streptomyces sp.) in suitable liquid media in triplicate [6].
  • Prepare two critical control samples in parallel: 1) Spent Media Control: Inoculate media with a sterile inoculum (e.g., heat-killed cells) and incubate under identical conditions to capture abiotic degradation products of the media. 2) Non-Inoculated Media Control: Incubate sterile media alone to establish the baseline chemical profile [4].
  • Harvest cells and culture broth at the target stationary phase. Separate cells from supernatant via centrifugation (e.g., 8,000 × g, 15 min, 4°C).

2. Metabolite Quenching and Extraction:

  • Rapid Quenching: Immediately quench metabolism by plunging cell pellets into a pre-cooled extraction solvent like methanol:water (4:1, v/v) at -20°C [6].
  • Comprehensive Extraction: For intracellular metabolites, homogenize quenched cells (e.g., bead beating) in the extraction solvent. For extracellular metabolites, mix supernatant with an equal volume of extraction solvent. A biphasic extraction (e.g., using methanol, chloroform, and water) can be employed for broader metabolite coverage [5].
  • Centrifuge extracts (13,000 × g, 10 min, 4°C), collect supernatant, and dry under a gentle stream of nitrogen. Reconstitute dried extracts in a solvent compatible with LC-MS injection (e.g., 100 µL of 80% methanol).

3. LC-MS/MS Data Acquisition:

  • Chromatography: Use Reversed-Phase UPLC (e.g., C18 column) with a water/acetonitrile gradient, both containing 0.1% formic acid, for broad secondary metabolite separation [5].
  • Mass Spectrometry: Employ a high-resolution mass spectrometer (e.g., Q-TOF or Orbitrap) capable of data-dependent acquisition (DDA). Acquire full-scan MS1 data (e.g., m/z 100-1500) and automatically trigger MS2 scans for the top N most intense ions in each cycle.

4. Interference Analysis with NP-PRESS:

  • Process raw data (peak picking, alignment) using software like MZmine or the MetaboAnalyst LC-MS module [7].
  • Import the aligned feature table (with MS1 and MS2 data) into the NP-PRESS pipeline.
  • Apply the FUNEL algorithm to filter features common to the spent media and non-inoculated controls, removing abiotic and media-derived biotic interference.
  • Apply the simRank algorithm to the remaining features against natural product libraries (e.g., GNPS) to dereplicate known compounds.
  • The final output is a prioritized list of features unique to the live culture and dissimilar to known compounds, representing high-priority targets for novel NP discovery [4].

Protocol 2: Visual Diagnostics for Interference in Untargeted Workflows

Effective visualization is critical for evaluating data quality and the impact of interference at each processing step [8].

1. Pre-processing and Quality Control Visualization:

  • Total Ion Chromatogram (TIC) Overlay: Align and overlay TICs from all samples (biological replicates, spent media controls, blank injections). This visual check identifies major shifts in retention time, dramatic intensity variations, or large contaminant peaks in controls [9].
  • Principal Component Analysis (PCA) on Raw Features: Perform PCA on the pre-normalized feature intensity table. Plot PC1 vs. PC2. Expected Outcome: QC samples (if available) should cluster tightly, and biological controls (spent media) should separate distinctly from true biological samples, visually demonstrating the magnitude of interference [9] [8].

2. Post-Dereplication Evaluation Visualization:

  • Venn Diagram or UpSet Plot: Visually summarize the number of molecular features detected in: A) Live Culture Samples, B) Spent Media Controls, C) Non-Inoculated Media. The unique features in group A represent the putative true metabolome after accounting for abiotic/media interference [8].
  • Volcano Plot Post-NP-PRESS: After applying FUNEL and simRank, create a volcano plot comparing feature intensity in the live culture versus the pooled controls. The x-axis represents log2(fold-change), and the y-axis represents -log10(p-value). High-priority, unique NPs should appear as statistically significant, high fold-change outliers in the upper-left or upper-right quadrants [9].
  • Hierarchical Clustering Heatmap: Generate a heatmap of the top N prioritized features (z-score normalized intensities) across all sample types. This confirms that the prioritized signals are abundant in live cultures and absent in controls, and can reveal co-expression patterns suggestive of related biosynthetic pathways [9] [8].

G cluster_1 Stage 1: MS1 Filtering (FUNEL) cluster_2 Stage 2: MS2 Prioritization (simRank) Start_Color Start_Color Input_Color Input_Color Process_Color Process_Color Decision_Color Decision_Color Output_Color Output_Color title NP-PRESS Two-Stage Dereplication Workflow A Aligned MS1 Feature Table + Metadata B Apply FUNEL Algorithm (Neutral Loss/Adduct Rules) A->B C Remove Features Matching Media & Control Profiles B->C D Filtered Feature Table (~50-80% Reduction) C->D E MS2 Spectra for Filtered Features D->E Pass Features invisible F Compute simRank Scores vs. NP Spectral DB E->F G Dereplicate Knowns Prioritize Novel Spectra F->G H Prioritized Target List for Novel NPs G->H

Diagram: A two-stage workflow showing the sequential application of FUNEL and simRank algorithms to filter and prioritize mass spectrometry data for natural product discovery.

Visualization of Signaling and Interference Pathways

Interpreting metabolomic data requires contextualizing metabolites within their biological pathways and understanding how stress signaling can generate interference [10].

G Stressor Stressor Signal Signal Primary Primary Secondary Secondary Interference Interference Title Plant Stress Signaling & Metabolomic Interference Abiotic Abiotic Stress (e.g., Drought, Salt) ROS ROS Burst & Signal Cascades Abiotic->ROS Degradation Cellular & Media Component Degradation Abiotic->Degradation Biotic Biotic Stress (e.g., Pathogen) Hormone Hormone Signaling (JA, SA, ABA) Biotic->Hormone Biotic->Degradation TF Activation of Transcription Factors ROS->TF Hormone->TF Reprogramming Metabolic Network Reprogramming TF->Reprogramming PMs Accumulation of Primary Metabolites (Proline, GABA, BCAAs) Reprogramming->PMs SMs Induction of Secondary Metabolites (Phenolics, Alkaloids) Reprogramming->SMs Byproducts Non-Target Catabolic Byproducts PMs->Byproducts Turnover Detected MS Signal Pool: Target Metabolites + Interference PMs->Detected SMs->Detected Degradation->Byproducts Byproducts->Detected

Diagram: Illustrates how biotic and abiotic stresses trigger both target metabolic responses (primary and secondary metabolites) and the generation of interfering chemical byproducts that co-elute in mass spectrometry analysis.

Table: Key Research Reagent Solutions for Interference-Aware Metabolomics

Reagent/Material Function in Protocol Rationale & Consideration
Stable Isotope-Labeled Media (e.g., U-¹³C glucose) Cultivation of microbes in controls and experimental samples. Enables tracking of true microbial metabolites vs. abiotic carryover from media via isotope patterns; crucial for validating FUNEL filters [5].
Biphasic Extraction Solvents (Methanol/Chloroform/Water) Comprehensive metabolite extraction from cell pellets and broth. Provides broad recovery of both polar and non-polar metabolites, ensuring the detected "interference" profile is representative of the total chemical space [5].
Solid-Phase Extraction (SPE) Cartridges (C18, HLB, mixed-mode) Clean-up and fractionation of crude extracts prior to LC-MS. Removes salts and highly polar media components that cause ion suppression and column degradation, reducing a major source of abiotic interference and improving sensitivity for target NPs.
Quality Control (QC) Reference Mix Injection at regular intervals during LC-MS sequence. Monitors instrument stability; data from QC samples is used for signal correction and to distinguish technical drift from biological variation [6].
MS Spectral Databases (GNPS, NIST, In-house libraries) Reference for simRank algorithm and manual validation. Essential for the dereplication stage. The comprehensiveness of the database directly impacts the false-positive rate for novelty claims [5] [4].
Metabolomics Analysis Software (MetaboAnalyst [7], MZmine, GNPS) Data processing, statistical analysis, and visualization. Platforms like MetaboAnalyst integrate multiple visualization strategies (PCA, volcano plots, heatmaps) critical for diagnosing interference and interpreting NP-PRESS output [9] [8] [7].

The high cost of irrelevant signals in metabolomics—measured in wasted time, resources, and missed discoveries—is a critical bottleneck in natural product research and metabolomics broadly. The NP-PRESS two-stage dereplication strategy provides a robust, algorithmic framework to address this by systematically subtracting biotic and abiotic interference. By integrating careful experimental design with controlled samples, followed by sequential FUNEL (MS1) and simRank (MS2) filtering, researchers can transform a complex, noisy metabolome into a refined list of high-priority candidates. This strategy, supported by diagnostic visualizations and a dedicated toolkit, directly enhances the probability of discovering novel, bioactive natural products by ensuring that analytical effort is focused on true signals of biological and chemical novelty.

The discovery of novel natural products (NPs) from microbial metabolomes represents a cornerstone of pharmaceutical development, yielding compounds with potent antimicrobial, anticancer, and various other therapeutic activities [1]. However, the analytical landscape is dominated by a critical bottleneck: the sheer complexity of metabolomic data. In a typical liquid chromatography-tandem mass spectrometry (LC-MS/MS) experiment, the signals of bioactive secondary metabolites are obscured by an overwhelming majority of irrelevant chemical features originating from abiotic processes, culture media, and cellular degradation products [4]. Traditional one-stage dereplication approaches, which often rely on direct database matching of MS/MS spectra, struggle to differentiate these interfering biotic features from true NPs, leading to high rates of false positives and fruitless isolation campaigns [1].

This document details a transformative two-stage metabolome refining pipeline known as NP-PRESS (Natural Product Prioritization and Refinement via Elimination of Signal Surplus). Framed within a broader thesis on advanced dereplication strategies, NP-PRESS introduces a conceptual shift from simple feature filtering to a systematic, two-stage data refinement process. This strategy employs two novel algorithms—FUNEL for MS1-level filtering and simRank for MS2-level prioritization—to sequentially remove irrelevant features and highlight putative novel NPs [1]. The following application notes and protocols provide a comprehensive guide to implementing this strategy, complete with experimental workflows, data visualization standards, and a toolkit for researchers.

The NP-PRESS Methodology: A Two-Stage Refinement Pipeline

The NP-PRESS pipeline is engineered to deconvolute complex metabolomes by sequentially applying distinct data refinement steps at the MS1 and MS2 levels. This staged approach ensures a thorough removal of non-relevant features before committing resources to the detailed analysis of prioritized candidates [4].

Stage 1: MS1-Level Filtering with FUNEL The first stage addresses the "signal surplus" from biotic and abiotic interferences. The FUNEL (Filtering of Uninteresting Nuisance Elements) algorithm operates on MS1 spectral data. Its core function is to perform comparative metabolomics between the sample of interest (e.g., a bacterial fermentation) and a set of control samples. These controls are meticulously designed to capture the chemical background, including sterile media, spent media from non-producing strains, or cells harvested during non-productive growth phases. FUNEL identifies and subtracts MS1 features that are statistically non-significant or are consistently present in these control samples. This step drastically reduces the dataset's complexity by eliminating up to 70-90% of total features attributed to media components and primary metabolic debris [1] [4].

Stage 2: MS2-Level Prioritization with simRank The second stage focuses on the remaining, refined feature set. The simRank algorithm analyzes the MS/MS fragmentation spectra of these features. Instead of relying solely on direct library matches, which are often incomplete for novel compounds, simRank calculates spectral similarity networks. It prioritizes features that exhibit moderate spectral relatedness to known natural product families or structural classes within databases like GNPS (Global Natural Products Social Molecular Networking) but are not direct matches. This prioritizes "scaffold-relative" novelty—compounds that are structurally new but may share biosynthetic logic with known families, making them prime candidates for discovery [1]. The final output is a shortlist of high-priority mlz-RT features accompanied by annotated putative structural classes and novelty scores.

Table: Core Algorithmic Functions in the NP-PRESS Pipeline

Algorithm Stage Primary Data Input Core Function Key Outcome
FUNEL 1 MS1 (Precursor Ion) Comparative analysis against control samples to subtract background features. Removal of biotic/abiotic interference; drastic reduction of feature list.
simRank 2 MS2 (Fragmentation Spectra) Spectral similarity networking and ranking against known NP libraries. Prioritization of features with scaffold-relative novelty.

G cluster_stage1 Stage 1: MS1-Level Refinement cluster_stage2 Stage 2: MS2-Level Prioritization Raw_MS1 Raw LC-MS/MS Data (10,000+ Features) FUNEL FUNEL Algorithm (Comparative Filtering) Raw_MS1->FUNEL Control_Samples Control Metabolomes (Sterile Media, Null Strains) Control_Samples->FUNEL Refined_List Refined Feature List (~1,000-2,000 Features) FUNEL->Refined_List Removed_Noise Removed Features: Media, Degradation Products FUNEL->Removed_Noise Extract_MS2 Extract MS2 Spectra Refined_List->Extract_MS2 simRank simRank Algorithm (Spectral Networking) Extract_MS2->simRank NP_Libraries Reference NP Libraries (e.g., GNPS) NP_Libraries->simRank Priority_List High-Priority NP Candidates (~10-50 Features) simRank->Priority_List Known_Matches Annotated Knowns simRank->Known_Matches

Detailed Experimental Protocols

Protocol 1: Bacterial Cultivation and Metabolome Extraction for NP-PRESS

Principle: Generate paired experimental and control samples to feed the FUNEL algorithm. The goal is to produce a metabolome enriched for secondary metabolites while simultaneously capturing the chemical background from all non-producing sources [4].

Materials:

  • Microbial Strain: Target strain (e.g., Streptomyces albus J1074) and an appropriate null control (non-producing mutant or closely related strain).
  • Growth Media: Appropriate liquid production medium (e.g., ISP2, R5A for actinomycetes) and matching sterile control media.
  • Extraction Solvents: HPLC-grade methanol, acetonitrile, ethyl acetate, and water. Acid (e.g., 1% formic acid) or base may be added for ion pairing.
  • Equipment: Centrifuge, vacuum concentrator, ultrasonic bath, lyophilizer, and 0.22 µm PTFE syringe filters.

Procedure:

  • Cultivation: Inoculate the target strain and the null control strain into triplicate flasks of production medium. Include triplicate flasks of sterile medium as an abiotic control. Incubate under optimal conditions (e.g., 28°C, 200 rpm) for the required time (e.g., 5-7 days for actinomycetes).
  • Harvest: Separate the biomass from the culture broth by centrifugation (e.g., 8,000 × g, 20 min, 4°C).
  • Extraction:
    • Biomass Pellet: Lyophilize and weigh the cell pellet. Extract using a solvent mixture of methanol:ethyl acetate:acetic acid (50:50:1, v/v/v) via sonication (30 min). Centrifuge and collect the supernatant.
    • Culture Supernatant: Acidity the supernatant to pH ~3 with formic acid. Partition against an equal volume of ethyl acetate three times. Pool the organic layers.
  • Sample Preparation: Combine the biomass and supernatant extracts for each biological replicate. Dry under a gentle stream of nitrogen or by vacuum concentration. Reconstitute the dried extract in a 1:1 mixture of methanol and water to a final concentration of 1 mg/mL. Filter through a 0.22 µm PTFE membrane prior to LC-MS injection.
  • QC Pool: Create a quality control (QC) sample by pooling equal volumes from all reconstituted experimental samples.

Protocol 2: LC-MS/MS Data Acquisition for Untargeted Analysis

Principle: Acquire high-resolution MS1 and MS2 data suitable for both FUNEL (MS1 comparison) and simRank (MS2 networking) analysis [11].

Materials:

  • LC System: UHPLC system with a C18 reversed-phase column (e.g., 2.1 x 150 mm, 1.8 µm).
  • MS System: High-resolution tandem mass spectrometer capable of data-dependent acquisition (DDA) (e.g., Q-TOF, Orbitrap).

Chromatography Conditions (Example):

  • Mobile Phase A: Water with 0.1% formic acid.
  • Mobile Phase B: Acetonitrile with 0.1% formic acid.
  • Gradient: 5% B to 100% B over 25-30 minutes.
  • Flow Rate: 0.3 mL/min.
  • Column Temperature: 40°C.
  • Injection Volume: 2-5 µL.

Mass Spectrometry Parameters (Example for DDA in positive mode):

  • MS1 Scan: Range mlz 100-1500; Resolution > 30,000.
  • MS2 Acquisition: DDA mode: Top 10 most intense ions per cycle; dynamic exclusion enabled.
  • Fragmentation: Collision-induced dissociation (CID) or higher-energy collisional dissociation (HCD) with stepped collision energies (e.g., 20, 40, 60 eV).
  • MS2 Scan Resolution: > 15,000.

Acquisition Sequence: Inject the QC sample at the beginning (3-5 times for system equilibration) and repeatedly throughout the batch (after every 4-6 experimental samples). Randomize the injection order of all experimental and control samples to mitigate instrument drift.

Protocol 3: Computational Analysis with NP-PRESS

Principle: Process raw LC-MS/MS files through the two-stage NP-PRESS workflow to generate a prioritized list of novel natural product candidates [1] [11].

Software & Platforms:

  • Raw Data Conversion: MSConvert (ProteoWizard).
  • Feature Detection & Alignment: MZmine, MS-DIAL, or XCMS.
  • NP-PRESS Implementation: Custom scripts for FUNEL and simRank (implementation details may be accessed via supplementary materials of primary literature [1]).
  • Molecular Networking: GNPS platform.

Procedure:

  • Data Pre-processing: Convert raw files to an open format (.mzML). Use MZmine or similar to perform peak picking, chromatogram deconvolution, isotope grouping, and alignment across all samples (experimental, null controls, sterile media). Create a feature table with mlz, RT, and intensity across samples.
  • Stage 1 - FUNEL Execution: Input the feature table and sample metadata into the FUNEL algorithm. The algorithm performs statistical testing (e.g., ANOVA, fold-change) to identify features significantly enriched in the experimental samples compared to all controls. Export a refined feature list and associated MS2 spectra.
  • Stage 2 - simRank Execution:
    • Submit the refined feature list's MS2 spectra to the simRank algorithm.
    • simRank performs spectral similarity matching against a curated NP spectral library and constructs a similarity network.
    • It assigns a "novelty priority score" based on connectivity patterns—features that form new clusters adjacent to known compound clusters are highly prioritized.
  • Validation & Annotation: Review the top-ranked candidates. Examine their MS/MS spectra, search against public databases (GNPS, MassBank), and predict molecular formulas. Perform manual validation of chromatographic peaks and fragmentation patterns.

Table: Key Parameters for LC-MS/MS Data Pre-processing

Processing Step Software (Example) Critical Parameters Purpose
Peak Picking MZmine Noise level, mlz tolerance (e.g., 0.005 Da), min peak duration Detect individual ion signals from raw data.
Chromatogram Deconvolution MZmine MS1 & MS2 mlz tolerance, RT span Resolve co-eluting peaks and link MS2 spectra to MS1 features.
Alignment MZmine mlz tolerance (0.01 Da), RT tolerance (0.1 min) Match identical features across multiple sample runs.
Isotope Grouping MZmine mlz tolerance, RT tolerance Group adducts and isotopes belonging to the same molecule.

Data Visualization & Interpretation

Effective visualization is critical for interpreting the complex, multi-dimensional data generated by the NP-PRESS pipeline and for communicating results [8]. The following standards should be applied.

1. Molecular Network Visualization (simRank Output): The primary output of simRank is a molecular network typically visualized using Cytoscape.

  • Nodes: Represent individual MS/MS features. Color nodes by sample group (e.g., blue for experimental, red for controls) or by novelty score (gradient from yellow to red).
  • Edges: Connect nodes with a cosine similarity score above a threshold (e.g., >0.7). Edge thickness should be proportional to the similarity score.
  • Clusters: Highlight clusters containing known library compounds (annotated) and adjacent "orphan" clusters of unknown, prioritized features.

2. Feature Abundance Plots (FUNEL Output): Visualize the effect of Stage 1 filtering.

  • Volcano Plots: Display fold-change (Experimental vs. Control) versus statistical significance (-log10 p-value) for all MS1 features. Features passing FUNEL thresholds can be highlighted in a distinct color [8].
  • Venn Diagrams: Illustrate the overlap and unique features between experimental, null control, and sterile media samples.

3. Chromatographic and Spectral Visualization: Essential for manual validation of priority candidates.

  • Extracted Ion Chromatograms (XICs): Overlay the XIC of a candidate's exact mass across experimental and control samples to confirm its unique presence [12].
  • Mirrored MS/MS Spectra: Display the experimental MS/MS spectrum of a candidate mirrored against the spectrum of its closest known library match or a predicted in-silico spectrum to illustrate similarities and key differences [12].

G MS_Raw_Data LC-MS/MS Raw Data PreProcess Data Pre-processing (Feature Detection, Alignment) MS_Raw_Data->PreProcess Feature_Table Aligned Feature Table (m/z, RT, Intensity) PreProcess->Feature_Table Vis_FUNEL Visualize FUNEL Filtering Feature_Table->Vis_FUNEL Vis_simRank Visualize simRank Network Feature_Table->Vis_simRank Vis_Candidate Visualize Priority Candidates Feature_Table->Vis_Candidate Plot_Volcano Volcano Plot (Show Filtered vs. Removed) Vis_FUNEL->Plot_Volcano Plot_Network Molecular Network (Colored by Novelty) Vis_simRank->Plot_Network Plot_XIC_MS2 XIC & Mirrored MS/MS (Confirm Unique Pattern) Vis_Candidate->Plot_XIC_MS2 Decision Decision: Proceed to Isolation & Structure Elucidation Plot_Volcano->Decision Plot_Network->Decision Plot_XIC_MS2->Decision

The Scientist's Toolkit: Essential Research Reagents & Materials

Table: Key Reagent Solutions for NP-PRESS Workflow Implementation

Category Item / Reagent Specification / Function Critical Notes
Chromatography Mobile Phase Additive Formic Acid (0.1%) or Ammonium Acetate (5-10 mM). Enhances ionization in positive or negative ESI mode and improves peak shape. Must be LC-MS grade.
Mass Spectrometry Calibration Solution Manufacturer-specific ESI-L low concentration tuning mix. Enables accurate mass measurement (< 5 ppm error). Must be infused pre-run for high-resolution instruments.
Sample Preparation Extraction Solvent System Methanol:Ethyl Acetate:Acetic Acid (50:50:1, v/v/v). Broad-spectrum solvent for secondary metabolites of varying polarity from biomass [11].
Sample Preparation Reconstitution Solvent Methanol:Water (1:1, v/v). Ensures solubility of a wide polarity range of compounds and compatibility with reversed-phase LC gradients.
Data Processing Internal Standard Deuterated or non-native compound (e.g., chloramphenicol-d5). Spiked into all samples pre-extraction to monitor and correct for extraction efficiency and instrument variability.
Cultivation Control Media Identical, sterile production media. Serves as the abiotic control for FUNEL to subtract all media-derived chemical features [4].

Applications & Case Studies

The efficacy of the NP-PRESS strategy is demonstrated by its application to diverse bacterial strains, leading to the discovery of novel bioactive compounds [1] [4].

  • Case Study: Streptomyces albus J1074

    • Challenge: This model strain has a sequenced genome rich in biosynthetic gene clusters (BGCs), but its metabolome is complex and interspersed with many common metabolites.
    • NP-PRESS Application: Application of the two-stage pipeline successfully filtered the background and prioritized signals from the surugamide non-ribosomal peptide synthetase (NRPS) cluster.
    • Outcome: Discovery of new surugamide A analogs with variations in the peptide sequence, validating the pipeline's ability to highlight structural variants within known families.
  • Case Study: Wukongibacter baidiensis M2B1 (Anaerobic Bacterium)

    • Challenge: Extremophilic and anaerobic bacteria are difficult to culture and their metabolomes are poorly understood, increasing the risk of missing novel chemistry.
    • NP-PRESS Application: The strategy was crucial for differentiating true secondary metabolites from the unique and complex background of an anaerobic culture system.
    • Outcome: Discovery of an entirely new family of depsipeptides named baidienmycins. These compounds exhibited potent antimicrobial and anticancer activities, underscoring the pipeline's power to unveil novel scaffolds with therapeutic potential from challenging sources.

These case studies confirm that the two-stage conceptual shift from simple analysis to systematic refinement effectively addresses the core challenge of metabolome complexity, turning high-dimensional MS data into a targeted discovery engine for novel natural products.

Core Philosophy and Strategic Objectives

The NP-PRESS (Natural Products Prioritization and Refinement via Elimination of Spectral Similarity) pipeline is founded on a core philosophical shift in natural product (NP) discovery: moving from a detection-centric to a refinement-centric paradigm. Conventional mass spectrometry (MS) workflows are excellent at detecting thousands of chemical features but struggle to distinguish true, biosynthetically relevant natural products from the overwhelming background of biotic interference—such as media components, cellular degradation products, and horizontally acquired metabolites. NP-PRESS posits that the key to unlocking novel chemistry lies not in more sensitive detection, but in more intelligent, context-aware filtration.

Its primary objective is to serve as a decisive two-stage filter that aggressively removes irrelevant MS features while preserving and prioritizing those with high biosynthetic potential. This is achieved by integrating orthogonal data analysis strategies at the MS1 and MS2 levels, effectively mimicking the logical deduction of an experienced natural products chemist. The pipeline is designed to be especially effective for challenging microbial sources, such as extremophiles or strains with sparse metabolomic profiles, where signal-to-noise ratios are notoriously low and high-value metabolites are easily missed [1].

Comparative Analytical Framework: NP-PRESS vs. Conventional Dereplication

The following table quantifies the paradigm shift introduced by NP-PRESS, contrasting its strategic approach and outcomes with conventional dereplication methods.

Table 1: Strategic and Outcome Comparison: NP-PRESS vs. Conventional Dereplication

Aspect Conventional Dereplication NP-PRESS Strategy Impact
Core Focus Identity matching against known compound libraries. Prioritization of unknown features via background subtraction. Shifts focus from known to unknown chemical space.
Primary Data Used Predominantly MS2 spectral matching. Integrated MS1 feature behavior and MS2 network analysis. Uses orthogonal data layers for robust decision-making.
Handling of Biotic Interference Often unaddressed; treated as part of the sample background. Actively modeled and subtracted using the FUNEL algorithm. Dramatically reduces feature list size, enhancing clarity.
Key Algorithmic Engine Spectral similarity scoring (e.g., cosine score). Two-stage: FUNEL (MS1) and simRank (MS2). Enables prioritization based on biosynthetic logic.
Typical Outcome List of known compounds and unresolved "unknown" features. A ranked, shortlist of features most likely to be novel NPs. Directs purification efforts efficiently to high-priority targets.
Demonstrated Novel Discovery Can rediscover known compounds efficiently. Enabled discovery of baidienmycins and new surugamide analogs [1]. Proven efficacy in de novo structure family identification.

Pipeline Architecture and Workflow

The NP-PRESS pipeline implements a sequential, two-stage refinement process. The workflow diagram below illustrates the logical flow from raw data to prioritized compound discovery.

G Raw_LCMS_Data Raw LC-MS/MS Data MS1_Processing Stage 1: MS1 Processing & FUNEL Analysis Raw_LCMS_Data->MS1_Processing Refined_Feature_List Refined Feature List (Biotic Noise Reduced) MS1_Processing->Refined_Feature_List MS2_Network Stage 2: MS2 Network Analysis via simRank Refined_Feature_List->MS2_Network Prioritized_Targets Prioritized Targets for Novel Natural Products MS2_Network->Prioritized_Targets Isolation Targeted Isolation & Structural Elucidation Prioritized_Targets->Isolation Novel_Compound Novel Bioactive Compound Isolation->Novel_Compound

NP-PRESS Two-Stage MS Dereplication Workflow

Algorithmic Foundations: FUNEL and simRank

The analytical power of NP-PRESS is driven by two specialized algorithms, each operating on a different level of MS data. Their functions are detailed below.

Table 2: Core Algorithms of the NP-PRESS Pipeline

Algorithm Stage Function Key Mechanism
FUNEL (FUll and NEgative feature anaLysis) 1 (MS1) Elimination of biotic interference. Compares feature profiles between experimental (full) and control (negative) cultures. Features not significantly enriched in the experimental group are flagged as non-biosynthetic background and removed [1].
simRank 2 (MS2) Prioritization of novel NP clusters. Analyzes spectral similarity networks of the refined features. Features that form tight clusters (high connectivity) with known NPs are deprioritized. Novel, structurally unique features that form distinct clusters or singletons are prioritized for investigation [1].

The following diagram details the decision logic within the critical second stage of the pipeline.

G Check2 Does feature form a novel, well-connected cluster? Prioritize High Priority Target (Potential new scaffold) Check2->Prioritize Yes Investigate Medium Priority (Requires validation) Check2->Investigate No Start Refined MS2 Spectra (from FUNEL Stage) Check1 Does feature form a cluster with known NPs? Start->Check1 Check1->Check2 No Deprioritize Deprioritize (Likely known analog) Check1->Deprioritize Yes

Stage 2 Priority Logic via simRank Analysis

Detailed Experimental Protocols

Protocol 1: Bacterial Cultivation and Metabolite Extraction for NP-PRESS Analysis

This protocol is optimized to generate the paired "full" and "negative" culture datasets required for the FUNEL algorithm.

1.1 Materials Preparation

  • Bacterial Strains: Target strain (e.g., Streptomyces albus J1074) and a biosynthetically "null" control (e.g., a mutant lacking core biosynthetic machinery, or a sterile media control) [1].
  • Growth Media: Appropriate liquid culture medium (e.g., ISP2, R5A for actinomycetes). Prepare identical media for experimental and control cultures.
  • Extraction Solvents: HPLC-grade methanol, acetonitrile, and ethyl acetate.
  • Equipment: Sterile shaker-incubator, centrifuge, sonic dismembrator, speed vacuum concentrator.

1.2 Procedure

  • Parallel Cultivation: Inoculate the target strain into experimental culture flasks and prepare control flasks (null mutant or sterile media) in biological triplicate. Incubate under identical conditions (temperature, shaking, duration).
  • Metabolite Harvest: At stationary phase, pellet cells by centrifugation (4,000 x g, 20 min, 4°C). Separate the supernatant (containing excreted metabolites) from the cell pellet.
  • Dual-Phase Extraction:
    • Supernatant: Extract with an equal volume of ethyl acetate, vortex vigorously for 2 minutes, and separate phases by centrifugation. Collect the organic (ethyl acetate) layer. Repeat twice. Pool organic layers.
    • Cell Pellet: Resuspend pellet in a 1:1 mixture of methanol:acetonitrile. Sonicate on ice (3 cycles of 30 sec pulse, 30 sec rest). Centrifuge (15,000 x g, 15 min, 4°C) and collect the supernatant.
  • Sample Combination & Concentration: For each replicate, combine the processed supernatant extract and cell pellet extract. Dry under a gentle stream of nitrogen or using a speed vacuum concentrator.
  • LC-MS Resuspension: Reconstitute the dried extract in 200 µL of methanol containing 0.1% formic acid. Centrifuge (15,000 x g, 10 min) to pellet insoluble debris. Transfer the clear supernatant to an LC-MS vial for analysis.

Protocol 2: LC-MS/MS Data Acquisition for Dereplication

This protocol ensures consistent, high-quality data suitable for both FUNEL and simRank analysis.

2.1 LC Conditions

  • Column: Reversed-phase C18 column (e.g., 2.1 x 100 mm, 1.7 µm particle size).
  • Mobile Phase: A: Water with 0.1% formic acid; B: Acetonitrile with 0.1% formic acid.
  • Gradient: Optimize for broad metabolite elution (e.g., 5% B to 100% B over 20-25 minutes).
  • Flow Rate: 0.3 mL/min. Column Temperature: 40°C. Injection Volume: 5-10 µL.

2.2 MS Conditions (Q-TOF or Orbitrap recommended)

  • Ionization Mode: Electrospray Ionization (ESI), positive and negative modes acquired separately.
  • MS1 Survey Scan: Resolution > 30,000 (FWHM at m/z 200); Scan range: m/z 100-1500.
  • Data-Dependent MS2 Acquisition: Top 10-15 most intense ions per cycle. Isolation width: 2 m/z. Fragmentation: Collision-induced dissociation (CID) or Higher-energy C-trap dissociation (HCD) with stepped normalized collision energies (e.g., 20, 35, 50).

2.3 Data File Organization Maintain a strict file naming convention to pair samples for FUNEL (e.g., StrainA_Rep1_Full.mzML, StrainA_Rep1_Neg.mzML). Acquire all samples in a randomized order within a continuous sequence to minimize instrumental drift.

Application Notes: Case Study and Performance

As a proof of concept, NP-PRESS was applied to the analysis of Streptomyces albus J1074 and the anaerobic bacterium Wukongibacter baidiensis M2B1 [1].

  • Process: The FUNEL stage significantly reduced the initial thousands of MS features by subtracting background. Subsequent simRank analysis of the remaining features highlighted spectral clusters distinct from known compounds.
  • Outcome: This direct led to the targeted isolation of new surugamide analogs from S. albus and the discovery of an entirely new family of depsipeptides, the baidienmycins, from W. baidiensis. Baidienmycins exhibited potent antimicrobial and anticancer activities, validating the pipeline's ability to prioritize bioactive novel NPs [1].

The Scientist's Toolkit: Essential Research Reagents and Materials

Table 3: Key Research Reagent Solutions for NP-PRESS Implementation

Item Category Specific Example/Description Function in NP-PRESS Workflow
Biological Materials Wild-type and biosynthetically "knock-out" mutant bacterial strains. Provides the paired "Full" vs. "Negative" culture essential for the FUNEL algorithm's background subtraction [1].
Chromatography HPLC-grade solvents (MeCN, MeOH, H₂O) with 0.1% formic acid. Forms the mobile phase for high-resolution LC separation, impacting feature detection and peak shape.
Mass Spectrometry Tuning and calibration solution for MS (e.g., sodium formate cluster). Ensures mass accuracy and reproducibility across the analytical sequence, critical for reliable feature alignment.
Software & Databases MS-Dial, MZmine, or similar for feature finding; GNPS for spectral networking; In-house NP spectral library. Used for initial data processing, public spectral matching, and custom spectral comparisons for simRank analysis.
Data Analysis Python/R environment with packages for statistical comparison (e.g., for FUNEL) and graph-based clustering (e.g., for simRank). Enables execution of the core computational algorithms that define the NP-PRESS pipeline [1].

Inside NP-PRESS: A Step-by-Step Guide to the Two-Stage Workflow and Its Applications

The discovery of novel natural products (NPs) from microbial and plant sources remains a cornerstone of pharmaceutical development, yielding critical leads for antimicrobial, anticancer, and other therapeutic agents [1]. However, a fundamental bottleneck in this pipeline is the initial dereplication step—the rapid identification of known compounds to prioritize truly novel entities for costly and time-consuming isolation and structure elucidation [13]. Modern high-resolution mass spectrometry (HRMS) generates vast datasets of metabolic features, but the signals of potential NPs are often obscured by an overwhelming background of irrelevant ions originating from abiotic sources (e.g., solvents, plastics) and, more challengingly, from biotic processes (e.g., cellular degradation products, media components) [4].

This document details the first stage of NP-PRESS (Natural Products - Prioritization and Refinement by Enhanced Spectrometry Strategies), a novel two-stage MS feature dereplication strategy framed within a broader thesis on accelerating natural product discovery [1]. NP-PRESS introduces two new computational algorithms to systematically filter this complex metabolomic data. The first stage, the focus of this protocol, employs the FUNEL (Filtering of Uninteresting Non-Product-Related Features by Elution Profile) algorithm to perform rigorous filtering at the MS1 level [4]. By removing non-product-related features, FUNEL dramatically reduces dataset complexity and false leads before the more resource-intensive MS2 analysis. This initial refinement is critical for the success of the second stage, which utilizes the simRank algorithm for spectral similarity networking to highlight novel compound families [1]. The integrated NP-PRESS pipeline has proven effective, guiding the discovery of new surugamide analogs from Streptomyces albus and a new family of depsipeptides, the baidienmycins, from the anaerobic bacterium Wukongibacter baidiensis [4].

Core Principles of the FUNEL Algorithm

The FUNEL algorithm is designed to address a key limitation in current metabolomics: the inability to distinguish MS1 features originating from true secondary metabolites (natural products) from those generated by the routine metabolic turnover of the producing organism or its growth medium [4]. While background chemical noise can be partially subtracted using blank injections, biotic background from processed media and cellular debris is sample-inherent and has been historically difficult to filter out.

FUNEL operates on the principle that true natural products are typically synthesized de novo during a cultivation period. In contrast, compounds derived from the biotransformation of media components (e.g., peptides from hydrolyzed yeast extract) are present from the start of cultivation and are gradually consumed or transformed over time [4]. The algorithm exploits this difference in temporal profiles.

Logical Workflow of the NP-PRESS Strategy with FUNEL

G Start Raw MS1 Feature Table (m/z, RT, Intensity) FUNEL Stage 1: FUNEL Algorithm Start->FUNEL Criteria1 Apply Temporal Profile Filter FUNEL->Criteria1 Criteria2 Apply Blank Subtraction Filter FUNEL->Criteria2 Output1 Refined MS1 Feature Table (NP-Enriched) Criteria1->Output1 Criteria2->Output1 Stage2 Stage 2: simRank MS2 Analysis Output1->Stage2 Output2 Prioritized List of Novel NP Candidates Stage2->Output2

Diagram Title: The Two-Stage NP-PRESS Dereplication Strategy

The algorithm requires LC-HRMS data collected from two key sample sets:

  • Time-Point Cultivation Samples: The producer organism is cultivated, and samples are harvested at multiple time points (e.g., early, mid, and late fermentation).
  • Spent Media Controls: Sterile culture media is incubated under the same conditions but without inoculation. Samples are taken at matching time points.

FUNEL processes the aligned MS1 feature table (containing m/z, retention time, and intensity across all samples) by applying two sequential filters [4]:

  • Temporal Profile Filter: Features are retained only if their intensity profile across the cultivation time series shows a pattern of accumulation (increasing over time). Features with a flat or depletion profile are considered background from processed media and are filtered out.
  • Blank Subtraction Filter: Features that are also detected in the spent media control samples are considered abiotic or media-derived background and are removed.

This two-pronged approach ensures that only features showing biological production by the organism during cultivation are passed to the next stage.

Detailed Experimental Protocol

Sample Preparation and LC-HRMS Data Acquisition

The following protocol is adapted from successful applications in bacterial natural product discovery [4] and aligns with standard practices for microbial metabolomics [11].

Materials & Growth:

  • Producer Strain: e.g., Streptomyces albus J1074 or other bacterium of interest.
  • Growth Medium: Appropriate liquid culture medium (e.g., ISP2, R5A for actinomycetes).
  • Sample Quenching: 60% aqueous methanol at -40°C.
  • Extraction Solvent: Methanol/Water/Formic acid (e.g., 49:49:2, v/v/v) [11].
  • LC-MS Solvents: LC-MS grade water (A) and acetonitrile (B), both with 0.1% formic acid.

Procedure:

  • Inoculation and Cultivation: Inoculate the producer strain into multiple flasks containing fresh medium. Simultaneously, prepare an equal number of flasks with sterile medium only (spent media controls).
  • Time-Point Sampling: For both cultivation and control flasks, aseptically remove aliquots (e.g., 1 mL) at defined time points (e.g., 24h, 48h, 72h, 96h). Perform biological replicates (n=3-5).
  • Metabolite Quenching & Extraction: Immediately mix each sample aliquot with cold quenching solution to halt metabolism. Centrifuge to pellet cells. For intracellular metabolite analysis, extract the cell pellet with extraction solvent via sonication or bead-beating. For extracellular analysis, extract the supernatant. Combine extracts, dry under nitrogen or vacuum, and reconstitute in a water/acetonitrile mix suitable for LC-MS injection [11].
  • LC-HRMS Analysis:
    • Column: Reversed-phase C18 column (e.g., 2.1 x 150 mm, 1.8 μm).
    • Gradient: Use a water/acetonitrile gradient optimized for broad metabolite separation (e.g., 5% to 98% B over 20-30 minutes) [11].
    • Mass Spectrometer: High-resolution mass spectrometer (Q-TOF, Orbitrap) capable of data-dependent acquisition (DDA).
    • Acquisition Mode: Use DDA to collect both high-resolution MS1 spectra and subsequent MS2 fragmentation spectra for top ions. This generates the data needed for both FUNEL (MS1) and subsequent simRank (MS2) analysis [11]. A typical setup includes a full scan from m/z 100-2000 at 60-120k resolution, followed by MS2 scans on the most intense precursors.

Data Preprocessing for FUNEL Analysis

Raw LC-HRMS data must be converted into an aligned feature table.

  • Feature Detection and Alignment: Use open-source software like MZmine 3 or commercial packages to process raw data files [11].
    • Perform peak picking (chromatogram building) on the MS1 data.
    • Deconvolute isotopes and adducts to group features belonging to the same compound.
    • Align features across all sample runs (all time points and controls) based on accurate mass and retention time (RT), using a tolerance of ±0.01 Da and ±0.1 min.
  • Result: Generate a consensus feature table where each row is a unique metabolite feature (defined by m/z and RT) and each column is the integrated peak intensity (or area) for that feature in a specific sample file. This table, typically exported as a .csv file, is the primary input for the FUNEL algorithm.

FUNEL Algorithm Execution

The logic of the FUNEL filter is implemented through feature intensity comparisons.

Algorithmic Steps:

G Input Aligned MS1 Feature Table (All Samples) Q1 Feature in Spent Media Controls? Input->Q1 Q2 Intensity Profile Shows Accumulation? Q1->Q2 No Action1 Discard Feature (Media Background) Q1->Action1 Yes Action2 Discard Feature (Consumed Media Component) Q2->Action2 No (Flat/Decreasing) Action3 Retain Feature (Potential Natural Product) Q2->Action3 Yes (Increasing) Output Filtered Feature Table Action3->Output

Diagram Title: Decision Workflow of the FUNEL Filtering Algorithm

  • Input: Load the aligned feature intensity table.
  • Blank Subtraction Filter:
    • For each feature, calculate the average intensity in the spent media control samples.
    • If the average control intensity is > a defined threshold (e.g., >5% of the average intensity in cultivation samples or statistically significant via a t-test), flag the feature as media-derived.
    • Result: All flagged features are removed from the candidate list.
  • Temporal Profile Filter:
    • For each remaining feature, analyze its intensity across the cultivation time points (e.g., T1, T2, T3, T4).
    • Apply a statistical test (e.g., Spearman rank correlation) to assess if the intensity has a significant positive correlation with time.
    • Alternatively, set a simpler rule: intensity at T4 > intensity at T1 by a defined fold-change (e.g., >1.5x) and the intermediate time points show a generally increasing trend.
    • Result: Features failing to show a clear accumulation profile are removed.
  • Output: The final output is a refined MS1 feature table containing only features that passed both filters. This table is significantly reduced in size and enriched for compounds genuinely produced by the organism during fermentation.

Performance and Validation

The efficacy of the FUNEL algorithm within the NP-PRESS pipeline is demonstrated by its application in real discovery campaigns. The filtering drastically reduces dataset complexity, allowing downstream resources to focus on promising leads.

Table 1: Performance of FUNEL Filtering in NP-PRESS Case Studies [1] [4]

Producer Organism Initial MS1 Features Features After FUNEL Reduction (%) Key Discovery Enabled
Streptomyces albus J1074 ~15,000 ~3,000 80% New surugamide analogs
Wukongibacter baidiensis M2B1 ~12,000 ~2,500 79% Baidienmycins (new depsipeptides)

The utility of FUNEL is further underscored when compared to other state-of-the-art feature prioritization methods. While tools like MassQL provide a powerful, flexible language for querying specific patterns (e.g., isotopes, neutral losses) in public MS data repositories [14], FUNEL is specifically designed for a different problem: distinguishing biologically synthesized products from complex biotic background in controlled cultivation experiments.

Table 2: Comparison of MS1 Feature Prioritization Strategies

Method / Algorithm Primary Function Key Advantage Limitation in NP Discovery Context
FUNEL (NP-PRESS) Filters based on temporal cultivation profile. Removes sample-inherent biotic background; highly specific for de novo synthesis. Requires carefully designed time-course experiment.
Blank Subtraction (Standard) Subtracts features found in process blanks. Removes abiotic contamination (solvents, tubing). Cannot remove background from processed media components.
MassQL [14] Query language for MS data patterns (MS1 & MS2). Extremely flexible for finding known chemical motifs; vendor-agnostic. Does not prioritize based on biological origin; requires pattern definition.
MBR/PIP [15] [16] Transfers IDs between runs using RT, m/z, IM. Increases feature identification sensitivity across samples. Can propagate errors; requires high-quality reference library; not a filter for biotic noise [16].

Integration with Downstream NP-PRESS Stage and Broader Workflows

The filtered output from FUNEL is the essential input for the second stage of NP-PRESS, which employs the simRank algorithm. simRank performs modified molecular networking on MS2 spectral data but is applied only to the precursors that passed the FUNEL filter. This focused analysis increases the chance that spectral similarity clusters represent true families of secondary metabolites rather than background compounds [1] [4].

For comprehensive dereplication, the FUNEL-simRank pipeline can be integrated with other established tools. The refined feature list can be queried against natural product databases using molecular formula or exact mass. Furthermore, the accurate MS1 features (with m/z and RT) can be used as high-fidelity targets for Match Between Runs (MBR) or Peptide-Identity-Propagation (PIP) in subsequent analyses of new strains or conditions, though such transfers require rigorous false-discovery rate control [15] [16]. Advanced software platforms like MaxQuant, which now support ion mobility dimensions, can enhance the accuracy of such alignments by using collision cross section (CCS) as an additional coordinate [15].

The Scientist's Toolkit: Essential Reagents and Software

Table 3: Key Research Reagent Solutions and Software for FUNEL/NP-PRESS Implementation

Item Function/Description Application in Protocol
Methanol/Water/Formic Acid (49:49:2) Extraction solvent for intracellular and extracellular metabolites. Provides good recovery of a wide polarity range of NPs [11]. Sample preparation, metabolite extraction.
LC-MS Grade Water & Acetonitrile (with 0.1% FA) Mobile phases for reversed-phase LC-HRMS. High purity minimizes background chemical noise in MS1 spectra. LC-HRMS separation during data acquisition.
MZmine 3 Open-source software for mass spectrometry data processing. Performs chromatogram building, deisotoping, alignment, and feature table export [11]. Data preprocessing before FUNEL analysis.
MaxQuant Comprehensive software suite for quantitative proteomics (and metabolomics). Its advanced "Match Between Runs" (MBR) algorithm can utilize multiple dimensions (RT, m/z, CCS) for high-confidence feature alignment [15]. Optional for advanced feature alignment and integration with ion mobility data.
R Script/Python Environment Custom computational environment for implementing the FUNEL logic (statistical tests, thresholding, filtering). Execution of the core FUNEL algorithm.
GNPS / MassIVE Public repository and ecosystem for mass spectrometry data. Used for spectral library matching, molecular networking, and sharing raw data [14]. Downstream analysis after FUNEL filtering (e.g., with simRank) and data deposition.

The discovery of novel, bioactive natural products (NPs) from microbial metabolomes is persistently challenged by the overwhelming chemical background of non-relevant metabolites. These include primary metabolites, cellular degradation products, and components from growth media, whose signals in mass spectrometry (MS) analyses can obscure the often lower-abundance secondary metabolites of interest. The NP-PRESS (Natural Products - Prioritization and Refinement by Elimination of Spectral Signatures) pipeline is a novel, two-stage metabolome refining strategy designed to overcome this hurdle [1] [4].

This pipeline systematically removes irrelevant chemical features to highlight NPs with higher potential for novelty and bioactivity. Stage 1 employs the FUNEL (FUnctional-group guided comparisoN of Extracted ion chromatogram and MS1 spectra for List) algorithm. FUNEL operates on MS1 data to filter out features originating from "biotic processes" by comparing experimental samples against a comprehensive database of control samples (e.g., spent media, host organism extracts). It does this by evaluating mass defects, isotopic patterns, and retention time shifts indicative of common biochemical transformations [1] [4].

Stage 2, which is the focus of these application notes, utilizes the simRank algorithm to analyze MS2 (tandem mass spectrometry) data. While Stage 1 effectively reduces dataset complexity, Stage 2 provides a higher-order, structural similarity-based filter. It prioritizes NP candidates by identifying MS2 spectra in the experimental samples that are dissimilar to all spectra found in control samples, thereby flagging compounds with potentially novel chemical scaffolds [1] [4].

Theoretical Foundation of the simRank Algorithm

The simRank algorithm, in its general form, is a graph-theoretic measure of structural-context similarity. Its core principle is: "two objects are considered similar if they are related to similar objects" [17]. In the context of web page analysis, this translates to pages being similar if they are linked to by similar pages.

For MS2 spectral analysis within NP-PRESS, this concept is adapted. Here, "objects" are precursor ions (detected features from MS1). The "relationship" is defined by their fragment ions (the MS2 spectrum). The adapted simRank principle for NP discovery becomes: Two precursor ions are considered to have similar chemical structures if their fragmentation spectra (the ions they are "related to") are similar [17] [18].

  • Calculation: The algorithm performs pairwise comparisons between all MS2 spectra from the sample of interest and a pooled set of MS2 spectra from control samples.
  • Scoring: A similarity score is calculated for each pair, typically based on the alignment and intensity correlation of their fragment ions.
  • Filtering: A user-defined similarity score threshold is applied. A sample spectrum that scores below this threshold when compared against all control spectra is deemed novel and prioritized for further investigation.

Integration and Workflow within NP-PRESS

The simRank stage is not a standalone process but a critical, refining component of the sequential NP-PRESS pipeline. The following diagram illustrates the complete two-stage workflow and the specific role of the simRank module.

NP_PRESS_Workflow cluster_stage1 STAGE 1: MS1 Data Refinement (FUNEL Algorithm) cluster_stage2 STAGE 2: MS2 Data Prioritization (simRank Algorithm) Start Raw LC-MS/MS Data (Experimental & Control Samples) MS1_Proc MS1 Feature Extraction & Alignment Start->MS1_Proc FUNEL Biotic Process Filtering (vs. Control Database) MS1_Proc->FUNEL Output1 Refined Feature List (Potential NPs) FUNEL->Output1 MS2_Extract MS2 Spectrum Extraction & Merging Output1->MS2_Extract Precursors to target simRank Pairwise Spectral Similarity Analysis MS2_Extract->simRank Filter Apply simRank Threshold Filter simRank->Filter Output2 Prioritized NP Candidates (Novel Scaffolds) Filter->Output2 Isolation Targeted Isolation & Structural Elucidation Output2->Isolation ControlDB Control Sample MS2 Spectral Library ControlDB->simRank Reference for similarity check Bioassay Biological Activity Assessment Isolation->Bioassay

Diagram 1: The Two-Stage NP-PRESS Dereplication and Prioritization Pipeline.

Detailed Experimental Protocol for simRank Analysis

This protocol assumes the completion of Stage 1 (FUNEL) processing and the availability of raw LC-MS/MS data files (.mzML or .mzXML format) for both experimental and control samples.

Input Data Preparation

  • Sample Files: Provide the MS2 data file for your experimental sample of interest [18].
  • Control Files: Provide one or more MS2 data files from your control samples (e.g., spent media, non-producing strain). These will be pooled to create the reference spectral library [18].
  • Optional Target List: A CSV file with columns mz and rt (retention time in seconds) can be supplied to restrict analysis only to precursor ions of interest, such as those passing Stage 1 [18].

Spectral Pre-processing & Parameter Configuration

Before similarity calculation, MS2 spectra undergo merging and cleaning. Key configurable parameters in platforms like simRank-Filter include [18]:

Table 1: Key Pre-processing and Algorithm Parameters for simRank Analysis

Parameter Default Value Function & Impact on Analysis
Fragment Intensity Threshold 1% Fragments with normalized intensity below this value are excluded from comparison, reducing noise [18].
Retention Time Merge Window (ΔRT) 30 sec MS2 spectra from the same precursor ion within this RT window are merged to create a consensus spectrum [18].
Precursor Alignment Tolerance 20 ppm, 0.01 Da Maximum m/z difference to align precursor ions across runs for control library building [18].
Fragment Alignment Tolerance 0.01 Da Maximum m/z difference to consider two fragment ions as identical during spectrum comparison [18].
Minimum Fragments per Spectrum 5 Spectra with fewer fragments are considered low-quality and excluded from analysis [18].
Remove Precursor Ion Window Enabled (17 Da) Removes fragments close to the precursor m/z (e.g., water/ammonia losses), which are often non-informative [18].
simRank Similarity Threshold 15 Critical. Sample spectra with a similarity score below this value against all control spectra are output as novel candidates [18].

Execution of simRank Comparison

The core algorithm follows a defined computational workflow, as detailed below.

simRank_Process InputSpec Merged MS2 Spectrum from Sample of Interest StartLoop For each Sample Spectrum InputSpec->StartLoop ControlLib Pooled Library of Merged Control Spectra ControlLib->StartLoop Compare against InnerLoop For each Control Spectrum StartLoop->InnerLoop Initiate pairwise comparison Align 1. Align Fragment Ions (within m/z tolerance) InnerLoop->Align CompareThresh Compare Max Score vs. Set Threshold InnerLoop->CompareThresh All controls compared CalcScore 2. Calculate simRank Similarity Score Align->CalcScore CheckMax 3. Track Maximum Similarity Score CalcScore->CheckMax CheckMax->InnerLoop Next control spectrum Novel Novel Candidate (Priority for Isolation) CompareThresh->Novel Max Score < Threshold Known Known / Dereplicated (Lower Priority) CompareThresh->Known Max Score ≥ Threshold Novel->StartLoop Next sample spectrum Known->StartLoop Next sample spectrum

Diagram 2: The simRank Spectral Comparison and Prioritization Logic.

Output Interpretation

The primary output is a table of filtered features. The most critical column is the simRank similarity score. Features with scores below the applied threshold represent MS2 spectra not found in the control background and are high-priority targets for downstream isolation and structure elucidation [1] [18].

Table 2: Exemplar Output from NP-PRESS simRank Analysis

Precursor m/z Retention Time (s) simRank Score (vs. Controls) Status Proposed Action
487.2564 654 5.2 Novel High Priority: Proceed to isolation
322.1541 432 78.9 Known/Dereplicated Low priority, likely from media/biotic process
601.2987 721 12.1 Novel High Priority: Proceed to isolation
455.2302 589 92.3 Known/Dereplicated Deprioritize

Validation & Case Studies

The efficacy of the integrated NP-PRESS pipeline, culminating in the simRank filter, has been demonstrated in multiple discovery campaigns [1] [4].

  • Case Study 1: Streptomyces albus J1074 Application of NP-PRESS guided the discovery of previously overlooked surugamide analogs. The simRank stage was critical in distinguishing their MS2 signatures from the complex metabolic background [1] [4].

  • Case Study 2: Wukongibacter baidiensis M2B1 (Anaerobic Bacterium) NP-PRESS analysis led to the discovery of an entirely new family of depsipeptides, named baidienmycins. These compounds exhibited potent antimicrobial and anticancer activities in bioassays. This success underscores the pipeline's power in uncovering novel NPs from underexplored and extremophile microorganisms [1] [4].

The Scientist's Toolkit: Essential Research Reagents & Materials

Table 3: Key Reagents and Materials for NP-PRESS simRank Protocol

Item Function in Protocol Specifications & Notes
LC-MS Grade Solvents Mobile phase for chromatography. Acetonitrile, Methanol, Water (with 0.1% Formic Acid). Essential for reproducible retention times and high MS sensitivity.
Microbial Growth Media Culturing experimental and control samples. Use chemically defined media if possible to simplify the control background. Document all components for reference.
Standard QA/QC Compounds System suitability and calibration. A mix of known compounds to verify LC-MS performance and mass accuracy before analytical runs.
Data Processing Software Raw data conversion and peak picking. e.g., MSConvert (ProteoWizard) to generate .mzML files from vendor formats [19].
simRank Implementation Platform Executing the Stage 2 algorithm. e.g., simRank-Filter web module or custom Python/R scripts implementing the algorithm [17] [18].
Dereplication Databases Contextualizing simRank results. Public (GNPS, NP Analyst [19]) or commercial spectral libraries for additional validation of novelty.

The discovery of novel natural products (NPs) from microbial sources remains a cornerstone of pharmaceutical development, yet is challenged by the high rate of compound rediscovery and the obscurity of low-abundance metabolites within complex biological extracts [20]. This application note details a targeted methodology for the discovery of new surugamide analogs, a family of bioactive cyclic nonribosomal peptides (NRPs), from Streptomyces species. The protocol is explicitly framed within the methodological context of the NP-PRESS (Natural Product Prioritization and Refinement by mass Spectrometry Strategy) research, a novel two-stage MS feature dereplication pipeline [1] [4].

The NP-PRESS strategy addresses a critical gap in metabolomics by systematically removing irrelevant MS features originating from abiotic processes and, more challengingly, biotic processes such as media components and cellular degradation products [1]. By integrating two specialized algorithms—FUNEL for MS1-level feature refinement and simRank for MS2-spectral similarity scoring—NP-PRESS refines crude metabolomes to highlight genuine secondary metabolites [4]. As a proof-of-concept, this pipeline was successfully applied to Streptomyces albus J1074, facilitating the identification of previously overlooked surugamide analogs [1] [4]. This document translates that research into a standardized, detailed protocol for researchers aiming to discover novel derivatives within known natural product families.

Foundational Knowledge: The Surugamide Family

Surugamides are a growing family of peptides produced by Streptomyces, primarily characterized by an eight-amino-acid macrocyclic core structure that includes multiple D-amino acid residues [21]. They are biosynthesized by a unique non-ribosomal peptide synthetase (NRPS) gene cluster (surABCD) and cyclized by a dedicated penicillin-binding protein-like thioesterase (PBP-like TE), SurE [22] [23] [24].

  • Bioactivity Profile: Surugamides exhibit diverse pharmacological potential, including cathepsin B inhibition (implicated in cancer metastasis), antifungal activity, and recently, selective anthelmintic activity against Dirofilaria immitis (heartworm) [21] [25].
  • Structural Diversity: The family encompasses cyclic octapeptides (e.g., surugamides A–E), linear decapeptides (e.g., surugamide F), and acylated analogs (e.g., acyl-surugamide A) [21] [24]. Modifications, particularly acylation of a lysine residue, have been shown to be critical for specific biological activities, such as anthelmintic action [25].
  • Discovery Challenge: In model strains like S. albus J1074, the sur biosynthetic gene cluster (BGC) is often silent or poorly expressed under standard laboratory conditions, requiring specific cultivation or elicitation strategies for detectable production [26] [27].

The NP-PRESS Dereplication Strategy: Core Workflow

The NP-PRESS pipeline is designed to prioritize NP-derived MS signals by removing interfering features in two sequential stages [1] [4].

G NP-PRESS Two-Stage MS Dereplication Workflow cluster_stage1 STAGE 1: MS1-Level Refinement (FUNEL) cluster_stage2 STAGE 2: MS2-Level Prioritization (simRank) MS1_Data Raw LC-HRMS/MS Data (Full Scan MS1 & DDA-MS2) FUNEL FUNEL Algorithm Filters non-NP features MS1_Data->FUNEL Refined_MS1 NP-Enriched Feature List FUNEL->Refined_MS1 MS2_Data MS2 Spectra of Refined Features Refined_MS1->MS2_Data simRank simRank Algorithm Scores vs. NP Libraries MS2_Data->simRank Prior_List Prioritized Ranked List of Novel NP Candidates simRank->Prior_List NP_DB Reference NP Spectral Libraries (e.g., GNPS) NP_DB->simRank

Stage 1: MS1-Level Filtering with FUNEL The FUNEL algorithm processes untargeted LC-HRMS data to remove mass features associated with cultivation media, primary metabolites, and common laboratory contaminants. It employs blank subtraction, isotopic pattern recognition, and heuristic rules based on the typical physicochemical properties of secondary metabolites to drastically reduce dataset complexity before MS2 analysis [1] [4].

Stage 2: MS2-Level Prioritization with simRank The simRank algorithm analyzes the MS/MS spectra of the refined feature list. It computes spectral similarity scores against curated databases of known natural product spectra (e.g., GNPS). Critically, it prioritizes features that show high similarity to a known NP family (indicating structural relatedness) but are not exact matches, thereby flagging potential novel analogs like new surugamides for isolation [1] [4].

Integrated Experimental Protocol for Surugamide Discovery

This protocol combines strain selection, culture elicitation, NP-PRESS-based analysis, and targeted isolation.

Objective: To activate the silent sur BGC and maximize surugamide analog production [26] [27].

  • Strain Selection: Select Streptomyces strains predicted to harbor the sur BGC via genome mining (e.g., using AntiSMASH). Marine-derived strains (e.g., S. albidoflavus RKJM-0023) often show higher constitutive production [21] [26].
  • OSMAC Cultivation:
    • Inoculate strain into multiple media (e.g., BFM15m, SYP-NaCl, YD, ISP2) [21] [26].
    • Incubate at 28°C with shaking (180 rpm) for 5-7 days.
  • Chemical Elicitation (Optional but Recommended for Silent Clusters):
    • Prepare stock solutions of elicitors: Ivermectin (5 µg/mL in DMSO) and Etoposide (10 µg/mL in DMSO) [27].
    • At 24-48 hours post-inoculation, add elicitor to culture to achieve a final sub-inhibitory concentration (e.g., 0.1-1 µg/mL). Use a DMSO vehicle control.
  • Harvest and Extraction:
    • Acidity culture broth to pH ~3.0 using dilute HCl.
    • Extract twice with an equal volume of ethyl acetate (EtOAc).
    • Combine organic layers, dry over anhydrous Na₂SO₄, and concentrate in vacuo to obtain the crude extract.

Protocol 4.2: LC-HRMS/MS Data Acquisition for NP-PRESS

Objective: Generate high-quality MS1 and MS2 data for dereplication.

  • Instrumentation: Use UHPLC coupled to a high-resolution mass spectrometer (e.g., Q-TOF or Orbitrap) equipped with an ESI source.
  • Chromatography:
    • Column: C18 reversed-phase column (e.g., 2.1 x 100 mm, 1.7 µm).
    • Mobile Phase: (A) H₂O + 0.1% Formic acid; (B) Acetonitrile + 0.1% Formic acid.
    • Gradient: 5% B to 100% B over 20-25 minutes.
    • Flow Rate: 0.3 mL/min.
  • Mass Spectrometry:
    • Ionization: ESI positive mode.
    • MS1 Scan: m/z 150-1500, resolution > 30,000.
    • MS2 (DDA): Top 10-15 most intense ions per cycle. Use a stepped normalized collision energy (e.g., 20, 35, 50 eV).

Protocol 4.3: Data Processing via NP-PRESS and Molecular Networking

Objective: Dereplicate known compounds and prioritize novel surugamide analogs.

  • Data Conversion: Convert raw files to open formats (.mzML or .mzXML).
  • Execute NP-PRESS Pipeline:
    • Apply the FUNEL algorithm to the MS1 data to filter out non-NP features [1] [4].
    • Submit the filtered MS2 data to the simRank algorithm against a custom database containing surugamide A-E, F, and acyl-surugamide spectra [1] [4].
    • The output is a ranked list of features. High-priority candidates are those with (a) high simRank similarity to the surugamide spectral family and (b) a precursor mass not matching a known surugamide.
  • Orthogonal Validation with GNPS:
    • Upload the same dataset to the Global Natural Products Social Molecular Networking (GNPS) platform [21] [20].
    • Create a molecular network using the standard feature-based workflow.
    • Annotate the cluster containing known surugamide standards. Novel analogs prioritized by NP-PRESS should appear as distinct nodes within this same cluster [21] [25].

Protocol 4.4: Targeted Isolation and Structural Elucidation

Objective: Physically isolate and determine the structure of prioritized analogs.

  • Scale-up Fermentation: Perform large-scale cultivation (e.g., 10 x 1 L) of the most productive medium/condition identified in Protocol 4.1 [21].
  • Primary Fractionation: Fractionate the crude extract using normal-phase or size-exclusion flash chromatography.
  • Targeted Purification: Use semi-preparative reversed-phase HPLC to purify the specific m/z target. Monitor fractions by LC-MS.
  • Structural Characterization:
    • HRMS: Confirm molecular formula.
    • NMR Spectroscopy: Acquire 1D (¹H, ¹³C) and 2D (COSY, HSQC, HMBC, TOCSY) NMR spectra. Key NMR signals for surugamides include multiple amide NH protons (δH 7.1-8.5) and characteristic amino acid side-chain signals [21].
    • Amino Acid Analysis: Perform acid hydrolysis followed by Marfey's reagent derivatization to determine the stereochemistry (L/D) of constituent amino acids.
    • Biosynthetic Correlation: Amplify and sequence the surE cyclase gene from the producing strain to confirm the genetic basis for macrocyclization [23].

Key Data and Bioactivity of Recent Surugamide Analogs

Table 1: Recently Discovered Surugamide Analogs and Their Properties

Analog Name Producing Strain Molecular Formula Key Structural Feature Reported Bioactivity (IC50/EC50) Citation
Acyl-Surugamide A2 S. albidoflavus RKJM-0023 C₅₀H₈₃N₉O₉ N-ε-acetyl-L-lysine residue Antifungal (data pending) [21]
Acyl-Surugamide A3 Streptomyces sp. CMB-M0112 Not specified Acylated lysine derivative Anthelmintic vs. D. immitis: 3.3 µg/mL [25]
Surugamide K Streptomyces sp. CMB-MRB032 Not specified N-methylated analog Inactive vs. D. immitis (>25 µg/mL) [25]
Acyl-Surugamide AS3 (semi-synthetic) Derivatized from Surugamide A Not specified Synthetic acylation Anthelmintic vs. D. immitis: 3.4 µg/mL [25]

Table 2: Elicitation Effect on Surugamide Production in S. albus J1074

Cultivation Condition Relative Production of Surugamides (vs. Control) Key Elicitor/Medium Citation
Standard Medium (TSB) Low/Basal (Repressed) N/A [26]
YD Medium >13-fold increase (in marine strain SM17) Marine strain vs. terrestrial J1074 [26]
Chemical Elicitation (Ivermectin) Up to 5-fold induction of sur BGC expression HiTES screening [27]
Chemical Elicitation (Etoposide) Up to 5-fold induction of sur BGC expression HiTES screening [27]

The Surugamide Biosynthetic Pathway and Key Enzyme

The biosynthesis of surugamides involves a unique NRPS assembly line and a dedicated cyclase.

G Surugamide Biosynthesis and SurE Cyclization cluster_mech SurE Catalytic Mechanism NRPS NRPS Assembly Line (surA & surD for octapeptides) L-Ile1 loading → chain extension Linear_Pep Linear Peptidyl-Thioester Terminal D-Leu8 NRPS->Linear_Pep SurE Cyclase SurE (PBP-like TE) Recognizes L-Ile1(N-term) & D-Leu8(C-term) Linear_Pep->SurE Cyclic Head-to-Tail Cyclic Octapeptide (e.g., Surugamide A) SurE->Cyclic Heterochiral Coupling Acyl Post-Cyclization Modification (e.g., Lysine Acylation) Cyclic->Acyl AnionHole Oxyanion Hole Stabilizes C-term D-residue AnionHole->SurE H_Bond H-Bond Network (Y154, K66, N156) Recognizes N-term L-residue H_Bond->SurE R446 Residue R446 Anchors substrate for macrocyclization R446->SurE

  • NRPS Assembly: The octapeptide core is assembled by two NRPS proteins, SurA and SurD, with integrated epimerization (E) domains introducing D-amino acids [22] [24].
  • Cyclization by SurE: The linear peptide, tethered as a thioester, is offloaded and macrocyclized by SurE. This enzyme is a PBP-like thioesterase that uniquely catalyzes heterochiral coupling between an N-terminal L-amino acid and a C-terminal D-amino acid [23].
  • Catalytic Mechanism: Computational studies reveal SurE uses an oxyanion hole to stabilize the C-terminal D-residue and a critical hydrogen-bond network (involving Y154, K66, N156) to recognize the N-terminal L-residue, enabling selective cyclization with an energy barrier of ~19.4 kcal/mol [23].
  • Post-Cyclization Modification: Analogs like acyl-surugamides are formed by enzymatic acylation of the cyclic core, a modification directly linked to enhanced anthelmintic activity [25].

The Scientist's Toolkit: Essential Reagents and Materials

Table 3: Key Research Reagent Solutions for Surugamide Discovery

Item Function/Description Example/Application in Protocol
BFM15m / SYP-NaCl Media Cultivation media that enhance surugamide production in marine Streptomyces strains [21] [26]. Used in OSMAC cultivation (Protocol 4.1).
Ivermectin & Etoposide Elicitors Chemical inducers of silent biosynthetic gene clusters. Act via pathway-specific repression relief/SOS response [27]. Used in chemical elicitation (Protocol 4.1).
Ethyl Acetate (EtOAc) Organic solvent for broad-spectrum metabolite extraction from acidified culture broth. Used in harvest and extraction (Protocol 4.1).
C18 Reversed-Phase HPLC Columns Standard for peptide separation based on hydrophobicity. Critical for analytical profiling and purification. Used in LC-HRMS (4.2) and Targeted Purification (4.4).
Marfey's Reagent (FDAA) Chiral derivatizing agent for determining the absolute configuration (L/D) of amino acids after hydrolysis. Used in Structural Characterization (Protocol 4.4).
N-Acetylcysteamine (SNAC) Thioester Synthetic mimic of the peptidyl carrier protein (PCP)-bound thioester intermediate. Used in in vitro studies of SurE cyclase activity [22].
NP-PRESS Software Pipeline Custom algorithms (FUNEL, simRank) for two-stage MS data dereplication and novel NP prioritization [1] [4]. Core of data processing (Protocol 4.3).
GNPS Molecular Networking Platform Public web-based platform for MS/MS spectral similarity networking and database comparison [21] [20]. Used for orthogonal validation (Protocol 4.3).

This application note details the successful integration of the NP-PRESS (Natural Products-Prioritization and Evaluation by Stage-wise Screening) two-stage MS dereplication strategy for the discovery of novel depsipeptides from the anaerobic, extremophilic bacterium Wukongibacter baidiensis. The NP-PRESS strategy utilizes newly developed MS1 (FUNEL) and MS2 (simRank) algorithms to effectively remove interfering signals from abiotic and biotic processes, enabling the prioritization of low-yield, hard-to-detect natural products. As a proof-of-concept, this approach guided the isolation and characterization of the baidienmycin family, a new class of depsipeptides exhibiting potent antimicrobial and anticancer activities [1]. This study underscores the efficacy of targeted dereplication in unlocking the bioactive potential of under-explored extremophilic bacteria within the context of modern natural product drug discovery.

The rediscovery of known compounds remains a critical bottleneck in natural product (NP)-based drug discovery. Mass spectrometry (MS) is a powerful discovery tool, but its utility is often hampered by the overwhelming complexity of microbial extracts, where signals from true secondary metabolites are obscured by a background of interfering features derived from culture media, cellular degradation products, and other biotic processes [1]. This challenge is particularly acute when investigating unusual or extremophilic bacteria, such as Wukongibacter baidiensis, which thrive in harsh environments like hydrothermal vents and are promising sources of novel chemistry [28] [29].

This application note is framed within the broader thesis research on the NP-PRESS two-stage MS feature dereplication strategy. The core thesis posits that a systematic, algorithm-driven filtering of MS data can dramatically improve the efficiency of novel NP discovery. NP-PRESS operationalizes this by implementing two sequential filtering stages: first, the FUNEL algorithm cleans MS1 data by removing features associated with common biochemical building blocks and known noise patterns; second, the simRank algorithm analyzes MS2 fragmentation spectra to cluster and rank features based on structural novelty compared to dereplication libraries [1]. The case study presented here—the discovery of baidienmycins from W. baidiensis—serves as a critical validation of this thesis, demonstrating its practical application and effectiveness in a real-world discovery pipeline targeting depsipeptides, a class of compounds with proven therapeutic potential [30] [31] [32].

Target Bacterium:Wukongibacter baidiensis

Wukongibacter baidiensis is an anaerobic, Gram-stain-positive, spore-forming bacterium first isolated from mixed hydrothermal sulfide samples collected from a deep-sea vent [28].

  • Taxonomy & Phylogeny: It belongs to the family Peptostreptococcaceae. Phylogenetic analysis of its 16S rRNA gene shows it forms a distinct genus, with its closest relatives being Clostridium halophilum and Clostridium caminithermale (now reclassified) [28].
  • Extremophile Physiology: The strain grows optimally at 30°C, pH 8.0, and in a high-salinity environment (30-40 g/L sea salts), reflecting its adaptation to the hydrothermal vent niche [28]. Such extreme environments are recognized as reservoirs for microbial lineages with unique genomic and biosynthetic capabilities [29].
  • Chemotaxonomy: The major fatty acids are C14:0 and summed feature 1 (iso H-C15:1 / C13:0 3-OH). Predominant polar lipids include diphosphatidylglycerol, phosphatidylcholine, and phosphatidylethanolamine. Its genomic DNA G+C content is 33.4 mol% [28].

Core Technology: The NP-PRESS Dereplication Strategy

The NP-PRESS strategy is designed to overcome the signal-to-noise problem in LC-MS-based metabolomics. Its two-stage workflow is summarized in the table below and visualized in Figure 1.

Table 1: The Two-Stage NP-PRESS Dereplication Workflow

Stage Algorithm Data Level Primary Function Key Action
Stage 1 FUNEL MS1 (Precursor Ion) Filtering & Cleanup Removes mass features corresponding to ubiquitous biochemical building blocks, media components, and known biotic interference patterns.
Stage 2 simRank MS2 (Fragmentation) Dereplication & Prioritization Compresses MS2 spectra into loss/decomposition vectors, clusters them via similarity ranking, and flags clusters with no match to known compound libraries as high-priority for novel NPs.

G RawLCMS Raw LC-MS/MS Data (Complex Feature Set) FUNEL Stage 1: FUNEL Algorithm (MS1 Data Filtering) RawLCMS->FUNEL FilteredMS1 Filtered Feature List (Potential NP Candidates) FUNEL->FilteredMS1 Removes abiotic/biotic background simRank Stage 2: simRank Algorithm (MS2 Similarity Clustering) FilteredMS1->simRank Dereplicated Dereplicated & Prioritized List simRank->Dereplicated NovelNP High-Priority Novel NP Candidates Dereplicated->NovelNP No match to reference library KnownNP Known or Common Compounds Dereplicated->KnownNP Matches reference library

Figure 1: The NP-PRESS Two-Stage MS Dereplication Workflow. This diagram illustrates the sequential application of the FUNEL (Stage 1) and simRank (Stage 2) algorithms to filter complex LC-MS data and prioritize novel natural product candidates [1].

Application Case Study: Discovery of Baidienmycins

4.1 Experimental Workflow

  • Fermentation & Extraction: W. baidiensis M2B1 was cultured under optimized anaerobic conditions. The broth was extracted with organic solvent to obtain a crude natural product extract.
  • LC-MS/MS Analysis: The extract was analyzed using high-resolution LC-MS/MS, generating a dataset of MS1 precursor ions and associated MS2 fragmentation spectra.
  • NP-PRESS Analysis: The raw data was processed using the NP-PRESS pipeline. FUNEL significantly reduced the dataset by removing >50% of interfering MS1 features. The remaining features were analyzed by simRank, which clustered MS2 spectra and compared them against public (e.g., GNPS) and proprietary dereplication libraries.
  • Priority Identification: A distinct cluster of features showing no spectral match to known compounds was flagged as high-priority. These features shared a core fragmentation pattern suggestive of a related compound family.
  • Bioactivity-Guided Fractionation: Following the NP-PRESS priority list, targeted fractionation via preparative HPLC was performed. Fractions were screened for antimicrobial and cytotoxic activities.
  • Structure Elucidation: Active fractions containing the prioritized features were subjected to advanced purification. The structures of the novel compounds, named baidienmycins, were determined using a combination of NMR spectroscopy, Marfey's analysis, and further MS sequencing, confirming them as a new family of depsipeptides [1].

4.2 Key Outcomes & Biological Activities The application of NP-PRESS led directly to the efficient discovery of the baidienmycin family. Preliminary biological evaluation revealed significant activities, as summarized below.

Table 2: Biological Activity Profile of Baidienmycins from W. baidiensis

Activity Assay Target / Cell Line Reported Potency Significance
Antimicrobial Panel of bacterial pathogens Potent activity Indicates potential as a new antibiotic scaffold, crucial in the AMR crisis [1] [30].
Anticancer Panel of human cancer cell lines Potent activity Suggests potential for development as anticancer agents [1] [31].

Detailed Experimental Protocols

5.1 Protocol 1: NP-PRESS Data Analysis for Depsipeptide Prioritization

  • Software & Input: Process raw .RAW or .mzML files from HR-LC-MS/MS (e.g., Q-Exactive, timsTOF) using NP-PRESS software suite. A sample blank and medium control are required.
  • Step 1 – MS1 Processing with FUNEL: Convert raw data to a feature table (m/z, RT, intensity). Apply the FUNEL filter to remove features matching: (a) common metabolite building block masses (±5 ppm), (b) features present in the blank/medium control, (c) isotopes and adducts.
  • Step 2 – MS2 Processing with simRank: For remaining features, extract MS2 spectra. Use simRank to convert spectra to binary loss/decomposition vectors. Perform pairwise similarity comparison (cosine similarity >0.7) to cluster related spectra.
  • Step 3 – Dereplication: Query each spectral cluster against the GNPS molecular library and an in-house depsipeptide/library. Flag clusters with no library match or matches below a confidence threshold (e.g., cosine score <0.5) as high-priority targets.
  • Output: A ranked list of precursor m/z values and retention times corresponding to putative novel depsipeptides, ready to guide fraction collection.

5.2 Protocol 2: Fermentation & Targeted Fractionation of W. baidiensis

  • Culture Conditions: Inoculate W. baidiensis M2B1 into anaerobic broth (e.g., supplemented marine broth). Incubate at 30°C under anaerobic conditions (N₂/CO₂/H₂ atmosphere) for 7-14 days with shaking [28].
  • Extraction: Adjust broth to pH 3.0 with HCl. Extract twice with equal volume of ethyl acetate. Combine organic layers, dry over anhydrous Na₂SO₄, and concentrate in vacuo to yield crude extract.
  • Targeted Fractionation: Reconstitute crude extract in methanol. Perform analytical-scale LC-MS using conditions identical to the discovery run. Use the NP-PRESS output list to program a preparative HPLC system to collect time-based fractions corresponding to the high-priority m/z-RT pairs.
  • Bioassay: Screen all collected fractions for bioactivity (e.g., antibacterial vs. S. aureus, cytotoxicity vs. HepG2 cells). Pool active fractions containing the target ions for further purification.

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Depsipeptide Discovery from Unusual Bacteria

Item Function & Specification Application in Protocol
Specialized Culture Media Anaerobic broth (e.g., Marine Broth 2216), pre-reduced, with specific salinity (30-40 g/L sea salts) and pH 8.0 buffer. Cultivation of fastidious anaerobic extremophiles like W. baidiensis [28].
Anaerobic Chamber or Jars System to maintain an oxygen-free atmosphere (e.g., N₂:CO₂:H₂, 80:10:10). Essential for inoculating, transferring, and growing strict anaerobes.
High-Resolution LC-MS/MS System Instrument capable of data-dependent acquisition (DDA) or data-independent acquisition (DIA), e.g., UPLC-QTOF or UPLC-Orbitrap. Generation of high-quality MS1 and MS2 spectra for NP-PRESS analysis [1].
Dereplication Libraries Digital spectral databases: Public (GNPS) and in-house curated libraries of known NPs and depsipeptides. Reference for simRank algorithm to identify known compounds and highlight novelty [1] [30].
Preparative HPLC System System with C18 column, UV-Vis/DAD detector, and automated fraction collector. Isolation of gram-scale quantities of target compounds guided by NP-PRESS output.
NMR Solvents (Deuterated) High-purity solvents: DMSO-d6, Methanol-d4, CDCl3. Structure elucidation of purified novel depsipeptides.

Biosynthesis & Mechanism Visualization

Depsipeptides like baidienmycins are typically biosynthesized by multi-modular enzymatic complexes known as non-ribosomal peptide synthetases (NRPS), often with hybrid polyketide synthase (PKS) components [30] [31]. A generalized NRPS/PKS pathway is illustrated below.

G Substrates Amino Acid & Hydroxy Acid Substrates NRPS NRPS/PKS Mega-Complex (Adenylation (A), Thiolation (T), Condensation (C), Epimerization (E) , Thioesterase (TE) Domains) Substrates->NRPS Activated by A domain LinearPeptide Linear Peptidyl Intermediate (Tethered to T Domain) NRPS->LinearPeptide Chain elongation via C domain Macrocyclization Te Domain-Catalyzed Macrocyclization & Release LinearPeptide->Macrocyclization Cleavage & cyclization by TE domain Depsipeptide Cyclic Depsipeptide (e.g., Baidienmycin) Macrocyclization->Depsipeptide

Figure 2: Generalized NRPS/PKS Biosynthetic Pathway for Depsipeptides. This diagram outlines the key enzymatic steps in assembling cyclic depsipeptides, involving substrate activation, sequential condensation, and final macrocyclization [30] [31].

Concluding Remarks

This case study validates the NP-PRESS two-stage MS dereplication strategy as a powerful framework for thesis research and applied natural product discovery. By systematically eliminating analytical noise and prioritizing true novelty, it enables researchers to efficiently probe "difficult" sources like extremophilic bacteria. The discovery of the bioactive baidienmycins from Wukongibacter baidiensis serves as a compelling model for future efforts aimed at mining the unique chemical space encoded by unusual microorganisms, accelerating the identification of novel depsipeptides and other lead compounds for therapeutic development.

The discovery of novel natural products (NPs) from microbial sources is pivotal for pharmaceutical development, yet it is hampered by the high complexity of microbial metabolomes and the resource-intensive nature of traditional bioassay-guided isolation [33] [4]. A significant challenge lies in the mass spectrometry (MS) data, where signals from novel NPs are often obscured by a vast number of interfering features originating from abiotic sources, culture media, and microbial processed products [1] [4]. This "chemical noise" leads to inefficient resource allocation and missed discoveries.

This application note is framed within the broader research thesis on the two-stage MS feature dereplication strategy, NP-PRESS (Natural Product Prioritization and Evaluation by Semi-Supervised Scoring) [1] [4]. NP-PRESS addresses the core dereplication challenge by implementing a metabolome-refining pipeline designed to systematically remove irrelevant chemical features and prioritize those most likely to be novel secondary metabolites. The strategy employs two key algorithms: FUNEL for MS1-level filtering of non-NP features and simRank for MS2-level spectral networking and novelty scoring [4].

Here, we detail the extension and successful application of this NP-PRESS workflow to the mining of NPs from mangrove-derived Streptomyces, an exceptionally promising but metabolomically complex source. Mangrove ecosystems are biodiversity hotspots, and their unique environmental pressures (e.g., high salinity, low oxygen) drive microbes like Streptomyces to produce diverse and bioactive secondary metabolites [34]. Demonstrating efficacy in this challenging context validates NP-PRESS as a robust strategy for accelerating NP discovery from complex environmental microbiomes.

Application Notes: NP-PRESS in Action

2.1. Proof-of-Concept and Validation The NP-PRESS pipeline was initially validated using the model strain Streptomyces albus J1074, where it successfully facilitated the identification of new surugamide analogs [1] [4]. Its performance was further demonstrated on the unusual anaerobic bacterium Wukongibacter baidiensis M2B1, leading to the discovery of the new, bioactive depsipeptide family, baidienmycins [1]. These successes established NP-PRESS's capability to uncover novel metabolites from diverse bacterial sources by effectively differentiating NP signals from complex background interference.

2.2. Direct Application: Mining Mangrove Streptomyces speibonae W307 The NP-PRESS strategy was directly applied to Streptomyces speibonae W307, isolated from a mangrove environment [33]. The two-stage dereplication process was critical for managing the metabolomic complexity of this strain.

  • Stage 1 (FUNEL): The MS1 data was processed to filter out features associated with known media components and common microbial processing products, drastically reducing the dataset's complexity.
  • Stage 2 (simRank): The remaining MS2 spectra were analyzed to cluster similar molecules and score their novelty against spectral databases.

This targeted analysis guided the isolation efforts toward a specific cluster of unknown features, culminating in the identification of three new natural products, strepyrazinones A, B, and C [33]. Structural elucidation via HR-MS and NMR, coupled with ECD calculations for configuration determination, confirmed that two of these compounds possess entirely new skeletons [33]. This case study concretely demonstrates how NP-PRESS extends the discovery workflow by providing a rational, data-driven prioritization scheme that directly leads to the isolation of novel chemical entities.

2.3. Corroborative Genomics-Metabolomics Workflow Complementary studies on mangrove-derived Streptomyces highlight the synergy of genomics with metabolomics, a philosophy aligned with NP-PRESS's data-centric approach. For instance, whole-genome sequencing of Streptomyces murinus THV12 revealed a significant biosynthetic potential, with 47 secondary metabolite biosynthetic gene clusters (smBGCs), representing 17.9% of its 8.3 Mb genome [35]. Concurrent LC-HR-MS/MS metabolomics under OSMAC (One Strain Many Compounds) cultivation conditions detected a range of metabolites, including actinomycin D and cinnabaramide A, validating the expression of these genomic potentials [35]. This combined strategy mirrors the preparatory and investigative steps that make NP-PRESS application effective, by first identifying a strain of high potential and then applying focused metabolomic dereplication.

Table 1: Summary of NP Discovery from Mangrove-Derived Streptomyces Using Advanced Strategies

Strain Source Key Strategy Major Findings Reference
Streptomyces speibonae W307 Mangrove environment NP-PRESS dereplication pipeline Isolation of three strepyrazinones (A-C), two with new structures. [33]
Streptomyces murinus THV12 Mangrove sediment Combined genomics & metabolomics Genome harbors 47 smBGCs. Metabolomics detected actinomycin D, pentamycin, etc. [35]
Streptomyces sp. (Various) Mangrove sediments (Review) Traditional bioassay-guided fractionation Catalog of 519 NPs (70% bioactive), including piericidins, azalomycins, etc. [34]

Detailed Experimental Protocols

3.1. Protocol 1: NP-PRESS Dereplication Workflow for LC-MS/MS Data This protocol details the computational steps for implementing the two-stage NP-PRESS strategy [1] [4].

  • Step 1: LC-MS/MS Data Acquisition.
    • Acquire high-resolution LC-MS/MS data from crude microbial extracts in both positive and negative ionization modes. Use data-dependent acquisition (DDA) methods to obtain MS2 spectra [36].
  • Step 2: MS1 Data Processing with FUNEL Algorithm.
    • Convert raw data to an open format (e.g., mzML). Process files with feature detection software (e.g., MZmine [36]) to obtain a list of aligned features (m/z, RT, intensity).
    • Apply the FUNEL filter. This algorithm compares features against curated in-house databases of common media components, solvents, and known microbial degradation products. Features matching these "biotic process" compounds with high confidence are flagged for removal.
    • Output: A refined feature table containing only "NP-candidate" features for MS2 analysis.
  • Step 3: MS2 Data Analysis with simRank Algorithm.
    • Extract MS2 spectra associated with the refined feature list.
    • Submit spectra to the simRank algorithm. This tool performs molecular networking, clustering spectra based on similarity (cosine score). Crucially, it scores each cluster's novelty by comparing it to public spectral libraries (e.g., GNPS). Clusters with low similarity to known compounds receive high novelty scores [4].
    • Output: A molecular network with nodes (compounds) color-coded or ranked by novelty score. High-priority, novel clusters are visually identified for targeted isolation.

3.2. Protocol 2: Integrated Genomics & Metabolomics for Strain Prioritization This protocol outlines a complementary approach to identify high-potential mangrove Streptomyces strains for NP-PRESS analysis [35].

  • Step 1: Genome Sequencing and Mining.
    • Extract high-quality genomic DNA from the target Streptomyces strain.
    • Perform whole-genome sequencing (e.g., Illumina HiSeq). Assemble reads and annotate the genome.
    • Use the antiSMASH software pipeline to identify and annotate secondary metabolite biosynthetic gene clusters (smBGCs). Analyze the number, type (e.g., PKS, NRPS), and novelty of clusters [35].
  • Step 2: OSMAC Cultivation and Metabolite Profiling.
    • Cultivate the strain under several different conditions (varying media, salinity, co-cultures) to elicit diverse metabolite production (OSMAC approach) [35].
    • Extract metabolites from each fermentation and analyze by LC-HR-MS/MS.
    • Process metabolomic data to link expressed metabolites (from Step 2) to predicted BGCs (from Step 1), creating a strain-specific "chemical-genomic" map.
  • Step 3: Target Selection for NP-PRESS.
    • Prioritize strains that show a high density of novel or "silent" smBGCs in their genome and produce a complex metabolome with unknown features under OSMAC conditions. This strain becomes the optimal input for the NP-PRESS dereplication protocol (3.1).

3.3. Protocol 3: Isolation and Characterization of Prioritized Compounds

  • Step 1: Targeted Isolation.
    • Scale up fermentation of the prioritized strain under the condition that induced the target NP-PRESS feature(s).
    • Use guided fractionation (e.g., HPLC) based on the exact m/z and RT of the high-priority feature(s) identified by NP-PRESS to isolate the pure compound.
  • Step 2: Structural Elucidation.
    • Acquire high-resolution mass spectrometry (HR-MS) data to determine molecular formula.
    • Perform comprehensive 1D and 2D NMR experiments (e.g., ¹H, ¹³C, COSY, HSQC, HMBC) to determine planar structure.
    • Determine absolute configuration via methods such as electronic circular dichroism (ECD) calculations or chemical derivatization [33].
  • Step 3: Bioactivity Assessment.
    • Test the pure compound in relevant bioassays (e.g., antimicrobial, cytotoxic). Determine minimum inhibitory concentrations (MIC) using standard microdilution methods [35].

Visualization of Workflows and Strategies

np_press start Crude Microbial Extract lcms LC-HR-MS/MS Analysis (DDA Mode) start->lcms raw_data Raw MS1 & MS2 Data lcms->raw_data funnel STAGE 1: FUNEL Filter (Remove non-NP MS1 features) raw_data->funnel refined_list Refined Feature List (NP Candidates) funnel->refined_list simrank STAGE 2: simRank Analysis (MS2 Networking & Novelty Scoring) refined_list->simrank network Molecular Network with Novelty Score simrank->network priority High-Priority Target(s) for Isolation network->priority

Diagram 1: The NP-PRESS Two-Stage Dereplication Pipeline [1] [4]

integration cluster_genomics Genomics Workflow cluster_metabolomics Metabolomics Workflow g1 Strain Isolation (Mangrove Sediment) g2 Genome Sequencing & Assembly g1->g2 g3 BGC Prediction & Analysis (antiSMASH) g2->g3 g_potential Assessment of Biosynthetic Potential g3->g_potential strain_sel High-Priority Strain Selection g_potential->strain_sel Genomic Data m1 OSMAC Cultivation (Varied Conditions) m2 LC-HR-MS/MS Metabolite Profiling m1->m2 m3 NP-PRESS Dereplication (Protocol 3.1) m2->m3 m_expression Assessment of Metabolite Expression m3->m_expression m_expression->strain_sel Metabolomic Data discovery Targeted Isolation & Novel NP Discovery strain_sel->discovery

Diagram 2: Integrated Strategy for Strain Prioritization and Discovery

Table 2: Key Research Reagents and Solutions for Mangrove Streptomyces NP Mining

Item/Category Function/Application Example/Note
Selective Isolation Media Favors growth of actinomycetes from complex mangrove sediment. Actinomycetes Isolation Agar (AIA), ISP media supplemented with nalidixic acid and cycloheximide [35].
OSMAC Elicitors To activate silent biosynthetic gene clusters by varying cultivation parameters. Different carbon/nitrogen sources, salts, enzyme inhibitors, or co-culture with other microbes [35].
LC-HR-MS/MS System High-resolution metabolomic profiling for dereplication and compound detection. Systems like UPLC coupled to Q-TOF or Orbitrap mass spectrometers are standard [35] [36].
Genome Mining Software In silico prediction of secondary metabolite potential from genome sequence. antiSMASH: Primary tool for BGC identification and analysis [35].
Dereplication Platforms Computational analysis of MS data for rapid compound identification. GNPS (Global Natural Products Social): For molecular networking and library matching [36]. NP-PRESS: For specialized two-stage MS feature filtering [1].
Chromatography Resins Fractionation and purification of target metabolites from crude extract. Solid-phase extraction (SPE) cartridges, and preparative HPLC columns (C18, silica gel).
NMR Solvents Solubilizing purified compounds for structural elucidation. Deuterated solvents (e.g., DMSO-d6, CDCl3, CD3OD).

Optimizing NP-PRESS: Practical Solutions for Common Challenges and Data Pitfalls

Abstract This application note details the critical parameter tuning of the FUNEL and simRank algorithms within the NP-PRESS (Natural Products Prioritization and Refinement Strategy) pipeline. NP-PRESS is a two-stage mass spectrometry (MS) feature dereplication strategy designed to uncover novel natural products (NPs) by removing overwhelming irrelevant features from microbial metabolomes, particularly those originating from biotic processes [1] [4]. The core innovation lies in the stepwise application of FUNEL for MS1-level feature refinement and simRank for MS2-level spectral prioritization [4]. Precise calibration of these algorithms is paramount, as it governs the essential trade-off between sensitivity (discovering true novel NPs) and specificity (rejecting known or irrelevant compounds). This document provides a structured framework, experimental protocols, and practical guidelines for researchers to optimize these parameters, thereby maximizing the efficacy of novel bioactive compound discovery in projects such as the study of Streptomyces albus J1074 and the anaerobic bacterium Wukongibacter baidiensis M2B1 [1].

The discovery of novel natural products (NPs) from microbial sources is a cornerstone of pharmaceutical development. However, a major bottleneck is the sheer complexity of metabolomic data, where signals from novel, often low-abundance NPs are obscured by a vast excess of features from culture media, cellular degradation products, and known metabolites [1] [4]. Traditional dereplication methods struggle to differentiate true NP signals from this biotic interference, leading to costly and fruitless isolation efforts [4].

The NP-PRESS pipeline addresses this by implementing a rigorous two-stage filtering strategy [4]:

  • Stage 1 - FUNEL (MS1 Feature Refinement): This algorithm operates on the MS1 level to perform an initial, coarse filtering. It leverages principles of funnel optimization to strategically reduce the dimensionality of the metabolomic dataset [37]. By applying constraints on physicochemical parameters (e.g., retention time windows, mass defect, isotopic patterns), FUNEL removes a large portion of irrelevant abiotic and biotic background features, creating a refined subset of candidate ions for further MS/MS analysis.
  • Stage 2 - simRank (MS2 Spectral Prioritization): This algorithm operates on the MS2 fragmentation data. It employs network-based similarity ranking to compare the tandem mass spectra of candidate features against spectral libraries of known compounds [38]. Unlike binary matching, simRank assesses spectral similarity within a network context, effectively clustering analogs and highlighting outliers that may represent novel chemical scaffolds [4].

The sequential application of FUNEL and simRank creates a powerful gating mechanism. The performance of the entire NP-PRESS pipeline is critically dependent on the parameter settings for each stage, which directly control the balance between sensitivity and specificity.

Parameter Tuning Framework: Sensitivity vs. Specificity

Optimal performance of NP-PRESS is achieved not by maximizing either sensitivity or specificity in isolation, but by tuning parameters to find an optimal balance suitable for the research goal. The following table summarizes the key tunable parameters for each algorithm and their effect on the discovery workflow.

Table 1: Critical Parameters for FUNEL and simRank Algorithms in NP-PRESS

Algorithm Core Parameter Effect on SENSITIVITY Effect on SPECIFICITY Recommended Tuning Strategy
FUNEL (MS1) Mass Tolerance Window Increases: Wider windows retain more true NPs with slight m/z deviations. Decreases: Wider windows admit more unrelated interfering features. Start with instrument accuracy (e.g., ±5 ppm). Widen slightly for complex samples or unknown adducts.
Retention Time Tolerance Increases: Liberal RT windows accommodate shifts from matrix effects. Decreases: Liberal RT windows increase chance of co-eluting interference. Define based on chromatographic reproducibility (e.g., ±0.1 min). Tighten for high-resolution separations.
Blank Subtraction Threshold Increases: Lower thresholds aggressively subtract background, risking NP loss. Decreases: Lower thresholds may remove true NP signals also present in blanks. Use fold-change (e.g., ≥10x intensity in sample vs. blank) and visually inspect EICs for key features.
simRank (MS2) Spectral Similarity Score Cutoff Increases: Lower score thresholds retain more spectra, including weak matches to knowns. Decreases: Lower thresholds populate networks with false connections, diluting novel clusters. Set initial cutoff at 0.7 (cosine score). Adjust based on library quality; increase for cleaner networks.
Minimum Matched Fragment Ions Increases: Lower minimum count retains spectra with poor fragmentation. Decreases: Lower count increases false-positive spectral matches. Require ≥4-6 matched fragment ions for high-confidence dereplication.
Maximum Cluster Size Increases: Larger clusters group more related analogs, capturing diversity. Decreases: Very large clusters can become noisy, obscuring novel scaffold outliers. Monitor cluster distribution; break apart clusters exceeding 20-30 nodes for manual review.

Detailed Application Protocols

Protocol A: Initial Setup and Data Acquisition for NP-PRESS

This protocol outlines the prerequisite steps for generating high-quality data suitable for FUNEL and simRank analysis [4] [11].

  • Sample Preparation & LC-MS/MS Acquisition:

    • Prepare microbial extracts and appropriate solvent blank controls in triplicate.
    • Utilize Ultra-High Performance Liquid Chromatography (UHPLC) coupled to a high-resolution mass spectrometer (e.g., Q-TOF, Orbitrap) [11].
    • Acquire data in data-dependent acquisition (DDA) mode to obtain MS2 spectra for prioritized ions [11]. For comprehensive coverage, data-independent acquisition (DIA/SWATH) can be employed in parallel [11].
    • Critical Step: Maintain consistent chromatographic and ionization conditions across all samples and blanks to ensure robust comparative analysis.
  • Data Pre-processing:

    • Convert raw data to an open format (e.g., mzML) using tools like MSConvert [11].
    • Process files through feature detection and alignment software (e.g., MZmine, MS-DIAL) [11].
    • Generate a consolidated feature table containing accurate m/z, retention time, intensity across samples, and associated MS2 spectra for downstream analysis.

Protocol B: Calibrating FUNEL Parameters with Control Samples

This protocol uses characterized samples to establish baseline FUNEL parameters before analyzing novel strains.

  • Use a Control Strain: Analyze a well-studied model organism (e.g., a Streptomyces species known to produce a specific metabolite family).
  • Iterative Parameter Testing:
    • Run FUNEL on the control dataset with default parameters.
    • Assess output: Does the refined feature list contain the known target metabolites (check sensitivity)? Does it exclude most features present in the solvent blank (check specificity)?
    • Adjust parameters from Table 1 sequentially. For example, if a known metabolite is missing, slightly widen the mass tolerance. If too many blank features remain, increase the blank subtraction threshold.
  • Validation: The tuned set of FUNEL parameters is validated when they successfully prioritize the known metabolites of the control strain while removing >90% of the total initial features.

Protocol C: Optimizing simRank for Novel Cluster Detection

This protocol focuses on tuning simRank to highlight spectral novelty after FUNEL pre-filtering.

  • Build a Custom Spectral Library: Curate a project-specific library from public databases (e.g., GNPS) and in-house standards relevant to the studied taxa [11].
  • Run simRank with Varied Cutoffs:
    • Execute the simRank algorithm on the FUNEL-refined MS2 data using the custom library.
    • Perform runs with spectral similarity cutoffs of 0.6, 0.7, and 0.8.
  • Analyze Molecular Networks: Visualize the resulting molecular networks (e.g., in Cytoscape).
    • A low cutoff (0.6) will produce a large, connected network. Identify major clusters of known compounds.
    • A high cutoff (0.8) will produce sparse networks. Identify small, disconnected nodes or clusters that represent high-priority novel candidates.
  • Determine the Optimal Balance: Select the cutoff value that yields a manageable number of high-priority novel clusters (e.g., 5-20) for subsequent isolation, while still providing meaningful analog connections for known compound families.

The Scientist's Toolkit for NP-PRESS Implementation

Table 2: Essential Research Reagents and Solutions

Item Specification / Recommended Product Function in NP-PRESS Workflow
UHPLC-HRMS System Q-TOF or Orbitrap mass spectrometer with nanoflow or conventional UHPLC. Generates high-resolution MS1 and MS2 data essential for accurate feature detection and spectral matching [11].
Chromatography Column Reversed-phase C18 column (e.g., 2.1 x 150 mm, 1.8 μm). Provides the compound separation necessary for resolving complex metabolomes and obtaining pure MS2 spectra [11].
Data Processing Software MZmine, MS-DIAL, or similar open-source platforms. Performs feature detection, alignment, blank subtraction, and exports data in formats compatible with FUNEL/simRank [11].
Molecular Networking Platform Global Natural Products Social Molecular Networking (GNPS). Provides the computational environment and public spectral libraries to execute and visualize simRank-based molecular networks [11].
Chemical Standards Authentic standards of expected metabolite classes. Serves as positive controls for validating LC-MS performance and tuning FUNEL parameters (Protocol B).

The NP-PRESS strategy, powered by the sequential application of FUNEL and simRank, represents a significant advance in metabolomic dereplication by systematically removing biotic interference [1] [4]. Its success is intrinsically linked to the deliberate tuning of algorithm parameters, a process that governs the critical sensitivity-specificity equilibrium. The frameworks and protocols provided here offer a practical roadmap for researchers to calibrate the NP-PRESS pipeline for their specific systems. By methodically applying Protocol B to establish robust FUNEL filters and Protocol C to optimize simRank for novelty detection, drug discovery professionals can significantly enhance their probability of isolating previously undiscovered natural products with potent biological activities, as demonstrated by the discovery of the baidienmycins [4].

NP_PRESS_Workflow NP-PRESS Two-Stage Dereplication Workflow (Max 760px) Raw_MS_Data Raw LC-MS/MS Data (Complex Metabolome) FUNEL_Stage STAGE 1: FUNEL MS1 Feature Refinement Raw_MS_Data->FUNEL_Stage Refined_Feature_List Refined Feature List (Reduced Complexity) FUNEL_Stage->Refined_Feature_List Param_Tuning Critical Parameter Tuning: - Mass Tolerance - RT Window - Blank Subtraction Param_Tuning->FUNEL_Stage simRank_Stage STAGE 2: simRank MS2 Spectral Prioritization Refined_Feature_List->simRank_Stage Molecular_Network Molecular Network & Cluster Analysis simRank_Stage->Molecular_Network Spectral_Library Spectral Library (Known Compounds) Spectral_Library->simRank_Stage Novel_Candidates High-Priority Novel NP Candidates Molecular_Network->Novel_Candidates

Parameter_Balance Parameter Tuning: Sensitivity-Specificity Trade-off (Max 760px) Optimal_Performance Optimal NP-PRESS Performance High_Sensitivity High Sensitivity (Find More NPs) High_Sensitivity->Optimal_Performance Consequence_More Consequence: More Novel Clusters + More Noise High_Sensitivity->Consequence_More High_Specificity High Specificity (Reduce False Leads) High_Specificity->Optimal_Performance Consequence_Fewer Consequence: Fewer, Purer Clusters Risk of NP Loss High_Specificity->Consequence_Fewer F_Wide_Tol FUNEL: Wider Tolerances F_Wide_Tol->High_Sensitivity F_Stringent_Sub FUNEL: Stringent Subtraction F_Stringent_Sub->High_Specificity S_Low_Cutoff simRank: Low Similarity Cutoff S_Low_Cutoff->High_Sensitivity S_High_Cutoff simRank: High Similarity Cutoff S_High_Cutoff->High_Specificity

The discovery of novel Natural Products (NPs) remains a cornerstone of pharmaceutical development, particularly in the urgent fight against antimicrobial resistance (AMR) and cancer [39]. However, researchers face a dual challenge of biological and analytical complexity. Biologically, the most promising NPs are often produced under specific conditions: by extremophiles thriving in unique geological niches or from silent biosynthetic gene clusters (BGCs) that are not expressed in standard laboratory settings [40] [41]. Analytically, the mass spectrometry (MS) data from such experiments is immensely complex, filled with interfering signals from media, cellular debris, and primary metabolism that can obscure the target NPs [1].

This article presents integrated Application Notes and Protocols framed within the context of the NP-PRESS (Natural Product Prioritization and Evaluation by Sequential Scoring) research thesis [1]. We detail how the NP-PRESS two-stage MS dereplication strategy directly addresses data complexity, enabling researchers to confidently prioritize novel features from challenging samples like extremophile extracts or activated silent BGCs. The following sections provide a detailed workflow, from experimental design and sample preparation to data analysis and compound prioritization, equipping scientists with a robust framework for next-generation NP discovery.

The NP-PRESS Two-Stage Dereplication Strategy: Core Workflow

The NP-PRESS strategy is engineered to filter out overwhelming irrelevant MS features and highlight true, novel natural products [1]. It operates through two sequential computational stages applied to LC-MS/MS data.

Stage 1: FUNEL (Filtering Using Neutral Loss) Algorithm

  • Objective: To remove ubiquitous MS1 features originating from common biotic processes (e.g., lipids, peptides, media components).
  • Mechanism: FUNEL identifies and filters features based on characteristic neutral losses (e.g., -H₂O, -CO₂, -NH₃) and mass defects associated with known, non-interesting biochemical compound classes.
  • Outcome: A significant reduction (typically >50%) in the total feature set, drastically simplifying the dataset for subsequent analysis [1].

Stage 2: simRank Similarity Scoring Algorithm

  • Objective: To evaluate the novelty of the remaining MS2 spectra.
  • Mechanism: simRank compares the MS2 spectrum of each unknown feature against curated databases of known NP MS2 spectra (e.g., GNPS, NIST MS/MS libraries). It provides a similarity score.
  • Prioritization: Features with low simRank scores (indicating low similarity to known compounds) are flagged as high-priority candidates for novel NPs. This is complemented by evaluating MS1 characteristics like isotopic patterns and retention time.

Table 1: NP-PRESS Performance Metrics in Proof-of-Concept Studies

Microbial Strain Sample Type / Challenge Key NP-PRESS Outcome Identified Novel Compounds
Streptomyces albus J1074 [1] Model actinobacterium; complex metabolite background. Prioritized features from the silent sur BGC after elicitation. New surugamide analogs [1].
Wukongibacter baidiensis M2B1 [1] Unusual anaerobic extremophile; high interference. Enabled discovery of a new compound family from a prioritized, unknown feature. Baidienmycins (new depsipeptides with antimicrobial & anticancer activity) [1].

G Start LC-MS/MS Analysis of Crude Extract MS1 MS1 Feature Detection (All LC Peaks) Start->MS1 FUNEL Stage 1: FUNEL Filter MS1->FUNEL Filtered Filtered Feature Set (NP-Enriched) FUNEL->Filtered Removes >50% biotic interference [1] MS2 MS2 Spectral Data Acquisition Filtered->MS2 SimRank Stage 2: simRank Scoring (vs. NP Library) MS2->SimRank Priority Priority List (Low-Similarity Features) SimRank->Priority Prioritizes novel spectra DB Known NP Spectral DB DB->SimRank Query

Diagram 1: The NP-PRESS Two-Stage MS Dereplication Workflow.

Application Note I: Bioprospecting Extremophile Microbiomes with NP-PRESS

3.1 Rationale and Hypothesis Extreme environments (deep-sea vents, acid mine lakes, hypersaline basins) exert intense geochemical pressures that drive the evolution of unique microbial secondary metabolism [40]. The "Extremophile Hypothesis" posits that these conditions foster biochemical novelty, making extremophiles prime sources for new antibiotic and anticancer scaffolds [40]. However, their extracts are analytically challenging due to high salt content, unusual media, and potent primary metabolites that interfere with MS detection of rare NPs.

3.2 Protocol: From Sample to Prioritized Compound

  • Step 1: Sample Collection & Strain Isolation

    • Materials: Sterile samplers (Niskin bottles, corers); in situ fixation reagents (if needed); diverse culture media (R2A, marine agar, ATCC media for anaerobes) mimicking native physicochemical conditions (pH, salinity, temperature) [40].
    • Procedure: Collect environmental samples (sediment, water, biofilm) aseptically. Employ a cultivation strategy using a gradient of conditions (pH 2-10, salinity 1-30%, temperatures 4-80°C) to capture diversity. For uncultivable majority, consider direct metagenomic DNA extraction [40].
  • Step 2: Small-Scale Fermentation & Elicitation

    • Materials: 24-well deep-well plates or 50mL tube fermenters; orbital shaker/incubator; elicitor libraries (e.g., NCI diversity set, natural product library) [41].
    • Procedure: Inoculate purified isolates in 1-5 mL of appropriate broth. Include elicitation conditions: add small molecules (e.g., 10-50 µM of ivermectin, etoposide have shown efficacy [41]) or conduct co-culture with other isolates [42]. Incubate under optimal growth conditions for 3-14 days.
  • Step 3: Metabolite Extraction for LC-MS

    • Materials: Resin (XAD-16 or HP20) for in situ adsorption; organic solvents (EtOAc, MeOH, CH₂Cl₂); centrifuge; speedvac concentrator.
    • Procedure: For broth cultures, add 5% (w/v) resin 24h before harvest. Separate resin by filtration, elute metabolites with MeOH or acetone. Alternatively, perform liquid-liquid extraction of whole broth with equal volume EtOAc. Concentrate organic phase to dryness and reconstitute in 100 µL DMSO or MeOH for MS analysis.
  • Step 4: LC-HRMS/MS Data Acquisition & NP-PRESS Analysis

    • Instrumentation: Reversed-phase UHPLC (C18 column) coupled to high-resolution tandem mass spectrometer (Q-TOF, Orbitrap).
    • MS Method: Data-Dependent Acquisition (DDA) mode. MS1 scan range m/z 150-2000 at resolution >35,000. Top 10-20 most intense ions per cycle selected for fragmentation (HCD or CID).
    • Data Processing: Convert raw data (.raw, .d) to .mzML format. Use MZmine3 or similar for feature detection (MS1). Apply the FUNEL algorithm (Stage 1) to filter features. Submit filtered feature list and associated MS2 spectra for simRank analysis (Stage 2) against public (GNPS) and in-house NP libraries. Prioritize features with low similarity scores and interesting isotopic patterns.

Table 2: Bioactive Compound Specialization in Extreme Environments [40]

Geological Niche Dominant Stressors Adaptive Strategy Associated Bioactive Compound Classes
Deep-Sea Hydrothermal Vents High pressure, temperature gradients, heavy metals. Thermostable/protective molecules, metal chelators. Potent antimicrobial peptides (e.g., Marthiapeptide A), anticancer polyketides [40].
Acid Mine Drainage Lakes Extreme acidity (pH <3), toxic metal ions (As, Cu). Metal efflux pumps, intracellular pH buffering. Novel meroterpenoids, lactones with anti-inflammatory activity (e.g., Berkeleylactone A) [40] [42].
Hypersaline Lakes High osmotic pressure, ionic stress. "Salt-in" or compatible solute synthesis. Bacterioruberin carotenoids, extremolytes, halocins (antimicrobial peptides) [40].

Diagram 2: Linking Geological Stress to NP Specialization in Extremophiles.

Application Note II: Activating and Dereplicating Silent Biosynthetic Clusters

4.1 The Silent Cluster Problem Genomic sequencing reveals that prolific microbes harbor 5-10 times more Biosynthetic Gene Clusters (BGCs) than they produce compounds under lab conditions [41]. These "silent" or "cryptic" clusters represent the greatest untapped reservoir of NP diversity. Activation is required, followed by efficient dereplication to identify novel products amidst newly expressed metabolites.

4.2 Protocol: Activation and Targeted Analysis

  • Step 1: Genetic Activation via Promoter Engineering

    • Method: CRISPR-Cas9-mediated promoter knock-in [41].
    • Procedure: Design sgRNAs targeting the region upstream of the target BGC. Clone a strong constitutive promoter (e.g., ermEp) into a CRISPR-Cas9 delivery plasmid. Transform the host strain (e.g., Streptomyces). Screen for successful recombinants via PCR. This method directly drives expression of the cluster [41].
  • Step 2: Chemogenetic Activation via HiTES (High-Throughput Elicitor Screening)

    • Method: Use of a reporter strain to screen for small molecule inducers [41].
    • Procedure: Fuse a fluorescent reporter (eGFP) to the native promoter of the silent BGC and integrate into the host chromosome. Grow the reporter strain in 96-well plates with a library of ~500 potential elicitors (e.g., approved drugs, natural products). Monitor fluorescence as a proxy for BGC activation. Identify hits (e.g., ivermectin) [41]. Apply the hit elicitor to the wild-type strain for metabolite production.
  • Step 3: Heterologous Expression via FAC (Fungal Artificial Chromosome)

    • Method: For genetically intractable fungi, especially extremophiles [42].
    • Procedure: Isolate genomic DNA and create a FAC library. Screen for clones carrying the target BGC. Transfer the entire BGC-FAC into a heterologous host like Aspergillus nidulans (FAC-AnHH) [42]. Culture the engineered host under standard conditions; the heterologous environment often activates the cluster.
  • Step 4: Metabolite Analysis with NP-PRESS

    • Post-Activation: Extract metabolites from activated strains (genetically modified, elicited, or heterologous host) and control strains (wild-type or empty vector).
    • NP-PRESS Application: Acquire LC-HRMS/MS data for all samples. Process data together. Use NP-PRESS to filter universal background. Critically, the simRank stage will now highlight features unique to the activated sample that also show low similarity to known compounds, directly pinpointing the novel products of the target BGC.

Table 3: Strategies for Activating Silent Biosynthetic Gene Clusters

Strategy Core Principle Key Advantage Example Outcome
CRISPR-Cas9 Promoter Insertion [41] Replace native promoter with a strong, constitutive one. Precise, genetic; leads to consistent, high-level production. Production of alteramides, FR-900098, and novel pigments in Streptomyces [41].
HiTES (High-Throughput Elicitor Screening) [41] Screen small molecules for ability to induce a BGC reporter. Uncovers natural ecological signals; no genetic modification needed. Identified ivermectin/etoposide as elicitors of the sur cluster, yielding novel surugamides [41].
FAC Heterologous Expression [42] Capture & express entire BGC in a tractable surrogate host. Bypasses host regulatory networks; ideal for non-model/ extremophile fungi. Activated 10 BGCs from Penicillium spp., yielding 14 compounds including novel citreohybriddional [42].

The Scientist’s Toolkit: Essential Reagents and Materials

Table 4: Key Research Reagent Solutions for NP Discovery from Complex Sources

Reagent / Material Function / Purpose Application Context
HP-20 / XAD Resins Hydrophobic adsorption resin for in situ capture of metabolites from fermentation broth. Pre-concentrates NPs; removes salts & water-soluble interferents—critical for extremophile broths [40].
Elicitor Library (e.g., NCI Diversity Set) A collection of structurally diverse small molecules used to probe BGC induction. HiTES screening for silent BGC activation in reporter strains [41].
CRISPR-Cas9 Plasmid System for Actinomycetes All-in-one vector for sgRNA expression and Cas9 protein production in Streptomyces. Genetic activation of silent BGCs via promoter knock-in [41].
FAC (Fungal Artificial Chromosome) Vector High-capacity cloning vector (100-300 kb) for capturing entire fungal BGCs. Heterologous expression of cryptic BGCs from non-model or extremophile fungi [42].
Modified A. nidulans (FAC-AnHH) Engineered fungal host strain optimized for FAC integration and secondary metabolism. Heterologous production host for FACs, often activating silent clusters [42].
Curated In-House MS/MS Spectral Library A local database of MS2 spectra from known NPs relevant to the research focus. Crucial for accurate simRank scoring in NP-PRESS; improves novelty assessment vs. public DBs.

Concluding Protocol: Integrated Workflow for a Targeted Discovery Campaign

This protocol integrates the above strategies for a targeted campaign on an extremophile bacterium with a bioinformatically identified silent BGC.

  • Genome Mining & Selection: Sequence the extremophile isolate. Use antiSMASH to identify a silent, high-priority BGC (e.g., a novel NRPS or PKS cluster).
  • Activation Attempt (Parallel Tracks):
    • A. Genetic Activation: If tractable, design CRISPR-Cas9 to insert a strong promoter upstream of the target BGC [41].
    • B. Elicitation: Perform HiTES using a reporter strain or directly screen elicitors (including hits like ivermectin) on the wild type [41].
    • C. Heterologous Expression: If intractable, attempt to capture the BGC in a BAC/FAC and express in a surrogate host [42].
  • Fermentation & Extraction: Culture activated strains alongside controls. Use resin-based extraction for clean metabolite recovery.
  • LC-HRMS/MS Analysis & NP-PRESS Dereplication: Analyze all extracts. Process data through the NP-PRESS pipeline. FUNEL removes common interference; simRank identifies features unique to activated samples with low database similarity.
  • Priority Compound Isolation & Validation: Scale-up fermentation for the top 1-2 prioritized features. Isolate compounds using preparative HPLC. Elucidate structure using NMR. Confirm bioactivity through antimicrobial or cytotoxicity assays.

The discovery of novel natural products (NPs) through mass spectrometry is fundamentally hampered by the problem of irreproducibility. Variability in sample preparation and liquid chromatography-mass spectrometry (LC-MS) conditions generates inconsistent feature sets, making biological comparisons unreliable and obscuring the detection of rare, low-abundance metabolites. This irreproducibility stems from multiple sources: the inherent complexity of biological matrices, the chemical diversity of NPs, and the sensitivity of MS detection to subtle changes in experimental parameters.

The NP-PRESS (Natural Product Prioritization and Evaluation with a Two-Stage Strategy) research provides a critical framework for addressing this challenge [1]. This two-stage dereplication strategy is not merely an informatics solution but necessitates rigorous, reproducible upstream analytical chemistry to function correctly. Its first stage employs the FUNEL algorithm to filter out abiotic background and noise from MS1 data, while the second stage uses the simRank algorithm to differentiate true natural products from biotic interference (e.g., media components, degradation products) in MS2 data [1]. The efficacy of this prioritization is entirely dependent on the consistency of the feature lists input into the system. Therefore, establishing standardized, robust protocols for sample preparation and LC-MS analysis is the essential foundation upon which advanced dereplication strategies are built. This article details the best practices required to ensure reproducible data generation for NP discovery pipelines.

Systematic Sample Preparation Strategies for Complex Matrices

Selecting an appropriate sample preparation method is the first and most critical determinant of reproducibility. The choice must balance the desired level of sample cleanliness with the need to capture the broadest possible metabolome, including non-polar and polar secondary metabolites. The core principle is that cleaner samples drive better and more consistent assay performance [43].

Table 1: Comparison of Common LC-MS Sample Preparation Methods for Natural Product Workflows

Method Key Principle Best For Advantages for Reproducibility Limitations
Dilute-and-Shoot [43] Minimal processing; sample dilution in MS-compatible solvent. Relatively clean matrices (e.g., microbial culture supernatant, plant sap). Low handling minimizes human error; very fast; high recovery of a wide analyte range. High matrix effects; prone to ion suppression; not suitable for complex, protein-rich samples.
Protein Precipitation (PPT) [43] [44] Denaturation and pelleting of proteins using organic solvent (e.g., methanol, acetonitrile). Protein-rich samples (e.g., fermentation broths, cell lysates). Simple, rapid, and effective at removing proteins; uses common lab reagents. Limited selectivity; phospholipids and salts remain; can precipitate some metabolites of interest.
Liquid-Liquid Extraction (LLE) [43] [44] Partitioning of analytes based on solubility in two immiscible solvents (aqueous vs. organic). Extraction of non-polar to moderately polar compounds from aqueous matrices. Excellent cleanup; effective removal of salts and polar matrix interferences; can concentrate analytes. Labor-intensive; difficult to automate fully; emulsion formation can cause variability.
Solid-Phase Extraction (SPE) [43] [44] Selective adsorption of analytes onto a functionalized sorbent, followed by washing and elution. High-purity extraction and concentration of analytes from complex matrices; targeted or untargeted work. High selectivity and cleanliness; reduces matrix effects significantly; compatible with automation for high reproducibility. More complex protocol; requires method development (sorbent, solvent selection); can be costly.

For NP-PRESS, which aims to detect low-abundance features, methods that reduce matrix interference are paramount. Solid-Phase Extraction (SPE) is often the best choice for achieving reproducible, high-quality data from complex bacterial cultures [43]. The use of mixed-mode or selective sorbents can help fractionate samples, simplifying the chromatogram and reducing ion suppression, which in turn yields more consistent feature detection across replicates.

Automation is a key enabler of reproducibility. Automated liquid handlers can execute SPE, LLE, and PPT protocols with superior precision and consistency compared to manual pipetting, reducing human error and inter-operator variability [43] [44]. One documented implementation cut hands-on analyst time from 3 hours to 10 minutes while standardizing the process [43].

Detailed Protocol: Solid-Phase Extraction for Microbial Metabolites

This protocol is optimized for the extraction of a broad range of secondary metabolites from Streptomyces or similar bacterial culture filtrates, suitable for input into the NP-PRESS pipeline.

Materials: Culture supernatant (acidified to pH ~3 with formic acid if targeting acidic compounds), SPE cartridges or 96-well plates (e.g., mixed-mode reversed-phase/cation exchange, 30 mg sorbent), Conditioning Solvent (Methanol), Equilibration Solvent (Water with 0.1% Formic Acid), Wash Solvent 1 (Water with 0.1% Formic Acid), Wash Solvent 2 (Methanol:Water 20:80 v/v), Elution Solvent (Methanol with 2% Ammonium Hydroxide or acetonitrile/methanol with 0.1% formic acid for reversed-phase), vacuum manifold or positive pressure processor, collection tubes/plates.

Procedure:

  • Conditioning: Pass 1 mL of Methanol through the sorbent bed. Do not allow the bed to dry completely.
  • Equilibration: Pass 1 mL of Equilibration Solvent (Water with 0.1% FA) through the bed.
  • Sample Loading: Load the acidified culture supernatant (e.g., 1-3 mL) onto the cartridge at a slow, dropwise rate (~1 mL/min).
  • Washing: Sequentially wash with 1 mL of Wash Solvent 1 and then 1 mL of Wash Solvent 2. Discard all flow-through.
  • Elution: Elute absorbed metabolites into a clean collection tube with 1-2 mL of Elution Solvent. Apply solvent slowly and let it soak the bed for 30 seconds before applying full vacuum/pressure.
  • Reconstitution: Evaporate the eluate to dryness under a gentle stream of nitrogen or by centrifugal vacuum concentration. Reconstitute the dried extract in 100 µL of initial LC-MS mobile phase (e.g., 95% Water, 5% Acetonitrile, 0.1% Formic Acid), vortex thoroughly, and centrifuge before vialing.

NP-PRESS Context: This SPE protocol significantly reduces salts, sugars, and primary metabolites that constitute the "biotic interference" stage two of NP-PRESS (simRank) is designed to computationally filter [1]. A clean extract improves chromatographic peak shape and MS/MS spectral quality, increasing the confidence of both FUNEL and simRank algorithmic evaluations.

Standardization of LC-MS Conditions for Reproducible Feature Detection

Chromatographic and mass spectrometric parameters must be locked down to ensure the same features are detected in every run. This is non-negotiable for longitudinal studies or multi-batch analyses common in NP discovery.

Table 2: Key LC-MS Parameters Requiring Standardization for Reproducible Untargeted Analysis

System Component Critical Parameters Recommended Practice for Reproducibility Impact on NP-PRESS
Liquid Chromatography Column (make, chemistry, lot, age), Gradient profile, Flow rate, Column Temperature, Injection volume, Mobile Phase (brand, additives, pH). Use the same column brand/chemistry; document lot numbers; utilize pre-set, validated gradient tables; prepare mobile phases in large, consistent batches. Directly affects retention time (RT) stability, a critical metric for FUNEL's alignment and background subtraction [1].
Mass Spectrometry (MS1 Survey Scan) Resolution, Scan Range, AGC Target, Maximum Injection Time, Polarity Switching Dwell Time. Use consistent resolution settings (e.g., 60,000-120,000 at m/z 200); calibrate instrument daily; use auto-gain control (AGC) to maintain consistent ion populations. Determines the mass accuracy and peak shape of precursor ions, which are essential for accurate molecular formula prediction and adduct deconvolution.
Tandem MS (MS/MS Data-Dependent Acquisition) Isolation Width, Collision Energy (fixed, ramped, or stepped), AGC Target for MS2, Top N for fragmentation, Dynamic Exclusion. Apply normalized collision energy (e.g., 20-35% for HCD); use a consistent dynamic exclusion window (e.g., 15s). Directly governs the quality and consistency of MS/MS spectra, which are the sole input for the simRank algorithm's structural similarity comparisons [1].
System Suitability & QC Reference standard mixture, Pooled QC sample injection frequency. Inject a standardized mixture of NPs at beginning of sequence; inject a pooled QC sample every 5-10 experimental samples to monitor system drift. Allows for post-acquisition correction of minor RT or intensity drift, ensuring features are comparable across the entire batch analyzed by NP-PRESS.

A pooled Quality Control (QC) sample, created by combining a small aliquot of every experimental sample, is indispensable. It is injected repeatedly throughout the acquisition batch. Consistency in the total ion chromatogram and feature detection of the QC samples indicates a stable system. Significant drift necessitates investigation and potentially re-calibration or column re-equilibration before proceeding.

G cluster_0 QC-Based Monitoring Pooled_QC Pooled QC Sample LC_MS_System LC-MS System (Stable Conditions) Pooled_QC->LC_MS_System Data_Batch Raw Data Batch (with QC injections) LC_MS_System->Data_Batch Monitor Monitor RT & Intensity Stability Data_Batch->Monitor Stable Stable Monitor->Stable Unstable Unstable / Drift Monitor->Unstable Action Corrective Action: Re-calibrate, Re-equilibrate Unstable->Action Action->LC_MS_System

Diagram: QC-Driven Workflow for LC-MS System Stability. A pooled QC sample, analyzed at intervals, provides feedback on system performance. Drift triggers corrective action before valuable experimental samples are compromised.

Data Handling and Processing for Reproducible Bioinformatics

Reproducibility extends into the digital domain. Consistent, documented data processing workflows are required to transform raw files into the feature lists used by NP-PRESS.

  • File Management & Naming: Use a consistent, informative naming convention (e.g., Project_Strain_Date_Replicate.ext). Store all raw data, methods, and audit trails from the LC-MS system.
  • Processing Software & Parameters: Use the same software version (e.g., Compound Discoverer, MZmine, XCMS) with a locked parameter set for peak picking, alignment, and gap filling. Key parameters include:
    • Peak Picking: Signal-to-Noise threshold, minimum peak intensity, peak width range.
    • Alignment: Maximum RT shift tolerance, mass tolerance.
    • Gap Filling: Intensity threshold, mass/RT tolerance.
  • Blank Subtraction: Systematically subtract features present in procedural blank injections from the experimental samples to remove background contaminants.
  • Feature Annotation Consistency: Use standardized databases and adduct rules for preliminary annotation. The output should be a consensus feature list with columns for m/z, RT, peak area/intensity across all samples, and putative annotations.

This processed, clean feature table is the optimal input for the NP-PRESS two-stage dereplication. The FUNEL algorithm can more effectively filter abiotic noise when the input data itself is free from technical artifacts generated by inconsistent sample prep or instrument drift [1].

Validation and Reporting: The Final Guardrails

A reproducible workflow is a validated workflow. Implement these final guardrails:

  • SOP Documentation: Every step—from cell harvesting and extraction to LC-MS method and data processing—must be captured in a detailed Standard Operating Procedure (SOP).
  • Reagent Tracking: Maintain a log of critical reagents (columns, solvents, SPE sorbent lots) used for each batch of samples.
  • Full Metadata Capture: Experimental metadata (strain, growth conditions, harvest time, extraction yield, injection order) must be inextricably linked to the raw data file.
  • Performance Metrics: Report key reproducibility metrics, such as the percentage of features with a relative standard deviation (RSD) < 30% in the pooled QC samples, and the RT shift of internal standards across the batch.

G Start Sample (Complex Matrix) Prep Standardized Sample Prep (SPE Protocol) Start->Prep QC_Pool Pooled QC Prep->QC_Pool LC Stabilized LC Separation MS Calibrated MS Detection (MS1 & MS2) LC->MS Process Parameter-Locked Data Processing MS->Process Feature_Table Reproducible Feature Table Process->Feature_Table NP_PRESS NP-PRESS Pipeline Stage 1: FUNEL Stage 2: simRank Feature_Table->NP_PRESS Output Prioritized NP Candidates NP_PRESS->Output QC_Pool->LC Monitor System Monitoring QC_Pool->Monitor QC_Pool->Monitor Monitor->LC Monitor->MS

Diagram: Integrated Reproducible Workflow for NP-PRESS. From sample to discovery, each stage is controlled and monitored, with the pooled QC sample providing a feedback loop to ensure data quality before processing and algorithmic dereplication.

The Scientist's Toolkit: Essential Research Reagent Solutions

Table 3: Key Reagents and Materials for Reproducible NP Sample Preparation and LC-MS

Item Function & Role in Reproducibility
Mixed-Mode SPE Sorbents (e.g., Reverse-Phase/Cation Exchange) Provides selective, robust cleanup of complex culture broths; removing salts and primary metabolites reduces matrix effects and improves LC column lifetime, leading to more consistent retention times [43] [44].
LC-MS Grade Solvents (Water, Acetonitrile, Methanol) Ultra-pure solvents minimize background chemical noise and ion suppression in the MS, ensuring consistent baseline and detection sensitivity across batches and vendors.
Volatile Mobile Phase Additives (Formic Acid, Ammonium Formate, Ammonium Hydroxide) Provides consistent pH control for reproducible chromatographic separation (especially for ionic compounds) and efficient positive/negative electrospray ionization.
Stable, High-Purity LC Column (e.g., C18, 1.7-1.9 µm, 100-150 mm length) The core component for separation; using the same brand, chemistry, and lot number is critical for replicating the exact retention time landscape of metabolites across studies.
System Suitability & QC Standard Mix A cocktail of known natural products or metabolites covering a range of polarities. Injected at the start of each batch, it verifies column performance, MS sensitivity, and mass accuracy are within specified limits.
Automated Liquid Handling System Automates pipetting, SPE, and plate transfers, drastically reducing human error and variability in sample prep, especially for 96-well formats, enhancing throughput and reproducibility [43] [44].

Reproducibility in NP discovery is not a single step but a holistic discipline encompassing wet-lab bench work, instrumental analysis, and data science. By implementing standardized, robust protocols for sample preparation—prioritizing selective cleanup like SPE—and rigidly controlling LC-MS conditions, researchers generate the high-fidelity, consistent data required by advanced dereplication frameworks like NP-PRESS. This integrated approach, from controlled extraction to validated data processing, transforms LC-MS from a variable screening tool into a reliable engine for the reproducible discovery of novel natural products. The ultimate goal is a seamless pipeline where technical variability is minimized, allowing biological and chemical novelty to be revealed with confidence.

The discovery of novel natural products (NPs) with pharmaceutical potential is fundamentally hampered by the complexity of biological metabolomes. In traditional workflows, the majority of chromatographic and spectroscopic effort is spent on the re-isolation of known compounds or the pursuit of analytical artifacts, a process that is both time-consuming and resource-intensive [45]. The core challenge lies in accurately interpreting mass spectrometry (MS) data to differentiate between signals representing genuine novel metabolites, those belonging to known but undocumented compounds (database gaps), and those arising from non-biological processes or analytical artifacts [1]. This distinction is critical for directing isolation efforts toward true novelty.

The NP-PRESS (Natural Products – Prioritization and Refinement by Enhanced Spectral Screening) strategy presents a targeted solution within this framework [1] [4]. It is a two-stage mass spectrometry feature dereplication pipeline designed to systematically remove irrelevant chemical features from complex metabolomic data, thereby refining the analysis to prioritize signals with a higher probability of representing novel NPs. This application note details the protocols for implementing NP-PRESS and contextualizes its performance against established dereplication methodologies, providing researchers with a structured approach to interpret MS results and confidently identify true novelty.

The NP-PRESS strategy is engineered to address the specific problem of biotic interference—signals originating from microbial-processed cellular degradation products and culture media components—which often overwhelm the true NP metabolome in bacterial extracts [4]. The pipeline employs two sequentially applied, novel algorithmic filters to process LC-MS/MS data.

Stage 1: FUNEL (FUll-NOise-ELimination) Algorithm. This initial stage operates on MS1 (precursor ion) data. FUNEL is designed to perform a comprehensive subtraction of irrelevant features by comparing the metabolomic profile of the target sample against a rigorously constructed control model. This model encapsulates the "baseline" chemical noise derived from abiotic and biotic processes not associated with specialized metabolite production [1] [4]. The output is a significantly refined metabolome dataset depleted of ubiquitous background interference.

Stage 2: simRank Algorithm. The refined feature list from FUNEL is then analyzed using the simRank algorithm at the MS2 (fragmentation) level. simRank calculates spectral similarity but incorporates a scoring system that prioritizes features with dissimilarity to known compounds in public spectral libraries (e.g., GNPS). Crucially, it also identifies and groups features with high similarity to each other but low similarity to known entries, effectively highlighting potential new compound families or analogs [1]. The final output is a shortlist of prioritized MS features that are both unique to the producing organism and structurally distinct from previously characterized metabolites.

Comparative Dereplication Strategies and Performance Metrics

While NP-PRESS offers a specialized, algorithm-driven approach, dereplication employs a spectrum of methodologies. The table below summarizes key strategies, their technological basis, and their primary strengths and limitations.

Table 1: Comparison of Dereplication Strategies for Natural Products

Strategy / Protocol Name Core Technology Key Mechanism for Identifying Novelty Typical Application Context Key Strength Major Limitation
NP-PRESS [1] [4] LC-HRMS/MS, FUNEL (MS1) & simRank (MS2) algorithms Stepwise removal of biotic/abiotic interference; prioritization of features dissimilar to known libraries. Microbial extracts, especially extremophiles or complex backgrounds. High specificity in removing non-NP signals; discovers low-abundance metabolites. Requires carefully constructed control models; algorithm-dependent.
MS/MS Library Matching [45] LC-ESI-MS/MS with in-house or public spectral libraries Direct matching of precursor m/z, retention time, and fragmentation pattern against library entries. Rapid screening of plant extracts for known bioactive phytochemicals. Fast, high-confidence identification of known compounds. Cannot identify novel compounds absent from the library; prone to false negatives from library gaps.
PLANTA Protocol [46] NMR-HetCA, HPTLC, Chemometrics (e.g., STOCSY, SH-SCY) Statistical correlation of spectral/chromatographic features with bioactivity; orthogonal data integration. Pre-isolation identification of bioactive constituents in complex plant extracts. Direct link to bioactivity; non-destructive; orthogonal validation. Lower sensitivity than MS; requires larger sample amounts; complex data analysis.
Molecular Networking [47] LC-MS/MS with spectral similarity networking (e.g., GNPS) Visualization of related molecules as clusters; novelty inferred from cluster location relative to knowns. Untargetted exploration of metabolite families in diverse samples. Visualizes chemical families; good for analog discovery. Relies on ionization efficiency; weak for structurally unique singletons.

The performance of a dereplication strategy can be quantified. For example, the PLANTA protocol, when applied to an artificial mixture of 59 compounds, demonstrated a detection rate of 89.5% for active metabolites and a correct identification rate of 73.7% [46]. NP-PRESS has proven effective in real discovery campaigns, leading to the identification of new surugamide analogs from Streptomyces albus and the discovery of an entirely new family of depsipeptides, the baidienmycins, from Wukongibacter baidiensis, which exhibited potent antimicrobial and anticancer activities [1] [4].

Detailed Experimental Protocols

Protocol: NP-PRESS Pipeline for Microbial Extracts

Objective: To prioritize novel natural product features from a complex microbial extract by removing interfering signals from biotic and abiotic processes [4].

I. Sample Preparation & LC-MS/MS Data Acquisition

  • Culture & Extraction: Grow the target microorganism (e.g., Streptomyces albus J1074) and an appropriate control (e.g., spent media inoculated with a non-producing strain or sterile media taken through the same process) under identical conditions [4]. Extract metabolites from cell pellets and supernatant using a standardized solvent system (e.g., 1:1:1 Methanol:Acetonitrile:Water).
  • LC-MS/MS Analysis:
    • Instrumentation: Use a high-resolution LC-MS/MS system (e.g., Q-TOF or Orbitrap).
    • Chromatography: Employ a reversed-phase C18 column (e.g., 2.1 x 100 mm, 1.7 µm). Use a binary gradient: (A) Water with 0.1% Formic Acid; (B) Acetonitrile with 0.1% Formic Acid. A typical gradient runs from 5% B to 95% B over 20-30 minutes.
    • Mass Spectrometry: Acquire data in data-dependent acquisition (DDA) mode. MS1 spectra should be collected at a resolution > 35,000. The top 10-20 most intense ions per cycle are selected for fragmentation (MS2) at a normalized collision energy (e.g., 25-35 eV). Ensure both target and control samples are analyzed in technical triplicate.

II. Data Processing with FUNEL Algorithm (Stage 1)

  • Feature Detection: Process raw MS files using software like MZmine 3 or MS-DIAL for peak picking, alignment, and gap filling. Generate a feature table containing m/z, retention time (RT), and intensity for all detected ions across all samples.
  • Control Model Application: Input the aligned feature table into the FUNEL algorithm. The algorithm performs a comparative analysis, statistically identifying and subtracting features that are ubiquitously present in the control model or show non-relevant fluctuation patterns [4].
  • Output: A refined feature table containing only features significantly enriched in the target microorganism sample relative to the constructed noise model.

III. Data Processing with simRank Algorithm (Stage 2)

  • Spectral Library Query: Export the MS2 spectra associated with the refined feature list from Step II. Search these against public spectral libraries (e.g., GNPS) using standard cosine similarity scoring.
  • simRank Analysis: Input the library search results into the simRank algorithm. simRank re-evaluates matches, applying a penalty for high similarity to known compounds and a premium for features that are novel or form unique clusters with other novel features [1].
  • Prioritization: The algorithm outputs a ranked list of features. Top-priority features are those with high-quality MS2 spectra that are both absent from the control model (FUNEL-filtered) and show low similarity or novel clustering patterns compared to public databases.

IV. Downstream Validation

  • Targeted Isolation: Use the m/z and RT of the top-ranked features to guide semi-preparative or preparative HPLC isolation.
  • Structural Elucidation: Subject purified compounds to NMR spectroscopy (1H, 13C, 2D experiments) for definitive structural characterization and confirmation of novelty.

Protocol: Orthogonal Validation via the PLANTA Workflow

Objective: To provide orthogonal, activity-guided identification of bioactive compounds in a complex extract prior to isolation, complementing MS-based dereplication [46].

  • Fractionation & Bioassay: Separately fractionate the crude extract (e.g., using Flash Chromatography or FCPC). Test all fractions for the desired biological activity (e.g., DPPH radical scavenging).
  • 1H NMR Profiling: Acquire quantitative 1H NMR spectra for each bioactive fraction.
  • NMR-HetCA Analysis: Apply the NMR HeteroCovariance Approach. This statistical method calculates the covariance between the 1H NMR spectral data and the bioactivity data across fractions, generating a "pseudospectrum" that highlights NMR signals correlating with activity [46].
  • STOCSY-Guided Depletion: For correlated signals, use Statistical Total Correlation Spectroscopy (STOCSY) to identify all NMR peaks from the same molecule. Subtract ("deplete") peaks matching known, non-active compounds to generate a simplified, database-compatible spectrum of the active constituent [46].
  • HPTLC & Cross-Correlation: Analyze fractions by High-Performance Thin-Layer Chromatography (HPTLC). Use the SH-SCY (Statistical Heterocovariance–SpectroChromatographY) method to correlate active NMR peaks with specific bands on the HPTLC plate, providing a visual and chromatographic anchor for the target compound [46].
  • Dereplication: Query the depleted NMR spectrum against NMR databases. The combined evidence from bioactivity correlation, chromatographic behavior (RT/HPTLC Rf), and spectral matching provides high-confidence identification or flags a compound for novel isolation.

The Scientist's Toolkit: Essential Research Reagents & Solutions

Table 2: Key Reagent Solutions for NP Dereplication Workflows

Item Function in Dereplication Example/Note
High-Resolution Mass Spectrometer Provides accurate mass measurement for elemental composition determination and generates MS/MS spectra for structural comparison. Q-TOF or Orbitrap instruments are standard [1] [45].
UPLC/HPLC System with C18 Column Separates complex mixtures to reduce ion suppression and provide retention time as a key identification parameter. Sub-2µm particle columns are recommended for UPLC [45].
MS-Grade Solvents & Additives Ensure reproducibility, minimize background noise, and promote consistent ionization in ESI-MS. Methanol, Acetonitrile, Water, Formic Acid [45].
Reference Standard Compound Libraries Essential for building in-house MS/MS libraries and validating identifications. Commercial suppliers (e.g., Sigma-Aldrich) or purified isolates [45].
NMR Solvents & Tubes Required for structural elucidation of isolated compounds to confirm novelty. Deuterated solvents (e.g., methanol-d4), 5 mm NMR tubes [46].
Specialized Culture Media Used to cultivate target microorganisms, often influencing secondary metabolite production. ISP-2, R2A, or other defined media for bacteria [1] [4].
Data Analysis Software For processing raw MS data, running specialized algorithms, and visualizing molecular networks. MZmine, MS-DIAL, GNPS, proprietary scripts for FUNEL/simRank [1] [4].
Chromatography Resins for Fractionation For activity-guided isolation after dereplication pinpoints a target. Solid-phase extraction (SPE) cartridges, Sephadex LH-20, preparative C18 silica [46] [47].

Logical Workflow and Decision Pathways

The following diagrams outline the logical flow of the NP-PRESS strategy and the general decision process for interpreting MS features.

np_press_workflow Start Start: Complex Microbial Extract LCMS LC-HRMS/MS Data Acquisition (Target & Control) Start->LCMS MS1Table MS1 Feature Table (m/z, RT, Intensity) LCMS->MS1Table FUNEL Stage 1: FUNEL Algorithm (MS1 Noise Subtraction) MS1Table->FUNEL RefinedTable Refined Feature Table (Biotic Noise Removed) FUNEL->RefinedTable simRank Stage 2: simRank Algorithm (MS2 Novelty Prioritization) RefinedTable->simRank RankedList Prioritized List of Novel NP Candidates simRank->RankedList Isolation Targeted Isolation & Structural Elucidation RankedList->Isolation

Diagram 1: The NP-PRESS Two-Stage Dereplication Workflow.

decision_logic D1 Detected MS Feature in Sample? D2 Also Present in Control (FUNEL)? D1->D2 Yes Artifact Likely Analytical or Biotic Artifact D1->Artifact No D3 MS/MS Match in Spectral Library? D2->D3 No D2->Artifact Yes D4 High simRank Score for Novelty/New Cluster? D3->D4 Weak/No Match KnownNP Known Natural Product (Dereplication Complete) D3->KnownNP Strong Match DBGap Database Gap (Known Compound, Unrecorded Spectrum) D4->DBGap No NovelNP High-Priority Novel NP Candidate D4->NovelNP Yes

Diagram 2: Decision Logic for Interpreting MS Features.

Validating Success: How NP-PRESS Compares and Integrates with Modern Dereplication Tools

Natural Products (NPs) remain an invaluable source for pharmaceutical development, yet their discovery is hampered by the overwhelming chemical complexity of biological extracts [48]. Traditional dereplication methods aim to rapidly identify known compounds to avoid redundant isolation work. However, these methods often fail to distinguish true secondary metabolites—the target bioactive NPs—from the vast background of interfering features originating from biotic processes (e.g., media components, cellular degradation products) and abiotic sources [4]. This limitation leads to wasted resources on fruitless isolations and obscures potentially novel, low-abundance metabolites.

Framed within the broader thesis on the two-stage MS feature dereplication strategy termed NP-PRESS (Natural Product Prioritization and Refinement via Extra Feature Subtraction), this analysis provides a detailed benchmark against established methods [1]. NP-PRESS introduces a paradigm shift by systematically removing irrelevant chemical features before annotation, thereby refining the metabolome to highlight genuine NPs. This document presents application notes and experimental protocols to validate and implement this strategy, demonstrating its superior performance in prioritizing novel bioactive compounds for drug discovery pipelines.

Quantitative Benchmarking: NP-PRESS vs. Established Dereplication Platforms

The performance of any dereplication strategy is measured by its accuracy, efficiency, and success in guiding the discovery of novel chemical entities. The following table provides a comparative analysis of the novel NP-PRESS pipeline against three cornerstone traditional approaches: spectral library matching (exemplified by in-house libraries), molecular networking via GNPS, and the PNP-specific algorithm DEREPLICATOR [49] [45] [11].

Table 1: Comparative Performance of Dereplication Strategies

Performance Metric Traditional Library Matching [45] GNPS Molecular Networking [11] DEREPLICATOR (for PNPs) [49] NP-PRESS Pipeline [4] [1]
Core Strategy Match MS/MS spectra to curated reference libraries. Cluster MS/MS spectra by similarity to visualize chemical families. Hybrid search combining spectral matching with genomic insights for peptides. Two-stage MS1/MS2 analysis to subtract irrelevant features before annotation.
Key Advantage High confidence for known compounds present in the library. Powerful for identifying structural analogs and new members of known families. High-throughput, accurate identification of peptide natural products (PNPs). Dramatically reduces dataset complexity, exposing low-abundance and novel NPs.
Primary Limitation Limited to compounds in the library; fails for novel scaffolds. Requires sufficient spectral similarity; can miss unique or highly modified compounds. Specialized for PNPs (NRPs/RiPPs); not generalizable to all NP classes. Requires paired experimental design (e.g., with/without culture media).
Reported Output (Case Study) Dereplication of 31 compounds from plant extracts [45]. Annotation of 51 compounds from Sophora flavescens [11]. Identification of 100s of variant PNPs from GNPS datasets [49]. Discovery of new surugamide analogs and the novel baidienmycin family [4].
Novel Compound Discovery Low (aimed at avoiding rediscovery). Moderate (enables analog discovery). High for PNPs (specifically designed for variants). Very High (actively prioritizes unknown features post-subtraction).

Analysis of Benchmarking Data: The quantitative comparison highlights NP-PRESS's strategic differentiation. While traditional methods excel at cataloging known compounds, NP-PRESS is engineered for novelty discovery. Its two-stage filtering process, employing the FUNEL (MS1) and simRank (MS2) algorithms, removes up to 90% of irrelevant features in complex microbial metabolomes, which are typically the major noise obscuring target NPs [4]. This pre-filtering step directly addresses the core weakness of methods like GNPS, which processes all acquired features regardless of origin. Consequently, NP-PRESS successfully identified new antibacterial and anticancer depsipeptides (baidienmycins) from an anaerobic bacterium, a task where conventional dereplication would likely have failed due to metabolic interference [1].

Detailed Experimental Protocols

Protocol A: Implementing the NP-PRESS Pipeline for Microbial Metabolomes

This protocol outlines the application of NP-PRESS for prioritizing novel natural products from microbial cultures [4] [1].

I. Experimental Design & Sample Preparation

  • Strains and Cultivation: Culture the target microorganism (e.g., Streptomyces albus J1074) and a non-producing control (e.g., sterile media or a genetically silenced mutant) in identical physiological conditions.
  • Metabolite Extraction: At the target growth phase, quench metabolism rapidly (e.g., using cold methanol). Extract intracellular metabolites via sonication in a 2:2:1 (v/v/v) mixture of Methanol:Ethyl Acetate:Water. Extract extracellular metabolites from the culture broth by liquid-liquid partition with ethyl acetate.
  • Sample Cleanup: Pool intracellular and extracellular extracts. Dry under reduced pressure and reconstitute in LC-MS grade methanol for analysis.

II. LC-HRMS/MS Data Acquisition

  • Chromatography: Use a reversed-phase C18 column (e.g., 2.1 x 150 mm, 1.8 μm). Employ a binary gradient: mobile phase A (0.1% formic acid in water), B (0.1% formic acid in acetonitrile). Use a 20-30 minute linear gradient from 5% to 100% B.
  • Mass Spectrometry: Operate a high-resolution mass spectrometer (Q-TOF, Orbitrap) in positive/negative electrospray ionization mode.
    • MS1 Survey Scans: Acquire at a resolution > 35,000 (at m/z 200) across m/z 100-1500.
    • MS2 Data-Dependent Acquisition (DDA): Fragment the top 10-15 most intense ions per cycle using stepped normalized collision energy (e.g., 20, 40, 60 eV).

III. Data Processing with NP-PRESS

  • Feature Detection & Alignment: Process raw files (test and control samples) with MZmine or similar software. Perform peak picking, alignment, and gap filling to create a feature table with m/z, retention time (RT), and intensity.
  • Stage 1: FUNEL Analysis (MS1 Level Subtraction):
    • Input the aligned feature table.
    • The FUNEL algorithm performs unsupervised feature elimination, statistically comparing feature intensities between the target and control groups.
    • Features that are not significantly more abundant in the target sample (e.g., p-value > 0.05 after FDR correction) are tagged as "background" and subtracted. This removes media- and process-related interferents.
  • Stage 2: simRank Analysis (MS2 Level Prioritization):
    • Input the MS/MS spectra associated with the remaining features from FUNEL.
    • simRank calculates pairwise spectral similarities but down-weights common fragment ions that are ubiquitous in background features.
    • Features with unique or rare fragmentation patterns receive high simRank scores, prioritizing them for novel NP discovery.
  • Prioritization & Annotation: Rank the filtered features by their simRank score. Subject top-ranked features to molecular networking on GNPS and database searches (e.g., AntiMarin, NIST) for structural hypotheses.

Protocol B: Traditional Dereplication via LC-MS/MS Library Construction and Matching

This protocol details the creation and use of an in-house spectral library for rapid dereplication of common phytochemicals, as validated in recent work [45].

I. Construction of an In-House Tandem MS Library

  • Standard Pooling Strategy: Select and pool authentic standards based on log P values and exact masses to minimize co-elution and isobaric interference. Prepare 2-3 pooled mixtures.
  • LC-MS/MS Analysis of Pools:
    • Chromatography: Use a C18 column with a gradient elution (e.g., 5-95% acetonitrile in water over 20 min).
    • MS Analysis: Acquire high-resolution MS1 and MS2 data in both positive and negative ionization modes. For MS2, acquire spectra at multiple collision energies (e.g., 10, 20, 30, 40 eV) to capture comprehensive fragmentation patterns.
  • Library Curation: For each standard, compile its molecular formula, exact mass (<5 ppm error), retention time, precursor ion ([M+H]⁺, [M+Na]⁺, etc.), and all associated MS/MS spectra into a database file compatible with your analysis software (e.g., .mgf format).

II. Dereplication of Unknown Extracts

  • Data Acquisition: Analyze the complex plant or microbial extract under the identical LC-MS/MS conditions used for library construction.
  • Database Search: Process the data using software (e.g., MS-DIAL, Compound Discoverer) capable of performing multidimensional matching. Key search parameters include:
    • Mass Tolerance: MS1 tolerance ≤ 5 ppm, MS2 tolerance ≤ 0.01 Da.
    • Retention Time Index: Use relative RT or a calibration curve for improved confidence.
    • Spectral Match Scoring: Use a composite score (e.g., dot product, reverse dot product, purity) with a threshold (e.g., > 0.7) for positive identification.
  • Validation: Where possible, confirm identities by spiking the authentic standard into the sample and monitoring co-elution and consistent spectral matching.

Visualizing Strategies and Workflows

The following diagrams, generated using DOT language and compliant with specified color and contrast rules, illustrate the conceptual and procedural differences between the dereplication strategies.

Diagram 1: Two-Stage NP-PRESS Workflow for Novel NP Prioritization [4] [1]

cluster_stage1 STAGE 1: MS1-Level Feature Subtraction (FUNEL) cluster_stage2 STAGE 2: MS2-Level Novelty Scoring (simRank) A1 Raw MS1 Feature Table (Test & Control Samples) A2 Statistical Comparison & Unsupervised Elimination A1->A2 A3 Refined Feature Table (Background-Subtracted) A2->A3 B1 MS2 Spectra of Refined Features A3->B1 B2 Spectral Similarity Analysis with Common Fragment Down-weighting B1->B2 B3 Prioritized Ranked List of Novel NP Candidates B2->B3 End Targeted Isolation of Novel NPs B3->End Start Complex Microbial Extract Start->A1

Diagram 2: Comparative Dereplication Strategy Decision Flow

Start Research Goal Definition Q1 Is the target compound class known or suspected? Start->Q1 Q2 Is the sample extremely complex with high background interference? Q1->Q2 No (Novel Scaffold Hunt) Lib Use Traditional Library Matching Q1->Lib Yes (Known Compounds) Q3 Is the focus on Peptidic Natural Products (PNPs)? Q2->Q3 No PRESS Use NP-PRESS Pipeline Q2->PRESS Yes (e.g., crude extract) GNPS Use GNPS Molecular Networking Q3->GNPS No (General NP Classes) DEREP Use DEREPLICATOR Algorithm Q3->DEREP Yes

The Scientist's Toolkit: Essential Research Reagent Solutions

Successful dereplication requires precise materials and analytical resources. The following table lists key solutions for implementing the protocols described.

Table 2: Essential Reagents and Materials for Dereplication Experiments

Item Name Specification / Example Primary Function in Dereplication
LC-MS Grade Solvents Methanol, Acetonitrile, Water (with 0.1% Formic Acid) Mobile phase components for high-sensitivity, reproducible chromatographic separation [45] [11].
Authentic Chemical Standards >97% purity compounds (e.g., flavonoids, alkaloids) Essential for constructing and validating in-house tandem MS libraries for confident peak annotation [45].
Solid Phase Extraction (SPE) Cartridges C18, HLB, or Mixed-Mode phases Pre-analytical cleanup of crude extracts to reduce matrix effects and instrument fouling.
High-Resolution Mass Spectrometer Q-TOF, Orbitrap, or FT-ICR MS systems Provides accurate mass measurement (<5 ppm error) for formula prediction and high-quality MS/MS spectra for structural matching [48] [45].
Chromatography Column Reversed-Phase C18 (e.g., 2.1 x 150 mm, 1.8 μm) Core component for separating complex mixtures of natural products prior to mass spectrometric detection [11].
Data Analysis Software MZmine, MS-DIAL, Compound Discoverer, GNPS Platforms for feature detection, alignment, spectral deconvolution, and database searching, enabling the processing of large metabolomics datasets [49] [11].
Public Spectral Databases GNPS, MassBank, NIST, AntiMarin Critical reference repositories for spectral matching and molecular networking to annotate known compounds and their analogs [49] [45].

The discovery of novel natural products (NPs) is persistently hampered by the high rate of compound rediscovery, making dereplication—the early identification of known compounds—a critical first step [50]. Within the framework of a broader thesis on two-stage MS feature dereplication, this document details a proof-of-concept validation for the NP-PRESS (Natural Product Prioritization by Elimination of Self-Signals) strategy [1]. NP-PRESS addresses a key bottleneck: the inability of conventional methods to differentiate signals from novel secondary metabolites from the overwhelming background of "biotic processed" features, such as microbial degradation products and media components [1].

This application note provides detailed protocols for transitioning from a prioritized MS feature to an isolated compound with validated bioactivity, using the discovery of the baidienmycins from Wukongibacter baidiensis M2B1 as a case study [1]. The workflow integrates advanced mass spectrometry, innovative bioinformatics, and classical natural product chemistry to accelerate the targeted discovery of novel bioactive entities.

Core Strategy: The NP-PRESS Two-Stage Dereplication Workflow

The NP-PRESS strategy is built upon two novel algorithms, FUNEL and simRank, which operate on MS1 and MS2 data, respectively [1].

  • Stage 1 (MS1 - FUNEL Algorithm): This stage focuses on filtering out uninteresting "self-signals" derived from the organism's primary metabolism and the culture medium. By analyzing isotopic patterns and chromatographic profiles, FUNEL identifies and removes ubiquitous background features, significantly reducing dataset complexity and highlighting potential secondary metabolites.
  • Stage 2 (MS2 - simRank Algorithm): The remaining features are subjected to MS/MS analysis. The simRank algorithm calculates spectral similarities but is designed to prioritize molecular families that are dissimilar to known compounds in public databases (e.g., GNPS). This positive selection for spectral novelty directly guides the isolation efforts toward structurally unique compounds [1].

The synergy of these stages ensures that only the most promising, novel MS features are carried forward for isolation.

G Raw_MS1_Data Raw MS1 Data (Complex Extract) FUNEL_Algo FUNEL Algorithm (Primary Metabolite/Media Filter) Raw_MS1_Data->FUNEL_Algo Filtered_Feature_List Filtered Feature List (Potential NPs) FUNEL_Algo->Filtered_Feature_List Removes biotic self-signals MS2_Acquisition MS/MS Acquisition on Filtered Features Filtered_Feature_List->MS2_Acquisition simRank_Algo simRank Algorithm (Novelty Prioritization) MS2_Acquisition->simRank_Algo Novel_Feature_Prioritization Prioritized Novel Features (e.g., Baidienmycins) simRank_Algo->Novel_Feature_Prioritization Highlights spectral novelty Target_Isolation Targeted Isolation Workflow Novel_Feature_Prioritization->Target_Isolation NP_Databases Reference NP Databases (e.g., GNPS, AntiMarin) NP_Databases->simRank_Algo Similarity query

Diagram 1: NP-PRESS Two-Stage Dereplication and Prioritization Workflow

Proof-of-Concept: Discovery of the Baidienmycins

The NP-PRESS strategy was validated on the anaerobic bacterium Wukongibacter baidiensis M2B1. Application of the two-stage filter successfully prioritized a cluster of MS features that were distinct from known compounds in databases. Targeted isolation guided by these features led to the discovery of a new family of depsipeptides, named baidienmycins [1].

Table 1: Key Data for Baidienmycins Discovery via NP-PRESS

Parameter Details & Quantitative Results
Source Organism Wukongibacter baidiensis M2B1 (anaerobic bacterium) [1]
NP-PRESS Outcome Prioritization of a novel molecular family distinct from database entries [1].
Discovered Compounds Baidienmycins (a new family of depsipeptides) [1].
Key Bioactivity Potent antimicrobial and anticancer activities reported [1].
Validation Method Comparison of MS2 spectra and features against public GNPS libraries confirmed novelty [1].
Role of Dereplication NP-PRESS eliminated >90% of interfering MS1 features from media/primary metabolism, allowing focus on novel secondary metabolites [1].

Detailed Experimental Protocols

Protocol 1: LC-HRMS/MS Data Acquisition for NP-PRESS Input

Objective: Generate high-quality MS1 and MS2 data from microbial crude extracts suitable for analysis with the FUNEL and simRank algorithms.

Materials:

  • Crude ethyl acetate or butanol extract of fermented microbial broth [51].
  • LC-MS grade solvents (MeOH, ACN, H₂O with 0.1% formic acid).
  • UHPLC system coupled to a high-resolution tandem mass spectrometer (e.g., Q-TOF or Orbitrap).

Procedure:

  • Sample Preparation: Reconstitute dried crude extract in methanol to a concentration of 1 mg/mL. Centrifuge at 14,000 × g for 10 minutes to remove particulates.
  • Chromatography: Inject 5-10 µL onto a reversed-phase C18 column (e.g., 2.1 x 100 mm, 1.7 µm). Use a gradient from 5% to 100% acetonitrile in water (both with 0.1% formic acid) over 20-30 minutes.
  • Mass Spectrometry:
    • MS1 Survey Scan: Acquire data in positive and/or negative ion mode with a resolution >35,000 (at m/z 200) and a scan range of m/z 100-1500.
    • Data-Dependent MS2 (dd-MS2): Select the top 10-15 most intense ions from each MS1 scan for fragmentation. Use a stepped normalized collision energy (e.g., 20, 40, 60 eV) to generate rich fragmentation spectra [13]. Employ dynamic exclusion to spread acquisition across less abundant features.
  • Data Export: Convert raw data files (.d, .raw) to open formats (.mzML, .mzXML) using tools like ProteoWizard MSConvert for downstream processing [13].

Protocol 2: Applying NP-PRESS for Feature Prioritization

Objective: Process acquired LC-MS/MS data through the NP-PRESS pipeline to identify novel compound features.

Software/Tools: NP-PRESS algorithms (FUNEL, simRank), MZmine2 or similar for initial feature finding, GNPS for comparative analysis [1] [13].

Procedure:

  • Feature Detection (Pre-processing): Import .mzML files into MZmine2. Perform chromatogram building, mass detection, deconvolution, isotopic peak grouping, and alignment to create a feature table with associated MS2 spectra.
  • Stage 1 - FUNEL Filter: Submit the MS1 feature table (containing m/z, RT, and intensity) to the FUNEL algorithm. FUNEL will analyze the dataset against control samples (e.g., sterile media) and flag features with patterns consistent with primary metabolites or media components for removal [1].
  • Stage 2 - simRank Analysis: Export the MS2 spectra associated with the remaining FUNEL-filtered features. Submit these spectra to the simRank algorithm, which will compare them against a curated spectral library (e.g., from GNPS). simRank will rank features based on their dissimilarity to known compounds, generating a prioritized list of novel molecular families [1].
  • Visualization & Decision: Load the simRank results into molecular networking software (e.g., Cytoscape via GNPS). Clusters of nodes (spectra) with low connectivity to known compound clusters represent prioritized targets for isolation.

Protocol 3: Targeted Isolation Guided by Prioritized Features

Objective: Isolate milligram quantities of the target compound from a scaled-up microbial culture.

Materials: Fermentation broth (e.g., 10-50 L), solid-phase extraction (SPE) cartridges (C18, DIAION HP20), preparative HPLC system, Sephadex LH-20, analytical TLC/HPLC supplies [51].

Procedure:

  • Scale-Up & Extraction: Culture the source microorganism under optimized conditions in large-scale fermentation. Separate biomass from broth by centrifugation. Extract the broth with a suitable organic solvent (e.g., ethyl acetate, 3 x volume). Extract the biomass separately with acetone or methanol. Combine extracts based on TLC and analytical HPLC analysis against the target feature's UV and MS signature [51].
  • Fractionation: Subject the combined crude extract to vacuum liquid chromatography (VLC) or coarse SPE using a stepped solvent gradient (e.g., H₂O/MeOH/EtOAc). Monitor fractions by analytical HPLC-MS for the presence of the target ion ([M+H]⁺/[M-H]⁻).
  • Purification: Pool fractions containing the target. Further purify using a combination of techniques:
    • Gel Filtration: Use Sephadex LH-20 with isocratic methanol or methanol:dichloromethane (1:1) to desalt and perform initial separation by molecular size.
    • Preparative HPLC: Perform final purification on a preparative C18 column. Use a shallow gradient of acetonitrile in water (with 0.1% TFA or formic acid) to separate the target compound from closely eluting impurities. Collect peaks based on UV and evaporate to dryness.
  • Purity Assessment: Verify the purity (>95%) of the isolated compound using analytical HPLC with both diode array and mass spectrometric detection. The MS1 mass and retention time must match the original prioritized feature.

Protocol 4: Bioactivity Validation Assays

Objective: Confirm the biological activity of the isolated compound.

Materials: Isolated compound, 96-well microtiter plates, appropriate cell lines (e.g., HeLa, MCF-7 for cancer; Bacillus subtilis, Staphylococcus aureus for bacteria; Candida albicans for fungi), cell culture media, alamarBlue or MTT reagent, spectrophotometer/plate reader [51] [52] [53].

Procedure for Anticancer Activity (Cytotoxicity Assay):

  • Seed cancer cells (e.g., HeLa) in a 96-well plate at a density of 5,000 cells/well and incubate for 24 h.
  • Prepare serial dilutions of the test compound in DMSO and further dilute in cell culture medium (final DMSO <0.5%).
  • Treat cells with compound dilutions for 48-72 hours.
  • Add 10 µL of alamarBlue reagent per well and incubate for 2-4 hours.
  • Measure fluorescence (Ex 560 nm / Em 590 nm) or absorbance (570 nm). Calculate cell viability relative to DMSO-treated controls and determine the IC₅₀ value using nonlinear regression analysis [51] [52].

Procedure for Antimicrobial Activity (MIC Determination):

  • Prepare a suspension of the test microorganism (e.g., Bacillus subtilis) in Mueller-Hinton broth to a 0.5 McFarland standard.
  • In a 96-well plate, perform two-fold serial dilutions of the compound in broth.
  • Inoculate each well with the bacterial suspension. Include growth and sterility controls.
  • Incubate the plate at 37°C for 16-20 hours.
  • The Minimum Inhibitory Concentration (MIC) is the lowest concentration of compound that completely inhibits visible growth [51] [52].

G Start Prioritized MS Feature from NP-PRESS ScaleUp Scale-Up Fermentation (10-50 L Culture) Start->ScaleUp Extract Broth & Biomass Extraction (e.g., EtOAc, MeOH) ScaleUp->Extract Frac Fractionation (VLC, SPE) Extract->Frac Purify Purification (Prep. HPLC, Sephadex) Frac->Purify Char Structure Elucidation (NMR, HRMS) Purify->Char p1 Purify->p1 p2 Char->p2 Assay Bioactivity Assays (MIC, Cytotoxicity) Validate Validated Bioactive Natural Product Assay->Validate p1->Assay p2->Validate

Diagram 2: Comprehensive Validation Workflow from Feature to Bioactivity

The Scientist's Toolkit: Essential Research Reagents & Solutions

Table 2: Key Reagents and Materials for NP Dereplication and Isolation

Item Function & Application in Workflow
Ethyl Acetate (EtOAc) A standard medium-polarity solvent for liquid-liquid extraction of fermented broth, effective for extracting a wide range of secondary metabolites [51].
Solid-Phase Extraction (SPE) Cartridges (C18, HP20) For rapid desalting and initial fractionation of crude extracts based on polarity, reducing complexity before HPLC [51].
Sephadex LH-20 Gel filtration resin for size-based separation and removal of salts/pigments; used with organic solvents like methanol or CH₂Cl₂/MeOH [51].
Preparative C18 HPLC Column The cornerstone of final purification. Allows high-resolution separation of compounds using gradients of water and acetonitrile/methanol [51].
Deuterated Solvents (CDCl₃, DMSO-d₆, CD₃OD) Essential for nuclear magnetic resonance (NMR) spectroscopy, the primary technique for determining planar structure and stereochemistry of isolated compounds [50].
LC-MS Grade Solvents (MeOH, ACN, H₂O + 0.1% FA) Essential for all LC-MS steps to prevent ion suppression, background noise, and column degradation, ensuring high-quality data for dereplication [13] [53].
AlamarBlue/MTT Reagent Cell viability indicators used in cytotoxicity and antiproliferative bioassays. Metabolic reduction by living cells produces a measurable colorimetric/fluorometric change [51] [52].
Mueller-Hinton Broth (MHB) A standardized, low-protein medium recommended by CLSI for determining Minimum Inhibitory Concentrations (MICs) of antimicrobial compounds [52].
Database Access (GNPS, DEREPLICATOR+) Critical bioinformatics resources. GNPS allows molecular networking and spectral library matching [54] [55]. DEREPLICATOR+ enables dereplication against extensive databases of peptides, polyketides, and other NP classes [56].

The discovery of novel natural products (NPs) is persistently challenged by the high rate of compound rediscovery and the difficulty in detecting low-abundance or conditionally expressed metabolites [50]. While genomics reveals a vast reservoir of biosynthetic potential, with microbial genomes harboring dozens of uncharacterized Biosynthetic Gene Clusters (BGCs), the majority remain as "orphan" clusters with no identified molecular product [57]. Conversely, mass spectrometry (MS)-based metabolomics detects a plethora of compounds, but efficiently distinguishing novel NP signals from complex biological noise is a formidable task [1].

This application note addresses this critical gap by detailing a synergistic strategy that correlates output from the novel two-stage MS feature dereplication platform, NP-PRESS, with genomic BGC data [1]. NP-PRESS employs specialized algorithms (FUNEL and simRank) to aggressively filter MS1 and MS2 data, removing interfering features from biotic processes and media to prioritize ions most likely to represent novel NPs [1]. When these prioritized MS features are integrated with genomic evidence of BGC activation—such as from transcriptomics or proteomics—researchers gain a powerful, hypothesis-driven framework for discovery. This integrated approach, framed within a broader thesis on advanced dereplication strategies, streamlines the journey from genetic potential to novel chemical entity, dramatically accelerating lead identification for drug development [50].

Conceptual Framework: Bridging MS Features and Genetic Loci

The core synergy lies in forming a data-driven bridge between a confidently filtered mass spectrometric signal and a transcriptionally active genetic locus. The process transforms parallel data streams into a coherent discovery pipeline.

  • NP-PRESS Output: The primary output is a prioritized list of LC-MS features (characterized by m/z, retention time, and MS/MS fragmentation). These features have been computationally vetted to subtract common biotic interferences, thereby enriching for signals originating from specialized secondary metabolism [1].
  • BGC Data Input: This encompasses mined BGC sequences (e.g., from AntiSMASH), their transcriptional activity (RNA-Seq RPKM values), and proteomic expression data [57] [58]. The key is to identify BGCs that are "silent" in control conditions but activated under specific elicitation, such as co-culture [57].
  • Correlation Logic: The link is established through temporal or conditional congruence. The upregulation of a specific BGC (evidenced by transcriptomics) must coincide with the appearance of a unique NP-PRESS-prioritized MS feature in the metabolome. Subsequent isotopic labeling or detailed MS/MS analysis of the putative product can provide further confirmatory evidence linking the molecule to the cluster [57].

Table 1: Key Data Types for Correlation and Their Sources

Data Type Description Common Source/Method Role in Correlation
Prioritized MS1 Feature m/z, RT, intensity of a metabolite ion. NP-PRESS processed LC-MS data [1]. The target chemical entity requiring a genetic origin.
MS2 Fragmentation Spectrum Molecular fingerprint from collision-induced dissociation. LC-MS/MS analysis. Used for structural similarity networking (e.g., via GNPS) and database dereplication [50].
BGC Sequence Genomic locus encoding biosynthetic enzymes. Genome mining (AntiSMASH, PRISM) [57] [58]. Predicts chemical class (e.g., NRPS, PKS) and potential structure of the product.
Transcriptomic Read Counts (RPKM) Quantitative gene expression levels. RNA-Seq of producing vs. non-producing conditions [57]. Evidences activation of the BGC coincident with metabolite detection.
Proteomic Data Expression levels of biosynthetic enzymes. LC-MS/MS proteomics. Confirms translation of BGC genes into functional machinery.

G cluster_ms MS-Based Metabolomics (NP-PRESS) cluster_gen Genomics & Multi-Omics RawLCMS Raw LC-MS Data NP_PROCESS NP-PRESS Processing (FUNEL & simRank) RawLCMS->NP_PROCESS PrioFeatures Prioritized NP Features (m/z, RT, MS/MS) NP_PROCESS->PrioFeatures Correlate Integrative Correlation Analysis PrioFeatures->Correlate Temporal/Induction Congruence Genome Bacterial Genome BGCMining BGC Mining & Annotation Genome->BGCMining BGCList Candidate BGCs BGCMining->BGCList MultiOmic Transcriptomics/Proteomics of Elicited Conditions BGCList->MultiOmic ActiveBGC Activated BGC MultiOmic->ActiveBGC ActiveBGC->Correlate NovelNP Novel Natural Product with Genomic Validation Correlate->NovelNP

Integrated NP-PRESS to BGC Correlation Workflow

Detailed Application Protocols

Protocol A: NP-PRESS-Assisted Dereplication for Feature Prioritization

This protocol refines raw LC-MS data to generate a shortlist of MS features most likely to correspond to novel natural products.

Experimental Workflow:

  • Sample Preparation & LC-MS/MS Analysis:
    • Culture the microbe of interest under standard and BGC-eliciting conditions (e.g., co-culture, specific media, chemical elicitors) [57].
    • Extract metabolites using appropriate solvents. Perform LC-MS/MS analysis on a high-resolution instrument (e.g., Q-TOF, Orbitrap) in data-dependent acquisition (DDA) mode.
  • Data Processing with NP-PRESS:
    • Convert raw data to an open format (e.g., mzML).
    • Process data through the NP-PRESS pipeline [1]:
      • FUNEL Algorithm (MS1): Filters ions based on isotopic patterns and retention time behavior characteristic of peptides and other common biotic molecules, subtracting them from the dataset.
      • simRank Algorithm (MS2): Compresses MS/MS spectra by removing fragment ions common to solvent and media backgrounds, preserving unique spectral signatures.
    • Output: A table of filtered, prioritized MS features with associated MS/MS spectra, ready for downstream analysis.

Protocol B: Genomic BGC Activation Profiling

This protocol identifies which BGCs in a genome are transcriptionally activated under conditions that induce new metabolite production.

Experimental Workflow:

  • Genome Sequencing, Mining, and Annotation:
    • Sequence the producer strain’s genome. Assemble and annotate using standard tools.
    • Mine for BGCs using AntiSMASH [57] or similar. Annotate cluster types (PKS, NRPS, etc.) and core biosynthetic genes.
  • Multi-Omic Profiling of Elicited Conditions:
    • Transcriptomics: Isolate RNA from the same biological replicates used for metabolomics (under standard and elicited conditions). Perform RNA-Seq. Map reads to the genome and quantify expression (e.g., in RPKM) for all genes, particularly those within BGCs [57].
    • Differential Expression Analysis: Use tools like EdgeR [57] to identify BGCs with statistically significant upregulation under the eliciting condition.
    • Proteomics (Optional but Confirmatory): Perform a global proteomic analysis on the same samples to detect expression of the biosynthetic enzymes predicted by the activated BGC [57].

Table 2: NP-PRESS & Genomics Correlation Performance Metrics

Metric NP-PRESS Performance Genomic BGC Correlation Integrated Workflow Advantage
Feature Reduction Removes >90% of interfering MS1 features from biotic background [1]. Narrows 10,000+ BGCs in genomes to a few activated targets [57]. Focuses analytical resources on a shortlist of high-priority metabolite-BGC pairs.
Novelty Confidence High confidence that prioritized features are not common media/biotic components [1]. High confidence that activated BGCs encode underexplored chemistry. Convergent evidence strongly increases probability of true novelty.
Case Study Yield Guided discovery of new surugamide and baidienmycin families [1]. Enabled linking keyicin production to its specific BGC and interspecies signaling [57]. Provides both the novel structure (MS) and its genetic blueprint (genomics).
Key Analytical Tool FUNEL and simRank algorithms [1]. RNA-Seq differential expression analysis (e.g., EdgeR) [57]. Temporal correlation analysis and molecular networking.

G cluster_culture Parallel Sample Preparation cluster_omics Multi-Omic Data Generation cluster_analysis Integrated Data Analysis Start Initiate Study with Producer Microbe Cond1 Culture under Standard Conditions Start->Cond1 Cond2 Culture under Eliciting Conditions (e.g., Co-culture) Start->Cond2 Metabolomics LC-MS/MS Metabolomics Cond1->Metabolomics Transcriptomics RNA Extraction & Sequencing Cond1->Transcriptomics Cond2->Metabolomics Cond2->Transcriptomics NP_PRESS NP-PRESS Processing (Feature Prioritization) Metabolomics->NP_PRESS DiffExpr Transcriptomic Analysis (BGC Activation Profile) Transcriptomics->DiffExpr Correl Correlate Active BGCs with Prioritized Metabolite Features NP_PRESS->Correl DiffExpr->Correl Target Validated Target: Novel NP from Activated BGC Correl->Target

Experimental Protocol for NP-PRESS & BGC Correlation

Case Study: Application in Keyicin Discovery

The discovery of keyicin provides a seminal example of this integrative approach [57]. While not using NP-PRESS specifically, the methodology mirrors its principles.

  • Observation & Prioritization: Co-culture of Micromonospora and Rhodococcus produced a new antimicrobial activity. Metabolite profiling (analogous to NP-PRESS output) detected a unique compound (keyicin) only in co-culture.
  • Genomic Mining: Genome sequencing of Micromonospora revealed an orphan type II PKS BGC (kyc cluster) predicted to synthesize an anthracycline.
  • Correlation via Transcriptomics: RNA-Seq showed the kyc BGC was transcriptionally silent in Micromonospora monoculture but highly upregulated during co-culture. This activation was triggered by a small molecule signal from Rhodococcus, linking a specific metabolite peak to the specific activation of its cognate BGC [57].
  • Validation: This correlation guided the targeted isolation and structural elucidation of keyicin, confirming the prediction of the BGC.

The Scientist's Toolkit: Essential Research Reagent Solutions

Table 3: Key Reagents and Materials for Integrated NP-PRESS & BGC Studies

Item Function/Description Application Notes
Specialized Growth Media Media formulations designed to elicit secondary metabolism (e.g., R2A, ISP2) [1]. Essential for activating silent BGCs. Co-culture setups require compatible media for both organisms [57].
RNA Stabilization Reagent (e.g., TRIzol) Immediately stabilizes cellular RNA to preserve accurate transcriptional profiles. Critical for transcriptomics. Samples must be taken from the exact same cultures and time points used for metabolomics.
Next-Generation Sequencing Kit For preparation of RNA-Seq libraries (e.g., Illumina TruSeq). Enables genome-wide quantification of BGC gene expression. Poly-A selection is not suitable for bacterial RNA.
LC-MS Grade Solvents High-purity acetonitrile, methanol, and water for metabolite extraction and chromatography. Minimizes chemical background noise, improving NP-PRESS filtering performance [1].
Database Subscriptions/Access Access to genomic (MIBiG) [58], metabolomic (GNPS) [50], and chemical (PubChem) databases. Required for BGC annotation, molecular networking, and final dereplication of putative novel compounds.

Natural product (NP) discovery is a cornerstone of pharmaceutical development, yet it is persistently challenged by the low abundance of novel compounds and the complex interference from biological matrices. Traditional mass spectrometry (MS) workflows struggle to differentiate signals of novel NPs from a background of microbial degradation products and media components [1]. The NP-PRESS strategy addresses this by implementing a two-stage MS feature dereplication process, using algorithms like FUNEL and simRank to prioritize truly novel features by thoroughly removing irrelevant biotic and abiotic signals [1]. However, a critical bottleneck remains: the confident identification of novel peptide or peptide-like natural products once they are prioritized.

This is where AI-driven de novo peptide sequencing, exemplified by InstaNovo, creates a transformative synergy. InstaNovo is a transformer-based deep learning model that translates fragment ion peaks (MS/MS spectra) into peptide sequences without relying on pre-existing protein databases [59] [60]. Its diffusion-model counterpart, InstaNovo+, further refines predictions through iterative processes [59]. By integrating InstaNovo into the NP-PRESS pipeline, researchers can transition from merely prioritizing unknown features to directly sequencing and identifying them. This integration is particularly powerful for discovering novel ribosomal and non-ribosomal peptides, cyclized peptides, and other natural product derivatives that are absent from conventional databases, thereby "illuminating the dark proteome" of microbial producers [59] [60].

Comparative Analysis: NP-PRESS and InstaNovo

Core Function and Strategic Fit

Table 1: Comparative Core Functions of NP-PRESS and InstaNovo

Aspect NP-PRESS Strategy InstaNovo/AI-Driven Sequencing Integrated Advantage
Primary Goal Dereplication; prioritize novel MS features by removing known/irrelevant signals [1]. Database-free identification; determine the amino acid sequence from MS/MS spectra [59] [60]. From prioritization to identification: Converts a list of "interesting unknowns" into definitive sequences.
Key Innovation Two-stage algorithm (FUNEL, simRank) to subtract features from biotic processes and media [1]. Transformer (InstaNovo) and diffusion (InstaNovo+) models for direct spectrum-to-sequence translation [59]. Creates a closed-loop discovery engine: Filter, then sequence. Reduces search space for AI, increasing its efficiency.
Data Input MS1 (precursor) and MS2 (fragment) data from complex biological extracts [1]. MS/MS (MS2) spectrum (peak lists with m/z and intensity) [61]. NP-PRESS pre-filters the most promising, novel spectra for computationally intensive de novo analysis.
Output A prioritized list of LC-MS features highly likely to represent novel natural products [1]. Predicted peptide sequences with associated confidence scores (log probabilities) [61]. Annotated novel natural product sequences with structural hypotheses ready for synthesis and validation.
Thesis Context Provides the essential first stage of the two-stage dereplication strategy by handling complex mixtures. Provides the decisive second stage by solving the identity of the prioritized unknowns. Completes the two-stage MS feature dereplication thesis, enabling end-to-end novel NP discovery.

Performance Metrics and Quantitative Synergy

Table 2: Performance Metrics Demonstrating Integration Potential

Metric NP-PRESS (Concept Proof) InstaNovo/InstaNovo+ (Reported Performance) Interpretation for Integration
Discovery Yield Guided discovery of new surugamide analogs and the new depsipeptide family baidienmycins from bacteria [1]. Identified 1,338 previously undetected protein fragments in well-studied HeLa cell samples [60]. Suggests high potential to identify novel peptides from NP-PRESS-prioritized features in microbial extracts.
Precision Gain Excels at removing interfering features to highlight true NP signals [1]. IN+ identified 32.71% more PSMs at 5% FDR than Casanovo (SOTA) on a yeast dataset [59]. The high precision of both tools is multiplicative, ensuring final identifications are both novel and accurate.
Handling Modifications Not explicitly addressed for PTMs. InstaNovo-P model fine-tuned for phosphorylation; v1.1 natively supports common PTMs (Oxidation, Deamidation, etc.) [62] [61]. Enables discovery of modified NP analogues, a common source of bioactivity diversity.
Algorithmic Complement FUNEL and simRank algorithms for feature comparison and filtering [1]. Transformer neural network with knapsack beam search and diffusion-based refinement [59] [61]. Filtering (NP-PRESS) reduces noise for the sequence inference model (InstaNovo), optimizing overall computational efficiency.

Integrated Experimental Protocols

Protocol 1: Integrated NP Discovery Workflow

Objective: To discover and sequence novel peptide-based natural products from a microbial culture extract.

Workflow Diagram:

G A Microbial Culture Extraction B LC-MS/MS Analysis (Data-Dependent Acquisition) A->B C NP-PRESS Processing (FUNEL & simRank) B->C D Output: Prioritized List of 'Novel' MS2 Spectra (.mgf) C->D E InstaNovo Prediction (Transformer + Diffusion) D->E F Confidence Filtering (Log Prob., Δ mass ppm) E->F G Output: Candidate Novel Peptide Sequences F->G H Validation (Synthesis, Bioassay) G->H

(Diagram: Integrated NP Discovery and Sequencing Workflow)

Steps:

  • Sample Preparation & LC-MS/MS: Prepare extracts from target microbial strains (e.g., Streptomyces or anaerobic bacteria). Perform reversed-phase LC-MS/MS using data-dependent acquisition (DDA) to collect MS1 and MS2 spectra [1].
  • NP-PRESS Dereplication: Process the raw MS data through the NP-PRESS pipeline. This applies the FUNEL and simRank algorithms to compare features against control samples (media, spent broth) and subtract signals from known biotic processes, resulting in a curated list of MS2 spectra (.mgf or .mzML format) corresponding to high-priority, potentially novel features [1].
  • InstaNovo Sequencing: Use the command-line interface to run de novo sequencing on the prioritized spectra.

  • Data Filtering: Filter the predictions in novel_peptide_predictions.csv based on confidence. Key columns include log_probabilities (higher is better) and delta_mass_ppm (absolute value closer to zero indicates better mass accuracy). Retain sequences with log probability > -3 and |Δ mass| < 10 ppm for further analysis [61].
  • Downstream Validation: The resulting candidate sequences form testable structural hypotheses. They can be chemically synthesized for standard co-injection to confirm LC-MS retention time and fragmentation pattern, or used for heterologous expression and biological activity testing [1].

Protocol 2: Targeted Validation via Parallel Reaction Monitoring (PRM)

Objective: To confirm the presence and expression of a candidate novel peptide discovered via Protocol 1.

Workflow Diagram:

G A1 Input: Candidate Sequence from Integrated Workflow B1 Theoretical Spectral Library Generation (Skyline/Pyteomics) A1->B1 C1 PRM Assay Design (Define precursor m/z & isolation window) B1->C1 D1 LC-PRM/MS Analysis of New Microbial Cultures C1->D1 E1 Data Analysis (Peak detection, library matching) D1->E1 F1 Confirmation: Detection of synthesized peptide standard E1->F1 Synthesized Standard F2 Discovery: Variable detection in producing strain cultures E1->F2 Culture Extracts

(Diagram: Targeted PRM Validation for Candidate Sequences)

Steps:

  • Spectral Library Generation: For the candidate peptide sequence (e.g., "ALPYTPKK"), use software like Skyline or the Python pyteomics library to generate a theoretical MS2 spectrum. Include predicted fragment ions (b- and y-ions) and their expected intensities if using advanced tools.
  • PRM Assay Development: Calculate the precursor m/z for the candidate peptide in common charge states (+2, +3). Design a PRM method on your tandem mass spectrometer (e.g., Q-Exactive, TripleTOF) to target these precursor ions with an appropriate isolation width (e.g., 2 m/z) and schedule the acquisition around the expected LC retention time.
  • Sample Analysis & Confirmation:
    • Synthetic Standard: Analyze a chemically synthesized version of the candidate peptide. Successful co-elution and a high match between the observed PRM spectrum and the theoretical library confirm the InstaNovo-predicted structure.
    • Biological Replicates: Re-analyze new extracts from the putative producing microbial strain and relevant controls using the PRM method. Detection of the peptide specifically in the producing strain, with a MS2 spectrum matching the library, provides orthogonal biological validation [62].
  • Quantification (Optional): If a synthetic standard is available, a calibration curve can be built to estimate the native concentration of the novel natural product in the culture.

The Scientist's Toolkit

Table 3: Essential Research Reagent and Software Solutions

Category Item/Software Function in Integrated Workflow Key Notes/Specifications
MS Data Generation High-Resolution LC-MS/MS System (e.g., Q-Exactive, timsTOF) Generates high-quality MS1 and MS2 spectral data for both discovery (DDA) and validation (PRM). Essential for achieving the mass accuracy and resolution needed for NP-PRESS filtering and InstaNovo sequencing.
Computational Environment Workstation with NVIDIA GPU (e.g., RTX 4090, A100) Accelerates the training and inference of deep learning models like InstaNovo+. Critical for practical turnaround times when processing hundreds to thousands of prioritized spectra [61].
Core Analysis Software NP-PRESS Algorithms (FUNEL, simRank) Performs the initial critical dereplication and feature prioritization from complex LC-MS data [1]. The foundational first stage that filters the data for AI analysis.
Core Analysis Software InstaNovo Python Package (instanovo) Executes the de novo peptide sequencing via command line or Python API [61]. Version 1.1.0 natively supports key post-translational modifications relevant to NPs (e.g., oxidation, deamidation) [61].
Data Format Bridge MS Convert (ProteoWizard) / pymzml Converts raw spectrometer files (.raw, .d) to open formats (.mzML, .mgf) for NP-PRESS and InstaNovo. Ensures compatibility between instrument vendor software and open-source analysis pipelines.
Validation & Design Skyline or Pyteomics Library Creates theoretical spectral libraries for PRM assay design and provides a platform for targeted data analysis. Enables the transition from in silico prediction to empirical, hypothesis-driven validation [62].
Chemical Validation Custom Peptide Synthesis Service Provides synthetic analytical standards for definitive structural confirmation via co-elution experiments. Final, definitive step to confirm the predicted structure of a novel natural product.

Conclusion

The NP-PRESS two-stage dereplication strategy represents a significant methodological advance for natural product discovery, directly addressing the persistent challenge of metabolome complexity. By systematically removing irrelevant biotic and abiotic features through its FUNEL and simRank algorithms, the pipeline efficiently prioritizes novel secondary metabolites, as validated by the discovery of new surugamides and the baidienmycin family. This approach not only reduces the resource-intensive risk of erroneous isolations but also proves particularly powerful for mining unconventional microbial sources. Looking forward, the integration of NP-PRESS with rapidly evolving genomic mining tools and deep learning-based de novo sequencing models, such as InstaNovo, promises to create a more holistic, in-silico-first discovery framework. This convergence will further accelerate the identification and structural elucidation of novel bioactive compounds, reinvigorating the natural product pipeline for biomedical and clinical research in the face of emerging diseases and antibiotic resistance.

References