Synergy in Spectrometry: A Comprehensive Guide to LC-HRMS and NMR Data Fusion for Advanced Food Classification

Andrew West Dec 02, 2025 510

This article explores the powerful combination of Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (NMR) spectroscopy through data fusion strategies for food classification and authentication.

Synergy in Spectrometry: A Comprehensive Guide to LC-HRMS and NMR Data Fusion for Advanced Food Classification

Abstract

This article explores the powerful combination of Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (NMR) spectroscopy through data fusion strategies for food classification and authentication. Aimed at researchers, scientists, and drug development professionals, it covers the foundational principles of these complementary analytical techniques, delves into multi-level data fusion methodologies (low-, mid-, and high-level), and presents cutting-edge applications from wine and nut traceability to pharmaceutical quality control. The content further addresses critical troubleshooting for data integration challenges, provides frameworks for model validation and performance comparison, and discusses the translational potential of these robust analytical frameworks for biomedical and clinical research, including metabolomic profiling and quality attribute prediction.

Understanding the Core Synergy: Why LC-HRMS and NMR are a Powerful Combination for Food Profiling

The Inherent Strengths and Weaknesses of LC-HRMS and NMR Spectroscopy

In the evolving landscape of analytical chemistry, Liquid Chromatography-High-Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (NMR) Spectroscopy have emerged as two pivotal techniques for metabolomics and food classification research. While each method possesses distinct capabilities, their integration through data fusion strategies presents a powerful approach for comprehensive sample characterization. This application note delineates the inherent strengths and limitations of both platforms, providing detailed experimental protocols and contextualizing their application within food authentication research. The complementary nature of LC-HRMS and NMR enables researchers to leverage the high sensitivity of the former with the quantitative robustness and structural elucidation power of the latter, creating a synergistic workflow that surpasses the capabilities of either technique used independently [1] [2].

The growing need for food authenticity verification, particularly for high-value products like coffee, wine, and honey, has driven the development of sophisticated analytical methodologies that can detect adulteration and verify geographical origin. Within this framework, understanding the technical advantages and constraints of LC-HRMS and NMR becomes imperative for designing effective classification models that integrate data from both platforms [3] [4].

Technical Comparison: LC-HRMS versus NMR Spectroscopy

The selection between LC-HRMS and NMR spectroscopy requires careful consideration of their fundamental operational principles and performance characteristics. The following section provides a detailed technical comparison to guide researchers in selecting the appropriate technology for their specific application needs.

Table 1: Comprehensive Comparison of Technical Specifications between LC-HRMS and NMR Spectroscopy

Parameter LC-HRMS NMR Spectroscopy
Sensitivity Very high (can detect compounds at ng/mL or lower levels) [5] Moderate to low; limited by insufficient sample concentrations [6] [2]
Sample Preparation Requires extraction, often complex; protein precipitation for serum [5] Minimal; typically just dissolution in deuterated solvent [6] [7]
Destructive Nature Destructive technique Non-destructive; sample can be recovered [6] [7]
Quantitation Requires standards; susceptible to matrix effects inherently quantitative without need for calibration curves [2]
Structural Elucidation Provides molecular formula via exact mass; fragmentation patterns Provides definitive 3D structural information, including stereochemistry [6]
Throughput Moderate (chromatographic separation required) Rapid once sample is loaded
Reproducibility Subject to retention time shifts, requiring alignment algorithms [3] Excellent; highly reproducible across instruments and laboratories [3]
Molecular Size Limitation Suitable for a wide range, but can be challenged by very large molecules Difficulty with higher molecular weight molecules due to spectral complexity [6]
Key Detectable Nuclei Not applicable (mass-based) 1H, 13C, 15N, 31P, 23Na, 19F [6]
Operational Costs High (instrumentation, maintenance) Very high (cryogenic liquids, powerful magnets) [6]
In-Depth Analysis of Strengths and Weaknesses

LC-HRMS excels in sensitivity and specificity, capable of detecting thousands of metabolite features in a single analysis through untargeted acquisition [5] [4]. This technique provides specific identifications based on monoisotopic mass, retention time, isotopic patterns, and fragmentation spectra, enabling the creation of extensive shared spectral libraries [8]. However, limitations persist, particularly for low-concentration compounds or low-abundance ion fragments, where obtaining sufficient fragmentation for complete identification becomes challenging [8]. Additionally, LC-HRMS generates massive datasets that require sophisticated software tools and impose significant demands on data storage and processing infrastructure [8] [5]. A notable technical challenge is the lack of long-term robustness, where variations in column age and instrument contamination can lead to retention time shifts, complicating data comparison across different batches and laboratories [3].

NMR spectroscopy offers distinct advantages in structural elucidation at the atomic level, providing comprehensive information about molecular structure, dynamics, and interactions within the natural environment while preserving sample integrity [6]. As a non-destructive technique, NMR allows sample recovery for subsequent analyses, and its inherently quantitative nature enables precise concentration measurements without requiring external standards [2] [7]. The exceptional reproducibility of NMR data facilitates direct comparison across different instruments and time periods, a crucial advantage for long-term studies [3]. Primary limitations include relatively low sensitivity compared to mass spectrometry and high instrumentation and maintenance costs due to requirements for powerful superconducting magnets and cryogenic cooling systems [6] [2]. Furthermore, NMR faces challenges in analyzing large molecular weight compounds due to increased spectral complexity and is restricted to studying nuclei with magnetic moments [6].

Experimental Protocols for Food Classification Research

Protocol: LC-HRMS Analysis for Coffee Authentication

This protocol is adapted from a published methodology for the classification of Arabica and Robusta coffee samples from different geographical origins [4].

1. Sample Preparation:

  • Materials: Green or roasted coffee beans, methanol, acetonitrile, formic acid, ultrapure water.
  • Procedure:
    • Grind coffee beans to a consistent particle size using a laboratory mill.
    • Weigh 0.5 g of ground coffee into a conical flask.
    • Add 50 mL of hot water (92-96°C) and brew for 4-6 minutes, mimicking standard preparation.
    • Cool the brew to room temperature and filter through a 0.22 μm PVDF syringe filter.
    • Transfer 100 μL of filtrate to an LC vial for analysis.

2. LC-HRMS Analysis:

  • Instrumentation: UHPLC system coupled to Q-Exactive Orbitrap or similar high-resolution mass spectrometer.
  • Chromatographic Conditions:
    • Column: C18 reversed-phase column (e.g., 100 × 2.1 mm, 1.8 μm)
    • Mobile Phase A: Water with 0.1% formic acid
    • Mobile Phase B: Methanol with 0.1% formic acid
    • Gradient: 5% B to 95% B over 25 minutes, hold for 5 minutes
    • Flow Rate: 0.3 mL/min
    • Column Temperature: 40°C
    • Injection Volume: 5 μL
  • Mass Spectrometric Conditions:
    • Ionization: Electrospray Ionization (ESI) in both positive and negative modes
    • Scan Range: m/z 100-1500
    • Resolution: >70,000 (at m/z 200)
    • Capillary Temperature: 320°C
    • Sheath Gas Flow: 40 arbitrary units
    • Aux Gas Flow: 15 arbitrary units

3. Data Processing:

  • Convert raw files to mzML format using MSConvert.
  • Process data using XCMS or similar software for peak detection, retention time alignment, and feature quantification.
  • Perform statistical analysis using Principal Component Analysis (PCA) and Partial Least Squares-Discriminant Analysis (PLS-DA) to identify discriminant features.
Protocol: NMR Analysis for Wine Metabolomic Fingerprinting

This protocol is adapted from a study on the classification of Amarone wines based on grape withering time and yeast strain using 1H NMR [1].

1. Sample Preparation:

  • Materials: Wine samples, deuterated water (D2O), sodium azide, phosphate buffer, 3-(trimethylsilyl) propionic acid-d4 sodium salt (TSP).
  • Procedure:
    • Mix 300 μL of wine with 300 μL of phosphate buffer (0.1 M, pD 7.4) in D2O.
    • Add 0.25 mM TSP as an internal chemical shift reference and 0.05% sodium azide to inhibit microbial growth.
    • Centrifuge at 13,000 rpm for 10 minutes to remove any particulate matter.
    • Transfer 550 μL of the supernatant to a 5 mm NMR tube.

2. NMR Spectroscopy:

  • Instrumentation: 600 MHz NMR spectrometer with a cryoprobe for enhanced sensitivity.
  • Acquisition Parameters:
    • Experiment Type: 1D NOESY-presat (noesygppr1d) for water suppression
    • Temperature: 298 K
    • Spectral Width: 20 ppm
    • Relaxation Delay: 4 seconds
    • Mixing Time: 10 ms
    • Number of Scans: 64
    • Acquisition Time: 3 seconds

3. Data Processing:

  • Process Free Induction Decays (FIDs) by applying exponential multiplication (line broadening of 0.3 Hz) followed by Fourier transformation.
  • Manually phase and baseline correct all spectra.
  • Calibrate spectra to TSP signal at δ 0.0 ppm.
  • Reduce data by segmenting spectra into regions (bucketing) of equal width (δ 0.04 ppm), excluding the water region (δ 4.7-5.0 ppm).
  • Normalize the integrated bucket regions to total intensity for multivariate statistical analysis.
Data Fusion Strategy for Enhanced Classification

Integrating data from LC-HRMS and NMR platforms enhances classification accuracy by capturing complementary aspects of the sample metabolome. The following workflow outlines a standardized procedure for mid-level data fusion, which has demonstrated superior performance in food classification tasks [1] [9].

food_fusion cluster_1 Data Acquisition cluster_2 Data Processing Pipeline cluster_3 Model Building & Output LC_HRMS LC_HRMS Data_Preprocessing Data_Preprocessing LC_HRMS->Data_Preprocessing NMR NMR NMR->Data_Preprocessing Feature_Selection Feature_Selection Data_Preprocessing->Feature_Selection Data_Fusion Data_Fusion Feature_Selection->Data_Fusion Multivariate_Analysis Multivariate_Analysis Data_Fusion->Multivariate_Analysis Classification_Model Classification_Model Multivariate_Analysis->Classification_Model Result Result Classification_Model->Result

Diagram 1: Data fusion workflow for LC-HRMS and NMR integration in food classification.

Research Reagent Solutions and Essential Materials

Successful implementation of LC-HRMS and NMR methodologies requires specific reagents and materials optimized for each platform. The following table catalogues essential items for researchers establishing these techniques in their laboratories.

Table 2: Essential Research Reagents and Materials for LC-HRMS and NMR Experiments

Item Function/Application Technical Specifications Example Use Case
Deuterated Solvents (e.g., CD3OD, D2O, CDCl3) NMR solvent; provides deuterium lock signal 99.8% deuterium enrichment; NMR tubes (5 mm) Dissolving samples for NMR analysis without interfering proton signals [7]
Chemical Shift Reference (e.g., TSP) Internal chemical shift standard for NMR δ 0.0 ppm for ¹H NMR; soluble in water Referencing NMR spectra in aqueous solutions [7]
C18 LC Columns Reversed-phase separation for LC-HRMS 100-150 mm length; 2.1 mm ID; 1.7-1.8 μm particle size Separating complex metabolite mixtures in coffee, wine [4]
Mass Calibration Solution Daily mass accuracy calibration for HRMS Covers broad m/z range; compatible with ionization mode Ensuring sub-ppm mass accuracy during untargeted screening
Deuterated Mobile Phase Additives LC-NMR hyphenation; minimal interference Deutero-acetonitrile, deutero-methanol, D2O with buffers Online LC-NMR applications for structural ID [10]
Solid Phase Extraction (SPE) Cartridges Sample clean-up and metabolite concentration C18, HILIC, or mixed-mode chemistries; 30-100 mg bed weight Pre-concentrating dilute food samples prior to LC-HRMS

Decision Framework for Technique Selection

The choice between LC-HRMS and NMR, or the decision to implement both, depends on specific research goals, sample characteristics, and available resources. The following diagram provides a systematic approach for technique selection based on key experimental requirements.

decision_flow Start Start Sensitivity Sensitivity Start->Sensitivity Structure Structure Sensitivity->Structure Moderate acceptable LC_HRMS_Rec LC_HRMS_Rec Sensitivity->LC_HRMS_Rec High required Quantitation Quantitation Structure->Quantitation Molecular formula sufficient NMR_Rec NMR_Rec Structure->NMR_Rec Definitive required Throughput Throughput Quantitation->Throughput Relative acceptable Quantitation->NMR_Rec Absolute without standards Resources Resources Throughput->Resources Moderate acceptable Throughput->LC_HRMS_Rec Higher preferred Fusion_Rec Fusion_Rec Resources->Fusion_Rec Ample resources Consider_NMR Consider_NMR Resources->Consider_NMR Limited budget Consider_LC_HRMS Consider_LC_HRMS

Diagram 2: Decision framework for selecting between LC-HRMS and NMR based on research requirements.

LC-HRMS and NMR spectroscopy represent complementary analytical pillars within modern food classification research. LC-HRMS delivers exceptional sensitivity and broad metabolome coverage, while NMR provides unparalleled structural elucidation capabilities and quantitative robustness. The strategic integration of these platforms through data fusion approaches, as demonstrated in wine and coffee authentication studies, creates a synergistic analytical framework that significantly enhances classification accuracy and metabolic insight. As food authentication challenges grow increasingly complex, leveraging the combined strengths of LC-HRMS and NMR will be essential for developing robust classification models that protect consumers and ensure product integrity within global food supply chains.

Data fusion is the process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source [11]. In scientific research, particularly in fields requiring high-precision classification such as food authenticity testing, data fusion provides a powerful framework for combining complementary analytical techniques. The core principle involves merging diverse data streams—such as Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (NMR) spectroscopy—to create a unified, comprehensive profile that surpasses the capabilities of any single method [3] [12].

The fundamental model for understanding data fusion processes is the JDL/DFIG model, which categorizes fusion into increasingly refined levels from Source Preprocessing (Level 0) to Mission Refinement (Level 6) [11]. For analytical chemistry applications, this translates to a workflow that progresses from raw instrumental data acquisition through feature extraction, multidimensional integration, and finally to classification and decision-making, enabling researchers to transform disconnected data points into actionable knowledge about geographical origin, adulteration, and quality of food products.

Data Fusion in Food Classification Research

The application of data fusion in food classification addresses significant challenges in modern food authentication. As global trade expands, so does the scope for food fraud, creating an urgent need for analytical methods that can detect increasingly sophisticated adulteration practices [3]. Traditional targeted analysis approaches, which focus on specific known markers, struggle to identify novel adulterants as fraudsters continuously adapt their methods. Data fusion enables a comprehensive untargeted screening approach that can detect deviations from authentic profiles even before specific adulteration methods are defined.

For complex classification problems such as determining the geographical origin of honey, data fusion is particularly valuable. Samples within a single country can be highly diverse due to different varieties, regional variations, and changing weather conditions between harvest years [3]. By combining complementary analytical techniques—typically LC-HRMS for sensitive detection of numerous compounds and NMR for robust, reproducible fingerprinting—researchers can build more accurate and robust classification models that capture the multifaceted nature of food authenticity.

Table 1: Analytical Techniques in Food Classification Data Fusion

Technique Key Advantages Role in Data Fusion Typical Data Output
LC-HRMS High sensitivity; detects numerous compounds Provides detailed compositional data Chromatographic peaks with mass/charge ratios and intensities
NMR High robustness and reproducibility; quantitative Creates stable spectral fingerprint Spectral bins with intensity values
Sensory Analysis Direct quality assessment Adds consumer-relevant attributes Numerical scores from trained panels
Stable Isotope Analysis Geographic discrimination Provides origin traceability Isotopic ratio values

Experimental Protocols for LC-HRMS and NMR Data Fusion

LC-HRMS Analysis Protocol

Sample Preparation:

  • Honey Extraction: Dilute 1 g of honey with 10 mL of acetonitrile:water (50:50, v/v) mixture.
  • Centrifugation: Centrifuge at 10,000 × g for 10 minutes to remove particulate matter.
  • Filtration: Pass supernatant through a 0.22 μm nylon membrane filter prior to analysis.

Instrumental Parameters:

  • Chromatography: Utilize two complementary LC approaches: Hydrophilic Interaction Liquid Chromatography (HILIC, Accucore-150-Amide-HILIC 150 × 2.1 mm) in negative ion mode for polar compounds, and Reverse Phase (RP, Hypersil Gold C18* 150 × 2.1 mm) in positive ion mode for non-polar compounds.
  • Mobile Phase: For both methods, use water and acetonitrile with acetic acid as modifier with gradient elution.
  • Mass Spectrometry: Employ electrospray ionization (ESI) and acquire data in profile mode using variable Data-Independent Acquisition (vDIA) with full scan range of 100–1500 Da (MS1) and six fragmentation windows (MS2) [3].

Quality Control:

  • Include procedural blanks to identify contamination.
  • Use quality control samples (pooled from all samples) every 10 injections to monitor instrument stability.
  • Add internal standards (e.g., sorbic acid in acetonitrile-water mixture) for normalization strategy.

NMR Analysis Protocol

Sample Preparation:

  • Buffer Preparation: Prepare 0.2 M sodium phosphate buffer in D₂O (pH 6.0) containing 0.025% trimethylsilylpropanoic acid (TSP) as chemical shift reference.
  • Honey Solution: Mix 40 mg of honey with 600 μL of NMR buffer.
  • Centrifugation: Centrifuge at 12,000 × g for 10 minutes to remove any insoluble material.
  • Transfer: Pipette 550 μL of supernatant into 5 mm NMR tubes.

Data Acquisition:

  • Instrument: 600 MHz NMR spectrometer with cryoprobe.
  • Temperature: Maintain at 298 K.
  • Standard Protocol: Utilize one-dimensional NOESY-presat sequence for water suppression with 256 scans, 4s relaxation delay, and 100 ms mixing time.
  • Spectral Parameters: Acquire 64k data points with spectral width of 20 ppm.

Processing Parameters:

  • Apply exponential line broadening of 0.3 Hz before Fourier transformation.
  • Manually correct phase and baseline.
  • Calibrate spectrum to TSP at 0.0 ppm.
  • Perform binning (bucketting) of 0.04 ppm regions for multivariate analysis.

Data Fusion and Analysis Protocol

The BOULS (Bucketing of Untargeted LCMS Spectra) approach provides a specialized workflow for fusing LC-HRMS data from different devices and timepoints, addressing the critical challenge of combining disparate datasets in routine analysis [3].

Data Preprocessing:

  • LC-HRMS Processing: Convert raw files to mzML format using MSConvert with peak picking filter to centroid profile mode data. Import into R using Bioconductor package mzR.
  • NMR Processing: Process NMR data with established protocols including phase correction, baseline correction, and reference alignment.
  • Retention Time Alignment: For LC-HRMS data, align retention times using a centrally maintained reference spectrum to enable comparison across different analytical batches.
  • Three-Dimensional Bucketing: Divide LC-HRMS data into buckets across retention time, mass-to-charge ratio (m/z), and feature intensity dimensions. For NMR data, employ traditional spectral binning (typically 0.04 ppm regions).

Feature Integration and Model Building:

  • Data Fusion: Combine LC-HRMS buckets with NMR bins to create integrated feature matrix.
  • Normalization: Apply probabilistic quotient normalization to account for overall concentration differences.
  • Multivariate Analysis: Utilize Random Forest algorithm with 1000 trees for classification modeling, using out-of-bag error estimation for internal validation.
  • Model Validation: Employ independent test set validation (typically 70:30 training:test split) and cross-validation to assess model performance.

fusion_workflow start Sample Collection (Honey) lcms LC-HRMS Analysis start->lcms nmr NMR Analysis start->nmr process1 Data Preprocessing: - Retention time alignment - 3D bucketing (LC-HRMS) - Spectral binning (NMR) lcms->process1 nmr->process1 process2 Data Fusion: Feature matrix integration process1->process2 process3 Machine Learning: Random Forest Classification process2->process3 result Classification Result: Geographical Origin process3->result

Data Fusion Workflow for Food Authentication

Mathematical Foundations of Data Fusion

Data fusion methodologies are underpinned by sophisticated mathematical frameworks that enable the integration of heterogeneous data sources. Two prominent approaches for fusing diverse datasets are Collective Matrix Factorization (CMF) and Coupled Matrix and Tensor Factorizations (CMTF) [12].

Collective Matrix Factorization (CMF)

CMF is a powerful data fusion technique based on joint matrix decomposition that simultaneously analyzes multiple datasets from diverse sources. The core concept involves factorizing multiple relation matrices that share one or more common modes, thereby revealing hidden or latent associations that might not be apparent when analyzing individual datasets separately [12].

Given two matrices ( X \in \mathbb{R}^{I \times J} ) and ( Y \in \mathbb{R}^{I \times K} ) that are coupled through a common mode, the CMF can be formally represented as:

[ \min{A,B,C} f(A,B,C) = \| X - AB^T \|F^2 + \| Y - AC^T \|_F^2 ]

Where:

  • ( A \in \mathbb{R}^{I \times R} ) is the shared factor matrix of both ( X ) and ( Y )
  • ( B \in \mathbb{R}^{J \times R} ) and ( C \in \mathbb{R}^{K \times R} ) are factor matrices specific to ( X ) and ( Y ), respectively
  • ( \| \cdot \|_F ) denotes the Frobenius norm
  • ( R ) represents the number of latent components

This formulation enables the transfer of information through the common mode between different matrices, with fused multiple data sources achieving higher accuracy than single data sources [12].

Multi-Kernel Learning for Heterogeneous Data Fusion

For heterogeneous data types that cannot be directly combined, multi-kernel learning schemes provide an effective fusion approach by transforming disparate data into a homogeneous kernel space where similarities can be meaningfully compared and combined [13].

The kernel transformation of information from each modality ( \phim ) results in a corresponding kernel gram matrix ( Km ). These may then be combined in a weighted manner as:

[ \tilde{K}(i,j) = \sum{m=1}^M \gammam K_m(i,j) ]

Where ( \gamma_m ) represents the weight assigned to modality ( m ), which can be optimized based on its relative importance or discriminative power. This kernel combination approach, particularly the Semi-Supervised Multi-Kernel (SeSMiK) method, has demonstrated superior performance in integrating imaging and non-imaging data for biomedical applications, and shows significant promise for food authentication challenges [13].

Table 2: Data Fusion Methodologies and Applications

Fusion Method Mathematical Basis Advantages Application in Food Analysis
Collective Matrix Factorization (CMF) Joint matrix decomposition of coupled matrices Information transfer through common modes; reveals latent associations Integrating LC-HRMS data with sensory evaluation scores
Coupled Matrix Tensor Factorization (CMTF) Joint analysis of matrices and tensors Handles heterogeneous data orders; natural extension of CMF Fusing multi-dimensional NMR data with compositional tables
Multi-Kernel Learning Kernel space projections and combinations Handles diverse data representations; optimal weighting Combining spectral data with chromatographic fingerprints
Consensus Embedding Ensemble of embeddings from feature subsets Robust to noise and parameter selection Geographic origin classification using multiple analytical techniques

The Scientist's Toolkit: Essential Research Reagents and Materials

Successful implementation of data fusion strategies for food classification requires carefully selected reagents, materials, and computational tools. The following table details essential components for LC-HRMS and NMR data fusion experiments.

Table 3: Research Reagent Solutions for LC-HRMS NMR Data Fusion

Item Specification/Type Function in Protocol
Chromatography Columns HILIC (Accucore-150-Amide-HILIC) and RP (Hypersil Gold C18) Separation of polar and non-polar compounds respectively
LC-MS Grade Solvents Acetonitrile, Water, Methanol with 0.1% Formic Acid Mobile phase preparation; sample extraction
Internal Standards Sorbic Acid, TSP (Trimethylsilylpropanoic acid) Retention time alignment (LC-MS); chemical shift reference (NMR)
NMR Buffer 0.2 M Sodium Phosphate Buffer in D₂O, pH 6.0 Provides consistent pH and deuterium lock for NMR
Quality Control Materials Certified Reference Materials (CRMs), Pooled Quality Control Samples Monitoring instrument performance; batch-to-batch normalization
Data Processing Software R packages (xcms, mzR), Python (scikit-learn), MATLAB Data preprocessing, feature extraction, and model building

fusion_methods fusion Data Fusion Methods matrix Matrix-Based Methods fusion->matrix tensor Tensor-Based Methods fusion->tensor kernel Kernel Methods fusion->kernel cmf Collective Matrix Factorization (CMF) matrix->cmf cmtf Coupled Matrix & Tensor Factorization (CMTF) tensor->cmtf mkl Multi-Kernel Learning (SeSMiK) kernel->mkl app1 Homogeneous Data Integration cmf->app1 app2 Heterogeneous Data Fusion cmtf->app2 app3 Imaging & Non-imaging Data Combination mkl->app3

Data Fusion Method Classification

Data fusion represents a paradigm shift in analytical chemistry for food classification, moving beyond single-technique approaches to integrated methodologies that leverage the complementary strengths of multiple analytical platforms. The fusion of LC-HRMS and NMR data, supported by robust mathematical frameworks such as collective matrix factorization and multi-kernel learning, creates synergistic effects that enhance classification accuracy, enable detection of novel adulteration patterns, and provide comprehensive product authentication.

For researchers and drug development professionals, implementing the protocols and methodologies outlined in this article requires careful attention to experimental design, data preprocessing consistency, and appropriate selection of fusion algorithms based on the specific characteristics of the data being integrated. As the field advances, the development of standardized data fusion workflows and validation frameworks will be crucial for widespread adoption in regulatory and quality control environments, ultimately strengthening global food supply chains and protecting consumer interests through more sophisticated authentication capabilities.

Untargeted metabolomics has emerged as a powerful analytical strategy for comprehensive food fingerprinting, enabling the simultaneous analysis of a wide range of small-molecule metabolites to verify authenticity, ensure quality, and detect adulteration [14]. This approach provides a snapshot of the metabolic activity in food products, reflecting factors such as geographical origin, raw material composition, and processing techniques [14] [15]. Within food authentication, untargeted metabolomics is technically implemented to ensure consumer protection through strict inspection and enforcement of food labeling, ultimately detecting deliberate adulteration that compromises food quality and safety [14].

The core principle of untargeted metabolomics lies in its ability to perform global analysis of all detectable analytes in a sample without prior knowledge of which metabolites will be detected [14]. This extensive nature makes it particularly valuable for uncovering emerging fraudulent practices in the food industry, as it can reveal unexpected compositional differences without targeting specific compounds [14]. When integrated with chemometric techniques, untargeted metabolomics can identify subtle patterns in complex data that serve as characteristic fingerprints for authentic products [3] [15].

Core Analytical Techniques and Instrumentation

Fundamental Analytical Platforms

The application of untargeted metabolomics in food fingerprinting primarily relies on two analytical platforms: mass spectrometry (MS) and nuclear magnetic resonance (NMR) spectroscopy. Each platform offers distinct advantages that contribute complementary information to food authentication studies [16].

Liquid Chromatography-Mass Spectrometry (LC-MS), particularly high-resolution mass spectrometry (HRMS), provides exceptional sensitivity and a wide dynamic range for detecting metabolites at various concentration levels [3] [16]. The coupling with chromatography (liquid or gas) enables the separation of complex matrices, facilitating the detection and quantification of trace metabolites in food samples [16]. Common configurations include reverse-phase (RP) chromatography for non-polar compounds and hydrophilic interaction liquid chromatography (HILIC) for polar compounds, often analyzed in positive and negative ion modes respectively to maximize metabolite coverage [3].

NMR spectroscopy, while less sensitive than MS, offers significant advantages as a non-destructive technique that provides valuable structural elucidation and enables precise metabolite quantification without extensive sample preparation [16]. Proton NMR (1H-NMR) is particularly valuable for profiling major metabolites in food samples and generates highly reproducible data that can be compared across different instruments and laboratories over time [3].

Data Fusion Strategies for Enhanced Classification

The integration of data from multiple analytical platforms through data fusion (DF) strategies significantly enhances the classification power of untargeted metabolomics for food fingerprinting [1] [16]. Data fusion methodologies combine the complementary strengths of different techniques, such as LC-HRMS and 1H-NMR, to provide a more comprehensive view of the food metabolome than any single technique can achieve alone [16].

Table 1: Data Fusion Strategies in Untargeted Metabolomics

Fusion Level Description Methodologies Advantages Limitations
Low-Level Direct concatenation of raw or pre-processed data matrices PCA, PLS Preserves all original information High dimensionality; Requires careful data scaling
Mid-Level Concatenation of features extracted from individual datasets PCA, PARAFAC, MCR-ALS Reduces dimensionality; Highlights relevant features Potential loss of information during feature extraction
High-Level Combination of model outputs or decisions Bayesian consensus, majority voting Flexible; Can integrate heterogeneous models Complex interpretation; May not exploit variable interactions

Research demonstrates that data fusion approaches significantly improve predictive accuracy in food classification. A study on Amarone wine authentication achieved a lower classification error rate (7.52%) when using LC-HRMS and 1H NMR data fusion compared to individual techniques, with notable variations in amino acids, monosaccharides, and polyphenolic compounds during the withering process [1]. The limited correlation between datasets (RV-score = 16.4%) highlighted their complementarity, confirming the value of multi-platform approaches [1].

Experimental Protocols for Food Fingerprinting

Sample Preparation and Analysis

Proper sample preparation is critical for obtaining reliable metabolomic data. While specific protocols vary depending on the food matrix and analytical platform, the following general procedures apply to most food authentication studies:

Sample Extraction and Metabolite Isolation:

  • Homogenization of food samples to ensure representative sampling
  • Metabolite extraction using appropriate solvent systems (e.g., methanol-water or chloroform-methanol mixtures) to cover diverse metabolite classes
  • Centrifugation and filtration to remove particulate matter and macromolecules
  • Concentration or dilution to appropriate levels for analytical detection

LC-HRMS Analysis:

  • Chromatographic separation: Utilize both RP and HILIC methods to cover polar and non-polar metabolites [3]
  • Mass spectrometric detection: Employ high-resolution instruments such as Orbitrap or Q-TOF mass analyzers for accurate mass measurements [3] [15]
  • Data acquisition: Use both full-scan MS and data-dependent MS/MS acquisition to enable compound identification [3]
  • Quality control: Include pooled quality control samples (from all samples) throughout the sequence to monitor instrument stability

1H-NMR Analysis:

  • Sample preparation: Mix food extracts with deuterated solvent (e.g., D₂O or CD₃OD) containing a reference standard (e.g., TSP or DSS) for chemical shift calibration [16]
  • Data acquisition: Acquire spectra with sufficient scans to achieve adequate signal-to-noise ratio, using standard pulse sequences (e.g., NOESY-presat for water suppression) [16]
  • Temperature control: Maintain constant temperature during analysis to ensure chemical shift reproducibility [16]

Data Processing Workflows

Raw data from analytical instruments require extensive processing to extract meaningful biological information. The workflow typically involves multiple steps:

Table 2: Key Data Preprocessing Steps in Untargeted Metabolomics

Processing Step Description Common Tools/Approaches
Feature Detection Identification of chromatographic peaks and spectral features XCMS, MS-DIAL, Progenesis QI
Retention Time Alignment Correction of retention time shifts between samples XCMS, MZmine
Missing Value Imputation Handling of missing data points K-nearest neighbors, minimum value replacement
Data Pretreatment Scaling and transformation to enhance biological information Pareto scaling, autoscaling, log transformation [17]

For LC-HRMS data, the BOULS (Bucketing of Untargeted LCMS Spectra) approach enables analysis of data obtained from different devices and times without reprocessing entire datasets [3]. This method uses a central spectrum for retention time alignment and implements three-dimensional bucketing (retention time, m/z, and intensity), allowing newly acquired spectra to be classified and added to training datasets efficiently [3].

Statistical Analysis and Model Building

Chemometric analysis is essential for interpreting complex metabolomic data and building classification models:

  • Unsupervised methods such as Principal Component Analysis (PCA) explore natural clustering patterns without prior knowledge of sample classes [15]
  • Supervised methods including Partial Least Squares-Discriminant Analysis (PLS-DA) and Random Forest (RF) build predictive models using known class information to maximize separation between groups [3] [15]

Random Forest is particularly effective for food authentication, as it handles high-dimensional data well and provides variable importance measures for identifying discriminatory metabolites [3]. In honey authentication studies, RF models based on LC-HRMS data achieved 94% classification accuracy for 126 test samples from six different countries [3].

Experimental Workflow and Data Fusion Strategy

The following diagrams illustrate the core workflows and relationships in untargeted metabolomics for food fingerprinting.

food_metabolomics_workflow cluster_sample_prep Sample Preparation cluster_analytical Analytical Platforms cluster_data_processing Data Processing cluster_data_fusion Data Fusion & Modeling FoodSample Food Sample Homogenization Homogenization FoodSample->Homogenization Extraction Metabolite Extraction Homogenization->Extraction SamplePrepQC Quality Control Extraction->SamplePrepQC LCMS LC-HRMS Analysis SamplePrepQC->LCMS NMR NMR Analysis SamplePrepQC->NMR DataCollection Raw Data Collection LCMS->DataCollection NMR->DataCollection FeatureDetection Feature Detection DataCollection->FeatureDetection Alignment Retention Time Alignment FeatureDetection->Alignment Normalization Data Normalization Alignment->Normalization Pretreatment Data Pretreatment Normalization->Pretreatment DataFusion Data Fusion Strategies Pretreatment->DataFusion MultivariateAnalysis Multivariate Analysis DataFusion->MultivariateAnalysis ModelValidation Model Validation MultivariateAnalysis->ModelValidation BiomarkerID Biomarker Identification ModelValidation->BiomarkerID End End BiomarkerID->End Start Start Start->FoodSample

Untargeted Metabolomics Workflow for Food Fingerprinting

data_fusion_strategies cluster_low_level Low-Level Fusion cluster_mid_level Mid-Level Fusion cluster_high_level High-Level Fusion DataSources Data Sources LL_Concatenate Matrix Concatenation DataSources->LL_Concatenate ML_FeatureExtraction Feature Extraction DataSources->ML_FeatureExtraction HL_IndividualModels Individual Models DataSources->HL_IndividualModels LL_Preprocessing Inter-block Scaling LL_Concatenate->LL_Preprocessing LL_Analysis Multiblock PCA/PLS LL_Preprocessing->LL_Analysis Results Classification Results LL_Analysis->Results ML_ScoreConcatenate Score Concatenation ML_FeatureExtraction->ML_ScoreConcatenate ML_ModelBuilding Model Building ML_ScoreConcatenate->ML_ModelBuilding ML_ModelBuilding->Results HL_DecisionFusion Decision Fusion HL_IndividualModels->HL_DecisionFusion HL_FinalPrediction Final Prediction HL_DecisionFusion->HL_FinalPrediction HL_FinalPrediction->Results

Data Fusion Strategies for Enhanced Classification

Essential Research Reagents and Materials

Successful implementation of untargeted metabolomics for food fingerprinting requires specific reagents and analytical materials. The following table outlines key solutions and their functions:

Table 3: Essential Research Reagent Solutions for Untargeted Metabolomics

Reagent/Material Function Application Notes
LC-MS Grade Solvents (acetonitrile, methanol, water) Mobile phase preparation; Sample extraction High purity minimizes background interference and ion suppression
Deuterated NMR Solvents (D₂O, CD₃OD) NMR sample preparation; Field frequency locking Provides deuterium signal for instrument locking; minimizes solvent background
Internal Standards (stable isotope-labeled compounds) Quality control; Quantification Corrects for instrument variation; enables semi-quantitative analysis
Chemical Shift References (TSP, DSS) NMR chemical shift calibration Provides reference point (0 ppm) for spectral alignment
Ionization Additives (formic acid, ammonium acetate) LC-MS mobile phase modifiers Enhances ionization efficiency in positive/negative MS modes
Metabolite Extraction Solvents (chloroform, methanol, water) Comprehensive metabolite extraction Biphasic system extracts both polar and non-polar metabolites

Applications in Food Authentication

Untargeted metabolomics has demonstrated significant utility across various food authentication applications:

Geographical Origin Verification

The geographical authentication of traditional foods represents a major application of untargeted metabolomics. Studies on products such as Pempek (traditional Indonesian fish cake) have successfully identified region-specific metabolite markers using HRMS-based approaches [15]. Similarly, research on honey authentication achieved high classification accuracy (94%) for geographical origin using LC-HRMS profiling combined with machine learning [3]. These approaches detect subtle variations in metabolic profiles resulting from differences in raw materials, soil composition, climate, and traditional production methods unique to each region [15].

Adulteration Detection

Untargeted metabolomics effectively identifies food adulteration through unexpected metabolic patterns. Common issues detected include:

  • Undeclared addition of water, sugar, acid, pulp wash, or peel extracts to fruit juices [14]
  • Oil blending without declaration, particularly addition of lower-quality oils to extra virgin olive oil [14]
  • Species substitution in meat and fish products [14]
  • Mislabelling of conventional products as organic [14]

The non-targeted nature of this approach is particularly valuable for detecting novel adulterants that may not be identified through targeted methods [3].

Processing Method Verification

Metabolomic fingerprints can distinguish processing techniques such as:

  • Withering time in Amarone wine production [1]
  • Yeast strain differentiation in fermentation processes [1]
  • Thermal processing methods (irradiation, freezing, microwave heating) [14]
  • Cheese production methods (raw vs. heat-treated milk) [14]

Untargeted metabolomics represents a powerful framework for comprehensive food fingerprinting, offering unprecedented capability to verify authenticity, detect adulteration, and ensure food quality. The integration of multiple analytical platforms through data fusion strategies significantly enhances classification power beyond the capabilities of individual techniques [1] [16]. As food fraud methods evolve, the untargeted nature of this approach provides a critical advantage in identifying emerging fraudulent practices without prior knowledge of specific adulterants [3].

The successful implementation of untargeted metabolomics requires careful attention to experimental design, sample preparation, data processing, and statistical modeling to generate robust classification models. With proper validation, these approaches can achieve high classification accuracy exceeding 90% for complex authentication challenges such as geographical origin determination [3]. As databases of authentic food fingerprints expand and analytical technologies advance, untargeted metabolomics is poised to play an increasingly vital role in global food authentication systems.

Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (NMR) spectroscopy are two cornerstone analytical techniques in modern foodomics. Their integration through data fusion strategies provides a powerful framework for addressing complex challenges in food authenticity, origin traceability, and quality control [1] [18] [19]. LC-HRMS offers exceptional sensitivity, enabling the detection and identification of numerous metabolites at low concentration levels, while NMR provides highly reproducible, quantitative data and unparalleled structural elucidation capabilities, distinguishing between isomers that are often indistinguishable by MS alone [18] [3]. The synergy created by fusing datasets from these platforms delivers a more comprehensive metabolic fingerprint of a food product than any single technique can achieve, significantly enhancing the accuracy of classification models [1] [20] [19].

Table 1: Comparison of LC-HRMS and NMR Spectroscopy in Food Metabolomics

Feature LC-HRMS 1H NMR
Sensitivity High (femtomole level) [18] Low (microgram level) [18]
Selectivity High Moderate
Structural Information Molecular formula, fragmentation patterns [18] Direct information on functional groups and connectivity [18]
Quantitation Semi-quantitative, suffers from matrix effects [18] Inherently quantitative [18]
Sample Throughput Moderate High
Robustness & Reproducibility Requires careful standardization [3] Highly robust and reproducible [3]
Key Metabolites Polyphenols, lipids, semi-polar compounds [1] Amino acids, sugars, organic acids, polar metabolites [1]

Application Note 1: Classification of Amarone Wine

Experimental Protocol

Objective: To classify Amarone wine samples based on grape withering time and yeast strain using a multi-omics data fusion approach [1].

Sample Preparation:

  • Wine Samples: Analyze 80 Amarone wine samples representing different withering times and yeast strains [1].
  • LC-HRMS Analysis:
    • Instrumentation: Utilize a UHPLC system coupled to a Q-Exactive Orbitrap mass spectrometer or equivalent.
    • Chromatography: Employ a reversed-phase C18 column (e.g., 2.1 x 100 mm, 1.7 µm). The mobile phase consists of (A) water and (B) acetonitrile, both with 0.1% formic acid.
    • Gradient: Use a linear gradient from 5% to 95% B over 25 minutes.
    • MS Parameters: Acquire data in both positive and negative ionization modes with a mass resolution of >70,000 full width at half maximum (FWHM) [3].
  • NMR Analysis:
    • Sample Preparation: Mix 300 µL of wine with 300 µL of phosphate buffer (pH 7.4) in D₂O containing 0.1% TSP (sodium trimethylsilylpropanesulfonate) as an internal chemical shift reference [19].
    • Instrumentation: Conduct 1D ¹H NMR experiments on a 600 MHz spectrometer equipped with a cryoprobe.
    • Acquisition: Collect spectra at 25°C with a sufficient number of scans to achieve a high signal-to-noise ratio [1].

Data Processing and Fusion:

  • LC-HRMS Data: Pre-process raw data using software like XCMS for peak picking, alignment, and integration. Perform compound annotation using accurate mass and MS/MS spectra against databases [3].
  • NMR Data: Process free induction decays (FIDs) by applying Fourier transformation, phase correction, and baseline correction. Align spectra to TSP (δ 0.0 ppm) and bin the data (e.g., δ 0.04 ppm buckets) to reduce dimensionality [19].
  • Mid-Level Data Fusion: Export the normalized and scaled peak tables from LC-HRMS and NMR as separate data blocks. Fuse the datasets by concatenating the significant features (Variables Important in Projection, VIP >1.0) identified from preliminary multivariate analysis of each block [1] [21].

Multivariate Data Analysis:

  • Unsupervised Analysis: Perform Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS) on the fused data matrix to explore natural clustering without a priori sample knowledge [1].
  • Supervised Analysis: Develop a classification model using Sparse Partial Least Squares-Discriminant Analysis (sPLS-DA) on the fused dataset to discriminate samples based on withering time and yeast strain. Validate the model using cross-validation and permutation tests [1].

Key Findings and Significance

The data fusion approach successfully classified Amarone wines with a lower classification error rate (7.52%) compared to models built with individual techniques [1]. The multi-omics pseudo-eigenvalue space revealed a limited correlation between the LC-HRMS and NMR datasets (RV-score = 16.4%), underscoring their complementarity [1]. Significant variations in amino acids, monosaccharides, and polyphenolic compounds were identified as key discriminators for the withering time, providing a broader characterization of the wine metabolome [1].

G cluster_sample_prep Sample Preparation cluster_acquisition Data Acquisition cluster_processing Data Processing & Fusion cluster_analysis Multivariate Analysis WineSamples 80 Amarone Wine Samples LC_Prep LC-HRMS: Filter & Dilute WineSamples->LC_Prep NMR_Prep NMR: Mix with D₂O Buffer WineSamples->NMR_Prep LC_MS LC-HRMS Analysis (Reversed-Phase, Orbitrap) LC_Prep->LC_MS NMR_Acq ¹H NMR Analysis (600 MHz, Cryoprobe) NMR_Prep->NMR_Acq LC_Process LC-HRMS Data: Peak Picking (XCMS) & Annotation LC_MS->LC_Process NMR_Process NMR Data: Fourier Transform Binning & Alignment NMR_Acq->NMR_Process DataFusion Mid-Level Data Fusion (Concatenate VIP Features) LC_Process->DataFusion NMR_Process->DataFusion MCR Unsupervised: MCR-ALS DataFusion->MCR SPLS Supervised: sPLS-DA (Classification Model) DataFusion->SPLS

Figure 1: Experimental workflow for the multi-omics analysis of Amarone wine, from sample preparation to data analysis [1] [19].

Application Note 2: Geographical Origin Authentication of Salmon

Experimental Protocol

Objective: To determine the geographical origin and production method (wild vs. farmed) of salmon using a mid-level data fusion strategy [20].

Sample Preparation:

  • Salmon Samples: Collect muscle tissue from 522 salmon samples of known provenance (e.g., Alaska, Norway, Iceland, Scotland) and production method [20].
  • REIMS Analysis (Lipidomics):
    • Instrumentation: Use a REIMS source coupled to a high-resolution mass spectrometer.
    • Procedure: Apply a bipolar radiofrequency to the salmon tissue to generate an aerosol of charged ions. This is typically done in conjunction with an electrosurgical knife for rapid analysis.
    • MS Parameters: Acquire mass spectra in negative ion mode over a mass range of m/z 150–1500 [20].
  • ICP-MS Analysis (Elemental Profile):
    • Sample Digestion: Accurately weigh ~0.2 g of dried salmon tissue. Digest with 5 mL of concentrated nitric acid using a closed-vessel microwave system.
    • Instrumentation: Analyze the digestate using an ICP-MS.
    • Parameters: Monitor relevant isotopes (e.g., ⁵⁵Mn, ⁷⁵As, ¹¹¹Cd, ²⁰⁸Pb) and use internal standards (e.g., Sc, Ge, Rh) for quantification [20].

Data Processing and Fusion:

  • REIMS Data: Normalize the spectral data to the total ion count. Perform peak picking and alignment.
  • ICP-MS Data: Normalize the elemental concentration data. Log-transform the data if necessary to achieve a normal distribution.
  • Mid-Level Data Fusion: From the REIMS data, select the top 18 significant lipid features (e.g., fatty acids, phospholipids). From the ICP-MS data, select the top 9 significant elemental markers. Combine these selected features into a single fused data matrix for subsequent modeling [20].

Multivariate Data Analysis:

  • Exploratory Analysis: Use Principal Component Analysis (PCA) on both individual and fused datasets to visualize natural clustering and identify outliers.
  • Supervised Modeling: Construct an Orthogonal Partial Least Squares-Discriminant Analysis (OPLS-DA) model on the fused dataset to classify salmon origin. Validate the model with cross-validation and a separate test set (e.g., n=17) [20].

Table 2: Key Metabolite and Elemental Markers for Salmon Authenticity

Analytical Platform Marker Class Example Compounds/Markers Role in Discrimination
REIMS (Lipidomics) Unsaturated Fatty Acids [20] C7H12O2 (m/z 127.0759), C15H28O2 (m/z 239.2011) Differentiate regional diets & metabolism
Diacylglycerophosphocholines [20] GP0101 species Indicate farming conditions & species
Triacylglycerols [20] GL0301 species Reflect energy storage & nutritional status
ICP-MS (Elemental) Trace Elements [22] [20] Mn, As, Cd, Pb Fingerprint of water & sediment geology

Key Findings and Significance

The mid-level data fusion of REIMS and ICP-MS data achieved a cross-validation classification accuracy of 100% for salmon origin, a performance not attainable with single-platform methods [20]. All independent test samples (n=17) were correctly assigned to their geographical origin. The study identified 18 robust lipid markers and 9 elemental markers that provided strong evidence for the provenance of the salmon, demonstrating the power of this fused omics approach for high-stakes authenticity control in complex food supply chains [20].

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents and Materials for LC-HRMS and NMR Foodomics

Item Function/Application Example/Note
LC-HRMS Grade Solvents Mobile phase preparation; minimizes background noise & ion suppression [18] Acetonitrile, Methanol, Water (all LC-MS grade)
Deuterated NMR Solvents Provides field-frequency lock and solvent signal for NMR; crucial for quantitative analysis [18] D₂O, Deuterated Chloroform (CDCl₃), Deuterated Methanol (CD₃OD)
Internal Standards Data normalization; correction for instrumental drift [18] [3] Sorbic acid (for LC-HRMS) [3], TSP (for NMR) [18]
Solid Phase Extraction (SPE) Cartridges Sample clean-up; metabolite enrichment/purification prior to analysis C18, HLB, Ion Exchange phases
Chemical Reference Standards Metabolite identification and confirmation; required for definitive annotation [19] Forsythiaside A, Phillyrin [21]

Standardized Protocol for LC-HRMS/NMR Data Fusion

Sample Preparation Workflow

  • Homogenization: For solid food matrices, homogenize the sample to a fine powder under liquid nitrogen to preserve metabolite integrity.
  • Metabolite Extraction:
    • Weigh an aliquot (e.g., 100 mg) of the homogenized sample.
    • Add a mixture of a polar solvent (e.g., methanol:water, 4:1 v/v) and a non-polar solvent (e.g., chloroform) for comprehensive metabolite extraction. Include appropriate internal standards at this stage.
    • Vortex vigorously, sonicate in an ice bath, and centrifuge to separate phases.
    • Collect the polar (upper) phase for LC-HRMS and NMR analysis, and the non-polar (lower) phase for lipidomics [19].
  • Preparation for LC-HRMS: Dilute the polar extract with LC-MS grade water, if necessary, and filter through a 0.22 µm membrane into an LC vial [3].
  • Preparation for NMR: Mix an aliquot of the polar extract with a phosphate buffer in D₂O (pH 7.4) containing TSP. Transfer to a standard 5 mm NMR tube [19].

Data Integration and Analysis Workflow

The logical flow for processing and integrating data from the two analytical platforms is outlined below. This workflow ensures that the complementary data are effectively combined for a robust classification model.

G cluster_platforms Parallel Analytical Platforms Start Food Sample LCMS LC-HRMS Start->LCMS NMR NMR Start->NMR LCMS_Raw Raw Spectra LCMS->LCMS_Raw NMR_Raw Raw Spectra NMR->NMR_Raw subcluster_LCMS subcluster_LCMS LCMS_Peaks Peak Table (Aligned, Normalized) LCMS_Raw->LCMS_Peaks Fusion Mid-Level Data Fusion (Feature Concatenation) LCMS_Peaks->Fusion subcluster_NMR subcluster_NMR NMR_Bins Binned Data (Normalized) NMR_Raw->NMR_Bins NMR_Bins->Fusion Model Multivariate Model (sPLS-DA, OPLS-DA) Fusion->Model Result Classification & Marker ID Model->Result

Figure 2: Generic data processing and fusion workflow for food classification using LC-HRMS and NMR data [1] [19] [21].

Critical Steps and Troubleshooting

  • Mobile Phase Compatibility: For online LC-NMR-MS, the use of deuterated mobile phases is ideal but costly. A practical compromise is using deuterated water (D₂O) in the aqueous mobile phase to reduce solvent interference in NMR, while accepting a slight retention time shift compared to all-protonated LC-MS methods [18].
  • Sensitivity Balance: The inherent sensitivity difference between LC-HRMS and NMR is a major challenge. To mitigate this, consider using a cryoprobed NMR spectrometer and microcoil probes, or off-line fraction collection followed by concentration of LC eluent for NMR analysis of low-abundance metabolites [18].
  • Data Quality Control: For LC-HRMS, analyze a pooled Quality Control (QC) sample throughout the run to monitor instrument stability and for data normalization [3]. For NMR, ensure consistent sample pH and temperature during acquisition to prevent chemical shift drift [19].

From Theory to Practice: Implementing Data Fusion Strategies and Analytical Workflows

In the field of food authenticity and metabolomics, no single analytical technique can comprehensively capture the full complexity of a sample's chemical composition. Data fusion has emerged as a powerful multidisciplinary strategy that integrates datasets obtained from various independent analytical techniques to provide insights that surpass those achievable from any single approach [16] [23]. This integrated approach is particularly valuable in food classification research, where combining complementary data from techniques such as Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (NMR) spectroscopy provides a more holistic characterization of food samples [1] [24].

The fundamental principle behind data fusion is that different analytical techniques offer unique yet complementary information. For instance, LC-HRMS provides high sensitivity for detecting trace metabolites, while NMR offers robust structural elucidation and precise quantification capabilities [16]. When combined, these techniques enable researchers to build more robust classification models for determining geographical origin, production methods, and authenticity of various food products [1] [25] [26]. Studies have demonstrated that data fusion significantly enhances classification accuracy, with one review noting positive effects in 81% of food authenticity applications [23] [27].

Table 1: Comparison of Data Fusion Levels in Analytical Chemistry

Fusion Level Data Handling Approach Key Advantages Common Algorithms
Low-Level Direct concatenation of raw or pre-processed data matrices Preserves all original information; simple implementation PCA, PLS [16] [23]
Mid-Level Integration of extracted features from each dataset Reduces dimensionality; removes noise PCA, PLS, PARAFAC, MCR-ALS [16] [23]
High-Level Combination of model outputs or decisions Handles heterogeneous data well; reduces uncertainty Bayesian methods, voting schemes, fuzzy aggregation [16] [23]

The Three-Level Hierarchy of Data Fusion

Low-Level Data Fusion (LLDF)

Low-Level Data Fusion, also referred to as block concatenation, represents the most straightforward approach to data integration [16]. This method involves the direct concatenation of two or more data matrices originating from different analytical sources into a single, combined matrix for subsequent analysis [23] [27]. The fusion occurs at the most basic level of data representation, typically using raw or minimally pre-processed data from each technique.

The implementation of LLDF requires careful pre-processing to ensure meaningful integration, which can be divided into three critical stages: (1) correction of signal acquisition artefacts for each individual analytical platform; (2) equalization of contributions from different data blocks using methods such as mean centering or unit variance scaling; and (3) adjustment of weights assigned to each data block to account for differences in variance and dimensionality [16]. Without proper inter-block equalization, the analysis tends to be dominated by the data block with the greatest variance, potentially obscuring valuable information from other sources [16].

The primary advantage of LLDF is its simplicity and preservation of all original data, making it particularly useful when the relationships between variables from different sources are important to maintain. However, this approach faces significant challenges, especially when dealing with high-dimensional data. The concatenated matrix often contains a vast number of variables, frequently exceeding the number of observations, which can lead to computational inefficiencies and model overfitting [16] [23]. Additionally, LLDF can amplify noise and include redundant information, potentially diluting the relevant chemical signals [23] [27].

G A LC-HRMS Raw Data C Data Pre-processing A->C B NMR Raw Data B->C D Block Scaling/Normalization C->D E Concatenated Data Matrix D->E F Multivariate Analysis E->F G Fused Model F->G

Mid-Level Data Fusion (MLDF)

Mid-Level Data Fusion addresses several limitations of low-level fusion by incorporating a feature extraction step before data integration [16] [23]. This two-stage methodology first reduces the dimensionality of each individual data matrix to extract the most meaningful features, then concatenates these selected features into a single matrix for subsequent analysis [16]. This approach significantly decreases data complexity while preserving the most relevant information from each analytical technique.

The feature extraction process typically employs dimensionality reduction techniques such as Principal Component Analysis (PCA), which transforms the original variables into a smaller set of principal components that capture the maximum variance in the data [16] [23]. For higher-order data structures, alternative factorization methods may be employed, including Parallel Factor Analysis (PARAFAC), Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS), or more recently developed approaches like Multimodal Multitask Matrix Factorization [16]. These techniques effectively distill the essential information from each data block while filtering out noise and irrelevant variables.

The advantages of MLDF are particularly evident in food classification applications. For example, in a study distinguishing green and ripe Forsythiae Fructus, mid-level fusion of LC-MS and HS-GC-MS data produced an OPLS-DA model with superior performance (R²Y = 0.986, Q² = 0.974) compared to models built from either technique alone [21]. Similarly, research on salmon authenticity demonstrated that mid-level fusion of REIMS and ICP-MS data achieved 100% classification accuracy for geographical origin and production method, a feat not possible with single-platform methods [25] [28]. This fusion approach also identified 18 lipid markers and 9 elemental markers that provided robust evidence of salmon provenance [25].

G A LC-HRMS Data C Feature Extraction A->C B NMR Data D Feature Extraction B->D E Selected Features C->E F Selected Features D->F G Fused Feature Matrix E->G F->G H Multivariate Analysis G->H I Classification Model H->I

High-Level Data Fusion (HLDF)

High-Level Data Fusion, also known as decision-level fusion, represents the most complex approach in the data fusion hierarchy [16] [23]. Rather than integrating raw data or extracted features, HLDF combines the outputs or decisions from multiple models built on individual data blocks [23] [27]. This approach operates at the highest level of abstraction, aggregating predictions, classifications, or statistical measures from separate analyses to produce a consensus outcome with reduced uncertainty [16].

The implementation of HLDF involves building independent classification or regression models for each analytical technique and then combining their outputs using aggregation strategies. These may include heuristic rules, Bayesian consensus methods, fuzzy aggregation strategies, or majority voting schemes [16] [23]. A relevant example in food authenticity is the multiblock DD-SIMCA method, which combines full distances from individual models into a single cumulative metric known as the Cumulative Analytical Signal [16]. This strategy preserves interpretability while allowing the contribution of each data block to be traced in the final classification.

The primary advantage of HLDF is its ability to effectively handle highly heterogeneous data from disparate analytical platforms that may have different dimensionality, scale, and pre-processing requirements [16]. Since each data block is modeled separately, platform-specific characteristics can be optimally addressed without compromising the integrity of individual analyses. Additionally, HLDF typically requires less computational resources for the final fusion step compared to low and mid-level approaches [23]. However, this method may not fully exploit potential interactions between variables from different sources, and the interpretability of the final fused model can be more challenging [16].

Table 2: Data Fusion Applications in Food Authenticity Research

Application Area Analytical Techniques Fusion Level Key Findings Reference
Amarone Wine Classification LC-HRMS, ¹H NMR Multi-omics integration Improved predictive accuracy of wine fingerprint; identified amino acids, monosaccharides, polyphenolics [1]
Salmon Origin Authentication REIMS, ICP-MS Mid-level 100% classification accuracy; identified 18 lipid and 9 elemental markers [25] [28]
Hazelnut Geographical Origin ¹H NMR, LC-HRMS, BSIA Supervised multivariate (DIABLO) Minimum error rate for origin and cultivar classification [26]
Forsythiae Fructus Maturity LC-MS, HS-GC-MS Mid-level Superior model performance (R²Y=0.986, Q²=0.974) vs single techniques [21]

G A LC-HRMS Data C Individual Model A->C B NMR Data D Individual Model B->D E Model Output C->E F Model Output D->F G Decision Fusion E->G F->G H Consensus Prediction G->H

Experimental Protocols for LC-HRMS and NMR Data Fusion

Sample Preparation Protocol

Materials and Reagents:

  • Methanol (HPLC grade)
  • Chloroform (HPLC grade)
  • Deuterated solvent (e.g., D₂O, CD₃OD)
  • Reference standard compounds (e.g., forsythiaside A, phillyrin)
  • NMR reference compound (e.g., TSP, DSS)

Sample Extraction Procedure:

  • Homogenization: Grind food samples (e.g., hazelnuts, salmon, grapes) to a fine powder using a laboratory mill [26].
  • Dual Extraction: Weigh 100 mg of homogenized material and perform parallel extractions:
    • For LC-HRMS: Extract with 1 mL methanol:water (80:20, v/v) by vortexing for 30 seconds, followed by sonication for 15 minutes at room temperature [26] [21].
    • For NMR: Prepare a separate aliquot extracted with 1 mL deuterated methanol:deuterated water buffer (80:20, v/v) containing 0.01% TSP as internal reference [1].
  • Centrifugation: Centrifuge both extracts at 14,000 × g for 10 minutes at 4°C to pellet insoluble material.
  • Storage: Transfer supernatants to clean vials and store at -80°C until analysis.

Instrumental Analysis Parameters

LC-HRMS Analysis:

  • Instrument Configuration: UPLC system coupled to Q/Orbitrap mass spectrometer [21]
  • Chromatography Column: C18 column (100 × 2.1 mm, 1.7 μm)
  • Mobile Phase:
    • A: 0.1% formic acid in water
    • B: 0.1% formic acid in acetonitrile
  • Gradient Program: 5% B to 95% B over 25 minutes, hold 5 minutes, re-equilibrate [21]
  • Mass Spectrometry:
    • Ionization Mode: Electrospray ionization (ESI) in positive and negative modes
    • Resolution: 70,000 full width at half maximum (FWHM)
    • Mass Range: m/z 100-1500
    • Collision Energies: Stepped energy (20, 40, 60 eV) for MS/MS fragmentation

NMR Spectroscopy:

  • Instrument Configuration: 600 MHz NMR spectrometer with cryoprobe [1]
  • Temperature: 298 K
  • Pulse Sequence: NOESYPR1D with water presaturation [1]
  • Spectral Width: 12 ppm
  • Relaxation Delay: 4 seconds
  • Acquisition Time: 2.5 seconds per scan
  • Number of Scans: 64-128 depending on sample concentration

Data Pre-processing Workflow

LC-HRMS Data Processing:

  • Peak Detection and Alignment: Use software (e.g., XCMS, Progenesis QI) for peak picking, alignment, and integration [21].
  • Compound Identification: Match accurate mass and fragmentation patterns against databases (e.g., HMDB, LipidMaps) with Tier 1-4 confidence levels [25] [16].
  • Data Matrix Construction: Create a matrix with samples as rows and normalized peak intensities as columns.

NMR Data Processing:

  • Fourier Transformation: Process Free Induction Decay (FID) signals with exponential line broadening (0.3 Hz).
  • Phase and Baseline Correction: Apply manual phase correction and automated baseline correction.
  • Spectral Alignment: Reference spectra to internal standard (TSP at 0.0 ppm).
  • Bucketing: Divide spectra into regions (0.04 ppm buckets) and integrate area [1].
  • Data Matrix Construction: Create a matrix with samples as rows and bucket intensities as columns.

The Scientist's Toolkit: Essential Research Reagents and Materials

Table 3: Essential Research Reagents and Materials for LC-HRMS/NMR Data Fusion

Category Item Specification Function in Protocol
Chromatography C18 UPLC Column 100 × 2.1 mm, 1.7 μm particle size Separation of complex metabolite mixtures prior to MS detection
MS Calibration Reference Mass Solution ESI-L Low Concentration Tuning Mix Daily mass accuracy calibration of HRMS instrument
NMR Standards Deuterated Solvents D₂O, CD₃OD (99.9% deuterated) Provides locking signal for NMR spectrometer; dissolution medium
NMR Reference TSP (Trimethylsilylpropanoic acid) Sodium salt, 98% purity Chemical shift reference (0.0 ppm) and quantification standard
Extraction Solvents Methanol, Chloroform HPLC grade, ≥99.9% purity Extraction of broad range of metabolites (polar to non-polar)
Mobile Phase Formic Acid LC-MS grade, ≥99.8% purity Modifier for mobile phase to enhance ionization in ESI-MS
Quality Control Reference Compounds Forsythiaside A, phillyrin (>95% purity) System suitability testing and quality control of analyses

The hierarchical framework of data fusion—comprising low-level, mid-level, and high-level approaches—provides food scientists with a systematic methodology for integrating complementary data from LC-HRMS and NMR platforms. As demonstrated across numerous food authenticity applications, including wine classification [1], salmon origin verification [25], and hazelnut geographical tracing [26], data fusion consistently enhances classification accuracy and provides a more comprehensive chemical characterization than single-technique approaches. The selection of an appropriate fusion level depends on multiple factors, including data dimensionality, computational resources, and the specific research objectives. As the field continues to evolve, data fusion strategies will play an increasingly vital role in addressing complex challenges in food authenticity, quality control, and metabolomics research.

This application note provides a detailed protocol for generating and processing data from Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (NMR) spectroscopy for food classification studies. The integration of these two analytical techniques, a strategy known as data fusion, significantly enhances the comprehensiveness of food metabolome coverage and improves the accuracy of classifying samples based on attributes like geographical origin, production method, or processing techniques [1] [25]. This protocol is framed within a broader research context focused on authenticating Amarone wine, though the principles are applicable to a wide range of food commodities [1].

The workflow is structured into three critical phases: sample preparation, data acquisition, and data pre-processing. Standardized procedures in each phase are crucial for ensuring data quality, reproducibility, and the successful integration of the two complementary data streams. LC-HRMS offers high sensitivity for detecting a wide array of metabolites, while ¹H NMR provides a highly reproducible and quantitative overview of the main components [1]. Their fusion creates a powerful tool for food authenticity analysis.

Sample Preparation Protocols

Consistent and correct sample preparation is the foundation for obtaining high-quality analytical data. The following protocols are tailored for wine analysis but can be adapted for other liquid food matrices.

General Sample Handling

  • Initial Handling: For wine samples, ensure they are stored at 4°C prior to analysis. Allow samples to reach room temperature and mix by inversion to ensure homogeneity.
  • Filtration: Filter all samples through a 0.22 µm syringe filter (e.g., nylon or PVDF) to remove any particulate matter that could damage instrumentation or cause spectral distortions [29] [30].

LC-HRMS Sample Preparation

The goal for LC-HRMS is to remove non-volatile salts and proteins while concentrating metabolites.

  • Solid Phase Extraction (SPE):
    • Use reversed-phase SPE cartridges (e.g., C18, 100 mg/3 mL bed volume).
    • Conditioning: Sequentially pass 3 mL of methanol and 3 mL of acidified water (0.1% formic acid) through the cartridge.
    • Loading: Load 1 mL of the filtered wine sample.
    • Washing: Wash with 3 mL of acidified water (0.1% formic acid) to remove sugars, acids, and salts.
    • Elution: Elute metabolites using 2 mL of a methanol/acetonitrile (80:20, v/v) solution.
  • Post-Extraction Processing:
    • Evaporate the eluent to dryness under a gentle stream of nitrogen gas at 40°C.
    • Reconstitute the dry residue in 100 µL of a methanol/water (50:50, v/v) solution.
    • Transfer to a low-volume LC vial with insert for analysis [1].

¹H NMR Sample Preparation

The goal for NMR is to prepare a perfectly clear, particulate-free solution in a deuterated solvent in a high-quality NMR tube.

  • Sample Preparation:
    • Mix 400 µL of filtered wine with 200 µL of deuterated phosphate buffer (100 mM, pD 7.4). The buffer should contain 0.1 mM of TSP (3-(trimethylsilyl)-propionic acid-d4 sodium salt) as an internal chemical shift reference (δ = 0.0 ppm) for aqueous samples [29] [30].
    • Vortex the mixture for 30 seconds.
  • NMR Tube Preparation:
    • Use high-quality 5 mm NMR tubes (e.g., from Wilmad or Norell) suitable for at least 400 MHz frequency. Avoid disposable or scratched tubes [29] [30].
    • Transfer the 600 µL sample solution to the NMR tube using a Pasteur pipette, ensuring the filling height is approximately 4 cm.
    • Cap the tube securely and label it with a permanent marker directly on the tube or cap.

Table 1: Key Research Reagent Solutions for LC-HRMS/NMR Workflow

Item Function/Brief Explanation
C18 SPE Cartridges To isolate and concentrate semi-polar and non-polar metabolites from the aqueous wine matrix for LC-HRMS analysis.
Deuterated Solvent (e.g., D₂O) Provides a signal for the spectrometer's lock system and shimming, and creates an "invisible" background for ¹H NMR.
Internal Standard (TSP) Provides a precise internal reference point (0.0 ppm) for chemical shift calibration in ¹H NMR spectra of aqueous solutions.
Deuterated Buffer (pH 7.4) Maintains a constant pH for all samples, ensuring reproducibility of chemical shifts in NMR spectra.
High-Quality NMR Tubes Precision tubes ensure magnetic field homogeneity, which is critical for achieving high-resolution NMR spectra.

Data Acquisition Parameters

Standardized acquisition methods are vital for generating consistent and comparable datasets.

LC-HRMS Data Acquisition

  • Chromatography:
    • Column: Reversed-phase C18 column (e.g., 150 x 2.1 mm, 1.7 µm).
    • Mobile Phase A: Water with 0.1% formic acid.
    • Mobile Phase B: Acetonitrile with 0.1% formic acid.
    • Gradient: Use a linear gradient from 2% to 98% B over 25 minutes.
    • Flow Rate: 0.3 mL/min.
    • Column Temperature: 40°C.
    • Injection Volume: 5 µL.
  • Mass Spectrometry:
    • Ionization: Electrospray Ionization (ESI) in both positive and negative modes.
    • Mass Resolution: > 50,000 Full Width at Half Maximum (FWHM).
    • Mass Range: 50 - 1200 m/z.
    • Collision Energy: Use data-dependent acquisition (DDA) with a stepped collision energy (e.g., 10, 30, 50 eV) to generate MS/MS spectra for metabolite annotation [1] [31].

¹H NMR Data Acquisition

  • Spectrometer: A 600 MHz NMR spectrometer is recommended for optimal spectral dispersion and sensitivity.
  • Probe: A triple-resonance inverse detection cryoprobe for enhanced sensitivity.
  • Key Acquisition Parameters:
    • Pulse Sequence: 1D NOESY-presat for water suppression.
    • Spectral Width: 14 ppm.
    • Relaxation Delay (D1): 4 seconds.
    • Number of Scans: 128.
    • Temperature: 298 K (25°C) [1].

Data Pre-processing Workflows

Pre-processing converts raw instrumental data into a structured data matrix suitable for statistical analysis and data fusion.

LC-HRMS Data Pre-processing

LC-HRMS data processing aims to detect metabolic features (defined by m/z and retention time (RT)) and align them across all samples.

  • Software: Tools like XCMS (in R), MS-DIAL, or MZmine 2 can be used [31].
  • Key Steps:
    • Format Conversion: Convert raw data files to an open format like .mzML.
    • Peak Picking: Identify chromatographic peaks with a signal-to-noise ratio above a set threshold.
    • Retention Time Alignment: Correct for minor RT shifts between runs (e.g., using Obiwarp or LOESS).
    • Grouping: Group corresponding peaks across samples.
    • Gap Filling: Re-integrate missing peaks in samples where they were initially undetected.
    • Annotation: Use in-house or public databases (e.g., HMDB, LipidMaps) to annotate features based on exact mass and MS/MS spectra [31].
  • Output: A peak intensity table (data matrix) where rows are samples, columns are metabolic features (m/z/RT pairs), and values are peak intensities.

¹H NMR Data Pre-processing

NMR data processing focuses on extracting quantitative spectral information.

  • Software: TopSpin (Bruker), Chenomx NMR Suite, or in-house scripts in MATLAB/R.
  • Key Steps:
    • Fourier Transformation: Convert the raw time-domain (FID) data to a frequency-domain spectrum.
    • Phasing & Baseline Correction: Ensure a flat baseline and purely absorptive peaks.
    • Referencing: Calibrate the chemical shift scale to the TSP peak at 0.0 ppm.
    • Spectral Bucketing (Binning): Reduce the complexity of the data by integrating the spectral intensity into consecutive, small regions (e.g., 0.04 ppm buckets). This minimizes the effects of small pH-induced shifts.
    • Normalization: Apply probabilistic quotient normalization (PQN) to correct for overall concentration differences between samples [1].
  • Output: A data matrix where rows are samples, columns are spectral buckets (chemical shift regions), and values are the integrated spectral intensities.

Table 2: Summary of Pre-processing Steps and Objectives for LC-HRMS and ¹H NMR

Technique Key Pre-processing Steps Primary Objective
LC-HRMS Peak picking, RT alignment, gap filling, annotation. Generate a comprehensive table of metabolite features (m/z, RT) and their relative abundances across all samples.
¹H NMR Phasing, baseline correction, chemical shift referencing, bucketing, normalization. Generate a quantitative profile of the main components, resolved by chemical shift, that is comparable across all samples.

Integrated Data Fusion and Analysis Workflow

The power of this approach lies in the fusion of the two distinct but complementary data blocks.

  • Data Fusion Strategy: A mid-level data fusion approach is recommended. In this strategy, the pre-processed data from LC-HRMS (peak table) and NMR (bucketed spectra) are first subjected to feature selection or extraction (e.g., using Multiblock Partial Least Squares - Discriminant Analysis, MB-PLS-DA). The most discriminant variables from each block are then concatenated into a single, fused data matrix [1] [25].
  • Multivariate Analysis: The fused matrix is analyzed using supervised methods like sparse PLS-DA (sPLS-DA) to build a classification model that predicts the class of unknown samples (e.g., short vs. long withering time for grapes). This approach has been shown to provide a broader characterization of the food metabolome and a lower classification error rate compared to using either technique alone [1].

workflow start Start: Food Sample (e.g., Wine) sp_lc LC-HRMS Sample Prep: - Filtration - SPE Concentration start->sp_lc sp_nmr NMR Sample Prep: - Buffer/TSP Addition - Filtration start->sp_nmr acq_lc LC-HRMS Acquisition sp_lc->acq_lc acq_nmr NMR Acquisition sp_nmr->acq_nmr proc_lc LC-HRMS Pre-processing: - Peak Picking - RT Alignment - Gap Filling acq_lc->proc_lc proc_nmr NMR Pre-processing: - Phasing/Baseline - Bucketing - Normalization acq_nmr->proc_nmr block_lc LC-HRMS Data Block proc_lc->block_lc block_nmr NMR Data Block proc_nmr->block_nmr fusion Mid-Level Data Fusion & Feature Selection block_lc->fusion block_nmr->fusion model Multivariate Analysis (sPLS-DA Model) fusion->model result Result: Sample Classification & Biomarker Identification model->result

Workflow for LC-HRMS/NMR Food Analysis

fusion lc_block LC-HRMS Data (Peak Intensity Table) select_lc Feature Selection (e.g., VIP from sPLS-DA) lc_block->select_lc nmr_block NMR Data (Bucketed Spectra) select_nmr Feature Selection (e.g., VIP from sPLS-DA) nmr_block->select_nmr fused_lc Selected LC-HRMS Features select_lc->fused_lc fused_nmr Selected NMR Features select_nmr->fused_nmr fused_matrix Fused Data Matrix fused_lc->fused_matrix fused_nmr->fused_matrix final_model Final Classification Model (e.g., sPLS-DA on Fused Data) fused_matrix->final_model

Mid-Level Data Fusion Concept

Chemometric and Machine Learning Models for Integrated Data (e.g., sPLS-DA, OPLS-DA, RF, DIABLO)

The pursuit of food authentication and quality control has entered a new era with the adoption of advanced analytical techniques such as Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (NMR) spectroscopy. These platforms generate complementary data profiles that, when integrated, provide a more comprehensive view of the food metabolome. LC-HRMS offers high sensitivity and is capable of detecting thousands of metabolomic features in complex matrices, while NMR provides structural elucidation capabilities and precise quantification despite its lower sensitivity [16] [31]. The synergy between these techniques has created unprecedented opportunities for food classification research, particularly when coupled with sophisticated chemometric and machine learning models designed to handle multi-platform datasets.

The challenge of integrating these diverse data streams has catalyzed the development of specialized computational approaches. Data fusion strategies, which systematically combine information from multiple analytical sources, have emerged as a powerful framework for leveraging the complementary strengths of LC-HRMS and NMR [16]. These strategies operate at different levels of abstraction—from raw data concatenation to model-level integration—each with distinct advantages for specific research contexts. Simultaneously, machine learning algorithms ranging from traditional partial least squares discriminant analysis to more advanced ensemble and deep learning methods have been adapted to handle the high-dimensional, multi-block data structures characteristic of integrated food metabolomics studies [32] [33]. This application note provides a comprehensive overview of these methodologies, with detailed protocols for implementing integrated data analysis pipelines in food classification research.

Analytical Platforms and Data Characteristics

LC-HRMS and NMR: Complementary Platforms

The integration of LC-HRMS and NMR data begins with recognizing their fundamental complementarity. LC-HRMS is particularly valued for its high sensitivity, capable of detecting and quantifying trace metabolites in complex food matrices. When coupled with chromatography, it enables the separation and analysis of thousands of compounds in a single run [16]. However, MS is inherently destructive, offers limited structural information, and can suffer from ionization suppression effects and limited reproducibility across platforms. Conversely, NMR spectroscopy is non-destructive, provides rich structural elucidation capabilities, and enables absolute quantification without the need for identical standards [16]. Its main limitation lies in relatively lower sensitivity compared to MS, typically restricting detection to the most abundant metabolites in a sample.

Recent applications demonstrate the power of combining these platforms. In one study, LC-HRMS and ¹H NMR profiling were applied to 80 Amarone wine samples to classify them based on grape withering time and yeast strain [1]. The multi-omics data fusion approach provided a much broader characterization of the wine metabolome than either technique alone, successfully identifying significant variations in amino acids, monosaccharides, and polyphenolic compounds throughout the withering process. The complementarity of the assays was evidenced by the limited correlation between the datasets (RV-score = 16.4%), suggesting each technique captured distinct aspects of the metabolic profile [1].

Data Preprocessing and Feature Extraction

Effective data integration requires careful preprocessing of both LC-HRMS and NMR data to ensure compatibility and maximize informational content. For LC-HRMS data, preprocessing typically involves converting vendor-specific raw data files to open formats, followed by peak detection, retention time alignment, and metabolite annotation [31]. Feature extraction methods vary in their approach, with comparative studies showing that methods like Region of Interest-Multivariate Curve Resolution (ROI-MCR) can provide more streamlined and interpretable datasets compared to traditional software like Compound Discoverer [34].

For NMR data, standard preprocessing includes Fourier transformation, phase and baseline correction, chemical shift alignment, and spectral binning to reduce dimensionality. Normalization and scaling are critical for both platforms to account for technical variations and make features comparable across samples and platforms [16]. The FAIR (Findability, Accessibility, Interoperability, and Reusability) principles provide a valuable framework for ensuring that data processing software and resulting datasets support reproducible research [31].

Table 1: Comparison of Analytical Platforms for Food Metabolomics

Parameter LC-HRMS NMR
Sensitivity High (detects trace metabolites) Moderate (limited to abundant metabolites)
Structural Information Limited (requires standards for confirmation) Extensive (enables structural elucidation)
Quantitation Relative (requires standards) Absolute (without need for identical standards)
Destructive Yes No
Reproducibility Platform-dependent, can be variable High
Sample Throughput Moderate (chromatography required) High (minimal sample preparation)
Key Applications Comprehensive metabolite profiling, biomarker discovery Metabolic pathway analysis, structural validation

Data Fusion Strategies for Integrated Analysis

Levels of Data Fusion

Data fusion strategies for integrating LC-HRMS and NMR data are typically classified into three distinct levels based on the stage at which integration occurs [16]. Each approach offers different trade-offs between informational content, complexity, and interpretability.

Low-level data fusion (LLDF) represents the most straightforward approach, involving the concatenation of raw or preprocessed data matrices from different analytical sources before model building [16]. This strategy requires careful pre-processing to correct for acquisition artefacts and equalize the contributions of each data block through techniques like mean centering or unit variance scaling. While LLDF preserves the maximum amount of original information, it can result in datasets where the number of variables far exceeds the number of observations, creating computational challenges and potentially diluting important signals with high-dimensional noise.

Mid-level data fusion (MLDF) addresses the dimensionality challenge by applying feature extraction to each data block separately before concatenating the reduced representations [16]. This two-step methodology first extracts the most informative features from each analytical platform using techniques like Principal Component Analysis (PCA), parallel factor analysis (PARAFAC), or multivariate curve resolution-alternating least squares (MCR-ALS), then combines these features into a single matrix for subsequent modeling. MLDF typically yields more robust and interpretable models while reducing computational complexity.

High-level data fusion (HLDF), also known as decision-level fusion, represents the most complex approach, building separate models on each data platform and then combining their predictions [16]. This strategy can employ heuristic rules, Bayesian consensus methods, or fuzzy aggregation strategies to integrate model outputs. HLDF is particularly advantageous when integrating highly heterogeneous data types, as it allows each platform to be modeled using optimal algorithms and preprocessing strategies before integration.

Practical Implementation Considerations

The choice of fusion level depends on specific research objectives, data characteristics, and computational resources. LLDF is generally preferred when the goal is to fully exploit potential synergies between platforms and when sufficient samples are available relative to the total number of variables. MLDF offers a practical compromise that maintains much of the informational content while addressing dimensionality challenges. HLDF provides maximum flexibility for handling diverse data types and is particularly valuable in contexts where different analytical platforms naturally lend themselves to different modeling approaches.

In food authentication studies, the data fusion approach has demonstrated significant practical utility. For example, when classifying Amarone wines based on withering time and yeast strain, the fusion of LC-HRMS and ¹H NMR data provided a more comprehensive metabolic fingerprint than either technique alone, resulting in a lower classification error rate (7.52%) [1]. The supervised multi-block sPLS-DA model successfully handled the fused data structure and identified key discriminatory metabolites, including amino acids, monosaccharides, and polyphenolic compounds.

Machine Learning and Chemometric Models

The analysis of integrated LC-HRMS and NMR data employs diverse mathematical approaches that can be broadly categorized into traditional chemometric methods and modern machine learning algorithms. Each category offers distinct strengths for handling the high-dimensional, multi-block data structures characteristic of fused metabolomic datasets.

Traditional chemometric methods include techniques like Partial Least Squares-Discriminant Analysis (PLS-DA) and its variants, which are specifically designed to handle collinear variables and situations where the number of predictors exceeds the number of observations [35] [36]. These methods project data into latent variable spaces that maximize covariance between predictor blocks and response variables, making them particularly suitable for integrated data analysis.

Modern machine learning algorithms encompass a broader range of techniques including ensemble methods (Random Forests, XGBoost), support vector machines, and specialized neural architectures [32] [33]. These approaches typically offer greater flexibility in capturing complex nonlinear relationships but may require more careful tuning and validation to prevent overfitting.

Key Model Architectures

PLS-DA and sPLS-DA: PLS-DA is perhaps the most widely used chemometric technique in metabolomics, connecting two data matrices (raw data X and class membership Y) to optimize separation between sample classes [35]. Its sparse variant (sPLS-DA) incorporates feature selection to improve model interpretability and performance in high-dimensional settings. While powerful, PLS-DA can lead to overfitting when the number of variables significantly exceeds the number of samples, and may require a larger number of variables to achieve good prediction accuracy when only a few variables are truly responsible for class separation [35].

DIABLO : DIABLO (Data Integration Analysis for Biomarker discovery using Latent cOmponents) is a generalized multi-block extension of sPLS-DA specifically designed for integrated omics data analysis. It identifies correlated variables across multiple data platforms while maximizing discrimination between predefined classes, making it particularly valuable for food authentication studies requiring data fusion.

Random Forests (RF) and XGBoost: These ensemble methods create multiple decision trees and aggregate their predictions, offering robust performance for high-dimensional and nonlinear datasets [33]. Studies have demonstrated that XGBoost can achieve 100% classification accuracy on test sets when applied to features extracted using advanced preprocessing methods like KPIC2 [33]. The main disadvantages include computational complexity, reduced interpretability compared to linear methods, and potential overfitting with noisy datasets.

Novel Hybrid Approaches: Emerging methods like Primal-Dual for Classification with Rejection (PD-CR) represent innovative approaches that simultaneously optimize feature selection and prediction accuracy while providing a confidence score for each prediction [35]. This capability for "classification with rejection" is particularly valuable in food authentication contexts where reducing false positives and false negatives is critical.

Table 2: Comparison of Machine Learning Models for Integrated Data

Model Key Characteristics Advantages Limitations Best Suited Applications
PLS-DA/sPLS-DA Linear projection maximizing covariance between X and Y Handles multicollinearity; provides interpretable loadings Prone to overfitting with many irrelevant variables; requires careful validation Initial exploratory analysis; linear discrimination problems
DIABLO Multi-block extension of sPLS-DA Identifies correlated variables across platforms; designed for data integration Complex parameter tuning; requires balanced block structure Multi-platform data integration; biomarker discovery
Random Forests Ensemble of decision trees using bagging Robust to outliers; handles nonlinear relationships Low interpretability; computationally intensive with many trees Complex classification tasks; variable importance assessment
XGBoost Gradient boosting with regularization High accuracy; built-in feature selection Extensive parameter tuning; can overfit without proper validation High-performance classification; large datasets
PD-CR Primal-dual with rejection option Confidence scores for predictions; reduces false discoveries Emerging method with limited implementation Clinical decision support; high-stakes classification

Application Notes and Protocols

Integrated Workflow for Food Classification

The following protocol outlines a comprehensive workflow for food classification using integrated LC-HRMS and NMR data, incorporating data fusion and machine learning analysis.

G A Sample Collection (Food Matrices) B LC-HRMS Analysis A->B C NMR Analysis A->C D Data Preprocessing B->D C->D E Feature Extraction D->E F Data Fusion E->F G Model Building F->G H Validation & Interpretation G->H

Diagram 1: Integrated analysis workflow for food classification

Protocol 1: Food Authentication Using Multi-Block Data Fusion

Objective: To authenticate the geographical origin of apples using integrated LC-HRMS and NMR data through multi-block data fusion and classification models.

Materials and Reagents:

  • Apple samples from different geographical regions
  • LC-MS grade solvents (methanol, acetonitrile, water)
  • NMR buffer (e.g., phosphate buffer in D₂O)
  • Internal standards for quantification

Experimental Procedure:

  • Sample Preparation:

    • Homogenize apple samples using a frozen milling procedure
    • For LC-HRMS: Extract metabolites using 80% methanol, centrifuge, and collect supernatant
    • For NMR: Prepare extracts in NMR buffer with TSP as internal reference
  • Data Acquisition:

    • LC-HRMS Parameters:
      • Column: C18 column (100 × 2.1 mm, 1.8 μm)
      • Mobile phase: Water (A) and acetonitrile (B) with 0.1% formic acid
      • Gradient: 5-95% B over 20 minutes
      • Mass spectrometer: High-resolution instrument (Q-TOF or Orbitrap) in positive and negative ionization modes
    • NMR Parameters:
      • Spectrometer: High-field NMR (≥600 MHz)
      • Experiment: ¹H NMR with water suppression
      • Temperature: 298K
      • Scans: 64-128
  • Data Preprocessing:

    • LC-HRMS Data:
      • Convert raw files to open formats (mzML, mzXML)
      • Perform peak picking, retention time alignment, and gap filling using software such as XCMS, MZmine, or KPIC2
      • Annotate metabolites using in-house databases and computational prediction
    • NMR Data:
      • Apply Fourier transformation, phase and baseline correction
      • Perform chemical shift alignment and reference to TSP (0.0 ppm)
      • Reduce dimensionality through spectral binning (0.01-0.04 ppm buckets)
      • Remove regions containing solvent residues
  • Data Fusion and Model Building:

    • Normalize datasets using appropriate methods (probabilistic quotient normalization, total area normalization)
    • Apply mid-level data fusion by extracting principal components from each platform and concatenating scores
    • Build classification models using sPLS-DA, DIABLO, or Random Forests
    • Implement cross-validation to optimize model parameters and prevent overfitting
    • Validate models using external validation sets or double cross-validation
  • Interpretation:

    • Identify key discriminatory features through model loadings and variable importance measures
    • Conduct metabolic pathway analysis to interpret biological significance
    • Validate putative biomarkers using authentic standards where available
Protocol 2: Wine Classification Based on Production Parameters

Objective: To classify Amarone wines based on grape withering time and yeast strain using LC-HRMS and ¹H NMR data fusion [1].

Specific Modifications to General Protocol:

  • Experimental Design:

    • Include multiple batches representing different withering times (e.g., short, medium, long)
    • Incorporate different yeast strains used in fermentation
    • Maintain consistent sample preparation across all conditions
  • Data Analysis:

    • Apply both unsupervised (Multiple Co-inertia Analysis - MCIA) and supervised (sPLS-DA) multi-omics data fusion approaches
    • Use the multi-omics pseudo-eigenvalue space to assess correlation between datasets
    • Focus on identifying variations in amino acids, monosaccharides, and polyphenolic compounds
  • Validation:

    • Assess classification error rate using cross-validation
    • Compare performance against models built on individual platforms
    • Verify biological relevance of identified metabolite markers through literature mining

The Scientist's Toolkit

Essential Research Reagents and Software Solutions

Table 3: Essential Resources for Integrated LC-HRMS and NMR Analysis

Category Item Function/Purpose Examples/Alternatives
Analytical Instruments LC-HRMS System High-resolution metabolite separation and detection Q-TOF, Orbitrap systems
NMR Spectrometer Structural elucidation and quantification High-field (≥600 MHz) with cryoprobes
Data Processing Software XCMS LC-MS data preprocessing and feature detection Open-source R package
MZmine Modular LC-MS data processing platform Open-source with visualization tools
NMR Processing Software Spectral processing and analysis TopSpin, Chenomx, MNova
Statistical Analysis R/Python Statistical computing and machine learning Extensive package ecosystems (mixOmics, scikit-learn)
SIMCA-P Commercial chemometrics software User-friendly PLS-DA, OPLS-DA implementation
MetaboAnalyst Web-based metabolomics analysis platform Accessible for non-programmers
Data Fusion Tools mixOmics Multi-omics data integration Implements DIABLO, sPLS-DA, multiblock analyses
ROIMCR Feature extraction for LC-MS data Region of Interest approach for data compression

The integration of LC-HRMS and NMR data through advanced chemometric and machine learning models represents a powerful paradigm for food classification research. Data fusion strategies at multiple levels enable researchers to leverage the complementary strengths of these analytical platforms, providing more comprehensive metabolic fingerprints than either approach alone. The continuous development of specialized algorithms like DIABLO and sPLS-DA, coupled with robust validation frameworks, addresses the unique challenges of high-dimensional, multi-block data structures.

Future directions in this field will likely focus on several key areas. Explainable artificial intelligence (XAI) tools such as Shapley Additive Explanations (SHAP) are increasingly important for improving model transparency by linking predictions to underlying features [32]. Dynamic flavor modulation through innovations like attention mechanisms, graph neural networks, and digital twins represents another frontier, particularly for food quality applications [32]. Additionally, the adoption of FAIR principles for research software promises to enhance reproducibility and transparency in metabolomics data processing [31]. As these computational approaches mature alongside analytical technologies, integrated data analysis will continue to transform food authentication, quality control, and metabolic phenotyping research.

The authentication and classification of high-value wines like Amarone della Valpolicella are critical challenges in food chemistry. This case study details the application of a multi-omics data fusion approach, integrating Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) and Proton Nuclear Magnetic Resonance ('H NMR) metabolomics, to classify Amarone wines based on two key production parameters: grape withering time and yeast strain used in fermentation [1] [37]. The research was framed within a broader thesis on leveraging complementary analytical techniques to achieve superior food classification. By fusing data from these two platforms, the study demonstrated a more comprehensive characterization of the wine metabolome than could be achieved by either technique alone, achieving a high classification accuracy with a low error rate of 7.52% [1] [37].

Experimental Design & Workflow

The experimental design employed a untargeted metabolomics strategy to profile 80 Amarone wine samples differing in grape withering time and commercial yeast strain [1] [38]. The workflow encompassed sample preparation, instrumental analysis via two complementary techniques, and sophisticated data integration and analysis.

Table: Key Experimental Factors for Amarone Wine Classification

Factor Description Role in Classification
Withering Time Post-harvest drying process for grapes A key process affecting sugar and amino acid concentration, significantly altering the metabolic profile of the final wine [1] [39].
Yeast Strain Commercial Saccharomyces cerevisiae strains Different strains modulate fermentation, leading to distinct profiles of amino acids, monosaccharides, and polyphenolic compounds [1] [37].

workflow Start 80 Amarone Wine Samples SP Sample Preparation Start->SP LC_HRMS LC-HRMS Analysis SP->LC_HRMS NMR 1H NMR Analysis SP->NMR DF Data Pre-processing LC_HRMS->DF NMR->DF DFusion Data Fusion (MCIA) DF->DFusion Model Supervised Modeling (sPLS-DA) DFusion->Model Result Classification & Marker ID Model->Result

Detailed Methodologies & Protocols

Sample Preparation Protocol

A. LC-HRMS Analysis:

  • Extraction: Wine samples were thawed and extracted in triplicate. An 850 μL aliquot of wine was mixed with 850 μL of acetonitrile acidified with 1% (v/v) formic acid [38].
  • Processing: The mixture was sonicated for 10 minutes and then centrifuged at 12,000 x g at 4°C for 15 minutes. The supernatant was filtered through 0.22 μm cellulose syringe filters into amber vials for analysis [38].
  • Quality Control (QC): A QC sample was prepared by pooling 10 μL from each of the 80 wine samples. This QC was injected at the beginning of the sequence, after every ten samples, and at the end to monitor instrument stability [38].

B. 1H NMR Analysis:

  • Preparation: Wine samples were filtered or centrifuged to remove solid residues. A buffer solution (e.g., phosphate buffer, pH 3.0-3.2) was added to minimize pH-induced chemical shift variations between samples [40].
  • Internal Standard: A reference compound, typically 3-(trimethylsilyl)propionic-2,2,3,3-d4 acid sodium salt (TSP) or 2,2-dimethyl-2-silapentane-5-sulfonate-d4 sodium salt (DSS), was added for chemical shift calibration and, in some protocols, quantification [40].
  • Deuteration: The sample was mixed with at least 10% deuterated solvent (D₂O) to provide a field frequency lock for the NMR spectrometer [40].

Instrumental Analysis Parameters

Table: Instrumental Configuration for LC-HRMS and 1H NMR

Parameter LC-HRMS Protocol 1H NMR Protocol
Instrumentation UHPLC pump (Vanquish) coupled to Q-Exactive Orbitrap MS [38] 400-600 MHz NMR spectrometer [40]
Chromatography/ Acquisition BEH C18 column (2.1 × 100 mm, 1.7 μm); Gradient: 6% to 94% B in 32 min; Flow: 200 μL/min [38] Standard single-pulse 1H-NMR sequence with water suppression [40]
Mass Detection/ Spectral Width Full scan MS (m/z 100-1200); Resolution: 70,000 FWHM; Polarity: ESI+ [38] Spectral width typically 10-12 ppm [40]
Data Pre-processing MS-DIAL software (v.4.80) for peak picking, alignment, and annotation using FooDB [38] Phasing, baseline correction, chemical shift alignment (e.g., to TSP at 0 ppm), and binning (e.g., 0.04 ppm) [40]

Data Fusion and Statistical Analysis

  • Multi-Block Data Integration: The processed datasets from LC-HRMS and 1H NMR were integrated using Multiple Co-inertia Analysis (MCIA), an unsupervised method that explores the joint structure of multiple datasets [1] [37].
  • Supervised Classification: Sparse Partial Least Squares-Discriminant Analysis (sPLS-DA) was applied to the fused data to build a predictive model for classifying wines based on the defined factors (withering time and yeast strain) [1] [37]. This method also identifies key variables (metabolites) that most contribute to the classification.
  • Model Validation: The performance of the sPLS-DA model was evaluated based on its classification error rate, which was reported to be 7.52% [1] [37].

Key Findings and Data Interpretation

The data fusion strategy proved highly effective in classifying the Amarone wine samples and providing a broad characterization of their metabolome.

  • Complementary Techniques: The MCIA revealed a limited correlation between the LC-HRMS and 1H NMR datasets (RV-score = 16.4%), confirming their complementarity. Each technique captured unique aspects of the wine's chemical composition, and their integration provided a more complete fingerprint [1] [37].
  • Classification Performance: The sPLS-DA model successfully classified wines according to both withering time and yeast strain. The data fusion approach yielded a lower classification error rate (7.52%) compared to models built on individual datasets, enhancing predictive accuracy [1] [37].
  • Discriminatory Metabolites: The study identified several key metabolite classes that were significantly altered by the withering process and yeast activity, including amino acids, monosaccharides, and polyphenolic compounds [1] [37].

Table: Key Metabolite Changes Driving Amarone Wine Classification

Metabolite Class Influence of Withering Influence of Yeast Strain Potential Role in Wine Profile
Amino Acids Concentration increased due to water loss [39]. Production and consumption varied with strain [1]. Precursors to aroma compounds; contribute to taste (umami, sweetness) [1] [39].
Monosaccharides Concentration increased due to water loss [1]. Residual levels affected by fermentation efficiency [1]. Impact perceived sweetness and body of the wine [1].
Polyphenolic Compounds Concentration and profile altered [1]. Strain-dependent extraction and modification [1]. Influence color, astringency, bitterness, and oxidative stability [1].

The Scientist's Toolkit: Research Reagent Solutions

Table: Essential Reagents and Materials for LC-HRMS/NMR Wine Metabolomics

Item Function/Application
Acetonitrile (LC-MS Grade) Organic mobile phase for UHPLC separation; extraction solvent for protein precipitation [38].
Formic Acid (≥95%) Mobile phase modifier (0.1% v/v) to improve chromatographic separation and ionization efficiency in LC-HRMS [38].
Deuterated Water (D₂O) Solvent for NMR analysis, provides a field frequency lock [40].
NMR Reference Standards TSP or DSS for chemical shift calibration and quantification in 1H NMR [40].
Buffer Salts (e.g., Phosphate) Used to standardize pH across NMR samples, ensuring reproducible chemical shifts [40].
Solid-Phase Extraction (SPE) Cartridges For pre-concentration of specific metabolite classes (e.g., polyphenols) prior to analysis [40].
FooDB / MassBank of NA Public metabolomics databases for metabolite annotation based on accurate mass and MS/MS spectra [38].

This application note demonstrates that the integration of LC-HRMS and 1H NMR metabolomics via data fusion strategies is a powerful approach for the detailed classification of Amarone wines. The protocol successfully distinguished wines based on withering time and yeast strain with high accuracy, underscoring the complementarity of the two analytical techniques. The multi-omics data fusion framework outlined here provides a robust and transferable model for enhancing food authentication, traceability, and quality control in complex matrices, validating its significance within the broader thesis of advanced food classification research.

The authentication of high-value food products is a critical challenge in food chemistry, particularly for ingredients susceptible to fraudulent substitution. The "Tonda Gentile Trilobata" (TGT) hazelnut cultivar from Piedmont, Italy, represents a prime example of a premium agricultural product requiring robust traceability methods. This cultivar is recognized for its superior sensory characteristics and technological properties, particularly the easy detachment of the pellicle from the roasted seed [41]. However, its higher market value and Italy's status as a net hazelnut importer create economic incentives for adulteration with lower-quality varieties [41] [42].

Traditional morphological identification methods are insufficient for processed products or when geographical origin verification is required. This application note details an integrated multi-omics approach, fusing Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (NMR) data with advanced chemometrics to authenticate TGT hazelnuts throughout the production chain. The methodology aligns with the broader thesis that multi-platform data fusion significantly enhances classification accuracy in food authentication research [1] [9].

Background and Significance

The common hazel (Corylus avellana L.) is cultivated across Europe, with Italy being the second-largest global producer [41]. The TGT cultivar, protected by the PGI designation "Nocciola Piemonte IGP," commands a premium due to its quality. Traceability becomes complex when hazelnuts are processed into products like roasted kernels, pastes, and creams, where physical identification is impossible [42].

Multi-omics strategies address this challenge by providing complementary data layers. "Chemotype" analysis (e.g., metabolomics, elemental profiling) reflects the influence of geography and environment, while "genotype" analysis confirms the cultivar [41]. Data fusion integrates these layers, creating a comprehensive biochemical fingerprint that is resistant to manipulation and applicable to complex matrices, including final consumer products [1] [42] [3].

Multi-Omics Data Acquisition and Workflow

The authentication of TGT hazelnuts employs a sequential multi-omics pipeline, from sample preparation to final classification. The integrated workflow ensures that both compositional and genetic markers are captured and synergistically analyzed.

The diagram below illustrates the complete analytical workflow for the multi-omics tracing of hazelnut origin.

G Start Hazelnut Samples (Raw & Processed) SP Sample Preparation Start->SP LCMS LC-HRMS Analysis SP->LCMS NMR NMR Analysis SP->NMR ICP ICP-MS/OES Analysis SP->ICP DNA Genetic Analysis (RAPD) SP->DNA DF Mid-Level Data Fusion LCMS->DF NMR->DF ICP->DF DNA->DF ML Machine Learning (Random Forest, sPLS-DA) DF->ML Auth Authentication & Origin Classification ML->Auth End Verified TGT Origin Auth->End

Sample Preparation Protocol

3.2.1 Reagents and Materials:

  • Liquid Nitrogen for cryogenic grinding
  • Organic Solvents: HPLC-grade methanol, acetonitrile, chloroform
  • Deuterated Solvent: Deuterium oxide (D₂O) with 0.05% TSP for NMR
  • Digestion Acids: Suprapur nitric acid (HNO₃) and hydrogen peroxide (H₂O₂) for ICP-MS
  • DNA Extraction Kits: Commercially available kits optimized for nut seeds (e.g., CTAB-based methods) [41]

3.2.2 Procedure:

  • Homogenization: Freeze raw or processed hazelnut samples with liquid nitrogen and pulverize to a fine, homogeneous powder using a ball mill.
  • Metabolite Extraction (for LC-HRMS/NMR): Weigh 100 mg of powder. Add 1 mL of methanol:water (80:20, v/v) mixture. Vortex for 1 minute, sonicate for 15 minutes in an ice bath, and centrifuge at 14,000 × g for 10 minutes. Transfer the supernatant for analysis [3].
  • Elemental Analysis (for ICP-MS/OES): Weigh 500 mg of powder into digestion vessels. Add 6 mL of concentrated HNO₃ and 2 mL of H₂O₂. Perform microwave-assisted digestion. Cool and dilute to 50 mL with ultra-pure water [42].
  • DNA Extraction (for RAPD): Follow the manufacturer's protocol for the selected DNA extraction kit. Assess DNA quality and quantity via spectrophotometry and gel electrophoresis [41].

Analytical Platforms and Key Biomarkers

The power of this approach lies in the complementary data provided by each analytical platform. The table below summarizes the key biomarkers and their utility in authenticating TGT hazelnuts.

Table 1: Key Analytical Platforms and Biomarkers for TGT Hazelnut Authentication

Analytical Platform Measured Features Key Discriminative Biomarkers for TGT Primary Utility
LC-HRMS Phenolic compounds, lipids, polar metabolites Specific polyphenol fingerprints; clovamide derivatives [41] Chemotyping; differentiation by geography and processing
¹H NMR Major metabolites (sugars, amino acids, organic acids) Sucrose/glucose ratio; amino acid profile [1] Chemotyping; rapid metabolic profiling
ICP-MS/OES Trace elements & minerals Molybdenum (Mo), Nickel (Ni), Cesium (Cs), Rubidium (Rb) [42] Geographic origin traceability (soil fingerprint)
RAPD-PCR Genomic DNA polymorphisms Unique DNA banding patterns [41] Cultivar identification (genotyping)

LC-HRMS Metabolomic Profiling

4.1.1 Protocol:

  • Chromatography: Reversed-phase C18 column (e.g., 150 × 2.1 mm, 1.9 µm). Mobile phase A: 0.1% formic acid in water; B: 0.1% formic acid in acetonitrile. Gradient: 5% B to 95% B over 25 minutes.
  • Mass Spectrometry: Q-Exactive Orbitrap or similar HRMS. Data acquired in both positive and negative ESI mode with vDIA (variable Data-Independent Acquisition) for comprehensive fragmentation data [3].
  • Data Processing: Use software like XCMS for peak picking, alignment, and integration. Apply the BOULS (Bucketing Of Untargeted LCMS Spectra) approach to enable long-term data comparability across different instruments and batches [3].

¹H NMR Spectroscopy

4.2.1 Protocol:

  • Sample Preparation: Mix 100 µL of metabolite extract with 400 µL of D₂O containing 0.05% TSP (sodium 3-(trimethylsilyl)propionate-2,2,3,3-d4) as an internal chemical shift reference.
  • Acquisition Parameters: 600 MHz NMR spectrometer. Number of scans: 64. Temperature: 298 K. Pulse sequence: NOESYPRESAT for water suppression [1].
  • Data Processing: Fourier transformation, phase and baseline correction. Reference TSP signal to δ 0.00 ppm. Segment the spectrum into buckets (e.g., δ 0.04 ppm width) for multivariate analysis.

Elemental Fingerprinting via ICP-MS/OES

4.3.1 Protocol:

  • Analysis: Use ICP-OES for major elements and ICP-MS for trace and ultra-trace elements. Employ collision/reaction cell technology in ICP-MS to mitigate polyatomic interferences.
  • Quality Control: Include procedural blanks, certified reference materials (CRMs) of similar matrix (e.g., NIST 1547 Peach Leaves), and internal standards (e.g., Sc, Y, In, Bi) for drift correction [42].
  • Key Markers: As shown in Table 1, Mo and Ni are identified as particularly powerful markers for distinguishing TGT from other cultivars like "Tonda Gentile Romana" and "Mortarella," even in processed products [42].

Genetic Authentication via RAPD

4.4.1 Protocol:

  • PCR Amplification: 25 µL reaction containing 20 ng DNA, 1X buffer, 2.5 mM MgCl₂, 0.2 mM dNTPs, 1 U Taq polymerase, and 0.4 µM of a single random primer (e.g., OPA-09).
  • Cycling Conditions: Initial denaturation: 94°C for 5 min; 45 cycles of 94°C for 1 min, 36°C for 1 min, 72°C for 2 min; final extension: 72°C for 7 min.
  • Analysis: Separate PCR products by agarose gel electrophoresis and score the presence/absence of DNA bands. RAPD robustly discriminates between different hazelnut cultivars [41].

Data Fusion and Chemometric Analysis

The core of this strategy is the fusion of multiple omics datasets to build a superior classification model.

Data Preprocessing and Fusion Strategy

Each data block (LC-HRMS buckets, NMR buckets, elemental concentrations) is preprocessed individually (mean-centering, Pareto scaling). A mid-level data fusion (MLDF) approach is recommended. In MLDF, features from each platform are first selected and then concatenated into a single composite matrix before modeling [9].

Table 2: Chemometric Models for Hazelnut Classification

Model Type Application in TGT Tracing Key Advantage
Principal Component Analysis (PCA) Unsupervised Exploratory data analysis; outlier detection Visualizes natural clustering and data structure
Partial Least Squares-Discriminant Analysis (PLS-DA) Supervised Builds a predictive model for cultivar classification Handles collinear variables; good for noisy data
Random Forest (RF) Supervised (Ensemble) Final classification of origin and cultivar [3] High accuracy; handles non-linear relationships; provides variable importance

5.1.2 Protocol for Mid-Level Data Fusion with Random Forest:

  • Feature Selection: For each omics platform, select the top 20-50 most discriminative features based on Variable Importance in Projection (VIP) scores from PLS-DA or built-in feature importance in a preliminary RF model.
  • Data Concatenation: Combine the selected features from all platforms into a single fused data matrix (samples × combined features).
  • Model Training: Train a final Random Forest model on the fused matrix. Use out-of-bag (OOB) error for internal validation.
  • Model Validation: Validate the model's performance using an independent test set of samples not used in model training. Report accuracy, precision, and recall for TGT classification.

Logical Workflow for Data Integration

The following diagram outlines the logical sequence and decision points in the data fusion and classification process.

G Data Multi-Omics Raw Data (LC-HRMS, NMR, ICP-MS, RAPD) PreP Data Preprocessing (Normalization, Scaling, Bucketing) Data->PreP FS Feature Selection (VIP, RF Importance) PreP->FS Fuse Mid-Level Data Fusion FS->Fuse Model Train Classifier (Random Forest) Fuse->Model Eval Model Evaluation (Cross-Validation, Test Set) Model->Eval Decision Classification Accurate? Eval->Decision Result TGT Authentication Successful Decision->Result Yes Refine Refine Model/Features Decision->Refine No Refine->Fuse

The Scientist's Toolkit

Successful implementation of this multi-omics traceability strategy requires specific reagents and analytical resources. The following table details the essential components of the research toolkit.

Table 3: Essential Research Reagent Solutions and Materials

Item Function / Application Specific Examples / Notes
LC-HRMS Solvents Mobile phase preparation for chromatographic separation HPLC-MS grade water, acetonitrile, methanol with 0.1% formic acid modifier
NMR Solvents Sample preparation for metabolomic profiling Deuterium Oxide (D₂O) with TSP internal standard for chemical shift referencing
ICP-MS Acids Sample digestion for elemental analysis Suprapur grade Nitric Acid (HNO₃) and Hydrogen Peroxide (H₂O₂) to minimize background contamination
DNA Extraction Kit Isolation of high-quality genomic DNA from hazelnut kernels Kits using CTAB-based lysis buffers, optimized for challenging matrices high in fats and polyphenols [41]
Certified Reference Materials (CRMs) Quality control and validation of quantitative data NIST 1547 (Peach Leaves) for ICP-MS; certified metabolite mixtures for LC-HRMS/NMR
Stable Isotope Standards Internal standardization for quantitative LC-HRMS Isotopically labeled compounds for key metabolite classes to correct for ionization suppression

Concluding Remarks

This application note demonstrates that the fusion of LC-HRMS, NMR, ICP-MS, and genomic data creates a powerful, multi-layered authentication system for Tonda Gentile Trilobata hazelnuts. The mid-level data fusion strategy, coupled with Random Forest modeling, successfully integrates complementary information from the chemotype (influenced by geography and processing) and genotype, achieving a classification accuracy exceeding 90% in validated models [1] [3].

This protocol provides a robust framework that can be adapted to the traceability of other high-value agricultural products. The continuous learning aspect of the BOULS approach for LC-HRMS data ensures that the classification models can be dynamically updated, providing a sustainable solution to the evolving challenge of food fraud.

The global demand for seafood, particularly salmon, has surged, making it one of the world's most popular fish species. Concurrently, consumer awareness and concern regarding the authenticity, geographical origin, and production method (wild-caught vs. farmed) of their seafood have significantly increased [25]. The complex, international supply chains for salmon create opportunities for unintentional mislabeling or fraudulent substitution, a problem evidenced by a reported 25% mislabelling rate in some markets [25]. Since the eating quality and price of salmon are highly influenced by its growing environment, diet, and production method, verifying its provenance is crucial for consumer trust, safety, and fair trade [25].

Single-platform analytical methods, such as those based solely on mass spectrometry or spectroscopy, often focus on specific aspects of food composition and may lack the resolution for definitive provenance determination [24] [25]. To overcome this limitation, data fusion strategies that combine information from complementary analytical techniques have emerged as a powerful solution. This case study details a landmark experiment that achieved 100% classification accuracy for salmon geographical origin and production method by applying a mid-level data fusion approach to dual-platform mass spectrometry data [25].

Experimental Design and Workflow

Sample Collection and Preparation

A total of 522 salmon samples of known provenance were collected for model development and validation. These samples represented a diverse set of origins and production methods, crucial for building a robust classification model [25].

  • Geographical Origins: Alaska, Norway, Iceland, and Scotland.
  • Production Methods: Both farmed and wild-caught.
  • Test Set: An additional 17 commercial samples were purchased from UK supermarkets to independently evaluate the model's performance.

Muscle tissue samples were analyzed directly without any chemical pre-treatment for the REIMS analysis, while samples for ICP-MS analysis underwent a standardized closed-vessel microwave digestion to prepare them for elemental profiling [25].

Analytical Techniques and Instrumentation

Two complementary mass spectrometry platforms were employed to capture distinct biochemical profiles of the salmon samples.

Table 1: Key Analytical Techniques and Their Roles

Technique Acronym Profiling Type Key Advantages Role in the Study
Rapid Evaporative Ionization Mass Spectrometry REIMS Lipidomic Real-time, in-situ analysis; no sample preparation [25] Provides rapid lipid profiles for differentiation
Inductively Coupled Plasma Mass Spectrometry ICP-MS Elemental Powerful for trace element analysis; geographical fingerprinting [25] Provides elemental composition for origin traceability

Data Acquisition and Pre-processing

  • REIMS Data: Spectral data were collected in real-time and pre-processed to reduce noise and correct for baseline drift before being subjected to multivariate analysis [25].
  • ICP-MS Data: Elemental concentration data were acquired, normalized, and scaled to ensure variables from different techniques contributed equally to the fused model.

Data Fusion Strategy and Multivariate Analysis

The Principle of Mid-Level Data Fusion

Data fusion is categorized into three levels: low, mid, and high. Mid-level data fusion (MLDF) was identified as the most suitable strategy for this application [25] [43]. In MLDF:

  • Feature Extraction: The most discriminative variables or features are independently extracted from each analytical platform's dataset.
  • Feature Concatenation: These selected features are then combined into a single, consolidated data matrix.
  • Model Building: A final classification or regression model is built using this fused matrix.

This approach effectively reduces data dimensionality and noise by eliminating irrelevant variables, often leading to superior model performance compared to using single techniques or low-level fusion [21] [44].

Chemometric Workflow for Salmon Classification

The data analysis followed a structured workflow combining unsupervised and supervised chemometric methods.

G A REIMS Data (Lipidomic Profile) C Data Pre-processing (Baseline correction, normalization) A->C B ICP-MS Data (Elemental Profile) B->C D Feature Extraction (PCA, S-plots, VIP) C->D E Mid-Level Data Fusion (Feature Concatenation) D->E F Multivariate Analysis (OPLS-DA, PLS-DA, PCA-LDA) E->F G Model Validation (Cross-validation, Test Set) F->G H 100% Classification Accuracy G->H

Diagram 1: Chemometric workflow for salmon classification using mid-level data fusion.

The process began with pre-processing data from both REIMS and ICP-MS platforms. Key features were then extracted from each dataset; for REIMS data, this involved using PCA and S-plots from OPLS-DA models to identify significant lipid ions [25]. These selected features were fused into a single matrix, which was used to build supervised classification models, including OPLS-DA, PLS-DA, and PCA-LDA. The final model was rigorously validated through cross-validation and an independent test set [25].

Key Experimental Results

Performance of Single-Technique Models

Initial analysis using only REIMS data showed some ability to distinguish between salmon from different geographical regions using Principal Component Analysis (PCA). However, the model could not reliably achieve perfect classification, particularly for samples with similar origins [25]. Similarly, models based on a single technique were insufficient to guarantee the high accuracy required for definitive provenance verification.

Achieving 100% Accuracy with Mid-Level Fusion

The power of the mid-level data fusion approach was demonstrated by its flawless performance.

Table 2: Classification Results of the Mid-Level Data Fusion Model

Validation Method Number of Samples Classification Accuracy Notes
Cross-Validation 522 100% Internal validation on the main sample set [25]
Independent Test Set 17 100% All supermarket samples had origins correctly identified [25]

This result signifies that the fused model, leveraging the complementary information from lipidomics and elementomics, possessed a discriminative power unattainable by either single-platform method.

Identification of Discriminative Biomarkers

The OPLS-DA modeling and subsequent S-plot analysis enabled the identification of key biomarkers responsible for differentiating the salmon groups. In total, 18 robust lipid markers and 9 elemental markers were discovered, providing strong chemical evidence for provenance verification [25].

The lipid markers were tentatively identified and belonged to various classes, including:

  • Unsaturated fatty acids (e.g., FA 18:3, FA 20:5, FA 22:6)
  • Diacylglycerophosphocholines (GP0101)
  • Diacylglycerophosphoglycerols (GP0401)
  • Triradylglycerols (GL0301) [25]

The relative intensities of these compounds varied significantly between salmon from different origins and production methods, forming a unique chemical "fingerprint" for each group.

The Scientist's Toolkit: Research Reagents and Materials

Table 3: Essential Research Materials and Reagents

Item Function / Role in the Experiment
Salmon Muscle Tissue Biological matrix for direct analysis using REIMS and ICP-MS [25].
ICP-MS Calibration Standards Essential for quantifying elemental concentrations and ensuring accuracy in ICP-MS analysis.
LipidMaps Database Used for the tentative identification of lipid biomarkers based on high-resolution m/z values [25].
Microwave Digestion System Used for the closed-vessel digestion of salmon tissue prior to ICP-MS analysis, ensuring complete sample dissolution [25].
Chemometric Software Software platforms capable of performing PCA, OPLS-DA, and other multivariate analyses, and implementing data fusion strategies.

Detailed Experimental Protocol

Sample Analysis via REIMS

  • Sample Presentation: Present a small portion (approx. 1-2 mg) of fresh or thawed salmon muscle tissue to the REIMS source.
  • Data Acquisition: Use an electrosurgical knife or similar probe coupled to the mass spectrometer. The tissue is rapidly heated and the aerosolized particles are aspirated into the MS inlet.
  • Mass Spectrometry: Acquire mass spectra in negative ion mode, typically over a mass range of m/z 100-1200. Collect data for 5-10 seconds per sample spot.
  • Data Export: Export the averaged, centroid mass spectra for each sample for pre-processing.

Sample Analysis via ICP-MS

  • Digestion: Accurately weigh ~0.2 g of homogenized salmon tissue into a microwave digestion vessel. Add 5 mL of high-purity concentrated nitric acid.
  • Microwave Program: Run a standardized digestion program (e.g., ramp to 180°C over 15 minutes, hold for 15 minutes). Allow vessels to cool before opening.
  • Dilution: Transfer the digestate to a volumetric flask and dilute to volume with ultrapure water (e.g., 50 mL). Further dilute if necessary to match the calibration range.
  • ICP-MS Analysis: Analyze samples alongside a blank and a series of multi-element calibration standards. Monitor a panel of relevant isotopes (e.g., Li, B, Na, Mg, K, Ca, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, As, Se, Sr, Mo, Cd, Ba, Pb, U).

Data Pre-processing and Fusion

  • REIMS Data: Pre-process the spectra by applying baseline correction, mass axis alignment, and total ion current (TIC) normalization.
  • ICP-MS Data: Perform quantitative analysis to obtain concentration data. Log-transform and apply Pareto scaling to the data.
  • Feature Selection: For REIMS data, use OPLS-DA models and S-plots to select the most significant lipid ions (biomarkers). For ICP-MS, select elements with high variable importance in projection (VIP) scores.
  • Data Fusion: Concatenate the selected REIMS biomarker intensities and the selected ICP-MS elemental concentrations into a single data matrix (samples × combined features).

Model Building and Validation

  • Training: Use the fused data matrix to build a supervised classification model, such as an OPLS-DA or PLS-DA model, using the known sample origins and production methods as the Y-variable.
  • Cross-Validation: Validate the model using a rigorous cross-validation method (e.g., 7-fold cross-validation) to determine its robustness and avoid overfitting.
  • External Validation: Apply the finalized model to the independent test set of 17 commercially purchased salmon samples to evaluate its real-world predictive accuracy.

This case study successfully demonstrates that a mid-level data fusion strategy, which combines lipidomic data from REIMS and elemental data from ICP-MS, achieves a level of classification performance that is impossible with single-platform methods. The achieved 100% accuracy in determining the geographical origin and production method of salmon provides a powerful, science-based approach to combat food fraud. The outlined experimental protocol and chemometric workflow serve as a robust template that can be adapted and applied to the authenticity testing of many other high-value food commodities, ensuring product integrity and protecting consumer interests globally.

Navigating Analytical Challenges: Strategies for Robust and Reproducible Data Integration

In food classification research, the integration of multiple analytical platforms, such as Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (NMR) spectroscopy, generates multi-block data where each block contains unique yet complementary information about the food sample. This data heterogeneity presents significant challenges for analysis, as each platform produces data with different scales, dimensionalities, and statistical properties. Recent studies have demonstrated that multi-omics data fusion approaches can successfully classify food products based on their intrinsic characteristics. For instance, research on Amarone wine classification effectively integrated LC-HRMS and 1H NMR datasets, revealing a limited correlation between the datasets (RV-score = 16.4%), which highlighted their complementarity and the necessity of proper data scaling before fusion [1] [37].

The fundamental challenge in analyzing multi-block data stems from the technical heterogeneity inherent in different analytical platforms. In a typical food metabolomics workflow, transcriptomics data can generate thousands of transcripts, while proteomics may profile only a few thousand proteins, and metabolomics identifies merely a few hundred metabolites. This disparity creates an information burden where one data type can easily overshadow more actionable discoveries from other sources if not properly normalized [45]. Furthermore, data wrangling—including transformation, scaling, normalization, and mapping—becomes critical for successful multi-omics integration, as identifiers mapping across different omics layers is rarely a one-to-one correspondence [45].

Theoretical Foundations of Scaling and Normalization

The Necessity of Preprocessing in Multi-Block Analysis

Data normalization serves as a systematic approach to decomposing tables to eliminate data redundancy and undesirable characteristics, ultimately organizing data to maintain integrity and reduce redundancy [46]. In the context of multi-block data analysis for food classification, normalization ensures that each data block contributes meaningfully to the integrated model rather than dominating due to its inherent scale or magnitude. This is particularly important for algorithms that use distance measures or gradient descent optimization, which are sensitive to feature scales [46] [47].

The objectives of normalization in multi-block data analysis include:

  • Enhancing model performance: Algorithms, especially in machine learning, often perform better or converge faster when features are on a consistent scale [46].
  • Ensuring comparability: Normalization enables different features to be compared on a common scale, preventing one feature from overshadowing others due to its inherent scale [46].
  • Improving interpretability: Normalized datasets are easier to understand and analyze, with relationships between variables more clearly discernible [46].
  • Enabling data fusion: Properly scaled data blocks can be effectively integrated through various fusion strategies to extract complementary information [48] [25].

Scaling and Normalization Techniques

Table 1: Fundamental Scaling and Normalization Techniques for Multi-Block Data

Technique Mathematical Formula Key Characteristics Best Use Cases
Z-score Standardization ( x' = \frac{x - \mu}{\sigma} ) Centers data around zero (mean=0), scales by standard deviation (std=1) Gaussian-distributed data; algorithms assuming standard normal distribution [46] [47]
Min-Max Scaling ( x' = \frac{x - min}{max - min} ) Rescales data to specific range (typically [0,1]) Data with known boundaries; algorithms requiring bounded input [46] [47]
Robust Scaling ( x' = \frac{x - median}{IQR} ) Uses median and interquartile range; robust to outliers Data with significant outliers or non-Gaussian distribution [46] [47]
L2 Normalization ( x' = \frac{x}{ x _2} ) Scales samples to have unit Euclidean norm Distance-based algorithms; vector space models [46]
Max-Absolute Scaling ( x' = \frac{x}{max( x )} ) Scales data to [-1,1] range by maximum absolute value Data centered around zero without breaking sparsity [47]

Experimental Protocols for Multi-Block Data Normalization

Protocol 1: Data Preprocessing Workflow for LC-HRMS and NMR Data Fusion

Purpose: To establish a standardized workflow for preprocessing LC-HRMS and NMR data before multi-block analysis in food classification research.

Materials and Equipment:

  • Raw data files from LC-HRMS and NMR analyses
  • Computing environment with Python/R and necessary libraries (scikit-learn, pandas, numpy)
  • Metadata table with sample information and experimental conditions

Procedure:

  • Data Parsing and Initial Quality Control
    • Convert raw instrument data to open formats (mzML for LC-HRMS, JCAMP-DX for NMR)
    • Perform initial quality assessment using Principal Component Analysis (PCA) to identify outliers and batch effects
    • Document any missing values or experimental artifacts
  • Platform-Specific Preprocessing

    • For LC-HRMS data:

      • Perform peak picking, alignment, and gap filling using XCMS or similar tools
      • Apply total ion count or probabilistic quotient normalization to correct for sample concentration variations
      • Log-transform the data to address heteroscedasticity
    • For NMR data:

      • Apply phase and baseline correction
      • Perform chemical shift alignment using reference compounds
      • Use probabilistic quotient normalization or total area normalization
      • Integrate regions of interest or apply bucketing to reduce dimensionality
  • Data Scaling and Normalization

    • Assess data distribution using histograms and Q-Q plots for each block
    • Select appropriate normalization technique based on data characteristics (refer to Table 1)
    • Apply chosen normalization method to each block independently:

    • For Robust Scaling implementation:

  • Validation of Normalization

    • Verify that normalized datasets have comparable scales across blocks
    • Confirm preservation of biological patterns through PCA visualization
    • Document all parameters and transformations for reproducibility

Troubleshooting Tips:

  • If normalization introduces artifacts, revisit quality control steps and check for outlier samples
  • If inter-block disparities persist, consider block-scaling approaches where each block is scaled by its total variance
  • For datasets with extensive missing values, consider imputation methods before normalization

Protocol 2: Mid-Level Data Fusion with Normalized Multi-Block Data

Purpose: To integrate normalized LC-HRMS and NMR datasets using mid-level data fusion for enhanced food classification accuracy.

Materials and Equipment:

  • Normalized LC-HRMS and NMR data matrices from Protocol 1
  • Sample metadata including class labels (e.g., geographical origin, processing method)
  • Multivariate analysis software (Python with scikit-learn, SIMCA, or similar)

Procedure:

  • Feature Selection and Block Preparation
    • Perform feature selection on each normalized data block to identify discriminative variables
    • For LC-HRMS data: Use Variable Importance in Projection (VIP) scores or ANOVA to select significant features
    • For NMR data: Identify spectral regions with highest class discrimination power
    • Create reduced data blocks containing only selected features
  • Data Fusion and Model Building

    • Concatenate selected features from all blocks into a fused data matrix
    • Apply supervised multivariate methods to build classification models:
      • sPLS-DA (Sparse Partial Least Squares Discriminant Analysis): Effective for high-dimensional data and variable selection
      • DIABLO (Data Integration Analysis for Biomarker discovery using Latent cOmponents): Specifically designed for multi-block classification
    • Optimize model parameters through cross-validation
  • Model Validation and Interpretation

    • Assess model performance using k-fold cross-validation and independent test sets
    • Calculate classification accuracy, sensitivity, and specificity
    • Identify key features contributing to classification across blocks
    • Validate biological relevance of selected markers through database queries and literature mining

Expected Outcomes:

  • Improved classification accuracy compared to single-block models
  • Identification of complementary biomarkers from different analytical platforms
  • Robust model with minimal overfitting, suitable for application to new samples

Case Study: Wine Classification Using Normalized LC-HRMS and NMR Data

Application in Amarone Wine Authentication

A recent study demonstrates the practical application of multi-block data normalization in food classification. Researchers integrated LC-HRMS and 1H NMR data to classify Amarone wines based on grape withering time and yeast strain [1] [37]. The workflow involved:

Experimental Design:

  • Samples: 80 Amarone wine samples with varying withering times and yeast strains
  • Analytical Platforms: LC-HRMS and 1H NMR for comprehensive metabolomic profiling
  • Data Processing: Individual normalization of each data block followed by unsupervised data exploration (Multiple Co-Inertia Analysis - MCIA) and supervised statistical analysis (sPLS-DA)

Normalization Strategy:

  • Platform-specific preprocessing followed by unit variance scaling
  • Retention of complementary information while minimizing technical variance
  • The limited correlation between datasets (RV-score = 16.4%) confirmed their complementarity

Results:

  • The sPLS-DA models correctly classified wine samples with significantly lower error rate (7.52%) compared to single-platform approaches
  • Key metabolites discriminating withering time included amino acids, monosaccharides, and polyphenolic compounds
  • Data fusion provided a broader characterization of wine metabolome than individual techniques

Table 2: Key Metabolites Identified Through Normalized Multi-Block Analysis of Amarone Wines

Metabolite Class Specific Compounds Analytical Platform Discriminatory Power
Amino Acids Proline, Alanine, GABA 1H NMR High discrimination of withering time [1]
Monosaccharides Glucose, Fructose LC-HRMS Distinct patterns across yeast strains [1]
Polyphenolic Compounds Anthocyanins, Flavonols LC-HRMS Strong correlation with grape origin [1]
Organic Acids Tartaric acid, Malic acid 1H NMR Fermentation monitoring and classification [1]

Comparison of Data Fusion Strategies

Table 3: Data Fusion Strategies for Multi-Block Food Data Analysis

Fusion Level Description Advantages Limitations Application Examples
Low-Level Fusion of raw data before preprocessing Maximum information retention Requires identical feature dimensions; amplifies technical noise Limited use in omics studies [48]
Mid-Level Fusion of features after normalization and selection Balances information content and noise reduction; allows platform-specific preprocessing Requires careful feature selection to maintain relevance Salmon origin authentication [25]; Wine classification [1]
High-Level Fusion of model outputs or decisions Flexible to heterogeneous data; robust to missing blocks May lose synergistic information between blocks Multi-platform food authentication [48]

Table 4: Essential Research Reagent Solutions for Multi-Block Data Analysis

Resource Category Specific Tools/Solutions Function/Purpose Application Context
Data Preprocessing scikit-learn Preprocessing (StandardScaler, RobustScaler, MinMaxScaler) Implements various normalization and scaling techniques Python-based data preprocessing pipeline [47]
Multivariate Analysis SIMCA, mixOmics (R), PLS Toolbox Provides specialized algorithms for multi-block data analysis sPLS-DA, DIABLO, MB-PLS implementation [48] [1]
Metabolite Databases LipidMaps, HMDB, MetLin Compound identification and annotation Marker verification in food authenticity studies [25]
Multi-Block Methods MOFA, MEFISTO, Regularized CCA Statistical frameworks for multi-omics integration Identifying shared and specific patterns across data blocks [48] [45]
Visualization Tools ggplot2, plotly, Matplotlib Creation of publication-quality figures Visualizing multi-block integration results and classification performance [48]

Workflow Visualization

multi_block_workflow Multi-Block Data Normalization and Fusion Workflow raw_data Raw Multi-Block Data (LC-HRMS, NMR) platform_specific Platform-Specific Preprocessing (Peak picking, alignment, baseline correction) raw_data->platform_specific assess_distribution Assess Data Distribution (Histograms, Q-Q plots) platform_specific->assess_distribution select_method Select Normalization Method (Based on data characteristics) assess_distribution->select_method apply_normalization Apply Normalization (Z-score, Min-Max, Robust) select_method->apply_normalization validate_norm Validate Normalization (PCA, scale verification) apply_normalization->validate_norm feature_selection Feature Selection (VIP, ANOVA, correlation analysis) validate_norm->feature_selection data_fusion Data Fusion (Mid-level concatenation) feature_selection->data_fusion multivariate_analysis Multivariate Analysis (sPLS-DA, DIABLO, PCA-LDA) data_fusion->multivariate_analysis interpretation Model Interpretation & Validation (Biomarker identification, classification accuracy) multivariate_analysis->interpretation

Multi-Block Data Normalization and Fusion Workflow

normalization_decision Normalization Method Selection Decision Tree start Selecting Normalization Method outliers Significant outliers present? start->outliers distribution Data follows Gaussian distribution? outliers->distribution No robust Robust Scaling (IQR-based) outliers->robust Yes bounded Bounded input required? distribution->bounded No standard Z-score Standardization (Mean=0, Std=1) distribution->standard Yes distance_based Distance-based algorithm? bounded->distance_based No minmax Min-Max Scaling (Range [0,1]) bounded->minmax Yes distance_based->standard No l2 L2 Normalization (Unit norm) distance_based->l2 Yes

Normalization Method Selection Decision Tree

Addressing data heterogeneity through appropriate scaling and normalization techniques is fundamental to successful multi-block data analysis in food classification research. The protocols and case studies presented demonstrate that proper data preprocessing enables effective integration of complementary analytical platforms, leading to enhanced classification accuracy and biomarker discovery. As food authenticity challenges continue to evolve, the systematic approach to multi-block data normalization outlined here provides researchers with a robust framework for leveraging the full potential of multi-platform metabolomic data. Future developments in this field will likely focus on automated normalization selection, handling of missing data blocks, and real-time normalization for quality control in food production environments.

Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) has become indispensable in modern analytical science, particularly for non-targeted screening (NTS) in complex matrices ranging from environmental samples to food products. However, the sophisticated power of this technology is tempered by a significant challenge: technical variance arising from analysis across different devices, laboratories, and time periods. This variability presents a critical barrier to reproducibility, data comparison, and the implementation of robust long-term screening programs.

The issue is particularly acute in food authenticity research, where reliable classification models depend on consistent data generation over extended periods. Studies have demonstrated that different data processing workflows alone can yield notably low levels of feature agreement, potentially leading to divergent scientific interpretations [49] [50]. Furthermore, the lack of robustness, mainly caused by varying instrument conditions and column performance over time, currently hinders the application of untargeted LC-HRMS in commercial laboratories requiring consistent long-term analysis [3].

This application note details practical strategies and protocols to overcome these challenges, with a specific focus on ensuring data reliability within broader research frameworks such as LC-HRMS and NMR data fusion for food classification.

Understanding the specific sources and magnitude of technical variance is the first step in mitigating its effects. The following table summarizes the key factors and their demonstrated impact on data quality.

Table 1: Key Sources of Technical Variance in LC-HRMS and Their Documented Impacts

Source of Variance Documented Impact on Data Evidence from Literature
Data Processing Algorithms Low agreement (as low as ~10% overlap) in detected features between different software packages [50]. Comparison of MZmine3, XCMS, OpenMS, and SIRIUS showed low consensus, affecting downstream multivariate analysis [50].
Instrument & Column Condition Shifts in retention time and signal intensity due to column aging and instrument contamination [3]. Slightly different interaction between compounds and columns over time leads to measurement-based differences, hindering long-term data comparability [3].
Mobile Phase Preparation & Delivery Significant retention time shifts and increased baseline noise [51]. A 1% change in organic solvent concentration caused a 2-minute retention time shift in a reversed-phase separation; premixing mobile phases reduced noise five-fold [51].
Long-Term Temporal Drift Inability to directly compare untargeted LC-HRMS data processed at different times without complex batch correction [3]. A study on honey origin classification highlighted the difficulty of analyzing spectra obtained from different devices and at different points in time without a unified processing strategy [3].

Established Strategies for Robust LC-HRMS Analysis

Wet-Lab and Instrumental Protocols

1. Mobile Phase Management: The accuracy of mobile phase composition is critical. A documented case showed that a 1% change in acetonitrile concentration induced a 2-minute retention time shift [51].

  • Protocol: Mobile Phase Premixing
    • Principle: Instead of using pure solvents (e.g., water in line A and acetonitrile in line B), prepare pre-mixed solutions to reduce the proportioning range demanded from the LC pump, thereby increasing effective accuracy.
    • Procedure: For a method requiring a 19-24% acetonitrile gradient, premix the A-reservoir with 10% acetonitrile and the B-reservoir with 30% acetonitrile. Reprogram the gradient to 40-65% B to achieve the same effective composition. This reduces the proportional accuracy requirement by an order of magnitude [51].
    • Validation: Confirm retention time stability over multiple runs (<0.1 min variation). The method should allow for minor adjustments (e.g., 0.1% B increments) to fine-tune retention without reformulating the bulk solvents.

2. System Conditioning and Priming:

  • Protocol: Column Priming for Stable Retention
    • Principle: The complex surface chemistry of stationary phases can contain sites with different equilibration rates. "Priming" saturates these sites to prevent gradual retention time drift during initial injections.
    • Procedure: Inject several high-concentration samples of the analytic(s) of interest one after another. The total mass of analyte is more critical than the time of exposure. This is faster than performing multiple complete analytical runs to achieve stability [51].
    • Validation: Retention times for target compounds should be stable (e.g., <0.1 min drift) after the priming injections.

Data Processing and Analysis Protocols

1. Implementing a Robust, Standardized Data Processing Workflow:

  • Protocol: The BOULS (Bucketing of Untargeted LCMS Spectra) Approach
    • Principle: This method enables the analysis of untargeted LC-HRMS data from different devices and time points without reprocessing the entire dataset, a common limitation of standard workflows [3].
    • Procedure:
      • Data Conversion and Import: Convert raw files to mzML format and import into a processing environment (e.g., R).
      • Centralized Retention Time Alignment: Align all new samples to a single, central reference spectrum. This is a key difference from typical workflows where all samples are aligned to each other in a single batch.
      • Three-Dimensional Bucketing: Divide the aligned data into buckets based on retention time and m/z, summing the signal intensity within each bucket. This creates a consistent feature set across all batches.
      • Model Building and Prediction: Use the bucketed data to build a classification model (e.g., Random Forest). New samples can be aligned to the central reference, bucketed, and classified without reprocessing historical data [3].
    • Application: This protocol was successfully used to determine the geographical origin of honey with 94% accuracy, using data from 835 samples analyzed on different devices over time [3].

2. Utilizing Sparse Models for Improved Feature Prioritization:

  • Protocol: Sparse Principal Component Analysis (SPCA) for Time-Series Data
    • Principle: Standard PCA can be difficult to interpret in high-dimensional data as each component is a combination of all variables. SPCA incorporates regularization to select a sparse set of informative features, enhancing interpretability and suppressing noise [50].
    • Procedure:
      • Feature Extraction: Obtain a feature table using a chosen tool (e.g., MZmine3, XCMS).
      • SPCA Modeling: Apply SPCA to the feature table. The model will yield components driven by a limited number of non-zero loading features.
      • Trend Reliability Assessment: Combine SPCA with stratified bootstrapping (SBS-SPCA) to assess the reliability of selected features and the trends they represent across multiple resampled datasets [50].
    • Validation: In a study of industrial wastewater, this method robustly detected persistent temporal trends, with key markers being selected with >70% frequency upon bootstrapping [50].

The Scientist's Toolkit: Essential Reagents and Materials

Table 2: Key Research Reagent Solutions for Robust LC-HRMS Analysis

Item Function/Application Specific Example from Literature
High-Purity Type B Silica Columns Minimizes surface metal content and acidic silanols, reducing peak tailing and slow equilibration issues, especially for basic compounds [51]. Standard in modern LC-HRMS methods for metabolomics and exposomics [52].
LC-MS Grade Solvents & Additives Minimizes baseline noise and ion suppression caused by impurities; ensures consistent chromatographic performance. Used in all cited methodologies; water and acetonitrile with 0.1% formic acid is a common mobile phase system [52] [53].
Stable Isotope-Labeled Internal Standards Monitors and corrects for signal drift, ionization efficiency variations, and sample preparation inconsistencies during sample analysis. Used in robustness testing of peak-picking tools to evaluate area-concentration linearity and instrument performance [50].
Quality Control (QC) Samples A pooled sample from all test samples injected repeatedly throughout the analytical batch to monitor system stability, perform retention time alignment, and correct for signal drift. Critical in untargeted metabolomics for monitoring instrument performance and validating data processing parameters [49] [54].
Absorptive Matrices for Non-Standard Sampling Enables non-invasive, reproducible, and concentrated collection of complex biological fluids for subsequent LC-HRMS analysis. Leukosorb medium was used to collect nasal epithelial lining fluid (NELF) for untargeted analysis, demonstrating application in novel exposomics studies [52].

Integrated Workflow for Robust Multi-Platform Data Fusion

Overcoming technical variance is the foundation for successful data fusion strategies, such as combining LC-HRMS with NMR data. The following diagram illustrates a robust, integrated workflow that can be applied to food classification research, synthesizing the protocols outlined above.

SampleCollection Sample Collection SamplePrep Sample Preparation (Internal Standards Added) SampleCollection->SamplePrep LCHRMS LC-HRMS Analysis (Premixed Mobile Phases, Column Priming, QC Samples) SamplePrep->LCHRMS NMR NMR Analysis SamplePrep->NMR DataProc1 Data Processing (BOULS Workflow for LC-HRMS) LCHRMS->DataProc1 DataProc2 Data Processing (Standard NMR Preprocessing) NMR->DataProc2 FeatureTable1 Stable LC-HRMS Feature Table DataProc1->FeatureTable1 FeatureTable2 NMR Feature Table DataProc2->FeatureTable2 DataFusion Mid-Level Data Fusion & Multivariate Analysis FeatureTable1->DataFusion FeatureTable2->DataFusion Classification Robust Classification Model (e.g., Random Forest) DataFusion->Classification

Figure 1: Integrated Workflow for Robust LC-HRMS and NMR Data Fusion

This workflow emphasizes critical steps for ensuring robustness:

  • Standardized Sample Preparation: Incorporation of internal standards from the very beginning.
  • Controlled Instrumental Analysis: Implementation of premixed mobile phases and column priming during LC-HRMS runs, accompanied by frequent QC samples.
  • Consistent Data Processing: Use of the BOULS approach or similar strategies for LC-HRMS data to create a stable feature table over time, alongside standard NMR processing.
  • Data Fusion and Modeling: Integration of the stabilized feature tables from both platforms via mid-level data fusion, followed by the creation of a classification model that benefits from the complementary information of both techniques, as demonstrated in food authenticity studies [1] [25].

Technical variance in LC-HRMS is a formidable but surmountable challenge. A systematic approach that combines meticulous instrumental practices—such as mobile phase premixing and column priming—with advanced data processing strategies—such as the BOULS workflow and sparse multivariate models—can significantly enhance data robustness across devices and time. By implementing these protocols, researchers can build more reliable, reproducible, and transferable non-targeted screening methods. This robustness is paramount for achieving the ultimate goal of fusing LC-HRMS data with other rich data streams, like NMR, to create powerful and trustworthy classification models for food authenticity and beyond.

Dimensionality Reduction and Feature Selection to Minimize Data Noise

In food science, the fusion of Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (NMR) data generates rich, high-dimensional datasets crucial for authentication, origin traceability, and quality assessment. However, this high dimensionality introduces significant challenges, including data noise, redundancy, and the "curse of dimensionality," which can severely compromise model performance and interpretability [3] [55]. Effective data processing strategies are essential to distill meaningful biological information from complex analytical fingerprints.

Dimensionality reduction and feature selection serve as critical preprocessing steps to mitigate these issues. Dimensionality reduction techniques, such as Principal Component Analysis (PCA), transform the original high-dimensional data into a lower-dimensional space while preserving essential information [56] [57]. In contrast, feature selection techniques identify and retain the most relevant features from the original dataset, eliminating irrelevant or redundant variables [58] [56]. Within the context of LC-HRMS/NMR data fusion for food classification, these techniques are indispensable for enhancing model accuracy, robustness, and generalizability by minimizing the impact of data noise.

Technical Background

The Curse of Dimensionality in Analytical Food Chemistry

High-dimensional data, characterized by a vast number of features (p) relative to observations (n), presents the "small-n-large-p" problem. In LC-HRMS and NMR workflows, a single sample can yield thousands of features, such as spectral peaks or chromatographic coordinates [3] [59]. Without mitigation, this leads to model overfitting, where a model memorizes training data noise rather than learning underlying patterns, resulting in poor predictive performance on new data [55].

Data Noise in LC-HRMS and NMR

Data noise originates from multiple sources:

  • Technical Noise: In LC-HRMS, variations in column performance, instrument calibration, and contamination can cause retention time shifts and intensity fluctuations [3]. In NMR, magnetic field inhomogeneities or sample impurities contribute to spectral noise [59].
  • Biological and Environmental Noise: Natural variations in food composition due to botanical origin, seasonal effects, or processing methods introduce inherent data variability [3].
  • Label Noise: Incorrect class assignments in training data, such as mislabeled geographical origins, can mislead feature selection and model training [58].

Dimensionality Reduction and Feature Selection Techniques

This section details specific methods applicable to LC-HRMS and NMR data fusion, summarizing their properties for easy comparison.

Table 1: Comparison of Dimensionality Reduction and Feature Selection Methods

Method Type Key Principle Advantages Limitations Suitability for LC-HRMS/NMR
Principal Component Analysis (PCA) Unsupervised Dimensionality Reduction Finds orthogonal components that maximize variance in the data. Simplifies data structure, reduces noise, and is computationally efficient [56] [59]. Linear assumptions; components may not be biologically relevant [57]. High for exploratory analysis of NMR data [59].
Independent Component Analysis (ICA) Unsupervised Dimensionality Reduction Separates mixed signals into statistically independent components [60]. Can distinguish sources like endogenous compounds vs. dietary intake [60]. Computationally intensive; results can be sensitive to algorithm parameters [60]. Promising for identifying distinct dietary biomarkers from fused data.
Random Forest Feature Selection Supervised Feature Selection Uses ensemble of decision trees; feature importance is based on node impurity decrease [61]. Handles high-dimensional data well and models non-linear relationships [61] [3]. Risk of overfitting without proper validation; can be biased towards variables with more categories [61]. High; effective with LC-HRMS data for geographical origin classification [61] [3].
Sparse Partial Least Squares (SPLS) Supervised Dimensionality Reduction Creates components maximizing covariance with response, with built-in variable selection [62]. Produces interpretable patterns with a sparse subset of features [62]. Performance depends on correct tuning of sparsity and number of components [62]. High for creating interpretable, diet-related patterns from many food measurements [62].
t-SNE / UMAP Non-linear Dimensionality Reduction Preserves local neighborhood structures or manifold in low-dimensional embedding [57]. Excellent for visualization and revealing complex clusters. Computationally expensive; results can be sensitive to parameters; primarily for visualization [57]. Moderate for initial data exploration and quality control of clustered samples.
Protocol: Implementing Feature Selection and Dimensionality Reduction for LC-HRMS/NMR Data Fusion

Objective: To reduce data dimensionality and select discriminative features from fused LC-HRMS and NMR datasets for improved food origin classification accuracy.

Materials and Reagents:

  • LC-HRMS System: Thermo Scientific UltiMate 3000 system coupled to Q Exactive Hybrid Quadrupole-Orbitrap Mass Spectrometer [3].
  • NMR Spectrometer: High-field NMR spectrometer (e.g., 600 MHz) for high-resolution profiling [59].
  • Data Processing Software: R or Python with essential packages (xcms, sklearn, mixOmics).
  • Samples: Authentic food samples (e.g., honey) with verified geographical and botanical origins.

Procedure:

Step 1: Data Preprocessing and Fusion

  • LC-HRMS Processing: Convert raw files to mzML format. Process using the BOULS (Bucketing of Untargeted LCMS Spectra) approach, which involves retention time alignment against a central reference and 3D bucketing (retention time, m/z, intensity) to handle data from different instruments or time periods without reprocessing [3].
  • NMR Processing: Process Free Induction Decay (FID) signals. Apply Fourier transformation, phase and baseline correction, and chemical shift referencing. Integrate spectra into fixed buckets or select distinctive regions of interest [59].
  • Data Fusion: Combine the processed LC-HRMS bucket table and NMR bucket table into a single data matrix. Rows represent samples, and columns represent the combined features from both platforms. Normalize the fused data matrix (e.g., Pareto scaling) to account for different measurement scales.

Step 2: Feature Selection Using Random Forest

  • Model Training: Split the fused, normalized dataset into training and testing sets (e.g., 70/30). Train a Random Forest classifier on the training set using the known class labels (e.g., geographical origin) [61] [3].
  • Importance Calculation: Extract feature importance scores, typically measured by the mean decrease in Gini impurity or mean decrease in accuracy [61].
  • Feature Subset Selection: Rank all features by their importance scores. Select the top k most important features for downstream analysis, where k can be determined by cross-validation performance or a predefined threshold.

Step 3: Dimensionality Reduction with Sparse PLS-DA

  • Apply SPLS-DA: Input the selected feature subset from Step 2 into a Sparse Partial Least Squares - Discriminant Analysis (SPLS-DA) model. SPLS-DA is particularly suited for supervised classification and identifying key discriminative features [62].
  • Parameter Tuning: Optimize the number of components and the sparsity threshold (eta) via cross-validation to maximize classification accuracy and obtain a sparse set of feature weights.
  • Pattern Interpretation: Use the resulting components and their corresponding feature loadings to identify the specific LC-HRMS and NMR biomarkers most responsible for class separation.

Step 4: Model Validation

  • Predictive Assessment: Use the trained SPLS-DA model to predict the class labels of the held-out test set. Calculate performance metrics such as overall accuracy, precision, and recall.
  • Robustness Validation: Validate the model's robustness by injecting proportional random label noise into the training data and observing the stability of the selected feature subset and model performance [58].

Table 2: Research Reagent Solutions and Essential Materials

Item Function/Application Example/Specification
Q Exactive Mass Spectrometer High-resolution accurate mass analysis for untargeted profiling of food metabolites [3]. Thermo Scientific Hybrid Quadrupole-Orbitrap.
NMR Spectrometer Provides quantitative data on food composition and structure; non-destructive [59]. High-field spectrometer (e.g., 600 MHz) with HRMAS probe for semi-solids.
xcms Package (R) Bioinformatic tool for processing, peak detection, and alignment of LC-HRMS data [3]. Used within the BOULS workflow for initial peak picking.
BOULS Workflow Enables analysis of LC-HRMS data from different devices/times via 3D bucketing [3]. Custom R-based workflow, available on GitHub.
mixOmics Package (R) Provides implementations of SPLS-DA and other multivariate methods for integrative data analysis [57]. Essential for performing supervised dimensionality reduction on fused data.

The following workflow diagram synthesizes the key steps from data acquisition to final classification, highlighting the critical roles of feature selection and dimensionality reduction.

LC-HRMS Data LC-HRMS Data Data Preprocessing &\nFusion Data Preprocessing & Fusion LC-HRMS Data->Data Preprocessing &\nFusion NMR Data NMR Data NMR Data->Data Preprocessing &\nFusion Fused Data Matrix Fused Data Matrix Data Preprocessing &\nFusion->Fused Data Matrix Random Forest\nFeature Selection Random Forest Feature Selection Fused Data Matrix->Random Forest\nFeature Selection Selected Feature Subset Selected Feature Subset Random Forest\nFeature Selection->Selected Feature Subset SPLS-DA\nDimensionality Reduction SPLS-DA Dimensionality Reduction Selected Feature Subset->SPLS-DA\nDimensionality Reduction Classification Model &\nValidation Classification Model & Validation SPLS-DA\nDimensionality Reduction->Classification Model &\nValidation Final Classification\n& Biomarkers Final Classification & Biomarkers Classification Model &\nValidation->Final Classification\n& Biomarkers

Figure 1: Integrated Workflow for LC-HRMS and NMR Data Fusion and Analysis. This protocol outlines the sequential steps from raw data acquisition to final classification, emphasizing the critical stages of data fusion, feature selection, and dimensionality reduction to minimize noise and enhance model performance.

Application Notes and Case Studies

Case Study: Geographical Origin Classification of Honey

A practical application involved using LC-HRMS data from 835 honey samples to classify geographical origin. The initial high-dimensional data was processed using the BOULS bucketing method to ensure comparability across devices. A Random Forest classifier was then applied, achieving a final classification accuracy of 94% on a test set of 126 samples from six different countries [3]. This demonstrates that effective data processing and feature selection can yield highly accurate models suitable for routine application in commercial laboratories, directly addressing food fraud.

Mitigating the Impact of Class Label Noise

The robustness of feature selection is critical when class labels in the training data may be unreliable. A methodology to evaluate this involves injecting artificial label noise into the training data in a proportional, random manner. Studies comparing feature selection methods under such conditions have found that multivariate methods like Random Forest generally demonstrate greater robustness to class noise compared to some univariate filter methods, producing more stable feature subsets and maintaining better model performance [58]. This is a vital consideration when building models from data where expert labeling might be subjective or error-prone.

Dimensionality reduction and feature selection are not merely optional steps but fundamental components of a robust analytical pipeline for LC-HRMS and NMR data fusion in food classification. By effectively minimizing data noise and reducing dimensionality, these techniques empower researchers to build models that are both highly accurate and interpretable. The presented protocols and case studies provide a concrete framework for implementing these methods, enabling the transition of complex analytical data into reliable tools for ensuring food authenticity and quality. As the field progresses, the integration of more robust, non-linear, and compositionally-aware methods will further enhance our ability to extract meaningful signals from the complex noise inherent in modern food metabolomics data.

Optimizing Model Parameters and Avoiding Overfitting in Multi-Omics Models

Integrating Liquid Chromatography-High-Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (¹H NMR) data through multi-omics modeling presents powerful opportunities for advanced food classification and authentication. However, these high-dimensional datasets create significant challenges in model parameter optimization and overfitting risks. This protocol details systematic approaches for building robust, generalizable multi-omics classification models using data fusion strategies, with specific application to food authenticity verification. We provide comprehensive guidelines for hyperparameter tuning, validation frameworks, and implementation of supervised integration methods that have demonstrated success in discriminating food products based on geographical origin, processing methods, and botanical varieties.

Multi-omics data integration, particularly LC-HRMS and ¹H NMR fusion, has emerged as a transformative approach for food classification research. The complementary nature of these analytical techniques—with LC-HRMS offering high sensitivity for diverse metabolite detection and ¹H NMR providing robust quantitative analysis—creates a comprehensive metabolomic profile ideal for food authentication [1]. However, the high-dimensional nature of multi-omics datasets, where the number of features (variables) often vastly exceeds the number of samples, creates substantial risk of model overfitting. This occurs when models memorize noise and idiosyncrasies in the training data rather than learning generalizable patterns, resulting in poor performance on new datasets.

The integration of LC-HRMS and ¹H NMR is particularly valuable for food classification as it captures a broad spectrum of chemical compounds, from primary metabolites to specialized secondary metabolites, providing a chemical "fingerprint" that can distinguish subtle differences in food products based on processing, origin, or cultivar [26]. Successful implementation requires careful optimization of model parameters and rigorous validation strategies to ensure analytical robustness. This protocol addresses these challenges through systematic methodologies tested in real-world food authentication case studies.

Key Multi-Omics Integration Methods

Various computational approaches have been developed for integrating multiple omics datasets, each with distinct strengths and parameter optimization requirements. The table below summarizes the primary methods used in food classification research:

Table 1: Multi-Omics Data Integration Methods for Food Classification

Method Key Characteristics Primary Parameters to Optimize Strengths Limitations
DIABLO Supervised multivariate integration; identifies correlated variables across datasets; maximizes class discrimination [26] Number of components, number of variables to select per component, design matrix Excellent classification performance; identifies key biomarkers across platforms; handles paired data structure Requires careful cross-validation; potential overfitting with small sample sizes
sPLS-DA Supervised extension of partial least squares; performs variable selection and classification simultaneously [1] Number of components, number of variables to select Redimensionality; produces interpretable models; handles multicollinearity Sensitive to parameter tuning; may miss complex interactions
MOFA Factor analysis approach; identifies latent factors describing variation across and within data types [63] Number of factors, sparsity parameters, convergence tolerance Handles missing data; identifies shared and specific variation; probabilistic framework Requires substantial data for stability; complex interpretation
AJIVE Decomposes variation into joint, individual, and noise components; uses angle-based decomposition [63] Signal ranks for each data type, joint rank Stable with small sample sizes; clear variance partitioning Subjective rank selection; limited supervised application
Sparse mCCA Extension of canonical correlation analysis; finds correlated dimensions across multiple datasets with sparsity constraints [63] Sparsity parameters, number of components Identifies cross-omics correlations; produces sparse, interpretable components Computationally intensive; requires permutation testing

Experimental Protocols for LC-HRMS and ¹H NMR Data Fusion

Sample Preparation and Data Acquisition

Materials and Reagents:

  • LC-HRMS Grade Solvents (acetonitrile, methanol, water with 0.1% formic acid): Ensure high purity to minimize background noise and ion suppression [1]
  • NMR Reference Standards (TSP, DSS, or sodium azide): Provides chemical shift reference and quantification standard [26]
  • Deuterated Solvent (D₂O, CD₃OD, or DMSO-d6): Enables NMR locking and shimming while avoiding water suppression issues
  • Quality Control Samples (pooled sample aliquots): Monitors instrument performance and data quality throughout acquisition [26]
  • Internal Standards (stable isotope-labeled compounds): Corrects for analytical variation in LC-HRMS

Protocol 1: LC-HRMS Metabolomic Profiling

  • Sample Extraction: Homogenize food samples (100 mg) with 1 mL of cold methanol:acetonitrile:water (2:2:1, v/v/v) at 4°C
  • Centrifugation: Centrifuge at 14,000 × g for 15 minutes at 4°C to pellet insoluble material
  • Analysis:
    • LC Conditions: Reverse-phase C18 column (100 × 2.1 mm, 1.8 μm); mobile phase A (water + 0.1% formic acid) and B (acetonitrile + 0.1% formic acid)
    • Gradient: 5-100% B over 15 minutes, 5-minute re-equilibration
    • MS Detection: High-resolution mass spectrometer (Orbitrap or Q-TOF) in both positive and negative ionization modes
    • Quality Control: Inject pooled QC samples every 6-10 samples to monitor stability

Protocol 2: ¹H NMR Metabolomic Profiling

  • Sample Preparation: Mix 100 μL of food extract with 400 μL of deuterated phosphate buffer (pH 7.4) containing 0.25 mM TSP as chemical shift reference [1]
  • Data Acquisition:
    • Instrument: High-field NMR spectrometer (≥600 MHz)
    • Pulse Sequence: NOESY-presat for water suppression
    • Parameters: 64 scans, 4s relaxation delay, 100ms mixing time, 2s acquisition time
    • Temperature: Maintain at 298K throughout analysis
  • Processing:
    • Fourier transformation with 0.3 Hz line broadening
    • Phase and baseline correction
    • Referencing to TSP at 0.0 ppm
    • Spectral binning (0.04 ppm buckets) to reduce dimensionality
Data Preprocessing and Feature Extraction

LC-HRMS Data Processing:

  • Peak Picking and Alignment: Use XCMS, MS-DIAL, or Progenesis QI for peak detection, retention time alignment, and integration
  • Annotation: Search accurate mass and MS/MS spectra against databases (HMDB, MassBank, LipidMaps)
  • Filtering: Remove features with >30% missing values in QC samples or >20% relative standard deviation in technical replicates

¹H NMR Data Processing:

  • Spectral Alignment: Correct for minor pH-induced chemical shift variations using recursive segment-wise peak alignment
  • Normalization: Apply probabilistic quotient normalization to correct for dilution effects
  • Metabolite Identification: Assign peaks using reference databases (BMRB, HMDB) and spiking experiments

Table 2: Quantitative Parameters for Data Quality Assessment

Quality Metric LC-HRMS Threshold ¹H NMR Threshold Corrective Action if Failed
Retention Time Drift < 0.2 min in QCs N/A Recalibrate alignment or re-run sequence
Mass Accuracy < 5 ppm N/A Recalibrate mass spectrometer
Spectral Linewidth N/A < 1.0 Hz Reshim magnet and check tuning
PCA QC Clustering RSD < 30% in pooled QCs RSD < 20% in pooled QCs Remove outlier samples or batches
Signal Intensity Drift < 30% in QCs < 20% in QCs Normalize to internal standards

Model Optimization Framework

Hyperparameter Optimization Strategies

Effective model optimization requires systematic approaches to hyperparameter tuning, particularly for multi-omics integration methods:

Bayesian Optimization with Hyperband (BOHB):

  • Principle: Combines the efficiency of Bayesian optimization with the resource allocation of Hyperband [64]
  • Implementation:
    • Define search space for key parameters (number of components, sparsity parameters, learning rates)
    • Initialize with random configurations using small resource budget (number of iterations)
    • Progressively allocate more resources to promising configurations
    • Use Gaussian process or random forest surrogate model to guide selection
  • Advantage: Efficiently balances exploration and exploitation while managing computational budget

Cross-Validation Framework:

  • Stratified k-fold: Maintains class distribution in each fold (k=5-10 based on sample size)
  • Nested Cross-Validation: Outer loop for performance estimation, inner loop for parameter tuning
  • Monte Carlo Cross-Validation: Repeated random splits (70/30) to assess stability
DIABLO Implementation for Food Classification

The DIABLO framework has demonstrated exceptional performance in food authentication studies, achieving classification error rates as low as 7.52% in Amarone wine authentication based on withering time and yeast strain [1]. The following workflow outlines the optimization process:

G cluster_0 Key Parameters to Optimize DataPreprocessing Data Preprocessing ParameterGrid Define Parameter Grid DataPreprocessing->ParameterGrid CrossValidation Nested Cross-Validation ParameterGrid->CrossValidation nComp Number of Components ParameterGrid->nComp keepX Variables per Dataset (keepX) ParameterGrid->keepX design Design Matrix ParameterGrid->design ModelTraining Model Training CrossValidation->ModelTraining PerformanceEval Performance Evaluation ModelTraining->PerformanceEval FinalModel Final Model Selection PerformanceEval->FinalModel

DIABLO Parameter Optimization Protocol:

  • Define Parameter Grid:

    • Number of components: Test range 2-10 components
    • keepX parameters: Test sparse selection from 10-100 features per dataset and component
    • Design matrix: Test values from 0.1-0.9 to balance integration and discrimination
  • Tune Using Repeated Cross-Validation:

  • Select Optimal Parameters:

    • Choose parameters that minimize balanced error rate across cross-validation folds
    • Ensure stability through repeated cross-validation (nrepeat ≥ 10)
    • Validate with permutation testing (nperm ≥ 100) to confirm significance
Overfitting Prevention Strategies

Consistency Evaluation Framework: Research demonstrates that multi-omics methods can show excellent consistency in large-sample settings but may become unstable with smaller sample sizes [63]. Implement these evaluation strategies:

  • Cross-Validation Consistency: Measure similarity of selected features across cross-validation folds using Jaccard index
  • Permutation Null Testing: Compare performance against null models with permuted class labels
  • External Validation: Reserve completely independent test set (20-30% of samples) for final evaluation

Dimensionality Control:

  • Feature Pre-filtering: Remove non-informative variables using interquartile range or variance filters before integration
  • Sparsity Constraints: Implement L1 regularization to force zero coefficients for irrelevant features
  • Component Selection: Use elbow method in scree plots or permutation testing to determine optimal number of components

Case Study: Hazelnut Authentication Using Multi-Omics Fusion

A recent study on Tonda Gentile Trilobata hazelnuts from Piedmont demonstrates the practical application of these optimization principles [26]. The research integrated ¹H NMR and LC-HRMS datasets using DIABLO to discriminate geographical origin and cultivars across multiple harvest years.

Experimental Parameters:

  • Sample Size: Multiple hazelnut samples across two consecutive years
  • Data Types: ¹H NMR spectra and LC-HRMS positive/negative mode data
  • Integration Method: DIABLO framework with supervised integration
  • Key Findings: Data fusion significantly enhanced robustness and information extraction compared to individual techniques

Optimization Outcomes:

  • Successful classification of geographical origin and cultivar with minimal error rate
  • Identification of correlated variables across ¹H NMR and LC-HRMS datasets
  • Confirmation of hierarchically stronger effect of annual variability versus origin/cultivar
  • Demonstration that merged data provided superior classification to individual analytical approaches

The Scientist's Toolkit: Essential Research Reagents and Materials

Table 3: Essential Research Reagents and Materials for LC-HRMS/NMR Food Classification

Category Specific Items Function/Purpose Technical Specifications
Chromatography UHPLC-grade acetonitrile, methanol, water; 0.1% formic acid Mobile phase for LC separation; enables electrospray ionization ≥99.9% purity, LC-MS compatible, low background noise
Mass Spectrometry Leucine enkephalin, sodium formate, ESI tuning mix Mass calibration and instrument performance verification Certified reference materials for accurate mass calibration
NMR Spectroscopy Deuterated solvents (D₂O, CD₃OD); TSP, DSS; sodium azide Field frequency locking; chemical shift reference; bacteriostatic agent 99.8% deuterium minimum; NMR-grade purity
Quality Control Pooled quality control samples; NIST SRM 1950 Monitoring instrument stability; inter-laboratory comparison Representative matrix-matched materials
Sample Preparation Solid-phase extraction cartridges; microfuge filters; isotope-labeled internal standards Sample cleanup; protein removal; quantification normalization C18, HILIC, or mixed-mode phases; 3kDa molecular weight cutoff
Data Analysis CDP, XCMS, CAMERA, MetaboAnalyst, mixOmics Spectral processing, peak alignment, statistical analysis, multi-omics integration R/Python packages with multi-omics capabilities

Optimizing model parameters and preventing overfitting in multi-omics models requires a systematic approach spanning experimental design, data preprocessing, computational integration, and rigorous validation. The integration of LC-HRMS and ¹H NMR provides particularly powerful capabilities for food classification research when implemented with appropriate safeguards against overfitting. The DIABLO framework has demonstrated excellent performance in practical food authentication applications, achieving robust classification of food products based on geographical origin, processing methods, and botanical characteristics. By adhering to the protocols outlined in this application note—including careful hyperparameter optimization through Bayesian methods, rigorous cross-validation, and comprehensive consistency testing—researchers can build multi-omics models that generalize well to new samples and provide reliable classification for food authentication applications.

Software and Computational Considerations for Handling Large, Fused Datasets

The integration of multiple analytical techniques, known as data fusion, has emerged as a powerful strategy for enhancing the classification and characterization of complex samples. In food science and metabolomics, combining complementary datasets provides a more comprehensive understanding of food quality attributes than any single technique can deliver independently [65]. This approach is particularly valuable for high-value products like Amarone wine, where strict production parameters collectively contribute to a complex combination of thousands of compounds that define its character [65]. The rationale behind this methodology is that complementary and synergic effects arise from combining multi-source information, ultimately improving discriminant performance for sample classification [65].

The handling of these large, fused datasets presents significant computational challenges that require specialized software solutions and strategic data management approaches. This document outlines the key software considerations, visualization tools, and experimental protocols for successfully implementing data fusion strategies in food classification research, with specific application to LC-HRMS and 1H NMR metabolomics data.

Essential Software and Computational Tools

Data Visualization and Analysis Platforms

Handling large, fused datasets requires robust software tools capable of processing, analyzing, and visualizing complex multi-omics data. The table below summarizes key computational tools relevant for metabolomics data fusion research.

Table 1: Software Tools for Large-Scale Data Analysis and Visualization

Tool Name Primary Use Case Key Features Cost Considerations
Power BI [66] [67] Business intelligence & reporting Data connectivity, transformation with Power Query, automated data refresh, AI-powered insights \$10-\$20/user/month
Tableau [66] [67] [68] Advanced analytics & complex visualizations Dynamic dashboards, drill-down capabilities, extensive customization \$70+/user/month
Qlik Sense [66] [67] [68] Self-service analytics Associative data model, AI-powered insights, alerting and automation \$30+/user/month
Apache Spark [68] Big data processing Stream data processing, in-memory computing, supports SQL queries, machine learning libraries Open source
Plotly [66] Web-based data visualization Interactive charts, Dash framework for web applications, support for Python, R, JavaScript Free tier available
D3.js [67] Custom visualizations JavaScript library for highly customized, interactive data visualizations Open source

For researchers working with fused LC-HRMS and NMR data, tools like Tableau and Plotly offer the interactive visualization capabilities necessary to explore relationships between different data modalities, while Apache Spark provides the computational foundation for processing large-scale datasets.

Data Preparation Challenges

A critical consideration often overlooked in data fusion workflows is the significant time investment required for data preparation. Industry analysis indicates that data teams spend 80-90% of their time on data preparation tasks rather than actual analysis and visualization [67]. This challenge is particularly pronounced when working with fused datasets from different analytical platforms, where data standardization, cleaning inconsistent formatting, handling missing data, and creating business rules for integration can create substantial bottlenecks in the research pipeline [67].

Experimental Protocol: LC-HRMS and 1H NMR Data Fusion

Reagents and Materials

The following research reagents and solutions are essential for implementing the LC-HRMS and 1H NMR metabolomics data fusion protocol for food classification:

Table 2: Essential Research Reagents and Materials for Metabolomics Data Fusion

Item Specification Function/Application
LC-MS Grade Water [65] Sigma-Aldrich (Madison, CA, USA) Mobile phase component for LC-HRMS analysis
LC-MS Grade Acetonitrile [65] Sigma-Aldrich (Madison, CA, USA) Organic solvent for LC-HRMS analysis
Deuterium Oxide (D₂O) [65] VWR International BVBA (99.86% D) Solvent for ¹H NMR spectroscopy
TSP Standard [65] Sigma-Aldrich (CAS No. 24493-21-8) Chemical shift reference standard for NMR
Oxalic Acid [65] Sigma-Aldrich (CAS No. 144-62-7, 98%) Sample preparation for NMR analysis
Sodium Oxalate [65] Sigma-Aldrich (CAS No. 62-76-0, ≥99.5%) Sample preparation for NMR analysis
Formic Acid [65] Not specified in detail Acidification of extraction solvent for LC-HRMS
Amarone Wine Samples [65] 80 distinct samples from Valpolicella terroirs Experimental material for metabolomic profiling
Sample Preparation Protocol
  • Thaw wine samples at room temperature and extract in triplicate
  • Combine 850 μL of wine sample with 850 μL of acetonitrile acidified with 1% (v/v) formic acid
  • Process the mixture with ultrasound for 10 minutes
  • Centrifuge the samples using an Eppendorf 5810R centrifuge to separate phases
  • Collect the supernatant for LC-HRMS analysis
  • Prepare wine samples using deuterium oxide (D₂O) as the solvent
  • Add TSP (3-(Trimethylsilyl)-2,2,3,3-tetradeutero-propionic acid sodium salt) as an internal chemical shift reference standard
  • Utilize sodium oxalate and oxalic acid as part of the sample preparation process
  • Centrifuge samples as needed to remove particulates before NMR analysis
Instrumental Analysis Parameters
  • Technique: Liquid Chromatography coupled with High-Resolution Mass Spectrometry
  • Application: Provides complex fingerprints and accurate quantification of numerous compounds in a single analysis step without requiring specific sample preparation
  • Metabolite Coverage: Identifies polyphenols (stilbenes, flavonols, anthocyanins) and aroma precursors (monoterpene glycosides)
  • Technique: Nuclear Magnetic Resonance spectroscopy
  • Instrumentation: 400 MHz NMR spectrometer-based automated quality control system (Bruker)
  • Status: Officially recognized by the International Organization of Vine and Wine (OIV) with ISO/IEC 17025 accreditation
  • Application: Characterizes origin, variety, and winemaking practices of wine
Data Processing and Fusion Workflow

The integration of LC-HRMS and 1H NMR datasets requires a systematic computational workflow that encompasses both unsupervised and supervised multivariate statistical approaches. The following diagram illustrates the complete experimental and computational workflow for metabolomics data fusion:

G Start 80 Amarone Wine Samples SamplePrep Sample Preparation Start->SamplePrep LC_HRMS LC-HRMS Analysis SamplePrep->LC_HRMS NMR 1H NMR Analysis SamplePrep->NMR DataProcessing Data Processing and Feature Extraction LC_HRMS->DataProcessing NMR->DataProcessing DataFusion Data Fusion DataProcessing->DataFusion MCIA Unsupervised Analysis: Multiple Co-inertia Analysis (MCIA) DataFusion->MCIA sPLS_DA Supervised Analysis: sPLS-DA Modeling DataFusion->sPLS_DA Classification Sample Classification (Withering Time, Yeast Strain) MCIA->Classification sPLS_DA->Classification Validation Model Validation Classification->Validation Metabolites Key Metabolite Identification Validation->Metabolites

Workflow for Metabolomics Data Fusion

Multivariate Statistical Analysis
  • Technique: Multiple Co-inertia Analysis (MCIA)
  • Purpose: Explore inherent data structure without prior knowledge of sample classes
  • Outcome: Revealed limited correlation between LC-HRMS and 1H NMR datasets (RV-score = 16.4%), suggesting complementarity of the analytical assays
  • Technique: Sparse Partial Least Squares Discriminant Analysis (sPLS-DA)
  • Purpose: Build predictive models for classifying wine samples based on withering time and yeast strains
  • Performance: Achieved lower error rate in sample classification (7.52%) compared to individual techniques
  • Advantage: Provided much broader characterization of wine metabolome than individual techniques
Metabolite Identification and Validation

The data fusion approach enabled identification of significant variations in key metabolite classes throughout the withering process, including [1] [65]:

  • Amino acids
  • Monosaccharides
  • Polyphenolic compounds

Data Presentation and Visualization Guidelines

Effective Table Construction

Well-structured tables are essential for presenting complex fused datasets in scientific research. The table below outlines the key anatomical components of an effective data table and their respective functions:

Table 3: Anatomical Components of Effective Research Tables

Table Component Function Formatting Guidelines
Title [69] Concise summary of data purpose and context Use prominent background or font color; bold typeface; differentiate from rest of table
Subtitle [69] Additional descriptive text providing context Include time period, methodology, units of measurement; appear below title
Column Headers [69] Identify type/category of data in each column Bold typeface; different background color; clear labeling
Row Headers [69] Label each row; identify categories associated with rows Leftmost position; distinct formatting
Data Cells [69] Contain individual data values at row/column intersections Appropriate alignment; numeric data right-aligned; text left-aligned
Totals Row/Column [69] Display summary statistics or aggregated values Position at bottom or rightmost side; differentiate visually
Key/Legend [69] Explain symbols, abbreviations, or color coding Position for easy reference; ensure accurate interpretation
Color Contrast and Accessibility

When creating visualizations for fused datasets, adherence to accessibility standards is essential for clear communication. The following table summarizes WCAG 2.1 contrast requirements for data visualization components:

Table 4: Color Contrast Requirements for Data Visualization [70] [71]

Content Type Minimum Ratio (AA) Enhanced Ratio (AAA) Notes
Body Text 4.5:1 7:1 Applies to most text elements in visualizations
Large Text 3:1 4.5:1 14pt bold or 18pt+ regular
Graphical Objects 3:1 Not defined Charts, graphs, UI components, form input borders
Incidental Text Exempt Exempt Inactive controls, logotypes, purely decorative text

Tools such as WebAIM's Contrast Checker provide automated verification of color contrast ratios and should be incorporated into the visualization workflow [70] [71].

The integration of LC-HRMS and 1H NMR metabolomics data through fusion approaches demonstrates significant promise for advancing food classification research. The supervised multi-omics data fusion strategy outlined in this protocol provides a robust framework for handling large, complex datasets, enabling researchers to achieve more accurate sample classification and identify key discriminant metabolites. The computational tools, experimental protocols, and visualization guidelines presented here offer a comprehensive foundation for implementing these approaches in food chemistry and metabolomics research, particularly for high-value products like Amarone wine where quality authentication is paramount.

Measuring Success: Model Validation, Performance Benchmarks, and Comparative Analysis

In food classification research using LC-HRMS and NMR data fusion, establishing robust validation protocols is paramount to developing models that generalize well to new, unseen samples. Validation strategies serve as a critical safeguard against overfitting, where a model learns patterns specific to the training data but fails to perform reliably in practical applications. The fundamental challenge in analytical food science is the high-dimensional nature of omics data, where the number of measured variables (mass spectral features, NMR peaks) far exceeds the number of samples, creating a high risk of models capturing noise rather than true biological signals.

The complementary nature of LC-HRMS and NMR data makes rigorous validation especially important. LC-HRMS offers high sensitivity for detecting numerous compounds, while NMR provides robust structural information and quantitative capabilities [16] [72]. When fused, these datasets create complex models that require careful validation to ensure the integrated biological information is genuinely predictive rather than coincidental. This protocol outlines comprehensive validation approaches specifically designed for LC-HRMS and NMR data fusion workflows in food authentication and classification studies.

Theoretical Foundation: Cross-Validation and Test Sets

The Problem of Model Selection

In machine learning for food classification, we face multiple candidate models with different hyperparameters. The core challenge is selecting a model that will perform well on future unknown samples, not just on available data. Without proper validation, there is a significant risk of selecting models that exhibit overfitting, where they memorize training data patterns but fail to generalize [73]. This is particularly problematic in foodomics, where sample collection can be expensive and time-consuming, resulting in limited datasets.

The solution lies in implementing a structured approach to data usage through training-validation-test splits and cross-validation. These techniques provide realistic estimates of model performance on new data by repeatedly testing models on subsets of data not used during training [74]. This practice prevents data leakage, where information from the test set inadvertently influences model training, leading to overly optimistic performance estimates.

Cross-Validation Fundamentals

Cross-validation (CV) systematically partitions available data to simulate how models will perform on unseen samples. In k-fold cross-validation, the training set is divided into k smaller sets (folds). The model is trained on k-1 folds and validated on the remaining fold, repeating this process k times with each fold serving as the validation set once [74]. The performance metric reported is the average across all iterations, providing a more stable estimate of model performance than a single train-test split.

For smaller datasets common in foodomics research, stratified k-fold cross-validation is particularly valuable as it preserves the percentage of samples for each class in every fold, maintaining representative class distributions across splits. More complex variations like shuffle-split cross-validation and nested cross-validation offer additional flexibility for optimizing hyperparameters while maintaining unbiased performance estimation [74].

The Role of Independent Test Sets

While cross-validation provides robust performance estimation during model development, an independent test set is essential for final model evaluation. This set of data is held back from the entire model selection and training process, serving as a completely unseen dataset to evaluate the finalized model [73]. This practice provides the best approximation of how the model will perform when deployed for real-world food authentication tasks.

The independent test set acts as the "final exam" for the model, with the crucial requirement that these exam questions must not be seen by the model during its training [73]. In food classification research, this means the test samples must undergo the exact same pre-processing and analysis protocols as the training samples but must be completely excluded from feature selection, model training, and parameter optimization steps.

Table 1: Key Terminology in Validation Protocols

Term Definition Role in Model Validation
Training Set Data used to fit model parameters Enables model learning from known data
Validation Set Data used for model selection and hyperparameter tuning Provides unbiased evaluation during model development
Test Set Data held back for final model evaluation Estimates real-world performance on completely unseen data
k-Fold Cross-Validation Resampling method that uses k different folds as validation sets Reduces variance in performance estimation
Stratified Sampling Approach that maintains class distribution in splits Preserves representative class ratios in validation

Experimental Design and Data Collection Framework

Sample Preparation and Analytical Sequencing

Proper experimental design begins long before data analysis. For food classification studies using LC-HRMS and NMR, sample randomization across analytical batches is critical to avoid confounding technical variability with biological signals. For example, in a study classifying Amarone wines based on grape withering time and yeast strain, 80 samples were analyzed using both LC-HRMS and NMR in randomized sequences to prevent batch effects [1]. Similarly, in honey origin authentication, 835 samples were analyzed with careful attention to analytical sequence to enable robust classification models [75].

Quality control samples should be integrated throughout the analytical sequence to monitor instrument stability. For LC-HRMS, this typically involves pooled quality control samples analyzed at regular intervals, while for NMR, standard reference samples verify instrument performance. These quality controls are essential for identifying technical artifacts and ensuring data quality throughout acquisition.

Data Preprocessing and Feature Extraction

Both LC-HRMS and NMR data require extensive preprocessing before fusion and modeling. For LC-HRMS data, the BOULS approach (bucketing of untargeted LCMS spectra) provides a robust framework for processing data from different instruments and timepoints. This method uses a central spectrum for retention time alignment and implements a bucketing step that sums signal intensities within three-dimensional buckets (retention time, m/z, and intensity), enabling analysis of data not processed simultaneously [75].

NMR data typically undergoes exponential line broadening, Fourier transformation, phase correction, and baseline correction before spectral alignment and bucketing. For both techniques, data normalization is critical to remove unwanted technical variation. Common approaches include probabilistic quotient normalization, total area normalization, or reference-based normalization using internal standards.

Table 2: Data Preprocessing Steps for LC-HRMS and NMR

Processing Step LC-HRMS Protocol NMR Protocol
Signal Correction Retention time alignment, mass calibration Phase correction, baseline correction
Feature Detection Peak picking, deisotoping, adduct identification Spectral binning (e.g., 0.04 ppm buckets)
Normalization Total intensity, quality control-based, or probabilistic quotient normalization Total spectral area or reference compound normalization
Scaling Pareto, unit variance, or range scaling Pareto, unit variance, or range scaling
Data Fusion Level Low-level: concatenated raw data; Mid-level: selected features; High-level: model decisions Same fusion levels applied compatibly

Implementation Protocols for Cross-Validation

K-Fold Cross-Validation Implementation

The following protocol outlines the step-by-step procedure for implementing k-fold cross-validation in LC-HRMS and NMR data fusion studies:

  • Data Preparation: Begin with preprocessed and fused LC-HRMS and NMR data matrices. Ensure proper scaling has been applied within each analytical block to equalize contributions from both platforms. Intra-block normalization using Pareto scaling (1/√σ²) has been shown effective for LC-HRMS and NMR data fusion [16].

  • Stratified Splitting: Partition the fused data into k folds (typically k=5 or k=10) using stratified sampling to maintain class proportions in each fold. For food classification tasks with limited samples, k=5 provides a reasonable balance between bias and variance.

  • Iterative Training and Validation: For each fold iteration:

    • Use k-1 folds as the training set
    • Hold out the remaining fold as the validation set
    • Train the model on the training set
    • Predict the held-out validation set and calculate performance metrics
    • Store performance metrics and model parameters
  • Performance Aggregation: Calculate mean and standard deviation of performance metrics (accuracy, F1-score, etc.) across all k iterations. The standard deviation indicates model stability across different data subsets.

  • Model Selection: Use the aggregated performance metrics to compare different algorithms or hyperparameter settings, selecting the best-performing configuration for final model training.

In the Amarone wine classification study, researchers employed cross-validation within their supervised statistical analysis (sPLS-DA) to evaluate classification performance based on withering time and yeast strains, achieving robust classification with minimal error rates [1].

Cross-Validation with Data Fusion Techniques

The cross-validation process must be carefully adapted for different data fusion strategies:

For low-level data fusion (concatenating raw or preprocessed data matrices), apply cross-validation after data concatenation but ensure that preprocessing parameters are learned only from the training folds to prevent data leakage. In practice, this means that scaling parameters, alignment references, and feature selection must be recalculated for each training fold.

For mid-level data fusion (combining selected features from each platform), perform feature selection independently within each cross-validation fold. For example, in a study on emodin hepatotoxic metabolomics, researchers applied random forest feature selection separately to LC-MS and NMR data before fusion, followed by cross-validation of the fused model [9].

For high-level data fusion (combining model decisions), implement cross-validation separately for each platform's model, then fuse the predictions from each fold. This approach was used in a hazelnut traceability study where DIABLO framework integrated 1H-NMR and LC-HRMS data to classify geographical origin and cultivar with minimal error rate [26].

CrossValidationWorkflow cluster_CV K-Fold Cross-Validation Loop Start Preprocessed LC-HRMS and NMR Data Split Stratified K-Fold Splitting Start->Split Fold1 Fold 1: Validation Split->Fold1 Train1 Folds 2-K: Training Split->Train1 Fold2 Fold 2: Validation Fold3 Fold 3: Validation FoldK Fold K: Validation Model1 Train Model Train1->Model1 Train2 Folds 1,3-K: Training Model2 Train Model Train2->Model2 Train3 Folds 1-2,4-K: Training Model3 Train Model Train3->Model3 TrainK Folds 1-(K-1): Training ModelK Train Model TrainK->ModelK Eval1 Evaluate Performance Model1->Eval1 Eval2 Evaluate Performance Model2->Eval2 Eval3 Evaluate Performance Model3->Eval3 EvalK Evaluate Performance ModelK->EvalK Aggregate Aggregate Performance Metrics Across Folds Eval1->Aggregate Eval2->Aggregate Eval3->Aggregate EvalK->Aggregate FinalModel Train Final Model on Full Training Set Aggregate->FinalModel TestEval Evaluate on Independent Test Set FinalModel->TestEval

Diagram 1: Comprehensive cross-validation workflow for LC-HRMS and NMR data fusion studies. This structured approach ensures robust model evaluation while preventing data leakage.

Independent Test Set Validation Protocol

Test Set Selection and Stratification

The independent test set serves as the ultimate benchmark for model performance. Implement the following protocol for test set validation:

  • Initial Data Partitioning: Before any analysis or model development, randomly hold back a portion of the dataset (typically 20-30%) as the test set. In the honey authentication study, researchers used 126 samples (from 835 total) as an independent test set, achieving 94% classification accuracy for geographical origin [75].

  • Stratified Sampling: Ensure the test set maintains the same class distribution as the full dataset. For food classification tasks with multiple categories (e.g., different geographical origins, varieties, or processing methods), proportional representation of each class in the test set is critical for unbiased evaluation.

  • Complete Isolation: Maintain strict separation between training and test sets throughout the entire analytical workflow. The test set should not influence any aspect of data preprocessing, feature selection, or model training.

Final Model Evaluation Protocol

Once model development and selection are complete using cross-validation, proceed with final evaluation:

  • Final Model Training: Train the selected model configuration using the entire training set (including previously used validation folds) to maximize learning from all available data.

  • Test Set Prediction: Apply the final model to the held-out test set, ensuring all preprocessing steps are applied consistently using parameters derived only from the training set.

  • Comprehensive Performance Assessment: Calculate multiple performance metrics to fully characterize model behavior. For classification tasks, include accuracy, precision, recall, F1-score, and confusion matrix analysis. For regression tasks, report R², MSE, RMSE, and MAE.

  • Comparison with Cross-Validation Results: Compare test set performance with cross-validation results. Significant performance degradation on the test set may indicate overfitting or data leakage during model development.

In the Forsythiae Fructus classification study, researchers validated their mid-level data fusion model with an independent test set after cross-validation, confirming the model's ability to distinguish green and ripe fruits based on fused LC-MS and HS-GC-MS data [21].

Data Fusion-Specific Validation Considerations

Validation for Different Data Fusion Strategies

Each data fusion level requires specific validation considerations:

Low-level data fusion validation must account for the high dimensionality of concatenated LC-HRMS and NMR data. The validation process should monitor for dominance of one platform due to higher dimensionality or variance. In such cases, block scaling techniques that equalize the combined standard deviation of each platform (1/(∑σ)block) can balance contributions [16].

Mid-level data fusion requires careful validation of feature selection to prevent information leakage. Feature importance should be evaluated independently within each cross-validation fold. The emodin hepatotoxicity study demonstrated this approach, where random forest feature selection was applied separately to LC-MS and NMR data within each validation fold before fusion [9].

High-level data fusion involves validating the decision integration mechanism. When combining predictions from separate LC-HRMS and NMR models, the fusion mechanism itself (e.g., weighted voting, meta-classifier) must be validated using nested cross-validation to avoid overfitting the fusion parameters.

Multi-Study Validation Framework

For maximum robustness, implement a validation framework that incorporates multiple studies:

  • Internal Validation: Use cross-validation and hold-out test sets within your study.
  • External Validation: Validate models on data collected in different batches, by different operators, or using slightly different protocols.
  • Temporal Validation: For food classification, test models on samples from different harvest years or production seasons to assess temporal robustness.

The hazelnut traceability study highlighted the importance of temporal validation, noting that "annual variability is a relevant parameter for proper interpretation of results" [26]. Their data fusion approach successfully classified samples despite this variability when models accounted for temporal effects.

DataFusionValidation cluster_LowLevel Low-Level Data Fusion cluster_MidLevel Mid-Level Data Fusion cluster_HighLevel High-Level Data Fusion LCMS LC-HRMS Data LLPreprocess Preprocess and Concatenate LCMS->LLPreprocess MLFeatures Select Features from Each Platform LCMS->MLFeatures HLModel1 Train LC-HRMS Model LCMS->HLModel1 NMR NMR Data NMR->LLPreprocess NMR->MLFeatures HLModel2 Train NMR Model NMR->HLModel2 LLModel Train Single Model on Combined Data LLPreprocess->LLModel LLValidate Validate with Block Scaling LLModel->LLValidate FinalValidation Independent Test Set Validation LLValidate->FinalValidation MLCombine Combine Selected Features MLFeatures->MLCombine MLModel Train Model on Fused Features MLCombine->MLModel MLValidate Validate Feature Selection Process MLModel->MLValidate MLValidate->FinalValidation HLFuse Fuse Model Decisions HLModel1->HLFuse HLModel2->HLFuse HLValidate Validate Decision Fusion Mechanism HLFuse->HLValidate HLValidate->FinalValidation

Diagram 2: Validation approaches for different data fusion strategies in LC-HRMS and NMR studies. Each fusion level requires specific validation considerations to ensure robust performance.

Research Reagent Solutions and Computational Tools

Table 3: Essential Research Tools for LC-HRMS and NMR Data Fusion Validation

Tool Category Specific Solutions Application in Validation Protocols
Data Preprocessing XCMS (LC-HRMS), ACD/Spectrus (NMR), BOULS workflow Retention time alignment, spectral bucketing, feature detection, and data quality assessment
Statistical Analysis SIMCA, ROCCET, MetaboAnalyst Multivariate analysis, cross-validation implementation, performance metrics calculation
Machine Learning scikit-learn, Random Forest, sPLS-DA Model training, hyperparameter optimization, cross-validation, feature selection
Data Fusion Platforms DIABLO, mixOmics, MMTMF Multi-block data integration, cross-platform correlation analysis, fused model validation
Visualization ggplot2, Plotly, Cytoscape Performance metric visualization, model interpretation, results communication

Robust validation protocols combining cross-validation and independent test sets are essential for developing reliable LC-HRMS and NMR data fusion models in food classification research. The complementary nature of these analytical platforms creates powerful integrated models that must be rigorously validated to ensure real-world applicability. Through proper experimental design, careful data partitioning, appropriate fusion strategies, and comprehensive performance assessment, researchers can build models that genuinely capture biologically meaningful patterns rather than analytical artifacts or random noise. The protocols outlined here provide a framework for establishing validation workflows that instill confidence in model predictions and support the use of data fusion approaches in critical food authentication applications.

Within the field of food authenticity and classification, the limitations of single-analytical techniques are increasingly apparent. While methods like Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (NMR) spectroscopy provide valuable data, they often offer only a partial view of a sample's complex metabolome. Data fusion emerges as a powerful strategy to overcome these limitations by integrating complementary data from multiple analytical platforms. This approach provides a more comprehensive chemical profile, leading to enhanced classification accuracy and more robust model development for food origin and authenticity verification [1] [25]. This Application Note details a standardized protocol for implementing and quantifying the performance gain achieved by mid-level data fusion of LC-HRMS and 1H NMR data, providing researchers with a framework for superior food classification models.

Results & Discussion

Quantitative Performance Comparison of Single-Technique vs. Fused Models

The superior predictive capability of models built on fused data is demonstrated across diverse food matrices. The following table summarizes quantitative results from key studies, directly comparing the performance of single-technique models against mid-level data fusion models.

Table 1: Quantitative Comparison of Model Performance for Food Classification

Food Product Classification Aim Analytical Techniques Single-Technique Model Performance Data Fusion Model Performance Key Metabolites Influencing Classification
Amarone Wine [1] Withering time & Yeast strain LC-HRMS & 1H NMR LC-HRMS and 1H NMR each provided separate classifications. Fused model achieved a lower classification error rate of 7.52%. Amino acids, monosaccharides, polyphenolic compounds.
Salmon [25] Geographical origin & Production method REIMS & ICP-MS Single-platform methods could not achieve perfect classification. Fused model achieved 100% cross-validation classification accuracy. 18 lipid markers (e.g., FA 15:1, FA 18:3) and 9 elemental markers.
Forsythiae Fructus [21] Green vs. Ripe fruit UPLC-Q/Orbitrap MS & HS-GC-MS HS-GC-MS OPLS-DA: R2Y=0.968, Q2=0.930. Fused OPLS-DA: R2Y=0.986, Q2=0.974. 30 differential compounds selected from initial 61, reducing noise.
Honey [75] Geographical origin LC-HRMS (HILIC & RP methods) N/A (Single technique used with a novel processing approach). Final Random Forest model accuracy of 94% for 6 countries using the BOULS processing method. Comprehensive fingerprint of polar and non-polar compounds.

The consistency of results across different foods and analytical platforms is striking. In the case of Amarone wine, the data fusion approach not only improved predictive accuracy but also highlighted the complementarity of the two datasets, as evidenced by a limited correlation (RV-score = 16.4%) in the multi-omics pseudo-eigenvalue space [1]. For salmon authentication, the fusion of lipidomic (REIMS) and elementomic (ICP-MS) data was necessary to achieve a level of classification accuracy that was impossible with either single-platform method [25]. Similarly, for Forsythiae Fructus, the mid-level data fusion model provided a more robust model with higher explained variance (R2Y) and predictive ability (Q2) while effectively reducing data noise [21].

Visualizing the Data Fusion Workflow

The following diagram illustrates the logical workflow for a mid-level data fusion strategy, from raw data acquisition to the final fused classification model.

mid_level_fusion LC_HRMS_raw LC-HRMS Analysis LC_HRMS_feat Feature Table LC_HRMS_raw->LC_HRMS_feat NMR_raw 1H NMR Analysis NMR_feat Feature Table NMR_raw->NMR_feat Normalization Data Pre-processing (Normalization, Scaling) LC_HRMS_feat->Normalization NMR_feat->Normalization Feature_Selection Feature Selection (VIP > 1, p < 0.05) Normalization->Feature_Selection Fused_Matrix Fused Data Matrix Feature_Selection->Fused_Matrix Model Supervised Model (e.g., sPLS-DA, RF) Fused_Matrix->Model Result Enhanced Classification Model->Result Start Sample Set Start->LC_HRMS_raw Start->NMR_raw

Diagram 1: Mid-Level Data Fusion Workflow. This workflow involves generating feature tables from multiple analytical techniques, pre-processing the data, selecting discriminative features, and combining them into a single matrix for supervised modeling.

Experimental Protocol

This protocol outlines the key steps for conducting a mid-level data fusion study for food classification, based on established methodologies [1] [75] [21].

Sample Preparation and Instrumentation

  • Sample Collection: Secure a sufficient number of samples (e.g., n=80-500) representing all classes of interest (e.g., geographical origin, production method). Ensure metadata (e.g., origin, harvest year) is meticulously recorded.
  • LC-HRMS Analysis (for non-volatiles):
    • Extraction: Weigh 1.0 g of homogenized sample (e.g., wine, honey, ground herb). Extract with 10 mL of a methanol:water (80:20, v/v) solution containing 0.1% formic acid. Vortex for 1 min, sonicate for 15 min, and centrifuge at 15,000 × g for 10 min. Transfer the supernatant to an LC vial [75] [21].
    • Chromatography: Use a reversed-phase C18 column (e.g., 150 × 2.1 mm, 1.9 µm). Mobile phase A: Water with 0.1% formic acid; B: Acetonitrile with 0.1% formic acid. Apply a linear gradient from 5% B to 95% B over 25 min. Flow rate: 0.3 mL/min; column temperature: 40°C [75].
    • Mass Spectrometry: Operate the HRMS (e.g., Q-Exactive Orbitrap) in data-dependent acquisition (DDA) or data-independent acquisition (DIA) mode in both positive and negative electrospray ionization (ESI) modes. Resolution: >70,000; mass range: 100-1500 m/z [75].
  • 1H NMR Analysis (for broad-spectrum metabolomics):
    • Preparation: Mix 300 µL of sample with 300 µL of phosphate buffer (0.1 M, pD 7.4) in D2O containing 0.5 mM TSP (sodium 3-(trimethylsilyl)propionate-2,2,3,3-d4) as an internal chemical shift reference and for quantification [1].
    • Acquisition: Acquire 1H NMR spectra on a 600 MHz spectrometer equipped with a cryoprobe. Use a standard 1D NOESY-presaturation pulse sequence (noesygppr1d) to suppress the water signal. Accumulate 128 scans into 64k data points over a spectral width of 20 ppm [1].

Data Processing and Fusion Protocol

  • LC-HRMS Data Processing:
    • Use software (e.g., XCMS, Compound Discoverer) for peak picking, retention time alignment, and gap filling to generate a feature table with m/z, retention time, and intensity [75].
    • For long-term studies, consider the BOULS (Bucketing Of Untargeted LCMS Spectra) approach, which uses a central spectrum for alignment and a 3D bucketing step (retention time, m/z, intensity) to allow the integration of data from different devices and time points without reprocessing the entire dataset [75].
  • 1H NMR Data Processing:
    • Process FIDs by applying exponential line broadening (0.3 Hz), zero-filling to 128k points, and Fourier transformation. Manually correct phase and baseline.
    • Reference the spectrum to the TSP peak at δ 0.0 ppm.
    • Bin the spectrum into consecutive regions (e.g., δ 0.04 ppm buckets) across the region δ 0.5-10.0, excluding the water region (δ 4.7-5.0). Integrate the area under each bucket to create a data matrix.
  • Mid-Level Data Fusion:
    • Pre-processing: Normalize the feature tables from each platform (e.g., to total intensity, PQN) and scale the data (e.g., Pareto or Unit Variance scaling).
    • Feature Selection: Apply univariate (e.g., ANOVA with p < 0.05) and multivariate (e.g., Variable Importance in Projection (VIP) > 1 from a PLS-DA model) methods to each dataset independently to select the most discriminative features [21].
    • Data Concatenation: Horizontally concatenate the selected, pre-processed feature blocks from LC-HRMS and 1H NMR into a single fused data matrix (samples x [LC-HRMS features + NMR features]) [1] [21].
    • Model Building: Build a supervised classification model (e.g., sPLS-DA, OPLS-DA, Random Forest) on the fused matrix. Validate the model using cross-validation and an independent test set.

The Scientist's Toolkit

Table 2: Essential Research Reagents and Materials for LC-HRMS/NMR Metabolomics

Item Function / Application
U/HPLC-grade Solvents (Water, Acetonitrile, Methanol) Mobile phase preparation and sample extraction, ensuring minimal background interference.
Deuterated Solvent (D2O) with internal standard (TSP) Solvent for NMR analysis; TSP provides a chemical shift reference and quantification standard.
Acid Additives (Formic Acid, Acetic Acid) LC-MS mobile phase modifier to enhance ionization efficiency in ESI.
Standard reversed-phase U/HPLC column (e.g., C18, 150-100 mm x 2.1 mm, sub-2µm) Separation of complex non-polar to mid-polar metabolite mixtures in LC-HRMS.
HILIC U/HPLC column Complementary separation of polar metabolites not retained on C18 phases.
Quality Control (QC) Sample (Pooled from all study samples) Injected repeatedly throughout the analytical run to monitor instrument stability and for data normalization.
Chemical Reference Standards (e.g., Forsythiaside A, Phillyrin) Used for unambiguous identification and confirmation of key metabolites by matching retention time and MS/MS spectrum.

This application note provides a detailed protocol for identifying robust biomarkers by integrating liquid chromatography-high resolution mass spectrometry (LC-HRMS) and nuclear magnetic resonance (NMR) data. Framed within food classification research, we present a complete workflow from sample preparation and multi-platform data acquisition to statistical data fusion and biological validation. The methodologies described herein enable researchers to distinguish food products based on processing characteristics and biological relevance with high confidence, leveraging the complementary strengths of LC-HRMS and NMR platforms.

The identification of robust biomarkers requires moving beyond mere statistical significance to demonstrate biological relevance and practical utility. In food classification research, this challenge is particularly acute when dealing with complex matrices and subtle compositional differences. Data fusion approaches that combine multiple analytical platforms have emerged as powerful tools for addressing these challenges, providing complementary molecular coverage that single-platform approaches cannot achieve.

Recent advances in multi-omics integration have demonstrated that combining untargeted metabolomics approaches significantly enhances classification accuracy and provides broader characterization of food metabolomes [37]. These approaches employ both unsupervised data exploration and supervised statistical analysis to reveal significant variations in key compound classes including amino acids, monosaccharides, and polyphenolic compounds. The integration of LC-HRMS and 1H NMR has proven particularly effective, with studies showing a limited correlation between datasets (RV-score = 16.4%), highlighting their complementarity for comprehensive metabolome coverage [37].

For biomarker discovery to transition from statistically significant to biologically relevant, a rigorous workflow encompassing discovery, qualification, and validation phases is essential [76]. This structured approach ensures that identified markers not only demonstrate statistical differences between groups but also possess the specificity, sensitivity, and biological plausibility required for practical application in food authentication and quality control.

Experimental Design and Workflow

The following diagram illustrates the integrated experimental workflow for robust biomarker identification using LC-HRMS and NMR data fusion:

G Start Sample Collection and Preparation LCMS LC-HRMS Analysis Start->LCMS NMR 1H NMR Analysis Start->NMR Preproc1 Data Preprocessing: Peak picking, Alignment, Normalization LCMS->Preproc1 Preproc2 Data Preprocessing: Phasing, Baseline correction, Spectral alignment NMR->Preproc2 Fusion Data Fusion and Multi-Block Analysis Preproc1->Fusion Preproc2->Fusion Stats Statistical Analysis: MCIA, sPLS-DA Fusion->Stats Validation Biomarker Validation and Identification Stats->Validation BioInterpret Biological Interpretation Validation->BioInterpret

Sample Preparation Protocol

Materials and Reagents
  • Samples: Food products (e.g., wine, olive oil, honey) representing different classes, varieties, or processing conditions
  • Solvents: HPLC-grade methanol, acetonitrile, chloroform, deuterated solvents (D₂O, CD₃OD)
  • Chemicals: Potassium dihydrogen phosphate, sodium azide, 3-(trimethylsilyl) propionic-2,2,3,3-d4 acid sodium salt (TSP)
  • Equipment: Refrigerated centrifuge, vortex mixer, ultrasonic bath, precision balances, pH meter
  • Consumables: Syringe filters (0.22 μm PTFE), NMR tubes (5 mm), LC vials, micropipettes
Extraction Procedure
  • Homogenization: Homogenize 1.0 g of food sample with 5 mL of extraction solvent (methanol:water, 4:1, v/v) using a vortex mixer for 2 minutes.
  • Sonication: Sonicate the mixture in an ice bath for 15 minutes.
  • Centrifugation: Centrifuge at 14,000 × g for 15 minutes at 4°C.
  • Separation: Transfer the supernatant to a new tube.
  • Partitioning (for LC-HRMS): Evaporate 2 mL of supernatant under nitrogen stream and reconstitute in 200 μL of initial mobile phase.
  • Buffer Preparation (for NMR): Mix 500 μL of remaining supernatant with 100 μL of phosphate buffer (1.5 M KH₂PO₄ in D₂O, pH 7.4, 0.1% NaN₃, 0.005% TSP).
  • Filtration: Filter all samples through 0.22 μm syringe filters.
  • Storage: Store LC samples at 4°C until analysis (within 24 hours) and transfer NMR samples to 5 mm NMR tubes.

Table 1: Key Research Reagent Solutions

Reagent/Equipment Function/Application Specifications
HPLC-grade Methanol Extraction solvent Purity ≥99.9%, LC-MS grade
Deuterated Water (D₂O) NMR solvent 99.9% D, contains 0.05% TSP
Phosphate Buffer NMR chemical shift reference 1.5 M KH₂PO₄ in D₂O, pH 7.4
TSP ((Trimethylsilyl)propionic acid) NMR internal standard 0.005% in D₂O, for chemical shift referencing
PTFE Syringe Filters Sample clarification 0.22 μm pore size, removes particulates

Data Acquisition Protocols

LC-HRMS Analysis

Instrumentation and Parameters
  • LC System: UHPLC system with binary pump, autosampler (maintained at 10°C), and column oven
  • Mass Spectrometer: High-resolution mass spectrometer (Orbitrap or Q-TOF) with electrospray ionization (ESI) source
  • Chromatographic Conditions:
    • Column: C18 column (100 × 2.1 mm, 1.7 μm)
    • Mobile Phase: A: 0.1% formic acid in water; B: 0.1% formic acid in acetonitrile
    • Gradient: 5% B (0-1 min), 5-95% B (1-20 min), 95% B (20-23 min), 95-5% B (23-24 min), 5% B (24-30 min)
    • Flow Rate: 0.3 mL/min
    • Injection Volume: 5 μL
    • Column Temperature: 40°C
  • MS Parameters:
    • Ionization Mode: Positive and negative ESI (separate runs)
    • Mass Range: m/z 50-1200
    • Resolution: ≥70,000 (at m/z 200)
    • Spray Voltage: 3.5 kV (positive), 3.0 kV (negative)
    • Capillary Temperature: 320°C
    • Sheath Gas Pressure: 40 arb
    • Aux Gas Pressure: 10 arb
  • Quality Control: Pooled quality control (QC) samples injected every 6-8 samples to monitor system stability

1H NMR Analysis

Instrumentation and Parameters
  • NMR Spectrometer: High-field NMR spectrometer (≥600 MHz) with cryoprobe
  • Acquisition Parameters:
    • Temperature: 298 K
    • Pulse Sequence: NOESYPRESAT (noesygppr1d) for water suppression
    • Spectral Width: 12 ppm
    • Relaxation Delay: 4 s
    • Mixing Time: 10 ms
    • Acquisition Time: 2.5 s
    • Number of Scans: 64
    • Dummy Scans: 4
  • Data Processing:
    • Exponential Line Broadening: 0.3 Hz
    • Zero Filling: 64k
    • Fourier Transformation: Automatic after zero filling
    • Phase and Baseline Correction: Manual adjustment
    • Chemical Shift Referencing: TSP at δ 0.0 ppm

Data Processing and Statistical Analysis

Data Preprocessing Workflow

The following diagram outlines the data processing workflow from raw data to fused dataset:

G LCMSRaw LC-HRMS Raw Data LCPeak Peak Picking and Alignment LCMSRaw->LCPeak NMRRaw 1H NMR Raw Data NMRProc Phasing, Baseline, and Referencing NMRRaw->NMRProc Norm1 Normalization and QC Correction LCPeak->Norm1 Norm2 Normalization and Bucketing NMRProc->Norm2 Scale1 Pareto Scaling Norm1->Scale1 Scale2 Pareto Scaling Norm2->Scale2 Fused Fused Data Matrix Scale1->Fused Scale2->Fused

Data Fusion and Multivariate Analysis

Data Preprocessing
  • LC-HRMS Data: Use software (e.g., XCMS, MS-DIAL) for peak picking, alignment, and gap filling. Apply quality control-based correction using pooled QC samples.
  • NMR Data: Process using software (e.g., MNova, Chenomx). Apply bucketing (0.04 ppm buckets) excluding water (δ 4.7-5.0 ppm) and urea regions.
  • Normalization: Apply probabilistic quotient normalization to correct for dilution effects.
  • Scaling: Use Pareto scaling to balance between no scaling and unit variance scaling.
Statistical Data Fusion
  • Multiple Co-inertia Analysis (MCIA): Perform unsupervised multi-block analysis to explore relationships between samples and variables across both datasets.
  • Sparse Partial Least Squares-Discriminant Analysis (sPLS-DA): Apply supervised classification to identify features most relevant for class separation.
  • Model Validation: Use cross-validation and permutation testing to avoid overfitting. Calculate classification error rates.

Table 2: Key Parameters for Statistical Analysis

Analysis Method Key Parameters Implementation
Multiple Co-inertia Analysis (MCIA) Number of components, RV coefficient R package: omicade4
Sparse PLS-Discriminant Analysis Number of components, KeepX parameters R package: mixOmics
Cross-Validation Number of folds (n=7), Number of repeats (n=50) R package: caret
Permutation Testing Number of permutations (n=1000) Custom R script

Biomarker Validation and Identification

Validation Workflow

The biomarker validation process extends beyond statistical identification to establish biological relevance:

G StatSig Statistically Significant Features ConfID Confident Compound Identification StatSig->ConfID BioContext Biological Context Assessment ConfID->BioContext Robust Robustness Evaluation BioContext->Robust Valid Validated Biomarker Panel Robust->Valid

Compound Identification and Validation Protocol

LC-HRMS Metabolite Identification
  • Accurate Mass Matching: Search against databases (HMDB, FoodDB, KEGG) with mass tolerance <5 ppm.
  • Fragmentation Pattern Analysis: Compare MS/MS spectra with reference standards or spectral libraries (MassBank, GNPS).
  • Retention Time Prediction: Use in-house databases or quantitative structure-retention relationship models.
  • Confidence Levels:
    • Level 1: Identified by reference standard (RT and MS/MS match)
    • Level 2: Putatively annotated (MS/MS spectral match to library)
    • Level 3: Putatively characterized (based on chemical class)
    • Level 4: Unknown (distinct molecular feature only)
NMR Metabolite Identification
  • Spectral Comparison: Compare chemical shifts and coupling patterns with reference databases (BMRB, HMDB).
  • Spiking Experiments: Add authentic standards to confirm assignments.
  • 2D NMR Experiments: Perform 1H-1H COSY, 1H-13C HSQC, and HMBC for structural elucidation of unknown compounds.
Biological Relevance Assessment
  • Pathway Analysis: Map identified biomarkers to metabolic pathways using KEGG, MetaCyc, or PlantCyc databases.
  • Temporal Dynamics: Assess biomarker behavior across different processing stages or storage conditions.
  • Dose-Response Relationship: Evaluate correlation between biomarker levels and relevant food characteristics.

Table 3: Biomarker Validation Criteria Assessment

Validation Criterion Assessment Method Acceptance Threshold
Statistical Significance p-value (ANOVA), VIP (sPLS-DA) p<0.05, VIP>1.5
Identification Confidence MS/MS, RT matching, NMR Level 1 or 2 identification
Biological Plausibility Pathway mapping, Literature evidence Established metabolic role
Analytical Performance CV in QC samples, Signal stability CV<20% in pooled QCs
Classification Power AUC, Classification error AUC>0.8, Error rate<10%

Application Example: Amarone Wine Classification

Case Study Implementation

A recent study demonstrated the application of this workflow to classify Amarone wines based on grape withering time and yeast strain [37]. The integrated LC-HRMS and 1H NMR approach correctly classified wine samples with significantly higher accuracy than individual techniques, achieving a classification error rate of 7.52%.

Key biomarkers identified included:

  • Amino Acids: Significant variations in proline, arginine, and gamma-aminobutyric acid levels across withering times
  • Monosaccharides: Differential accumulation of glucose, fructose, and their derivatives
  • Polyphenolic Compounds: Distinct profiles of anthocyanins, flavonols, and phenolic acids

The data fusion approach revealed complementary information, with LC-HRMS providing sensitive detection of low-abundance polyphenols while NMR enabled absolute quantification of major metabolites and identification of unknown compounds.

This application note presents a comprehensive protocol for identifying robust biomarkers through LC-HRMS and NMR data fusion. By integrating complementary analytical platforms with advanced statistical methods and rigorous validation protocols, researchers can advance from merely identifying statistically significant features to discovering biologically relevant biomarkers with practical utility in food classification and authentication.

The structured workflow encompassing sample preparation, multi-platform data acquisition, statistical data fusion, and biological validation provides a template for generating high-quality, reproducible results that advance our understanding of food composition and quality determinants.

Assessing Model Generalizability and Long-Term Stability for Routine Application

The fusion of Liquid Chromatography-High-Resolution Mass Spectrometry (LC-HRMS) and Nuclear Magnetic Resonance (NMR) data represents a powerful multi-omics approach for food classification and authentication [1]. However, the transition from research-grade models to robust, routine analytical tools requires rigorous assessment of model generalizability and long-term stability. This protocol provides a comprehensive framework for evaluating these critical aspects, ensuring that classification models maintain performance across different instruments, over extended time periods, and with varying sample populations—a fundamental requirement for applications in food safety, authenticity, and regulatory compliance [3] [77].

The challenge of model robustness is particularly acute in untargeted LC-HRMS analyses, where slight variations in instrument performance, column age, and environmental conditions can introduce measurement-based differences that compromise model performance if not properly addressed [3]. Meanwhile, the complementary nature of LC-HRMS and NMR data fusion—as demonstrated by a limited correlation (RV-score = 16.4%) between datasets in Amarone wine classification—creates both opportunities and challenges for maintaining stable, generalizable models [1].

Theoretical Framework for Generalizability Assessment

Defining Generalizability in Analytical Models

Generalizability, or external validity, refers to a model's ability to maintain predictive performance when applied to data collected under different conditions than the original development dataset [77]. In the context of clinical prediction models—concepts directly transferable to analytical chemistry—this involves evaluating models on "data 'collected as part of an exercise separate from the development of the original model'" [77]. For LC-HRMS/NMR fusion models, this encompasses several critical dimensions:

  • Temporal generalizability: Performance consistency across different time periods
  • Geographic generalizability: Performance across different laboratories and instruments
  • Population generalizability: Performance across different sample populations and varieties
Dataset Shift and Its Implications

The machine learning literature conceptualizes generalizability challenges through the framework of dataset shift, where the joint distribution of inputs and outputs differs between training and deployment environments [77]. Several types of dataset shift are particularly relevant to LC-HRMS/NMR fusion models:

  • Covariate shift: Changes in the distribution of predictor variables (e.g., metabolite concentrations)
  • Label shift: Changes in the distribution of class labels
  • Concept drift: Changes in the relationship between predictors and outcomes

Table 1: Types of Dataset Shift in LC-HRMS/NMR Fusion Models

Shift Type Definition Example in Food Classification
Covariate Shift Change in distribution of input features (X) Seasonal variation in metabolite profiles due to climate differences
Label Shift Change in distribution of output classes (Y) Different prevalence of food adulteration types across regions
Concept Drift Change in relationship between X and Y Evolving adulteration methods that change chemical signature relationships

Experimental Protocols for Generalizability Assessment

Multi-Site Validation Protocol

Purpose: To evaluate model performance across different laboratory environments, instruments, and operators.

Materials:

  • Reference samples with known classification (minimum n=20 per class)
  • Participating laboratories (minimum n=3) with different LC-HRMS and NMR systems
  • Standardized sample preparation protocols
  • Data quality control materials (system suitability tests)

Procedure:

  • Distribute identical reference samples to all participating laboratories
  • Each laboratory performs sample preparation and analysis using their standard operating procedures
  • Collect raw data and processed features from all sites
  • Apply the classification model to each dataset without retraining
  • Calculate performance metrics (accuracy, precision, recall, F1-score) for each site
  • Compare performance across sites using statistical tests (e.g., ANOVA)

Acceptance Criterion: Model performance should not degrade more than 15% compared to development performance across all sites.

Temporal Stability Assessment Protocol

Purpose: To evaluate model performance over extended time periods, accounting for instrument drift, reagent lot variations, and environmental changes.

Materials:

  • Stability sample set (minimum n=15)
  • Quality control samples for monitoring instrument performance
  • Data tracking system for recording instrument parameters and environmental conditions

Procedure:

  • Analyze stability samples at regular intervals (e.g., weekly) over an extended period (minimum 6 months)
  • With each analysis batch, include quality control samples to monitor instrument performance
  • Record all relevant instrument parameters (column lot, mobile phase composition, temperature, etc.)
  • Apply the classification model to stability sample data collected at each time point
  • Calculate performance metrics over time
  • Perform trend analysis to identify performance degradation

Data Analysis:

  • Create control charts for model performance metrics
  • Calculate Pearson correlation between performance and time
  • Establish alert and action limits for performance degradation

Table 2: Performance Metrics for Generalizability Assessment

Metric Calculation Acceptance Threshold
Accuracy (TP + TN) / (TP + TN + FP + FN) ≥80% of development accuracy
Precision TP / (TP + FP) ≥75% of development precision
Recall TP / (TP + FN) ≥75% of development recall
F1-Score 2 × (Precision × Recall) / (Precision + Recall) ≥75% of development F1-score
Calibration Slope Slope of observed vs. predicted probabilities 0.85-1.15
Data Fusion and Preprocessing Protocol for Enhanced Generalizability

Purpose: To implement data processing strategies that enhance model stability across different analytical conditions, leveraging the BOULS (Bucketing of Untargeted LCMS Spectra) approach for LC-HRMS data [3].

LC-HRMS Data Processing:

  • Convert raw files to mzML format using MSConvert (ProteoWizard)
  • Perform retention time alignment using a central reference spectrum
  • Apply three-dimensional bucketing (retention time × m/z × intensity)
  • Use probabilistic quotient normalization for intensity correction
  • Implement quality control-based batch correction

NMR Data Processing:

  • Apply Fourier transformation and phase correction
  • Perform chemical shift alignment using reference compounds
  • Use probabilistic quotient normalization for spectral normalization
  • Integrate regions of interest (bucketing) comparable to LC-HRMS approach

Data Fusion:

  • Employ Multiple Co-Inertia Analysis (MCIA) for unsupervised integration [1]
  • Apply sparse Partial Least Squares-Discriminant Analysis (sPLS-DA) for supervised integration [1]
  • Implement mid-level fusion by concatenating selected features from both modalities

Visualization of Generalizability Assessment Workflow

G Start Model Development Phase DataCollection Multi-Omics Data Collection (LC-HRMS + NMR) Start->DataCollection Preprocessing Data Preprocessing & Fusion (Alignment, Normalization, Bucketing) DataCollection->Preprocessing ModelTraining Model Training & Optimization (sPLS-DA, Random Forest) Preprocessing->ModelTraining GeneralizabilityAssessment Generalizability Assessment Phase ModelTraining->GeneralizabilityAssessment MultiSite Multi-Site Validation (Geographic Generalizability) GeneralizabilityAssessment->MultiSite Temporal Temporal Stability Assessment (Long-term Performance) GeneralizabilityAssessment->Temporal DatasetShift Dataset Shift Analysis (Covariate, Label, Concept Drift) GeneralizabilityAssessment->DatasetShift Decision Performance Evaluation & Decision Point MultiSite->Decision Temporal->Decision DatasetShift->Decision Accept Model Approved for Routine Application Decision->Accept Meets All Criteria Retrain Model Requires Retraining/Optimization Decision->Retrain Fails Any Criterion

Generalizability Assessment Workflow for LC-HRMS/NMR Fusion Models

The Scientist's Toolkit: Essential Research Reagents and Materials

Table 3: Essential Research Reagents and Materials for Generalizability Assessment

Item Specifications Function in Protocol
Reference Standard Mixture Certified metabolites covering key chemical classes (amino acids, sugars, phenolics) System suitability testing and retention time calibration
Quality Control Pool Representative sample pool from all classes of interest Monitoring instrument performance and data quality
Deuterated Solvents D₂O, CD₃OD, with TMS reference standard NMR spectroscopy for locking, shimming, and chemical shift referencing
LC-MS Grade Solvents Acetonitrile, methanol, water with ammonium formate/acetate additives Mobile phase preparation for LC-HRMS analysis
HILIC and RP Columns Accucore-150-Amide-HILIC, Hypersil Gold C18 (150 × 2.1mm) Complementary chromatographic separation for comprehensive metabolite coverage [3]
Stable Isotope Standards ¹³C/¹⁵N-labeled amino acids, organic acids Quality control for quantification and recovery assessment
Sample Preparation Kits Solid-phase extraction cartridges (C18, polymer-based) Standardized sample cleanup and metabolite enrichment

Implementation Framework for Routine Application

Continuous Performance Monitoring

Implement a continuous monitoring system that tracks model performance metrics in real-time, with automated alerts for performance degradation. Key components include:

  • Dashboard visualization of performance metrics over time
  • Automated statistical process control with established control limits
  • Root cause analysis protocols for investigating performance deviations
Model Maintenance and Update Protocol

Establish a systematic approach to model maintenance that includes:

  • Regular performance reviews (quarterly)
  • Trigger-based retraining when performance degrades beyond established thresholds
  • Expansion of training data with new sample types and varieties
  • Version control for model iterations and performance history
Documentation and Reporting Standards

Maintain comprehensive documentation for model generalizability assessment, including:

  • Model passport with development parameters and performance benchmarks
  • Validation reports for each assessment phase
  • Change control records for all model modifications
  • Performance degradation incidents and corrective actions

This comprehensive framework for assessing model generalizability and long-term stability provides researchers with a rigorous methodology for transitioning LC-HRMS/NMR fusion models from research tools to reliable routine applications in food classification and authentication.

Conclusion

The fusion of LC-HRMS and NMR data represents a paradigm shift in food classification, moving beyond the limitations of single-platform analyses to create robust, information-rich models. As demonstrated across diverse applications from wine and hazelnuts to salmon, this synergistic approach consistently enhances classification accuracy, provides a more comprehensive characterization of the metabolome, and identifies key discriminant biomarkers. The methodologies and validation frameworks established in food science offer a direct and valuable pipeline for translation into biomedical and clinical research. Future directions should focus on standardizing data fusion protocols, developing more advanced AI-driven integration tools, and expanding applications into complex biological systems for patient stratification, therapeutic monitoring, and the discovery of novel clinical biomarkers, ultimately bridging the gap between food metabolomics and precision medicine.

References