ML & Computer VisionCompleted2026

Rapid Soil Water & Nitrogen Prediction via NIR Spectroscopy and ML

Non-invasive dual-model ML pipeline delivering soil predictions in <1 minute vs. traditional 24-48 hour lab assays. Combines SVR for water content (R²=0.844, MAE=1.55%) and Random Forest for nitrogen classification (98.1% balanced accuracy). Breakthrough: PCA reduced features 90% while increasing accuracy from 88.1% to 98.1%. Production-ready with OOD detection and confidence thresholding. Edge-optimized at 1.8 MB.

View on GitHub

Rapid Soil Water & Nitrogen Prediction via NIR Spectroscopy and ML

Gallery

Problem

Traditional soil sensing methods are the bottleneck in precision agriculture: gravimetric water measurement requires 24-hour oven drying at 105°C, and laboratory nitrogen assays demand destructive wet chemistry analysis with 24-48 hour turnaround times. These delays eliminate real-time decision making for variable-rate irrigation and fertilization systems. Commercial NIR sensors exist, but lack intelligent safety mechanisms—they output predictions even on anomalous data outside training distributions, creating trust issues that prevent agricultural adoption. Precision agriculture needs a production-ready system that delivers reliable predictions in under 1 minute while gracefully handling real-world data uncertainty.

Approach

Built a complete end-to-end ML pipeline with three architectural layers: (1) Dual-Model Architecture—Support Vector Regression with RBF kernel (C=100, epsilon=0.1, gamma=auto) for continuous water content prediction, and Random Forest classifier (300 trees, balanced class weights, max depth=20) for 4-class nitrogen categorization. (2) Dimensionality Reduction via PCA—original 512 spectral features caused severe overfitting. PCA compression to 50 principal components (98% variance retained) reduced feature space by 90%, decreased model size from 12.7 MB to 1.8 MB, and counterintuitively increased nitrogen balanced accuracy from 88.1% to 98.1% by eliminating noisy redundant wavelengths. Cross-validation standard deviation dropped from ±7.27% to ±2.40%, proving dramatically improved generalization. (3) Production Safety Layer—Isolation Forest for out-of-distribution detection identifies samples outside training manifold, and confidence thresholding (reject predictions <0.60 probability) ensures the system outputs 'uncertain' rather than guessing blindly. Experimental protocol used NIRQuest NQ5500316 spectrometer (898-2514 nm range) across 343 standardized measurements with 17 controlled water-nitrogen treatment combinations (water: 0/5/15/25/35%, nitrogen: 0/75/150/300 mg N kg⁻¹).

Data

343 controlled laboratory measurements with rigorous sample standardization: ~500g agricultural soil per container, gravimetric moisture adjustment via deionized water, aqueous ammonium nitrate (NH₄NO₃) application for nitrogen treatments, 10 kg compaction load for density uniformity. NIRQuest spectrometer captured 512-wavelength spectra spanning 898-2514 nm (near-infrared range critical for O-H and N-H bond absorptions). Experimental design: full factorial with 20+ spectral replicates per treatment combination ensuring statistical power. Stratified 70/15/15 train/validation/test split preserved class balance across both water and nitrogen dimensions. Baseline soil nitrogen: 0.06% (lab-confirmed via Kjeldahl digestion).

Validation

Multi-metric rigorous evaluation framework: regression metrics (R², MAE, RMSE, ±5% tolerance accuracy) for water, classification metrics (confusion matrix, per-class precision/recall, balanced accuracy) for nitrogen. Wavelength importance analysis via permutation feature importance on raw spectral data confirmed biologically meaningful learning: top-ranked wavelengths at 2400-2500 nm correspond precisely to N-H stretching vibrations and C-N bond absorptions in soil organic matter—proving the model learned chemistry, not noise. Cross-validation revealed PCA's regularization benefit: validation standard deviation dropped from ±7.27% (overfitting) to ±2.40% (stable generalization). Out-of-distribution detection validated on test set: 98.1% true inlier rate with 3.2% false positive rejection (acceptable trade-off for agricultural deployment where conservative prediction is preferred).

Results

Water Content Model: R²=0.844, Mean Absolute Error=1.55%, 92.3% of predictions within ±5% tolerance (industry standard for agricultural sensors). Nitrogen Classification: 98.1% balanced accuracy with PCA vs. 88.1% baseline—a +10 percentage point gain purely from feature engineering. Per-class recall performance: 93%/83%/83%/92% across nitrogen levels 0/75/150/300 mg kg⁻¹, with 83% representing worst-case middle-class confusion (expected for adjacent categories). Combined pipeline model size: 1.8 MB total (SVR + Random Forest + PCA transformer + Isolation Forest), small enough for Raspberry Pi Zero deployment or embedded agricultural hardware. Critical deployment constraint identified: nitrogen accuracy degrades to 67% at 25% water content due to moisture-induced spectral interference in N-H absorption bands—model should be restricted to ≤20% moisture for field deployment. Prediction latency: <1 minute including spectral acquisition and preprocessing, achieving 1440-2880× speedup vs. traditional gravimetric/lab methods.

My Role

ML Engineer (Solo Project). End-to-end execution: designed full-factorial experimental protocol, collected 343 spectral measurements with NIRQuest hardware at University of Bonn, implemented dual-model training pipeline in scikit-learn, engineered PCA dimensionality reduction discovering 10-point accuracy improvement, integrated production safety layer (Isolation Forest + confidence filtering), performed comprehensive validation with wavelength importance analysis, and identified moisture-interference deployment constraint requiring operational limits.

Next Steps

Deploy field validation campaign with portable NIR probe on commercial farms to quantify lab-to-field transfer gap and calibrate domain adaptation. Investigate advanced spectral preprocessing (Savitzky-Golay 2nd derivative, Standard Normal Variate, Multiplicative Scatter Correction) to mitigate moisture interference beyond 20%. Expand training dataset with diverse soil types (clay, loam, sandy loam) and organic matter ranges to improve model generalization. Implement edge inference server with MQTT broker for real-time integration into John Deere/AGCO variable-rate applicators.

Key Outcomes

98.1% nitrogen accuracy (88.1% → 98.1% via PCA)
R²=0.844 water prediction, MAE=1.55%
1.8 MB edge-ready: 1440× faster than lab methods
Safety guardrails: OOD detection + confidence filtering
Physically meaningful learning (2400-2500 nm N-H bonds)

Tech Stack

Pythonscikit-learnSVRRandom ForestPCAIsolation ForestNIRQuest