Abstract:
Objective As various emerging new pollutants continue to emerge and multiply in the environment, their potential ecological toxicity, bioaccumulation, and environmental persistence have become increasingly prominent, posing severe, multidimensional, and complex threats to natural ecosystems and human health. Notably, emerging new pollutants typically lack historical reference data, making it difficult to generate the labels required for supervised learning. Given that conventional detection approaches rely heavily on large-scale labeled datasets for model training, such a dearth of prior data severely constrains their applicability. Consequently, traditional methods often exhibit unsatisfactory performance in the identification of unknown new pollutants.
Methods In response, an innovative method for constructing a dynamic feature code library by coupling three-dimensional fluorescence spectroscopy with long short-term memory (LSTM) networks and incremental learning was proposed, with the specific purpose of enabling accurate identification of uncharacterized new pollutants. This method first reconstructs three-dimensional fluorescence spectroscopy data into quasi-time series ordered by excitation wavelengths, thereby obtaining a dynamic sequence with intrinsic order and contextual correlations. A wealth of chemical information, including the species and concentration of target analytes, is concealed within this sequence. It then leverages LSTM networks to sequentially read the emission spectra corresponding to each excitation wavelength, enabling the capture of dependencies between different wavelengths. Finally, after iterating through all emission spectra, the final output of the last time step is taken as the feature representation. This output serves as the feature code for the fluorescence spectrum of the pollutant sample and can be used for identification. Through this process, discriminative feature codes that can highly represent the overall spectral information are extracted; these codes encapsulate the core characteristics of the spectra, laying a solid foundation for subsequent pollutant identification. Furthermore, an incremental learning mechanism is introduced. This mechanism constructs a dynamically expandable feature code library based on the previously derived feature codes, achieves automatic identification of unknown pollutants by leveraging the calculated similarity threshold, and further integrates the feature codes of these newly identified pollutants into the library to support subsequent detection tasks. This design ensures that while the model continuously learns the features of new pollutants, it still maintains robust recognition performance for existing contaminants, resolving the issue that traditional recognition methods are unable to identify pollutants not included in their training data.
Results and Discussions To verify whether this method possesses the capability of continuously learning knowledge from untrained pollutant samples, three rounds of tests were designed. To simulate the variability present in real-world environments, test samples under different pH conditions were incorporated. The test results demonstrated that this method achieved an identification accuracy of 93.3% for known pollutant categories and 91.7% for unknown new pollutant categories. These results indicate that the method exhibits excellent performance in both mitigating catastrophic forgetting and adapting to knowledge pertaining to new categories, fully validating its capacity for continuous learning and incremental identification. Furthermore, to evaluate the effectiveness of the proposed method, simulated pollution experiments were conducted by adding five typical contaminants to river water samples. The results demonstrate the proposed method achieved an identification accuracy of 93.3% for single pollutants, outperforming principal component analysis, parallel factor analysis, residual neural networks, and the incremental learning benchmark iCaRL. This is attributed to the fact that the proposed method can effectively capture the complex dependencies inherent in pollutant sample data along the excitation wavelength dimension, thereby extracting feature representations with stronger discriminative power and significantly enhancing the robustness against background fluorescence interference in river water matrices. In addition, the proposed method achieves an accuracy of 70.8% in the complete identification of all components in mixed pollutants. Although this performance is slightly lower than that in single-pollutant identification tasks, it still surpasses other comparative methods. The slight performance limitation stems from the fact that the fluorescence peaks of two specific pollutants are relatively close in position. When mixed, these two components tend to cause feature-level confusion, leading the model to occasionally misclassify them as a single substance, which somewhat constrains the overall identification performance for mixed samples. Despite such challenges, the relative superiority of the proposed method across key metrics demonstrates its strong potential and promising application prospects in the field of mixed pollutant identification.
Conclusions In summary, by capturing the dependencies across different wavelengths of three-dimensional fluorescence spectroscopy data to extract feature codes and integrating an incremental learning mechanism, the proposed method provides a robust solution for the identification of unknown new pollutants. Not only can this method effectively identify known pollutants, but it also possesses the capability to learn from unknown new pollutants and transfer the acquired feature information to subsequent detection tasks, thus demonstrating favorable knowledge transfer and continuous learning characteristics. This demonstrates its excellent scalability and adaptive capability. With the continuous incorporation of additional contaminant feature codes, the method can gradually expand its recognition scope, thereby offering a reliable technical solution for the long-term monitoring and identification of emerging contaminants.