Advanced data mining and machine learning techniques for chemometric modeling

  1. Cernuda García, Carlos
Dirigida por:
  1. Erich Peter Klement Director/a
  2. Ricard Boqué Martí Director/a

Universidad de defensa: Johannes Kepler Universität Linz

Fecha de defensa: 01 de agosto de 2014

Tipo: Tesis

Resumen

This thesis is about the development of data mining and machine learning techniques for modeling in chemometrics, which is the part of analytical chemistry involving mathematical and statistical methods for relating measurements made on a chemical system and the current state of the system. More concretely, this work is focused on multivariate calibration, in which the measurements are spectral data and the state of the system is the concentration of some chemical compound. It is organized in five chapters. The first one describes the concepts of chemometrics and multivariate calibration. It also demonstrates the need for new advanced techniques that adapt to the continuous technical advances and enhancements in data acquisition and storage in chemometrics. Moreover, the problem statement is presented by describing the parts on the modeling process, and in which points and with which tools we can achieve improvements over the State-of-the-Art, as will be shown in Chapter 4. The last part of the chapter describes some State-of-the-Art methods, which are also used as benchmark in the practical applications of the new techniques presented in this thesis. The second chapter deals with batch off-line modeling, thus all the data is available from the very beginning. Off-line modeling can be used, for instance, to extract some a posteriori knowledge about the system or as an initial step of an on-line modeling process. The presented techniques encompass all three parts of the modeling process: i) data preprocessing, in the form of outlier detection algorithms, ii) model calculation, with some techniques adapted from other fields, like fuzzy inference systems, and also some brand new ones, like a genetic hybridization of optimization techniques for variable selection, and iii) validation, by means of several types of confidence intervals. The third chapter covers the incremental on-line modeling, in which the data appears in the form of a potentially infinite data stream, thus the samples are not available from the beginning but being continuously recorded, one by one or in small batches. In this work the data is supposed to be treated in a single-pass manner, meaning that once new incoming data arrive the previous data is not available anymore. The chapter includes the on-line incremental version of some of the outlier detection and model calculation techniques. Nevertheless, it goes further than just extending the second chapter, because it presents an on-line incremental version of partial least squares regression algorithm, as well as a retraining procedure, based on a sliding window concept, for self-adaptive models. Moreover, two active learning stages for cost optimization in dynamic processes are developed: an incremental stage for informative new samples selection for inclusion, and a decremental stage for outdated old used samples selection for exclusion. The forth chapter shows the successful applicability of the off-line and on-line methods, mentioned in the two previous chapters, in real-world scenarios consistent on some data sets from three chemical production systems, polyether acrylate, melamine resin and viscose fibers production. The last chapter states summarizes conclusions and remarks, and points out new opened lines for potential further research.