Enhancement of ensemble data mining techniques via soft computing
- Elsayed, Amgad Monir Mohamed
- Enrique Onieva Caracuel Director
- Michał Woźniak Director
Universidade de defensa: Universidad de Deusto
Fecha de defensa: 23 de marzo de 2021
- Manuel Graña Romay Presidente
- Antonio David Masegosa Arredondo Secretario
- Alberto Cano Vogal
Tipo: Tese
Resumo
Machine learning (ML) is the area of study that gives computers the ability to learn without being explicitly programmed. Sometimes this will reveal unsuspected correlations and lead to a deeper understanding of the problem. The magic is to \textit{learn from data}, as we are surrounded by data everywhere (user logs, financial data, production data, medical records, etc.). Machine learning is great for complex problems for which there is no good solution at all. Furthermore, ML is suitable for fluctuating environments as it can adapt to new data. While data mining is a related field that aims to discover patterns that were not immediately apparent. There are two important factors that drive this area: usage of effective models that capture the complex data, and design of scalable learning systems that learn from massive datasets. While it has been extensively reported in the literature that pooling together learning models is a desirable strategy to construct robust data mining systems. This is recognized as ensemble data mining. Ensemble systems for pattern classification have been expanded in the literature under the name of multiple classifier system (MCS). In classification tasks, various challenges are encountered, e.g., in terms of the data size, the number of classes, the dimensionality of the feature space, the overlap between instances, the balance between class categories, and the nonlinear complexity of the true unknown hypotheses. Those challenges cause the perfect solutions to be difficult to obtain. A promising solution is to train a set of diverse and accurate base classifiers and to combine them. A primary drawback of classifiers ensemble, despite its remarkable performance, is that it is necessary to combine a large number of classifiers to ensure that the error converges to its asymptotic value. This brings on high computational requirements, including the cost of training, the requirements for storage, and the time for a prediction. In addition, when classifiers are spread over a network, many communication costs are needed. To alleviate these drawbacks, various strategies will be proposed in this thesis. In particular, how soft computing techniques can be incorporated in MCS. Soft computing methods are pioneer computing paradigms that parallel the extraordinary ability of the human mind to reason and learn. Soft computing methods, computational intelligence, use approximate calculations to provide imprecise but usable solutions to unsolvable or just too time-consuming problems. From the literature in MCS, at most, soft computing methods were proposed either to optimize the classifiers' combination function or to select a subset of classifiers instead of aggregating all. However, the efficiency and efficacy of MCS can be still improved through our contributions in this thesis. The efficiency of MCS concerns; fast training, lower storage requirements, higher classification speed, lower communication cost between distributed models. Two directions were followed to achieve that. First, for data level, we apply instance selection (IS) methods as a preprocessing mechanism to decrease the training data-size. This could fast the training of MCS, and the accuracy of models could be increased through focusing on informative samples. Related to this part, we evaluate the interconnection between IS and MCS. Second, for the classifier level, ensemble pruning is a strategy by which a subset of classifiers can be selected while maintaining, even improving, the performance of the original ensemble. For that, we propose a guided search pruning method to combine multiple pruning metrics while retaining their performance. In addition, the simultaneous effect of downsizing the number of samples and downsizing the number of classifiers is analyzed. Furthermore, we analyze recent reordering-based MCS pruning metrics that are recognized as accurate and fast strategies to identify a subset of classifiers. The efficacy of MCS concerns the predictive performance, to go beyond what can be achieved from the state-of-art ensemble algorithms. Related to this part, we propose swarm intelligence (SI) algorithms, as soft computing techniques, to integrate multiple classifier decisions. In connection with that, a framework was proposed to combine three computational intelligence paradigms IS, MCS, and SI algorithms. The objective is to build a more diverse and highly accurate MCS, only from a reduced portion of the available data. In summary, this research introduces novel and improved strategies to increase the efficiency and the efficacy of MCS. Soft computing is applied to optimize the integration of classifiers and to identify the best classifier subsets. The results obtained throughout the thesis can boost the performance of ensemble systems by applying IS methods as a kind of data preprocessing technique. The application of SI algorithms or hybrid versions can be more promising to effectively integrate individuals' decisions. Furthermore, small-size ensembles with training on fewer samples could significantly outperform large-size ensembles that use whole training data. Finally, an analysis of recent heuristic metrics to prune bagging ensembles has been conducted.