Multi-objective evolutionary optimization for dimensionality reduction of texts represented by synsets

Vélez de Mendizabal, Iñaki; Basto-Fernandes, Vitor; Ezpeleta, Enaitz; Méndez, José R.; Gómez-Meire, Silvana; Zurutuza, Urko

doi:10.7717/PEERJ-CS.1240

Multi-objective evolutionary optimization for dimensionality reduction of texts represented by synsets

Vélez de Mendizabal, Iñaki ¹⁵
Basto-Fernandes, Vitor ¹
Ezpeleta, Enaitz ⁵
Méndez, José R. ²³⁴
Gómez-Meire, Silvana ⁴
Zurutuza, Urko ⁵

1 University Institute of Lisbon ISTAR-IUL, Instituto Universitário de Lisboa (ISCTE-IUL), Lisboa, Portugal
2 Galicia Sur Health Research Institute (IIS Galicia Sur), Hospital Álvaro Cunqueiro, Bloque técnico, SING Research Group, Vigo, Pontevedra, Spain
3 CINBIO-Biomedical Research Centre, Lagoas-Marcosende, Vigo, Pontevedra, Spain
4 Department of Computer Science Universidade de Vigo, Ourense, Spain
5 Universidad de Mondragón/Mondragon Unibertsitatea

Universidad de Mondragón/Mondragon Unibertsitatea

Mondragón, España

ROR https://ror.org/00wvqgd19

Show affiliations +

Journal:

PeerJ Computer Science

ISSN: 2376-5992

Year of publication: 2023

Volume: 9

Pages: e1240

Type: Article

DOI: 10.7717/PEERJ-CS.1240 GOOGLE SCHOLAR Open access editor

More publications in: PeerJ Computer Science

Sustainable development goals

Abstract

Despite new developments in machine learning classification techniques, improving the accuracy of spam filtering is a difficult task due to linguistic phenomena that limit its effectiveness. In particular, we highlight polysemy, synonymy, the usage of hypernyms/hyponyms, and the presence of irrelevant/confusing words. These problems should be solved at the pre-processing stage to avoid using inconsistent information in the building of classification models. Previous studies have suggested that the use of synset-based representation strategies could be successfully used to solve synonymy and polysemy problems. Complementarily, it is possible to take advantage of hyponymy/hypernymy-based to implement dimensionality reduction strategies. These strategies could unify textual terms to model the intentions of the document without losing any information (e.g., bringing together the synsets “viagra”, “ciallis”, “levitra” and other representing similar drugs by using “virility drug” which is a hyponym for all of them). These feature reduction schemes are known as lossless strategies as the information is not removed but only generalised. However, in some types of text classification problems (such as spam filtering) it may not be worthwhile to keep all the information and let dimensionality reduction algorithms discard information that may be irrelevant or confusing. In this work, we are introducing the feature reduction as a multi-objective optimisation problem to be solved using a Multi-Objective Evolutionary Algorithm (MOEA). Our algorithm allows, with minor modifications, to implement lossless (using only semantic-based synset grouping), low-loss (discarding irrelevant information and using semantic-based synset grouping) or lossy (discarding only irrelevant information) strategies. The contribution of this study is two-fold: (i) to introduce different dimensionality reduction methods (lossless, low-loss and lossy) as an optimization problem that can be solved using MOEA and (ii) to provide an experimental comparison of lossless and low-loss schemes for text representation. The results obtained support the usefulness of the low-loss method to improve the efficiency of classifiers.

€ View funding

Funding information

Funders

SMEIC, SRA and ERDF
- TIN2017-84658-C2-1-R and TIN2017-84658-C2-2-R
Conselleria de Cultura, Educación e Universidade of Xunta de Galicia
- ED431C 2022/03-GRC
Universities and Research of the Basque Country
- IT1676-22
FCT
- UIDB/04466/2020 and UIDP/04466/2020

Bibliographic References

Aiyar S, Shetty NP. 2018. N-gram assisted youtube spam comment detection. Procedia Computer Science 132(6):174-182
Alberto T, Lochter J. 2017. YouTube spam collection. UCI machine learning repository.
Ali A. 2020. Here’s What Happens Every Minute on the Internet in 2020 (Visual Capitalist) (accessed 19 October 2022)
Almeida TA, Silva TP, Santos I, Gómez Hidalgo JM. 2016. Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowledge-Based Systems 108(3):25-32
Bahgat EM, Moawad IF. 2017. Semantic-based feature reduction approach for e-mail classification.
Barushka A, Hajek P. 2019. Review spam detection using word embeddings and deep neural networks. In: MacIntyre J, Maglogiannis I, Iliadis L, Pimenidis E, eds. Artificial Intelligence Applications and Innovations. Cham: Springer International Publishing. 559:340-350
Basto-Fernandes V, Yevseyeva I, Méndez JR, Zhao J, Fdez-Riverola F, Emmerich MTM. 2016. A spam filtering multi-objective optimization study covering parsimony maximization and three-way classification. Applied Soft Computing 48(4):111-123
Blum AL, Langley P. 1997. Selection of relevant features and examples in machine learning. Artificial Intelligence 97(1):245-271
Cabrera-León Y, García Báez P, Suárez-Araujo CP. 2018. Non-email spam and machine learning-based anti-spam filters: trends and some remarks. In: EUROCAST 2017: Computer Aided Systems Theory–EUROCAST 2017. Cham: Springer. 10671:245-253
Chakraborty M, Pal S, Pramanik R, Ravindranath Chowdary C. 2016. Recent developments in social spam detection and combating techniques: a survey. Information Processing and Management 52(6):1053-1073
Chandrashekar G, Sahin F. 2014. A survey on feature selection methods. Computers & Electrical Engineering 40(1):16-28
Durillo JJ, Nebro AJ. 2008. jMetal Web site. (accessed 19 October 2022)
Goldkamp J, Dehghanimohammadabadi M. 2019. Evolutionary multi-objective optimization for multivariate pairs trading. Expert Systems with Applications 135(21):113-128
Kalousis A, Prados J, Hilario M. 2007. Stability of feature selection algorithms: a study on high-dimensional spaces. Knowledge and Information Systems 12(1):95-116
Kohavi R, John GH. 1997. Wrappers for feature subset selection. Artificial Intelligence 97(1):273-324
Li J, Lv P, Xiao W, Yang L, Zhang P. 2021. Exploring groups of opinion spam using sentiment analysis guided by nominated topics. Expert Systems with Applications 171:114585
Lopez-Gazpio I, Maritxalar M, Lapata M, Agirre E. 2019. Word n-gram attention models for sentence similarity and inference. Expert Systems with Applications 132(Feb):1-11
Méndez JR, Cotos-Yañez TR, Ruano-Ordás D. 2019. A new semantic-based feature selection method for spam filtering. Applied Soft Computing 76:89-104
Moro A, Navigli R. 2010. Babelfy | Multilingual Word Sense Disambiguation and Entity Linking together! (accessed 19 October 2022)
Moro A, Raganato A, Navigli R. 2014. Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics 2(22):231-244
Novo-Lourés M, Lage Y, Pavón R, Laza R, Ruano-Ordás D, Méndez JR. 2021. Improving pipelining tools for pre-processing data. International Journal of Interactive Multimedia and Artificial Intelligence
Novo-Lourés M, Pavón R, Laza R, Ruano-Ordas D, Méndez JR. 2020. Using natural language preprocessing architecture (NLPA) for big data text sources. Scientific Programming 2020:1-13
Princeton University. 2010. WordNet. (accessed 19 October 2022)
Robles JF, Chica M, Cordon O. 2020. Evolutionary multiobjective optimization to target social network influentials in viral marketing. Expert Systems with Applications 147(5439):113183
Sahin E, Aydos M, Orhan F. 2018. Spam/ham e-mail classification using machine learning methods based on bag of words technique.
Salcedo-Sanz S, Camps-Valls G, Perez-Cruz F, Sepulveda-Sanchis J, Bousono-Calzon C. 2004. Enhancing genetic feature selection through restricted search and walsh analysis. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 34(4):398-406
Sapienza NLP. 2012. BabelNet®, the largest multilingual encyclopedic dictionary and semantic network. (accessed 19 October 2022)
Scozzafava F, Raganato A, Moro A, Navigli R. 2015. Automatic identification and disambiguation of concepts and named entities in the multilingual wikipedia. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Cham: Springer. 9336:357-366
Shah FP, Patel V. 2016. A review on feature selection and feature extraction for text classification.
Silva RM, Alberto TC, Almeida TA, Yamakami A. 2017. Towards filtering undesired short text messages using an online learning approach with semantic indexing. Expert Systems with Applications 83:314-325
Statista Inc. 2022. Number of internet and social media users worldwide as of july 2022. (accessed 19 October 2022)
Suryawanshi S, Goswami A, Patil P. 2019. Email spam detection: an empirical comparative study of different ML and ensemble classifiers.
Tanabe R, Ishibuchi H. 2020. A review of evolutionary multimodal multiobjective optimization. IEEE Transactions on Evolutionary Computation 24(1):193-200
Trivedi SK, Dey S. 2016. A comparative study of various supervised feature selection methods for spam classification.
Turk S, Özcan E, John R. 2017. Multi-objective optimisation in inventory planning with supplier selection. Expert Systems with Applications 78:51-63
Vázquez I, Novo-Lourés M, Pavón R, Laza R, Méndez JR, Ruano-Ordás D. 2021. Improvements for research data repositories: the case of text spam. Journal of Information Science
Vélez de Mendizabal I, Basto-Fernandes V, Ezpeleta E, Méndez JR, Zurutuza U. 2020. SDRS: a new lossless dimensionality reduction for text corpora. Information Processing and Management 57(4):102249
Verma S, Pant M, Snasel V. 2021. A comprehensive review on NSGA-II for multi-objective combinatorial optimization problems. IEEE Access 9:57757-57791
Witten IH, Frank E, Hall MA, Pal CJ. 2016. Data mining: practical machine learning tools and techniques. Amsterdam Elsevier: Data Mining: Practical Machine Learning Tools and Techniques.
Xu H, Sun W, Javaid A. 2016. Efficient spam detection across online social networks.

Multi-objective evolutionary optimization for dimensionality reduction of texts represented by synsets

Universidad de Mondragón/Mondragon Unibertsitatea

Sustainable development goals

Abstract

Funding information

Funders

Bibliographic References