The aid of machine learning to overcome the classification of real health discharge reports written in Spanish
- Alicia Pérez
- Arantza Casillas
- Koldo Gojenola
- Maite Oronoz
- Nerea Aguirre
- Estibaliz Amillano
ISSN: 1135-5948
Year of publication: 2014
Issue: 53
Pages: 77-84
Type: Article
More publications in: Procesamiento del lenguaje natural
Abstract
La red de hospitales que configuran el sistema español de sanidad utiliza la Clasificación Internacional de Enfermedades Modificación Clínica (ICD9-CM) para codificar partes de alta hospitalaria. Hoy en día, este trabajo lo realizan a mano los expertos. Este artículo aborda la problemática de clasificar automáticamente partes reales de alta hospitalaria escritos en español teniendo en cuenta el estándar ICD9-CM. El desafío radica en que los partes hospitalarios están escritos con lenguaje espontáneo. Hemos experimentado con varios sistemas de aprendizaje automático para solventar este problema de clasificación. El algoritmo Random Forest es el más competitivo de los probados, obtiene un F-measure de 0.876.
Bibliographic References
- Bishop, C. M. 2006. Pattern Recognition and Machine Learning. Springer.
- Chang, C. C. and C. J. Lin. 2001. Libsvm: a library for support vector machines.
- Ferrao, J. C., M. D. Oliveira, F. Janela, and H.M.G. Martins. 2012. Clinical coding support based on structured data stored in electronic health records. In Bioinformatics and Biomedicine Workshops, 2012 IEEE International Conference on, pages 790-797.
- Hall, M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explorations, 11(1):10-18.
- Lang, D. 2007. Natural language processing in the health care industry. Consultant report, Cincinnati Children's Hospital Medical Center.
- Mitchell, T. 1997. Machine Learning. McGraw Hill.
- Peng, H., C. Gates, B. Sarma, N. Li, Y. Qi, R. Potharaju, C. Nita-Rotaru, and I. Molloy. 2012. Using probabilistic generative models for ranking risks of android apps. In Proceedings of the 2012 ACM conference on Computer and communications security, pages 241-252. ACM.
- Pestian, J. P., C. Brew, P. Matykiewicz, D. J. Hovermale, N. Johnson, K. Bretonnel Cohen, and W. Duch. 2007. A shared task involving multi-label classification of clinical free text. In Biological, translational, and clinical language processing, pages 97-104. Association for Computational Linguistics.
- Platt, J. C. 1999. Fast training of support vector machines using sequential minimal optimization. MIT press.
- Quinlan, R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.
- Rodríguez, J. D., A. Pérez, D. Arteta, D. Tejedor, and J. A. Lozano. 2012. Using multidimensional bayesian network classifiers to assist the treatment of multiple sclerosis. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 42(6):1705-1715.
- Soni, J., U. Ansari, D. Sharma, and S. Soni. 2011. Predictive data mining for medical diagnosis: An overview of heart disease prediction. International Journal of Computer Applications, 17.
- Sriram, B., D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas. 2010. Short text classification in twitter to improve information filtering. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 841-842. ACM.