Optimal symbol alignment distance: A new distance for sequences of symbols

Javier Herranz, J. Nin, Marc Solé

Producción científica: Artículo en revista indizadaArtículorevisión exhaustiva

21 Citas (Scopus)

Resumen

Comparison functions for sequences (of symbols) are important components of many applications, for example, clustering, data cleansing, and integration. For years, many efforts have been made to improve the performance of such comparison functions. Improvements have been done either at the cost of reducing the accuracy of the comparison, or by compromising certain basic characteristics of the functions, such as the triangular inequality. In this paper, we propose a new distance for sequences of symbols (or strings) called Optimal Symbol Alignment distance (OSA distance, for short). This distance has a very low cost in practice, which makes it a suitable candidate for computing distances in applications with large amounts of (very long) sequences. After providing a mathematical proof that the OSA distance is a real distance, we present some experiments for different scenarios (DNA sequences, record linkage, etc.), showing that the proposed distance outperforms, in terms of execution time and/or accuracy, other well-known comparison functions such as the Edit or Jaro-Winkler distances.

Idioma originalInglés
Número de artículo5601718
Páginas (desde-hasta)1541-1554
Número de páginas14
PublicaciónIEEE Transactions on Knowledge and Data Engineering
Volumen23
N.º10
DOI
EstadoPublicada - 2011
Publicado de forma externa

Huella

Profundice en los temas de investigación de 'Optimal symbol alignment distance: A new distance for sequences of symbols'. En conjunto forman una huella única.

Citar esto