TY - JOUR

T1 - Optimal symbol alignment distance

T2 - A new distance for sequences of symbols

AU - Herranz, Javier

AU - Nin, Jordi

AU - Solé, Marc

N1 - Funding Information:
The authors want to thank the anonymous reviewers of the IEEE Transactions on Knowledge and Data Engineering for their helpful comments on this paper. Partial support by the Spanish MEC Ministry (projects ARES—CONSOLIDER INGENIO 2010 CSD2007-00004—and eAEGIS— TSI2007-65406-C03-02) to the work of the three authors is also acknowledged. Javier Herranz enjoys a Ramón y Cajal grant, partially funded by the European Social Fund (ESF), from Spanish MICINN Ministry. His research is also supported by project MTM2009-07694 of the same MICINN Ministry. The work of Marc Solè is supported by project TIN2007-63927 of the Spanish MEC Ministry.

PY - 2011

Y1 - 2011

N2 - Comparison functions for sequences (of symbols) are important components of many applications, for example, clustering, data cleansing, and integration. For years, many efforts have been made to improve the performance of such comparison functions. Improvements have been done either at the cost of reducing the accuracy of the comparison, or by compromising certain basic characteristics of the functions, such as the triangular inequality. In this paper, we propose a new distance for sequences of symbols (or strings) called Optimal Symbol Alignment distance (OSA distance, for short). This distance has a very low cost in practice, which makes it a suitable candidate for computing distances in applications with large amounts of (very long) sequences. After providing a mathematical proof that the OSA distance is a real distance, we present some experiments for different scenarios (DNA sequences, record linkage, etc.), showing that the proposed distance outperforms, in terms of execution time and/or accuracy, other well-known comparison functions such as the Edit or Jaro-Winkler distances.

AB - Comparison functions for sequences (of symbols) are important components of many applications, for example, clustering, data cleansing, and integration. For years, many efforts have been made to improve the performance of such comparison functions. Improvements have been done either at the cost of reducing the accuracy of the comparison, or by compromising certain basic characteristics of the functions, such as the triangular inequality. In this paper, we propose a new distance for sequences of symbols (or strings) called Optimal Symbol Alignment distance (OSA distance, for short). This distance has a very low cost in practice, which makes it a suitable candidate for computing distances in applications with large amounts of (very long) sequences. After providing a mathematical proof that the OSA distance is a real distance, we present some experiments for different scenarios (DNA sequences, record linkage, etc.), showing that the proposed distance outperforms, in terms of execution time and/or accuracy, other well-known comparison functions such as the Edit or Jaro-Winkler distances.

KW - Sequences of symbols

KW - string distances

KW - triangular inequality

UR - http://www.scopus.com/inward/record.url?scp=80051708685&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2010.190

DO - 10.1109/TKDE.2010.190

M3 - Article

AN - SCOPUS:80051708685

SN - 1041-4347

VL - 23

SP - 1541

EP - 1554

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 10

M1 - 5601718

ER -