TY - JOUR

T1 - Optimal Symbol Alignment Distance

T2 - A New Distance for Sequences of Symbols

AU - Herranz, Javier

AU - Nin, Jordi

AU - Sole, Marc

PY - 1991/12

Y1 - 1991/12

N2 - Comparison functions for sequences (of symbols) are important components of many applications, for example clustering, data cleansing and integration. For years, many efforts have been made to improve the performance of such comparison functions. Improvements have been done either at the cost of reducing the accuracy of the comparison, or by compromising certain basic characteristics of the functions, such as the triangular inequality. In this paper, we propose a new distance for sequences of symbols (or strings) called Optimal Symbol Alignment distance (OSA distance, for short). This distance has a very low cost in practice, which makes it a suitable candidate for computing distances in applications with large amounts of (very long) sequences. After providing a mathematical proof that the OSA distance is a real distance, we present some experiments for different scenarios (DNA sequences, record linkage, …), showing that the proposed distance outperforms, in terms of execution time and/or accuracy, other welLKnown comparison functions such as the Edit or Jaro-Winkler distances.

AB - Comparison functions for sequences (of symbols) are important components of many applications, for example clustering, data cleansing and integration. For years, many efforts have been made to improve the performance of such comparison functions. Improvements have been done either at the cost of reducing the accuracy of the comparison, or by compromising certain basic characteristics of the functions, such as the triangular inequality. In this paper, we propose a new distance for sequences of symbols (or strings) called Optimal Symbol Alignment distance (OSA distance, for short). This distance has a very low cost in practice, which makes it a suitable candidate for computing distances in applications with large amounts of (very long) sequences. After providing a mathematical proof that the OSA distance is a real distance, we present some experiments for different scenarios (DNA sequences, record linkage, …), showing that the proposed distance outperforms, in terms of execution time and/or accuracy, other welLKnown comparison functions such as the Edit or Jaro-Winkler distances.

UR - http://www.scopus.com/inward/record.url?scp=84941528434&partnerID=8YFLogxK

U2 - 10.1109/26.120163

DO - 10.1109/26.120163

M3 - Article

AN - SCOPUS:84941528434

SN - 0090-6778

VL - 39

SP - 1762

EP - 1775

JO - IEEE Transactions on Communications

JF - IEEE Transactions on Communications

IS - 12

ER -