TY - JOUR
T1 - Kd-trees and the real disclosure risks of large statistical databases
AU - Herranz, Javier
AU - Nin, Jordi
AU - Solé, Marc
N1 - Funding Information:
The authors want to thank the reviewers of Information Fusion for their helpful comments on the first version of this paper. Partial support by the Spanish MEC Ministry (Projects ARES – CONSOLIDER INGENIO 2010 CSD2007-00004 – and eAEGIS – TSI2007-65406-C03-02) to the work of Javier Herranz and Jordi Nin is also acknowledged. Javier Herranz enjoys a Ramón y Cajal grant, partially funded by the European Social Fund (ESF), from Spanish MICINN Ministry. His research is also supported by Project MTM2009-07694 of the same MICINN Ministry. The work of Marc Solé is supported by Project TIN2007-63927 of the Spanish MEC Ministry.
PY - 2012/10
Y1 - 2012/10
N2 - Estimating the disclosure risk of a Statistical Disclosure Control (SDC) protection method by means of (distance-based) record linkage techniques is a very popular approach to analyze the privacy level offered by such a method. When databases are very large, some particular record linkage techniques such as blocking or partitioning are usually applied to make this process reasonably efficient. However, in this case the record linkage process is not exact, which means that the disclosure risk of a SDC protection method may be underestimated. In this paper we propose the use of kd-trees techniques to apply exact yet very efficient record linkage when (protected) datasets are very large. We describe some experiments showing that this approach achieves better results, in terms of both accuracy and running time, than more classical approaches such as record linkage based on a sliding window. We also discuss and experiment on the use of these techniques not to link a whole protected record with its original one, but just to guess the value of some confidential attribute(s) of the record(s). This fact leads to concepts such as k-neighbor l-diversity or k-neighbor p-sensitivity, a generalization (to any SDC protection method) of l-diversity or p-sensitivity, which have been defined for SDC protection methods ensuring k-anonymity, such as microaggregation.
AB - Estimating the disclosure risk of a Statistical Disclosure Control (SDC) protection method by means of (distance-based) record linkage techniques is a very popular approach to analyze the privacy level offered by such a method. When databases are very large, some particular record linkage techniques such as blocking or partitioning are usually applied to make this process reasonably efficient. However, in this case the record linkage process is not exact, which means that the disclosure risk of a SDC protection method may be underestimated. In this paper we propose the use of kd-trees techniques to apply exact yet very efficient record linkage when (protected) datasets are very large. We describe some experiments showing that this approach achieves better results, in terms of both accuracy and running time, than more classical approaches such as record linkage based on a sliding window. We also discuss and experiment on the use of these techniques not to link a whole protected record with its original one, but just to guess the value of some confidential attribute(s) of the record(s). This fact leads to concepts such as k-neighbor l-diversity or k-neighbor p-sensitivity, a generalization (to any SDC protection method) of l-diversity or p-sensitivity, which have been defined for SDC protection methods ensuring k-anonymity, such as microaggregation.
KW - Attribute disclosure
KW - Kd-trees
KW - Real disclosure risk
KW - Record linkage
KW - Statistical Disclosure Control
UR - http://www.scopus.com/inward/record.url?scp=84861636225&partnerID=8YFLogxK
U2 - 10.1016/j.inffus.2011.03.001
DO - 10.1016/j.inffus.2011.03.001
M3 - Article
AN - SCOPUS:84861636225
SN - 1566-2535
VL - 13
SP - 260
EP - 273
JO - Information Fusion
JF - Information Fusion
IS - 4
ER -