TY - GEN
T1 - Parallelizing record linkage for disclosure risk assessment
AU - Guisado-Gámez, Joan
AU - Prat-Pérez, Arnau
AU - Nin, J.
AU - Muntés-Mulero, Victor
AU - Larriba-Pey, Josep Ll
PY - 2008
Y1 - 2008
N2 - Handling very large volumes of confidential data is becoming a common practice in many organizations such as statistical agencies. This calls for the use of protection methods that have to be validated in terms of the quality they provide. With the use of Record Linkage (RL) it is possible to compute the disclosure risk, which gives a measure of the quality of a data protection method. However, the RL methods proposed in the literature are computationally costly, which poses difficulties when frequent RL processes have to be executed on large data. Here, we propose a distributed computing technique to improve the performance of a RL process. We show that our technique not only improves the computing time of a RL process significantly, but it is also scalable in a distributed environment. Also, we show that distributed computation can be complemented with SMP based parallelization in each node increasing the final speedup.
AB - Handling very large volumes of confidential data is becoming a common practice in many organizations such as statistical agencies. This calls for the use of protection methods that have to be validated in terms of the quality they provide. With the use of Record Linkage (RL) it is possible to compute the disclosure risk, which gives a measure of the quality of a data protection method. However, the RL methods proposed in the literature are computationally costly, which poses difficulties when frequent RL processes have to be executed on large data. Here, we propose a distributed computing technique to improve the performance of a RL process. We show that our technique not only improves the computing time of a RL process significantly, but it is also scalable in a distributed environment. Also, we show that distributed computation can be complemented with SMP based parallelization in each node increasing the final speedup.
KW - Disclosure risk evaluation
KW - Distributed computing
KW - Parallel computing
KW - Record linkage
UR - http://www.scopus.com/inward/record.url?scp=56849107086&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-87471-3_16
DO - 10.1007/978-3-540-87471-3_16
M3 - Conference contribution
AN - SCOPUS:56849107086
SN - 3540874704
SN - 9783540874706
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 190
EP - 202
BT - Privacy in Statistical Databases - UNESCO Chair in Data Privacy International Conference, PSD 2008, Proceedings
PB - Springer Verlag
T2 - International Conference on Privacy in Statistical Databases, PSD 2008
Y2 - 24 September 2008 through 26 September 2008
ER -