Using entropy as a measure of privacy loss in statistical databases
Although the Internet is a vast source of information for individuals it is also a major source of information about individuals. Data collection through surveys, registration pages, user forms have resulted in more personal information being available than before. Organizations like the Census Bureau, insurance companies, hospitals, universities keep databases that contain valuable information, increasing concerns about an individual's privacy. The problem of privacy in such databases is to protect information specific to an individual while releasing aggregate data for research purposes. Although there are several approaches to privacy preservation, definitions of privacy loss are either missing or tailored to each approach. In order to compare different approaches, we need a definition of privacy that not only determines if privacy loss occurs, but also measures the amount of privacy loss. In this thesis, we propose a definition of privacy based on the concept of entropy used in information theory. Entropy is a way to measure the amount of information in a signal based on probabilities. We use this notion of information and consider the entropy of records in a database. The amount of privacy loss caused by a statistical release is defined as the difference in entropy of a record before and after the statistical release. We argue that this notion of privacy loss is intuitive. In particular, we consider the amount of privacy loss after releasing the average in a randomly generated one-dimensional database, as the size of the database increases.
Chirayath, Vinod, "Using entropy as a measure of privacy loss in statistical databases" (2004). ETD Collection for University of Texas, El Paso. AAI1423717.