Data anonymization protects privacy by encoding or deleting private or sensitive information within a database. This process safeguards personal data against unauthorized use or disclosure via cyber-attacks or other security breaches.
A single data breach can impact the personal data of hundreds to millions of people. For example, a Michigan-based bank experienced a data breach in June 2022 that compromised the social security numbers of 1.5 million customers. In that same month, a Massachusetts healthcare group also reported a breach in which hackers stole records containing names, social security numbers, and other sensitive information of up to 2 million people.
If a dataset contains private or sensitive information, anonymization techniques render such information anonymous so that it cannot be linked back to individuals. Data anonymization is also known as data masking or obfuscation.
Anonymized data can be used in a variety of applications, such as test data for quality assurance (QA), development, and training purposes outside the production environment. It also preserves private or confidential information within datasets that are stored or shared with third parties.
There are many types of sensitive information that need protection:
Personally identifiable information (PII). Any data that could be linked to a specific person is considered PII. Examples include full names, addresses, social security numbers, fingerprints, and dates of birth.
Protected health information (PHI). When used as research data, PHI must be anonymized before it can be released. Medical histories, insurance information, and lab results are just a few examples of PHI.
Payment card information (PCI). The Payment Card Industry Data Security Standards (PCI DSS) requires organizations to protect cardholder data from unauthorized exposure. Examples of PCI data include the cardholder’s name, PIN, and Primary Account Number (PAN).
Intellectual property (IP). IP refers to something created through creativity, brainpower, and skilled work. Trademarks, patents, and copyrights are all examples of intellectual property.
For example, a dataset describing the buying habits of shoppers based on age range does not need to include participant’s name or exact age. Likewise, PII is not necessary to train effective AI systems, so it should be scrubbed from the data to avoid unintended disclosures.
In this article, we will explore various data anonymization techniques and best practices. We will also look at what differentiates data de-identification from data anonymization for business security and how tools like WinZip® Enterprise help protect enterprise data with data anonymization.
Data anonymization techniques
There are several ways to remove identifiable information from a dataset. The technique that will work best depends on the associated use case.
For example, some data anonymization techniques are best suited to test data management, while others are appropriate when sharing data with third parties.
Enterprise organizations create, share, and store a large volume of diverse data, so it is impractical to use a single anonymization technique across all datasets.
There are two primary approaches to anonymization: randomization and generalization. Data generalization dilutes individual attributes by modifying the scale or order of magnitude. Randomization techniques change the dataset’s attributes to remove links between data and individuals.
Here are the various ways you can anonymize data through randomization and generalization:
Substitution. Data substitution masks the original information by replacing it with another value. For example, you can mask customer names with a random lookup file that preserves the data’s original look and feel.
Shuffling. The shuffling method randomly shuffles data within an attribute or set of attributes. For example, you could shuffle employee names across multiple employee records to eliminate the links between data columns and hide personal information.
Number and date variance. The number and date variance technique randomizes each value in a column so that it cannot be traced back to its original form. For example, you can apply a variance of +/- 10% to monthly sales figures or employee salaries.
Scrambling. Scrambling characters and numbers hide the original content and protects personally identifiable data. For example, you can scramble account numbers to maintain the appearance of accurate data, such as changing #85241 to #42815.
Masking out. To share data with unauthorized users, businesses mask out parts of the original data using random characters or other data. Masking credit card data to only show the last four digits is a common example of this technique.
Nulling out. A null value can be used to replace sensitive information, ensuring that unauthorized users cannot see actual data. For example, nulling out middle names in a dataset reduces the risk of individual identification.
Anonymized data vs. de-identification
Data de-identification is the process of removing identifying information from a dataset. Anonymized data is free of any identifiable information as well as all quasi-identifiable information that, if combined with other data, could be used to re-identify an individual.
While both de-identification and anonymization remove direct and quasi-identifiers, de-identified data can be reconnected to the original information. With anonymized data, however, there is no way to link it back to identifiable information.
It is important to understand the subtle differences between de-identification and anonymization for compliance with various data privacy regulations. For example, the General Data Protection Regulation (GDPR) has three criteria for anonymization techniques.
Individualization. Any data that can provide context to single out an individual within a dataset must be anonymized. For example, if a dataset contains the height of various individuals and only one person is 4’10”, that individual is singled out because it is a unique value.
Correlation. Linking quasi-identifiers from separate sources makes it easy for bad actors to identify an individual. For example, demographic studies suggest that around 87% of the US population is identifiable using just three attributes—gender, date of birth, and ZIP code.
Inference. Inference is the ability to guess or estimate the value of an attribute using other available information. For example, a dataset with statistics on levels of seniority and salaries within a department does not directly identify individuals, but inferences can be drawn between the two pieces of information, allowing an individual to be identified.
An appropriate anonymization solution should prevent the individualization, correlation, and inference of data that would allow an individual to be traced within the dataset.
GDPR requirements for anonymization are stricter than similar data privacy provisions. For example, the California Consumer Protection Act (CCPA) requires companies to make reasonable efforts to remove identifying data, while GDPR requires that identifiable information is irreversibly prevented from use.
The Health Insurance Portability and Accountability Act (HIPAA) includes how data can be stored, used, and shared. To use health data in research and assessments, it must be de-identified to reduce privacy risks to individuals. HIPAA-compliant de-identification does not remove the risk of individual identification entirely, which means that it is not the same as data anonymization.
Best practices to keep data safe
With both the number and type of cyberattacks increasing, it is more important than ever to protect private, sensitive, and confidential information. In 2021, the average total cost of a data breach was $4.24 million, a 10% increase from 2020. Personal and sensitive information were included in 44% of breaches, costing companies an average of $180 per lost or stolen record.
Protecting private information starts by assessing and classifying all your organizational data. Classify data according to its sensitivity so that you can apply the most appropriate data anonymization technique.
Data security is never a one-and-done process. It is important to regularly re-evaluate databases to identify new risks and assess the performance of the controls, policies, and procedures meant to protect data.
Also consider whether a linkage attack could connect any other datasets with the anonymized data. For example, an anonymized dataset containing the gender, date of birth, and postal code of individuals could be cross-referenced with a public voter registry which contains the same information but also includes the individuals’ names.
How WinZip Enterprise helps you keep data anonymous
Data anonymization and encryption are effective, powerful methods of protecting sensitive data from unauthorized access. Encryption transforms data into a coded format using encryption algorithms. The data is unintelligible unless the end user has the cryptographic key needed to decrypt the information.
While data anonymization is used for datasets that are in active use, encryption protects data in-transit and at-rest. By leveraging both encryption and data anonymization techniques, organizations enable comprehensive protection against unauthorized access and use of personal and sensitive information.
WinZip Enterprise is a comprehensive solution for safeguarding critical data. It features a complete set of enterprise-grade tools for unsurpassed protection everywhere your data resides.
Centralized IT controls make it easy to customize WinZip Enterprise to your specific needs, such as removing unnecessary features and setting and enforcing company-wide security policies.