What Is Pseudonymization? How It Protects Data Privacy

May 31, 2022
by Sagar Joshi

Having access to personal data means companies can tailor products and services to their customers’ needs and interests. But that access comes with great responsibility.

Organizations must maintain data privacy and confidentiality to comply with industry regulations such as General Data Protection Regulation (GDPR).

Pseudonymization plays a crucial role in ensuring data protection. Many organizations use data de-identification and pseudonymity software to comply with privacy and data protection laws and reduce their risk of compromising personally identifiable information.

Personal information can be anything related to an identifiable natural person. Among other markers, this includes name, location, and identification number. Info can comprise any combination of physical, physiological, social, economic or psychological characteristics related to a person. 

Pseudonymization is a part of the data management and de-identification process. It replaces personally identifiable information (PII) with one or more pseudonyms or artificial identifiers. Companies can restore pseudonymized data to its original state by using additional information that supports the re-identification process.

It’s one way to comply with the European Union’s General Data Protection Regulation (GDPR). The regulatory standard mandates secure storage of personal data. When effectively implemented, pseudonymization also motivates relaxing the obligations of data controllers. 

A risk-based pseudonymization technique considers utility and scalability factors while offering protection. Implementation of risk-based pseudonymization is possible when data controllers and processors have access to information supplied by product owners, service managers or application owners. 

Regulators need to give granular and practical steps to assess risks while promoting risk-based prioritization and its best practices. This allows for data protection at scale and helps companies secure large volumes of personal data.

How does pseudonymization work?

In the pseudonymization process, identifiers such as name, phone number, or email address are mapped to pseudonyms: any arbitrary number, character, or a sequence of both. For example, if there are two identifiers, A and B, mapped to pseudonyms PS1 and PS2, the process runs a pseudonymization function that differentiates PS1 from PS2. Otherwise, recovering identifiers could become ambiguous. 

It’s possible to map a single identifier to multiple pseudonyms as long as the actual identifier can be recovered. For each pseudonym, there is an additional secret, also known as a pseudonymization secret, which helps recover the original identifier. A pseudonymization table that maps identifiers to a pseudonym can be a simple example of secret or additional information.

Want to learn more about Data De-Identification Tools? Explore Data De-Identification products.

Anonymization vs. Pseudonymization

The anonymization process makes data completely unreadable or anonymous: the original data cannot be recovered later. Let’s take a simple example. If you anonymize data such as the name Scott, its output can be XXXXX preventing recovery of the actual name from anonymized data. 

Anonymization vs pseudonymization

Conversely, with the help of additional information or a pseudonymization secret, the pseudonymous data can be transformed into the original identifier.

The anonymization process ensures data privacy but isn’t always practical. In some cases, like healthcare data, anonymization can draw meaningful conclusions without compromising a patient’s identity. 

When anonymization can’t fully ensure data privacy, encryption and other security measures might be necessary.  These situations arise when anonymized data is combined with other data sets and when it's possible to trace the information to a specific person.  

Pseudonymization techniques

Below are some basic pseudonymization techniques that teams can use to protect personal data.

Counter

The counter technique substitutes each identifier with a number chosen by a monotonic counter. It avoids ambiguity by ensuring no repetition in monotonic counter values. This technique is easy to implement for small, simple datasets.

Example of pseudonymization with counter technique
Name Pseudonym (counter generator)
Fisher 10
Mark 11
Twain 12

Simplicity is an advantage of the counter technique. However, there might be some implementation and scalability issues in the case of large and more sophisticated datasets since they store the entire pseudonymization table.

Random number generator (RNG)

A random number generator produces values that have an equal possibility of being chosen from a total population, generating unpredictable values. It provides robust data protection compared to the counter technique as it’s challenging to pinpoint the actual identifier as long as the pseudonymization table isn’t compromised. 

Example of pseudonymization with RNG
Name Pseudonym (RNG)
Fisher 342
Mark 984
Twain 410

Fair warning, RNG comes with chances of collision. Collision refers to a scenario where the function assigns the same pseudonym to two different identifiers. Scalability can also be a challenge as you work on larger, more sophisticated datasets since this technique also stores the pseudonymization table.

Cryptographic hash function

A cryptographic hash function maps input strings of multiple lengths to fixed-length outputs. This ensures that it is computationally infeasible to find any input that generates a specific output string. Furthermore, it’s free of collisions. For example, Alice and Fisher, after pseudonymization using the cryptographic hash function, will generate 24fsa35gersw439 and 43ase98shekc021 as pseudonyms.

Although cryptographic hash functions solve some challenges of pseudonymization, such as collisions, it’s prone to brute force attacks and dictionary attacks. 

Message authentication code (MAC)

Message authentication code is similar to cryptographic hash function, but it uses a secret key to generate a pseudonym. As long as this key isn’t compromised, it’s infeasible to detect the actual identifier from the pseudonym.

MAC is seen as a robust pseudonymization technique. Its variations apply to different scalability and utility requirements of the pseudonymization entity. MAC can be applied in internet-based display advertising where an advertiser can attach a unique pseudonym for each individual without revealing their identities.

You can also apply MAC in separate sub-parts of an identifier and use the same secret key. For example, in the case of XYZ@abc.op and PNR@abc.op, you can assign the same secret key to the domain abc.op and generate the same sub-pseudonym.

Encryption

Symmetric encryption, especially block ciphers such as Advanced Encryption Standard (AES), encrypts an identifier with a secret key. This key serves as a pseudonymization secret and recovery secret. The block size can be smaller or larger than an identifier in this technique. The method includes padding if the identifier’s size is smaller than the block size. 

On the other hand, if the identifier’s size is bigger than the block size, either the identifier is compressed into a size smaller than the block size, or a mode of operation such as counter mode CTR is used. Encryption is a strong pseudonymization technique.

In cases where the data controller needs to preserve the format without revealing the original identifier, format-preserving encryption (FPE) is used instead of conventional cryptography. For example, during the pseudonymization of XYZ@jkl.com, FPE can produce wqi@abc.kxr, and conventional cryptography can generate hui sa0 2ser @ aqw xde bgt miu cvf erw 56t as pseudonyms.

Below are a few advanced pseudonymization methods used in comparatively complex data sets. 

  • Asymmetric encryption involves two different entities in the pseudonymization process. The public key creates a pseudonym; the private key resolves it to determine the identifier.
  • Hash chains depend on repeatedly hashing the hash value to produce an output that requires multiple inversions to determine the original identifier.
  • Secret sharing schemes split confidential information into multiple parts. These schemes are also known as (k, n) threshold schemes. 

Pseudonymization policies

There are three standard policies of pseudonymization vital to its practical implementation. Let’s consider an identifier A that appears in databases X and Y. After pseudonymization, A gets a pseudonym according to one of the following policies. 

Deterministic pseudonymization

In deterministic pseudonymization, whenever an identifier appears multiple times in different databases, it’s always replaced with the same pseudonym. For example, if A appears in both X and Y databases, it would be replaced with a pseudonym PS.

During this policy's implementation, all unique identifiers are replaced with their corresponding pseudonyms.

Document-randomized pseudonymization

Document-randomized pseudonymization substitutes multiple instances of an identifier with different pseudonyms. For example, if an identifier A appears two times in a database, it will get replaced with pseudonyms PS1 and PS2 for their respective occurrences. However, the pseudonymization is consistent between different databases in this policy.

Implementation of document-randomized pseudonymization requires a list of all identifiers and treats all occurrences independently.

Fully-randomized pseudonymization

Fully-randomized pseudonymization replaces multiple instances of an identifier with different pseudonyms whenever it occurs in any database. When working on a single database, it’s similar to document-randomized pseudonymization. However, if datasets are pseudonymized two times using fully-randomized pseudonymization, the output would be different from that of document-randomized pseudonymization. The latter would generate the same outcome twice.

The randomness is selective in the case of document-randomized pseudonymization, whereas it’s global for fully-randomized pseudonymization.

How to choose a pseudonymization technique and policy

While selecting a pseudonymization technique, you need to determine the data protection level and utility requirements you wish to achieve after implementation. RNG, encryption, and message authentication code are vital to ensure robust data protection. However, you might choose a combination or variation of techniques mentioned above based on utility requirements.

Similarly, your choice of pseudonymization technique varies based on the data protection levels and ability to compare different databases. For example, fully-randomized pseudonymization offers the best data protection level but might not be suitable if you wish to compare different databases. Document-randomized pseudonymization and deterministic function offer utility but facilitate linkability between data subjects.

The complexity and scalability also play a significant role in governing your choice. Except for some encryption variations, most techniques apply to identifiers of varying sizes. Since hash function,  random number generator, and message authentication code have chances of collision, you need to choose the size of a pseudonym carefully. 

Pseudonymization use cases

A combination of different pseudonymization approaches can offer unique advantages in real-world applications. Below are the common industries that popularly implement pseudonymization.

Healthcare

Pseudonymization protects sensitive data in medical records against accidental or intentional access by any unauthorized party. Medical records contain substantial data regarding a patient's medical condition, diagnosis, financial aspect, and medical history. Doctors can use these records to assess a patient's medical condition and provide treatment. 

On the other hand, insurance companies can use financial data. Similarly, research agencies can leverage medical records to access binary information such as whether a patient was treated. 

All the scenarios mentioned above suggest that any party would access information that’s relevant to them. But medical records hold detailed information about all aspects of a patient’s healthcare. Pseudonymization plays a vital role here and prevents parties from accessing data that isn’t relevant to their purpose. 

For example, the research institutions need access to symptoms, duration, and treatment data to perform statistical modeling and analysis. Pseudonymization helps them provide this data to researchers so that it can’t be tied to any patient.

Medical institutions can use pseudonymization to protect patients’ privacy while processing medical data. It helps comply with standard regulations in healthcare and protect patients' data against unauthorized access. 

Cybersecurity

Modern cybersecurity technologies no longer depend on static or signature-based protection.  Instead, they correlate suspicious events that reveal the existence of advanced threats and train machine learning systems to detect them. These technologies also focus on building behavioral threat models and establishing reputation-based protection.

These technologies process personal data to provide security analytics, and pseudonymization plays a vital role in protecting sensitive information. With the web growing exponentially, it becomes increasingly challenging to track and block bad domains, URLs or bad actors. Modern security systems use behavioral analytics and train their systems after correlating field-collected data known as security telemetry. These telemetry analytics do not require user identification, and any data related to actual use can be pseudonymized to ensure privacy.

Many machine learning systems leverage the “wisdom of the crowd” to understand the behavior of a vast population, like downloaded files and URLs. Reputation systems assign a reputation score based on the collected telemetry. These models succeed when large samples of both benign and malicious data are analyzed, helping models understand the distinction between both. Correlating such data wouldn’t require user identification of benign users, but at some point would need to identify malicious users. 

Pseudonymization helps contain sensitive user information in such scenarios while they’re sent for analysis to the pseudonymization entity. Organizations use pseudonymization tools, also known as data-identification pseudonymity software, to eliminate any correlation or actual human identity.

Top 5 data de-identification and pseudonymity software

Data de-identification and pseudonymity software substitute confidential information in datasets with artificial identifiers or pseudonyms. These software help companies pseudonymize (or tokenize) sensitive data, minimize the risk of storing personal information, and comply with data privacy and protection standards.

To qualify for inclusion in the data de-identification and pseudonymity software category, a product must:

  • Substitute personal data with pseudonyms
  • Protect data against re-identification
  • Match GDPR standards for pseudonymization under the Data Protection by Design and by Default requirements
  • Meet California Consumer Privacy Act’s (CCPA)  requirements

This data was pulled from G2 on May  12, 2022.  Some reviews may be edited for clarity.

1. VGS Platform

Very Good Security (VGS) Platform offers a faster way to achieve business outcomes through a zero-data approach that decouples the business value of sensitive data from the related security and compliance risks. It helps customers achieve compliance sixteen times faster, speeds up the audit process by 70%, improves customer experience, and reduces costs while supporting constant innovation.

What users like:

“It took me so little time to understand how VGS works and change our workflow to be proxied through VGS. Support has always been a great experience, especially via chat.”

- VGS Platform Review, Vu K.

What users dislike:

“Advanced use cases can be complex, especially in the secure file transfer protocol (SFTP) filtering space. It seems like the SFTP product is not as mature as the HTTP proxy, which makes sense since I think most use of VGS will be in the HTTP proxy.

The dashboard is friendly for onboarding new users, but eventually, they become challenging to manage. It would be nice if I could edit a filter's YAML directly in the dashboard instead of having to export/re-import the whole YAML.”

- VGS Platform Review, Leejay H.

2. Cloud Compliance for Salesforce

Cloud Compliance for Salesforce provides teams and leadership with complete data security and compliance with privacy laws (GDPR, CCPA), industry regulations (Health Insurance Portability and Accountability Act, Payment Card Industry security standards), and InfoSec policies. It helps companies mitigate the risk of non-compliance with a standardized and error-free solution.

What users have said:

“It stays up to date with the latest details in compliance measures like GDPR etc. It also has a fast click-based UI that minimizes the time to set up.

Data retention policies could be customized to suit the specific needs if one needs to keep historical data for an extended time period."

- Cloud Compliance for Salesforce Review, Nitin S.

*As of May 23, 2022, Cloud Compliance for Salesforce had one review on G2.

3. D-ID

D-ID’s identity protection makes organizations’ photos and videos unrecognizable to facial recognition tools. It safeguards facial biometric data and prevents any bad actor from using pictures and videos to access any information.

What users have said:

“Ease of use is the main thing for me. I would buy it all over again.  I liked the whole app, no complaints."

- D-ID Review, Billy A.

*As of May 23, 2022, D-ID had one review on G2.

4. Immuta

Immuta provides unified data access to analytical datasets in the cloud to engineering and operations teams. It speeds time to data, facilitates secure data sharing with more users, and mitigates data breaches and leaks.

What users like:

“Immuta is a cloud data access control platform that is adaptive & scalable based on the dynamic nature of our data sources. It provisions all source-target integration seamlessly so that we can facilitate data transition from our on-premise to cloud infrastructure.

Since it’s an automated platform hosted in the cloud, we save a lot of time as it doesn't require any job parsing or agent installations. Essential datasets are registered accurately in its catalog, and we can also enable custom preferences while performing data analysis.”

- Immuta Review, Nikitha S.

What users dislike:

“Whenever I have to add a new table from a data warehouse, which is already known to Immuta, I have to type the connection details again and again (host, username, etc.)”

- Immuta Review, Igor C.

5. Informatica Dynamic Data Masking

Informatica ​​Dynamic Data Masking prevents unauthorized users from accessing sensitive information with real-time de-identification and de-sensitization. It safeguards personal and sensitive information while supporting offshoring, outsourcing, and cloud-based initiatives.

What users have said: 

“Informatica DDM gives the convenience and reliability of having data protection with its extensive DDM feature. It covers the security aspect of unauthorized access and prevents data corruption throughout its lifecycle. Its end-user privacy compliance includes various key elements such as data encryption, hashing, tokenization, etc.

Informatica DDM is great for data governance, integrity, and security considerations. It's suitable from my organization's standpoint, and I like the product.”

- Informatica Dynamic Data Masking Review, Sabapathi G.

*As of May 23, 2022, Informatica Dynamic Data Masking had one review on G2.

Prove compliance through reliance

Choose a data de-identification and pseudonymity software that best fits your data protection needs and rely on it to prove compliance. With software, you can derive value from datasets without compromising the privacy of the data subjects in a given dataset.

If you need to use an alternative version of datasets for demos or training purposes while ensuring the protection of sensitive data, data masking can better support your requirements.

Learn more about data masking and how it facilitates secure data sharing.

Sagar Joshi
SJ

Sagar Joshi

Sagar Joshi is a former content marketing specialist at G2 in India. He is an engineer with a keen interest in data analytics and cybersecurity. He writes about topics related to them. You can find him reading books, learning a new language, or playing pool in his free time.