The Rise of Shared-Data Honeypots: How to Avoid Them and Protect Your Customers’ Data

One thing we know about today’s cyber attackers is that, for the right target, they can be patient. Very patient.

According to FireEye’s latest M-Trends report, attackers were present on breached networks for a median period of 101 days before being discovered.[1] And these attackers were likely to have spent months or even years in making the initial breaches.

Many of these breaches were caused by advanced persistent threats (APTs) and other highly targeted attacks. They’re typically carried out by highly organized teams, who use a variety of tools and techniques to gain high-level access to systems.

These sophisticated attacks require advanced protection and response measures. But one fundamental change that organizations can make is to avoid being a target in the first place by not offering a ‘honeypot’ of data to tempt hackers.

Anywhere that valuable or sensitive data is centralized—whether on-premises, in a data center, or in the cloud—is a potentially attractive target for attackers.

This applies even more so when two or more organizations share their data. As we’ve explained elsewhere, there is a new business imperative to share data with other organizations so companies can gain new insights and innovate.

The problem is most data-sharing solutions require data to be centralized before it is exchanged. And often that data includes customers’ personally identifiable information (PII), such as names, contact details, and dates of birth.

Any breach of PII can have devastating effects on businesses. They can fall foul of tough new privacy laws such as the European Union’s General Data Protection Regulation (GDPR), experience monetary loss and lose the trust of customers.

Flawed Data-Sharing Solutions

Sure, there’s a range of technologies available that can help organizations protect PII when sharing data. But each of these technologies has its pitfalls. For example, many need to be applied in bespoke do-it-yourself solutions, which can be complex and time-consuming to manage.

By contrast, third-party data-matching services can make sharing data considerably easier. Each participating organization uploads its dataset to the service, which provides tools to de‑identify PII and match data. Better services also offer bank-grade data protection.

However, these services are generally like most other data-sharing solutions in that they require datasets to be centralized. So that data is a risk of becoming a honeypot, even if advanced security techniques are used.

Senate Matching’s Decentralized Solution

Data Republic’s Senate Matching technology is designed to mitigate this risk by using a decentralized architecture rather than storing data all in one place. It also employs a combination of data-protection techniques, such as:

  • Randomly assigning a token to each record to anonymize datasets that has no correlation with the original PII
  • Using a cryptographic technique called hashing to mask PII fields
  • Adding a random alphanumeric string to each PII field—or ‘salting’—to make hashing even more secure.

In addition, Senate Matching integrates with Data Republic’s Senate data-exchange platform, which offers several data-governance controls, such as data auditing and user access permissions.

However, a key differentiator in Senate Matching’s approach is its distributed architecture, which features three types of virtual machines, or ‘nodes.’ These nodes are segregated and protected using best-practice security measures.

1. The Contributor Node

To share a dataset, the contributing organization first uploads it to a Contributor Node. This virtual machine runs inside the organization’s IT environment and outside the Senate platform. No other parties (including Data Republic) can access the data inside this node.

Salting, hashing, and tokenization occur when the data is uploaded to the Contributor Node. But Senate Matching then goes further by dividing each hashed PII field into several ‘slices.’

2. Matcher Nodes

These slices are distributed to a network of Matcher Nodes on Data Republic’s platform.

Each Matcher Node can receive multiple slices from the same dataset, but never from the same field. By distributing the hashed PII across the network of nodes, no one (not even Data Republic) can reconstruct the original customer data, even if they know the values used for salting.

3. The Aggregator Node

An Aggregator Node is created for each data match. This virtual machine includes the list of Matcher Nodes and their network addresses. The Aggregator Node connects to the Matcher Nodes using encrypted HTTPS and a private certificate. It also sends randomly generated globally unique identifiers (GUIDs) for the datasets to be matched.

During the matching process, each Matcher Node compares its slices and sends a list of potentially matching (encrypted) tokens to the Aggregator Node. These lists include a number of false positives by design, making reidentification even more difficult for potential hackers.

The Aggregator Node filters out the false positives based on the number of ‘votes’ received from the Matcher Nodes. It then returns the list of token pairs. These pairs are then used to assemble a matched dataset comprising the data from both organizations—without any PII.

Governance Controls for Greater Protection

The final matched dataset can be subject to further governance controls. For example, it can be restricted so that analysis is only be performed in highly secure virtual machines on the Senate platform.

These advanced techniques ensure that Senate and Senate Matching combine to provide a flexible but tightly governed and secure environment for matching and sharing data.

For more details on how Senate Matching works, download our whitepaper.

[1] M-Trends 2018, FireEye, https://www.fireeye.com/content/dam/collateral/en/mtrends-2018.pdf.