Garbage in, garbage out

Preamble

We wrote about what value blockchain technology brings to different types of data, namely, how strongly or weakly blockchain is able to guarantee data provenance, immutability, and veracity for different types of data. In this article, we address another often (intentionally) overlooked consideration of how data interacts with blockchain - like any other system, blockchain suffers from the classic "garbage in, garbage out" problem.

Lying to the Blockchain

Per our earlier article on data, we see that blockchain system cannot make any veracity guarantees for data that was not natively generated on-chain and not publicly available, which unfortunately makes up for the vast majority of the data in this world. Hence, if someone (or some device) commits fraudulent data into the blockchain, there's no way to ascertain the veracity of the data, and you'd end up with fraudulent data permanently committed to the blockchain's history. If you put garbage into the blockchain, you get garbage out of the blockchain.

Purported applications which ignore this problem run rampant today, often with additional layers of technologies to give the facade of correctness. Here are a few examples,

Decentralized data market: where companies are incentivized with tokens to put their data up for sale - how do you know the data being bought is real?
Privacy-preserving queries: a service where the number of high net worth individuals for a bank could be counted via a zero-knowledge range proof so you'd get a count without the bank giving you any of its customer's information - how do you know the bank isn't fabricating its entire customer database?

For publicly available data, you can design a game whereby players with financial stakes at risk challenge one another on the veracity of the data provided, as ChainLink does. But as pointed out in our previous article, the VAST majority of the world's data are not publicly available.

So what's to be done? The key is to secure the data at the source.

Securing the Source

Securing data at the source before anything can corrupt it

If data were acquired not at the source but through any third-party intermediary, the data's veracity can no longer be trusted without also trusting the intermediary. The more intermediaries are involved in handling the data, the more you'd have to trust, until at some point so many intermediaries are involved the data might as well be generated from a random number generator.

The goal is then to capture the data as close to the source as possible. For example,

Instead of obtaining sales data from a retailer's database, get it at the point of sale hardware;
Instead of subscribing to a feed from a weather website, get it from weather sensors that collected the data;
Instead of reading a PDF report from an bridge operations company, try to get raw data from video cameras and sensors installed on the bridge

But how do you secure data from the source? Since most data in this world are either generated or captured by devices, let's describe how to secure device generated data. Here we face three (3) potential points of failure,

Identity: how do you know what is generating the data? Is it from a temperature sensor like you expected, or a random number generator from a malicious player?
Processing & Transmission: even if the data source is real and identifiable, how do you know if the data wasn't altered, corrupted, or just outright switched during processing and transmission on the device - e.g., while moving from the sensor into the communication module?
Digital / Analog Interface: even if identity, processing, and transmission are secured, how do you prevent someone from altering the way the device collects data by physically feeding it a fake input signal?

Let's tackle these one by one and see what can be done.

A Practical Approach

Identity

To ensure that a data-generating device's identity is protected, a set of public / private keys could be embedded into the device, and making the public key known plus making available onsite inspections of the actual hardware's output are practical and practiced ways to ensure that the hardware is what it says it is. But that's the easy part.

The tricky part is how do you make sure that this identity cannot be stolen and is known only to the device? You can use something called a secure element (SE), which is a piece of hardware that can generate public/private key pairs within the chip and is highly tamper resistant. A SE typically just does one thing: to sign messages, which is a fancy way of saying to provide proof of identity. If you've ever owned a credit card or a modern smart phone, you've benefited from the functionalities of a secure element.

Processing & Transmission

To protect that the data processing & transmission logic is secure, we make use of a microcontroller (MCU) with secure boot (SB). You can think of a microcontroller as a very simple computer.

SB ensures that only an entity with the right private key is able to load applications into the MCU. The application logic and associated checksums could be shared ahead of time with relevant stakeholders or simply open-sourced so they could be verified post loading.

What's more critical next is that once the application has been thoroughly tested, we need to disable all modification functionalities from application and the MCU, including firmware upgrades. This is to ensure that the application logic is now absolutely immutable, not even changeable by the manufacturer at this point.

This creates obvious disadvantages, such as the fact that the application can no longer be upgraded. But in return, we have gained true device independence (in conjunction with the SE) from outside interference, with perfectly deterministic and unalterable behavior that could be trusted.

Digital / Analog Interface

This problem is tricky, and cannot be solved using hardware embedded on the data collection and relay device. Often creative mechanisms must be devised to ensure that the interface is not disrupted, but it is highly application-specific. Let's use an example.

Suppose you have a refrigeration truck that's part of a fleet from a cold chain logistics company, tasked with delivering fresh fish to local supermarkets. To ensure quality, the fish must remain within a certain temperature range. If the temperature is too high, the fish could spoil. If the temperature is too low, the fish could end up with inferior taste and texture. To ensure that the logistics company adheres to the contractual temperature range, the supermarket puts a temperature sensor in the truck.

But what if, the truck driver takes the sensor and puts it inside an ice cooler in front of the truck, while he dials up the temperature in the refrigeration unit to save energy costs? The sensor has no idea it has been moved, and keeps collecting and reporting data that's within the contractually-agreed upon temperature range. The sensor has been duped.

One way to mitigate this risk is to hard wire the sensor into the refrigeration unit so it is nearly impossible to remove. But maybe this method could still be circumvented by say, wrapping a bag of ice around the sensor while keeping the rest of the truck above the contractual temperature range.

Another, potentially better (but far more expensive) way is to put a tamper-resistant seal on each package of fish, with separate a temperature sensor in every package. So if the driver tries take out the temperature sensor, they would need to break the seal, something that's easily detectable and likely to break key terms of the contract.

As aforementioned, to resolve the problem of the digital / analog interface takes a lot of creativity, and the solutions tend to be highly application specific.

At Taraxa, we are serving our clients with hardware modules which we custom-designed to exhibit the characteristic of true device independence, to ensure that the data generated from these devices can be trusted to be completely free from outside influence. We'll keep everyone updated on our progress!

Stay tuned.