# Data Valuation with the help of the Shapley Value

According to a study by the telecommunications company Cisco, there will be 1.8 Internet-based data connections between machines (M2M connections) for every person on earth in 2022. The linking of physical and virtual objects with each other, the so-called Internet of Things (IoT), will continue to progress and so will the automated exchange of data between machines.

Two professors of the Massachusetts Institute of Technology (MIT), Andrew McAfee and Erik Brynjolfsson, see the technology of machine learning as the most important basic technology of our time. Machine learning (ML) is the ability of a machine or software to learn and solve certain tasks independently on the basis of experience (data).

Since machines independently exchange data with each other and use artificial intelligence algorithms, data valuation needs to be automated.

One approach is the Shapley Value. The Shapley Value approach can be assigned to cooperative game theory and is based on the simple basic idea that every player receives a payout that depends on his contributions to all possible constellations. The contributions, in turn, consist of the increase in value that the player generates. The Shapley Value, for which Lloyd Shapley received the 2012 Nobel Prize for Economics, is able to answer the key question of how winnings should be distributed among players when the value creation process is multiplicative. Since an algorithm, including a computer program, usually processes data from several data suppliers (e.g. machines) and the amount of learning data usually increases the quality of the output, the approach developed in the 1950s to divide value added contributions in digital business models was rediscovered.

A question of stakeholders is how the generated turnover in an ML system can be distributed between the parties involved. I will illustrate a possible solution with the help of Shapley Value.

Example:

Let’s assume that there are two data suppliers and one algorithm that processes the data further, i.e. a total of three parties.

The Shapley Value (p) is the added value (v) of a new network participant (i) in the coalition (S). In the Shapley approach, four steps have to be performed one after the other:

First, all possible constellations of those involved in value creation must be determined. For our three participants, there are 1 x 2 x 3 = 6 possible sequences. For n players, alternatively 1 x 2 x … x n = n! (n faculty). N is the total number of network participants (in our case three). S is a coalition with n participants.

This can be represented as follows: 

As a second step, a marginal contribution is determined for each of these sequences. Let us assume a turnover of 10 million euros and a fictitious additional value added of 2p. In a third step, all marginal amounts are added and in a fourth step, this sum is divided by the number of possible constellations. The final result assigns each player a share of the added value that corresponds to the average marginal contribution of the player:

 Coalition Algorithm Z Data supplier X Data supplier Y {Z,X,Y} 0 p p {Z,Y,X} 0 p p {X,Z,Y} p 0 p {X,Y,Z} 2p 0 0 {Y,Z,X} p p 0 {Y,X,Z} 2p 0 0 Sum 6p 3p 3p Shapley Value p/2 p/2 p/2

Interpretation: The resulting distribution between the two data providers and the algorithm represents the relative contributions. Without data, the algorithm would not have been able to work. According to the assumption, the algorithm could generate half of the revenue for a data provider. Other data providers would therefore not be entitled to a share of the profit. If the algorithm is the only participant in the coalition, the marginal value added contribution is zero, since no additional value is assumed to be generated without data providers. If X joins the coalition with algorithm Z, the marginal value added contribution of this new network participant is p, as is the marginal value added contribution of data provider Y. This means that the data suppliers are assigned p/2 in each case.

Half of the value added is thus generated by the algorithm and 25% by each data provider. Thus, the two data suppliers each receive 2.5 million euros in sales and the algorithm 5 million euros.

Problem:

In practice, the example constellation and this result will probably never occur. In Industry 4.0, there are millions, even billions, of data suppliers, depending on the sector. The current method for determining the Shapley Value for an unknown utility function is based on Monte Carlo simulations. In addition, it is realistic to assume that only a few of the data providers significantly contribute to value creation.

Why does it make sense to build advanced data valuation models based on the Shapley Value? The advantage is the fact that Shapley Value is a unique value contribution sharing scheme that has many characteristics that match the characteristics of digital business models

(https://data-valuation.com/the-effects-that-shape-the-value-of-data/ )

These include

1. Group rationality: The value of the entire data set is fully distributed among all users.

2. Fairness: Two users who are identical in terms of what they contribute to the utility of a data set should have the same value.

3. Additivity: The individual data values add up to the sum of all these data values.

Very little work has been done to determine the value of the data in self-learning systems. The main problem here is the lack of methods for efficiently determining the Shapley Value for a very large number of data suppliers.

At the end of February 2019, scientists from various universities, including ETH Zurich, presented an approach that attempts to approach this problem (Towards Efficient Data Valuation Based on the Shapley Value, available at https://arxiv.org/abs/1902.10275). The authors hope for further research between machine learning and cooperative game theory as well as the solution of real data collection and data valuation problems.

Furthermore, one has to keep in mind that a data set that is not valuable in one context may well be valuable in another. A paper published in April 2019 (Data Shapley: Equitable Valuation of Data for Machine Learning, available at https://arxiv.org/abs/1904.02868) views this as an interesting field of research. It remains exciting.