Data Valuation using Data Shapley Value has received further scientific approval in recent months. The Shapley Value assigns a contribution to value creation to the individual data providers (for more on the theoretical foundations of Shapley Value see my article from May 2019).
More than half of the time is spent on data preparation and data transformation in a machine learning project.
The steps are structured as follows:
Let’s take the example of a medical technology company that wants to use patient data to detect breast cancer at an early stage. (step 1)
For this purpose, the company receives high-resolution images from various hospital operators where the manufacturer’s devices are located. (step 2) Then the manufacturer builds a database schema and database connections (step 3) and visualizes the data using tableau. (step 4) He analyzes the data with Python. (step 5). Finally, the result of the entire process is valuated with the F1 score. (step 6)
So, there are data suppliers, an algorithm and a result, which is valuated. In each of the steps, assumptions are made, different programs are used, and the results are valuated differently.
Accordingly, in each of the steps there is the possibility of achieving a different Data Valuation with the help of the Shapley Value.
In particular, the following points should be questioned:
In order to be able to perform a valuation using the Shapley Value, it is assumed that the person performing the valuation has access to all datasets that are used for the algorithm. In practice, however, access restrictions may exist, for example, due to data protection requirements.
If the algorithm is built, the data can be manipulated to achieve a higher calculation of the Shapley Value. However, this does not necessarily lead to a better result; there may be a discrepancy between a high Shapley Value and a high Data Value.
Another interesting question is: Should data providers whose data is valued with a negative Shapley Value be penalized?
Let’s take the example above, in step 2 different hospital operators transmit their images. One hospital operator transmits data that is of low quality. Should the medical technology company receive monetary compensation from the one hospital operator for transmitting the relatively poorer quality data?