Human-centred mechanism design with Democratic AI – Nature Human Behaviour

Participants

The study was approved by HuBREC (Human Behavioural Research Ethics Committee), which is a research ethics committee run within Deepmind but staffed/chaired by academics from outside the company. Participants were recruited over an approximately 8-month period from two different crowdsourcing platforms. All participants gave informed consent to participate in the experiment. The task was advertised to users located in the UK and the USA. We did not record identifiers or personal information from participants. Participants who accepted the Human Intelligence Task (HIT) received a link that led them to a game lobby in which they were grouped with three other players. Groups of four players participated in the game live and in interaction with one another. When a response was required, participants had a fixed duration in which to respond (2min for all screens except voting, which was 4min), accompanied by a timer that signalled how much time they had remaining. The game advanced only when the player who was slowest to respond had completed the round. Players who timed out were given a warning. Players who timed out twice were removed from the game and replaced with a randomly responding bot (games with missing data were excluded from the analysis). The game took approximately 2030min and participants were paid up to 8, consisting of a base and a bonus payment. The precise conversion rate of points earned in-game (return in coins) to the bonus paid at the end of the study varied inversely with the sum of endowments over players. This way stakes were on average equated across games. Data collection and analysis were not performed blind to the conditions of the experiments.

A total of n=4,776 participants took part in Exp. 13. The pilot data that were used for training the agent consisted of a further ~4,000 datasets (including some partial datasets where participants timed out). Exclusion lists were used to prevent participants from rejoining the experiments multiple times (however, as the overall data were collected over several months on two platforms, and we did not collect identifiers, it was impossible to be absolutely sure that all participants are unique).

Investment game

All participants in Exp. 13 played 34 rounds of an investment game (3 blocks of 10 rounds and 1 bonus block of 4 rounds). On each round, each player was allocated an endowment of 2, 4, 6, 8 or 10 coins (2, 4 or 10 coins in Exp. 1) depending on the endowment condition to which they were allocated, and whether they were designated the head or tail player, all of which was entirely random (and unrelated, for example, to the order in which they joined the game). In all endowment conditions, there was a single head player who received 10 coins and three tail players who all received either 2, 4, 6, 8 or 10 coins (in the equal endowment condition, the distinction between head and tail players is nominal only). Players received their endowment at the start of each round. Each players endowment remained the same across all 34 rounds (and they were instructed that this would be the case).

In every round of every block, each player i privately chose to divide an integer number of coins (their endowment ei) between a project and a private account (the contributions made to the project are denoted ci). No player could see the others choices at this stage. The project was a public fund that received a return on investment (was multiplied by a common productivity factor r=1.6) and was then shared between participants according to some redistribution scheme, allocating a payout yi to player i (see below). Coins allocated to the private account were simply retained by participants (with no return on investment). The total return to each player on each round was thus their payout plus endowment minus contribution \(y_i + e_i – c_i\).

In the first block of all experiments, participants played 10 tutorial rounds with no referee. This meant that funds allocated to the project were distributed equally among all players and there was no further redistribution (see below). In making this choice, we assume that equal redistribution is a default position, which the referees subsequently adjust. There were several reasons for this choice, including the importance of illustrating to players that a social dilemma could arise. However, to ensure that this did not influence our results, we ran additional controls (not reported) in which we added a block of libertarian after the no referee block. For the conditions we checked, performance was almost identical to under the default referee, and our agent continued to be preferred in all cases.

In blocks 2 and 3, participants played 10 rounds with a referee (or mechanism; one mechanism for each block). The referee(s) redistributed project funds among players according to a specified mechanism, without creating or destroying wealth. The two rival mechanisms were encountered in counterbalanced order. After block 3, participants voted for the mechanism that they preferred. They knew that they would be making this vote (and what it would entail) from the end of block 1, before experiencing the mechanisms. In block 4, the probability of re-experiencing mechanism A (or B) was exactly equal to the fraction of votes that A (or B) received from the 4 players. The choice was thus deterministic if all players voted the same way, and there was no opportunity to vote strategically. Participants then answered seven debriefing questions (see below). Finally, they experienced four rounds of the chosen mechanism (block 4) and proceeded to a bonus screen where they were thanked and informed of their total earnings. Only data from blocks 2 and 3 were included in the analysis of Exp. 24 (see below for Exp. 1). We report numbers of participants who chose to vote for either mechanism using binomial tests.

Detail of experiments

We describe 3 experiments in the main text. All experiments had the same form but the way we present the data is slightly different for Exp. 1 relative to Exp. 2 and Exp. 3.

In Exp. 2 (n=2,508), players experienced the HCRM and either strict egalitarian, libertarian or liberal egalitarian mechanisms in randomized groups (we provide details about all mechanisms below). The order of encounter of the different mechanisms was counterbalanced over blocks 2 and 3. Under the strict egalitarian mechanism, the referee effectively took no action, so that the original earnings and earnings after referees actions screens looked the same (apart from bar colour). Under libertarian, the mechanism returned to each player a sum that was 1.6 their contribution. We decided to recruit a larger cohort for liberal egalitarian because simulation data (Supplementary Table 2) suggested that this was the highest performing baseline and potentially more data would be required to draw a reliable conclusion about which was preferred (target participant numbers were decided beforehand, and we did not use optional stopping). In Exp. 3 (n=736), players experienced HCRM and RM (trained with rational players; see below for description of these players, and main text for more details about the RM) in counterbalanced order over blocks 2 and 3.

The data described as Experiment 1 came from a study in which an earlier version of the HCRM competed against libertarian and liberal egalitarian. We present data only from these two baselines (not the earlier version of the agent). The data from strict egalitarian are taken from block 1 (in which there was no referee). Thus, the conditions under which these data were collected are the same as Exp. 2, except that (1) the order of the mechanisms was not counterbalanced, (2) the rival AI-designed mechanism was slightly different and (3) we did not report voting data from this experiment. Our goal here is to illustrate how contributions vary under different mechanisms; in fact, a near-identical pattern of results was replicated in Exp. 2. At this stage we only used three endowment conditions: (10, 2, 2, 2), (10, 4, 4, 4) and equality ((10, 10, 10, 10)).

Data pre-processing and analysis

Our main analyses focus on data from Exp. 13 (n=4,776 total). We have made this large dataset freely available at https://github.com/deepmind/hcmd_dai, along with code for recreating key figures.

In Exp. 1, we plot contributions as a function of mechanism and endowment as described above (Fig. 1c). In Exp. 23, data were analysed from blocks 2 and 3, that is from the two blocks of 10 trials in which participants played the game with a referee that was either the HCRM or a rival baseline. We logged or computed various per-block, per-round metrics for each individual player, including (absolute) contributions (ci), relative contributions (\(c_i/e_i\)), (absolute) payouts (yi), relative payouts (\(y_i/e_i\)) and return (\(e_i – c_i + y_i\)), as well as some group-level game metrics including Gini coefficient and surplus (sum of returns/sum of initial endowments over players and rounds). We show contributions and payouts over time for head and tail players under each mechanism in Supplementary Figs. 3 and 4.

Our voting data consisted of 4 binary votes per group g, which were either for HCRM or a baseline. We performed a group-level permutation test to assess statistical significance. Our permuted data randomly flipped voting preferences but preserved the covariance among votes within a group (note that statistics from a naive binomial test might be inflated as they are not independently conditioned on participants shared experiences of the mechanism). The lowest possible P value obtainable given our 10,000 shuffles is P<0.0001 (but we do not report P values lower than 1/1,000 to meet journal style requirements).

Overview of the mechanism training method

Our approach to AI mechanism design consisted of three steps: (1) we trained virtual human players with supervised learning to imitate human play in the investment game using an existing dataset, which we describe as pilot data for the main experiments presented here; (2) we trained a vote-maximizing mechanism using a deep RL agent interacting with these virtual players; and (3) we evaluated the mechanism by deploying it with new unseen human participants, together with comparison baselines (see above). This last step yielded new data, which could be used to repeat the steps above, and refine both the virtual players and the mechanism. In the next section, we describe these steps in detail.

Training virtual players using imitation learning

We drew upon data from previous pilot experiments (see pilot testing, below) to train virtual human players using imitation learning. All data were collected from human participants playing the 10-round investment game (described above) under a variety of different mechanisms and varying endowment conditions (not equally distributed). While generally we included only games in which all players finished the experiment, the training data included data from a handful of pilot games where one or more players dropped out. However, the modelling excluded the responses from the players that had been replaced with randomly responding bots. For more details, see pilot testing below.

We used imitation learning in which virtual players were trained to imitate human play. Virtual human players were deep neural networks. Each network was a simulation of a single player, which received information about all players contributions and receipts on the current trial (similar to real human players), and was trained to predict the focal players contributions on the next trial. As they were equipped with recurrent memory (long short term-memory (LSTM) units) (37), the networks could potentially learn to use the trial history going back to the start of the game to make this prediction.

The network received the following information as input on each step: each players endowment (4 inputs); each players previous contribution (4 inputs); each players previous contribution relative to endowment (4 inputs); and each players payout (4 inputs). Payouts, endowments and contributions were divided by 10 to lie in approximately the same range as relative contributions. These inputs were fed to a linear layer with output size of 64 and tanh nonlinearities, followed by an LSTM with hidden size 16. The LSTM outputs to a final linear layer of size 11 whose outputs coded the unnormalized log-probabilities of a categorical distribution corresponding to the probability of contributing 0, 1, , 9 or 10 coins. We masked those outputs corresponding to contributions in excess of the endowment allocated to the focal player.

We trained this architecture with back-propagation through time (38) to minimize the cross-entropy between the predictions and the actual contributions, regularized with two additional terms: the entropy of the prediction (with weight 0.1); and the L2 loss on the parameters of the network (with weight 0.00001). The model was implemented in TensorFlow 1 and the architecture was optimized using Adam (39) with learning rate 0.0004 and parameters beta1 0.9, beta2 0.999 and epsilon 1108. We trained the model by performing 30,000 updates with mini batches of size 512. Training took <6h without the use of accelerators.

The virtual human player networks were evaluated on a separate hold-out dataset consisting of the contributions of a new group of human players (n=384) that resembled as closely as possible the final conditions under which we expected to evaluate the HCRM. We swept over network hyperparameters (including layer widths, number of LSTM units, learning rate and regularization types) to minimize validation loss.

Voting model

We learned in piloting that human players votes are most strongly predicted by the relative payouts (\({{y}}_i/e_{{i}}\)) they receive under one mechanism or another. We thus used this variable as the basis for virtual player voting. Each virtual players probability of voting for mechanism A rather than the rival mechanism B was \(p\left( \mathrmA \right) = \Phi \left[ {\mathrmrpay^{\mathrmA} – \mathrmrpay^\mathrmB} \right]\) where \({{\mathrmrpay}}^\mathrmM\) is the sum of relative payouts obtained under mechanism M and \(\Phi \left[ \cdot \right]\) is a logistic function with slope s. We set s to be 1.4, but similar results were obtained (in terms of mechanism policy, see below) under a wide range of values (see below).

Mechanism designer: problem definition and setup

We call the neural network used to design the mechanism the mechanism designer and use the term human-compatible mechanism (HCRM) to refer to the mechanism it designs, which was obtained only after training has converged. It used RL to learn a function that mapped observations (game states generated through interaction with virtual players) onto redistribution weights (a variable that determines which player gets what fraction of the project fund). We chose a Graph Network-based architecture that is equivariant to permutation in the ordering of participants and trained it on simulated 10-round investment games, with the goal of maximizing the cumulative voting probabilities of the virtual players against a carefully chosen alternative mechanism (see below). After training for 10,000 steps, the network parameters were frozen, and the function was exported in a way that allowed ready implementation in a human testing setting.

Network architecture

Inputs to the network were the endowments, contributions and relative contributions (that is, contributionendowment ratios) for the current round, for all four participants (12 inputs per round). The networks output was 4 numbers that were passed through a softmax function (that is, so that they are positive and sum to 1) to generate the redistribution weights for each player. When we state that the network has no memory, we mean that (1) it does not receive historical information about contributions or payouts as inputs; and (2) it does not have recurrence, that is, network states are not passed between timesteps. Note, however, that the mechanism could infer implicitly the number of rounds if human/virtual player policies vary their contributions between different timepoints within the block.

We organized the networks observations in a fully connected directed graph \((u,V,E)\) where each player was represented as a vertex \(v_\mathrmk \in V\). Directed edges es,r connecting vs and vr had empty initial attributes, and the input global attribute vector u was empty. Computations in Graph Networks start by updating the edge attributes, followed by the node attributes and finally global attributes. In particular, directed edge attributes were updated with a function \(\upvarphi_e\) of the input edge attribute, the sender and receiver vertex attributes, and the global attributes vector: \(e_\mathrms,\mathrmr\prime = \upvarphi_e(e_\mathrms,\mathrmr,v_\mathrms,v_\mathrmr,u)\); vertex attributes were updated as a function \(\upvarphi_v\) of the input vertex attributes, the sum of all updated edge attributes that connect into vr, and the global attributes vector: \(v_\mathrmr\prime = \upvarphi_{\mathrmv}(\Sigma _\mathrmse_{\mathrms,\mathrmr}\prime ,v_\mathrmr,u)\); finally, the global attributes vector was updated with a function of the input global attributes, and the sum of all updated edges and vertices: \(u\prime = \upvarphi_u(\Sigma _{{{{\mathrms}}},{{{\mathrmr}}}}e_{{{{\mathrms}}},{{{\mathrmr}}}}\prime ,\Sigma _\mathrmkv_{{\mathrmk}}\prime ,u)\). We note that the same functions \(\upvarphi_{{e}}\), \(\upvarphi_v\) are used to update all edges and nodes in a graph, and that both the input and output of Graph Networks are directed graphs, so these modules can be used in sequence.

The mechanism designers policy network architecture consisted of two Graph Networks (GNs) that processed the observation in sequence. In the first GN, we implemented all of \(\upvarphi_{{{e}}}\), \(\upvarphi_v\) and \(\upvarphi_{u}\) as distinct linear layers with 32 output units and tanh activation functions. In the second GN, we implemented \(\upvarphi_{{{\mathrm{e}}}}\) as a linear layer with 32 output units and tanh activation function, and \(\upvarphi_{{{\mathrm{v}}}}\) as a linear layer with a single output unit. We normalized the vertex outputs with a softmax across players, thus obtaining the redistribution weights; \(\upvarphi_{{{u}}}\) was ignored in the second GN.

Training algorithm

To train the mechanism, we used an estimator drawn from a framework known as Stochastic Computation Graphs (SCG)30, which approximates the gradients of the vote-maximization objectives with respect to the networks policy parameters. We trained the mechanism designer iteratively with 10,000 updates using the RMSProp algorithm to optimize the policy, with the following parameters: learning rate 0.0004; epsilon 1105; decay 0.99; and no momentum. On every update, we simulated two batches of 512 games with 10 rounds per game. We divided the batches in groups of 64 episodes with consistent endowments, leading to 8 possible endowments: the head player received 10 coins, and the tail players received 2, 3, 4, 5, 6, 7, 8 or 10 coins (using a broader range of tail player endowments helped avoid overfitting).

On every round, the game unfolded as described above for human players, except that the contributions were dictated by the virtual human players. In the first block, the redistribution policy was decided by the mechanism designer under training, and in the other it was played by an alternative planner (which was the winner of the metagame, defined by \(w = 1,{{{v}}} = 1\); see section on ideological manifold). We paired episodes from these two batches to obtain 2,048 votes (512 pairs of episodes, 4 players) given our model of human voting. The objective that the HCRM aimed to maximize was the sum of votes across players, averaged across episodes.

Note that during training of the mechanism, we did not feed in the human data to predict player contributions (that is, teacher forcing). Furthermore, the payouts observed by the virtual players depended on mechanism policies that may lie outside of the human data, thus requiring the virtual players to generalize beyond the training dataset.

Having defined the observations as well as the objective that we wished to maximize, we estimated the policy gradient, that is, the gradient of the objective (the average number of votes) with respect to the policy parameters (the weights of the graph network) by turning to the SCG framework. We note here that most of the computation in the investment game is differentiable (including the policy as implemented by the HCRM), with the virtual human players policies, whose action space is discrete, being the only exception. The SCG framework generalizes the policy gradient theorem and allowed us to obtain a low-variance estimator of the policy gradient by auto-differentiating through the environment and mechanism policy, while compensating for the non-differentiable operations (the discrete contributions of the players). The surrogate objective for the policy gradient was as follows:

$$S = J + \bot (J) \times \Sigma _{{{\mathrm{i}}}}\Sigma _t = 2^10\mathrmlogp( \bot ({{c}}_{{{\mathrm{i}}}}^{\mathrmt})),$$

where S is the surrogate objective, J is the objective we wish to maximize per episode (the expected number of votes) and is the stop-gradient operation. Note that for the second term, the gradient can only flow through the parameterization of the log-probability of the players policy. Note also that the contributions of the first round are removed from the equation since they do not depend on the mechanisms parameters. In practice, additionally, we chose to mean-center J within a batch because this is known to reduce the variance of the gradient estimator.

In Supplementary Information, we include further details that provide (1) a detailed description and illustration of the game, (2) the voting procedure, (3) debriefing, (4) determinants of voting analysis, (5) beach plots, (6) the ideological manifold, (7) rational players, (8) the metagame, (9) pilot testing, (10) human referee experiments and (11) theoretical analysis of the game.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

www.actusduweb.com
Suivez Actusduweb sur Google News


Ce site utilise des cookies pour améliorer votre expérience. Nous supposerons que cela vous convient, mais vous pouvez vous désinscrire si vous le souhaitez. J'accepte Lire la suite