Stefano V. Albrecht, Filippos Christianos
Multi-Agent Reinforcement Learning
Foundations and Modern Approaches
Stefano V. Albrecht, Filippos Christianos
Multi-Agent Reinforcement Learning
Foundations and Modern Approaches
- Gebundenes Buch
- Merkliste
- Auf die Merkliste
- Bewerten Bewerten
- Teilen
- Produkt teilen
- Produkterinnerung
- Produkterinnerung
"This book provides an accessible technical introduction to the field of Multi-Agent Reinforcement Learning (MARL)"--
Andere Kunden interessierten sich auch für
- Verena RieserReinforcement Learning for Adaptive Dialogue Systems74,99 €
- Vladimir Marik / Jörg Müller / Michal Pechoucek (eds.)Multi-Agent Systems and Applications III83,99 €
- Explainable, Transparent Autonomous Agents and Multi-Agent Systems37,99 €
- Agent-Based Approaches in Economic and Social Complex Systems VII110,99 €
- Recent Advances in Agent-Based Negotiation: Applications and Competition Challenges103,99 €
- Recent Advances in Agent-Based Negotiation: Applications and Competition Challenges103,99 €
- Evangelos KranakisThe Mobile Agent Rendezvous Problem in the Ring35,30 €
-
-
-
"This book provides an accessible technical introduction to the field of Multi-Agent Reinforcement Learning (MARL)"--
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
Produktdetails
- Produktdetails
- Verlag: MIT Press Ltd
- Seitenzahl: 394
- Erscheinungstermin: 17. Dezember 2024
- Englisch
- Abmessung: 161mm x 236mm x 30mm
- Gewicht: 760g
- ISBN-13: 9780262049375
- ISBN-10: 0262049376
- Artikelnr.: 71703105
- Herstellerkennzeichnung
- Libri GmbH
- Europaallee 1
- 36244 Bad Hersfeld
- 06621 890
- Verlag: MIT Press Ltd
- Seitenzahl: 394
- Erscheinungstermin: 17. Dezember 2024
- Englisch
- Abmessung: 161mm x 236mm x 30mm
- Gewicht: 760g
- ISBN-13: 9780262049375
- ISBN-10: 0262049376
- Artikelnr.: 71703105
- Herstellerkennzeichnung
- Libri GmbH
- Europaallee 1
- 36244 Bad Hersfeld
- 06621 890
Stefano V. Albrecht, Filippos Christianos, and Lukas Schäfer
Preface xi
Summary of Notation xv
List of Figures xvii
1 Introduction 1
1.1 Multi-Agent Systems 2
1.2 Multi-Agent Reinforcement Learning 6
1.3 Application Examples 8
1.3.1 Multi-Robot Warehouse Management 8
1.3.2 Competitive Play in Board Games and Video Games 10
1.3.3 Autonomous Driving 11
1.3.4 Automated Trading in Electronic Markets 11
1.4 Challenges of MARL 12
1.5 Agendas of MARL 13
1.6 Book Contents and Structure 15
I FOUNDATIONS OF MULTI-AGENT REINFORCEMENT LEARNING 17
2 Reinforcement Learning 19
2.1 General Definition 20
2.2 Markov Decision Processes 22
2.3 Expected Discounted Returns and Optimal Policies 24
2.4 Value Functions and Bellman Equation 26
2.5 Dynamic Programming 29
2.6 Temporal-Difference Learning 32
2.7 Evaluation with Learning Curves 36
2.8 Equivalence of R(s, a, s') and R(s, a) 39
2.9 Summary 40
3 Games: Models of Multi-Agent Interaction 43
3.1 Normal-Form Games 44
3.2 Repeated Normal-Form Games 46
3.3 Stochastic Games 47
3.4 Partially Observable Stochastic Games 49
3.4.1 Belief States and Filtering 53
3.5 Modelling Communication 55
3.6 Knowledge Assumptions in Games 56
3.7 Dictionary: Reinforcement Learning Game Theory 58
3.8 Summary 58
4 Solution Concepts for Games 61
4.1 Joint Policy and Expected Return 62
4.2 Best Response 65
4.3 Minimax 65
4.3.1 Minimax Solution via Linear Programming 67
4.4 Nash Equilibrium 68
4.5 -Nash Equilibrium 70
4.6 (Coarse) Correlated Equilibrium 71
4.6.1 Correlated Equilibrium via Linear Programming 74
4.7 Conceptual Limitations of Equilibrium Solutions 75
4.8 Pareto Optimality 76
4.9 Social Welfare and Fairness 78
4.10 No-Regret 81
4.11 The Complexity of Computing Equilibria 83
4.11.1 PPAD Complexity Class 84
4.11.2 Computing -Nash Equilibrium is PPAD-Complete 86
4.12 Summary 87
5 Multi-Agent Reinforcement Learning in Games: First Steps and Challenges
89
5.1 General Learning Process 90
5.2 Convergence Types 92
5.3 Single-Agent RL Reductions 95
5.3.1 Central Learning 95
5.3.2 Independent Learning 97
5.3.3 Example: Level-Based Foraging 99
5.4 Challenges of MARL 101
5.4.1 Non-Stationarity 102
5.4.2 Equilibrium Selection 104
5.4.3 Multi-Agent Credit Assignment 106
5.4.4 Scaling to Many Agents 108
5.5 What Algorithms Do Agents Use? 109
5.5.1 Self-Play 109
5.5.2 Mixed-Play 111
5.6 Summary 111
6 Multi-Agent Reinforcement Learning: Foundational Algorithms 115
6.1 Dynamic Programming for Games: Value Iteration 116
6.2 Temporal-Difference Learning for Games: Joint Action Learning 118
6.2.1 Minimax Q-Learning 121
6.2.2 Nash Q-Learning 123
6.2.3 Correlated Q-Learning 124
6.2.4 Limitations of Joint Action Learning 125
6.3 Agent Modelling 127
6.3.1 Fictitious Play 128
6.3.2 Joint Action Learning with Agent Modelling 131
6.3.3 Bayesian Learning and Value of Information 134
6.4 Policy-Based Learning 140
6.4.1 Gradient Ascent in Expected Reward 141
6.4.2 Learning Dynamics of Infinitesimal Gradient Ascent 142
6.4.3 Win or Learn Fast 145
6.4.4 Win or Learn Fast with Policy Hill Climbing 147
6.4.5 Generalised Infinitesimal Gradient Ascent 149
6.5 No-Regret Learning 151
6.5.1 Unconditional and Conditional Regret Matching 151
6.5.2 Convergence of Regret Matching 153
6.6 Summary 156
II MULTI-AGENT DEEP REINFORCEMENT LEARNING: ALGORITHMS AND PRACTICE 159
7 Deep Learning 161
7.1 Function Approximation for Reinforcement Learning 161
7.2 Linear Function Approximation 163
7.3 Feedforward Neural Networks 165
7.3.1 Neural Unit 166
7.3.2 Activation Functions 167
7.3.3 Composing a Network from Layers and Units 168
7.4 Gradient-Based Optimisation 169
7.4.1 Loss Function 170
7.4.2 Gradient Descent 171
7.4.3 Backpropagation 174
7.5 Convolutional and Recurrent Neural Networks 175
7.5.1 Learning from Images – Exploiting Spatial Relationships in Data 175
7.5.2 Learning from Sequences with Memory 178
7.6 Summary 180
8 Deep Reinforcement Learning 183
8.1 Deep Value Function Approximation 184
8.1.1 Deep Q-Learning – What Can Go Wrong? 184
8.1.2 Moving Target Problem 187
8.1.3 Breaking Correlations 188
8.1.4 Putting It All Together: Deep Q-Networks 192
8.1.5 Beyond Deep Q-Networks 193
8.2 Policy Gradient Algorithms 195
8.2.1 Advantages of Learning a Policy 195
8.2.2 Policy Gradient Theorem 197
8.2.3 REINFORCE: Monte Carlo Policy Gradient 199
8.2.4 Actor-Critic Algorithms 202
8.2.5 A2C: Advantage Actor-Critic 203
8.2.6 PPO: Proximal Policy Optimisation 206
8.2.7 Policy Gradient Algorithms in Practice 209
8.2.8 Concurrent Training of Policies 210
8.3 Observations, States, and Histories in Practice 215
8.4 Summary 216
9 Multi-Agent Deep Reinforcement Learning 219
9.1 Training and Execution Modes 220
9.1.1 Centralised Training and Execution 220
9.1.2 Decentralised Training and Execution 221
9.1.3 Centralised Training with Decentralised Execution 222
9.2 Notation for Multi-Agent Deep Reinforcement Learning 222
9.3 Independent Learning 223
9.3.1 Independent Value-Based Learning 224
9.3.2 Independent Policy Gradient Methods 226
9.3.3 Example: Deep Independent Learning in a Large Task 228
9.4 Multi-Agent Policy Gradient Algorithms 230
9.4.1 Multi-Agent Policy Gradient Theorem 231
9.4.2 Centralised Critics 232
9.4.3 Centralised Action-Value Critics 236
9.4.4 Counterfactual Action-Value Estimation 237
9.4.5 Equilibrium Selection with Centralised Action-Value Critics 239
9.5 Value Decomposition in Common-Reward Games 242
9.5.1 Individual-Global-Max Property 244
9.5.2 Linear Value Decomposition 246
9.5.3 Monotonic Value Decomposition 249
9.5.4 Value Decomposition in Practice 255
9.5.5 Beyond Monotonic Value Decomposition 261
9.6 Agent Modelling with Neural Networks 266
9.6.1 Joint Action Learning with Deep Agent Models 267
9.6.2 Learning Representations of Agent Policies 270
9.7 Environments with Homogeneous Agents 274
9.7.1 Parameter Sharing 276
9.7.2 Experience Sharing 278
9.8 Policy Self-Play in Zero-Sum Games 281
9.8.1 Monte Carlo Tree Search 283
9.8.2 Self-Play MCTS 286
9.8.3 Self-Play MCTS with Deep Neural Networks: AlphaZero 288
9.9 Population-Based Training 290
9.9.1 Policy Space Response Oracles 292
9.9.2 Convergence of PSRO 295
9.9.3 Grandmaster Level in StarCraft II: AlphaStar 297
9.10 Summary 301
10 Multi-Agent Deep RL in Practice 305
10.1 The Agent-Environment Interface 305
10.2 MARL Neural Networks in PyTorch 307
10.2.1 Seamless Parameter Sharing Implementation 309
10.2.2 Defining the Models: An Example with IDQN 310
10.3 Centralised Value Functions 312
10.4 Value Decomposition 313
10.5 Practical Tips for MARL Algorithms 314
10.5.1 Stacking Time Steps vs. Recurrent Network vs. Neither 314
10.5.2 Standardising Rewards 315
10.5.3 Centralised Optimisation 315
10.6 Presentation of Experimental Results 316
10.6.1 Learning Curves 316
10.6.2 Hyperparameter Search 318
11 Multi-Agent Environments 321
11.1 Criteria for Choosing Environments 322
11.2 Structurally Distinct 2_x0002_2 Matrix Games 323
11.2.1 No-Conflict Games 323
11.2.2 Conflict Games 324
11.3 Complex Environments 325
11.3.1 Level-Based Foraging 326
11.3.2 Multi-Agent Particle Environment 328
11.3.3 StarCraft Multi-Agent Challenge 329
11.3.4 Multi-Robot Warehouse 330
11.3.5 Google Research Football 331
11.3.6 Hanabi 332
11.3.7 Overcooked 333
11.4 Environment Collections 334
11.4.1 Melting Pot 335
11.4.2 OpenSpiel 335
11.4.3 Petting Zoo 337
A Surveys on Multi-Agent Reinforcement Learning 339
Bibliography 341
Summary of Notation xv
List of Figures xvii
1 Introduction 1
1.1 Multi-Agent Systems 2
1.2 Multi-Agent Reinforcement Learning 6
1.3 Application Examples 8
1.3.1 Multi-Robot Warehouse Management 8
1.3.2 Competitive Play in Board Games and Video Games 10
1.3.3 Autonomous Driving 11
1.3.4 Automated Trading in Electronic Markets 11
1.4 Challenges of MARL 12
1.5 Agendas of MARL 13
1.6 Book Contents and Structure 15
I FOUNDATIONS OF MULTI-AGENT REINFORCEMENT LEARNING 17
2 Reinforcement Learning 19
2.1 General Definition 20
2.2 Markov Decision Processes 22
2.3 Expected Discounted Returns and Optimal Policies 24
2.4 Value Functions and Bellman Equation 26
2.5 Dynamic Programming 29
2.6 Temporal-Difference Learning 32
2.7 Evaluation with Learning Curves 36
2.8 Equivalence of R(s, a, s') and R(s, a) 39
2.9 Summary 40
3 Games: Models of Multi-Agent Interaction 43
3.1 Normal-Form Games 44
3.2 Repeated Normal-Form Games 46
3.3 Stochastic Games 47
3.4 Partially Observable Stochastic Games 49
3.4.1 Belief States and Filtering 53
3.5 Modelling Communication 55
3.6 Knowledge Assumptions in Games 56
3.7 Dictionary: Reinforcement Learning Game Theory 58
3.8 Summary 58
4 Solution Concepts for Games 61
4.1 Joint Policy and Expected Return 62
4.2 Best Response 65
4.3 Minimax 65
4.3.1 Minimax Solution via Linear Programming 67
4.4 Nash Equilibrium 68
4.5 -Nash Equilibrium 70
4.6 (Coarse) Correlated Equilibrium 71
4.6.1 Correlated Equilibrium via Linear Programming 74
4.7 Conceptual Limitations of Equilibrium Solutions 75
4.8 Pareto Optimality 76
4.9 Social Welfare and Fairness 78
4.10 No-Regret 81
4.11 The Complexity of Computing Equilibria 83
4.11.1 PPAD Complexity Class 84
4.11.2 Computing -Nash Equilibrium is PPAD-Complete 86
4.12 Summary 87
5 Multi-Agent Reinforcement Learning in Games: First Steps and Challenges
89
5.1 General Learning Process 90
5.2 Convergence Types 92
5.3 Single-Agent RL Reductions 95
5.3.1 Central Learning 95
5.3.2 Independent Learning 97
5.3.3 Example: Level-Based Foraging 99
5.4 Challenges of MARL 101
5.4.1 Non-Stationarity 102
5.4.2 Equilibrium Selection 104
5.4.3 Multi-Agent Credit Assignment 106
5.4.4 Scaling to Many Agents 108
5.5 What Algorithms Do Agents Use? 109
5.5.1 Self-Play 109
5.5.2 Mixed-Play 111
5.6 Summary 111
6 Multi-Agent Reinforcement Learning: Foundational Algorithms 115
6.1 Dynamic Programming for Games: Value Iteration 116
6.2 Temporal-Difference Learning for Games: Joint Action Learning 118
6.2.1 Minimax Q-Learning 121
6.2.2 Nash Q-Learning 123
6.2.3 Correlated Q-Learning 124
6.2.4 Limitations of Joint Action Learning 125
6.3 Agent Modelling 127
6.3.1 Fictitious Play 128
6.3.2 Joint Action Learning with Agent Modelling 131
6.3.3 Bayesian Learning and Value of Information 134
6.4 Policy-Based Learning 140
6.4.1 Gradient Ascent in Expected Reward 141
6.4.2 Learning Dynamics of Infinitesimal Gradient Ascent 142
6.4.3 Win or Learn Fast 145
6.4.4 Win or Learn Fast with Policy Hill Climbing 147
6.4.5 Generalised Infinitesimal Gradient Ascent 149
6.5 No-Regret Learning 151
6.5.1 Unconditional and Conditional Regret Matching 151
6.5.2 Convergence of Regret Matching 153
6.6 Summary 156
II MULTI-AGENT DEEP REINFORCEMENT LEARNING: ALGORITHMS AND PRACTICE 159
7 Deep Learning 161
7.1 Function Approximation for Reinforcement Learning 161
7.2 Linear Function Approximation 163
7.3 Feedforward Neural Networks 165
7.3.1 Neural Unit 166
7.3.2 Activation Functions 167
7.3.3 Composing a Network from Layers and Units 168
7.4 Gradient-Based Optimisation 169
7.4.1 Loss Function 170
7.4.2 Gradient Descent 171
7.4.3 Backpropagation 174
7.5 Convolutional and Recurrent Neural Networks 175
7.5.1 Learning from Images – Exploiting Spatial Relationships in Data 175
7.5.2 Learning from Sequences with Memory 178
7.6 Summary 180
8 Deep Reinforcement Learning 183
8.1 Deep Value Function Approximation 184
8.1.1 Deep Q-Learning – What Can Go Wrong? 184
8.1.2 Moving Target Problem 187
8.1.3 Breaking Correlations 188
8.1.4 Putting It All Together: Deep Q-Networks 192
8.1.5 Beyond Deep Q-Networks 193
8.2 Policy Gradient Algorithms 195
8.2.1 Advantages of Learning a Policy 195
8.2.2 Policy Gradient Theorem 197
8.2.3 REINFORCE: Monte Carlo Policy Gradient 199
8.2.4 Actor-Critic Algorithms 202
8.2.5 A2C: Advantage Actor-Critic 203
8.2.6 PPO: Proximal Policy Optimisation 206
8.2.7 Policy Gradient Algorithms in Practice 209
8.2.8 Concurrent Training of Policies 210
8.3 Observations, States, and Histories in Practice 215
8.4 Summary 216
9 Multi-Agent Deep Reinforcement Learning 219
9.1 Training and Execution Modes 220
9.1.1 Centralised Training and Execution 220
9.1.2 Decentralised Training and Execution 221
9.1.3 Centralised Training with Decentralised Execution 222
9.2 Notation for Multi-Agent Deep Reinforcement Learning 222
9.3 Independent Learning 223
9.3.1 Independent Value-Based Learning 224
9.3.2 Independent Policy Gradient Methods 226
9.3.3 Example: Deep Independent Learning in a Large Task 228
9.4 Multi-Agent Policy Gradient Algorithms 230
9.4.1 Multi-Agent Policy Gradient Theorem 231
9.4.2 Centralised Critics 232
9.4.3 Centralised Action-Value Critics 236
9.4.4 Counterfactual Action-Value Estimation 237
9.4.5 Equilibrium Selection with Centralised Action-Value Critics 239
9.5 Value Decomposition in Common-Reward Games 242
9.5.1 Individual-Global-Max Property 244
9.5.2 Linear Value Decomposition 246
9.5.3 Monotonic Value Decomposition 249
9.5.4 Value Decomposition in Practice 255
9.5.5 Beyond Monotonic Value Decomposition 261
9.6 Agent Modelling with Neural Networks 266
9.6.1 Joint Action Learning with Deep Agent Models 267
9.6.2 Learning Representations of Agent Policies 270
9.7 Environments with Homogeneous Agents 274
9.7.1 Parameter Sharing 276
9.7.2 Experience Sharing 278
9.8 Policy Self-Play in Zero-Sum Games 281
9.8.1 Monte Carlo Tree Search 283
9.8.2 Self-Play MCTS 286
9.8.3 Self-Play MCTS with Deep Neural Networks: AlphaZero 288
9.9 Population-Based Training 290
9.9.1 Policy Space Response Oracles 292
9.9.2 Convergence of PSRO 295
9.9.3 Grandmaster Level in StarCraft II: AlphaStar 297
9.10 Summary 301
10 Multi-Agent Deep RL in Practice 305
10.1 The Agent-Environment Interface 305
10.2 MARL Neural Networks in PyTorch 307
10.2.1 Seamless Parameter Sharing Implementation 309
10.2.2 Defining the Models: An Example with IDQN 310
10.3 Centralised Value Functions 312
10.4 Value Decomposition 313
10.5 Practical Tips for MARL Algorithms 314
10.5.1 Stacking Time Steps vs. Recurrent Network vs. Neither 314
10.5.2 Standardising Rewards 315
10.5.3 Centralised Optimisation 315
10.6 Presentation of Experimental Results 316
10.6.1 Learning Curves 316
10.6.2 Hyperparameter Search 318
11 Multi-Agent Environments 321
11.1 Criteria for Choosing Environments 322
11.2 Structurally Distinct 2_x0002_2 Matrix Games 323
11.2.1 No-Conflict Games 323
11.2.2 Conflict Games 324
11.3 Complex Environments 325
11.3.1 Level-Based Foraging 326
11.3.2 Multi-Agent Particle Environment 328
11.3.3 StarCraft Multi-Agent Challenge 329
11.3.4 Multi-Robot Warehouse 330
11.3.5 Google Research Football 331
11.3.6 Hanabi 332
11.3.7 Overcooked 333
11.4 Environment Collections 334
11.4.1 Melting Pot 335
11.4.2 OpenSpiel 335
11.4.3 Petting Zoo 337
A Surveys on Multi-Agent Reinforcement Learning 339
Bibliography 341
Preface xi
Summary of Notation xv
List of Figures xvii
1 Introduction 1
1.1 Multi-Agent Systems 2
1.2 Multi-Agent Reinforcement Learning 6
1.3 Application Examples 8
1.3.1 Multi-Robot Warehouse Management 8
1.3.2 Competitive Play in Board Games and Video Games 10
1.3.3 Autonomous Driving 11
1.3.4 Automated Trading in Electronic Markets 11
1.4 Challenges of MARL 12
1.5 Agendas of MARL 13
1.6 Book Contents and Structure 15
I FOUNDATIONS OF MULTI-AGENT REINFORCEMENT LEARNING 17
2 Reinforcement Learning 19
2.1 General Definition 20
2.2 Markov Decision Processes 22
2.3 Expected Discounted Returns and Optimal Policies 24
2.4 Value Functions and Bellman Equation 26
2.5 Dynamic Programming 29
2.6 Temporal-Difference Learning 32
2.7 Evaluation with Learning Curves 36
2.8 Equivalence of R(s, a, s') and R(s, a) 39
2.9 Summary 40
3 Games: Models of Multi-Agent Interaction 43
3.1 Normal-Form Games 44
3.2 Repeated Normal-Form Games 46
3.3 Stochastic Games 47
3.4 Partially Observable Stochastic Games 49
3.4.1 Belief States and Filtering 53
3.5 Modelling Communication 55
3.6 Knowledge Assumptions in Games 56
3.7 Dictionary: Reinforcement Learning Game Theory 58
3.8 Summary 58
4 Solution Concepts for Games 61
4.1 Joint Policy and Expected Return 62
4.2 Best Response 65
4.3 Minimax 65
4.3.1 Minimax Solution via Linear Programming 67
4.4 Nash Equilibrium 68
4.5 -Nash Equilibrium 70
4.6 (Coarse) Correlated Equilibrium 71
4.6.1 Correlated Equilibrium via Linear Programming 74
4.7 Conceptual Limitations of Equilibrium Solutions 75
4.8 Pareto Optimality 76
4.9 Social Welfare and Fairness 78
4.10 No-Regret 81
4.11 The Complexity of Computing Equilibria 83
4.11.1 PPAD Complexity Class 84
4.11.2 Computing -Nash Equilibrium is PPAD-Complete 86
4.12 Summary 87
5 Multi-Agent Reinforcement Learning in Games: First Steps and Challenges
89
5.1 General Learning Process 90
5.2 Convergence Types 92
5.3 Single-Agent RL Reductions 95
5.3.1 Central Learning 95
5.3.2 Independent Learning 97
5.3.3 Example: Level-Based Foraging 99
5.4 Challenges of MARL 101
5.4.1 Non-Stationarity 102
5.4.2 Equilibrium Selection 104
5.4.3 Multi-Agent Credit Assignment 106
5.4.4 Scaling to Many Agents 108
5.5 What Algorithms Do Agents Use? 109
5.5.1 Self-Play 109
5.5.2 Mixed-Play 111
5.6 Summary 111
6 Multi-Agent Reinforcement Learning: Foundational Algorithms 115
6.1 Dynamic Programming for Games: Value Iteration 116
6.2 Temporal-Difference Learning for Games: Joint Action Learning 118
6.2.1 Minimax Q-Learning 121
6.2.2 Nash Q-Learning 123
6.2.3 Correlated Q-Learning 124
6.2.4 Limitations of Joint Action Learning 125
6.3 Agent Modelling 127
6.3.1 Fictitious Play 128
6.3.2 Joint Action Learning with Agent Modelling 131
6.3.3 Bayesian Learning and Value of Information 134
6.4 Policy-Based Learning 140
6.4.1 Gradient Ascent in Expected Reward 141
6.4.2 Learning Dynamics of Infinitesimal Gradient Ascent 142
6.4.3 Win or Learn Fast 145
6.4.4 Win or Learn Fast with Policy Hill Climbing 147
6.4.5 Generalised Infinitesimal Gradient Ascent 149
6.5 No-Regret Learning 151
6.5.1 Unconditional and Conditional Regret Matching 151
6.5.2 Convergence of Regret Matching 153
6.6 Summary 156
II MULTI-AGENT DEEP REINFORCEMENT LEARNING: ALGORITHMS AND PRACTICE 159
7 Deep Learning 161
7.1 Function Approximation for Reinforcement Learning 161
7.2 Linear Function Approximation 163
7.3 Feedforward Neural Networks 165
7.3.1 Neural Unit 166
7.3.2 Activation Functions 167
7.3.3 Composing a Network from Layers and Units 168
7.4 Gradient-Based Optimisation 169
7.4.1 Loss Function 170
7.4.2 Gradient Descent 171
7.4.3 Backpropagation 174
7.5 Convolutional and Recurrent Neural Networks 175
7.5.1 Learning from Images – Exploiting Spatial Relationships in Data 175
7.5.2 Learning from Sequences with Memory 178
7.6 Summary 180
8 Deep Reinforcement Learning 183
8.1 Deep Value Function Approximation 184
8.1.1 Deep Q-Learning – What Can Go Wrong? 184
8.1.2 Moving Target Problem 187
8.1.3 Breaking Correlations 188
8.1.4 Putting It All Together: Deep Q-Networks 192
8.1.5 Beyond Deep Q-Networks 193
8.2 Policy Gradient Algorithms 195
8.2.1 Advantages of Learning a Policy 195
8.2.2 Policy Gradient Theorem 197
8.2.3 REINFORCE: Monte Carlo Policy Gradient 199
8.2.4 Actor-Critic Algorithms 202
8.2.5 A2C: Advantage Actor-Critic 203
8.2.6 PPO: Proximal Policy Optimisation 206
8.2.7 Policy Gradient Algorithms in Practice 209
8.2.8 Concurrent Training of Policies 210
8.3 Observations, States, and Histories in Practice 215
8.4 Summary 216
9 Multi-Agent Deep Reinforcement Learning 219
9.1 Training and Execution Modes 220
9.1.1 Centralised Training and Execution 220
9.1.2 Decentralised Training and Execution 221
9.1.3 Centralised Training with Decentralised Execution 222
9.2 Notation for Multi-Agent Deep Reinforcement Learning 222
9.3 Independent Learning 223
9.3.1 Independent Value-Based Learning 224
9.3.2 Independent Policy Gradient Methods 226
9.3.3 Example: Deep Independent Learning in a Large Task 228
9.4 Multi-Agent Policy Gradient Algorithms 230
9.4.1 Multi-Agent Policy Gradient Theorem 231
9.4.2 Centralised Critics 232
9.4.3 Centralised Action-Value Critics 236
9.4.4 Counterfactual Action-Value Estimation 237
9.4.5 Equilibrium Selection with Centralised Action-Value Critics 239
9.5 Value Decomposition in Common-Reward Games 242
9.5.1 Individual-Global-Max Property 244
9.5.2 Linear Value Decomposition 246
9.5.3 Monotonic Value Decomposition 249
9.5.4 Value Decomposition in Practice 255
9.5.5 Beyond Monotonic Value Decomposition 261
9.6 Agent Modelling with Neural Networks 266
9.6.1 Joint Action Learning with Deep Agent Models 267
9.6.2 Learning Representations of Agent Policies 270
9.7 Environments with Homogeneous Agents 274
9.7.1 Parameter Sharing 276
9.7.2 Experience Sharing 278
9.8 Policy Self-Play in Zero-Sum Games 281
9.8.1 Monte Carlo Tree Search 283
9.8.2 Self-Play MCTS 286
9.8.3 Self-Play MCTS with Deep Neural Networks: AlphaZero 288
9.9 Population-Based Training 290
9.9.1 Policy Space Response Oracles 292
9.9.2 Convergence of PSRO 295
9.9.3 Grandmaster Level in StarCraft II: AlphaStar 297
9.10 Summary 301
10 Multi-Agent Deep RL in Practice 305
10.1 The Agent-Environment Interface 305
10.2 MARL Neural Networks in PyTorch 307
10.2.1 Seamless Parameter Sharing Implementation 309
10.2.2 Defining the Models: An Example with IDQN 310
10.3 Centralised Value Functions 312
10.4 Value Decomposition 313
10.5 Practical Tips for MARL Algorithms 314
10.5.1 Stacking Time Steps vs. Recurrent Network vs. Neither 314
10.5.2 Standardising Rewards 315
10.5.3 Centralised Optimisation 315
10.6 Presentation of Experimental Results 316
10.6.1 Learning Curves 316
10.6.2 Hyperparameter Search 318
11 Multi-Agent Environments 321
11.1 Criteria for Choosing Environments 322
11.2 Structurally Distinct 2_x0002_2 Matrix Games 323
11.2.1 No-Conflict Games 323
11.2.2 Conflict Games 324
11.3 Complex Environments 325
11.3.1 Level-Based Foraging 326
11.3.2 Multi-Agent Particle Environment 328
11.3.3 StarCraft Multi-Agent Challenge 329
11.3.4 Multi-Robot Warehouse 330
11.3.5 Google Research Football 331
11.3.6 Hanabi 332
11.3.7 Overcooked 333
11.4 Environment Collections 334
11.4.1 Melting Pot 335
11.4.2 OpenSpiel 335
11.4.3 Petting Zoo 337
A Surveys on Multi-Agent Reinforcement Learning 339
Bibliography 341
Summary of Notation xv
List of Figures xvii
1 Introduction 1
1.1 Multi-Agent Systems 2
1.2 Multi-Agent Reinforcement Learning 6
1.3 Application Examples 8
1.3.1 Multi-Robot Warehouse Management 8
1.3.2 Competitive Play in Board Games and Video Games 10
1.3.3 Autonomous Driving 11
1.3.4 Automated Trading in Electronic Markets 11
1.4 Challenges of MARL 12
1.5 Agendas of MARL 13
1.6 Book Contents and Structure 15
I FOUNDATIONS OF MULTI-AGENT REINFORCEMENT LEARNING 17
2 Reinforcement Learning 19
2.1 General Definition 20
2.2 Markov Decision Processes 22
2.3 Expected Discounted Returns and Optimal Policies 24
2.4 Value Functions and Bellman Equation 26
2.5 Dynamic Programming 29
2.6 Temporal-Difference Learning 32
2.7 Evaluation with Learning Curves 36
2.8 Equivalence of R(s, a, s') and R(s, a) 39
2.9 Summary 40
3 Games: Models of Multi-Agent Interaction 43
3.1 Normal-Form Games 44
3.2 Repeated Normal-Form Games 46
3.3 Stochastic Games 47
3.4 Partially Observable Stochastic Games 49
3.4.1 Belief States and Filtering 53
3.5 Modelling Communication 55
3.6 Knowledge Assumptions in Games 56
3.7 Dictionary: Reinforcement Learning Game Theory 58
3.8 Summary 58
4 Solution Concepts for Games 61
4.1 Joint Policy and Expected Return 62
4.2 Best Response 65
4.3 Minimax 65
4.3.1 Minimax Solution via Linear Programming 67
4.4 Nash Equilibrium 68
4.5 -Nash Equilibrium 70
4.6 (Coarse) Correlated Equilibrium 71
4.6.1 Correlated Equilibrium via Linear Programming 74
4.7 Conceptual Limitations of Equilibrium Solutions 75
4.8 Pareto Optimality 76
4.9 Social Welfare and Fairness 78
4.10 No-Regret 81
4.11 The Complexity of Computing Equilibria 83
4.11.1 PPAD Complexity Class 84
4.11.2 Computing -Nash Equilibrium is PPAD-Complete 86
4.12 Summary 87
5 Multi-Agent Reinforcement Learning in Games: First Steps and Challenges
89
5.1 General Learning Process 90
5.2 Convergence Types 92
5.3 Single-Agent RL Reductions 95
5.3.1 Central Learning 95
5.3.2 Independent Learning 97
5.3.3 Example: Level-Based Foraging 99
5.4 Challenges of MARL 101
5.4.1 Non-Stationarity 102
5.4.2 Equilibrium Selection 104
5.4.3 Multi-Agent Credit Assignment 106
5.4.4 Scaling to Many Agents 108
5.5 What Algorithms Do Agents Use? 109
5.5.1 Self-Play 109
5.5.2 Mixed-Play 111
5.6 Summary 111
6 Multi-Agent Reinforcement Learning: Foundational Algorithms 115
6.1 Dynamic Programming for Games: Value Iteration 116
6.2 Temporal-Difference Learning for Games: Joint Action Learning 118
6.2.1 Minimax Q-Learning 121
6.2.2 Nash Q-Learning 123
6.2.3 Correlated Q-Learning 124
6.2.4 Limitations of Joint Action Learning 125
6.3 Agent Modelling 127
6.3.1 Fictitious Play 128
6.3.2 Joint Action Learning with Agent Modelling 131
6.3.3 Bayesian Learning and Value of Information 134
6.4 Policy-Based Learning 140
6.4.1 Gradient Ascent in Expected Reward 141
6.4.2 Learning Dynamics of Infinitesimal Gradient Ascent 142
6.4.3 Win or Learn Fast 145
6.4.4 Win or Learn Fast with Policy Hill Climbing 147
6.4.5 Generalised Infinitesimal Gradient Ascent 149
6.5 No-Regret Learning 151
6.5.1 Unconditional and Conditional Regret Matching 151
6.5.2 Convergence of Regret Matching 153
6.6 Summary 156
II MULTI-AGENT DEEP REINFORCEMENT LEARNING: ALGORITHMS AND PRACTICE 159
7 Deep Learning 161
7.1 Function Approximation for Reinforcement Learning 161
7.2 Linear Function Approximation 163
7.3 Feedforward Neural Networks 165
7.3.1 Neural Unit 166
7.3.2 Activation Functions 167
7.3.3 Composing a Network from Layers and Units 168
7.4 Gradient-Based Optimisation 169
7.4.1 Loss Function 170
7.4.2 Gradient Descent 171
7.4.3 Backpropagation 174
7.5 Convolutional and Recurrent Neural Networks 175
7.5.1 Learning from Images – Exploiting Spatial Relationships in Data 175
7.5.2 Learning from Sequences with Memory 178
7.6 Summary 180
8 Deep Reinforcement Learning 183
8.1 Deep Value Function Approximation 184
8.1.1 Deep Q-Learning – What Can Go Wrong? 184
8.1.2 Moving Target Problem 187
8.1.3 Breaking Correlations 188
8.1.4 Putting It All Together: Deep Q-Networks 192
8.1.5 Beyond Deep Q-Networks 193
8.2 Policy Gradient Algorithms 195
8.2.1 Advantages of Learning a Policy 195
8.2.2 Policy Gradient Theorem 197
8.2.3 REINFORCE: Monte Carlo Policy Gradient 199
8.2.4 Actor-Critic Algorithms 202
8.2.5 A2C: Advantage Actor-Critic 203
8.2.6 PPO: Proximal Policy Optimisation 206
8.2.7 Policy Gradient Algorithms in Practice 209
8.2.8 Concurrent Training of Policies 210
8.3 Observations, States, and Histories in Practice 215
8.4 Summary 216
9 Multi-Agent Deep Reinforcement Learning 219
9.1 Training and Execution Modes 220
9.1.1 Centralised Training and Execution 220
9.1.2 Decentralised Training and Execution 221
9.1.3 Centralised Training with Decentralised Execution 222
9.2 Notation for Multi-Agent Deep Reinforcement Learning 222
9.3 Independent Learning 223
9.3.1 Independent Value-Based Learning 224
9.3.2 Independent Policy Gradient Methods 226
9.3.3 Example: Deep Independent Learning in a Large Task 228
9.4 Multi-Agent Policy Gradient Algorithms 230
9.4.1 Multi-Agent Policy Gradient Theorem 231
9.4.2 Centralised Critics 232
9.4.3 Centralised Action-Value Critics 236
9.4.4 Counterfactual Action-Value Estimation 237
9.4.5 Equilibrium Selection with Centralised Action-Value Critics 239
9.5 Value Decomposition in Common-Reward Games 242
9.5.1 Individual-Global-Max Property 244
9.5.2 Linear Value Decomposition 246
9.5.3 Monotonic Value Decomposition 249
9.5.4 Value Decomposition in Practice 255
9.5.5 Beyond Monotonic Value Decomposition 261
9.6 Agent Modelling with Neural Networks 266
9.6.1 Joint Action Learning with Deep Agent Models 267
9.6.2 Learning Representations of Agent Policies 270
9.7 Environments with Homogeneous Agents 274
9.7.1 Parameter Sharing 276
9.7.2 Experience Sharing 278
9.8 Policy Self-Play in Zero-Sum Games 281
9.8.1 Monte Carlo Tree Search 283
9.8.2 Self-Play MCTS 286
9.8.3 Self-Play MCTS with Deep Neural Networks: AlphaZero 288
9.9 Population-Based Training 290
9.9.1 Policy Space Response Oracles 292
9.9.2 Convergence of PSRO 295
9.9.3 Grandmaster Level in StarCraft II: AlphaStar 297
9.10 Summary 301
10 Multi-Agent Deep RL in Practice 305
10.1 The Agent-Environment Interface 305
10.2 MARL Neural Networks in PyTorch 307
10.2.1 Seamless Parameter Sharing Implementation 309
10.2.2 Defining the Models: An Example with IDQN 310
10.3 Centralised Value Functions 312
10.4 Value Decomposition 313
10.5 Practical Tips for MARL Algorithms 314
10.5.1 Stacking Time Steps vs. Recurrent Network vs. Neither 314
10.5.2 Standardising Rewards 315
10.5.3 Centralised Optimisation 315
10.6 Presentation of Experimental Results 316
10.6.1 Learning Curves 316
10.6.2 Hyperparameter Search 318
11 Multi-Agent Environments 321
11.1 Criteria for Choosing Environments 322
11.2 Structurally Distinct 2_x0002_2 Matrix Games 323
11.2.1 No-Conflict Games 323
11.2.2 Conflict Games 324
11.3 Complex Environments 325
11.3.1 Level-Based Foraging 326
11.3.2 Multi-Agent Particle Environment 328
11.3.3 StarCraft Multi-Agent Challenge 329
11.3.4 Multi-Robot Warehouse 330
11.3.5 Google Research Football 331
11.3.6 Hanabi 332
11.3.7 Overcooked 333
11.4 Environment Collections 334
11.4.1 Melting Pot 335
11.4.2 OpenSpiel 335
11.4.3 Petting Zoo 337
A Surveys on Multi-Agent Reinforcement Learning 339
Bibliography 341