Preface

原文链接

代码链接

参考博客

Open AI gym: 只适用于单智能体环境

PettingZoo:专注于多智能体环境

1 Basic Usage

1.1 Initializing Environments

Using environments in PettingZoo is very similar to using them in Gymnasium. You initialize an environment via:

在PettingZoo中使用环境与在Gymnasium中非常相似,可以通过以下方式初始化环境:

1
2
from pettingzoo.butterfly import pistonball_v6
env = pistonball_v6.env()

Environments are generally highly configurable via arguments at creation, i.e.:

环境通常可以通过创建时的参数进行高度配置,即:

1
2
cooperative_pong.env(ball_speed=18, left_paddle_speed=25,
right_paddle_speed=25, is_cake_paddle=True, max_cycles=900, bounce_randomness=False)

1.2 Interacting With Environments

Environments can be interacted with using a similar interface to Gymnasium:

可以使用与Gymnasium类似的界面与环境进行交互:

1
2
3
4
5
env.reset()
for agent in env.agent_iter():
observation, reward, termination, truncation, info = env.last()
action = policy(observation, agent)
env.step(action)
  • agent:实际上是智能体的名字(在代码中为字符串

The commonly used methods are:

常用的方法有:

1
agent_iter(max_iter=2**63)

returns an iterator that yields the current agent of the environment. It terminates when all agents in the environment are done or when max_iter(steps have been executed).

返回生成环境当前智能体的迭代器。当环境中的所有智能体都完成(指的是死亡)或已执行步骤达到最大值max_iter时,迭代器终止迭代。

  • 在实际使用中,iter中止迭代,当且仅当agents list(即agents,实际上就是一个字符串列表)为空

也就是说,在自定义环境时,如果希望game over(无论是agents all done,还是max_iter执行到最大的timestep),那么就一定要清空agents list

事实上,truncation && termination两个变量本身,并不能让iter终止。在自定义环境的时候,如果希望这两个变量能够影响到iter,从而让game正常over,就要在step()中调用self._was_dead_step(action)函数


1
last(observe=True)

returns observation, reward, done, and info for the agent currently able to act. The returned reward is the cumulative reward that the agent has received since it last acted. If observe is set to False, the observation will not be computed, and None will be returned in its place. Note that a single agent being done does not imply the environment is done.

返回当前能够执行操作的智能体的观测、奖励、完成和信息。返回的奖励是智能体自上次行动以来收到的累计奖励。如果observe设置为False时,将不计算观测值,并在其位置返回None。请注意,执行单个智能体并不意味着环境已完成。

  • 相当于gym中单智能体环境的step(),但是这里的last()并不输入action,而仅仅返回(observation, reward, done, info)

1
reset()

resets the environment and sets it up for use when called the first time.

重置环境并将其设置为首次调用时使用。


1
step(action)

takes and executes the action of the agent in the environment, automatically switches control to the next agent.

采取并执行环境中代理的操作,自动将控制切换到下一个代理。

  • [ ] Q:示例代码中只有一个for,而且是仅仅针对于agent_iter的,这样怎么能够执行timestep的循环?

  • [x] agent_iter会不断地循环迭代,从0th agent迭代到nth agent,然后又会从0th agent开始迭代,一直到agent list为空

控制timestep的循环,重点在于step(),和gym的单智能体环境的step()不太一样。

pettingzoo当中的step()仅仅输入动作,但是并不会返回(observation, reward, done, info)

换句话说:

  • gymstep() == pettingzoo last() + step()

在实际coding中

  • 可以在每个agent输入action后,实时更新state(环境的state),例如下棋的先手、后手
  • 也可以在所有agent都输入一轮的action之后(即完成一个完整的timestep结束后),再更新state,例如石头剪刀布,这种方法适用于Parallel API

此处给出的code是Agent Environment Cycle (AEC),用Parallel会比较麻烦。但是也可以做到parallel。通过if self._agent_selector.is_last():(当agent_iter迭代到最后一个agent时,表示一个timestep的结束),来控制一个完整的timestep。

1.3 Additional Environment API

PettingZoo models games as Agent Environment Cycle (AEC) games, and thus can support any game multi-agent RL can consider, allowing for fantastically weird cases. Because of this, our API includes lower level functions and attributes that you probably won’t need but are very important when you do. Their functionality is used to implement the high-level functions above though, so including them is just a matter of code factoring.

PettingZoo将游戏建模为智能体环境周期(AEC)游戏,因此可以支持多智能体RL可以考虑的任何游戏,允许出现非常奇怪的情况。正因为如此,我们的API包含了较低级别的函数和属性,可能不需要它们,但它们在使用时非常重要。它们的功能用于实现上述高级函数,因此包含它们只是一个代码分解问题。

1
agents

A list of the names of all current agents, typically integers. These may be changed as an environment progresses (i.e. agents can be added or removed).

所有当前智能体的名称列表,通常为整数。这些可能会随着环境的发展而改变(即可以添加或删除智能体)。

  • Note that:agents list只是保存agents的名字,而并非agent本身(即class类)。甚至不是agent的ID(在agent list当中的下标)

1
num_agents

The length of the agents list.

智能体列表的长度

  • 即agents (list)的长度

1
agent_selection

an attribute of the environment corresponding to the currently selected agent that an action can be taken for.

与当前所选智能体相对应的环境属性,可以对其执行操作。

  • [ ] ==有疑问,结合代码看==

1
observation_space(agent)

a function that retrieves the observation space for a particular agent. This space should never change for a particular agent ID.

检索特定智能体的观测空间的函数。对于特定的智能体ID,此空间不应更改。


1
terminations

A dict of the termination state of every current agent at the time called, keyed by name. last() accesses this attribute. Note that agents can be added or removed from this dict. The returned dict looks like:

每个当前调用的智能体的终止状态的dict(字典),由名称键入。last()访问此属性。请注意,可以从此dict中添加或删除智能体。返回的dict如下所示:

1
terminations = {0:[first agent's termination state], 1:[second agent's termination state] ... n-1:[nth agent's termination state]}
  • terminations这个dict的key是agents name,value是True / False
  • 即:当agent done,对应的value就设置为True

1
truncations

A dict of the truncation state of every current agent at the time called, keyed by name. last() accesses this attribute. Note that agents can be added or removed from this dict. The returned dict looks like:

每个当前调用的智能体的截断状态的dict,由名称键入。last()访问此属性。请注意,可以从此dict中添加或删除代理。返回的dict如下所示:

1
truncations = {0:[first agent's truncation state], 1:[second agent's truncation state] ... n-1:[nth agent's truncation state]}
  • [ ] Q: Differences between terminations && truncations
  • [x] trunctions是实际执行的timestep大于设定值时,设置为True;terminations是某个agent done,对应的value设置为True

在实际coding中,如果某个agent terminations or truncations为True,而输入的action不是None,那么就会报错。只是因为step()中使用了self._was_dead_step(action)

事实上,terminations && truncations并不能让agent_iter中止迭代。在自定义环境的时候,如果希望terminations && truncations影响到agent_iter,进而实现game over,应该在step()当中调用API的self._was_dead_step(action)函数


1
infos

A dict of info for each current agent, keyed by name. Each agent’s info is also a dict. Note that agents can be added or removed from this attribute. last() accesses this attribute. The returned dict looks like:

每个当前智能体的信息字典,由名称键入。每个智能体的信息也是一个字典。请注意,可以在此属性中添加或删除智能体。last() 访问此属性。返回的dict看起来像:

1
infos = {0:[first agent's info], 1:[second agent's info] ... n-1:[nth agent's info]}
  • 其实此处的infos与gym的step()返回的infos作用类似,都是用于强化学习调整参数,对于自定义环境来说,并没有很大的作用

1
observe(agent)

Returns the observation an agent currently can make. last() calls this function.
返回智能体当前可以进行的观测。last() 调用此函数。


1
reward

A dict of the rewards of every current agent at the time called, keyed by name. Rewards the instantaneous reward generated after the last step. Note that agents can be added or removed from this attribute. last() does not directly access this attribute, rather the returned reward is stored in an internal variable. The rewards structure looks like:

每个当前调用的智能体的奖励信息,以姓名键入。奖励最后一步后产生的即时奖励。请注意,可以在此属性中添加或删除代理。last()不直接访问该属性,而是将返回的奖励存储在内部变量中。奖励结构如下:

1
{0:[first agent's reward], 1:[second agent's reward] ... n-1:[nth agent's reward]}
  • [ ] Q:last()将返回的reward存储在内部变量中,指的是哪个内部变量?
  • [ ] A:在自定义环境中,自己声明一个self.rewards变量,用来存储reward

1
seed(seed=None)

Reseeds the environment. reset() must be called after seed(), and before step().

重置环境种子。reset()必须在seed()之后和step()之前调用。

  • 调用顺序为:seed() $\to$ reset() $\to$ step()

1
render()

Returns a rendered frame from the environment using render mode specified at initialization. In the case render mode is 'rgb_array', returns a numpy array, while with 'ansi' returns the strings printed. There is no need to call render() with human() mode.*

使用初始化时指定的渲染模式从环境中返回渲染帧。在这种情况下,渲染模式是'rgb_array',返回numpy数组,而使用'ansi'返回打印的字符串。没有必要调用human()模式的render()

  • 即:如果初始化的时候,指定的渲染模式为human,那么就没有必要额外调用render()。因为这种情况下,就算不调用render(),也可以正常显示(human mode)渲染窗口。
  • 在自定义环境中,需要自己写在step()当中,例如:
1
2
if self.render_mode == "human":
self.render()

1
close()

Closes the rendering window.

关闭渲染窗口。

  • [ ] Q:确切地说,是关闭渲染窗口,还是关闭游戏环境env?
  • [x] A:官方代码的注释表明“关闭应释放任何图形显示、子流程、网络连接”。事实上,在编写自定义环境时,如果没有显示图像窗口、子流程、网络连接,那么这个函数里面可以直接写一个pass,也就是在调用环境以后,没有必要特意去“关闭游戏环境env”

1.4 Optional API Components

While not required by the base API, most downstream wrappers and utilities depend on the following attributes and methods, and they should be added to new environments except in special circumstances where adding one or more is not possible.

虽然基本API不需要,但大多数下游包装器和实用程序都依赖于以下属性和方法,应该将它们添加到新环境中,除非在特殊情况下无法添加一个或多个。

1
possible_agents

A list of all possible_agents the environment could generate. Equivalent to the list of agents in the observation and action spaces. This cannot be changed through play or resetting.

环境可能生成的所有可能的智能体的列表。相当于观察和行动空间中的智能体列表。这不能通过播放或重置进行更改。

  • [ ] Q:用处?
  • [x] A:允许的/合法的agents name,环境初始化的时候,生成的实际的agents_list只能从当中选取

1
max_num_agents

The length of the possible_agents list.

possible_agents列表的长度。


1
observation_spaces

A dict of the observation spaces of every agent, keyed by name. This cannot be changed through play or resetting.

每个智能体的观察空间的dict,由名称键入。这不能通过播放或重置进行更改。


1
action_spaces

A dict of the action spaces of every agent, keyed by name. This cannot be changed through play or resetting.

每个智能体的动作空间的dict,由名称键入。这不能通过播放或重置进行更改。

  • “Play or resetting”:
    • play:在游戏进行过程中
    • resetting:reset()

1
state()

Returns a global observation of the current state of the environment. Not all environments will support this feature.

返回对环境当前状态的全局观测。并非所有环境都支持此功能。


1
state_space

The space of a global observation of the environment. Not all environments will support this feature.

全局环境观测的空间。并非所有环境都支持此功能。

1.5 Notable Idioms

1.5.1 Checking if the entire environment is done

When an agent is terminated or truncated, it’s removed from agents, so when the environments done agents will be an empty list. This means not env.agents is a simple condition for the environment being done.

当智能体被终止或截断时,它将从agents中移除,所以当环境完成时agents将是一个空列表。这意味着not env.agents是一个简单的环境条件。

  • 意思是:可以用not env.agents来判断game over
  • 实际coding中,“当智能体被终止或截断时,它将从agents中移除”这个功能,需要用self._was_dead_step(action)函数

1.5.2 Unwrapping an environment

If you have a wrapped environment, and you want to get the unwrapped environment underneath all the layers of wrappers (so that you can manually call a function or change some underlying aspect of the environment), you can use the .unwrapped attribute. If the environment is already a base environment, the .unwrapped attribute will just return itself.*

如果您有一个已包装的环境,并且希望在所有包装层下面获得未包装的环境(以便您可以手动调用函数或更改环境的某些底层方面),那么可以使用.unwrapped属性。如果环境已经是基本环境,则.unwrapped属性将只返回自身。

1
base_env = knights_archers_zombies_v10.env().unwrapped
  • [ ] Q:为什么要包装env?怎么包装?
  • [x] A:不重要,有时候要把并行环境,改成串行环境,就需要包装

1.5.3 Variable Numbers of Agents (Death)

Agents can die and generate during the course of an environment. If an agent dies, then its entry in the terminated dictionary is set to True, it become the next selected agent (or after another agent that is also terminated or truncated), and the action it takes is required to be None. After this vacuous step is taken, the agent will be removed from agents and other changeable attributes. Agent generation can just be done with appending it to agents and the other changeable attributes (with it already being in the possible agents and action/observation spaces), and transitioning to it at some point with agent_iter.

在环境过程中,智能体可能会死亡并生成。如果智能体死亡,则其在terminated字典设置为True,它将成为下一个选定的智能体(或在另一个也被终止或截断的智能体之后),它所采取的行动(action)必须是None。执行此空步骤后,这个智能体将从agents和其他可变属性中删除。智能体生成只需将其附加到agents以及其他可变属性(它已经存在于可能的智能体列表和行动/观察空间中),并在某个时候使用agent_iter转换到它。

  • 此处规定:agent死亡之后就应该在terminated字典中设置为True,并且采取的action也必须为None

如果agent死亡之后,传递给env的action不是None,就会报错。这是因为self._was_dead_step(action)函数

  • 关键点:“执行此空步骤后,这个智能体将从agents和其他可变属性中删除”

如果规定死亡后还可以复活?这样直接删除会导致game over——例如在某个timestep,所有的agent都死亡了,但是都正在等待复活,此时game并没有over

解决方案:如果游戏规定可以复活,那么在agent死亡之后,不要再terminated字典中设置为True

1.5.4 Environment as an Agent

In certain cases, separating agent from environment actions is helpful for studying. This can be done by treating the environment as an agent. We encourage calling the environment actor env in env.agents, and having it take None as an action.

在某些情况下,将智能体与环境作用分离有助于研究。这可以通过将环境视为一种智能体来实现。我们鼓励调用环境行动者env在env.agents中,将None作为一项行动。

1.6 Raw Environment

Environments are by default wrapped in a handful of lightweight wrappers that handle error messages and ensure reasonable behavior given incorrect usage (i.e. playing illegal moves or stepping before resetting). However, these add a very small amount of overhead. If you want to create an environment without them, you can do so by using the raw_env() constructor contained within each module:

默认情况下,环境被包装在少数轻量级包装器中,这些包装器处理错误消息,并确保在不正确的使用情况下的合理行为(即在重置之前进行非法移动或单步移动)。然而,这些增加了非常少的开销。如果要创建没有它们的环境,可以使用raw_env()——在每个模块中包含的构造函数:

1
env = knights_archers_zombies_v10.raw_env(<environment parameters>)

2 Environment Creation

This documentation overviews creating new environments and relevant useful wrappers, utilities and tests included in PettingZoo designed for the creation of new environments.

本文档概述了创建新环境以及PettingZoo中为创建新环境而设计的相关有用包装器、实用程序和测试。

2.1 Example Custom Environment

This is a carefully commented version of the PettingZoo rock paper scissors environment.

这是一个经过仔细注释的PettingZoo石头剪刀布环境版本。

  • 环境代码在Example_Custom_Environment.py
  • 使用该环境的代码为t_ece.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
''''''
'''
This is a carefully commented version of the PettingZoo rock paper scissors environment.
这是一个经过仔细注释的PettingZoo石头剪刀布环境版本。(非Parallel版本)
https://pettingzoo.farama.org/content/environment_creation/
'''
import functools

import gymnasium
import numpy as np
from gymnasium.spaces import Discrete

from pettingzoo import AECEnv
from pettingzoo.utils import agent_selector, wrappers

ROCK = 0
PAPER = 1
SCISSORS = 2
NONE = 3
MOVES = ["ROCK", "PAPER", "SCISSORS", "None"]
NUM_ITERS = 5 # 这个常量是最大时间步。原为100。
REWARD_MAP = {
(ROCK, ROCK): (0, 0),
(ROCK, PAPER): (-1, 1),
(ROCK, SCISSORS): (1, -1),
(PAPER, ROCK): (1, -1),
(PAPER, PAPER): (0, 0),
(PAPER, SCISSORS): (-1, 1),
(SCISSORS, ROCK): (-1, 1),
(SCISSORS, PAPER): (1, -1),
(SCISSORS, SCISSORS): (0, 0),
}


def env(render_mode=None):
''''''
"""
The env function often wraps the environment in wrappers by default.
You can find full documentation for these methods
elsewhere in the developer documentation.
默认情况下,env函数通常将环境包装在包装器中。
您可以找到这些方法的完整文档
在开发人员文档的其他地方。
"""
internal_render_mode = render_mode if render_mode != "ansi" else "human"
# 若render_mode为"ansi",则internal_render_mode为"human"
# 若render_mode为"human",则internal_render_mode还是为"human"
# 若render_mode不为以上2个,则internal_render_mode = render_mode

env = raw_env(render_mode=internal_render_mode)
# This wrapper is only for environments which print results to the terminal
# 此包装器仅适用于将结果打印到终端的环境

if render_mode == "ansi":
env = wrappers.CaptureStdoutWrapper(env)
# this wrapper helps error handling for discrete action spaces
# 该包装器有助于离散行动空间的错误处理

env = wrappers.AssertOutOfBoundsWrapper(env)
# Provides a wide vareity of helpful user errors
# 提供各种有用的用户错误
# Strongly recommended
# 强烈推荐

env = wrappers.OrderEnforcingWrapper(env)
return env


class raw_env(AECEnv):
''''''
"""
The metadata holds environment constants. From gymnasium, we inherit the "render_modes",
metadata which specifies which modes can be put into the render() method.
At least human mode should be supported.
The "name" metadata allows the environment to be pretty printed.
元数据包含环境常量。从体育馆,我们继承了“render_modes”,
元数据,指定哪些模式可以放入render()方法中。
至少应该支持人工模式。
“名称”元数据允许对环境进行漂亮的打印。
"""

metadata = {"render_modes": ["human"], "name": "rps_v2"}

def __init__(self, render_mode=None):
''''''
"""
The init method takes in environment arguments and
should define the following attributes:
- possible_agents
- action_spaces
- observation_spaces
These attributes should not be changed after initialization.
init方法接受环境参数和
应定义以下属性:
-可能的智能体
-行动空间
-观测空间
初始化后不应更改这些属性。
"""
self.possible_agents = ["player_" + str(r) for r in range(2)]
# possible_agents指的应该是允许的智能体的名字。

self.agent_name_mapping = dict(
zip(self.possible_agents, list(range(len(self.possible_agents))))
)
# zip()函数可以将列表组装为字典。
#self.agent_name_mapping为{'player_0': 0, 'player_1': 1}

# gymnasium spaces are defined and documented here: https://gymnasium.farama.org/api/spaces/
# 这里定义并记录了健身房的空间:https://gymnasium.farama.org/api/spaces/
self._action_spaces = {agent: Discrete(3) for agent in self.possible_agents}
# self._action_spaces为{'player_0': Discrete(3), 'player_1': Discrete(3)}
self._observation_spaces = {
agent: Discrete(4) for agent in self.possible_agents
}
self.render_mode = render_mode

# this cache ensures that same space object is returned for the same agent
# 该缓存确保为同一智能体返回相同的空间对象
# allows action space seeding to work as expected
# 允许操作空间种子按预期工作
@functools.lru_cache(maxsize=None)
def observation_space(self, agent):
# gymnasium spaces are defined and documented here: https://gymnasium.farama.org/api/spaces/
return Discrete(4)


@functools.lru_cache(maxsize=None)
def action_space(self, agent):
return Discrete(3)

def render(self):
''''''
"""
Renders the environment. In human mode, it can print to terminal, open
up a graphical window, or open up some other display that a human can see and understand.
渲染环境。在人工模式下,它可以打印到终端,打开
打开一个图形窗口,或者打开一些人类可以看到和理解的其他显示。
"""
if self.render_mode is None:
gymnasium.logger.warn(
"You are calling render method without specifying any render mode."
)
return

if len(self.agents) == 2:
string = "Current state: Agent1: {} , Agent2: {}".format(
MOVES[self.state[self.agents[0]]], MOVES[self.state[self.agents[1]]]
)
else:
string = "Game over"
print(string)

def observe(self, agent):
''''''
"""
Observe should return the observation of the specified agent. This function
should return a sane observation (though not necessarily the most up to date possible)
at any time after reset() is called.
观察应返回指定智能体的观察结果。此功能
应该返回理智的观察结果(尽管不一定是最新的)
在调用reset()之后的任何时间。
"""

# observation of one agent is the previous state of the other
#对一个代理的观察是另一个代理先前的状态
return np.array(self.observations[agent])

def close(self):
''''''
"""
Close should release any graphical displays, subprocesses, network connections
or any other environment data which should not be kept around after the
user is no longer using the environment.
关闭应释放任何图形显示、子流程、网络连接
或任何其他环境数据,这种环境数据不应继续存在,
当用户不再使用该环境之后。
"""
pass

def reset(self, seed=None, return_info=False, options=None):
''''''
"""
Reset needs to initialize the following attributes
- agents
- rewards
- _cumulative_rewards
- terminations
- truncations
- infos
- agent_selection
And must set up the environment so that render(), step(), and observe()
can be called without issues.
Here it sets up the state dictionary which is used by step() and the observations dictionary which is used by step() and observe()
Reset需要初始化以下属性
-智能体
-奖励
-累计奖励
-终止
-截断
-信息
-智能体选择
并且必须设置环境,以便render()、step()和observe()
可以毫无问题地调用。
在这里,它设置了step()使用的状态字典,和step()和observe()使用的观测字典
"""
self.agents = self.possible_agents[:]
# 后面带上[:],则是拷贝的意思,而非指向原来的对象。好处是:
# 修改agents,不会影响到原来的possible_agents
self.rewards = {agent: 0 for agent in self.agents}
self._cumulative_rewards = {agent: 0 for agent in self.agents}
self.terminations = {agent: False for agent in self.agents}
self.truncations = {agent: False for agent in self.agents}
self.infos = {agent: {} for agent in self.agents}
self.state = {agent: NONE for agent in self.agents}
self.observations = {agent: NONE for agent in self.agents}
self.num_moves = 0 # 时间步计数
"""
Our agent_selector utility allows easy cyclic stepping through the agents list.
我们的agent_selector实用程序允许轻松地循环遍历代理列表。
"""
self._agent_selector = agent_selector(self.agents)
self.agent_selection = self._agent_selector.next() # 我认为这里的第一次next()等同于reset()
# 这两个有啥区别?为何要用next()?
# 我的理解是:这个next()的作用是循环迭代智能体列表。
# selector是迭代选择器;self.agent_selection是“当前智能体”(我结合后半部分代码推导出来的)
# 初始化时,self.agent_selection要用选择器的next()功能来指向第0号智能体。其实reset()也会指向第0号智能体。

def step(self, action):
''''''
"""
step(action) takes in an action for the current agent (specified by
agent_selection) and needs to update
- rewards
- _cumulative_rewards (accumulating the rewards)
- terminations
- truncations
- infos
- agent_selection (to the next agent)
And any internal state used by observe() or render()
步骤(动作)为当前代理执行动作(由agent_selection指定),并且需要更新:
-奖励
-_cumulative_rewards(累积奖励)
-终止
-截断
-信息
-agent_selection(到下一个代理)
以及observe()或render()使用的任何内部状态
"""
if (
self.terminations[self.agent_selection]
or self.truncations[self.agent_selection]
):
# handles stepping an agent which is already dead
# 处理步进已经死亡的智能体
# accepts a None action for the one agent, and moves the agent_selection to
# the next dead agent, or if there are no more dead agents, to the next live agent
# 接受一个代理的None操作,并将agent_selection移动到
# 下一个死亡智能体,或者如果不再有死亡智能体,则移动到下个存活智能体
self._was_dead_step(action) # 哪里定义了这个函数?在父类里
# 这个函数会把截断或终止的智能体从智能体列表中删除。对应的代码是:
'''
del self.terminations[agent]
del self.truncations[agent]
del self.rewards[agent]
del self._cumulative_rewards[agent]
del self.infos[agent]
self.agents.remove(agent)
'''

# 我认为,既然死亡了,就直接在这里把action设为None好了。
# 不过,考虑到这篇代码里没有明确游戏结束的标志,所以这里仍然保留原状。
# 事实上,self._was_dead_step(action)这个函数可以视为控制游戏结束的函数。
# 这篇代码从头到尾都没有把terminations改为True
return


agent = self.agent_selection # 这里是下一个代理吗?不是。而是当前智能体
# 其实这句话可以忽略。它只是简单替代了self.agent_selection

# the agent which stepped last had its _cumulative_rewards accounted for
# (because it was returned by last()), so the _cumulative_rewards for this
# agent should start again at 0
# 最后一步执行的智能体有其累计值(因为它是由last()返回的),因此该智能体的累计值应该在0处重新开始
# 啥意思?没明白。
# 这篇代码没写last()。last()这个函数是父类的
self._cumulative_rewards[agent] = 0

# stores action of current agent
# 存储当前智能体的动作
self.state[self.agent_selection] = action

# collect reward if it is the last agent to act
# 如果是最后一个行动的智能体,则收取奖励
if self._agent_selector.is_last(): # 这里的意思应该是:当前智能体是最后一个,则更新该时间步的奖励
# 可以理解为:目的是控制一个完整的时间步。
# rewards for all agents are placed in the .rewards dictionary
# 所有智能体的奖励都放在奖励字典中
self.rewards[self.agents[0]], self.rewards[self.agents[1]] = REWARD_MAP[
(self.state[self.agents[0]], self.state[self.agents[1]])
]

self.num_moves += 1 # 这个变量是时间步计数
# The truncations dictionary must be updated for all players.
# 必须为所有玩家更新截断字典。
self.truncations = {
agent: self.num_moves >= NUM_ITERS for agent in self.agents
}
# 这里截断的判断标准是:智能体行动的次数是否大于最大时间步。
# 可以理解为:截断的意义就在于控制最大时间步。

# observe the current state
# 观察当前状态
for i in self.agents:
self.observations[i] = self.state[
self.agents[1 - self.agent_name_mapping[i]]
]
else: # 此时一个时间步还没结束。
# necessary so that observe() returns a reasonable observation at all times.
# 必要的,以便observe()始终返回合理的观测值。
self.state[self.agents[1 - self.agent_name_mapping[agent]]] = NONE
# no rewards are allocated until both players give an action
# 在两名玩家都做出动作之前,不会分配奖励
self._clear_rewards()

# selects the next agent.
# 选择下一个智能体。
self.agent_selection = self._agent_selector.next()
# Adds .rewards to ._cumulative_rewards
# 向_cumulative_rewards添加奖励
self._accumulate_rewards()

if self.render_mode == "human":
self.render()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 测试Example_Custom_Environment
# t_epe.py

import Example_Custom_Environment as ece

env = ece.env(render_mode='human')
env.reset()
#括号里的数字应该是总时间步
for agent in env.agent_iter(1000):
observation, reward, termination, truncation, info = env.last()
# 这里随机采样。实际用于强化学习算法时,这里应该改为算法的函数。
action =env.action_space(agent).sample()
if truncation==True:
action = None
env.step(action)

2.2 Example Custom Parallel Environment

  • 环境代码在Example_Parallel_Environment.py
  • 使用该环境的代码为t_epe.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
''''''
'''
并行环境。PettingZoo石头剪刀布。
'''
import functools

import gymnasium
from gymnasium.spaces import Discrete

from pettingzoo import ParallelEnv
from pettingzoo.utils import parallel_to_aec, wrappers

ROCK = 0
PAPER = 1
SCISSORS = 2
NONE = 3
MOVES = ["ROCK", "PAPER", "SCISSORS", "None"]
NUM_ITERS = 10 # 最大时间步。原为100
REWARD_MAP = {
(ROCK, ROCK): (0, 0),
(ROCK, PAPER): (-1, 1),
(ROCK, SCISSORS): (1, -1),
(PAPER, ROCK): (1, -1),
(PAPER, PAPER): (0, 0),
(PAPER, SCISSORS): (-1, 1),
(SCISSORS, ROCK): (-1, 1),
(SCISSORS, PAPER): (1, -1),
(SCISSORS, SCISSORS): (0, 0),
}


def env(render_mode=None):
''''''
"""
The env function often wraps the environment in wrappers by default.
You can find full documentation for these methods
elsewhere in the developer documentation.
默认情况下,env函数通常将环境包装在包装器中。
您可以找到这些方法的完整文档
在开发人员文档的其他地方。
"""
internal_render_mode = render_mode if render_mode != "ansi" else "human"
env = raw_env(render_mode=internal_render_mode)
# This wrapper is only for environments which print results to the terminal
# 此包装器仅适用于将结果打印到终端的环境
if render_mode == "ansi":
env = wrappers.CaptureStdoutWrapper(env)
# this wrapper helps error handling for discrete action spaces
# 该包装器有助于离散操作空间的错误处理
env = wrappers.AssertOutOfBoundsWrapper(env)
# Provides a wide vareity of helpful user errors
# 提供各种有用的用户错误
# Strongly recommended
# 强烈推荐
env = wrappers.OrderEnforcingWrapper(env)
return env


def raw_env(render_mode=None):
''''''
"""
To support the AEC API, the raw_env() function just uses the from_parallel
function to convert from a ParallelEnv to an AEC env
为了支持AEC API,raw_env()函数只使用from_parallel
用于从ParallelEnv转换为AEC环境的函数
"""
env = parallel_env(render_mode=render_mode)
env = parallel_to_aec(env)
return env


class parallel_env(ParallelEnv):
metadata = {"render_modes": ["human"], "name": "rps_v2"}

def __init__(self, render_mode=None):
''''''
"""
The init method takes in environment arguments and should define the following attributes:
- possible_agents
- action_spaces
- observation_spaces
These attributes should not be changed after initialization.
init方法接受环境参数,并且应该定义以下属性:
-可能的智能体
-操作空间(_S)
-观测_空间
初始化后不应更改这些属性。
"""
self.possible_agents = ["player_" + str(r) for r in range(2)]
self.agent_name_mapping = dict(
zip(self.possible_agents, list(range(len(self.possible_agents))))
)
self.render_mode = render_mode

# this cache ensures that same space object is returned for the same agent
# 该缓存确保为同一代理返回相同的空间对象
# allows action space seeding to work as expected
# 允许操作空间种子按预期工作
@functools.lru_cache(maxsize=None)
def observation_space(self, agent):
# gymnasium spaces are defined and documented here: https://gymnasium.farama.org/api/spaces/
return Discrete(4)

@functools.lru_cache(maxsize=None)
def action_space(self, agent):
return Discrete(3)

def render(self):
''''''
"""
Renders the environment. In human mode, it can print to terminal, open
up a graphical window, or open up some other display that a human can see and understand.
渲染环境。在人工模式下,它可以打印到终端,打开
打开一个图形窗口,或者打开一些人类可以看到和理解的其他显示。
"""
if self.render_mode is None:
gymnasium.logger.warn(
"You are calling render method without specifying any render mode."
)
return

if len(self.agents) == 2: # 这篇代码漏了定义self.state,以及给它赋值
string = "Current state: Agent1: {} , Agent2: {}".format(
MOVES[self.state[self.agents[0]]], MOVES[self.state[self.agents[1]]]
)
else:
string = "Game over" # 这里只是打印了“游戏结束”,并没有关闭游戏呀。如何退出智能体迭代器?
# 退出智能体迭代器的方法应该是:将智能体列表中的智能体全部删除。
# 而能够把某智能体删除,当且仅当它截断或终止。
# 所以当你希望游戏结束时,请把所有智能体的终止状态设为True。
print(string)

def close(self):
"""
Close should release any graphical displays, subprocesses, network connections
or any other environment data which should not be kept around after the
user is no longer using the environment.
"""
pass

def reset(self, seed=None, return_info=False, options=None):
''''''
"""
Reset needs to initialize the `agents` attribute and must set up the
environment so that render(), and step() can be called without issues.
Here it initializes the `num_moves` variable which counts the number of
hands that are played.
Returns the observations for each agent
Reset需要初始化“agents”属性,并且必须设置环境,
以便可以毫无问题地调用render()和step()。
在这里,它初始化“num_moves”变量,该变量计算玩的手的数量(指的是时间步计数)。
返回每个代理的观察结果
"""
self.agents = self.possible_agents[:]
self.num_moves = 0
observations = {agent: NONE for agent in self.agents}

# 这篇代码漏了state。现在给它加上:
self.state = {agent: NONE for agent in self.agents}

if not return_info:
return observations
else:
infos = {agent: {} for agent in self.agents}
return observations, infos

def step(self, actions):
''''''
"""
step(action) takes in an action for each agent and should return the
- observations
- rewards
- terminations
- truncations
- infos
dicts where each dict looks like {agent_1: item_1, agent_2: item_2}
步骤(操作)为每个代理执行一个操作,并应返回
-观察
-奖励
-终止
-截断
-信息
dicts,其中每个dict看起来像{agent_1:item_1,agent_2:item_2}
"""
# If a user passes in actions with no agents, then just return empty observations, etc.
# 如果用户在没有智能体的情况下传入操作,那么只返回空的观察结果,等等。
if not actions:
self.agents = []
return {}, {}, {}, {}, {}

# 这篇代码漏了state,要加上:
self.state=actions

# rewards for all agents are placed in the rewards dictionary to be returned
# 所有代理的奖励都会放在要返回的奖励字典中
rewards = {}
rewards[self.agents[0]], rewards[self.agents[1]] = REWARD_MAP[
(actions[self.agents[0]], actions[self.agents[1]])
]

terminations = {agent: False for agent in self.agents}

self.num_moves += 1
env_truncation = self.num_moves >= NUM_ITERS
truncations = {agent: env_truncation for agent in self.agents}

# current observation is just the other player's most recent action
# 目前的观察只是其他玩家最近的动作
observations = {
self.agents[i]: int(actions[self.agents[1 - i]])
for i in range(len(self.agents))
}

# typically there won't be any information in the infos, but there must
# still be an entry for each agent
# 通常情况下,信息(infos)中不会有任何信息,但每个代理都必须有一个条目
infos = {agent: {} for agent in self.agents}

if env_truncation:
self.agents = []

if self.render_mode == "human":
self.render()

return observations, rewards, terminations, truncations, infos

1
2
3
4
5
6
7
8
9
10
11
12
# 测试并行环境 Example_Parallel_Environment
# t_epe.py

import Example_Parallel_Environment as epe

parallel_env = epe.parallel_env(render_mode='human')
observations = parallel_env.reset()

while parallel_env.agents:
actions = {agent: parallel_env.action_space(agent).sample() for agent in parallel_env.agents} # this is where you would insert your policy
observations, rewards, terminations, truncations, infos = parallel_env.step(actions)

2.3 Using Wrappers

A wrapper is an environment transformation that takes in an environment as input, and outputs a new environment that is similar to the input environment, but with some transformation or validation applied. PettingZoo provides wrappers to convert environments back and forth between the AEC API and the Parallel API and a set of simple utility wrappers which provide input validation and other convenient reusable logic. PettingZoo also includes wrappers via the SuperSuit companion package pip install supersuit).

包装器是一种环境转换器,它接受环境作为输入,并输出与输入环境类似的新环境,但应用了一些转换或验证。PettingZoo提供用于转换环境的包装器在AEC API和并行API之间来回切换,以及一组简单的实用包装器提供输入验证和其他方便的可重用逻辑。PettingZoo还包括使用SuperSuit配套软件包的包装器(pip install supersuit ).

  • Supersuit一般用不上

2.4 Developer Utils

The utils directory contains a few functions which are helpful for debugging environments. These are documented in the API docs.

utils目录包含一些有助于调试环境的函数。这些都记录在API文档中。

The utils directory also contain some classes which are only helpful for developing new environments. These are documented below.

utils目录还包含一些仅对开发新环境有帮助的类。这些记录如下。

2.4.1 Agent selector

The agent_selector class steps through agents in a cycle

这个agent_selector类在一个循环中逐步遍历智能体

It can be used as follows to cycle through the list of agents:

它可以按如下方式循环使用智能体列表:

1
2
3
4
5
6
7
8
from pettingzoo.utils import agent_selector
agents = ["agent_1", "agent_2", "agent_3"]
selector = agent_selector(agents)
agent_selection = selector.reset()
# agent_selection will be "agent_1"
for i in range(100):
agent_selection = selector.next()
# will select "agent_2", "agent_3", "agent_1", "agent_2", "agent_3", ..."
  • agent_selector类会不断地循环遍历智能体

2.4.2 Deprecated Module

The DeprecatedModule is used in PettingZoo to help guide the user away from old obsolete environment versions and toward new ones. If you wish to create a similar versioning system, this may be helpful.

PettingZoo中使用了Deprecated Module,以帮助引导用户远离旧的过时环境版本,转而使用新的环境版本。如果您希望创建类似的版本控制系统,这可能会有所帮助。

For example, when the user tries to import the knights_archers_zombies_v0 environment, they import the following variable (defined in pettingzoo/butterfly/init.py):

例如,当用户尝试导入knights_archers_zombies_v0环境,它们导入以下变量(在pettingzoo/butterfly/init.py中定义 ):

1
2
from pettingzoo.utils.deprecated_module import DeprecatedModule
knights_archers_zombies_v0 = DeprecatedModule("knights_archers_zombies", "v0", "v10")

This declaration tells the user that knights_archers_zombies_v0 is deprecated and knights_archers_zombies_v10 should be used instead. In particular, it gives the following error:

此声明告诉用户knights_archers_zombies_v0已弃用,并且knights_archers_zombies_v10应改为使用。特别地,它给出了以下错误:

1
2
3
from pettingzoo.butterfly import knights_archers_zombies_v0
knights_archers_zombies_v0.env()
# pettingzoo.utils.deprecated_module.DeprecatedEnv: knights_archers_zombies_v0 is now deprecated, use knights_archers_zombies_v10 instead

3 Testing Environment

PettingZoo has a number of compliance tests for environments through. If you are adding a new environment, we encourage you to run these tests on your own environment.

PettingZoo通过进行了许多环境合规性测试。如果您要添加新环境,我们鼓励您在自己的环境中运行这些测试。

3.1 API Test

PettingZoo’s API has a number of features and requirements. To make sure your environment is consistent with the API, we have the api_test. Below is an example:

PettingZoo的API具有许多功能和要求。为了确保您的环境与API一致,我们使用了API_test。下面是一个示例:

1
2
3
4
from pettingzoo.test import api_test
from pettingzoo.butterfly import pistonball_v6
env = pistonball_v6.env()
api_test(env, num_cycles=1000, verbose_progress=False)

As you can tell, you simply pass an environment to the test. The test will assert or give some other error on an API issue, and will return normally if it passes.

正如您所知,您只需将环境传递给测试即可。测试将断言或给出有关API问题的其他错误,如果通过,测试将正常返回。

The optional arguments are:

可选参数包括:

1
num_cycles

runs the environment for that many cycles and checks that the output is consistent with the API.

将环境运行这么多周期,并检查输出是否与API一致。

1
verbose_progress

Prints out messages to indicate partial completion of the test. Useful for debugging environments.

打印消息以指示测试部分完成。对调试环境很有用。

  • 这个API测试的作用就是:我自己写好了一个环境(自定义环境)之后,用这个api_test()函数来测试我的环境是否符合规范。

3.3 Seed Test

To have a properly reproducible environment that utilizes randomness, you need to be able to make it deterministic during evaluation by setting a seed for the random number generator that defines the random behavior. The seed test checks that calling the seed() method with a constant actually makes the environment deterministic.

要拥有一个利用随机性的可适当复制的环境,您需要能够通过为定义随机行为的随机数生成器设置种子,使其在评估期间具有确定性。种子测试检查调用带有常数的方法seed(),这能使环境具有确定性。

The seed test takes in a function that creates a pettingzoo environment. For example

种子测试采用一个创建pettingzoo环境的函数。例如

1
2
3
4
5
6
7
from pettingzoo.test import seed_test, parallel_seed_test
from pettingzoo.butterfly import pistonball_v6
env_fn = pistonball_v6.env
seed_test(env_fn, num_cycles=10, test_kept_state=True)
# or for parallel environments
parallel_env_fn = pistonball_v6.parallel_env
parallel_seed_test(parallel_env_fn, num_cycles=10, test_kept_state=True)

Internally, there are two separate tests.

在内部,有两个单独的测试:

1.Do two separate environments give the same result after the environment is seeded?

1.在环境设置种子后,两个单独的环境是否会产生相同的结果?

2.Does a single environment give the same result after seed() then reset() is called?

2.调用seed()然后调用reset(),单个环境是否会给出相同的结果?

The first optional argument, num_cycles , indicates how long the environment will be run to check for determinism. Some environments only fail the test long after initialization.
第一个可选参数,num_cycles,指示环境将运行多长时间以检查确定性。有些环境在初始化后很长时间内都无法通过测试。

The second optional argument, test_kept_state allows the user to disable the second test. Some physics based environments fail this test due to barely detectable differences due to caches, etc, which are not important enough to matter.

第二个可选参数,test_kept_state允许用户禁用第二个测试。一些基于物理的环境由于缓存等原因几乎无法检测到差异而无法通过此测试,这些缓存并不重要。

  • ==存疑?==

3.4 Max Cycles Test

The max cycles test tests that the max_cycles environment argument exists and the resulting environment actually runs for the correct number of cycles. If your environment does not take a max_cycles argument, you should not run this test. The reason this test exists is that many off-by-one errors are possible when implementing max_cycles. An example test usage looks like:

最大循环测试用于测试max_cycles环境参数存在,结果环境实际运行的周期数正确。如果您的环境不需要max_cycles参数,您不应该运行此测试。该测试存在的原因是,在实施max_cycles时可能会出现许多非同步错误。测试用法示例如下:

1
2
3
4
from pettingzoo.test import max_cycles_test
from pettingzoo.butterfly import pistonball_v6
env = pistonball_v6.env()
max_cycles_test(env)

3.5 Render Test

The render test checks that rendering 1) does not crash and 2) produces output of the correct type when given a mode (only supports ‘human’ , ‘ansi’, and ‘rgb_array’ modes).

渲染测试检查渲染:1)不会崩溃;2)在给定模式(仅支持‘human’‘ansi’‘rgb_array’模式)时,生成正确类型的输出。

1
2
3
4
from pettingzoo.test import render_test
from pettingzoo.butterfly import pistonball_v6
env = pistonball_v6.env()
render_test(env)

The render test method takes in an optional argument custom_tests that allows for additional tests in non-standard modes.

渲染测试方法接受可选参数custom_tests,这允许在非标准模式下进行额外的测试。

1
2
3
4
custom_tests = {
"svg": lambda render_result: return isinstance(render_result, str)
}
render_test(env, custom_tests=custom_tests)
  • “非标准模式”指的是除了‘human’, 'ansi', 'rgb_array'之外的模式

3.6 Performance Benchmark Test

To make sure we do not have performance regressions, we have the performance benchmark test. This test simply prints out the number of steps and cycles that the environment takes in 5 seconds. This test requires manual inspection of its outputs:

为了确保我们没有性能退化,我们进行了性能基准测试。该测试只打印出环境在5秒内进行的步骤和周期数。此测试需要手动检查其输出:

1
2
3
4
from pettingzoo.test import performance_benchmark
from pettingzoo.butterfly import pistonball_v6
env = pistonball_v6.env()
performance_benchmark(env)
  • 指的应该是5秒内能执行的timestep,越多表示性能越好
  • 实际输出是每秒的timestep && 循环次数

3.7 Save Observation Test

The save observation test is to visually inspect the observations of games with graphical observations to make sure they are what is intended. We have found that observations are a huge source of bugs in environments, so it is good to manually check them when possible. This test just tries to save the observations of all the agents. If it fails, then it just prints a warning. The output needs to be visually inspected for correctness.

保存观测测试是用图形观测直观地检查游戏的观测结果,以确保它们符合预期。我们发现,观测是环境中大量错误的来源,因此最好尽可能手动检查它们。这个测试只是试图保存所有智能体的观测结果。如果失败,则只打印警告。需要目视检查输出的正确性。

1
2
3
4
from pettingzoo.test import test_save_obs
from pettingzoo.butterfly import pistonball_v6
env = pistonball_v6.env()
test_save_obs(env)
  • 实际上,只是保存了一个timestep(一帧),每张图片就是单个智能体的预测