Preface
Open AI gym: 只适用于单智能体环境
PettingZoo:专注于多智能体环境
1 Basic Usage
1.1 Initializing Environments
Using environments in PettingZoo is very similar to using them in Gymnasium. You initialize an environment via:
在PettingZoo中使用环境与在Gymnasium中非常相似,可以通过以下方式初始化环境:
1 | from pettingzoo.butterfly import pistonball_v6 |
Environments are generally highly configurable via arguments at creation, i.e.:
环境通常可以通过创建时的参数进行高度配置,即:
1 | cooperative_pong.env(ball_speed=18, left_paddle_speed=25, |
1.2 Interacting With Environments
Environments can be interacted with using a similar interface to Gymnasium:
可以使用与Gymnasium类似的界面与环境进行交互:
1 | env.reset() |
- agent:实际上是智能体的名字(在代码中为字符串)
The commonly used methods are:
常用的方法有:
1 | agent_iter(max_iter=2**63) |
returns an iterator that yields the current agent of the environment. It terminates when all agents in the environment are done or when max_iter(steps have been executed).
返回生成环境当前智能体的迭代器。当环境中的所有智能体都完成(指的是死亡)或已执行步骤达到最大值max_iter时,迭代器终止迭代。
- 在实际使用中,iter中止迭代,当且仅当agents list(即
agents
,实际上就是一个字符串列表)为空
也就是说,在自定义环境时,如果希望game over(无论是agents all done,还是max_iter执行到最大的timestep),那么就一定要清空agents list
事实上,truncation
&& termination
两个变量本身,并不能让iter终止。在自定义环境的时候,如果希望这两个变量能够影响到iter,从而让game正常over,就要在step()
中调用self._was_dead_step(action)
函数
1 | last(observe=True) |
returns observation, reward, done, and info for the agent currently able to act. The returned reward is the cumulative reward that the agent has received since it last acted. If observe is set to False, the observation will not be computed, and None will be returned in its place. Note that a single agent being done does not imply the environment is done.
返回当前能够执行操作的智能体的观测、奖励、完成和信息。返回的奖励是智能体自上次行动以来收到的累计奖励。如果observe设置为
False
时,将不计算观测值,并在其位置返回None。请注意,执行单个智能体并不意味着环境已完成。
- 相当于gym中单智能体环境的
step()
,但是这里的last()
并不输入action,而仅仅返回(observation, reward, done, info)
1 | reset() |
resets the environment and sets it up for use when called the first time.
重置环境并将其设置为首次调用时使用。
1 | step(action) |
takes and executes the action of the agent in the environment, automatically switches control to the next agent.
采取并执行环境中代理的操作,自动将控制切换到下一个代理。
[ ] Q:示例代码中只有一个for,而且是仅仅针对于agent_iter的,这样怎么能够执行timestep的循环?
[x] agent_iter会不断地循环迭代,从0th agent迭代到nth agent,然后又会从0th agent开始迭代,一直到agent list为空
控制timestep的循环,重点在于step()
,和gym的单智能体环境的step()
不太一样。
pettingzoo当中的step()
仅仅输入动作,但是并不会返回(observation, reward, done, info)
换句话说:
- gym
step()
== pettingzoolast() + step()
在实际coding中
- 可以在每个agent输入action后,实时更新state(环境的state),例如下棋的先手、后手
- 也可以在所有agent都输入一轮的action之后(即完成一个完整的timestep结束后),再更新state,例如石头剪刀布,这种方法适用于Parallel API
此处给出的code是Agent Environment Cycle (AEC),用Parallel会比较麻烦。但是也可以做到parallel。通过if self._agent_selector.is_last():
(当agent_iter迭代到最后一个agent时,表示一个timestep的结束),来控制一个完整的timestep。
1.3 Additional Environment API
PettingZoo models games as Agent Environment Cycle (AEC) games, and thus can support any game multi-agent RL can consider, allowing for fantastically weird cases. Because of this, our API includes lower level functions and attributes that you probably won’t need but are very important when you do. Their functionality is used to implement the high-level functions above though, so including them is just a matter of code factoring.
PettingZoo将游戏建模为智能体环境周期(AEC)游戏,因此可以支持多智能体RL可以考虑的任何游戏,允许出现非常奇怪的情况。正因为如此,我们的API包含了较低级别的函数和属性,可能不需要它们,但它们在使用时非常重要。它们的功能用于实现上述高级函数,因此包含它们只是一个代码分解问题。
1 | agents |
A list of the names of all current agents, typically integers. These may be changed as an environment progresses (i.e. agents can be added or removed).
所有当前智能体的名称列表,通常为整数。这些可能会随着环境的发展而改变(即可以添加或删除智能体)。
- Note that:agents list只是保存agents的名字,而并非agent本身(即class类)。甚至不是agent的ID(在agent list当中的下标)
1 | num_agents |
The length of the agents list.
智能体列表的长度
- 即agents (list)的长度
1 | agent_selection |
an attribute of the environment corresponding to the currently selected agent that an action can be taken for.
与当前所选智能体相对应的环境属性,可以对其执行操作。
- [ ] ==有疑问,结合代码看==
1 | observation_space(agent) |
a function that retrieves the observation space for a particular agent. This space should never change for a particular agent ID.
检索特定智能体的观测空间的函数。对于特定的智能体ID,此空间不应更改。
1 | terminations |
A dict of the termination state of every current agent at the time called, keyed by name.
last()
accesses this attribute. Note that agents can be added or removed from this dict. The returned dict looks like:每个当前调用的智能体的终止状态的dict(字典),由名称键入。
last()
访问此属性。请注意,可以从此dict中添加或删除智能体。返回的dict如下所示:
1 | terminations = {0:[first agent's termination state], 1:[second agent's termination state] ... n-1:[nth agent's termination state]} |
terminations
这个dict的key是agents name,value是True / False- 即:当agent done,对应的value就设置为True
1 | truncations |
A dict of the truncation state of every current agent at the time called, keyed by name. last() accesses this attribute. Note that agents can be added or removed from this dict. The returned dict looks like:
每个当前调用的智能体的截断状态的dict,由名称键入。
last()
访问此属性。请注意,可以从此dict中添加或删除代理。返回的dict如下所示:
1 | truncations = {0:[first agent's truncation state], 1:[second agent's truncation state] ... n-1:[nth agent's truncation state]} |
- [ ] Q: Differences between
terminations
&&truncations
? - [x]
trunctions
是实际执行的timestep大于设定值时,设置为True;terminations
是某个agent done,对应的value设置为True
在实际coding中,如果某个agent terminations
or truncations
为True,而输入的action不是None,那么就会报错。只是因为step()
中使用了self._was_dead_step(action)
事实上,terminations
&& truncations
并不能让agent_iter中止迭代。在自定义环境的时候,如果希望terminations
&& truncations
影响到agent_iter,进而实现game over,应该在step()
当中调用API的self._was_dead_step(action)
函数
1 | infos |
A dict of info for each current agent, keyed by name. Each agent’s info is also a dict. Note that agents can be added or removed from this attribute.
last()
accesses this attribute. The returned dict looks like:每个当前智能体的信息字典,由名称键入。每个智能体的信息也是一个字典。请注意,可以在此属性中添加或删除智能体。
last()
访问此属性。返回的dict看起来像:
1 | infos = {0:[first agent's info], 1:[second agent's info] ... n-1:[nth agent's info]} |
- 其实此处的
infos
与gym的step()
返回的infos
作用类似,都是用于强化学习调整参数,对于自定义环境来说,并没有很大的作用
1 | observe(agent) |
Returns the observation an agent currently can make.
last()
calls this function.
返回智能体当前可以进行的观测。last()
调用此函数。
1 | reward |
A dict of the rewards of every current agent at the time called, keyed by name. Rewards the instantaneous reward generated after the last step. Note that agents can be added or removed from this attribute.
last()
does not directly access this attribute, rather the returned reward is stored in an internal variable. The rewards structure looks like:每个当前调用的智能体的奖励信息,以姓名键入。奖励最后一步后产生的即时奖励。请注意,可以在此属性中添加或删除代理。
last()
不直接访问该属性,而是将返回的奖励存储在内部变量中。奖励结构如下:
1 | {0:[first agent's reward], 1:[second agent's reward] ... n-1:[nth agent's reward]} |
- [ ] Q:
last()
将返回的reward存储在内部变量中,指的是哪个内部变量? - [ ] A:在自定义环境中,自己声明一个
self.rewards
变量,用来存储reward
1 | seed(seed=None) |
Reseeds the environment.
reset()
must be called afterseed()
, and beforestep()
.重置环境种子。
reset()
必须在seed()
之后和step()
之前调用。
- 调用顺序为:
seed()
$\to$reset()
$\to$step()
1 | render() |
Returns a rendered frame from the environment using render mode specified at initialization. In the case render mode is
'rgb_array'
, returns a numpy array, while with'ansi'
returns the strings printed. There is no need to callrender()
withhuman()
mode.*使用初始化时指定的渲染模式从环境中返回渲染帧。在这种情况下,渲染模式是
'rgb_array'
,返回numpy数组,而使用'ansi'
返回打印的字符串。没有必要调用human()
模式的render()
- 即:如果初始化的时候,指定的渲染模式为
human
,那么就没有必要额外调用render()
。因为这种情况下,就算不调用render()
,也可以正常显示(human
mode)渲染窗口。 - 在自定义环境中,需要自己写在
step()
当中,例如:
1 | if self.render_mode == "human": |
1 | close() |
Closes the rendering window.
关闭渲染窗口。
- [ ] Q:确切地说,是关闭渲染窗口,还是关闭游戏环境env?
- [x] A:官方代码的注释表明“关闭应释放任何图形显示、子流程、网络连接”。事实上,在编写自定义环境时,如果没有显示图像窗口、子流程、网络连接,那么这个函数里面可以直接写一个pass,也就是在调用环境以后,没有必要特意去“关闭游戏环境env”
1.4 Optional API Components
While not required by the base API, most downstream wrappers and utilities depend on the following attributes and methods, and they should be added to new environments except in special circumstances where adding one or more is not possible.
虽然基本API不需要,但大多数下游包装器和实用程序都依赖于以下属性和方法,应该将它们添加到新环境中,除非在特殊情况下无法添加一个或多个。
1 | possible_agents |
A list of all possible_agents the environment could generate. Equivalent to the list of agents in the observation and action spaces. This cannot be changed through play or resetting.
环境可能生成的所有可能的智能体的列表。相当于观察和行动空间中的智能体列表。这不能通过播放或重置进行更改。
- [ ] Q:用处?
- [x] A:允许的/合法的agents name,环境初始化的时候,生成的实际的agents_list只能从当中选取
1 | max_num_agents |
The length of the possible_agents list.
possible_agents列表的长度。
1 | observation_spaces |
A dict of the observation spaces of every agent, keyed by name. This cannot be changed through play or resetting.
每个智能体的观察空间的dict,由名称键入。这不能通过播放或重置进行更改。
1 | action_spaces |
A dict of the action spaces of every agent, keyed by name. This cannot be changed through play or resetting.
每个智能体的动作空间的dict,由名称键入。这不能通过播放或重置进行更改。
- “Play or resetting”:
- play:在游戏进行过程中
- resetting:
reset()
1 | state() |
Returns a global observation of the current state of the environment. Not all environments will support this feature.
返回对环境当前状态的全局观测。并非所有环境都支持此功能。
1 | state_space |
The space of a global observation of the environment. Not all environments will support this feature.
全局环境观测的空间。并非所有环境都支持此功能。
1.5 Notable Idioms
1.5.1 Checking if the entire environment is done
When an agent is terminated or truncated, it’s removed from agents, so when the environments done agents will be an empty list. This means not env.agents is a simple condition for the environment being done.
当智能体被终止或截断时,它将从agents中移除,所以当环境完成时agents将是一个空列表。这意味着not env.agents是一个简单的环境条件。
- 意思是:可以用
not env.agents
来判断game over - 实际coding中,“当智能体被终止或截断时,它将从agents中移除”这个功能,需要用
self._was_dead_step(action)
函数
1.5.2 Unwrapping an environment
If you have a wrapped environment, and you want to get the unwrapped environment underneath all the layers of wrappers (so that you can manually call a function or change some underlying aspect of the environment), you can use the .unwrapped attribute. If the environment is already a base environment, the .unwrapped attribute will just return itself.*
如果您有一个已包装的环境,并且希望在所有包装层下面获得未包装的环境(以便您可以手动调用函数或更改环境的某些底层方面),那么可以使用.unwrapped属性。如果环境已经是基本环境,则.unwrapped属性将只返回自身。
1 | base_env = knights_archers_zombies_v10.env().unwrapped |
- [ ] Q:为什么要包装env?怎么包装?
- [x] A:不重要,有时候要把并行环境,改成串行环境,就需要包装
1.5.3 Variable Numbers of Agents (Death)
Agents can die and generate during the course of an environment. If an agent dies, then its entry in the terminated dictionary is set to True, it become the next selected agent (or after another agent that is also terminated or truncated), and the action it takes is required to be None. After this vacuous step is taken, the agent will be removed from agents and other changeable attributes. Agent generation can just be done with appending it to agents and the other changeable attributes (with it already being in the possible agents and action/observation spaces), and transitioning to it at some point with agent_iter.
在环境过程中,智能体可能会死亡并生成。如果智能体死亡,则其在terminated字典设置为True,它将成为下一个选定的智能体(或在另一个也被终止或截断的智能体之后),它所采取的行动(action)必须是None。执行此空步骤后,这个智能体将从agents和其他可变属性中删除。智能体生成只需将其附加到agents以及其他可变属性(它已经存在于可能的智能体列表和行动/观察空间中),并在某个时候使用agent_iter转换到它。
- 此处规定:agent死亡之后就应该在terminated字典中设置为True,并且采取的action也必须为None
如果agent死亡之后,传递给env的action不是None,就会报错。这是因为self._was_dead_step(action)
函数
- 关键点:“执行此空步骤后,这个智能体将从agents和其他可变属性中删除”
如果规定死亡后还可以复活?这样直接删除会导致game over——例如在某个timestep,所有的agent都死亡了,但是都正在等待复活,此时game并没有over
解决方案:如果游戏规定可以复活,那么在agent死亡之后,不要再terminated字典中设置为True
1.5.4 Environment as an Agent
In certain cases, separating agent from environment actions is helpful for studying. This can be done by treating the environment as an agent. We encourage calling the environment actor env in env.agents, and having it take None as an action.
在某些情况下,将智能体与环境作用分离有助于研究。这可以通过将环境视为一种智能体来实现。我们鼓励调用环境行动者env在env.agents中,将None作为一项行动。
1.6 Raw Environment
Environments are by default wrapped in a handful of lightweight wrappers that handle error messages and ensure reasonable behavior given incorrect usage (i.e. playing illegal moves or stepping before resetting). However, these add a very small amount of overhead. If you want to create an environment without them, you can do so by using the raw_env() constructor contained within each module:
默认情况下,环境被包装在少数轻量级包装器中,这些包装器处理错误消息,并确保在不正确的使用情况下的合理行为(即在重置之前进行非法移动或单步移动)。然而,这些增加了非常少的开销。如果要创建没有它们的环境,可以使用raw_env()——在每个模块中包含的构造函数:
1 | env = knights_archers_zombies_v10.raw_env(<environment parameters>) |
2 Environment Creation
This documentation overviews creating new environments and relevant useful wrappers, utilities and tests included in PettingZoo designed for the creation of new environments.
本文档概述了创建新环境以及PettingZoo中为创建新环境而设计的相关有用包装器、实用程序和测试。
2.1 Example Custom Environment
This is a carefully commented version of the PettingZoo rock paper scissors environment.
这是一个经过仔细注释的PettingZoo石头剪刀布环境版本。
- 环境代码在
Example_Custom_Environment.py
- 使用该环境的代码为
t_ece.py
1 | '''''' |
1 | # 测试Example_Custom_Environment |
2.2 Example Custom Parallel Environment
- 环境代码在
Example_Parallel_Environment.py
- 使用该环境的代码为
t_epe.py
1 | '''''' |
1 | # 测试并行环境 Example_Parallel_Environment |
2.3 Using Wrappers
A wrapper is an environment transformation that takes in an environment as input, and outputs a new environment that is similar to the input environment, but with some transformation or validation applied. PettingZoo provides wrappers to convert environments back and forth between the AEC API and the Parallel API and a set of simple utility wrappers which provide input validation and other convenient reusable logic. PettingZoo also includes wrappers via the SuperSuit companion package pip install supersuit).
包装器是一种环境转换器,它接受环境作为输入,并输出与输入环境类似的新环境,但应用了一些转换或验证。PettingZoo提供用于转换环境的包装器在AEC API和并行API之间来回切换,以及一组简单的实用包装器提供输入验证和其他方便的可重用逻辑。PettingZoo还包括使用SuperSuit配套软件包的包装器(pip install supersuit ).
- Supersuit一般用不上
2.4 Developer Utils
The utils directory contains a few functions which are helpful for debugging environments. These are documented in the API docs.
utils目录包含一些有助于调试环境的函数。这些都记录在API文档中。
The utils directory also contain some classes which are only helpful for developing new environments. These are documented below.
utils目录还包含一些仅对开发新环境有帮助的类。这些记录如下。
2.4.1 Agent selector
The agent_selector class steps through agents in a cycle
这个agent_selector类在一个循环中逐步遍历智能体
It can be used as follows to cycle through the list of agents:
它可以按如下方式循环使用智能体列表:
1 | from pettingzoo.utils import agent_selector |
- 即
agent_selector
类会不断地循环遍历智能体
2.4.2 Deprecated Module
The DeprecatedModule is used in PettingZoo to help guide the user away from old obsolete environment versions and toward new ones. If you wish to create a similar versioning system, this may be helpful.
PettingZoo中使用了Deprecated Module,以帮助引导用户远离旧的过时环境版本,转而使用新的环境版本。如果您希望创建类似的版本控制系统,这可能会有所帮助。
For example, when the user tries to import the knights_archers_zombies_v0 environment, they import the following variable (defined in pettingzoo/butterfly/init.py):
例如,当用户尝试导入knights_archers_zombies_v0环境,它们导入以下变量(在pettingzoo/butterfly/init.py中定义 ):
1 | from pettingzoo.utils.deprecated_module import DeprecatedModule |
This declaration tells the user that knights_archers_zombies_v0 is deprecated and knights_archers_zombies_v10 should be used instead. In particular, it gives the following error:
此声明告诉用户knights_archers_zombies_v0已弃用,并且knights_archers_zombies_v10应改为使用。特别地,它给出了以下错误:
1 | from pettingzoo.butterfly import knights_archers_zombies_v0 |
3 Testing Environment
PettingZoo has a number of compliance tests for environments through. If you are adding a new environment, we encourage you to run these tests on your own environment.
PettingZoo通过进行了许多环境合规性测试。如果您要添加新环境,我们鼓励您在自己的环境中运行这些测试。
3.1 API Test
PettingZoo’s API has a number of features and requirements. To make sure your environment is consistent with the API, we have the api_test. Below is an example:
PettingZoo的API具有许多功能和要求。为了确保您的环境与API一致,我们使用了API_test。下面是一个示例:
1 | from pettingzoo.test import api_test |
As you can tell, you simply pass an environment to the test. The test will assert or give some other error on an API issue, and will return normally if it passes.
正如您所知,您只需将环境传递给测试即可。测试将断言或给出有关API问题的其他错误,如果通过,测试将正常返回。
The optional arguments are:
可选参数包括:
1 | num_cycles |
runs the environment for that many cycles and checks that the output is consistent with the API.
将环境运行这么多周期,并检查输出是否与API一致。
1 | verbose_progress |
Prints out messages to indicate partial completion of the test. Useful for debugging environments.
打印消息以指示测试部分完成。对调试环境很有用。
- 这个API测试的作用就是:我自己写好了一个环境(自定义环境)之后,用这个
api_test()
函数来测试我的环境是否符合规范。
3.3 Seed Test
To have a properly reproducible environment that utilizes randomness, you need to be able to make it deterministic during evaluation by setting a seed for the random number generator that defines the random behavior. The seed test checks that calling the seed() method with a constant actually makes the environment deterministic.
要拥有一个利用随机性的可适当复制的环境,您需要能够通过为定义随机行为的随机数生成器设置种子,使其在评估期间具有确定性。种子测试检查调用带有常数的方法seed(),这能使环境具有确定性。
The seed test takes in a function that creates a pettingzoo environment. For example
种子测试采用一个创建pettingzoo环境的函数。例如
1 | from pettingzoo.test import seed_test, parallel_seed_test |
Internally, there are two separate tests.
在内部,有两个单独的测试:
1.Do two separate environments give the same result after the environment is seeded?
1.在环境设置种子后,两个单独的环境是否会产生相同的结果?
2.Does a single environment give the same result after seed() then reset() is called?
2.调用seed()然后调用reset(),单个环境是否会给出相同的结果?
The first optional argument, num_cycles , indicates how long the environment will be run to check for determinism. Some environments only fail the test long after initialization.
第一个可选参数,num_cycles,指示环境将运行多长时间以检查确定性。有些环境在初始化后很长时间内都无法通过测试。The second optional argument, test_kept_state allows the user to disable the second test. Some physics based environments fail this test due to barely detectable differences due to caches, etc, which are not important enough to matter.
第二个可选参数,test_kept_state允许用户禁用第二个测试。一些基于物理的环境由于缓存等原因几乎无法检测到差异而无法通过此测试,这些缓存并不重要。
- ==存疑?==
3.4 Max Cycles Test
The max cycles test tests that the max_cycles environment argument exists and the resulting environment actually runs for the correct number of cycles. If your environment does not take a max_cycles argument, you should not run this test. The reason this test exists is that many off-by-one errors are possible when implementing max_cycles. An example test usage looks like:
最大循环测试用于测试max_cycles环境参数存在,结果环境实际运行的周期数正确。如果您的环境不需要max_cycles参数,您不应该运行此测试。该测试存在的原因是,在实施max_cycles时可能会出现许多非同步错误。测试用法示例如下:
1 | from pettingzoo.test import max_cycles_test |
3.5 Render Test
The render test checks that rendering 1) does not crash and 2) produces output of the correct type when given a mode (only supports ‘human’ , ‘ansi’, and ‘rgb_array’ modes).
渲染测试检查渲染:1)不会崩溃;2)在给定模式(仅支持‘human’、‘ansi’和‘rgb_array’模式)时,生成正确类型的输出。
1 | from pettingzoo.test import render_test |
The render test method takes in an optional argument custom_tests that allows for additional tests in non-standard modes.
渲染测试方法接受可选参数custom_tests,这允许在非标准模式下进行额外的测试。
1 | custom_tests = { |
- “非标准模式”指的是除了
‘human’, 'ansi', 'rgb_array'
之外的模式
3.6 Performance Benchmark Test
To make sure we do not have performance regressions, we have the performance benchmark test. This test simply prints out the number of steps and cycles that the environment takes in 5 seconds. This test requires manual inspection of its outputs:
为了确保我们没有性能退化,我们进行了性能基准测试。该测试只打印出环境在5秒内进行的步骤和周期数。此测试需要手动检查其输出:
1 | from pettingzoo.test import performance_benchmark |
- 指的应该是5秒内能执行的timestep,越多表示性能越好
- 实际输出是每秒的timestep && 循环次数
3.7 Save Observation Test
The save observation test is to visually inspect the observations of games with graphical observations to make sure they are what is intended. We have found that observations are a huge source of bugs in environments, so it is good to manually check them when possible. This test just tries to save the observations of all the agents. If it fails, then it just prints a warning. The output needs to be visually inspected for correctness.
保存观测测试是用图形观测直观地检查游戏的观测结果,以确保它们符合预期。我们发现,观测是环境中大量错误的来源,因此最好尽可能手动检查它们。这个测试只是试图保存所有智能体的观测结果。如果失败,则只打印警告。需要目视检查输出的正确性。
1 | from pettingzoo.test import test_save_obs |
- 实际上,只是保存了一个timestep(一帧),每张图片就是单个智能体的预测