Preface

原文链接

代码链接

参考博客

使用PettingZoo的游戏环境来实施强化学习算法（包括CleanRL的PPO和Tianshou的DQN）；
使用PettingZoo创建自定义环境。

1 CleanRL: Implementing PPO

2 Tianshou: Basic API Usage

3 Tianshou: Training Agents

4 Tianshou: CLI and Logging

5 (WIP) Creating Environments: Repository Structure

5.1 Introduction

Welcome to the first of five short tutorials, guiding you through the process of creating your own PettingZoo environment, from conception to deployment.

欢迎来到五个简短教程中的第一个，这些教程将指导您从概念到部署，创建自己的PettingZoo环境。

其实只有4个教程

We will be creating a parallel environment, meaning that each agent acts simultaneously.

我们将创建一个并行环境，这意味着每个智能体都会同时进行动作。

Before thinking about the environment logic, we should understand the structure of

environment repositories.

在考虑环境逻辑之前，我们应该了解环境存储库的结构。

5.2 Tree Structure

Environment repositories are usually laid out using the following structure:

环境存储库通常使用以下结构进行布局：

├── Custom-Environment
│   └── custom-environment
│       ├── custom_environment_v0.py
│       └── env
│           └── custom_environment.py
├── README.md
└── requirements.txt

1	/custom-environment/env

is where your environment will be stored, along with any helper functions (in the case of a complicated environment).

是存储环境以及任何帮助函数（在复杂环境中）的位置。

1	/custom-environment/custom_environment_v0.py

is a file that imports the environment - we use the file name for environment version control.

是导入环境的文件——我们使用文件名进行环境版本控制。

即在这里面导入自定义环境

1	/requirements.txt

is a file used to keep track of your environment dependencies. At the very least, pettingzoo should be in there. Please version control all your dependencies via “”.

是用于跟踪环境依赖项的文件。至少，pettingzoo应该在里面。请通过“==”来控制所有依赖项的版本。

意思是在这个txt文件中写依赖项及其版本，例如：pettingzoo==1.22.3

5.3 Advanced: Additional (optional) files

The above file structure is minimal. A more deployment-ready environment would include

上述文件结构是最小的。一个更容易部署的环境应包括：

/docs/

for documentation,

用于文档，

/setup.py

for packaging,

用于打包。

1	/custom-environment/__init__.py

for depreciation handling, and Github actions for continuous integration of environment tests.

用于折旧处理，以及Github持续集成环境测试的行动。

5.4 Skeleton code

The entirety of your environment logic is stored within /custom-environment/env

您的整个环境逻辑存储在/custom-environment/env

1	/custom-environment/env/custom_environment.py

from pettingzoo.utils.env import ParallelEnv
class CustomEnvironment(ParallelEnv):
    def __init__(self):
       pass
    def reset(self, seed=None, options=None):
       pass
    def step(self, actions):
       pass
    def render(self):
       pass
    def observation_space(self, agent):
       return self.observation_spaces[agent]
    def action_space(self, agent):
       return self.action_spaces[agent]

自定义环境的代码必须包括以上这些函数。

6 (WIP) Creating Environments: Environment Logic

6.1 Introduction

Now that we have a basic understanding of the structure of environment repositories, we can start thinking about the fun part - environment logic!

现在我们已经基本了解了环境存储库的结构，我们可以开始思考有趣的部分——环境逻辑！

For this tutorial, we will be creating a two-player game consisting of a prisoner, trying to escape, and a guard, trying to catch the prisoner. This game will be played on a 7x7 grid, where:

在本教程中，我们将创建一个双玩家游戏，由试图逃跑的囚犯和试图抓住囚犯的警卫组成。此游戏将在7x7网格上进行，其中：

The prisoner starts in the top left corner,

囚犯从左上角开始，

the guard starts in the bottom right corner,

警卫从右下角开始，

the escape door is randomly placed in the middle of the grid, and

逃生门随机放置在网格中间，并且

Both the prisoner and the guard can move in any of the four cardinal directions (up, down, left, right).

囚犯和警卫都可以向四个基本方向（上、下、左、右）中的任何一个方向移动。

6.2 Code

1	/custom-environment/env/custom_environment.py

无动作屏蔽

import functools
import random
from copy import copy

import numpy as np
from gymnasium.spaces import Discrete, MultiDiscrete

from pettingzoo.utils.env import ParallelEnv


class CustomEnvironment(ParallelEnv):
    def __init__(self):
        self.escape_y = None
        self.escape_x = None
        self.guard_y = None
        self.guard_x = None
        self.prisoner_y = None
        self.prisoner_x = None
        self.timestep = None
        self.possible_agents = ["prisoner", "guard"]

    def reset(self, seed=None, options=None):
        self.agents = copy(self.possible_agents)
        self.timestep = 0

        self.prisoner_x = 0
        self.prisoner_y = 0

        self.guard_x = 7
        self.guard_y = 7

        self.escape_x = random.randint(2, 5)
        self.escape_y = random.randint(2, 5)

        observations = {
            a: (
                self.prisoner_x + 7 * self.prisoner_y,
                self.guard_x + 7 * self.guard_y,
                self.escape_x + 7 * self.escape_y,
            )
            for a in self.agents
        }
        return observations

    def step(self, actions):
        # Execute actions
        prisoner_action = actions["prisoner"]
        guard_action = actions["guard"]

        if prisoner_action == 0 and self.prisoner_x > 0:
            self.prisoner_x -= 1
        elif prisoner_action == 1 and self.prisoner_x < 6:
            self.prisoner_x += 1
        elif prisoner_action == 2 and self.prisoner_y > 0:
            self.prisoner_y -= 1
        elif prisoner_action == 3 and self.prisoner_y < 6:
            self.prisoner_y += 1

        if guard_action == 0 and self.guard_x > 0:
            self.guard_x -= 1
        elif guard_action == 1 and self.guard_x < 6:
            self.guard_x += 1
        elif guard_action == 2 and self.guard_y > 0:
            self.guard_y -= 1
        elif guard_action == 3 and self.guard_y < 6:
            self.guard_y += 1

        # Check termination conditions
        terminations = {a: False for a in self.agents}
        rewards = {a: 0 for a in self.agents}
        if self.prisoner_x == self.guard_x and self.prisoner_y == self.guard_y:
            rewards = {"prisoner": -1, "guard": 1}
            terminations = {a: True for a in self.agents}

        elif self.prisoner_x == self.escape_x and self.prisoner_y == self.escape_y:
            rewards = {"prisoner": 1, "guard": -1}
            terminations = {a: True for a in self.agents}

        # Check truncation conditions (overwrites termination conditions)
        truncations = {a: False for a in self.agents}
        if self.timestep > 100:
            rewards = {"prisoner": 0, "guard": 0}
            truncations = {"prisoner": True, "guard": True}
            self.agents = []
        self.timestep += 1

        # Get observations
        observations = {
            a: (
                self.prisoner_x + 7 * self.prisoner_y,
                self.guard_x + 7 * self.guard_y,
                self.escape_x + 7 * self.escape_y,
            )
            for a in self.agents
        }

        # Get dummy infos (not used in this example)
        infos = {a: {} for a in self.agents}

        return observations, rewards, terminations, truncations, infos

    def render(self):
        grid = np.zeros((7, 7))
        grid[self.prisoner_y, self.prisoner_x] = "P"
        grid[self.guard_y, self.guard_x] = "G"
        grid[self.escape_y, self.escape_x] = "E"
        print(f"{grid} \n")

    @functools.lru_cache(maxsize=None)
    def observation_space(self, agent):
        return MultiDiscrete([7 * 7 - 1] * 3)

    @functools.lru_cache(maxsize=None)
    def action_space(self, agent):
        return Discrete(4)

有动作屏蔽

import functools
import random
from copy import copy

import numpy as np
from gymnasium.spaces import Discrete, MultiDiscrete

from pettingzoo.utils.env import ParallelEnv


class CustomEnvironment(ParallelEnv):
    def __init__(self):
        self.escape_y = None
        self.escape_x = None
        self.guard_y = None
        self.guard_x = None
        self.prisoner_y = None
        self.prisoner_x = None
        self.timestep = None
        self.possible_agents = ["prisoner", "guard"]

    def reset(self, seed=None, options=None):
        self.agents = copy(self.possible_agents)
        self.timestep = 0

        self.prisoner_x = 0
        self.prisoner_y = 0

        self.guard_x = 7
        self.guard_y = 7

        self.escape_x = random.randint(2, 5)
        self.escape_y = random.randint(2, 5)

        observation = (
            self.prisoner_x + 7 * self.prisoner_y,
            self.guard_x + 7 * self.guard_y,
            self.escape_x + 7 * self.escape_y,
        )
        observations = {
            "prisoner": {"observation": observation, "action_mask": [0, 1, 1, 0]},
            "guard": {"observation": observation, "action_mask": [1, 0, 0, 1]},
        }
        return observations

    def step(self, actions):
        # Execute actions
        prisoner_action = actions["prisoner"]
        guard_action = actions["guard"]

        if prisoner_action == 0 and self.prisoner_x > 0:
            self.prisoner_x -= 1
        elif prisoner_action == 1 and self.prisoner_x < 6:
            self.prisoner_x += 1
        elif prisoner_action == 2 and self.prisoner_y > 0:
            self.prisoner_y -= 1
        elif prisoner_action == 3 and self.prisoner_y < 6:
            self.prisoner_y += 1

        if guard_action == 0 and self.guard_x > 0:
            self.guard_x -= 1
        elif guard_action == 1 and self.guard_x < 6:
            self.guard_x += 1
        elif guard_action == 2 and self.guard_y > 0:
            self.guard_y -= 1
        elif guard_action == 3 and self.guard_y < 6:
            self.guard_y += 1

        # Generate action masks
        prisoner_action_mask = np.ones(4)
        if self.prisoner_x == 0:
            prisoner_action_mask[0] = 0  # Block left movement
        elif self.prisoner_x == 6:
            prisoner_action_mask[1] = 0  # Block right movement
        if self.prisoner_y == 0:
            prisoner_action_mask[2] = 0  # Block down movement
        elif self.prisoner_y == 6:
            prisoner_action_mask[3] = 0  # Block up movement

        guard_action_mask = np.ones(4)
        if self.guard_x == 0:
            guard_action_mask[0] = 0
        elif self.guard_x == 6:
            guard_action_mask[1] = 0
        if self.guard_y == 0:
            guard_action_mask[2] = 0
        elif self.guard_y == 6:
            guard_action_mask[3] = 0

        if self.guard_x - 1 == self.escape_x:
            guard_action_mask[0] = 0
        elif self.guard_x + 1 == self.escape_x:
            guard_action_mask[1] = 0
        if self.guard_y - 1 == self.escape_y:
            guard_action_mask[2] = 0
        elif self.guard_y + 1 == self.escape_y:
            guard_action_mask[3] = 0

        # Check termination conditions
        terminations = {a: False for a in self.agents}
        rewards = {a: 0 for a in self.agents}
        if self.prisoner_x == self.guard_x and self.prisoner_y == self.guard_y:
            rewards = {"prisoner": -1, "guard": 1}
            terminations = {a: True for a in self.agents}
            self.agents = []

        elif self.prisoner_x == self.escape_x and self.prisoner_y == self.escape_y:
            rewards = {"prisoner": 1, "guard": -1}
            terminations = {a: True for a in self.agents}
            self.agents = []

        # Check truncation conditions (overwrites termination conditions)
        truncations = {"prisoner": False, "guard": False}
        if self.timestep > 100:
            rewards = {"prisoner": 0, "guard": 0}
            truncations = {"prisoner": True, "guard": True}
            self.agents = []
        self.timestep += 1

        # Get observations
        observation = (
            self.prisoner_x + 7 * self.prisoner_y,
            self.guard_x + 7 * self.guard_y,
            self.escape_x + 7 * self.escape_y,
        )
        observations = {
            "prisoner": {
                "observation": observation,
                "action_mask": prisoner_action_mask,
            },
            "guard": {"observation": observation, "action_mask": guard_action_mask},
        }

        # Get dummy infos (not used in this example)
        infos = {"prisoner": {}, "guard": {}}

        return observations, rewards, terminations, truncations, infos

    def render(self):
        grid = np.zeros((7, 7))
        grid[self.prisoner_y, self.prisoner_x] = "P"
        grid[self.guard_y, self.guard_x] = "G"
        grid[self.escape_y, self.escape_x] = "E"
        print(f"{grid} \n")

    @functools.lru_cache(maxsize=None)
    def observation_space(self, agent):
        return MultiDiscrete([7 * 7 - 1] * 3)

    @functools.lru_cache(maxsize=None)
    def action_space(self, agent):
        return Discrete(4)

7 (WIP) Creating Environments: Action Masking

7.1 Introduction

In many environments, it is natural for some actions to be invalid at certain times. For example, in a game of chess, it is impossible to move a pawn forward if it is already at the front of the board. In PettingZoo, we can use action masking to prevent invalid actions from being taken.

在许多环境中，某些操作在某些时候无效是很自然的。例如，在国际象棋游戏中，如果棋子已经在棋盘的前面，就不可能向前移动棋子。在PettingZoo中，我们可以使用动作屏蔽来防止采取无效的动作。

Action masking is a more natural way of handling invalid actions than having an action have no effect, which was how we handled bumping into walls in the previous tutorial.

动作屏蔽是处理无效动作的一种更自然的方式，而不是让动作没有效果，这就是我们在上一个教程中处理撞墙的方式。

7.2 Code

如上述6.2代码所示

8 (WIP) Creating Environments: Testing Your Environment

8.1 Introduction

Now that our environment is complete, we can test it to make sure it works as intended. PettingZoo has a built-in testing suite that can be used to test your environment.现在我们的环境已经完成，我们可以测试它以确保它按预期工作。PettingZoo有一个内置测试套件，可以用来测试您的环境。

8.2 Code

(add this code below the rest of the code in the file)

（将此代码添加到文件中其余代码的下面）

1	/custom-environment/env/custom_environment.py

1
2
3

 from pettingzoo.test import parallel_api_test  # noqa: E402
if __name__ == "__main__":
    parallel_api_test(CustomEnvironment(), num_cycles=1_000_000)

parallel_api_test()是并行环境的API测试。num_cycles是测试的局数。

PettingZoo自定义环境没有提及“注册”这回事。而gym自定义环境是需要“注册”的，唯有如此，才能在使用时通过from gym import …来导入自定义环境。

如果希望通过from pettingzoo.xxx import …来导入自定义环境，那么就需要仿照pettingzoo软件包的层次结构，把自定义环境整个文件夹放入我的Anaconda的虚拟环境的pettingzoo文件夹中。