<2018-Bogdan-On Developing a UAV Pursuit-Evasion Policy Using Reinforcement Learning>

Abstract

目标：近距离一对一空战的反应机动策略
算法：DNN+A3C
验证：Monte Carlo experiments && SCRIMMAGE环境

1. Introduction

本文选用了折衷方案：给定相对准确的飞机动力学模型，进行训练，并实际测试
算法：Asynchronous Advantage Actor-Critic (A3C)

需要面对的问题是，高维的观察空间 && 连续的动作空间

2. Problem Definition and Specifications

2.1 Mission Definition

Evader：RL agent

Pursuer：Greedy Shooter（GS）agent

目标：躲避攻击的同时，实现攻守异势

为了简化学习问题，飞机仅在水平面上飞行

2.2 Reinforcement Learning Specification

定义了RL agent的飞行参数

速度：拥有最大、最小速度

定义了A3C模型参数

2.3 Greedy Shooter Specification

What is Greedy Shooter？

——An aircraft using the GS autonomous behavior employs pure pursuit to aim directly at its opponent. Using pure pursuit allows the GS agent to steadily close the distance to its opponent and keep the opponent close to or in its targeting zone.

——换句话说，就是稳定拉近与对手的距离，并使对手保持在自己的目标区域内

GS vs. lead pursuit strategy

GS更简单，但是在1 vs 1中，GS也是很强大的对手

定义了飞行参数

速度：只有恒定速度

3. Training Results

在30分钟的训练中，10分钟后RL agent的胜率开始下降

分析原因在于：

在急转弯过程中，RL agent可能会选择其他动作，也不会导致输掉游戏，但是会产生巨大的梯度
在后续的训练中，RL agent会更多地选择这种机动，因为它会产生巨大的梯度（虽然没有吊用）
最终呈现的结果是，RL agent会“zig-zag”运动，即“之”形，很容易被逮住

4. Experiments

最终学习到的策略是：

也就是说，RL会随机机动徘徊，直到GS足够近，然后降速急转弯，让GS进入自己的target zone

5. Summary

这篇文章没有任何理论上的分析…

但是给空战提供了一个场景搭建的样例，以及A3C算法