张浩
近十几年来,自动驾驶技术受到了广泛关注和深入研究,这项技术未来可以被用于改善道路安全、解决交通拥堵和提升燃油效率等。目前自动驾驶系统中各种类型的感知任务都依靠深度学习技术达到了很高的精度。然而自动驾驶中的决策和控制功能,并不适合用深度学习来实现。自动驾驶车辆的决策和控制要解决的是一个序列决策的问题,自动驾驶车辆需要不断学习周围不断变化的环境信息,并且每一步都需要做出一个决策。因此自动驾驶车辆的决策和控制适合用强化学习来解决。强化学习讨论的问题是智能体如何在复杂、不确定的环境中最大化它能获得的奖励。强化学习已经被证明可以有效处理部分可观测、长期规划和高维度的序列决策问题。
现有关于强化学习的研究,大多是对驾驶行为比如换道和跟驰单独进行建模。主要原因是跟驰行为和换道行为有着不同的动作空间和奖励函数,跟驰行为的动作是连续型的,换道行为的动作是离散型的。经典的强化学习算法难以处理此类具有混合动作空间的问题。然而在真实的驾驶场景中,驾驶员对于换道行为和跟驰行为并不会完全分割清楚。驾驶员通常会通过调整其某一维度的驾驶行为来更好的完成其另一维度的驾驶目标。因此,本研究提出了具有混合动作空间的深度强化学习算法来学习智能车的自动驾驶策略,该算法可以同时学习自由换道、车道保持、跟驰前车和避免碰撞。围绕本研究的研究目标和任务,主要进行了以下内容的研究:
首先,本文使用OpenAI 的Gym 平台来搭建数值仿真环境。与当前使用较多的交通仿真软件相比,数值仿真实验可以更好的建模智能车的微观交通行为,并且从数值仿真实验环境中提取信息也更加容易。本研究设计了复杂度不同的六个数值实验,这些实验包括了经典的跟驰场景、自由换道场景、避免车辆碰撞场景和保持车道场景。本研究的研究重点为车辆的自动驾驶决策,因此不考虑城市道路中的信号灯等管控措施的影响。此外,还介绍了使用Gym 平台搭建强化学习仿真环境的主要步骤,包括状态空间的构建、车辆状态转移方程的构建、奖励函数的设计和仿真环境终止条件的设计。
其次,本研究选取深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)作为基准模型,测试了DDPG 方法同时做自由换道规划和跟驰等任务时的表现,也测试了DDPG 方法在单独的跟驰任务和单独的换道任务中的表现。 DDPG 算法作为经典的具有连续型动作空间的强化学习算法,已经被很多研究用来做单独的车辆跟驰任务。但目前鲜有文献用DDPG 算法来同时做车辆的跟
驰和换道任务。DDPG 算法的输出动作为连续型,因此在进行动作设计时,就不能包括换道决策的离散动作。动作空间包括两个连续型的动作横向和纵向的车辆加速度,奖励函数主要包括跟驰奖励、换道奖励、安全奖励和车道保持奖励。经过实验验证,DDPG 算法可以较好的完成单独的跟驰策略学习和安全策略学习。但DDPG 算法在自由换道策略和跟驰策略同时学习时不能收敛。这主要是因为DDPG 算法无法考虑混合的动作空间,只能忽略掉其中的离散型动作,这会降低算法的学习效率。
最后,本研究引入参数化深度Q 网络(Parametrized Deep Q-Networks Learning, P-DQN)算法来处理多任务学习中的混合动作空间问题。P-DQN 的奖励函数包括跟驰奖励、换道奖励、安全奖励和车道保持奖励。动作空间设计包括一个离散型动作即车辆是否换道或者向何侧换道,还包括两个连续型动作即车辆的横向和纵向加速度。P-DQN 的输入信息则包括了主车与周围四辆车的相对信息和主车与道路边界线的相对信息。因为输入的信息较为复杂,本研究使用卷积神经网络帮助P-DQN 处理输入的信息。通过P-DQN 在六个复杂度不同的数值仿真实验中的表现,可以验证P-DQN 同时完成了智能车的跟驰、换道、保持车道和避免碰撞任务。此外,与DDPG 相比P-DQN 有着更快的学习速度和更高的安全性。DDPG 在单独的两个任务中分别需要2000 回合和10000 回合达到收敛,P-DQN 在所有任务的同时学习时只需要1000 回合就可收敛。DDPG在单独的两个任务的测试中分别有1%和2%的碰撞率,P-DQN 在所有任务的同时学习的测试中碰撞率均为0。
本文的研究工作验证了一种新的学习自动驾驶策略的强化学习算法,该算法为以后处理其它具有混合动作空间的自动驾驶问题提供了一种新思路。此外,本研究将跟驰和换道行为同时建模,有助于自动驾驶技术的进一步发展和应用。
关键词:深度强化学习,深度确定性策略梯度,参数化深度Q 网络,自动驾驶,换道行为,跟驰行为
In recent years, autonomous driving technology has received widespread attention and extensive research, with the potential to improve road safety, alleviate traffic congestion, and enhance fuel efficiency in the future. At present, various types of perception tasks in autonomous driving systems rely on deep learning technology to achieve high accuracy. However, the decision-making and control functions in autonomous driving are not suitable for implementation using deep learning technology. The decision-making and control of autonomous vehicles actually involve solving a sequential decision-making problem, in which the vehicle needs to continuously learn from the ever-changing surrounding environment and make a decision at each step. Reinforcement learning is well-suited to address this challenge. Reinforcement learning discusses how an intelligent agent can maximize the rewards it receives in a complex and uncertain environment. Reinforcement learning has been proven to effectively handle partially observable, long-term planning, and high-dimensional sequential decision-making problems.
Existing research on reinforcement learning mainly focuses on modeling driving behaviors such as lane changing and car-following separately. The main reason is that car-following and lane-changing behaviors have different action spaces and reward functions; car-following actions are continuous, while lane-changing actions are discrete. Classical reinforcement learning algorithms struggle to handle such problems with mixed action spaces. However, in real driving scenarios, drivers do not completely separate lane-changing and car-following behaviors. Instead, drivers typically adjust their driving behavior in one dimension to better achieve their driving objectives in another dimension.
Therefore, this study proposes a deep reinforcement learning algorithm with a hybrid action space to learn autonomous driving policies for intelligent vehicles. At the input end, this research employs a convolutional neural network to help the reinforcement learning algorithm better process the surrounding vehicle information. At the output end, the study introduces a reinforcement learning algorithm with a hybrid action space to model tasks such as intelligent vehicle lane changes and car-following that involve both discrete and continuous actions. Using Python's Gym platform, six representative numerical simulation scenarios were built, and the training and validation of the proposed deep reinforcement learning algorithm with a hybrid action space were carried out in these scenarios. The main research contents related to the research objectives and tasks are as follows:
First, this paper uses the OpenAI Gym platform to build a numerical simulation environment. Compared with the widely-used traffic simulation software, numerical simulation experiments can better model the microscopic traffic behavior of intelligent vehicles, and it is also easier to extract information from the numerical simulation environment. This study has designed six numerical experiments of varying complexity, encompassing classic car-following scenarios, free lane changing scenarios, vehicle collision avoidance scenarios, and lane-keeping scenarios. The main focus of this study is the autonomous driving decision-making of vehicles, thus the impact of control measures such as traffic lights in urban roads is not considered. In addition, the main steps of building a reinforcement learning simulation environment using the Gym platform are introduced, including the construction of the state space, the construction of vehicle state transition equations, the design of reward functions, and the design of simulation environment termination conditions.
Secondly, this study selects the Deep Deterministic Policy Gradient (DDPG) as the baseline model, testing the performance of the DDPG method in simultaneously planning free lane changes and car-following tasks, as well as the performance of the DDPG method in individual car-following and lane-changing tasks. The DDPG algorithm, as a classic reinforcement learning algorithm with continuous action spaces, has been used by many studies for separate vehicle car-following tasks. However, there are few studies that use the DDPG algorithm to simultaneously perform vehicle car-following and lane-changing tasks. The output action of the DDPG algorithm is continuous, so it cannot include discrete lane-changing decisions in the action design. The action space includes two continuous actions: lateral and longitudinal vehicle acceleration, and the reward function mainly consists of car-following rewards, lane-changing rewards, safety rewards, and lane-keeping rewards. Through experimental verification, the DDPG algorithm can perform well in learning separate car-following and safety strategies. However, the DDPG algorithm cannot converge when learning free lane-changing and car-following strategies simultaneously. This is mainly because the DDPG algorithm cannot consider mixed action spaces and can only ignore discrete actions, which reduces the learning efficiency of the algorithm.
Finally, this study introduces the Parametrized Deep Q-Networks Learning (P-DQN) algorithm to address the mixed action space problem in multi-task learning. The reward function of P-DQN includes car-following rewards, lane-changing rewards, safety rewards, and lane-keeping rewards. The action space design includes a discrete action, which determines whether the vehicle changes lanes or to which side, as well as two continuous actions: the vehicle's lateral and longitudinal acceleration. The input information of P-DQN includes the relative information of the host vehicle and the surrounding four vehicles, and the relative information of the host vehicle and the road boundary lines. Due to the complexity of the input information, this study uses a convolutional neural network to help P-DQN process the input information. The performance of P-DQN in six numerical simulation experiments of varying complexity verifies that P-DQN successfully accomplishes the tasks of vehicle-following, lane-changing, lane-keeping, and collision avoidance for intelligent vehicles. In addition, compared to DDPG, P-DQN has a faster learning speed and higher safety. DDPG takes 2000 and 10000 episodes to converge in the separate tasks, while P-DQN only needs 1000 episodes to converge when learning all tasks simultaneously. In the tests of the separate tasks, DDPG has a collision rate of 1% and 2%, while P-DQN has a collision rate of 0% in the tests of learning all tasks simultaneously.
This study validates a novel reinforcement learning algorithm for learning autonomous driving strategies, offering a new approach to dealing with other autonomous driving issues that involve mixed action spaces in the future. Additionally, this research models both car-following and lane-changing behaviors concurrently, contributing to the further development and application of autonomous driving technologies.
Key Words: Deep Reinforcement Learning, Deep Deterministic Policy Gradient, Parameterized Deep Q-Learning, Automated vehicle, Lane-changing behavior, Car-following behavior