Awesome Exploration Methods in Reinforcement Learning
Updated on 2024.06.12
-
Here is a collection of research papers for Exploration methods in Reinforcement Learning (ERL). The repository will be continuously updated to track the frontier of ERL. Welcome to follow and star!
-
The balance of exploration and exploitation is one of the most central problems in reinforcement learning. In order to give readers an intuitive feeling for exploration, we provide a visualization of a typical hard exploration environment in MiniGrid below. In this task, a series of actions to achieve the goal often require dozens or even hundreds of steps, in which the agent needs to fully explore different state-action spaces in order to learn the skills required to achieve the goal.
A typical hard-exploration environment: MiniGrid-ObstructedMaze-Full-v0.
Table of Contents
A Taxonomy of Exploration RL Methods
(Click to Collapse)
In general, we can divide reinforcement learning process into two phases: collect phase and train phase. In the collect phase, the agent chooses actions based on the current policy and then interacts with the environment to collect useful experience. In the train phase, the agent uses the collected experience to update the current policy to obtain a better performing policy.
According to the phase the exploration component is explicitly applied, we simply divide the methods in Exploration RL
into two main categories: Augmented Collecting Strategy
, Augmented Training Strategy
:
-
Augmented Collecting Strategy
represents a variety of different exploration strategies commonly used in the collect phase, which we further divide into four categories:Action Selection Perturbation
Action Selection Guidance
State Selection Guidance
Parameter Space Perturbation
-
Augmented Training Strategy
represents a variety of different exploration strategies commonly used in the train phase, which we further divide into seven categories:Count Based
Prediction Based
Information Theory Based
Entropy Augmented
Bayesian Posterior Based
Goal Based
(Expert) Demo Data
Note that there may be overlap between these categories, and an algorithm may belong to several of them. For other detailed survey on exploration methods in RL, you can refer to Tianpei Yang et al and Susan Amin et al.
Here are the links to the papers that appeared in the taxonomy:
[1] Go-Explore: Adrien Ecoffet et al, 2021
[2] NoisyNet, Meire Fortunato et al, 2018
[3] DQN-PixelCNN: Marc G. Bellemare et al, 2016
[4] #Exploration Haoran Tang et al, 2017
[5] EX2: Justin Fu et al, 2017
[6] ICM: Deepak Pathak et al, 2018
[7] RND: Yuri Burda et al, 2018
[8] NGU: Adrià Puigdomènech Badia et al, 2020
[9] Agent57: Adrià Puigdomènech Badia et al, 2020
[10] VIME: Rein Houthooft et al, 2016
[11] EMI: Wang et al, 2019
[12] DIYAN: Benjamin Eysenbach et al, 2019
[13] SAC: Tuomas Haarnoja et al, 2018
[14] BootstrappedDQN: Ian Osband et al, 2016
[15] PSRL: Ian Osband et al, 2013
[16] HER Marcin Andrychowicz et al, 2017
[17] DQfD: Todd Hester et al, 2018
[18] R2D3: Caglar Gulcehre et al, 2019
Papers
format:
- [title](paper link) (presentation type, openreview score [if the score is public])
- author1, author2, author3, ...
- Key: key problems and insights
- ExpEnv: experiment environments
ICLR 2024
(Click to Collapse)
-
Unlocking the Power of Representations in Long-term Novelty-based Exploration
- Alaa Saade, Steven Kapturowski, Daniele Calandriello, Charles Blundell, Pablo Sprechmann, Leopoldo Sarra, Oliver Groth, Michal Valko, Bilal Piot
- Key: Robust Exploration via Clustering-based Online Density Estimation
- ExpEnv: Atari, DM-HARD-8
-
A Theoretical Explanation of Deep RL Performance in Stochastic Environments
- Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan
- Key: Stochastic Environments, effective horizon, RL theory, instance-dependent bounds, empirical validation of theory
- ExpEnv: BRIDGE
-
DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization
- Guowei Xu, Ruijie Zheng, Yongyuan Liang, Xiyao Wang, Zhecheng Yuan, Tianying Ji, Yu Luo, Xiaoyu Liu, Jiaxin Yuan, Pu Hua, Shuzhen Li, Yanjie Ze, Hal Daumé III, Furong Huang, Huazhe Xu
- Key: Visual RL, Dormant Ratio Minimization, Exploration
- ExpEnv:DeepMind Control Suite, MetaWorld, and Adroit
-
METRA: Scalable Unsupervised RL with Metric-Aware Abstraction
- Seohong Park, Oleh Rybkin, Sergey Levine
- Key: unsupervised RL, metric-aware abstraction, scalable exploration
- ExpEnv: state-based Ant and HalfCheetah, Kitchen
-
Text2Reward: Reward Shaping with Language Models for Reinforcement Learning
- Tianbao Xie, Siheng Zhao, Chen Henry Wu, Yitao Liu, Qian Luo, Victor Zhong, Yanchao Yang, Tao Yu
- Key: reward shaping, language models, text-based reward shaping
- ExpEnv: MUJOCO, MANISKILL2, METAWORLD
-
Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learning
- Haoqi Yuan, Zhancun Mu, Feiyang Xie, Zongqing Lu
- Key: goal-based models, pre-training, sample efficiency
- ExpEnv: Kitchen, Minecraft.
-
Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning
- Hyungho Na, Yunkyeong Seo, Il-chul Moon
- Key: episodic memory, cooperative multi-agent, efficient utilization
- ExpEnv: StarCraft II and Google Research Football
-
Simple Hierarchical Planning with Diffusion
- Chang Chen, Fei Deng, Kenji Kawaguchi, Caglar Gulcehre, Sungjin Ahn
- Key: hierarchical planning, diffusion, exploration
- ExpEnv: Maze2D and AntMaze
-
Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks
- Ziping Xu, Zifan Xu, Runxuan Jiang, Peter Stone, Ambuj Tewari
- Key: myopic exploration, multitask reinforcement learning, diverse tasks
- ExpEnv: synthetic robotic control environment
-
PAE: Reinforcement Learning from External Knowledge for Efficient Exploration
- Zhe Wu, Haofei Lu, Junliang Xing, You Wu, Renye Yan, Yaozhong Gan, Yuanchun Shi
- Key: external knowledge, efficient exploration, reinforcement learning
- ExpEnv: BabyAI and MiniHack
-
In-context Exploration-Exploitation for Reinforcement Learning
- Zhenwen Dai, Federico Tomasi, Sina Ghiassian
- Key: in-context exploration-exploitation, reinforcement learning, exploration-exploitation trade-off
- ExpEnv: Dark Room, Dark Key-to-Door, Dark Room (Biased).
-
- Licong Lin, Yu Bai, Song Mei
- Key: transformers, decision makers, in-context reinforcement learning
- ExpEnv: Linear bandit, Bernoulli bandits.
-
Learning to Act without Actions
- Dominik Schmidt, Minqi Jiang
- Key: recovering latent action information, video, pre-training
- ExpEnv: Procgen
-
- Mingde Zhao, Safa Alver, Harm van Seijen, Romain Laroche, Doina Precup, Yoshua Bengio
- Key: spatio-temporal abstractions, hierarchical planning, task/goal decomposition
- ExpEnv: MiniGrid-BabyAI
NeurIPS 2023
(Click to Collapse)
-
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration
- Zhihan Liu, Miao Lu, Wei Xiong, Han Zhong, Hao Hu, Shenao Zhang, Sirui Zheng, Zhuoran Yang, Zhaoran Wang
- Key: a single objective that integrates the estimation and planning components, balancing exploration and exploitation automatically, sublinear regret
- ExpEnv: MuJoCo with sparse reward
-
On the Importance of Exploration for Generalization in Reinforcement Learning
- Yiding Jiang, J Zico Kolter, Roberta Raileanu
- Key: exploration, generalization, Exploration via Distributional Ensemble
- ExpEnv: tabular contextual MDP, Procgen and Crafter
-
Monte Carlo Tree Search with Boltzmann Exploration
- Michael Painter, Mohamed Baioumy, Nick Hawes, Bruno Lacerda
- Key: Boltzmann exploration with MCTS, optimal actions for the maximum entropy objective do not necessarily correspond to optimal actions for the original objective, two improved algorithms.
- ExpEnv: the Frozen Lake environment, the Sailing Problem, Go
-
Breadcrumbs to the Goal: Supervised Goal Selection from Human-in-the-Loop Feedback
- Marcel Torne Villasevil, Max Balsells I Pamies, Zihan Wang, Samedh Desai, Tao Chen, Pulkit Agrawal, Abhishek Gupta
- Key: human-in-the-loop feedback, bifurcating human feedback and policy learning
- ExpEnv: Bandu, Block Stacking, Kitchen, and Pusher,Four rooms and Maze
-
MIMEx: Intrinsic Rewards from Masked Input Modeling
- Toru Lin, Allan Jabri
- Key: pseudo-likelihood estimation with different mask distributions,
- ExpEnv: PixMC-Sparse, DeepMind Control suite
-
Accelerating Exploration with Unlabeled Prior Data
- Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, Sergey Levine
- Key: prior data without reward labels, learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards
- ExpEnv: AntMaze domain, Adroit hand manipulation domain, and a visual simulated robotic manipulation domain.
-
On the Convergence and Sample Complexity Analysis of Deep Q-Networks with ε-Greedy Exploration
- Shuai Zhang, Hongkang Li, Meng Wang, Miao Liu, Pin-Yu Chen, Songtao Lu, Sijia Liu, Keerthiram Murugesan, Subhajit Chaudhury
- Key: ε-greedy exploration, convergence, sample complexity
- ExpEnv: Numerical Experiments
-
Pitfall of Optimism: Distributional Reinforcement Learning by Randomizing Risk Criterion
- Taehyun Cho, Seungyub Han, Heesoo Lee, Kyungjae Lee, Jungwoo Lee
- Key: distributional reinforcement learning, randomizing risk criterion, optimistic exploration
- ExpEnv: Atari 55 games.
-
CQM: Curriculum Reinforcement Learning with a Quantized World Model
- Seungjae Lee, Daesol Cho, Jonghae Park, H. Jin Kim
- Key: curriculum reinforcement learning, quantized world model, quantized world model
- ExpEnv: PointNMaze
-
Safe Exploration in Reinforcement Learning: A Generalized Formulation and Algorithms
- Akifumi Wachi, Wataru Hashimoto, Xun Shen, Kazumune Hashimoto
- Key: safe exploration, generalized formulation, safe exploration algorithms, Meta-Algorithm for Safe Exploration
- ExpEnv: grid-world and Safety Gym
-
Successor-Predecessor Intrinsic Exploration
- Changmin Yu, Neil Burgess, Maneesh Sahani, Samuel J. Gershman
- Key: retrospective