Dwango Media Village(DMV)

This article is automatically translated.

This is Sasaki from Dwango Media Village. I would like to introduce the reinforcement learning framework RLlib, which I recently started using.

Deep reinforcement learning is one of the machine learning technologies that has been gaining attention recently and is actively researched. Unlike supervised learning, which requires a large amount of sample data, reinforcement learning requires experiential data obtained from the interaction between the environment and the agent. Therefore, efficiently collecting interaction experience data is crucial. By parallelizing the process of collecting experience data among process nodes, the collection efficiency can be increased, but performance can vary greatly depending on the implementation. Moreover, with new learning and exploration methods being proposed daily, implementing experimental programs supporting the latest learning methods from scratch is a significant task.

That’s why we decided to use RLlib [Liang et al., 2018] as a framework for experiments. It implements representative methods, including Ape-X, an algorithm for fast-converging Q-learning [Horgan et al., 2018]. Another reason is the high number of GitHub stars, indicating that many people seem to be using it.

Although RLlib is convenient, it supports various algorithms, resulting in many configuration items, and the documentation explanations are minimal. Therefore, in this article, I will explain the learning configuration items of RLlib using Ape-X as an example. First, I will review the overview of Ape-X and RLlib as an introduction. Then, I will explain how to start learning with RLlib as a simple usage example. Finally, I will explain the configuration items for learning, focusing on those related to Ape-X.

About Ape-X

Ape-X is a type of DQN proposed in the paper Distributed Prioritized Experience Replay [Horgan et al., 2018].

Ape-X combines methods proposed in the past to improve the performance of DQN and introduces distributed learning. It runs multiple exploration workers in parallel, stores the obtained experience data in a replay buffer, and updates the Q-Network parameters in bulk using the GPU. The batch input used for gradient calculation is selected from the replay buffer based on the priority derived from the TD error. Specifically, it introduces the following three methods:

Prioritized Experience Replay [Schaul et al., 2016]
Double Q-Learning [Lafferty et al., 2010]
Dueling Network [Wang et al., 2016]

About RLlib

RLlib is a subpackage of the distributed execution library Ray in Python and can be used in combination with other subpackages, Tune, when conducting learning experiments.

Ray is a library for asynchronously executing functions executed via immutable objects shared between process nodes. When a specially decorated Python function foo is defined, foo.remote() becomes callable. When foo.remote() is executed, the function runs, but the value of the object is not immediately evaluated; instead, the execution is scheduled by the scheduler. The function returns the unique ID of the shared object. When this ID is passed to ray.get(), the computation is executed.

The scheduler for executing multiple computation processes can be started with ray.init(). Conducting learning with RLlib corresponds to scheduling computations with the scheduler. It is possible to start a new server each time or register the computation execution with a running server via the network.

Tune is a library for conducting and visualizing learning experiments and hyperparameter estimation. It implements functions to combine multiple learning experiments with meta inference algorithms for parameter exploration. Learning experiments are managed as Experiments, queued, and executed sequentially.

Starting Learning

For this example, we assume conducting a single reinforcement learning experiment, similar to the learning sample script in RLlib.

To conduct a learning experiment, simply pass a dictionary describing the experiment to ray.tune.tune.run_experiments(). The dictionary contains the following:

name: Name of the experiment. Part of the name of the directory where logs are saved.
run: Name of the algorithm to execute.
local_dir: Location to save logs. Default is ~/ray_results.
checkpoint_freq: Interval of iterations to save the model.
checkpoint_at_end: Whether to save at the end of learning.
restore: File name to resume from a previous checkpoint.
config: Detailed settings.

The run parameter needs to specify the name of an algorithm, such as A3C or IMPALA. See the RLlib page for details. For reinforcement learning, an Agent class corresponding to the name is initialized, and the main function _train() is called multiple times.

Ape-X is implemented in RLlib as a special case of DQN. The APEXAgent class overrides the DQNAgent class, and the main parameters are described in the ray source along with their default settings. If you want to configure settings other than the default, you can add a dictionary to the config in the above Experiment settings. Recommended config settings with results are also published.

To customize learning, you need to modify the config. Below is an explanation of its items.

Learning Configuration Items

Since there are many items, they are categorized for explanation. Note that the categories do not correspond to the implementation.

Workers refer to the processes performing exploration in parallel. They need to be configured according to the machine’s resources (CPU, GPU) where learning is conducted. The number of parallel workers is num_workers x num_envs_per_worker.

Name	Type	Description
num_workers	int	Number of workers
num_cpus_per_worker	int	Number of CPUs allocated per worker
num_gpus_per_worker	int	Number of GPUs allocated per worker
num_envs_per_worker	int	Number of environments per worker

In Ape-X, workers use the ε-greedy algorithm to interact (rollout) with the environment and generate experience data. The value of ε changes from 1.0 to exploration_final_eps as learning progresses, changing at a rate of exploration_fraction per schedule_max_timesteps steps. Specifically, the new value of \(\epsilon\), \(\epsilon_{i+1}\), is determined as follows:

\[ \epsilon_{i+1} = \epsilon_{i} + (\epsilon_{final} - \epsilon_{0}) \]

where \(\epsilon_{final}\) and \(\epsilon_{0}\) are exploration_final_eps and the initial value of \(\epsilon\) (=1.0), respectively.

Name	Type	Description
per_worker_exploration	bool	Whether to vary the value of ε for each worker
exploration_fraction	float	Size of reduction in ε
horizon	int/null	Forcibly stops rollout after a specific number of steps
learning_starts	int	Number of steps of rollout before starting learning
sample_async	bool	Whether to perform rollout asynchronously
sample_batch_size	int	Number of steps per batch sent to the replay buffer

Prioritized Experience Replay [Schaul et al., 2016] assigns priorities to experience data in the replay buffer based on their usefulness for Q-learning, and learns more from high-priority experiences. The priority is determined according to the following algorithm:

In line 9, \(p_i\) is the probability for determining priority, which is the absolute value of the TD error plus a constant.

\[ p_i=|\delta_i|+\epsilon \]

Using only the TD error to create the probability distribution would result in certain experiences being repeatedly used. Therefore, the distribution \(p_i\) is annealed using \(\alpha\) and \(\beta\). \(\beta\) specifically increases from prioritized_replay_beta to final_prioritized_replay_beta, similar to \(\epsilon\).

Name	Type	Description
buffer_size	int	Size of the buffer
prioritized_replay	bool	Whether to use prioritized experience replay

Data processing from the environment is outlined on the Ray page. The preprocessor acts as a wrapper for the environment (Gym.Env) and mainly handles preprocessing of observation data. The following four preprocessors are applied for atari environments (unless custom_preprocessor is specified):

Clip reward: Limits the value of rewards to \([-1, 1]\).
Greyscaling: Converts images to grayscale.
Frame stack: Stacks multiple frames along the channel direction to create a single observation image.
Episodic life: Resets game play only when the game is over, different from when env.reset() is called.

Name	Type	Description
preprocessor_pref	string	Method of preprocessing observations
observation_filter	string	Processing performed after the preprocessor. The default is NoFilter, which does nothing.
synchronize_filters	bool	Whether to synchronize filter parameters per iteration
clip_rewards	bool	Whether to limit the reward for each step to \([-1, 1]\). If true, `numpy.sign()` is used.
compress_observations	bool	Whether to compress observations using LZ4 when storing them in the replay buffer

One iteration corresponds to the following steps. Note that checkpoint_freq is based on this iteration unit.

Until timesteps_iteration steps pass or min_iter_time_s seconds pass, the following is repeated:
- Sampling from the replay buffer
- Gradient calculation
- Weight update
- Updating weights of workers
- Updating priorities
Calculating and assigning priorities
Updating logs (Tensorboard updates)

Data addition and replay in the replay buffer are performed asynchronously.

Name	Type	Description
double_q	bool	Whether to apply Double Q-Learning
dueling	bool	Whether to apply Dueling Network
gamma	float	Discount rate for rewards
n_step	int	Value of n in n-step Q-Learning
target_network_update_freq	int	Number of steps per Q-network update
timesteps_per_iteration	int	Step interval of iterations
min_iter_time_s	int	Minimum number of seconds per iteration
train_batch_size	int	Size of the batch for Q-Network

Forward and backward computations of the Q-Network are performed using Tensorflow or PyTorch. The structure of the Q-Network can be written by yourself, but it automatically determines the structure according to the observation_space of the environment. Especially for images, a CNN is applied, and you can overwrite the size of the convolution kernel through settings.

Name	Type	Description
hiddens	int	Size of the fully-connected layer after the convolutional layers
noisy	bool	Whether to apply Noisy Network [Fortunato et al., 2017]. If true, ε-greedy is not used.
sigma0	float	Initial parameter of the Noisy Network
num_atoms	int	Number of output distributions of the Q-network. If greater than 1, it becomes distributional Q-learning [Bellemare and Dabney, 2017].
v_min	float	Parameter of distributional Q-learning
v_max	float	Parameter of distributional Q-learning

Name	Type	Description
optimizer_class	string	Type of optimizer. Use AsyncReplayOptimizer for Ape-X.
debug	bool	Toggle debug mode
max_weight_sync_delay	int	Minimum number of steps to wait since the last model parameter update
num_replay_buffer_shards	int	Parallelism of replay
lr	float	Learning rate
adam_epsilon	float	ε value of Adam Optimizer
grad_norm_clipping	float	Maximum value of the gradient norm

Others

batch_mode can be either truncate_episodes or complete_episodes. If complete_episodes, it does not send to the buffer until the episode ends.

Name	Type	Description
schedule_max_timesteps	int	Step interval for changing hyperparameters based on learning progress
gpu	bool	Whether to use GPU
gpu_fraction	float	GPU usage rate (0~1, where 1 is 100%)
monitor	bool	Whether to save rollout results periodically as videos
batch_mode	string	Batch sampling method.
tf_session_args	dict	Parameters when initializing Tensorflow Session

tf_session_args can be set as follows. See the Tensorflow documentation for details.

allow_soft_placement
device_count
- CPU
gpu_options
- allow_growth
inter_op_parallelism_threads
intra_op_parallelism_threads
log_device_placement

References

[horgan2018] [Horgan et al., 2018] Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., & Silver, D. (2018). Distributed Prioritized Experience Replay. In International Conference on Learning Representations (pp. 1–19). https://doi.org/10.1007/11564096

[liang2018] [Liang et al., 2018] Liang, E., Liaw, R., Moritz, P., Nishihara, R., Fox, R., Goldberg, K., … Stoica, I. (2018). RLlib: Abstractions for Distributed Reinforcement Learning. In International Conference on Machine Learning. https://arxiv.org/abs/1712.09381

[scaul2016] [Schaul et al., 2016] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay. In International Conference on Learning Representations (pp. 1–21). http://arxiv.org/abs/1511.05952

[lafferty2010] [Lafferty et al., 2010] Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., & Culotta, A. (2010). Double Q-learning. Advances in Neural Information Processing Systems, 23, 2613--2621.

[wang2016] [Wang et al., 2016] Wang, Z., Schaul, T., Hessel, M., & Lanctot, M. (2016). Dueling Network Architectures for Deep Reinforcement Learning Hado van Hasselt. In International Conference on Machine Learning (Vol. 48, pp. 1995–2003). https://doi.org/10.1109/MCOM.2016.7378425

[fortunato2017] [Fortunato et al., 2017] Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., … Legg, S. (2018). Noisy Networks for Exploration. In International Conference on Learning Representations. https://arxiv.org/abs/1706.10295

[bellemare2017] [Bellemare and Dabney, 2017] Bellemare, M. G., & Dabney, W. (2017). A Distributional Perspective on Reinforcement Learning. In International Conference on Machine Learning. https://arxiv.org/abs/1707.06887

Author

Publish: 2018/12/10

Kazuma Sasaki

How to Use the Reinforcement Learning Framework RLlib

About Ape-X

About RLlib

Starting Learning

Learning Configuration Items

Worker-related Items

Items Related to Worker Exploration

Items Related to Replay Buffer

Items Related to Experience Data Preprocessing

Items Related to Q-Learning

Items Related to Q-Network

Items Related to the Optimizer

Others

References

Author

Publish: 2018/12/10