Dwango Media Village(DMV)

How to Use the Reinforcement Learning Framework RLlib

This is Sasaki from Dwango Media Village. I would like to introduce the reinforcement learning framework RLlib, which I recently started using.

Deep reinforcement learning is one of the machine learning technologies that has been gaining attention recently and is actively researched. Unlike supervised learning, which requires a large amount of sample data, reinforcement learning requires experiential data obtained from the interaction between the environment and the agent. Therefore, efficiently collecting interaction experience data is crucial. By parallelizing the process of collecting experience data among process nodes, the collection efficiency can be increased, but performance can vary greatly depending on the implementation. Moreover, with new learning and exploration methods being proposed daily, implementing experimental programs supporting the latest learning methods from scratch is a significant task.

That’s why we decided to use RLlib [Liang et al., 2018] as a framework for experiments. It implements representative methods, including Ape-X, an algorithm for fast-converging Q-learning [Horgan et al., 2018]. Another reason is the high number of GitHub stars, indicating that many people seem to be using it.

Although RLlib is convenient, it supports various algorithms, resulting in many configuration items, and the documentation explanations are minimal. Therefore, in this article, I will explain the learning configuration items of RLlib using Ape-X as an example. First, I will review the overview of Ape-X and RLlib as an introduction. Then, I will explain how to start learning with RLlib as a simple usage example. Finally, I will explain the configuration items for learning, focusing on those related to Ape-X.

About Ape-X

Ape-X is a type of DQN proposed in the paper Distributed Prioritized Experience Replay [Horgan et al., 2018].

Ape-X Concept Diagram
Ape-X Concept Diagram

Ape-X combines methods proposed in the past to improve the performance of DQN and introduces distributed learning. It runs multiple exploration workers in parallel, stores the obtained experience data in a replay buffer, and updates the Q-Network parameters in bulk using the GPU. The batch input used for gradient calculation is selected from the replay buffer based on the priority derived from the TD error. Specifically, it introduces the following three methods:

About RLlib

RLlib is a subpackage of the distributed execution library Ray in Python and can be used in combination with other subpackages, Tune, when conducting learning experiments.

Ray is a library for asynchronously executing functions executed via immutable objects shared between process nodes. When a specially decorated Python function foo is defined, foo.remote() becomes callable. When foo.remote() is executed, the function runs, but the value of the object is not immediately evaluated; instead, the execution is scheduled by the scheduler. The function returns the unique ID of the shared object. When this ID is passed to ray.get(), the computation is executed.

The scheduler for executing multiple computation processes can be started with ray.init(). Conducting learning with RLlib corresponds to scheduling computations with the scheduler. It is possible to start a new server each time or register the computation execution with a running server via the network.

Tune is a library for conducting and visualizing learning experiments and hyperparameter estimation. It implements functions to combine multiple learning experiments with meta inference algorithms for parameter exploration. Learning experiments are managed as Experiments, queued, and executed sequentially.

Starting Learning

For this example, we assume conducting a single reinforcement learning experiment, similar to the learning sample script in RLlib.

To conduct a learning experiment, simply pass a dictionary describing the experiment to ray.tune.tune.run_experiments(). The dictionary contains the following:

The run parameter needs to specify the name of an algorithm, such as A3C or IMPALA. See the RLlib page for details. For reinforcement learning, an Agent class corresponding to the name is initialized, and the main function _train() is called multiple times.

Ape-X is implemented in RLlib as a special case of DQN. The APEXAgent class overrides the DQNAgent class, and the main parameters are described in the ray source along with their default settings. If you want to configure settings other than the default, you can add a dictionary to the config in the above Experiment settings. Recommended config settings with results are also published.

To customize learning, you need to modify the config. Below is an explanation of its items.

Learning Configuration Items

Since there are many items, they are categorized for explanation. Note that the categories do not correspond to the implementation.

Workers refer to the processes performing exploration in parallel. They need to be configured according to the machine’s resources (CPU, GPU) where learning is conducted. The number of parallel workers is num_workers x num_envs_per_worker.

NameTypeDescription
num_workersintNumber of workers
num_cpus_per_workerintNumber of CPUs allocated per worker
num_gpus_per_workerintNumber of GPUs allocated per worker
num_envs_per_workerintNumber of environments per worker

In Ape-X, workers use the ε-greedy algorithm to interact (rollout) with the environment and generate experience data. The value of ε changes from 1.0 to exploration_final_eps as learning progresses, changing at a rate of exploration_fraction per schedule_max_timesteps steps. Specifically, the new value of \(\epsilon\), \(\epsilon_{i+1}\), is determined as follows:

\[ \epsilon_{i+1} = \epsilon_{i} + (\epsilon_{final} - \epsilon_{0}) \]

where \(\epsilon_{final}\) and \(\epsilon_{0}\) are exploration_final_eps and the initial value of \(\epsilon\) (=1.0), respectively.

NameTypeDescription
per_worker_explorationboolWhether to vary the value of ε for each worker
exploration_fractionfloatSize of reduction in ε
horizonint/nullForcibly stops rollout after a specific number of steps
learning_startsintNumber of steps of rollout before starting learning
sample_asyncboolWhether to perform rollout asynchronously
sample_batch_sizeintNumber of steps per batch sent to the replay buffer

Prioritized Experience Replay [Schaul et al., 2016] assigns priorities to experience data in the replay buffer based on their usefulness for Q-learning, and learns more from high-priority experiences. The priority is determined according to the following algorithm:

Algorithm from the paper
Algorithm from the paper

In line 9, \(p_i\) is the probability for determining priority, which is the absolute value of the TD error plus a constant.

\[ p_i=|\delta_i|+\epsilon \]

Using only the TD error to create the probability distribution would result in certain experiences being repeatedly used. Therefore, the distribution \(p_i\) is annealed using \(\alpha\) and \(\beta\). \(\beta\) specifically increases from prioritized_replay_beta to final_prioritized_replay_beta, similar to \(\epsilon\).

NameTypeDescription
buffer_sizeintSize of the buffer
prioritized_replayboolWhether to use prioritized experience replay

worker_side_prioritization | bool | Whether workers prioritize experiences during rollout | | prioritized_replay_eps | float | Constant added to the TD error when calculating priority probability | | prioritized_replay_alpha | float | Constant added to the TD error when calculating priority probability | | beta_annealing_fraction | float | Amount of change in \(\beta\) | | prioritized_replay_beta | float | Initial value of \(\beta\) | | final_prioritized_replay_beta | float | Upper limit value of \(\beta\) | | compress_observations | bool | Whether to compress observations using LZ4 when storing them in the replay buffer |

Data processing from the environment is outlined on the Ray page. The preprocessor acts as a wrapper for the environment (Gym.Env) and mainly handles preprocessing of observation data. The following four preprocessors are applied for atari environments (unless custom_preprocessor is specified):

NameTypeDescription
preprocessor_prefstringMethod of preprocessing observations
observation_filterstringProcessing performed after the preprocessor. The default is NoFilter, which does nothing.
synchronize_filtersboolWhether to synchronize filter parameters per iteration
clip_rewardsboolWhether to limit the reward for each step to \([-1, 1]\). If true, numpy.sign() is used.
compress_observationsboolWhether to compress observations using LZ4 when storing them in the replay buffer

One iteration corresponds to the following steps. Note that checkpoint_freq is based on this iteration unit.

  1. Until timesteps_iteration steps pass or min_iter_time_s seconds pass, the following is repeated:
    • Sampling from the replay buffer
    • Gradient calculation
    • Weight update
    • Updating weights of workers
    • Updating priorities
  2. Calculating and assigning priorities
  3. Updating logs (Tensorboard updates)

Data addition and replay in the replay buffer are performed asynchronously.

NameTypeDescription
double_qboolWhether to apply Double Q-Learning
duelingboolWhether to apply Dueling Network
gammafloatDiscount rate for rewards
n_stepintValue of n in n-step Q-Learning
target_network_update_freqintNumber of steps per Q-network update
timesteps_per_iterationintStep interval of iterations
min_iter_time_sintMinimum number of seconds per iteration
train_batch_sizeintSize of the batch for Q-Network

Forward and backward computations of the Q-Network are performed using Tensorflow or PyTorch. The structure of the Q-Network can be written by yourself, but it automatically determines the structure according to the observation_space of the environment. Especially for images, a CNN is applied, and you can overwrite the size of the convolution kernel through settings.

NameTypeDescription
hiddensintSize of the fully-connected layer after the convolutional layers
noisyboolWhether to apply Noisy Network [Fortunato et al., 2017]. If true, ε-greedy is not used.
sigma0floatInitial parameter of the Noisy Network
num_atomsintNumber of output distributions of the Q-network. If greater than 1, it becomes distributional Q-learning [Bellemare and Dabney, 2017].
v_minfloatParameter of distributional Q-learning
v_maxfloatParameter of distributional Q-learning
NameTypeDescription
optimizer_classstringType of optimizer. Use AsyncReplayOptimizer for Ape-X.
debugboolToggle debug mode
max_weight_sync_delayintMinimum number of steps to wait since the last model parameter update
num_replay_buffer_shardsintParallelism of replay
lrfloatLearning rate
adam_epsilonfloatε value of Adam Optimizer
grad_norm_clippingfloatMaximum value of the gradient norm

Others

batch_mode can be either truncate_episodes or complete_episodes. If complete_episodes, it does not send to the buffer until the episode ends.

NameTypeDescription
schedule_max_timestepsintStep interval for changing hyperparameters based on learning progress
gpuboolWhether to use GPU
gpu_fractionfloatGPU usage rate (0~1, where 1 is 100%)
monitorboolWhether to save rollout results periodically as videos
batch_modestringBatch sampling method.
tf_session_argsdictParameters when initializing Tensorflow Session

tf_session_args can be set as follows. See the Tensorflow documentation for details.

References

[horgan2018] [Horgan et al., 2018] Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., & Silver, D. (2018). Distributed Prioritized Experience Replay. In International Conference on Learning Representations (pp. 1–19). https://doi.org/10.1007/11564096

[liang2018] [Liang et al., 2018] Liang, E., Liaw, R., Moritz, P., Nishihara, R., Fox, R., Goldberg, K., … Stoica, I. (2018). RLlib: Abstractions for Distributed Reinforcement Learning. In International Conference on Machine Learning. https://arxiv.org/abs/1712.09381

[scaul2016] [Schaul et al., 2016] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay. In International Conference on Learning Representations (pp. 1–21). http://arxiv.org/abs/1511.05952

[lafferty2010] [Lafferty et al., 2010] Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., & Culotta, A. (2010). Double Q-learning. Advances in Neural Information Processing Systems, 23, 2613--2621.

[wang2016] [Wang et al., 2016] Wang, Z., Schaul, T., Hessel, M., & Lanctot, M. (2016). Dueling Network Architectures for Deep Reinforcement Learning Hado van Hasselt. In International Conference on Machine Learning (Vol. 48, pp. 1995–2003). https://doi.org/10.1109/MCOM.2016.7378425

[fortunato2017] [Fortunato et al., 2017] Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., … Legg, S. (2018). Noisy Networks for Exploration. In International Conference on Learning Representations. https://arxiv.org/abs/1706.10295

[bellemare2017] [Bellemare and Dabney, 2017] Bellemare, M. G., & Dabney, W. (2017). A Distributional Perspective on Reinforcement Learning. In International Conference on Machine Learning. https://arxiv.org/abs/1707.06887

Author

Publish: 2018/12/10

Kazuma Sasaki