Deep Reinforcement Learning With Respect to Neural Networks

사색/인지과학적 사유

by Aesthetic Thinker 2021. 1. 25. 06:26

RL WITH NEURAL NETWORKS

The best way to train neural networks is to iteratively give them a batch of samples randomly sampled (i.i.d. in ideal) from a sufficient amount of input-output data pairs. As a nonlinear global function approximator, neural networks rapidly finds a way to mapping certain input with matched output during training. Once this intput-output matching is accomplished, the general feature extraction functions are innately constructed so that it can find a proper matched output for an unseen input at some extent.

Leveraging these features of neural networks, reinforcement learning, henceforth RL, has adopted neural networks as a value/action approximator with respect to the given high-dimensional observation (e.g. pixels) from environment. Along with some fancy approximative mathematical techniques, this combined form of RL with neural networks: deep reinforcement learning has achieved notable success in the area of the game playing, robot manipulation, motor control, and many other tasks which can be compatible with RL framework.

CHALLENGES IN DEEP RL

In spite of the explosive advances in recent deep RL, there still remain some problems making us hesitate to adopt it in practical use. Among others, a huge time cost consumed on learning induced by learning from scratch is a chronical problem of RL. Because the initial policy doesn't have any prior knowledge about the target task or given environment, a tremendous number of trial and error should be implemented to collect sufficient experiences to establish a proper policy for the given task.

That is the very difference between typical supervised learning method and RL framework. In the former, a sufficient amount of 'correct' input-output pairs are already prepared for network training. In the latter, however, there are no correct input-output pairs. Instead, all the input-output pairs should be acquired only through the agent's own experiences. The problem is, neural network performs well only when the training input-output pairs are 'sufficiently' and 'diversely' prepared and are given in the 'correct', 'balanced', 'uncorrelated' way.

Meanwhile, an agent, from scratch, the only thing it has is just a little poor policy. With this poor policy the conditions above are hardly achievable. Poor policy makes poor experience. Poor experiences are seldom correct. Correct experiences are sparse, and largely dependent on luck. Experiences are biased and correlated in adjacent time period. Even worse, environment is not always fully observable, instead partially observable in many cases. One more bonus on it: the rewards distributed in environment can be sparse, and even delayed (this makes the agent even harder to obtain the proper experience). Great! these terrible conditions are real tragedy for neural networks. It was a big hurdle for deep RL.

OVERCOMING EXPERIENCE CORRELATION: EXPERIENCE REPLAY

Thanks to the many wise researchers, however, neural networks for RL nowadays is not be considered that tragedy. It performs pretty well and stable. There have been devised several methods to meet the aforementioned conditions for nerual networks. The most fundamental idea is to store the experiences into the memories called experience replay first introduced in 1993, and got famous by DQN algorithm: Playing Atari with Deep Reinforcement Learning. Storing the experiences into memory buffer, whatever the purpose is, neural networks can highly benefit from this simple method by composing uncorrelated mini-batches using randomly sampled past experiences from the memory buffer. By doing this, the 'correlated' issue can be tackled to certain extent. And also it may lead to preventing neural network from catastrophic forgetting which is fatal problem to neural networks by continuously providing past experiences.

IMBALANCE IN EXPIERIENCES

Yet experience replay may well resolved the 'correlated' issue, but the 'imbalanced' issue is another problem making neural networks difficult to train. Although we can stabilize training with memory buffer, the experiences made by the policy are inevitably biased regardless of whether the policy is poor or not. This is because the policy is prone to produce actions from what it had already learned before, leading to producing redundant experiences, resulted in a replay memory buffer consisting lots of redundant experiences. This imbalanced experience problem is related to one of the common issue in machine learning known as data imbalance.

The main deficit of data imbalance are caused by a significant gap in the prior probability of the training samples. For example, let's say that there is a dataset for abuser detection consisting of 1% abuser data and 99% normal user data. In this case, on the learning phase, neural networks will struggle to learn the features of abuser because the proportion of normal user is much dominant than abuser in training dataset. Think about a network of which accuracy is 99%, clearly identifying 99% of normal user, while cannot recognizing abusers at all. For this case, 99% of accuracy is valueless. This issue is also known as hard negative mining. In supervised learning, hard negative mining has been tackled by some tricks such as weighting more on sparse data, substituting the evaluation metrics (e.g. distinguishing between false positive or false negative), data resampling with undersampling/oversampling for balanced mini-batch construction, sparse data augmentation, clustering abundant data into multiple classes. Further information about handling data imbalance can be found here: handling imbalanced datasets in machine learning.

Since data (experience) is not prepared in RL, we need a little different approach to cope with imbalanced experiences. One well-known trick is using Prioritized Experience Replay. As a vanilla memory replay buffer may has lots of redundant experiences, sampling uniformly from it may cause imbalance problem described above. Instead of uniform sampling, prioritized experience replay imposes higher priority to the experiences that have generated higher TD-error. In the context of Q-value estimation, TD-error indicates how 'surprising' the experience is, thereby prioritizing higher TD-error experiences means that prioritizing sparse experiences while deprioritizing redundant experiences. In such a way the Q-value estimator network can be trained more propely than just using uniformly sampled experiences.

EXPLORATION-EXPLOITATION TRADEOFF

But the imbalanced issue in RL is not a small problem. More generally, the 'imbalanced experience' issue is actually induced by a kind of widely known chronic problem in RL: exploitation-exploration dilemma. Exploitation means utilizing the past experiences to get a maximum reward to the best of the agent's knowledge, while exploration means getting new experiences which have not been visited in the past. The most popular example of this is multi-armed bandit problem. As exploitation-exploration issue is a fundamental subject covering wide range of research area in RL and it causes various problems both inside and outside of neural networks (even in human being ourselves), its detail will not be treated in this article. Yet it's apparent that because the policy learns from experiences which is acquired in the past by itself, not enough exploration encourages the agent to be satisfied with the past glories (local optimum). So an appropriate amount of exploration is critial to make the policy better. The simplest way to do it is to give the policy randomness. The common existing methods are ε-greedy in Q-learning (directly assigning randomness), and regularization like an entropy term in several Actor-Critic methods (indirectly assigning randomness). Anyway, it can be said that exploration-exploitation problem is about the samples not yet included in the memory buffer, while data imbalance problem is about the samples already included in the memory buffer but which are sparse.

SAMPLE EFFICIENCY AND IMITATION LEARNING

Even when we have a perfect exploration-exploitation strategy, comprising mini-batches uncorrelated, handling imbalanced data, RL can be still demanding because RL basically requires a huge amount of interaction data and this is such a time-consuming. One of the reason can be the lack of sample efficiency: the utilization ability on already gained experiences to reduce time cost (experience replay is a simple case for improving sample efficiency). This is a general issue in neural networks because it is largely dependent on the given data. In deep RL, sample efficiency problem can be defined as "How to maximally gather information from gained experiences to optimally train the policy for reward maximization?" Especially in many on-policy algorithms, the policies usually suffer from data inefficiency because the experiences collected by the past policy (before the last update) cannot be used to update current policy. But this has been tackled by importance sampling to use the on-policy algorithm as an off-policy style, to make use of the past experiences as well. Anyway, sample efficiency is still an important issue in off-policy algorithms, and this is a major research area in RL.

So far, I assumed that in RL there is no prepared experiences. But which is a lie as long as in imitation learning. In imitation learning, we can introduce an expert to help our fresh agent showing the expert's demonstrations. Let me introduce a simple example of imitation learning: behavioural cloning. Instead of acquiring experiences from the poor policy, we can obtain high-quality demonstrations from the expert's policy which is assumed as optimal. Once obtained such demonstrations, represented as state-action pairs, we can train the poor policy to perform each action obeying the given state-action pairs as a direct supervised way.

But this direct cloning state-action pairs of experts has lack of robustness. This is because the expert will act only in right way, the demonstrations from the expert would have quite narrow state space compared to the whole state space of the given environment. For this reason, it would better not to 'clone' the demonstrations but to 'refer' them. An example of referring experts is Deep Q-learning from Demonstrations (DQfD). In this algorithm, prepared demonstrations from experts are used to train Q value estimator network so that the fresh agent can start DQN-like RL not from scratch but with some knowledge of experts. In many complex tasks or real-world cases, 'exactly optimal' solution is actually almost impossible to find. So do experts. Therefore, the 'correct' data for RL is almost impossible to be prepared, but at least 'good' data can be. So we'd better to think imitation learning is in latter case.

There is one another case of data-prepared RL called batch reinforcement learning, also known as offline reinforcement learning. In here, agent will not perform any action during the training. Instead of gathering experiences from tremendous trial-errors, the agent learns only from the abundantly prepared data, without exploration. In other word, the focus of batch RL is to make the agent maximally exploit a prepared static dataset (Yes, sample efficiency again). This approach is highly desirable in learning for the physical robots, because in may real-worlds settings it is infeasible to make such a tremendous trial-errors which may risk damage to robots itselves or surrounding objects. Manually resetting the environment for each initial state of episodes are also a challenging issue in standard RL. Unless behavior is made concurrently with training, we are free from these issues. Along with imitation learning, batch RL is expected to shed light on resolving the challenges of real-world tasks. Since batch RL is quite a special case in RL, let me introduce a great example and explanation to refer of batch RL: Off-Policy Deep Reinforcement Learning without Exploration.

GENERALIZATION

To the neural networks, an ability of generalization is also an crucial aspect. Generalization in RL means the agent's ability to applying learned behavior to unseen but similar to certain state seen before. To maximally utilize neural networks in RL, some kinds of generalization should be made otherwise it can be replaced with a simple tabular style policy. But usually in RL, the generalization ability of learned policy has been poor. It is likely to overfit the given environment and to memorize the best experiences rather than finding some general underlying rules in the possible state space. The reasons for poor generalization in RL can be roughly split into three categories: geometrical difficulties in policy optimization, experience bias, and environmental bias.

Geometrical difficulties in policy optimization means that even if the policy is learning from a sufficient amount of diverse experiences, the geometrical landscape of the objective function has several flatten area or saddle points thus makes the policy slower to learn and be likely to converge to the local optimum. To mitigate it, smoothing techniques can be used to smooth the objective function. Smoothed landscape may decrease the chances of meeting local optimum, and make gradient toward global optimum more stable. One example of smoothing in RL is to add an entropy term in objective function. For the policy responsible for discrete action selection, we call the policy has maximum entropy when it has the same probability for each action. Thus, if entropy of the policy is added into the original objective function, the policy will learn to maximize the cumulative reward, maintaining not so high probability of the selected action as possible. So entropy term usually acts as a regularizer in policy learning, and it is empirically proven in Understanding the impact of entropy on policy optimization that entropy term makes objective function geometrically smoother so that the policy can learn with higher learning rates. But the main story here is not about local optimum, but about generalization: how smoothing helps generalization? The key is that in maximizing objective function, because smoothing pulls down the overall objective value, it concurrently pulls down the global optimum. Poor generalization is about overfitting, and overfitting is about falling into too deeply near global optimum, the sweet spot for proper generalization is usually a little apart from the global optimum. This is why regularization commonly helps generalization.

Experience bias means that the experiences stored in memory are not sufficient enough thus the policy cannot well generalize the actions. If there are some underlying common rules in the given environment, the possible states should have been visited well with adequate amount of exploration to enable the policy to catch those rules. So the problem of experience bias can be reduced to exploitation-exploration dilemma.

Environmental bias means that even if the policy perfectly learned the optimal actions in the given environment, it cannot perform well in the other environment of which states are different but semantically simillar to each other. Indeed, even the trivial noise added in environment can make the optimal policy wandering and useless at all. This is commonly because the policy had been biased in the given environment's certain static features totally regardless of the task to achieve. The simple way to make the policy robust to the perturbation of features is to train the policy in the environment doing perturbation itself. One example of this type of environment is Procgen Benchmark: the game environment of which features are not fixed but following procedural generation. Representation learning also can be intervened into the solutions of environmental bias. Instead of changing environment, representation learning takes an approach that utilizes the collected experiences. Representation learning is one of the major research subjects, thus various methods exist in there. One simple method used for representation learning is data augmentation, using basic techniques such as cropping, rotating, and color-jitter. These can be used to make neural networks robust to the perturbations of input data and focus on what is really matters in the given task. Notably, Contrastive learning was an hot trend for representation learning in the past year.

SUMMARY

In summary, neural networks as a global function approximator can be used in deep RL but bearing some difficulties to properly utilize it in certain issues such as uncorrelated mini-batch, data sufficiency, data balancedness, sample efficiency, data diversity, data correctness, and generalization ability. I have tried above to describe brief of them, and introduced some existing solutions for each issues. There still many challenges are remaining in deep RL with respect to neural networks, I hope the whole respectful researchers find a breakthrough for thoes issues with their own best. It is also a words toward me. Thank you for reading this monological article, I spent almost a week to write this. Good times though!

'사색 > 인지과학적 사유' 카테고리의 다른 글

Thinking Is Feeling. Activeness Is an Illusion. (1)	2021.02.13
역전파 알고리즘의 생물학적 타당성 (7)	2021.02.02
Reinforcement Learning: Does 'Model-free' Really Means 'No model'? (6)	2021.01.01
Modified Behaviorism (0)	2020.12.14
돌연변이(Mutation)의 중요성 (0)	2020.10.11