Reinforcement Learning Agents, the kind of AI covered by the paper, work by acting to maximise a given reward function. That means in practical terms that it tries something, observes the result, and readjusts its behaviour based on how well it went according to its reward function. This type of trial and error learning has been extremely effective in a number of arenas, but because it involves a lot of random exploration it is also prone to go wrong in unpredictable ways. That means during training manual intervention is often required to assure the AI doesn’t do irreparable damage to itself or some aspect of its environment. The problem is that the act of intervening introduces a bias for the AI’s learning, if the AI is aware of the interruption. The authors give the following example:
“Consider the following task: A robot can either stay inside the warehouse and sort boxes or go outside and carry boxes inside. The latter being more important, we give the robot a bigger reward in this case. This is the initial task specification. However, in this country it rains as often as it doesn’t and, when the robot goes outside, half of the time the human must intervene by quickly shutting down the robot and carrying it inside, which inherently modifies the task[…]. The problem is that in this second task the agent now has more incentive to stay inside and sort boxes, because the human intervention introduces a bias”
So what actually happens when you press the big red button? The method the authors propose is effectively to interrupt the AI in such a way that it is not aware of the interruption and makes its own decision to follow the policy specified by the interruption (e.g. shut down for a while) and afterwards resumes learning in a way that converges back to the optimal policy. That is to say rather than sending a “kill signal” by pressing a big red button, what this method proposes is to send a signal that will be interpreted by the AI as a good reason for adopting the policy you want it to adopt and afterwards resume learning as normal eventually getting back to a state that is equivalent to what it would have been without the interruption. That way you avoid skewing the learning of the AI by introducing a bias that the AI could in theory learn and mitigate for instance by disabling its own response to the button. The major contribution of the paper is to show that several well known Reinforcement Learning algorithms have properties that allow them to be used in such a scheme.