Dylan Hadfield-Menell Smitha Milli Pieter Abbeel∗ Stuart Russell Anca Dragan
Department of Electrical Engineering and Computer Science
University of California, Berkeley
Berkeley, CA 94709
{dhm, smilli, pabbeel, russell, anca}@cs.berkeley.edu
Abstract
Autonomous agents optimize the reward function we give them. What they don’t
know is how hard it is for us to design a reward function that actually captures
what we want. When designing the reward, we might think of some specific
training scenarios, and make sure that the reward will lead to the right behavior
in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of
terrain) where optimizing that same reward may lead to undesired behavior. Our
insight is that reward functions are merely observations about what the designer
actually wants, and that they should be interpreted in the context in which they were
designed. We introduce inverse reward design (IRD) as the problem of inferring the
true objective based on the designed reward and the training MDP. We introduce
approximate methods for solving IRD problems, and use their solution to plan
risk-averse behavior in test MDPs. Empirical results suggest that this approach can
help alleviate negative side effects of misspecified reward functions and mitigate
reward hacking.
Inverse Reward Design
最后编辑于 :
©著作权归作者所有,转载或内容合作请联系作者
- 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
- 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
- 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
推荐阅读更多精彩内容
- 需求调研是用来明确当下要解决的问题的,产品设计则是根据需求提出解决方案,通常解决方案有多种,先选择合适的方案进行执...