【智能机器人-01 Facebook Blender】ParlAI:Recipes for building an open-domain chatbot

ParlAI是 Facebook 开源的一个可用于在多种开放可用的对话数据集上训练和评估人工智能模型的框架。一个统一的分享、训练和评估对话模型的平台,支持各种对话任务。

对应的论文为:Recipes for building an open-domain chatbot



Building open-domain chatbots is a challenging area for machine learning research.


While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot.


Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona.


We show that large scale models can learn these skills when given appropriate training data and  choice of generation strategy.


We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements.


We then discuss the limitations of this work by analyzing failure cases of our models.


In this work, we provide recipes for building open domain chatbots that perform well in human evaluations.


It has been shown across the field of NLP (Devlin et al., 2019) and in conversational agents in particular (Dinan et al., 2020; Zhang et al., 2019; Adiwardana et al., 2020) that pre-training on large corpora is important. Beyond simply scaling models the two main takeaways from our study are:


    1. Blending Skills 技能混合

         Large improvements can be made by finetuning on data that emphasizes desirable conversational skills.


        We select tasks that make the model focus on personality and engagingness, knowledge, and empathy, achieving large gains by using the recently introduced Blended Skill Talk (BST) set-up (Smith et al., 2020), which targets those aspects by providing training data and initial conversational context (personas and topics).

        我们选择了一些任务,这些任务使模型侧重于个性和积极性、知识和同理心,通过使用最近引入的混合技能谈话(BST)设置(Smith et al.,2020)实现了巨大的收益,该配置通过提供训练数据和初始对话上下文(人物角色和主题)来针对这些方面优化。

        Small models using BST can match or outperform larger models that do not. While BST emphasizes desirable traits, we also show this tuning can minimize undesirable traits learnt from large corpora, such as toxicity.


    2. Generation Strategies 生成策略

        The choice of decoding algorithm is of critical importance, and two models with the same perplexity but different decoding algorithms can give vastly different results.


        In particular we show that the length of the bot’s utterances are crucial to human judgments of quality – too short and the responses are seen as dull or showing a lack of interest, too long and the  bot appears to waffle and not listen.


        We show, contrary to previous work which reports that beam search is inferior to sampling (Holtzman et al., 2019; Adiwardana et al., 2020), that careful choice of search hyperparameters can give strong results by controlling trade-offs.


        In particular, constraining the minimum beam length gives a crucial control of the dull versus spicy spectrum of responses.


Human evaluation results are highly dependent on the precise set-up one chooses.Model performance can be strongly affected by the specific instructions given to evaluators, such as a given topic or not, the overall conversation length, and the choice of human interlocutors, which may be difficult to jointly account for.We report performance when employing crowdworkers in short multi-turn conversations with no prompt.


However, in addition to that, we believe releasing models is the most reliable way to enable full insight into their capabilities.We thus make publicly available our large-scale, state of the art open-domain conversational agent, including code to fine-tune it, the model weights, and code to evaluate it, so that our setup is reproducible.


In human evaluations of engagingness our best model outperforms Meena (Adiwardana et al., 2020) in a pairwise comparison 75% to 25%, and in terms of humanness by 65% to 35% (both statistically significant, two-tailed binomial test, p < 0:01).

在人类对融入度的评估中,我们的最佳模型在成对比较中表现优于Meena(Adiwardana et al.,2020),在人性方面表现优于Meena 25%到75% ,在人性方面表现优于Meena 35%到65% (均显著,双尾二项检验,p<0:01)。

While the performance of our bot at first sight is very good, we do not believe we are yet close to solving the problem of open-domain conversation.We thus discuss limitations of our models, and initial attempts to solve them. In particular, our models still display: a lack of in-depth knowledge if sufficiently interrogated; a tendency to stick to simpler language; and a tendency to repeat oftused phrases.


We show how unlikelihood training and retrieve-and-refine mechanisms are potential avenues for fixing these problems; however, our initial experiments with these methods are inconclusive. We thus discuss future possibilities for alleviating these problems as well as methods to clearly expose and evaluate them.


