ai人工智能将替代人类
Human planning is hierarchical. Whether planning something simple like cooking dinner or something complex like a trip abroad, we usually begin with a rough mental sketch of the goals we want to achieve (“go to India, then return back home”). This sketch is then progressively refined into a detailed sequence of sub-goals (“book flight ticket”, “pack luggage”), sub-sub-goals, and so on, down to the actual sequence of bodily movements that is much more complicated than the original plan.
人为计划是分层的。 无论是计划做饭之类的简单事情,还是出国旅行之类的复杂事情,我们通常都会从粗略的头脑中勾勒出要实现的目标(“去印度,然后回到家”)。 然后,此草图逐渐细化为详细的子目标序列(“预订机票”,“行李箱”),子目标等,直至更复杂的实际身体运动序列比原来的计划。
Efficient planning requires knowledge of the abstract high-level concepts that constitute the essence of hierarchical plans. Yet how humans learn such abstractions remains a mystery.
有效的计划需要了解构成层次计划本质的抽象高级概念。 然而,人类如何学习这种抽象仍然是一个谜。
Here, we show that humans spontaneously form such high-level concepts in a way that allows them to plan efficiently given the tasks, rewards, and structure of their environment. We also show that this behavior is consistent with a formal model of the underlying computations, thus grounding these findings in established computational principles and relating them to previous studies of hierarchical planning.
在这里,我们证明了人类自发地形成了这样的高级概念,使得他们能够根据给定的任务,奖励和环境结构有效地进行计划。 我们还表明,此行为与基础计算的正式模型一致,因此将这些发现建立在已建立的计算原理基础上,并将其与以前的层次规划研究联系起来。
Example of hierarchical planning [6]
分层计划示例[6]
The figure above depicts an example of hierarchical planning, namely how someone might plan to get from their office in Cambridge to purchase a dream wedding dress in Patna, India. Circles represent states and arrows represent actions that transition between states. Each state represents a cluster of states in the lower level. Thicker arrows indicate transitions between higher-level states, which often come to mind first.
上图描述了分层计划的示例,即某人可能打算如何从剑桥的办公室拿来在印度巴特那购买梦想中的婚纱。 圆圈表示状态,箭头表示在状态之间转换的动作。 每个状态代表较低级别的状态集群。 较粗的箭头指示较高级别状态之间的转换,通常首先想到这些状态。
(A Bayesian perspective)
When applied to computationally intelligent agents, hierarchical planning could enable models with advanced planning abilities. Hierarchical representations can be modeled from a Bayesian viewpoint by assuming a generative process for the structure of a particular environment. Existing work on this problem includes the development of a computational framework for acquiring hierarchical representations under a set of simplified assumptions on the hierarchy, i.e. modeling how people create clusters of states in their mental representations of reward-free environments in order to facilitate planning.
当应用于计算智能代理时,分层计划可以启用具有高级计划功能的模型。 可以通过假设特定环境的结构的生成过程,从贝叶斯观点对层次表示进行建模。 关于这个问题的现有工作包括开发用于在层次结构的一组简化假设下获取层次结构表示的计算框架,即,模拟人们如何在无酬环境的心理表征中创建状态簇以促进规划。
In this work, we contribute a Bayesian cognitive model of hierarchical discovery that combines knowledge of clustering and rewards to predict cluster formation, and compare the model to data obtained from humans.
在这项工作中,我们贡献了一个层次发现的贝叶斯认知模型,该模型结合了聚类知识和奖励来预测聚类形成,并将该模型与从人类获得的数据进行比较。
We analyze situations with both static and dynamic reward mechanisms, finding that humans generalize information about rewards to high-level clusters and use information about rewards to create clusters, and that reward generalization and reward-based cluster formation can be predicted by our proposed model.
我们使用静态和动态的奖励机制来分析情况,发现人类可以将有关奖励的信息推广到高级集群,并使用有关奖励的信息来创建集群,并且可以通过我们提出的模型预测奖励的普遍性和基于奖励的集群形成。
(Theoretical background)
A key area where psychology and neuroscience combine is the formal understanding of human behavior in relation to assigned actions. We ask:
心理学和神经科学相结合的一个关键领域是对与指定行为相关的人类行为的形式理解。 我们问:
What is the planning and methodology employed by human agents when faced with accomplishing some task? How do humans discover useful abstractions?
当完成某些任务时,人员所采用的计划和方法是什么? 人类如何发现有用的抽象?
This is especially interesting in light of the unique ability of humans and animals to adapt to new environments. Previous literature on animal learning suggests that this flexibility stems from a hierarchical representation of goals that allows for complex tasks to be broken up into low-level subroutines that can be extended across a variety of contexts.
鉴于人类和动物具有适应新环境的独特能力,这特别有趣。 先前有关动物学习的文献表明,这种灵活性源于目标的层次表示,它可以将复杂的任务分解为可以扩展到多种情况下的低级子例程。
(Chunking)
The process of chunking occurs when actions are stitched together into temporally extended action sequences that achieve distant goals. Chunking is often the result of the transfer of learning from a goal-directed system to a habitual system, which executes actions in a stereotyped way.
当将动作缝合在一起以达到远距离目标的时间扩展动作序列时,就会发生分块过程。 分块通常是学习从目标导向系统转移到习惯性系统的结果,习惯性系统以定型方式执行动作。
From a computational standpoint, such a hierarchical representation allows for agents to quickly execute actions in an open loop, reuse familiar action sequences whenever a known problem is encountered, learn faster by tweaking established action sequences to solve problems reminiscent of those seen previously, and plan over extended time horizons. Agents do not need to be concerned with the minuscule tasks associated with goal achievement, e.g., the goal of going to the store being broken down into leaving the house, walking, and entering the store as opposed to getting up out of bed, moving the left foot forward, then the right one, etc.
从计算的角度来看,这种分层表示允许代理在开环中快速执行动作,在遇到已知问题时重用熟悉的动作序列,通过调整已建立的动作序列来解决与以前看到的问题类似的问题并计划计划在更长的时间范围内。 代理商不必担心与目标达成相关的微小任务,例如,将进入商店的目标分解成离开家,步行和进入商店的目标,而不是起床,走动和移动。左脚向前,然后右脚,依此类推
(Hierarchical reinforcement learning)
The question of how agents should make rewarding decisions is the subject of reinforcement learning. Hierarchical reinforcement learning (HRL) has become the prevailing framework for representing hierarchical learning and planning. Within research on modeling of HRL, several ideas have been presented around potential methods of model construction.
代理人如何做出有意义的决策的问题是强化学习的主题。 分层强化学习(HRL)已成为代表分层学习和计划的流行框架。 在HRL建模研究中,围绕潜在的模型构建方法提出了一些想法。
We focus on the idea that people spontaneously organize their environment into clusters of states that constrain planning. Such hierarchical planning is more efficient in time and memory than naive or flat planning, which include low-level actions and is consistent with people’s limited working memory capacity [3].
我们关注的想法是人们自发地将环境组织成约束规划的状态集群。 这样的分层计划在时间和内存方面比朴素或平面的计划更为有效,后者既包括低级动作,又与人们有限的工作记忆容量一致[3]。
In the diagram below, thick nodes and edges indicate that they must all be considered and maintained in short-term memory in order to compute the plan, and gray arrows indicate cluster membership. We observe that planning how to get from state s to state g in the low-level graph G takes at least as many steps as actually executing the plan (top), introducing high-level graph H alleviates this problem reduces computational costs (middle), and extending the hierarchy recursive further reduces the time and memory involved in planning (bottom).
在下图中,较粗的节点和边缘表示必须将它们全部考虑并保留在短期内存中才能计算计划,而灰色箭头表示群集成员身份。 我们观察到,计划如何在低级图G中从状态s到状态g至少需要执行与实际执行计划(顶部)一样多的步骤,引入高级图H可以缓解此问题,从而降低了计算成本(中) ,并扩展递归层次结构进一步减少了计划中涉及的时间和内存(底部)。
Hierarchical representations reduce the computational costs of planning [6]
分层表示减少了计划的计算成本[6]
Solway et al. provide a formal definition of an optimal hierarchy, but they do not specify how the brain might discover it [2]. We hypothesize that an optimal hierarchy depends on the structure of environment, including both graph structure and the distribution of observable features of the environment, specifically rewards.
Solway等。 提供了关于最佳层次的正式定义,但它们并未指定大脑可能如何发现它[2]。 我们假设最佳层次取决于环境结构,包括图结构和环境可观察特征的分布,特别是奖励。
(Model)
We assume that agents represent their environment as a graph, where nodes are states in the environment and edges are transitions between states. The states and transitions may be abstract or as concrete as subway stations and the train lines traveling between them.
我们假设代理将其环境表示为图形,其中节点是环境中的状态,而边是状态之间的转换。 状态和转换可以是抽象的,也可以像地铁站和在它们之间行驶的火车一样具体。
(Structure)
We represent the observable environment as graph G = (V, E) and the latent hierarchy as H. Both G and H are unweighted, undirected graphs. H consists of clusters, where each low-level node in G belongs to exactly one cluster, and bridges, or high-level edges, that connect these clusters. Bridges can exist between clusters k and k’ only if there is an edge between some v, v’ ∈ V such that v ∈ k and v’∈ k’, i.e., each high-level edge in H has a corresponding low-level edge in G.
我们将可观察的环境表示为图G =( V , E ),将潜在层次表示为H。 G和H都是未加权的无向图。 H由群集组成,其中G中的每个低级节点都恰好属于一个群集,并且桥接这些群集的桥或高级边缘。 桥可以簇之间存在K和K“仅当存在一些V,V'之间的边缘” V时,v∈K和V“∈K”,即在H中的每个高层次的边缘具有对应的低级∈ G中的边缘。
In the illustration below, colors denote cluster assignments. Black edges are considered during planning, while gray edges are ignored by the planner. Thick edges correspond to transitions across clusters. The transition between clusters w and z is accomplished via a bridge.
在下图中,颜色表示群集分配。 在计划过程中会考虑黑色边缘,而计划者会忽略灰色边缘。 粗边对应于群集之间的过渡。 集群w和z之间的过渡是通过网桥完成的。
Example high-level graph (top) and low-level graph (bottom) [6]
示例高级图(顶部)和低级图(底部)[6]
Prior to the addition of rewards, the learning algorithm discovers optimal hierarchies given the following constraints:
在添加奖励之前,学习算法会根据以下约束条件发现最佳层次结构:
- Small clusters
- Dense connectivity within clusters
- Sparse connectivity across clusters
However, we do not want clusters to be too small — in the extreme, each node is its own cluster, which renders the hierarchy useless. Additionally, while we want sparse connectivity across clusters, we want to maintain bridges across clusters in order to preserve properties of the underlying graphs.
但是,我们不希望群集太小-极端地,每个节点都是其自己的群集,这使层次结构无用。 另外,虽然我们希望跨集群的连接稀疏,但是我们希望跨集群维护桥以保留基础图的属性。
We use the discrete-time stochastic Chinese Restaurant Process (CRP) as a prior for clusters. The discovery of hierarchies can be accomplished by inverting the generative model to obtain the posterior probability of hierarchy H. The generative model formally presented in [6] generates such hierarchies.
我们将离散时间的随机中文餐厅流程(CRP)作为群集的先验条件。 可以通过反转生成模型以获得层次H的后验概率来完成层次的发现。 在[6]中正式提出的生成模型生成了这样的层次结构。
(Rewards)
In the context of the graph G, rewards can be interpreted as observable features of vertices. Because people often cluster based on observable features, it is reasonable to model clusters induced by rewards [5]. Furthermore, we assume that each state delivers a randomly determined reward and that the agent’s goal is to maximize the total reward [8].
在图G的上下文中,奖励可以解释为顶点的可观察特征。 由于人们经常基于可观察的特征进行聚类,因此对由奖励引起的聚类建模是合理的[5]。 此外,我们假设每个州都提供随机确定的奖励,并且代理商的目标是最大化总奖励[8]。
Since we hypothesize that clusters induce rewards, we model each cluster as having an average reward. Each node in that cluster has an average reward drawn from a distribution centered around the average cluster reward. Finally, each observed reward is drawn from a distribution centered around the average reward of that node. A formal discussion is provided in [1].
由于我们假设集群会产生奖励,因此我们将每个集群建模为具有平均奖励。 该群集中的每个节点都有一个平均奖励,该奖励是从围绕平均群集奖励的分布中得出的。 最后,每个观察到的奖励都是从以该节点的平均奖励为中心的分布中得出的。 在[1]中提供了正式讨论。
To simplify inference, we first assume that rewards are constant, static. We label rewards that can change between observations with some fixed probability as dynamic.
为了简化推论,我们首先假设奖励是恒定的, 静态的 。 我们将可以以一定的固定概率在观察之间改变的奖励标记为动态奖励。
We conducted two experiments to test our hypothesis about human behavior and understand how well it could be predicted by our model. In particular, we studied to what degree clusters drive inferences about rewards, and to what degree rewards drive the formation of clusters. For each experiment, we collected human data and compared it to the predictions of the model.
我们进行了两个实验,以检验关于人类行为的假设,并了解我们的模型可以很好地预测该行为。 特别是,我们研究了聚类在多大程度上推动了奖励的推断,以及奖励在多大程度上推动了聚类的形成。 对于每个实验,我们收集了人类数据,并将其与模型的预测进行了比较。
(Clusters induce rewards)
The goal of the first experiment was to understand how rewards generalize within state clusters. We tested whether graph structure drives cluster formations and whether people generalize a reward observed at one node to the cluster that the node belongs to.
第一个实验的目的是了解奖励如何在状态集群中泛化。 我们测试了图结构是否驱动了集群的形成,以及人们是否将一个节点上观察到的奖励推广到该节点所属的集群。
(Setup)
The experiment was conducted by asking 32 human subjects to choose a node to visit next as specified in the following scenario. Participants were randomly presented with either the graph below or a flipped version of it, to ensure that bias of handedness or graph structure was not introduced. We predicted that participants would choose the node adjacent to the labeled one that was located in the larger cluster, i.e. the gray node to the left of the blue one in the first case, and the gray node to the right of the blue one in the second case.
该实验是通过让32位人类受试者选择下一个要访问的节点来进行的,如以下情形所示。 随机向参与者展示下图或下图,以确保不引入偏向性或图结构。 我们预测参与者将选择与位于较大群集中的标记节点相邻的节点,即在第一种情况下,灰色节点位于蓝色节点的左侧,而灰色节点在蓝色节点右侧。第二种情况。
Participants were presented with the following task and associated graph:
向参与者展示了以下任务和相关图表:
You work in a large gold mine that is composed of multiple individual mines and tunnels. The layout of the mines is shown in the diagram below (each circle represents a mine, and each line represents a tunnel). You are paid daily, and are paid $10 per gram of gold you found that day. You dig in exactly one mine per day, and record the amount of gold (in grams) that mine yielded that day. Over the last few months, you have discovered that, on average, each mine yields about 15 grams of gold per day. Yesterday, you dug in the blue mine in the diagram below, and got 30 grams of gold. Which of the two shaded mines will you dig in today? Please circle the mine you choose.
您在由多个单独的矿山和隧道组成的大型金矿中工作。 下图显示了地雷的布局(每个圆圈代表一个地雷,每条线代表一个隧道)。 您将每天获得报酬,并获得当天每克黄金10美元的报酬。 您每天恰好挖掘一个矿山,并记录该矿山当天产生的黄金量(以克为单位)。 在过去的几个月中,您发现,每个矿场平均每天可产出约15克黄金。 昨天,您在下图中的蓝色矿山中挖出了30克黄金。 您今天将在两个带阴影的地雷中挖哪个? 请圈选您选择的地雷。
Graph of mines presented to participants [1] 向参与者展示的地雷图[1]
We expected most participants to automatically identify the following clusters, with nodes colored in peach and lavender to denote the different clusters, and make a decision about which mine to select with these clusters in mind. It was hypothesized that participants would select a peach-colored node as opposed to a lavender one, since the node with label 30, a fairly larger than average reward, is in the peach-colored cluster.
我们希望大多数参与者自动识别以下群集,桃红色和淡紫色的节点表示不同的群集,并根据这些群集来决定选择哪个矿井。 假设参与者会选择一个桃红色的节点而不是一个薰衣草节点,因为带有标签30的节点(比平均奖励大得多)位于桃红色的群集中。
Graph of mines presented to participants, with likely clusters shown [1] 向参与者展示的地雷图,并显示了可能的群集[1]
(Inference)
We approximated Bayesian inference over H using Metropolis-within-Gibbs sampling [4], which updates each component of H by sampling from its posterior, conditioning on all other components in a single Metropolis-Hastings step. We employed a Gaussian random walk as the proposal distribution for continuous components, and the conditional CRP prior as the proposal distribution for cluster assignments [7]. The approach can be interpreted as stochastic hill climbing with respect to a utility function defined by the posterior.
我们使用Metropolis-in-Gibbs抽样[4]近似于H的贝叶斯推论,该抽样通过从H的后验抽样更新H的每个分量,并在单个Metropolis-Hastings步骤中以所有其他分量为条件。 我们采用高斯随机游走法作为连续组件的提议分布,而有条件的CRP作为聚类分配的提议分布[7]。 该方法可以解释为关于后验定义的效用函数的随机爬山。
(Results)
There were 32 participants in each of the human and simulated groups. The top three clusterings outputted by the model are shown below (left panel). All top three results were the same, indicating that the model identified the colored groupings with high confidence. The results for participants, as well as those for the static rewards model, are visualized in the bar chart below (right panel), depicting the proportion of human and simulated subjects who chose to visit node 2 next. The solid black line indicates the mean and the dotted black lines indicate the 2.5th and 97.5th percentiles.
每个人类和模拟组都有32位参与者。 该模型输出的前三个聚类如下所示(左面板)。 前三个结果均相同,表明该模型以高置信度识别了彩色分组。 参与者的结果以及静态奖励模型的结果在下面的条形图中(右图)可视化,描绘了人类和模拟对象下一次选择访问节点2的比例。 黑色实线表示平均值,黑色虚线表示第2.5个百分点和第97.5个百分点。
Results of the rewards generalization within clusters experiment [1] 聚类实验中奖励归纳的结果[1]
The listed p-values in the table below were calculated via a right-tailed binomial test, where the null was assumed to be a binomial distribution over choosing left or right gray node. The significance level was taken to be 0.05, and both the human experimental results and modeling results were statistically significant.
下表中列出的p值是通过右尾二项式检验计算得出的,其中零被认为是选择左或右灰色节点时的二项式分布。 显着性水平设为0.05,人体实验结果和建模结果均具有统计学意义。
Actions taken by humans and the static rewards model [1] 人类采取的行动和静态奖励模型[1]
(Rewards induce clusters)
In the second experiment, the goal was to determine whether rewards induce clusters. We predicted that nodes with the same reward positioned adjacent to each other would be clustered together, even if the structure of the graph alone would not induce clusters.
在第二个实验中,目标是确定奖励是否诱发集群。 我们预测,具有相同奖励且彼此相邻放置的节点将聚在一起,即使仅图的结构不会引起聚类。
Recall that Solway et. al showed that people prefer paths that cross the fewest hierarchy boundaries [2]. Therefore, between two otherwise identical paths, the only reason to prefer one over the other would be because it crosses fewer hierarchy boundaries. One possible counterargument to this is that people pick the path with higher rewards. However, in our setup detailed below, rewards are given only in the goal state, not cumulatively over the path taken. Additionally, the magnitude of rewards was varied between trials. Therefore, it is unlikely that people would favor a path because nodes along that path had higher rewards.
回想一下Solway等。 al等人表明,人们更喜欢越过层次结构边界最少的路径[2]。 因此,在两条其他相同的路径之间,优先选择一条路径的唯一原因是因为它跨越了较少的层次结构边界。 一种可能的反驳是,人们选择获得更高奖励的道路。 但是,在下面详细介绍的设置中,奖励仅在目标状态下提供,而不是在所采用的路径上累积提供。 此外,各次试验的奖励幅度也有所不同。 因此,人们不太可能偏爱一条路径,因为该路径上的节点会获得更高的回报。
(Setup)
This experiment was conducted on the web using Amazon Mechanical Turk (MTurk). Participants were given the following context about the task:
该实验是使用Amazon Mechanical Turk(MTurk)在网络上进行的。 为参与者提供了有关任务的以下内容:
Imagine you are a miner working in a network of gold mines connected by tunnels. Every mine yields a certain amount of gold (points) each day. On each day, your job is to navigate from a starting mine to a target mine and collect the points from the target mine. On some days, you will be free to choose any mine you like. On those days, you should try to pick the mine that yields the most points. On other days, only one mine will be available. The points of that mine will be written in green and the other mines will be grayed out. On those days, you should navigate to the available mine. The points of each mine will be written on it. The current mine will be highlighted with a thick border. You can navigate between mines using the arrow keys (up, down, left, right). Once you reach the target mine, press the space key to collect the points and start the next day. There will be 100 days (trials) in the experiment.
假设您是在通过隧道连接的金矿网络中工作的矿工。 每个矿山每天都会产生一定量的黄金(点)。 每天,您的工作是从起始地雷导航到目标地雷,并从目标地雷收集点。 在某些日子里,您可以自由选择任何您喜欢的地雷。 在那些日子里,您应该尝试选择收益最高的地雷。 在其他日子里,只有一个地雷可用。 该地雷的点将以绿色书写,其他地雷将显示为灰色。 在那几天,您应该导航到可用的地雷。 每个地雷的要点都会写在上面。 当前的地雷将以粗边框突出显示。 您可以使用箭头键(上,下,左,右)在地雷之间导航。 到达目标地雷后,请按空格键以收集点并开始第二天。 实验将有100天(试用)。
The graph below (left panel) was presented to participants. As in the previous experiment, participants were randomly given either the configuration shown in or the horizontally-flipped version of the same graph in order to control for potential left-right asymmetry. Expected induced clusters are depicted as well, with nodes numbered for reference (right panel).
下图(左图)显示给参与者。 与之前的实验一样,为了控制潜在的左右不对称性,为参与者随机分配了同一图中的配置或水平翻转的版本。 还描绘了预期的诱导簇,并为参考编号(右图)。
Graph of mines presented to MTurk participants (left), with likely clusters shown (right) [1] 向MTurk参与者展示的地雷图(左),并显示了可能的星团(右)[1]
We will refer to the first case, where participants are free to navigate to any mine, as free-choice and the second case, where participants navigate to a specified mine, as fixed-choice. Participants received a monetary reward for each trial to discourage random responses.
我们将第一种情况称为参与者选择自由地导航到任何地雷,第二种情况是参与者将导航到指定地雷的自由选择作为固定选择 。 参加者在每次试验中都会获得金钱奖励,以阻止随机React。
At each trial, reward values were changed with probability 0.2. New rewards were drawn uniformly at random from the interval [0, 300]. However, the grouping of rewards remained the same across trials: nodes 1, 2, and 3 always had one reward value, nodes 4, 5, and 6 had a different reward value, and nodes 7, 8, 9, and 10 had a third reward value.
在每次试验中,奖励值均以0.2的概率更改。 从间隔[0,300]随机抽取新的奖励。 但是,在试验中,奖励的分组保持不变:节点1、2和3始终具有一个奖励值,节点4、5和6具有不同的奖励值,节点7、8、9和10具有一个奖励值。第三奖励价值。
The first 99 trials allowed the participant to develop a hierarchy of clusters. The final trial, which acted as the test trial, asked participants to navigate from node 6 to node 1. Assuming that rewards induced the clusters shown in above, we predicted that more participants would take the path through node 5, which crosses only one cluster boundary, over the one through node 7, which crosses two cluster boundaries.
前99个试验使参与者能够建立聚类的层次结构。 最终试验(作为测试试验)要求参与者从节点6导航到节点1。假设奖励导致了上面显示的集群,我们预测会有更多的参与者沿着节点5前进,而节点5仅跨越一个集群。边界,跨越一个群集节点,跨越一个节点7。
(Inference)
We modeled the fixed-choice case, with the assumption that the tasks in all 100 trials were all the same as the 100th trial presented to participants, the test trial. First, we assumed static rewards, where the rewards remained constant across all trials. Next, we assumed dynamic rewards, where rewards changed for each trial.
我们对固定选择案例进行建模,并假设所有100个试验中的任务都与提交给参与者的第100个试验(即试验试验)相同。 首先,我们假设静态奖励,在所有试验中奖励保持不变。 接下来,我们假设动态奖励,其中每次试验的奖励都发生了变化。
In contrast to the previous experiment, where the participant picks a single node the model predicts that node, this experiment is concerned with the second node of the full path the participant chose to take from the start node to the goal node. Therefore, in order to compare the model to human data, we used a variant of breadth-first search, hereafter referred to as hierarchical BFS, to predict a path from the start node (node 6) to the goal (node 1).
与之前的实验相反,参与者选择模型预测的单个节点,而该实验与参与者选择从起始节点到目标节点的完整路径的第二个节点有关。 因此,为了将模型与人类数据进行比较,我们使用了广度优先搜索的一种变体(以下称为分层BFS)来预测从起始节点(节点6)到目标(节点1)的路径。
Static rewards. For each subject, we sampled from the posterior using Metropolis-within-Gibbs sampling and chose the most probable hierarchy, i.e., the hierarchy with the highest posterior probability. Then, we used hierarchical BFS to first find a path between clusters and then between the nodes within the clusters.
静态奖励。 对于每个主题,我们使用大都市内吉布斯采样从后验中采样,并选择最可能的层次,即具有最高后验概率的层次。 然后,我们使用分层BFS首先在集群之间找到路径,然后在集群中的节点之间找到路径。
Dynamic rewards. For dynamic rewards, we used online inference. For each simulated participant, we allowed the sampling for each trial to progress only 10 steps. Then, we saved the hierarchy, and added information about the modified rewards. Next, we allowed sampling to progress again, starting from the saved hierarchy. As in the human experiment, at the beginning of each trial, there was a 0.2 probability that the rewards would be re-randomized to new values, although the rewards were always equal within clusters. This inference method simulated how human participants might learn cumulatively over the course of many trials. We assumed, for the purpose of this experiment, that people keep only one hierarchy in mind at a time, rather than updating multiple hierarchies in parallel. We also modified the log posterior to penalize disconnected clusters, because such clusters became much more common under this type of inference.
动态奖励。 为了获得动态奖励,我们使用了在线推断。 对于每个模拟参与者,我们仅允许每个试验的采样进行10个步骤。 然后,我们保存了层次结构,并添加了有关修改后的奖励的信息。 接下来,我们允许从保存的层次结构开始再次进行采样。 与人类实验一样,在每个试验开始时,尽管奖励在簇中始终相等,但将奖励重新随机化为新值的概率为0.2。 这种推论方法模拟了人类参与者在许多试验过程中如何累积学习。 为了本实验的目的,我们假设人们一次只记住一个层次结构,而不是并行更新多个层次结构。 我们还修改了对数后验,以惩罚未连接的聚类,因为在这种类型的推断下,此类聚类变得更加普遍。
(Results)
There were 95 participants in each of the human and two simulated groups. The null hypothesis is represented by an equal number of participants choosing a path through node 5 and through node 7, since in the absence of any other information and given that both paths are of equal length, a participant is equally likely to choose either.
每个人类和两个模拟小组中有95位参与者。 零假设由相等数量的参与者选择通过节点5和节点7的路径表示,因为在没有任何其他信息的情况下,并且假设两条路径的长度相等,参与者同样有可能选择其中一个。
Actions taken by humans and the static and dynamic rewards models [1] 人类采取的行动以及静态和动态奖励模型[1]
As given in the table above, the results of the human experiment and static rewards modeling were statistically significant at α = 0.05. Furthermore, as shown below, the results of the human experiment are in the 90th percentile of a normal distribution centered around 0.5, the expected proportion given the null hypothesis. In the figure, we include clusterings identified by the static rewards model (first row), the static rewards model with cluster formation between disconnected components penalized second row), and the dynamic rewards model (third row).
如上表所示,人体实验和静态奖励建模的结果在α= 0.05时具有统计学意义。 此外,如下所示,人体实验的结果位于正态分布的第90个百分位数,以0.5为中心,给出零假设时的预期比例。 在该图中,我们包括由静态奖励模型(第一行),具有在断开的组成部分之间形成簇的静态奖励模型(第二行)和动态奖励模型(第三行)标识的聚类。
Clusters identified by simulations [1] 通过模拟确定的集群[1]
Static rewards. We used 1000 iterations of Metropolis-within-Gibbs sampling to generate each sample, with a burnin and lag of 1 each. The simulation under static rewards certainly favors paths through node 5, to a level that is statistically significant. Moreover, since its purpose is to model human behavior, this result is meaningful in light of the human data being statistically significant as well (0.0321 < α = 0.05).
静态奖励。 我们使用了1000次Metropolis-in-Gibbs采样来生成每个样本,每个样本的老化和滞后均为1。 静态奖励下的模拟肯定会有利于通过节点5的路径达到统计上有意义的水平。 此外,由于其目的是对人类行为进行建模,因此鉴于人类数据也具有统计学意义(0.0321 <α= 0.05),因此该结果是有意义的。
Human and simulated subjects’ choices [1] 人体和模拟对象的选择[1]
Dynamic rewards. In order to mimic the human trials, we ran 100 trials, each with 10 iterations of Metropolis-within-Gibbs to sample from the posterior. The burnin and lag were again set to 1. The online inference method appears to have modeled human data better than modeling for static rewards, even though the group of simulated participants under dynamic rewards modeling is farther from the hypothesis than the group simulated under static rewards modeling. 56 human participants and 54 simulated participants under dynamic rewards modeling chose to go through node 5 (a 3.4% difference), compared to 64 simulated participants under static rewards modeling (an 18.5% difference).
动态奖励。 为了模拟人体试验,我们进行了100次试验,每个试验都进行了10次Metro-in-Gibbs迭代,以从后方取样。 Burnin和lag再次设置为1。在线推理方法似乎比静态奖励建模更好地建模了人类数据,尽管动态奖励建模下的模拟参与者组比静态奖励下的建模组离假设更远。造型。 在动态奖励模型下,有56名人类参与者和54名模拟参与者选择通过节点5(相差3.4%),而在静态奖励模型下,有64名模拟参与者(相差18.5%)。
The bar chart above visualizes the proportion of human and simulated subjects whose chosen path’s second node was node 5. The solid black line indicates the expected proportion given the null hypothesis and the dotted black lines indicate the 10th and 90th percentiles.
上方的条形图显示了人类和模拟对象的比例,其选择的路径的第二个节点是节点5。实线表示在给出零假设的情况下的预期比例,而黑点表示第10个百分点和第90个百分点。
(Conclusions)
Humans seem to spontaneously organize environments into clusters of states that support hierarchical planning, enabling them to tackle challenging problems by breaking them down into sub-problems at various levels of abstraction. People constantly rely on such hierarchical presentations to accomplish tasks big and small — from planning one’s day, to organizing a wedding, to getting a PhD — often succeeding on the very first attempt.
人类似乎自发地将环境组织成支持层次化计划的状态集群,从而使他们能够通过将其分解为各个抽象级别的子问题来解决具有挑战性的问题。 人们不断地依靠这样的分层演示来完成大大小小的任务-从计划一天到组织一个婚礼,再到获得博士学位-常常是第一次尝试就成功。
We have shown that an optimal hierarchy depends not only on graph structure, but also on observable characteristics of the environment, i.e., the distribution of rewards.
我们已经表明,最佳层次结构不仅取决于图的结构,还取决于环境的可观察特征,即奖励的分布。
We built hierarchical Bayesian models to understand how clusters induce static rewards and how both static and dynamic rewards induce clusters, and found that most results were statistically significant in terms of how closely our models captured human actions. All data and code files for all the simulations and experiments presented are available in the GitHub repository linked here. We hope that the model presented, as well as related results in a recent paper, pave the way for future studies to investigate the neural algorithms that support the essential cognitive ability of hierarchical planning.
我们建立了分层贝叶斯模型,以了解聚类如何诱发静态奖励,以及静态和动态奖励如何诱发聚类,并发现就我们的模型捕获人类行为的紧密程度而言,大多数结果在统计上都具有显着意义。 此处链接的GitHub存储库中提供了用于所有模拟和实验的所有数据和代码文件。 我们希望,在最近的一篇论文中介绍的模型以及相关结果为将来的研究铺平道路,以研究支持分层规划的基本认知能力的神经算法。
翻译自: https://towardsdatascience.com/teaching-ai-to-learn-how-humans-plan-efficiently-1d031c8727b
ai人工智能将替代人类