Bayesian Estimation

补充:一个例子看懂最大后验(使用Hit or Miss代价函数的贝叶斯估计)和极大似然的区别

小明今天没来上学,三个可能的Hypothesis(θ):

小明今天生病了  /  美国总统特朗普会见小明  /  地球遭受陨石撞击

用极大似然(MLE)估计出来的θ_hat(对θ的估计)是“地球遭受陨石撞击”,因为

Likelihood(小明今天没来上学|地球遭受陨石撞击)= 1

而用最大后验求出来的是“小明今天生病了”,因为考虑了先验——“地球毁灭”和“特朗普会见小明”的概率都远低于“小明今天生病了”。

用“奥卡姆剃刀”解释这个现象是模型越复杂(宇宙模型》国际关系模型》生活模型),出现的(先验)概率越低。

The intuition behind the theorem

Imagine you have a hypothesis about some phenomenon in the world. What is the probability that the hypothesis is true?

 

To answer the question, you make observations related to the hypothesis and use Bayes’ theorem to update its probability. The hypothesis has some prior probability which is based on past knowledge (previous observations). To update it with a new observation, you multiply the prior probability by the respective likelihood term and then divide by the evidence term. The updated prior probability is called the posterior probability.

The posterior probability then becomes the next prior which you can update from another observation. And so the cycle continues.

Notational Conventions

贝叶斯回归R代码 贝叶斯自回归_贝叶斯估计

Example of Bayesian Inference(Risk Factors Associated with Mortality in Game of Thrones)

贝叶斯回归R代码 贝叶斯自回归_ 贝叶斯公式_02

Some Key Terms in Bayesian Inference in plain English

贝叶斯回归R代码 贝叶斯自回归_ 贝叶斯公式_03

Bayesian = Stems from the well known Bayes Theorem which was first derived by Reverend Thomas Bayes.

Inference = Educated guessing

Bayesian Inference = Guessing in the style of Bayes

贝叶斯回归R代码 贝叶斯自回归_贝叶斯回归R代码_04

Prior probability density p(θ)

Our knowledge about q is assumed to be contained in a known prior distribution P(q), which expresses previous knowledge of θ from, for example, a past experience, with the absence of some proof.

Likelihood function p(z|θ)

The form of P(z|q) is assumed known, but the value of q is not known exactly. The likelihood reads as “the probability of the observation, given that the hypothesis is true”. This term basically represents how strongly the hypothesis predicts the observation.

So, the higher the likelihood, the higher the posterior probability is going to be.

Normalization factor(Evidence)  p(z)

The rest of our knowledge about q is contained in a set D of n random variables x1, x2, …, xn that follows P(z)

In the context of Bayes’ theorem, the evidence is the probability of a particular statement being true.

Posterior probability density p??(├ θ┤|z)

The posterior probability is obtained after multiplying the prior probability by the likelihood and then dividing by the evidence.

Bayes’ Theorem (Quick reminder):  posterior=(likelihood ×prior)/evidence

Life Example of Bayesian Inference

贝叶斯回归R代码 贝叶斯自回归_贝叶斯估计_05

There are two ways to use the Bayesian formula here,  the first one is on the slide. While the second one is we view prior as ‘Women & long hair’ , the likelihood is ‘your observation’, the evidence is ‘Man&Longhair + Woman&Longhair ’, and the posterior is your new knowledge about the ‘Woman and long hair’.

Recursive Bayesian Filter

贝叶斯回归R代码 贝叶斯自回归_贝叶斯滤波器_06

 If the variables are normally distributed and the transitions are linear, the Bayes filter becomes equal to the Kalman filter.

当观测方程中Θ是线性时,自回归贝叶斯就是Kalman

这里x时θ,z是我们对θ观测的数据.

The denominator is constant relative to x, so we can always substitute it for a coefficient C, which can usually be ignored in practice. The numerator can be calculated and then simply normalized, since its integral must be unity.

 

本MATLAB例子是个静态的常量所以他可以看成是对一个一阶马尔可夫过程中一个xk的观测;如果θ是个动态的过程,那么zk只与xk有关。

马尔可夫过程视为对非平稳真实环境的简化,既保留了处理复杂问题的本质,同时也便于进行数学分析。

Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior.

--》Using evidence narrow the probability distribution

一些参数更符合(更一致)数据,所以他们可能性更大。从公式可以看出为什么会变得尖锐。

the Benefit of Bayesian Approach

贝叶斯回归R代码 贝叶斯自回归_ 贝叶斯公式_07

Bayesian inference versus Frequentist inference

Two different interpretations of probability have long existed. In Bayesian inference, the prior probabilities are specified and then Bayes theorem is used to make probability statements about the parameter as in equation. In frequentist inference such prior probabilities are considered nonsensical. The parameter θ is considered an unknown constant, not a random variable. Since it is not random, making probability statements doesn’t make sense. A counterargument to this is that even if it is a constant, since it is unknown we may view it as a random variable. The uncertainty may be considered randomness.

这里要讲一下统计学的两种对频率的看法——两个学派。

频率学派关注的是一个点,而贝叶斯关注的是不确定性。

It might be one value, it might be another, it might be a third. Such arguments can and have continued for many years and are very interesting.

Bayesian inference versus Frequentist inference

Bayesian inference updates the probability estimate for a hypothesis as additional evidence is acquired. Bayesian inference is explicitly based on the evidence and prior opinion, which allows it to be based on multiple sets of evidence.

Frequentist inference is capable of making operational decisions and estimating parameters with or without confidence intervals.Frequentist inference is based solely on the probability of the data which is often one set of evidence.

Maximum Likelihood Estimation versus Bayesian Estimation

If you are just interested in determining θ, Bayesian and frequentist methods both offer promising paths toward a solution. Often the two methods generate extremely similar answers anyway, making any argument about which one is better nearly meaningless from the standpoint of whether the method arrives at the correct value of θ. Specifically, often the MSEs of the two methods are identical or nearly identical.

There are certain problems where the frequentist solution (usually Maximum Likelihood Estimation) is easier to follow, other problems where the Bayesian solution is easier to follow. Thus, a knowledge of both methods is useful.

Bayesian estimation

Bayesian estimation considers (the parameter vector to be estimated) to be a random variable.

Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ Theorem to find posterior

Our general setup is that we have a random sample Z = (x1,x2,…,xn) from a distribution p(y|θ), with θ unknown.

Our goal is to use all the available information to construct estimation θ.

贝叶斯回归R代码 贝叶斯自回归_贝叶斯估计_08

 

贝叶斯回归R代码 贝叶斯自回归_贝叶斯回归R代码_09

Bayes Estimator & Cost function

Bayesian estimators are defined by a minimization problem which seeks for the value of  ? ( θ) ̂   that minimizes the average cost.

贝叶斯回归R代码 贝叶斯自回归_贝叶斯滤波器_10

Bayes Estimator & Cost function

贝叶斯回归R代码 贝叶斯自回归_贝叶斯回归R代码_11

MAtlab程序

效果:

贝叶斯回归R代码 贝叶斯自回归_Matlab_12

贝叶斯回归R代码 贝叶斯自回归_贝叶斯回归R代码_13

贝叶斯回归R代码 贝叶斯自回归_Matlab_14

 

%% 
 figure(1);clf;
 figure(2);clf;
 N = 1000
 s=[3;5];  % 真值n=2*randn(2,N); % 方差为2的高斯白噪声
 x=zeros(2,N); % 初始化观测序列figure(1);
 h=plot(s(1),s(2),'r.');  % 真值
 set(h,'markersize',40,'linewidth',3); 
 axis([0,10,0,10]);
 hold off;  
 hold on
 for i=1:N %画带噪信号噪点
     x(:,i)=s+n(:,i);
     plot(x(1,i),x(2,i),'k.','markersize',10);
 end;
 pause
 %% 贝叶斯迭代初始化%初始化二维平台,分辨率0,05
 Sa=[2:0.05:4];
 Sb=[4:0.05:6];% 无先验知识,均匀分布(相当于方差无限大的正太)
 L=length(Sa);
 Pr=ones(L,L); % 初始化先验
 Po=ones(L,L); %初始化后验Pr=Pr/sum(sum(Pr)); % 归一化
 Po=Po/sum(sum(Po)); % 归一化
 figure(1);clf;
 colormap(hsv)
 mesh(Sa,Sb,Po), axis([2 4 4 6 0 0.015])%% 贝叶斯迭代开始
[a,b]=find(Po==max(max(Po)));  % 找先验最大点开始(此时后验等于先验)
 sest=[Sa(a);Sb(b)];  %从先验概率最大点开始
 figure(1);
 clf
 figure(2);
 clf
 subplot(211); plot(1,sest(1)); hold on;
 line([1,N],[s(1),s(1)]); % 画真值线
 subplot(212); plot(1,sest(2)); hold on;
 line([1,N],[s(2),s(2)]); % 画真值线K=[3,0;0,3]; % 二维高斯(协)方差形式
 for (n=2:length(x));
     Pr=Po; %上次后验变为下一次的先验
     m=0*Pr;   %获得一个初始化的二维矩阵
     %假设似然函数是协方差为K高斯,假设以每一个带噪信息点为均值计算似然函数和刷新后的后验
     %高度:每一次配方后的e的常数
     %位置其实是相当于对所有样点取均值(概率均值)
     for (i=1:length(Pr))
        for (j=1:length(Pr))
            me=[Sa(i);Sb(j)];
            m(i,j) = 1/sqrt((2*pi)^2*det(K)) * exp(-(x(:,n)-me)'*inv(K)*(x(:,n)-me)/2); %似然函数使用高斯形式         
            m(i,j) = m(i,j) * Pr(i,j); % 对应二维概率相乘,贝叶斯公式的分子
        end;
     end;
     Po=m/sum(sum(m)); %归一化
     figure(1);colormap(hsv);surf(Sa,Sb,Po), axis tight%画3维图    figure(2);
     [a,b]=find(Po==max(max(Po)));  % 当前我们判断点的位置的坐标
     sest=[Sa(a);Sb(b)];  %对应二维平面的位置
     subplot(211);plot(n,sest(1),'k.');axis([0 N 2 4 ]);h1=text(n,sest(1)+0.1,num2str(sest(1)),'color','r');hold on;drawnow;set(h1,'Visible','off');
     subplot(212); plot(n,sest(2),'k.');axis([0 N 4 6 ]);h2=text(n,sest(2)+0.1,num2str(sest(2)),'color','r');hold on;drawnow;set(h2,'Visible','off');
 end;  
 subplot(211); hold off;
 subplot(212); hold off;