Abstract
This paper is dedicated to the investigation of a new numerical method to approximate the optimal stopping problem for a discrete-time continuous state space Markov chain under partial observations. It is based on a two-step discretization procedure based on optimal quantization. First, we discretize the state space of the unobserved variable by quantizing an underlying reference measure. Then we jointly discretize the resulting approximate filter and the observation process. We obtain a fully computable approximation of the value function with explicit error bounds for its convergence towards the true value function.
References
Bally V, Pagès G, Printems J (2005) A quantization tree method for pricing and hedging multidimensional American options. Math Finance 15(1):119–168
Bäuerle N, Rieder U (2011) Markov decision processes with applications to finance. Springer, Heidelberg
Graf S, Luschgy H (2000) Foundations of quantization for probability distributions. Lecture notes in mathematics, vol 1730. Springer, Berlin
Hernández-Lerma O (1989) Adaptive Markov control processes. Applied mathematical sciences, vol 79. Springer, New York
Hernández-Lerma O, Lasserre JB (1996) Discrete-time Markov control processes. Applications of mathematics (New York), vol 30. Springer, New York
Pagès G, Pham H, Printems J (2004a) An optimal Markovian quantization algorithm for multi-dimensional stochastic control problems. Stoch Dyn 4(4):501–545
Pagès G, Pham H, Printems J (2004b) Optimal quantization methods and applications to numerical problems in finance. Handbook of computational and numerical methods in finance. Birkhäuser, Boston
Pham H, Runggaldier W, Sellami A (2005) Approximation by quantization of the filter process and applications to optimal stopping problems under partial observation. Monte Carlo Methods Appl 11(1):57–81
Ye F, Zhou E (2013) Optimal stopping of partially observable markov processes: a filtering-based duality approach. IEEE Trans Autom Control 58(10):2698–2704
Zhou E (2013) Optimal stopping under partial observation: near-value iteration. IEEE Trans Autom Control 58(2):500–506
Zhou E, Fu M, Marcus S (2010) Solving continuous-state pomdps via density projection. IEEE Trans Autom Control 55(5):1101–1116
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no potential conflict of interest
Appendix A: Proof of Theorem 1
Appendix A: Proof of Theorem 1
In order to prove Theorem 1, we need to introduce a new auxiliary control model \(\mathbf {M}\) given by the five-tuple \(\big (\mathbb {F}, \mathbb {A},T,H,h\big )\) where
-
(a)
the state space is \(\mathbb {F}=\mathbb {X}\times \mathbb {Y}\times \{0,1\}\),
-
(b)
the action space is \(\mathbb {A}=\{0,1\}\),
-
(c)
the transition probability function is given the stochastic kernel T on \(\mathbb {F}\) given \(\mathbb {F}\times \mathbb {A}\) defined by \(T(B\times C\times D |x,y,z,a)=R(B\times C |x,y) \big [ \delta _{z}(D) I_{\{a=0\}}+ \delta _{1}(D) I_{\{a=1\}} \big ]\) for any \(B\in \mathcal {B}(\mathbb {X}), C\in \mathcal {B}(\mathbb {Y}), D\subset \{0,1\}\) and \((x,y,z,a)\in \mathbb {F}\times \mathbb {A}\),
-
(d)
the cost-per-stage H and the terminal cost h.
Define \(\varvec{\Omega }=\mathbb {F}^{N_{0}+1}\) and \(\varvec{\mathcal {F}}\) its associated product \(\sigma \)-algebra. Introduce the coordinate projections \(\mathbf {X}_{t}\) (respectively \(\mathbf {Y}_{t}\), and \(\mathbf {Z}_{t}\)) from \(\varvec{\Omega }\) to the set \(\mathbb {X}\) (respectively \(\mathbb {Y}\), and \(\{0,1\}\)). Consider an arbitrary policy \(\pi \in \varPi ^{o}\). Define recursively the action process \(\{\mathbf {A}_{t}\}_{t\in \llbracket 0 ; N_{0}-1 \rrbracket }\) by \(\mathbf {A}_{t}=\pi _{t}(\mathbf {Y}_{0},\mathbf {Z}_{0},\mathbf {A}_{0},\ldots , \mathbf {Y}_{t-1},\mathbf {Z}_{t-1},\mathbf {A}_{t-1},\mathbf {Y}_{t},\mathbf {Z}_{t})\) for \(t\in \llbracket 1 ; N_{0}-1 \rrbracket \) and \(\mathbf {A}_{0}=\pi _{0}(\mathbf {Y}_{0},\mathbf {Z}_{0})\). Define the filtration \(\{\varvec{\mathcal {F}}_{t}\}_{t\in \llbracket 0 ; N_{0} \rrbracket }\) by \(\varvec{\mathcal {F}}_{t}=\sigma \{\mathbf {X}_{0},\mathbf {Y}_{0},\mathbf {Z}_{0},\ldots ,\mathbf {X}_{t},\mathbf {Y}_{t},\mathbf {Z}_{t}\}\) for \(t\in \llbracket 0 ; N_{0} \rrbracket \). According to Bäuerle and Rieder (2011), Hernández-Lerma and Lasserre (1996), there exists a probability measure \(\mathbf {P}^{\pi }_{(\mathbf {x},\mathbf {y})}\) on \(\big ( \varvec{\Omega },\varvec{\mathcal {F}} \big )\) satisfying
-
(i)
\(\mathbf {P}^{\pi }_{(\mathbf {x},\mathbf {y})}\big ((\mathbf {X}_{0},\mathbf {Y}_{0},\mathbf {Z}_{0})\in B\times C \times D\big )= \delta _{(\mathbf {x},\mathbf {y})}(B\times C) \delta _{0}(D)\),
-
(ii)
\(\mathbf {P}^{\pi }_{(\mathbf {x},\mathbf {y})}\big ((\mathbf {X}_{t+1},\mathbf {Y}_{t+1},\mathbf {Z}_{t+1})\in B\times C\times D |\varvec{\mathcal {F}}_{t} \big ) =T(B\times C\times D|\mathbf {X}_{t},\mathbf {Y}_{t},\mathbf {Z}_{t},\mathbf {A}_{t})\),
for \(t\in \llbracket 0 ; N_{0}-1 \rrbracket , B\in \mathcal {B}(\mathbb {X}), C\in \mathcal {B}(\mathbb {Y}), D\subset \{0,1\}\).
The expectation under the probability \(\mathbf {P}^{\pi }_{(\mathbf {x},\mathbf {y})}\) is denoted by \(\mathbf {E}^{\pi }_{(\mathbf {x},\mathbf {y})}\). For a policy \(\pi \in \varPi ^{o}\), the performance criterion is given by
The optimization problem we are interested in is to maximize the reward function \(\varvec{\mathcal {H}}_{\mathbf {M}}(\mathbf {x},\mathbf {y},\pi )\) over \(\varPi ^{o}\) and \(\overline{\varvec{\mathcal {H}}}_{\mathbf {M}}(\mathbf {x},\mathbf {y})=\sup _{\pi \in \varPi ^{o}} \varvec{\mathcal {H}}_{\mathbf {M}}(\mathbf {x},\mathbf {y},\pi )\). We first need to prove the following technical lemma.
Lemma 6
For any \(t\in \llbracket 0 ; N_{0}\rrbracket \),
Proof
Clearly, \(\mathbf {A}_{t}\) is measurable with respect to \(\sigma \{\mathbf {Y}_{0},\mathbf {Z}_{0},\ldots ,\mathbf {Y}_{t},\mathbf {Z}_{t}\}\) for \(t\in \llbracket 0 ; N_{0}-1 \rrbracket \). Moreover, from the definition of the transition kernel T, we obtain that \(\mathbf {Z}_{t}=I_{\{\mathbf {A}_{t-1}=1\}}+\mathbf {Z}_{t-1}I_{\{\mathbf {A}_{t-1}=0\}}\) for any \(t\in \llbracket 1 ; N_{0} \rrbracket \). Indeed, the kernel T sends z onto itself whenever \(a=0\) and onto 1 whenever \(a=1\). Recalling that \(\mathbf {Z}_{0}=0\), it follows easily \(\sigma \{\mathbf {Y}_{0},\mathbf {Z}_{0},\ldots ,\mathbf {Y}_{t},\mathbf {Z}_{t}\} \subset \sigma \{\mathbf {Y}_{0},\ldots ,\mathbf {Y}_{t}\}\) for \(t\in \llbracket 0 ; N_{0}\rrbracket \) showing the result. \(\square \)
The next result shows that the optimization problem defined through \(\varvec{\mathcal {M}}\) is equivalent to the initial optimal stopping problem defined in Definition 1.
Proposition 2
The following assertions hold.
-
(i)
For any control \(\ell \in L\), there exist a policy \(\pi \in \varPi ^{o}\) such that
$$\begin{aligned} \varvec{\mathcal {H}}_{\mathbf {M}}(\mathbf {x},\mathbf {y},\pi )=\varvec{\mathcal {H}}(\mathbf {x},\mathbf {y},\ell ). \end{aligned}$$ -
(ii)
For any policy \(\pi \in \varPi ^{o}\), there exist a control \(\ell \in L\) such that
$$\begin{aligned} \varvec{\mathcal {H}}(\mathbf {x},\mathbf {y},\ell )=\varvec{\mathcal {H}}_{\mathbf {M}}(\mathbf {x},\mathbf {y},\pi ). \end{aligned}$$
Proof
Regarding item (i), consider a control \(\ell =\big ( {\varvec{\Xi }},{\varvec{\mathcal {G}}},{\mathbf {Q}},\{{\varvec{\mathcal {G}}}_{t}\}_{t\in \llbracket 0 ; N_{0} \rrbracket },\{\varvec{\mathcal {X}}_{t},\varvec{\mathcal {Y}}_{t}\}_{t\in \llbracket 0 ; N_{0} \rrbracket },\tau \big )\) in L. On the probability space \(\big ( \varvec{\Xi },{\varvec{\mathcal {G}}},{\mathbf {Q}}\big )\), let us define the processes \(\{\varvec{\mathcal {A}}_{t}\}_{t\in \llbracket 0 ; N_{0}-1 \rrbracket }\) and \(\{\varvec{\mathcal {Z}}_{t}\}_{t\in \llbracket 0 ; N_{0} \rrbracket }\) by \(\varvec{\mathcal {A}}_{t}=I_{\{\tau \le t\}}\) and \(\varvec{\mathcal {Z}}_{t}=\varvec{\mathcal {A}}_{t-1}\) for \(t\in \llbracket 1 ; N_{0} \rrbracket \) and \(\varvec{\mathcal {Z}}_{0}=0\). Introduce the filtrations \(\{{\varvec{\mathcal {T}}}_{t}\}_{t\in \llbracket 0 ; N_{0} \rrbracket }\) by \({\varvec{\mathcal {T}}}_{t}=\sigma \{\varvec{\mathcal {X}}_{0},\varvec{\mathcal {Y}}_{0},\varvec{\mathcal {Z}}_{0},\varvec{\mathcal {A}}_{0},\ldots ,\varvec{\mathcal {X}}_{t},\varvec{\mathcal {Y}}_{t},\varvec{\mathcal {Z}}_{t},\varvec{\mathcal {A}}_{t}\}\) and \(\{{\varvec{\mathcal {G}}}^{\varvec{\mathcal {Y}}}_{t} \}_{t\in \llbracket 0 ; N_{0} \rrbracket }\) by \({\varvec{\mathcal {G}}}^{\varvec{\mathcal {Y}}}_{t}=\sigma \{\varvec{\mathcal {Y}}_{0},\ldots ,\varvec{\mathcal {Y}}_{t}\}\). Since \(\tau \) is an \(\{{\varvec{\mathcal {G}}}^{\varvec{\mathcal {Y}}}_{t}\}_{t\in \llbracket 0 ; N_{0} \rrbracket }\)-stopping time, we have \(\varvec{\mathcal {T}}_{t}\subset {\varvec{\mathcal {G}}}_{t}^{\varvec{\mathcal {Y}}}\). Moreover, \(\varvec{\mathcal {Z}}_{t+1}\) is \(\varvec{\mathcal {T}}_{t}\)-measurable. Consequently, it is easy to show that
We have \(\{\varvec{\mathcal {A}}_{t}=1\}=\{\varvec{\mathcal {Z}}_{t+1}=1\}\) and \(\{\varvec{\mathcal {A}}_{t}=0\}=\{\varvec{\mathcal {Z}}_{t+1}=0\}\subset \{\varvec{\mathcal {A}}_{t-1}=0\}=\{\varvec{\mathcal {Z}}_{t}=0\}\), and so
Now, there exists an \(\mathbb {A}\)-valued measurable mapping \(\pi _{t}\) defined on \(\mathbb {Y}^{t+1}\) satisfying \(\varvec{\mathcal {A}}_{t}=\pi _{t}(\varvec{\mathcal {Y}}_{0},\ldots ,\varvec{\mathcal {Y}}_{t})\) and so,
for any \(t\in \llbracket 0 ; N_{0}-1 \rrbracket \) and \(F\subset \mathbb {A}\). Recall that
for any \(B\in \mathcal {B}(\mathbb {X}), C\in \mathcal {B}(\mathbb {Y}), D\subset \{0,1\}\). Combining Eqs. (40)–(42) and by the uniqueness property in the Theorem of Ionescu-Tulcea (see, e.g. Hernández-Lerma and Lasserre 1996, Proposition C.10), it follows that for the control policy \(\pi =\{\pi _{t}\}_{t\in \llbracket 0 ; N_{0} \rrbracket }\)
for any \(H\in \varvec{\mathcal {F}}\).
Observe that for \(k\in \llbracket 0 ; N_{0}-1 \rrbracket \) we have \(\{\tau =k\}=\{\mathbf {Z}_{k}=0\}\cup \{\mathbf {A}_{k}=1\}\) and \(\{\tau =N_{0}\}=\{\mathbf {Z}_{N_{0}}=0\}\). Consequently,
Now, by using the definitions of H and h we get
By using Eq. (43), it follows that
showing the first claim.
Regarding item (ii), let \(\pi \) be a policy in \(\varPi ^{o}\). Then, on the probability space \(\big (\varvec{\Omega },\varvec{\mathcal {F}},\mathbf {P}^{\pi }_{(\mathbf {x},\mathbf {y})} \big ), \{ \mathbf {X}_{t},\mathbf {Y}_{t} \}_{t\in \llbracket 0 ; N_{0}\rrbracket }\) is an \(\{\varvec{\mathcal {F}}_{t}\}_{t\in \llbracket 0 ; N_{0}\rrbracket }\)-adapted Markov chain with transition kernel R and with initial distribution \(\delta _{(\mathbf {x},\mathbf {y})}\). Introduce the \( \llbracket 0 ; N_{0}\rrbracket \)-valued random variable \(\tau \) defined by
It follows from Lemma 6 that \(\tau \) is a stopping time with respect to \(\big \{\sigma \{\mathbf {Y}_{0},\ldots ,\mathbf {Y}_{t}\}_{t\in \llbracket 0 ;N_{0}\rrbracket }\big \}\) showing that the control \(\lambda \) defined by \(\Big (\varvec{\Omega },\varvec{\mathcal {F}},\mathbf {P}^{\pi }_{(\mathbf {x},\mathbf {y})}, \{\varvec{\mathcal {F}}_{t}\}_{t\in \llbracket 0 ; N_{0}\rrbracket },\{ \mathbf {X}_{t},\mathbf {Y}_{t} \}_{t\in \llbracket 0 ; N_{0}\rrbracket },\tau \Big )\) belongs to \(\Lambda \). Recalling that \(\mathbf {Z}_{0}=0\) and that \(\mathbf {Z}_{t}=I_{\{\mathbf {A}_{t-1}=1\}}+\mathbf {Z}_{t-1}I_{\{\mathbf {A}_{t-1}=0\}}\) for any \(t\in \llbracket 1 ; N_{0} \rrbracket \), we get that \(\{\tau =t\}=\{\mathbf {Z}_{t}=0\}\cup \{\mathbf {A}_{t}=1\}\) for \(t\in \llbracket 0 ; N_{0}-1 \rrbracket \) and \(\{\tau =N_{0}\}=\{\mathbf {Z}_{N_{0}}=0\}\). Now, by using the definitions of H and h it follows that
implying that \(\varvec{\mathcal {H}}(\mathbf {x},\mathbf {y},\ell )=\varvec{\mathcal {H}}_{\mathbf {M}}(\mathbf {x},\mathbf {y},\pi )\) and showing the second claim. \(\square \)
Proof of Theorem 1
From Theorem 5.3.2 in Bäuerle and Rieder (2011) we get that \(\overline{\varvec{\mathcal {H}}}_{\mathbf {M}}(\mathbf {x},\mathbf {y})=\overline{\varvec{\mathcal {H}}}_{\mathcal {M}}(\mathbf {x},\mathbf {y})\) and so from Proposition 2, it follows that \(\overline{\varvec{\mathcal {H}}}(\mathbf {x},\mathbf {y})=\overline{\varvec{\mathcal {H}}}_{\mathcal {M}}(\mathbf {x},\mathbf {y})\) giving the first equality in Eq. (8). Under Assumptions (A1), B and D, the hypotheses of Theorems 5.3.3 in Bäuerle and Rieder (2011) are satisfied. Therefore, it follows that the Bellman equation \(\{v_{k}\}_{ k \in \llbracket 0 ; N_{0}\rrbracket }\) for the model \(\mathcal {M}\) is given by
and satisfies \(\overline{\varvec{\mathcal {H}}}_{\mathcal {M}}(\mathbf {x},\mathbf {y})=v_{N_{0}}(\delta _{\mathbf {x}},\mathbf {y},0)\). However, since \(h(\theta ,y,1)=H(\theta ,y,1)=0\), it easy to show that \(v_{k}(\theta ,y,1)=0\) for any \((\theta ,y)\in \mathcal {P}(\mathbb {X})\times \mathbb {Y}\) and \(k\in \llbracket 0 ; N_{0}\rrbracket \). Moreover, by using the definitions of h, H and the kernel Q we obtain that \(v_{0}(\theta ,y,0) = \mathbf {H}(\theta ,y)\) and
for any \((\theta ,y)\in \mathcal {P}(\mathbb {X})\times \mathbb {Y}\) and \(k\in \llbracket 1 ; N_{0}\rrbracket \) implying that \(\overline{\varvec{\mathcal {H}}}_{\mathcal {M}}(\mathbf {x},\mathbf {y})=\mathfrak {B}^{N_{0}}\mathbf {H}(\delta _{\mathbf {x}},\mathbf {y})\) and giving the second equality in Eq. (8). \(\square \)
Rights and permissions
About this article
Cite this article
de Saporta, B., Dufour, F. & Nivot, C. Partially observed optimal stopping problem for discrete-time Markov processes. 4OR-Q J Oper Res 15, 277–302 (2017). https://doi.org/10.1007/s10288-016-0337-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10288-016-0337-8
Keywords
- Optimal stopping
- Partial observations
- Markov chain
- Dynamic programming
- Numerical approximation
- Error bound
- Quantization