Abstract
This paper studies constrained Markov decision processes under the total expected discounted cost optimality criterion, with a state-action dependent discount factor that may take any value between zero and one. Both the state and the action space are assumed to be Borel spaces. By using the linear programming approach, consisting in stating the control problem as a linear problem on a set of occupation measures, we show the existence of an optimal stationary Markov policy. Our results are based on the study of both weak-strong topologies in the space of occupation measures and Young measures in the space of Markov policies.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Optimal control problems under the discounted optimality criterion have been widely studied in the last decades, both from the theoretical point of view by progressively providing more general and flexible models, and from the point of view of applications by modeling a large number of problems arising in areas such as mathematical economy, population growth, finance, biology, extraction of natural resources, among others. The discounted criterion’s appeal is its representation as a fair value of future flows, which is a common practice in the real world.
Most of the papers in the literature on discounted Markov decision processes (MDPs) consider a constant discount factor. This assumption produces, from the technical point of view, a significant simplification when solving the optimal control problem. In many situations, however, this assumption does not fit into real applications, as the discount factor may vary with time, or it could be affected by the current state of the system or, even more, it could be affected by the controller’s actions. This shows the need for new models that allow to deal with state-action dependent discount factors. In this new line of research, we can cite the references [27,28,29, 34, 38]; however, most of these references assume that the corresponding discount factor is uniformly bounded above by some constant \(\alpha _0\) strictly less than one.
In this sense, an important contribution of this paper is to consider a state-action dependent discount factor that is allowed to take any value in the interval [0, 1] and, in particular, a discounting equal to one is permitted. From a modeling point of view, this allows us to deal with a wide variety of situations. As an illustration, a discount factor equal to zero can be interpreted as “killing” the process, as in optimal stopping problems [29, 31], while a discount factor equal to one can be interpreted, in the context of a hybrid control model, as the possibility for the controller to take several simultaneous actions [30].
On the other side, independently of the type of discount (constant or not) already discussed, there exist models whose nature implies a restriction for some of their elements. The type of restriction we are interested in is one producing that certain additional discounted costs in the model do not exceed a given value. This type of cost restrictions is very useful in several areas and has generated mathematical challenges different from those arising in the unconstrained case; see, e.g., [2, 3, 8, 17, 20,21,22,23,24, 28, 33, 35].
This paper considers a discrete-time infinite horizon Markov decision process with varying discount factor. The controller wants to minimize a total expected discounted cost, subject to some constraints on some similar total expected discounted payoffs. The approach we follow is to rewrite this original constrained problem as a linear programming problem defined on a space of occupation measures. To study this linear programming problem, we introduce a suitable topology (namely, the weak-strong topology) and we study topological properties such as compactness of the set of occupation measures and semicontinuity of the functionals to be minimized. In addition, it is also useful to define a topology in the family of Markov stationary policies, which is achieved by considering the Young topology on a family of stochastic kernels. Once the linear programming problem has been solved, an optimal control policy can be derived by using a standard result on a decomposition of product measures.
From the previous paragraphs, we can conclude that the main contribution of this paper lies in filling an existing gap of previous results on discounted constrained problems [2, 3, 8, 17, 21,22,23,24, 28, 33, 35], with recent results in [29,30,31] that study unconstrained control problems with payoffs containing state-action dependent discount factors that have the flexibility to take the extreme values zero or one. To the best of our knowledge, models like the one in this paper have not been previously analyzed.
The rest of the paper is organized as follows: the remaining of this introductory section is concerned with some notation and terminology used in the manuscript. Later, in Sect. 2 we describe the basic elements of the Markov decision model under study, the control policies, and the optimization problem with constraints (CP) we want to solve. The section ends with the statement of two important facts: (i) our main result in this paper that establishes the existence of optimal control policies for (CP) and (ii) the arguments ensuring the existence of control policies for which the objective function is finite. On the other hand, Sect. 3 is devoted to characterizing (CP) in terms of a linear programming problem (LP) defined on space of occupation measures. The rest of the section focuses on some preliminary results aimed at proving topological properties of the aforementioned space of occupation measures. Section 4 is devoted to showing that (LP) is solvable by establishing compactness of the space of occupation measures. This fact, in turn, proves that the original (CP) has a solution as well. Section 5 gives several complementary results. First, a discussion on the hypotheses of the model: we aim to propose an alternative set of mild conditions that imply those originally proposed. We also give sufficient conditions yielding the existence of “simpler” optimal policies (namely, deterministic or chattering policies). We also describe some applications: we introduce a class of queueing systems with failure which can be modeled with our results herein. The paper ends with a section containing our conclusions.
Notation and terminology. For a measurable space \(({\textbf{X}},\varvec{\mathfrak {X}})\), we introduce the following notation: \({\mathcal {M}}({\textbf{X}})\) stands for the set of nonnegative measures, \({\mathcal {M}}^f({\textbf{X}})\) is the set of finite nonnegative measures, and \({\mathcal {P}}({\textbf{X}})\) denotes the family of probability measures. For any \(x\in {\textbf{X}}\) we will denote by \(\delta _x\in {\mathcal {P}}({\textbf{X}})\) the Dirac probability measure concentrated at x, i.e., \(\delta _x(B)={\textbf{I}}_B(x)\) for \(B\in \varvec{\mathfrak {X}}\), where \({\textbf{I}}_B(\cdot )\) is the indicator function of the set B. Given a measurable function \(f:{\textbf{X}}\rightarrow {\mathbb {R}}\) and \(\mu \in {\mathcal {M}}({\textbf{X}})\) we will use the notation \(\langle f,\mu \rangle =\int _{\textbf{X}}fd\mu \) provided that the integral is well defined.
If \(({\textbf{Y}},\varvec{\mathfrak {Y}})\) is another measurable space then we will consider the product space \({\textbf{X}}\times {\textbf{Y}}\) endowed with the product \(\sigma \)-algebra \(\varvec{\mathfrak {X}}\otimes \varvec{\mathfrak {Y}}\). For a measure \(\mu \in {\mathcal {M}}({\textbf{X}}\times {\textbf{Y}})\) we will denote by \(\mu ^{\textbf{X}}\in {\mathcal {M}}({\textbf{X}})\) and \(\mu ^{{\textbf{Y}}}\in {\mathcal {M}}({\textbf{Y}})\) the corresponding marginal measures, i.e.,
Similarly, if \(\varvec{\mathcal {O}}\subseteq {\mathcal {M}}({\textbf{X}}\times {\textbf{Y}})\) then we will write
A substochastic kernel \(R(\cdot |\cdot )\) on \({\textbf{Y}}\) given \({\textbf{X}}\) is a function \(R:\varvec{\mathfrak {Y}}\times {\textbf{X}}\rightarrow [0,1]\) such that \(R(\cdot |x)\) is in \({\mathcal {M}}^f({\textbf{Y}})\) for each \(x\in {\textbf{X}}\) (with, necessarily, \(R({\textbf{Y}}|x)\le 1\)) and such that the mapping \(x\mapsto R(B|x)\) is measurable on \({\textbf{X}}\) for each \(B\in \varvec{\mathfrak {Y}}\). We say that R is a stochastic kernel when \(R({\textbf{Y}}|x)=1\) for all \(x\in {\textbf{X}}\). Given a measure \(\mu \in {\mathcal {M}}({\textbf{X}})\) we define \(\mu R\in {\mathcal {M}}({\textbf{Y}})\) by means of
and \(\mu \otimes R\in {\mathcal {M}}({\textbf{X}}\times {\textbf{Y}})\) as
for \(B\in \varvec{\mathfrak {X}}\) and \(C\in \varvec{\mathfrak {Y}}\), with the obvious relation \((\mu \otimes R)^{\textbf{Y}}=\mu R\).
Given a probability measure \(\lambda \in {\mathcal {P}}({\textbf{X}})\), we denote by \(L^1({\textbf{X}},\varvec{\mathfrak {X}},\lambda )\) and \(L^\infty ({\textbf{X}},\varvec{\mathfrak {X}},\lambda )\) the spaces of measurable functions \(f:{\textbf{X}}\rightarrow {\mathbb {R}}\) which are \(\lambda \)-integrable and \(\lambda \)-essentially bounded, respectively. As usual, we will identify functions which are \(\lambda \)-a.s. equal. Suppose that the functions \(\{f_n\}_{n\in {\mathbb {N}}}\) and f are in \(L^1({\textbf{X}},\varvec{\mathfrak {X}},\lambda )\). We will write
On \(L^\infty ({\textbf{X}},\varvec{\mathfrak {X}},\lambda )\) we will consider the weak\(^*\) convergence defined as follows: given functions \(\{f_n\}_{n\in {\mathbb {N}}}\) and f in \(L^\infty ({\textbf{X}},\varvec{\mathfrak {X}},\lambda )\), we will write \(f_n \ {{\mathop {\rightharpoonup }\limits ^{*}}\ }f\) when
A metric space \({\textbf{S}}\) will be always endowed with its Borel \(\sigma \)-algebra \({\mathcal {B}}({\textbf{S}})\). We say that \(f:{\textbf{S}}\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is lower semicontinuous (abbreviated l.s.c.) when the lower sections \(\{s\in {\textbf{S}}:f(s)\le z\}\) are closed for every \(z\in {\mathbb {R}}\). A function f is upper semicontinuous (u.s.c.) when \(-f\) is l.s.c.
For the measurable space \(({\textbf{X}},\varvec{\mathfrak {X}})\) and the metric space \({\textbf{S}}\), we say that \(f:{\textbf{X}}\times {\textbf{S}}\rightarrow {\mathbb {R}}\) is a Carathéodory function if \(f(\cdot ,s)\) is measurable on \({\textbf{X}}\) for each \(s\in {\textbf{S}}\) and \(f(x,\cdot )\) is continuous on \({\textbf{S}}\) for each \(x\in {\textbf{X}}\). We note that if \({\textbf{S}}\) is separable then any Carathéodory function is jointly measurable on \(\varvec{\mathfrak {X}}\otimes {\mathcal {B}}({\textbf{S}})\). We will denote by \(Car({\textbf{X}}\times {\textbf{S}})\) the set of all Carathéodory functions and by \(Car_b({\textbf{X}}\times {\textbf{S}})\) the family of bounded Carathéodory functions. If \(\lambda \in {\mathcal {P}}({\textbf{X}})\), we denote by \(Car_1({\textbf{X}}\times {\textbf{S}})\) the set of Carathéodory functions for which there exists \(G\in L^1({\textbf{X}},\varvec{\mathfrak {X}},\lambda )\) such that \(|f(x,s)|\le G(x)\) for every \((x,s)\in {\textbf{X}}\times {\textbf{S}}\). In the notation \(Car_1({\textbf{X}}\times {\textbf{S}})\), since there will be no risk of confusion, we will not make \(\lambda \) explicit.
We recall that a Borel space is a measurable subset of a Polish space (which is, in turn, a complete and separable metric space). Finally, we shall use the convention that \(\sum _{k}^{k-1}\bullet =0\) and \(\prod _{k}^{k-1}\bullet =1\) for any \(k\ge 0\), and will use the notation \({\mathbb {N}}_0:={\mathbb {N}}\cup \{0\}\).
2 The Control Problem
In this section we introduce the basic elements of the Markov decision process under study, such as the definitions of the state and action spaces, the transition law of the random process, the cost per stage, and the discount factor, among other elements. Next, we define the set of control policies to be used for the statement of the optimization problem with constraints we want to solve. Using the previous definitions, we introduce our main set of assumptions in order to (i) set our main result in this manuscript concerning the existence of optimal control policies for our constrained problem, and (ii) show that our control problem is well posed by proving the existence of at least a control policy that makes the total expected discount factor as well as the total expected discount cost to have a finite value under our stated assumptions.
Elements of the control model. We will consider a constrained Markov control model given by the following elements.
-
The state space \({\textbf{X}}\) is a Borel space endowed with its Borel \(\sigma \)-algebra \({\mathcal {B}}({\textbf{X}})\).
-
The action space \({\textbf{A}}\) is a Borel space endowed with its Borel \(\sigma \)-algebra \({\mathcal {B}}({\textbf{A}})\). For each \(x\in X\), let \({\textbf{A}}(x)\) be a nonempty measurable subset of \({\textbf{A}}\), which stands for the actions available at state x. We also define
$$\begin{aligned} {\mathbb {K}}=\{(x,a)\in {\textbf{X}}\times {\textbf{A}}: a\in {\textbf{A}}(x)\} \end{aligned}$$and assume that it is a measurable subset of \({\textbf{X}}\times {\textbf{A}}\) for the \(\sigma \)-algebra \({\mathcal {B}}({\textbf{X}})\otimes {\mathcal {B}}({\textbf{A}})\).
-
The transitions of the system are given by a stochastic kernel Q on \({\textbf{X}}\) given \({\textbf{X}}\times {\textbf{A}}\).
-
We define the following functions: the discount factor \(\alpha :{\textbf{X}}\times {\textbf{A}}\rightarrow [0,1]\), the cost-per-stage \(c:{\textbf{X}}\times {\textbf{A}}\rightarrow {\mathbb {R}}^+\), and the constraints \(d_i:{\textbf{X}}\times {\textbf{A}}\rightarrow {\mathbb {R}}\) for \(i=1,\ldots ,q\), which are bounded below functions. All these functions are assumed to be measurable. The constraint constants are denoted by \(k_i\in {\mathbb {R}}\) for \(i=1,\ldots ,q\). Finally, we will write
$$\begin{aligned} {\textbf{d}}=(d_{1},\ldots ,d_{q}):{\textbf{X}}\times {\textbf{A}}\rightarrow {\mathbb {R}}^{q} \quad \hbox {and}\quad {\textbf{k}}=(k_{1},\ldots ,k_{q}) \in {\mathbb {R}}^q. \end{aligned}$$(2)
Control policies. We define the following sets: \({\textbf{H}}_{0}={\textbf{X}}\) and, for any \(t\in {\mathbb {N}}\), we let \({\textbf{H}}_{t}=({\textbf{X}}\times {\textbf{A}})^{t}\times {\textbf{X}}\) endowed with the corresponding product \(\sigma \)-algebra. A typical element \(h_t\) of \({\textbf{H}}_t\) will be denoted by
A randomized history-dependent control policy (or a control policy, for short) is a sequence \(\pi _t\), for \(t\in {\mathbb {N}}_0\), of stochastic kernels on \({\textbf{A}}\) given \({\textbf{H}}_t\) which satisfy, in addition, that
We denote by \(\mathbf {\Pi }\) the set of all control policies.
Let \(\mathbf {\Phi }\) be the set of all stochastic kernels \(\varphi \) on \({\textbf{A}}\) given \({\textbf{X}}\) that satisfy \(\varphi ({\textbf{A}}(x)|x)=1\) for every \(x\in {\textbf{X}}\). We shall identify \(\varphi \in \mathbf {\Phi }\) with the control policy given by
so that we have \(\mathbf {\Phi }\subseteq \mathbf {\Pi }\). We will say that \(\mathbf {\Phi }\) is the family of randomized stationary Markovian policies (in short, stationary policies). Let \({\textbf{F}}\) be the family of measurable functions \(f:{\textbf{X}}\rightarrow {\textbf{A}}\) such that \(f(x)\in {\textbf{A}}(x)\) for every \(x\in {\textbf{X}}\). Each \(f\in {\textbf{F}}\) is identified with the policy \(\pi _t(\cdot |h_t)=\delta _{f(x_t)}(\cdot )\), and we will refer to \({\textbf{F}}\) as to the set of deterministic stationary Markov policies, with \({\textbf{F}}\subseteq \mathbf {\Phi }\).
The dynamic system. The space of all sample paths \(\mathbf {\Omega }=({\textbf{X}}\times {\textbf{A}})^{{\mathbb {N}}_0}\) is endowed with its product \(\sigma \)-algebra \(\varvec{\mathcal {F}}\). An element of \(\mathbf {\Omega }\) is denoted by \(\omega =(x_0,a_0,x_1,a_1,\ldots )\). We define the projection functions \(X_t,A_t,H_t\) on \(\mathbf {\Omega }\) and taking values in \({\textbf{X}},{\textbf{A}},{\textbf{H}}_t\) as the functions which associate to each \(\omega \in \mathbf {\Omega }\) the corresponding component \(x_t,a_t,h_t\), respectively.
By the Ionescu–Tulcea theorem (see, e.g., [25, Section 2]), given any initial distribution \(\nu \in {\mathcal {P}}({\textbf{X}})\) and an arbitrary control policy \(\pi \in \mathbf {\Pi }\), there exists a unique probability measure \({\textbf{P}}^\pi _\nu \) on \((\mathbf {\Omega },\varvec{\mathcal {F}})\) that satisfies:
-
(i).
\({\textbf{P}}^\pi _\nu \{X_0\in B\}=\nu (B)\);
-
(ii).
\({\textbf{P}}^\pi _\nu \{A_t\in C\mid H_t\}=\pi _t(C| H_t)\);
-
(iii).
\({\textbf{P}}^\pi _\nu \{X_{t+1}\in B\mid H_t,A_t\}=Q(B|X_t,A_t)\)
for any \(B\in {\mathcal {B}}({\textbf{X}})\), \(C\in {\mathcal {B}}({\textbf{A}})\) and \(t\in {\mathbb {N}}_0\). The probability measure \({\textbf{P}}_{\nu }^\pi \) is referred to as a strategic probability measure. Note that, by construction of the process, we have \({\textbf{P}}^\pi _\nu ({\mathbb {K}}^{{\mathbb {N}}_0})=1\). The expectation operator of \({\textbf{P}}_{\nu }^{\pi }\) is denoted by \({\textbf{E}}_{\nu }^{\pi }\). If the initial distribution is given by the Dirac distribution at some \(x\in {\textbf{X}}\) then we will simply write \({\textbf{P}}^\pi _x\) and \({\textbf{E}}^\pi _x\).
The optimization problem. Given the initial probability measure \(\nu \in {\mathcal {P}}({\textbf{X}})\) and a control policy \(\pi \in \mathbf {\Pi }\), the objective function to be minimized by the controller is
Since the cost-per-stage function c is nonnegative, it follows that \(V_0(\pi ,\nu )\) is well defined, although it might be infinite. The total expected discounted value of the restrictions is defined in a similar way:
Our assumptions below will ensure that these expectations are well-defined.
The constrained optimization problem (CP) for the controller is to minimize \(V_0(\pi ,\nu )\) over the set of all policies \(\pi \in \mathbf {\Pi }\) satisfying the restrictions: \(V_i(\pi ,\nu )\le k_i\) for each \(1\le i\le q\). This problem will be written
Markov chains induced by stationary policies. Given an initial state \(x_0\in {\textbf{X}}\) and a stationary policy \(\varphi \in \mathbf {\Phi }\), it is clear that the stochastic process \(\{X_t\}_{t\ge 0}\) taking values in \({\textbf{X}}\) is a time-homogeneous Markov chain under the probability measure \({\textbf{P}}^\varphi _{x_0}\). Namely, its transition probabilities are given by the stochastic kernel on \({\textbf{X}}\) defined as
In addition, we also have that given an initial state-action pair \((x_0,a_0)\in {\textbf{X}}\times {\textbf{A}}\) and a stationary policy \(\varphi \in \mathbf {\Phi }\), the process \(\{(X_t,A_t)\}_{t\ge 0}\) is also a time-homogeneous Markov chain taking values in \({\textbf{X}}\times {\textbf{A}}\) with transition probability given by
The probability measure on \(\mathbf {\Omega }\) induced by this Markov chain is denoted by \({\textbf{P}}^\varphi _{x_0,a_0}\) and its corresponding expectation operator is \({\textbf{E}}^\varphi _{x_0,a_0}\). This Markov chain will be referred to as the state-action Markov chain. It is related to the state-action process probability measure by the obvious relation
To conclude this section, we introduce the notations \(Q_\varphi ^t\) and \(\overline{Q}_\varphi ^t\), which stand for the t-step probability transitions of the previously defined Markov chains. Namely, we let
for every \(t\in {\mathbb {N}}_0\), \(x\in {\textbf{X}}\), and \(B\in {\mathcal {B}}({\textbf{X}})\). Similarly, we write \(\overline{Q}_\varphi ^0(\cdot |x,a)=\delta _{(x,a)}(\cdot )\) and
for all \(t\in {\mathbb {N}}_0\), \((x,a)\in {\textbf{X}}\times {\textbf{A}}\), \(B\in {\mathcal {B}}({\textbf{X}})\), and \(C\in {\mathcal {B}}({\textbf{A}})\). Note that \(Q_\varphi =Q^1_\varphi \) and \(\overline{Q}_\varphi =\overline{Q}^1_\varphi \).
Assumptions. Let us now state the conditions we impose on the control model. Given \(0<\epsilon <1\), we define the following lower and upper sections of the discount factor function:
and
which are both measurable subsets of \({\textbf{X}}\times {\textbf{A}}\).
Assumption 2.1
There exist constants \(0<\epsilon ,\beta <1\), \(k_0>0\), and an integer \(t_0\in {\mathbb {N}}\) such that:
-
(a).
For every \((x,a)\in U_\epsilon \) we have \(c(x,a)\ge k_0\).
-
(b).
For every \((x,a)\in U_\epsilon \) and \(\varphi \in \mathbf {\Phi }\), we have \({\textbf{P}}^\varphi _{x,a}\{(X_{t_0},A_{t_0})\in L_\epsilon \}\ge \beta \).
-
(c).
The optimization problem (CP) is feasible, i.e., there exists \(\pi _0\in \mathbf {\Pi }\) with \(V_0(\pi _0,\nu )<\infty \) and \(V_i(\pi _0,\nu )\le k_i\) for every \(1\le i\le q\).
We make some comments on this assumption. Observe that there is no loss of generality in supposing that the same \(\epsilon \) is valid in Assumptions 2.1(a) and (b). This is because, as \(\epsilon \) decreases to zero, the sets \(L_{\epsilon }\) increase while the sets \(U_{\epsilon }\) decrease. The condition in (a) fails when there exists a sequence \(\{(x_n,a_n)\}\) in \({\textbf{X}}\times {\textbf{A}}\) with \(\alpha (x_n,a_n)\uparrow 1\) and \(c(x_n,a_n)\downarrow 0\). The interpretation of (a) is that discount factors close to 1 are somehow penalized, so that it cannot be optimal for the controller to remain “very long” in \(U_\epsilon \).
The condition in (b) states that there is a probability at least \(\beta \) of reaching \(L_\epsilon \) after \(t_0\) transitions from whatever initial state-action pair when using any stationary policy.
Note that the condition (c) is just to avoid trivial cases. Indeed, if any policy satisfying the constraints is such that \(V_0(\pi ,\nu )=\infty \), then the optimization problem is trivial.
We also need to make the following technical hypotheses on the control model.
Assumption 2.2
-
(a)
The action space \({\textbf{A}}\) is a compact metric space, the action sets \({\textbf{A}}(x)\) are nonempty and compact for every \(x\in {\textbf{X}}\), and the correspondence \(x\mapsto {\textbf{A}}(x)\) from \({\textbf{X}}\) to \({\textbf{A}}\) is weakly measurable.
-
(b)
There exist a measurable function \(q:{\textbf{X}}\times {\textbf{X}}\times {\textbf{A}}\rightarrow {\mathbb {R}}^+\) and \(\lambda \in {\mathcal {P}}({\textbf{X}})\) such that
$$\begin{aligned} Q(B|x,a)=\int _B q(y|x,a)\lambda (dy)\quad \hbox {for every}\quad B\in {\mathcal {B}}({\textbf{X}})\quad \hbox {and}\quad (x,a)\in {\textbf{X}}\times {\textbf{A}}. \end{aligned}$$For each fixed \(x,y\in {\textbf{X}}\), the mapping \(a\mapsto q(y|x,a)\) is continuous on \({\textbf{A}}\). There exists a nonnegative measurable function \((x,y)\mapsto q(y|x)\) on \({\textbf{X}}^2\) and a constant \({\textbf{q}}>0\) with
$$\begin{aligned} 0\le q(y|x,a)\le q(y|x)\quad \hbox {and}\quad \int _{\textbf{X}}q(y|x)\lambda (dy)\le {\textbf{q}} \end{aligned}$$for every \(x,y\in {\textbf{X}}\) and \(a\in {\textbf{A}}\).
-
(c)
The cost function \(c:{\textbf{X}}\times {\textbf{A}}\rightarrow {\mathbb {R}}^+\) is l.s.c. on \({\textbf{A}}\) for each fixed \(x\in {\textbf{X}}\).
The constraint functions \(d_i:{\textbf{X}}\times {\textbf{A}}\rightarrow {\mathbb {R}}\) are bounded below and l.s.c. on \({\textbf{A}}\) for each fixed \(x\in {\textbf{X}}\), for every \(i=1,\ldots ,q\).
The discount factor function \(\alpha :{\textbf{X}}\times {\textbf{A}}\rightarrow [0,1]\) is in \(Car_b({\textbf{X}}\times {\textbf{A}})\).
-
(d)
The initial distribution \(\nu \) satisfies \(\nu \ll \lambda \).
Using [1, Theorem 18.6], it follows from Assumption 2.2(a) that \({\mathbb {K}}\) is a measurable subset of \({\textbf{X}}\times {\textbf{A}}\). Moreover, the Kuratowsky–Ryll–Nardzewski selection theorem [1, Theorem 18.13] implies the existence of measurable selectors for the correspondence \(x\mapsto {\textbf{A}}(x)\). In particular, the sets \({\textbf{F}}\), \(\mathbf {\Phi }\), and \(\mathbf {\Pi }\) are nonempty. As a consequence of Assumption 2.2(b), for each \(x\in {\textbf{X}}\) and any sequence \(a_n\rightarrow a\) in \({\textbf{A}}\) we have \(q(\cdot |x,a_n){\mathop {\longrightarrow }\limits ^{L^1}} q(\cdot |x,a)\).
We note that our hypotheses in Assumption 2.1 and 2.2 are quite standard, except Assumption 2.1(b). At the end of the paper—see Proposition 5.1 below—we will provide sufficient conditions for Assumption 2.1(b) which are more easily verifiable for some instances or special cases.
Main result. Under the previous conditions, we can solve the control problem (CP). More specifically, we have the following:
Theorem 2.3
If Assumptions 2.1 and 2.2 hold then there exists a randomized stationary Markovian policy which is optimal for the constrained control problem (CP).
The proof of this result will be given in Sect. 4.
Finiteness of the total expected discount factor. Now we prove some results that will be useful in the forthcoming. In particular, we will explore conditions under which the total expected discount factor over the whole time horizon, i.e.,
is finite. Note that in a control model with a discount factor bounded away from one, that is, \(\alpha (x,a)\le \alpha _0<1\), such a result is trivial because we have
In our case, however, the discount factor can take values arbitrarily close to one, or even equal to one, and the issue of the finiteness of (7) is far from being trivial. In fact, it is a key result because, as we shall see in later sections, it is closely related to the finiteness of the occupation measure of a policy.
Proposition 2.1
Suppose that Assumption 2.1 is satisfied.
-
(i)
If \(\pi \in \mathbf {\Pi }\) is such that \(V_0(\pi ,\nu )\) is finite then
$$\begin{aligned} {\textbf{E}}_\nu ^\pi \left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1} \alpha (X_j,A_j)\right] <\infty . \end{aligned}$$ -
(ii)
For every \(\varphi \in \mathbf {\Phi }\) and \(x\in {\textbf{X}}\) we have
$$\begin{aligned} {\textbf{E}}_x^\varphi \left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1} \alpha (X_j,A_j)\right] < \frac{1+t_0/\beta }{\epsilon }. \end{aligned}$$
Proof
We define the random times \(S_r\) and \(T_r\) on \(\mathbf {\Omega }\) taking values in \({\mathbb {N}}_0\cup \{+\infty \}\) as follows. Let \(S_0=0\) and for every \(r\in {\mathbb {N}}\) define
Consequently, we have either:
-
\((x_0,a_0)\in L_{\epsilon }\) and then \(0=S_0<T_1<S_1<T_2\ldots \), or
-
\((x_0,a_0)\in U_{\epsilon }\) and then \( 0=S_0=T_1<S_1<T_2\ldots \).
Observe that when some \(T_r\) or \(S_r\) is infinite, then all the subsequent times \(\{S_k\}_{k\ge r}\) and \(\{T_k\}_{k\ge r+1}\) are infinite as well. Hence, for every sample path \(\omega =(x_0,a_0,x_1,a_1,\ldots )\) the time horizon is partitioned into disjoint intervals (some of them could be empty)
so that
Given some \(\omega \in \mathbf {\Omega }\) and \(t\in {\mathbb {N}}\), we are going to derive bounds on \(\prod _{j=0}^{t-1}\alpha (x_j,a_j)\). Indeed, recalling (8), we will use the bounds
Therefore, if t is such that \(S_{r-1}<t\le T_r\) for some \(r\in {\mathbb {N}}\) then
while if \(T_r<t\le S_r\) for some \(r\in {\mathbb {N}}\) then
Once these bounds have been established, we observe that
Combining this equation with the previous bounds we observe that
because, as r varies, the exponent in the last expression starts at 1 and then increases by one unit at each step: if some \(S_r\) is infinite then there will be a finite number of terms and, otherwise, it will go across the whole geometric series. In either case, we obtain the desired inequality. This gives a bound on the first term of the right hand side of (9). For the second term in (9) we have
because each term in the exponent of \(1-\epsilon \) is greater than or equal to 1 except perhaps \(T_1-S_0\) which may vanish. Summarizing, we have so far established the following general inequalities: for any \(\omega \in \mathbf {\Omega }\)
Proof of (i). Let \(\pi \in \mathbf {\Pi }\) be such that \(V_0(\pi ,\nu )\) is finite. By (10) and recalling Assumption 2.1(a) we have
since \(c(x_{t-1},a_{t-1})\ge k_0\) when \(T_r<t\le S_r\) because, for such t, we have \(\alpha (x_{t-1},a_{t-1})\ge 1-\epsilon \). Therefore, the cost function c being nonnegative,
which concludes the proof of statement (i).
Proof of (ii). Let \(\varphi \in \mathbf {\Phi }\) be an arbitrary stationary policy. Suppose that we are given an initial state-action pair \((x_0,a_0)\in U_{\epsilon }\) and let the integer \(t_0\) be as in Assumption 2.1. We have
for every \((x_0,a_0)\in U_{\epsilon }\). On the other hand, for any integer \(k\ge 1\)
Given any j we can write, for \((x_0,a_0),\ldots ,(x_{(j-1)t_0},a_{(j-1)t_0})\in U_{\epsilon }\) and using the Markov property for the stationary policy \(\varphi \in \mathbf {\Phi }\),
where we make use of (12) for the initial state \((x_{(j-1)t_0},a_{(j-1)t_0})\in U_{\epsilon }\). Therefore, for all \(j=1,\ldots ,k\),
Consequently, recalling (13) and using the “total probability rule”,
with \(C_j=\{S_1>jt_0\}\), we obtain
In conclusion,
We proceed with the proof and, at this point, we recall the inequality (11). We know that the expected time the process spends in \(U_\epsilon \) when starting from \((x_0,a_0)\in U_\epsilon \) is bounded by \(t_0/\beta \) uniformly in \((x_0,a_0)\)—recall (14)—and, therefore, using the Markov property and the fact that \(\varphi \) is stationary, it follows that \({\textbf{E}}^\varphi _{x}[S_r-T_r]\le t_0/\beta \) for any \(r\in {\mathbb {N}}\) and \(x\in {\textbf{X}}\). Consequently, taking \({\textbf{E}}^\varphi _x\)-expectation in (11) and using this bound, we derive that
which concludes the proof. \(\square \)
3 Occupation Measures and Young Measures
In this section, we introduce the set of occupation measures and establish (i) sufficient conditions that guarantee the finiteness of such occupation measures, and (ii) a useful characterization of this set of measures by means of linear equations. This will allow us to establish the equivalence between the original constrained control problem (CP) defined in (6) and a linear programming problem on a space of measures defined in (20) below. We also introduce Young measures and the Young topology as a useful tool to define a topology in the space \(\mathbf {\Phi }\) of stationary policies, for which \(\mathbf {\Phi }\) becomes a compact metric space.
Definition 3.1
Given a policy \(\pi \in \mathbf {\Pi }\) and the initial distribution \(\nu \in {\mathcal {P}}({\textbf{X}})\), we define its occupation measure \(\mu _{\pi }\in {\mathcal {M}}({\textbf{X}}\times {\textbf{A}})\) as follows: for any measurable set \(\Gamma \subseteq {\textbf{X}}\times {\textbf{A}}\), let
By construction of the controlled process, the occupation measure is supported on \({\mathbb {K}}\), meaning that \(\mu _\pi ( {\mathbb {K}}^c)=0\). Since the initial distribution \(\nu \) will remain fixed, we will not make it explicit in the notation \(\mu _\pi \). The total mass of the measure \(\mu _\pi \) is given by (cf. (7))
and so, the occupation measure \(\mu _\pi \) might not necessarily be a finite measure. Under Assumption 2.1 and as a consequence of Proposition 2.1, however, if \(\pi \in \mathbf {\Pi }\) is either stationary of such that \(V_0(\pi ,\nu )\) is finite, then necessarily \(\mu _\pi \) is a finite measure.
The following proposition concerns a useful result about the decomposition of product measures. See the references [7, pp. 139–144] or [13, pp. 88–89] for further details.
Proposition 3.1
Let \(\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})\).
-
(i)
There exists a \(\mu ^{\textbf{X}}\)-a.s. uniqueFootnote 1 stochastic kernel \(\varphi \) on \({\textbf{A}}\) given \({\textbf{X}}\) such that \(\mu =\mu ^{\textbf{X}}\otimes \varphi \).
-
(ii)
If \(\mu \) is supported on \({\mathbb {K}}\) (i.e., \(\mu ({\mathbb {K}}^c)=0\)), then there exists \(\varphi \in \mathbf {\Phi }\) such that \(\mu =\mu ^{\textbf{X}}\otimes \varphi \).
Our next result provides a useful characterization of the occupation measures introduced in (15) as well as some of their properties. In (17) below, note that \(\alpha Q\equiv \alpha (x,a)Q(B|x,a)\) is a substochastic kernel on \({\textbf{X}}\) given \({\textbf{X}}\times {\textbf{A}}\), and so the notation \(\mu \alpha Q=\mu (\alpha Q)\) comes from (1).
Theorem 3.1
Let Assumption 2.1 hold true and consider an initial distribution \(\nu \in {\mathcal {P}}({\textbf{X}})\).
-
(i)
(Linear equations.) Given any \(\pi \in \mathbf {\Pi }\), its occupation measure \(\mu _\pi \) satisfies the conditions
$$\begin{aligned} \mu ({\mathbb {K}}^c)=0\quad \hbox {and}\quad \mu ^{\textbf{X}}=\nu +\mu \alpha Q, \end{aligned}$$(17)written in the variable \(\mu \in {\mathcal {M}}({\textbf{X}}\times {\textbf{A}})\).
-
(ii)
(Conditions for finiteness.) The occupation measure of any stationary policy \(\varphi \in \mathbf {\Phi }\) is finite: \(\mu _\varphi \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})\). If \(\pi \in \mathbf {\Pi }\) is such that \(V_0(\pi ,\nu )\) is finite, then its occupation measure is finite: \(\mu _\pi \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})\).
-
(iii)
(Sufficiency of stationary policies.) If a finite measure \(\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})\) satisfies (17) then there exists \(\varphi \in \mathbf {\Phi }\) such that \(\mu =\mu _\varphi \). In particular, if \(\pi \in \mathbf {\Pi }\) is such that \(V_0(\pi ,\nu )\) is finite then there exists \(\varphi \in \mathbf {\Phi }\) with \(\mu _\pi =\mu _\varphi \).
-
(iv)
(Boundedness.) The set of measures \(\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})\) satisfying (17) is bounded.
Proof
(i). The fact that \(\mu _\pi ({\mathbb {K}}^c)=0\) follows by construction of the state-action process. Fix arbitrary \(B\in {\mathcal {B}}({\textbf{X}})\). By definition, we have
The previous expression becomes (after changing the indices)
which is precisely the condition \(\mu _\pi ^{\textbf{X}}=\nu +\mu _\pi \alpha Q\).
(ii). Regarding the result for stationary policies, recalling the expression (15), it follows that
By the result in Proposition 2.1(ii) we obtain that \(\mu _\varphi ({\mathbb {K}})\) is finite. Concerning a policy \(\pi \in \mathbf {\Pi }\), the fact that \(\mu _\pi \) is finite when \(V_0(\pi ,\nu )\) also comes from Proposition 2.1(i).
(iii). Suppose that \(\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})\) is a finite measure satisfying the equations \(\mu ({\mathbb {K}}^c)=0\) and \(\mu ^{\textbf{X}}=\nu +\mu \alpha Q\). By the disintegration result in Proposition 3.1(ii), there exists \(\varphi \in \mathbf {\Phi }\) such that \(\mu =\mu ^{\textbf{X}}\otimes \varphi \) and so we have
Iterating this expression, we obtain for each \(n\ge 1\)
The first expression in the right hand side of this equation converges, by monotone convergence, to \(\mu ^{\textbf{X}}_\varphi (B)\). Regarding the second expression, we have the inequality
By the result in Proposition 2.1, we have that the following series converges:
so that for any \(x\in {\textbf{X}}\) we have that
We can now use dominated convergence with respect to the finite measure \(\mu ^{\textbf{X}}(dx)\) to conclude from (19) that the rightmost term in (18) converges to 0 as \(n\rightarrow \infty \). This shows, taking the limit as \(n\rightarrow \infty \) in (18) that \(\mu ^{\textbf{X}}\) coincides with the \({\textbf{X}}\)-marginal of the occupation measure of \(\varphi \): \(\mu ^{\textbf{X}}=\mu ^{\textbf{X}}_\varphi \). Since, by disintegration, we have \(\mu =\mu ^{\textbf{X}}\otimes \varphi \), we conclude that \(\mu =\mu _\varphi \).
The last statement in (iii) easily follows from parts (i) and (ii).
(iv). If \(\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})\) satisfies (17) then \(\mu =\mu _\varphi \) for some \(\varphi \in \mathbf {\Phi }\) (see item (iii)) and so, recalling (16),
by Proposition 2.1(ii). Hence,
which completes the proof. \(\square \)
Linear programming formulation. It is clear from the definition of the occupation measure, and recalling that c is nonnegative, that for any \(\pi \in \mathbf {\Pi }\) we have
In addition, if \(\pi \in \mathbf {\Pi }\) is such that \(\mu _\pi \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})\) then the value of the constraints \(V_i(\pi ,\nu )\) for \(1\le i\le q\) are well defined and they verify
This is because, the functions \(d_i\) being bounded below and \(\mu _\pi \) being finite, the integral of the negative part of \(d_i\) is necessarily finite.
In the sequel, we will use the following notation. Let
which corresponds to the set of finite occupation measures, and let
which stands for the measures in \(\varvec{\mathcal {O}}\) which, in addition, satisfy the inequalities for the constraint functions \({\textbf{d}}\). Observe that all the equations and inequalities defining the sets \(\varvec{\mathcal {O}}\) and \(\varvec{\mathcal {O}}_d\) are linear.
By Assumption 2.1(c), there is no loss of generality in imposing, in the definition of the optimization problem (CP), that minimization is carried over all policies \(\pi \in \mathbf {\Pi }\) that satisfy the constraints and such that \(V_0(\pi ,\nu )\) is finite. In view of Theorem 3.1, we have the following characterization.
Corollary 3.1
Under Assumption 2.1, the constrained problem (CP) is equivalent to the linear programming problem
Once we have stated (CP) as a linear programming problem, the plan of the proof on the existence of an optimal stationary policy consists in endowing \(\varvec{\mathcal {O}}\) with an adequate topology under which it becomes compact, and under which \(\langle c,\mu \rangle \) is a lower semicontinuous function whose infimum is, therefore, attained.
Topologies on measure spaces. In the sequel, we suppose that Assumptions 2.1 and 2.2 are satisfied. We define the following topologies on the sets \({\mathcal {M}}^f({\textbf{X}})\) and \({\mathcal {M}}^f({\textbf{A}})\) of finite measures on \({\textbf{X}}\) and \({\textbf{A}}\), respectively.
Definition 3.2
-
(a)
The strong topology (s-topology) on \({\mathcal {M}}^f({\textbf{X}})\) is the coarsest topology for which the mappings \(D\mapsto \mu (D)\) are continuous for every \(D\in {\mathcal {B}}({\textbf{X}})\).
-
(b)
The weak topology (in short, the w-topology) on \({\mathcal {M}}^f({\textbf{A}})\) is the coarsest topology for which the mappings \(\mu \mapsto \int _{{\textbf{A}}}fd\mu \) are continuous for every bounded and continuous function \(f:{\textbf{A}}\rightarrow {\mathbb {R}}\).
The next result is a useful characterization of relative compactness in the s-topology. For further details see [37, Lemma 3.5].
Proposition 3.2
For any \(\Gamma \subseteq {\mathcal {M}}^f({\textbf{X}})\), the following statements are equivalent:
-
(i)
\(\Gamma \) is relatively compact it the s-topology.
-
(ii)
For any sequence \(\{f_n\}\) of bounded measurable functions on \({\textbf{X}}\) such that \(f_n\downarrow 0\)
$$\begin{aligned} \lim _{n\rightarrow \infty }\sup _{\mu \in \Gamma }\int _{{\textbf{X}}}f_nd\mu = 0. \end{aligned}$$
Finally, we define a topology on the set \({\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})\) of finite measures on \({\textbf{X}}\times {\textbf{A}}\) when endowed with the product \(\sigma \)-algebra \({\mathcal {B}}({\textbf{X}})\otimes {\mathcal {B}}({\textbf{A}})\). We recall that \(Car_b({\textbf{X}}\times {\textbf{A}})\) is the set of real-valued bounded Carathéodory functions on \({\textbf{X}}\times {\textbf{A}}\).
Definition 3.3
The weak-strong (ws) topology on \({\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})\) is the coarsest topology for which the mappings \(\mu \mapsto \int _{{\textbf{X}}\times {\textbf{A}}} fd\mu \) are continuous for every function \(f\in Car_b({\textbf{X}}\times {\textbf{A}})\).
We will write \(\mu _n{\mathop {\longrightarrow }\limits ^{ws}}\mu \) when the sequence \(\{\mu _n\}\) converges to \(\mu \) in the weak-strong topology.
Young measures. Recall that \(\lambda \in {\mathcal {P}}({\textbf{X}})\) has been introduced in Assumption 2.2(b). On the set \(\mathbf {\Phi }\) of all stationary policies we define the following equivalence relation
We denote by \(\varvec{\mathcal {Y}}\) the family of all equivalence classes and we will say that an element in \(\varvec{\mathcal {Y}}\) (i.e., an equivalence class) is a Young measure.
Definition 3.4
The Young topology on \(\varvec{\mathcal {Y}}\) is the coarsest topology for which the mappings
are continuous for every \(f\in Car_1({\textbf{X}}\times {\textbf{A}})\).
Note that the integral in the above definition does not depend on the element \(\varphi \) that is chosen within the class of equivalence in \(\varvec{\mathcal {Y}}\).
Lemma 3.1
If \(\varphi ,\varphi '\in \mathbf {\Phi }\) are in the same class of equivalence of \(\varvec{\mathcal {Y}}\) then the probability measures \({\textbf{P}}^\varphi _\nu \) and \({\textbf{P}}^{\varphi '}_\nu \) coincide.
Proof
We will show that the joint distribution of \((X_0,A_0,\ldots ,X_t,A_t)\) is the same under both \({\textbf{P}}^\varphi _\nu \) and \({\textbf{P}}^{\varphi '}_\nu \) for any \(t\in {\mathbb {N}}\). We will show it by induction.
For \(t=0\), note that
where we use the fact that \(\nu \ll \lambda \) and that the equality \(\varphi (C|x)=\varphi '(C|x)\) holds \(\lambda \)-a.s.
Suppose now that the stated result is true for some \(t\in {\mathbb {N}}_0\). We have
Since the distribution of \((H_t,A_t)\) is the same under \({\textbf{P}}^\varphi _\nu \) and \({\textbf{P}}^{\varphi '}_\nu \), we conclude the result. \(\square \)
It follows from this lemma that two stationary policies \(\varphi ,\varphi '\) in the same class of equivalence of \(\varvec{\mathcal {Y}}\) have the same strategic probability measure. They are thus indistinguishable and they yield, in particular, the same expected payoffs and the same occupation measures. In the sequel, we shall therefore identify the family of Markov stationary policies \(\mathbf {\Phi }\) with the family of Young measures \(\varvec{\mathcal {Y}}\), and so we will refer to the Young topology on \(\mathbf {\Phi }\). We will write \(\varphi _n{\mathop {\longrightarrow }\limits ^{y}}\varphi \) when the sequence \(\{\varphi _n\}\) in \(\mathbf {\Phi }\) converges to \(\varphi \in \mathbf {\Phi }\) in the Young topology.
The space \(L^1({\textbf{X}},{\mathcal {B}}({\textbf{X}}),\lambda )\) is separable (this is because \({\mathcal {B}}({\textbf{X}})\) is countably generated), and so by [5, Lemma 1] we have the following.
Proposition 3.3
The set of stationary policies \(\mathbf {\Phi }\) is a compact metric space for the Young topology.
Lemma 3.2
For any \(t\in {\mathbb {N}}_0\) and any functions \(W_0,\ldots ,W_t\) in \(Car_b({\textbf{X}}\times {\textbf{A}})\), we have that
Proof
We shall consider the measurable product space \(({\textbf{X}}^{t+1},{\mathcal {B}}({\textbf{X}})\otimes \cdots \otimes {\mathcal {B}}({\textbf{X}}),\overline{\lambda })\) where the probability measure \({\overline{\lambda }}\in {\mathcal {P}}({\textbf{X}}^{t+1})\) is the \((t+1)\)-times product of \(\lambda \):
Similarly, considering the compact product Borel space \({\textbf{A}}^{t+1}\) of the action spaces, we shall be dealing with Carathéodory functions on \(({\textbf{X}}\times {\textbf{A}})^{t+1}\) that, for consistency in the notation, will be rather written as \({\textbf{X}}^{t+1}\times {\textbf{A}}^{t+1}\).
Observe that the expectation in the statement of the lemma can be written in integral form as
Under our conditions (recall Assumption 2.2(b)) we have that the function F given by
is in \( Car({\textbf{X}}^{t+1}\times {\textbf{A}}^{t+1})\). Let us now show that we have \(F\in Car_1({\textbf{X}}^{t+1}\times {\textbf{A}}^{t+1})\). Let \({\textbf{w}}_i\) be a bound for \(W_i\), that is, \(|W_i(x,a)|\le {\textbf{w}}_i\) for any \((x,a)\in {\textbf{X}}\times {\textbf{A}}\). We also have, by Assumption 2.2(b), for every \(0\le i\le t-1\), \(x_i,x_{i+1}\in {\textbf{X}}\) and \(a_i\in {\textbf{A}}\)
Hence, we conclude that
for every \((x_0,a_0,\ldots ,x_t,a_t)\in ({\textbf{X}}\times {\textbf{A}})^{t+1}\), where G is defined on \({\textbf{X}}^{t+1}\) as
It is now easy to see, integrating first with respect to \(\lambda (dx_t)\), then \(\lambda (dx_{t-1})\), etc., that
so that \(G\in L^1({\textbf{X}}^{t+1},{\mathcal {B}}({\textbf{X}})^{t+1},\overline{\lambda })\). This completes the proof that the function F is in \(Car_1({\textbf{X}}^{t+1}\times {\textbf{A}}^{t+1})\).
On the space \(({\textbf{X}}^{t+1},{\textbf{A}}^{t+1})\) we can consider the family \(\overline{\varvec{\mathcal {Y}}}\) of Young measures consisting of all stochastic kernels on \({\textbf{A}}^{t+1}\) given \({\textbf{X}}^{t+1}\), equipped with the corresponding Young topology (cf. Definition 3.4). Given any \(\varphi \in \mathbf {\Phi }\) we consider the tensor product \(\overline{\varphi }=\bigotimes _{i=0}^t \varphi \in \overline{\varvec{\mathcal {Y}}}\) of \(t+1\) copies of the Young measures \(\varphi \) defined as
By continuity of the tensor product of Young measures (see [18, Theorem 3.90] or [4, Theorem 4.17]) we have that the mapping from \(\mathbf {\Phi }\) to \(\overline{\varvec{\mathcal {Y}}}\) given by \(\varphi \mapsto \overline{\varphi }\) is continuous.
Now we proceed with the proof of the lemma. Suppose that \(\{\varphi _n\}\) is a sequence in \(\mathbf {\Phi }\) with \(\varphi _n{\mathop {\longrightarrow }\limits ^{y}}\varphi \). By continuity of the tensor product we have \(\overline{\varphi }_n{\mathop {\longrightarrow }\limits ^{y}}\overline{\varphi }\) in the Young topology of \(\overline{\varvec{\mathcal {Y}}}\) and so, since F is in \(Car_1({\textbf{X}}^{t+1}\times {\textbf{A}}^{t+1})\),
But the latter convergence is precisely (recall (21) and (22))
which completes the proof. \(\square \)
4 Proof of Theorem 2.3
In this section, we will show that the linear programming problem (LP) in (20) is solvable using both: (a) topological properties (i.e., compactness) of the sets \(\varvec{\mathcal {O}}\) and \(\varvec{\mathcal {O}}_d\), and (b): regularity properties of the mapping \(\mu \mapsto \langle c,\mu \rangle \). Then, from Corollary 3.1, we will derive the existence of a solution to the constrained problem (CP) in (6). In Sect. 5 we will provide mild sufficient conditions for the statement of Assumption 2.1(b).
We start this section by analyzing the compactness property of the sets \(\varvec{\mathcal {O}}\) and \(\varvec{\mathcal {O}}_d\). To this end, we need the following results that we borrow from [6] and whose proofs can be found in Proposition 2.3 and Theorems 2.5 and 3.1 of this same reference.
Lemma 4.1
Let \(({\textbf{U}},{\mathcal {U}})\) be a measurable space and \({\textbf{Z}}\) be a Polish space.
-
(i)
Assume that \({\mathcal {U}}\) is countably generated. Every \(\Gamma \subseteq {\mathcal {M}}^f({\textbf{U}}\times {\textbf{Z}})\) with the property of \(\Gamma ^{{\textbf{U}}}\) being relatively compact in the s-topology is metrizable in the ws-topology.
-
(ii)
Given \(\Gamma \subseteq {\mathcal {M}}^f({\textbf{U}}\times {\textbf{Z}})\), the following statements are equivalent.
-
(a)
\(\Gamma ^{{\textbf{U}}}\subseteq {\mathcal {M}}^f({\textbf{U}})\) is relatively compact in the s-topology and \(\Gamma ^{{\textbf{Z}}}\subseteq {\mathcal {M}}^f({\textbf{Z}})\) is tight.
-
(b)
\(\Gamma \) is tight in the ws-topology.
-
(c)
Every sequence in \(\Gamma \) is tight in the ws-topology.
-
(d)
\(\Gamma \) is relatively sequentially compact in the ws-topology.
-
(e)
\(\Gamma ^{{\textbf{U}}}\subseteq {\mathcal {M}}^f({\textbf{U}})\) is relatively compact in the s-topology and \(\Gamma ^{{\textbf{Z}}}\subseteq {\mathcal {M}}^f({\textbf{Z}})\) is relatively sequentially compact in the w-topology.
-
(a)
-
(iii)
Given \(\{\mu _n\}\) and \(\mu \) in \({\mathcal {M}}^f({\textbf{U}}\times {\textbf{Z}})\), we have that \(\mu _n{\mathop {\longrightarrow }\limits ^{ws}}\mu \) in \({\mathcal {M}}^f({\textbf{U}}\times {\textbf{Z}})\) if and only if
$$\begin{aligned} \liminf _{n\rightarrow \infty }\int _{{\textbf{U}}\times {\textbf{Z}}}gd\mu _n\ge \int _{{\textbf{U}}\times {\textbf{Z}}}gd\mu \end{aligned}$$for every real-valued measurable function g on \({\textbf{U}}\times {\textbf{Z}}\) such that \(z\mapsto g(u,z)\) is l.s.c on \({\textbf{Z}}\) for each \(u\in {\textbf{U}}\), and such that
$$\begin{aligned} \lim _{\gamma \rightarrow \infty }\sup _{n\ge 1}\int _{\{g\le -\gamma \}}\max (-g,0)d\mu _n=0 \end{aligned}$$(in particular, any bounded below g satisfies the latter condition).
Proposition 4.1
Under Assumptions 2.1 and 2.2, the sets \(\varvec{\mathcal {O}}\) and \(\varvec{\mathcal {O}}_d\) are compact metric spaces for the ws-topology.
Proof
The proof proceeds in several steps.
-
(i)
The set \(\varvec{\mathcal {O}}^{\textbf{X}}\) is relatively compact in the s-topology. To show this result, we will use the characterization given in Proposition 3.2. Namely, let \(\{f_n\}\) be an arbitrary sequence of bounded measurable functions on \({\textbf{X}}\) decreasing to the null function. We want to show that
$$\begin{aligned} \lim _{n\rightarrow \infty }\sup _{\mu \in \varvec{\mathcal {O}}} \int _{\textbf{X}}f_n d\mu ^{\textbf{X}}=0.\end{aligned}$$Recalling that any \(\mu \in \varvec{\mathcal {O}}\) is the occupation measure of some stationary policy \(\varphi \in \mathbf {\Phi }\), it suffices to show that
$$\begin{aligned} \lim _{n\rightarrow \infty } \sup _{\varphi \in \mathbf {\Phi }} {\textbf{E}}^\varphi _\nu \left[ \sum _{t=0}^\infty f_n(X_t)\prod _{j=0}^{t-1} \alpha (X_j,A_j)\right] =0. \end{aligned}$$(23)Choose now an arbitrary stationary policy \(\varphi \in \mathbf {\Phi }\). Observe that, for some constant \({\textbf{f}}>0\) and any integer k,
$$\begin{aligned}&{\textbf{E}}^\varphi _\nu \left[ \sum _{t=0}^\infty f_n(X_t)\prod _{i=0}^{t-1} \alpha (X_i,A_i)\right] \nonumber \\&\quad \le \sum _{t=0}^k {\textbf{E}}^\varphi _\nu [f_n(X_t)\prod _{i=0}^{t-1} \alpha (X_i,A_i)] + {\textbf{f}}\cdot {\textbf{E}}^\varphi _\nu \left[ \sum _{t=k+1}^\infty \prod _{i=0}^{t-1} \alpha (X_i,A_i)\right] . \end{aligned}$$(24)Regarding the rightmost term in (24),
$$\begin{aligned}&{\textbf{E}}^\varphi _\nu \left[ \sum _{t=k+1}^\infty \prod _{i=0}^{t-1} \alpha (X_i,A_i)\right] = {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^{k} \alpha (X_i,A_i)\sum _{t=k+1}^\infty \prod _{i=k+1}^{t-1} \alpha (X_i,A_i)\right] \\&\quad = {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^{k} \alpha (X_i,A_i) \cdot {\textbf{E}}^\varphi _\nu \left[ \sum _{t=k+1}^\infty \prod _{i=k+1}^{t-1} \alpha (X_i,A_i)\mid H_{k+1} \right] \right] . \end{aligned}$$Observe now that
$$\begin{aligned} {\textbf{E}}^\varphi _\nu \left[ \sum _{t=k+1}^\infty \prod _{i=k+1}^{t-1} \alpha (X_i,A_i)\mid H_{k+1}=h_{k+1} \right]= & {} {\textbf{E}}^\varphi _{x_{k+1}}\left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1} \alpha (X_j,A_j) \right] \\\le & {} \frac{1+t_0/\beta }{\epsilon }, \end{aligned}$$by Proposition 2.1(ii), so that
$$\begin{aligned} {\textbf{E}}^\varphi _\nu \left[ \sum _{t=k+1}^\infty \prod _{i=0}^{t-1} \alpha (X_i,A_i)\right] \le \frac{1+t_0/\beta }{\epsilon }\cdot {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^{k} \alpha (X_i,A_i)\right] . \end{aligned}$$Since the series
$$\begin{aligned} \sum _{k=0}^\infty {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^{k} \alpha (X_i,A_i)\right] \end{aligned}$$converges (Proposition 2.1(i)), it follows that for any \(\varphi \in \mathbf {\Phi }\) we have
$$\begin{aligned} \lim _{k\rightarrow \infty } {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^{k} \alpha (X_i,A_i)\right] =0 \end{aligned}$$and convergence is monotonically decreasing as \(k\rightarrow \infty \). The mapping on \(\mathbf {\Phi }\) given by \(\displaystyle \varphi \mapsto {\textbf{E}}^\varphi _\nu [\prod _{i=0}^{k} \alpha (X_i,A_i)]\) is continuous (Lemma 3.2) and, since \(\mathbf {\Phi }\) is compact, it follows from Dini’s theorem that convergence is uniform on \(\mathbf {\Phi }\). Therefore,
$$\begin{aligned} \lim _{k\rightarrow \infty } \sup _{\varphi \in \mathbf {\Phi }} {\textbf{E}}^\varphi _\nu \left[ \prod _{i=0}^{k} \alpha (X_i,A_i)\right] =0 \end{aligned}$$and so the rightmost term in (24) converges to 0 uniformly on \(\mathbf {\Phi }\). Consequently, given some \(\varepsilon >0\) we can find \(k_0\) such that, for any \(\varphi \in \mathbf {\Phi }\) we have
$$\begin{aligned} {\textbf{f}}\cdot {\textbf{E}}^\varphi _\nu \left[ \sum _{t=k_0+1}^\infty \prod _{i=0}^{t-1} \alpha (X_i,A_i)\right] <\varepsilon /2. \end{aligned}$$(25)Now we turn our attention to the leftmost term in (24) for \(k_0\). For any fixed \(0\le t\le k_0\), the mapping
$$\begin{aligned} \varphi \mapsto {\textbf{E}}^\varphi _\nu [f_n(X_t)\prod _{i=0}^{t-1} \alpha (X_i,A_i)] \end{aligned}$$is continuous on \(\mathbf {\Phi }\) (Lemma 3.2) and, by dominated convergence, it converges monotonically to zero as \(n\rightarrow \infty \). Using Dini’s theorem again, we obtain that convergence is uniform on \(\mathbf {\Phi }\). Hence, we can find some \(n_0\) such that \(n\ge n_0\) implies that
$$\begin{aligned} \sum _{t=0}^{k_0}{\textbf{E}}^\varphi _\nu [f_n(X_t)\prod _{i=0}^{t-1} \alpha (X_i,A_i)]\le \varepsilon /2(k_0+1)\quad \hbox {for any}\quad \varphi \in \mathbf {\Phi }. \end{aligned}$$This completes the proof of (23), and this establishes that \(\varvec{\mathcal {O}}^{\textbf{X}}\) is indeed relatively s-compact. Once we know that \(\varvec{\mathcal {O}}^{\textbf{X}}\) is relatively s-compact, we can invoke Lemma 4.1(i) to conclude that \(\varvec{\mathcal {O}}\) is a metric space for the ws-topology.
-
(ii)
The set \(\varvec{\mathcal {O}}^{\textbf{A}}\) is relatively w-compact. This is obvious because \({\textbf{A}}\) is a compact metric space and \(\varvec{\mathcal {O}}\) is bounded; recall Theorem 3.1(iv).
From items (i) and (ii), and Lemma 4.1(ii) we derive that \(\varvec{\mathcal {O}}\) is relatively compact for the ws-topology. To prove that \(\varvec{\mathcal {O}}\) is compact in this topology it only remains to show the next result.
-
(iii)
\(\varvec{\mathcal {O}}\) is closed. To see this, suppose that \(\{\mu _n\}\) is a sequence in \(\varvec{\mathcal {O}}\) that converges to some \(\mu \in {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})\), i.e., \(\mu _n{\mathop {\longrightarrow }\limits ^{ws}}\mu \). We must show that \(\mu \in \varvec{\mathcal {O}}\). First of all, we will show that \(\mu ^{\textbf{X}}=\nu +\mu \alpha Q\). By hypothesis, given any \(B\in {\mathcal {B}}({\textbf{X}})\) and \(n\in {\mathbb {N}}\), we have
$$\begin{aligned} \mu _n^{\textbf{X}}(B)= & {} \int _{{\textbf{X}}\times {\textbf{A}}} {\textbf{I}}_{B\times {\textbf{A}}}(x,a)\mu _n(dx,da)\\= & {} \nu (B)+\int _{{\textbf{X}}\times {\textbf{A}}}Q(B|x,a)\alpha (x,a)\mu _n(dx,da). \end{aligned}$$Using Assumption 2.2(b)–(c) it is easy to show that the functions Q(B|x, a) and \(\alpha (x,a)\) are in \(Car_b({\textbf{X}}\times {\textbf{A}})\), and the same happens to \( {\textbf{I}}_{B\times {\textbf{A}}}\). Hence, we can take the limit as \(n\rightarrow \infty \) since \(\mu _n{\mathop {\longrightarrow }\limits ^{ws}}\mu \) to conclude that \(\mu _n^{\textbf{X}}(B)=\int {\textbf{I}}_{B\times {\textbf{A}}}d\mu _n\rightarrow \int {\textbf{I}}_{B\times {\textbf{A}}}d\mu =\mu ^{\textbf{X}}(B)\) and so
$$\begin{aligned} \mu ^{\textbf{X}}(B)=\nu (B)+ \int _{{\textbf{X}}\times {\textbf{A}}}Q(B|x,a)\alpha (x,a)\mu (dx,da)\quad \hbox {for every}\quad B\in {\mathcal {B}}({\textbf{X}}), \end{aligned}$$as we wanted to prove.
Let us now show that \(\mu ({\mathbb {K}}^c)=0\). Since the action sets \({\textbf{A}}(x)\) are compact, it turns out that the function \((x,a)\mapsto {\textbf{I}}_{{\mathbb {K}}^c}(x,a)\) is measurable and nonnegative, and it is a l.s.c. function on \({\textbf{A}}\) for each fixed \(x\in {\textbf{X}}\). Using Lemma 4.1(iii), we have that
$$\begin{aligned} 0=\liminf _{n\rightarrow \infty } \mu _n({\mathbb {K}}^c)\ge \mu ({\mathbb {K}}^c) \end{aligned}$$and the result follows because \(\mu \) is a nonnegative measure. This completes the proof that \(\varvec{\mathcal {O}}\) is closed in the ws-topology.
Combining (i)–(iii) above we conclude that \(\varvec{\mathcal {O}}\) is a compact metric space in the ws-topology. To complete the proof of the proposition, it remains to establish the next statement.
-
(iv)
The set \(\varvec{\mathcal {O}}_d\) is compact. If we show that \(\varvec{\mathcal {O}}_d\) is closed, since it is contained in the compact set \(\varvec{\mathcal {O}}\), it will follow that it is compact as well. To see that \(\varvec{\mathcal {O}}_d\) is closed, suppose that \(\{\mu _n\}\) is a sequence in \(\varvec{\mathcal {O}}_d\) that converges in the ws-topology to some \(\mu \in \varvec{\mathcal {O}}\). Applying again Lemma 4.1(iii), since the functions \(d_i\) are bounded below and l.s.c. on \({\textbf{A}}\) for fixed \(x\in {\textbf{X}}\), we have that \(\mu _n{\mathop {\longrightarrow }\limits ^{ws}}\mu \) implies
$$\begin{aligned} \liminf _{n\rightarrow \infty } \langle d_i,\mu _n\rangle \ge \langle d_i,\mu \rangle . \end{aligned}$$Therefore, we have \(\langle d_i,\mu \rangle \le k_i\) for each \(1\le i\le q\), and so \(\varvec{\mathcal {O}}_d\) is closed and, hence, compact.
The proof of the proposition is now complete. \(\square \)
We also have the following result, whose proof directly follows from Lemma 4.1(iii).
Proposition 4.2
Under Assumptions 2.1 and 2.2, the function \(\mu \mapsto \langle c,\mu \rangle \) is l.s.c. on \(\varvec{\mathcal {O}}\).
We obtain our main result in the paper.
Proof of Theorem 2.3
We have that the linear programming problem (LP) in (20) is solvable because \(\varvec{\mathcal {O}}_d\) is compact (Proposition 4.1) and \(\mu \mapsto \langle c,\mu \rangle \) is l.s.c. (Proposition 4.2). If \(\mu ^*\in \varvec{\mathcal {O}}_d\) is an optimal solution of (LP), then \(\varphi ^*\in \mathbf {\Phi }\) such that \(\mu ^*=\mu _{\varphi ^*}\) is, by Corollary 3.1, an optimal control policy for the constrained problem (CP). \(\square \)
We conclude this section with some comments on related approaches used in the literature, and with a discussion on approximating techniques for the problem (LP).
Remark 4.1
-
(a)
The method employed in this section is also known in the literature as the direct method. To the best of our knowledge, earlier works that use such a method need the assumption that c is nonnegative (or, at least, bounded below). In the present work, the non-negativeness property ensures the existence of finite occupation measures \(\mu _\pi \) for arbitrary policies \(\pi \in \mathbf {\Pi }\)—see the proof of Theorem 3.1(ii) that in turn uses the arguments within the proof of the Proposition 2.1—. Thus, the case of c being negative and/or unbounded has not been previously studied with the direct method and we think that it is a not trivial task. There exist, however, previous works dealing with costs of this latter kind. For instance, the papers [19, 20] re-formulate the space \(\varvec{\mathcal {O}}_d\) as the restriction of a bigger linear program, so that, with the help of its corresponding dual problem, the existence of both the value function and an optimal policy can be obtained. More recently, the paper [32] assumes unbounded and possibly negative costs and a state-action dependent discount factor although the optimality results were obtained through the use of the well-known dynamic programming method.
-
(b)
As was argued earlier, the approach ensuring the existence of optimal control policies is based on the compactness of the sets \(\varvec{\mathcal {O}}\) and \(\varvec{\mathcal {O}}_d\), together with the semicontinuity of \(\mu \mapsto \langle c,\mu \rangle \), shown in Propositions 4.1 and 4.2, respectively. Our assumptions allowed us to obtain such optimality results by using the ws-topology and the Young topology. It is worth mentioning that previous works addressing this same problem for the case when the discount factor is a given constant \(\alpha <1\), have applied an approach similar to ours but whose main difference lies on the inf-compactnessFootnote 2 property of c (we do not make such assumption in this paper)—see, for instance, [20, 21, 28]. The approach in the aforementioned references is to apply the weak topology on a corresponding version of \(\varvec{\mathcal {O}}\) (rather than the topologies used here) so that it is proved that: (a) \(\varvec{\mathcal {O}}\) is weakly closed, and (b) a certain “minimizing family” of measures in \(\varvec{\mathcal {O}}\) is relative compact thanks to Prokhorov’s theorem. In addition, the semicontinuity property of the mapping \(\mu \mapsto \langle c,\mu \rangle \) is also ensured. We mention here the main advantages and disadvantages of these two approaches: on one side, the assumption on c to be inf-compact excludes the very simple case of c being bounded. Since we do not make this assumption, our model covers this case (as well as many others), but we pay the price of assuming both: (a) the compactness of the sets \({\textbf{A}}\) and \({\textbf{A}}(x),\) for \(x\in {\textbf{X}}\), as well as (b) the existence of a density function q with suitable properties—see Assumption 2.2.
-
(c)
Obtaining explicitly the optimal policy of the constrained problem (CP) requires solving the infinite-dimensional linear programming problem (LP) over a set of measures. It is not possible, in general, to give a solution to this infinite-dimensional program. To tackle this problem, approximation techniques become necessary. Making a detailed exposition of the techniques that could be used in this context would require extensive explanations and it is beyond the scope of this paper. Let us mention, however, some existing procedures. Finite approximations using approximating sequences and duality, as well as aggregation–relaxation techniques combined with inner approximations are discussed in [26, Chapter 12]. Some other methods rely on suitable discretization of the state and action spaces. These are based on discretizing the action space by using the metric on \({\textbf{A}}\), while the state space \({\textbf{X}}\) is discretized by using probabilistic methods. More precisely, the underlying probability measure \(\lambda \) is approximated by some probability measure \(\lambda _N\) on \({\textbf{X}}\) with finite support, and this support is taken as the discretization of \({\textbf{X}}\). The discrete measure \(\lambda _N\) can be obtained by quantization (i.e., making the best \(L^p\) approximation of \(\lambda \) by a probability measure supported on a given number N of points): see [9] for the results obtained by using this technique on a constrained discounted MDP, or by making a stochastic approximation letting \(\lambda _N\) be the empirical probability measure of \(\lambda \) for a sample of size N: this yields convergence rates which are stated in terms of Wasserstein distances, as seen in [10] also for a constrained discounted MDP.
5 Further Results
In this section, we present some complementary results as well as some applications.
5.1 Sufficient Conditions for Assumption 2.1(b)
Among our conditions (Assumptions 2.1 and 2.2), the most restrictive one is Assumption 2.1(b) because it requires that the probability of reaching \(L_\epsilon \) is bounded away from zero uniformly in the initial state \((x,a)\in U_\epsilon \) and in the policy \(\varphi \in \mathbf {\Phi }\). Next, we propose some additional conditions under which this condition is satisfied. In particular, we will just need the probability of reaching \(L_\epsilon \) to be positive for each single initial state \((x,a)\in U_\epsilon \) and each single stationary policy \(\varphi \in \mathbf {\Phi }\).
For fixed \(x\in {\textbf{X}}\), consider the x-section \(U_\epsilon (x)=\{a\in {\textbf{A}}:\alpha (x,a)\ge 1-\epsilon \}\). In addition to the previous hypotheses, we impose the following conditions.
Assumption 5.1
-
(a)
The discount value function \(\alpha \) is u.s.c. on \({\textbf{X}}\times {\textbf{A}}\).
-
(b)
The density function satisfies that \((x_n,a_n)\rightarrow (x,a)\) implies \(q(\cdot |x_n,a_n){\mathop {\longrightarrow }\limits ^{L^1}} q(\cdot |x,a)\).
-
(c)
For \(\epsilon >0\) in Assumption 2.1, the set \(U_\epsilon ^{\textbf{X}}=\{x\in {\textbf{X}}: U_\epsilon (x)\ne \emptyset \}\) is compact.
Proposition 5.1
Suppose that Assumptions 2.1 (except part (b)), 2.2, and 5.1 are satisfied. If there exists \(t_0\in {\mathbb {N}}\) such that
for any \((x,a)\in U_\epsilon \) and \(\varphi \in \mathbf {\Phi }\), then Assumption 2.1(b) holds.
Proof
By Assumption 5.1(a) we have that \(U_\epsilon \) is closed. Since it is contained in the compact set \(U_\epsilon ^{\textbf{X}}\times {\textbf{A}}\) (recall Assumption 5.1(c)), it follows that it is compact as well.
Suppose that an initial state \((x,a)\in {\textbf{X}}\times {\textbf{A}}\) is fixed and consider, for \(\varphi \in \mathbf {\Phi }\), the state-action Markov chain with transition probability \(\overline{Q}_\varphi (\cdot |x,a)\) and its successive compositions \(\overline{Q}^{\,t}_\varphi (\cdot |x,a)\) for any \(t\in {\mathbb {N}}\). Note that \(\overline{Q}^{\,t}_\varphi (\cdot |x,a)\) is in \({\mathcal {P}}({\textbf{X}}\times {\textbf{A}})\subseteq {\mathcal {M}}^f({\textbf{X}}\times {\textbf{A}})\), which we are assuming is endowed with the ws-topology. We are going to show that the mapping from \({\textbf{X}}\times {\textbf{A}}\times \mathbf {\Phi }\) to \({\mathcal {P}}({\textbf{X}}\times {\textbf{A}})\) given by
is continuous for any \(t\in {\mathbb {N}}\), where \(\mathbf {\Phi }\) is endowed with the Young topology. Namely, we are going to show that for any \(t\in {\mathbb {N}}\)
Given any \(f\in Car_b({\textbf{X}}\times {\textbf{A}})\) we have that
because the transition to the state at time 1 follows Q(dv|x, a), then we make \(t-1\) transitions according to the Markov chain \(Q_{\varphi _n}\) on \({\textbf{X}}\) and, finally, the action at time t is chosen according to \(\varphi _n\). Therefore,
Denote by \(h_n\) and h the functions in \(L^\infty ({\textbf{X}},{\mathcal {B}}({\textbf{X}}),\lambda )\) given by
By definition of the Young convergence it is clear that \(h_n\ {{\mathop {\rightharpoonup }\limits ^{*}}\ }h\) and so by Lemma 3.7 in [11], it follows that the expression within brackets in (27) satisfies (as functions in \(L^\infty ({\textbf{X}},{\mathcal {B}}({\textbf{X}}),\lambda )\) in the variable \(v\in {\textbf{X}}\))
Since q(v|x, a) is in \(L^1({\textbf{X}},{\mathcal {B}}({\textbf{X}}),\lambda )\) as a function of v for fixed (x, a), we conclude that the expression in (27) converges to
Regarding (28), the first expression within brackets is bounded and so this term converges to zero by Assumption 5.1(b). Consequently, (26) follows.
Observe that the function \((y,z)\mapsto {\textbf{I}}_{L_\epsilon }(y,z)\) on \({\textbf{X}}\times {\textbf{A}}\) is measurable and, for fixed \(y\in {\textbf{X}}\), it is l.s.c. in \(z\in {\textbf{A}}\) (this is because \(\alpha (y,z)\) is in \(Car_b({\textbf{X}}\times {\textbf{A}})\)). Hence, we have the function
is l.s.c. as a consequence of Lemma 4.1(iii) and so it attains its minimum on the compact space \(U_\epsilon \times \mathbf {\Phi }\). Consequently, if \({\textbf{P}}^{\varphi }_{x,a}\{ (X_{t_0},A_{t_0})\in L_\epsilon \}>0\) for every \((x,a)\in U_\epsilon \) and \(\varphi \in \mathbf {\Phi }\), then
as we wanted to prove. \(\square \)
5.2 Formulation as a Total Cost Markov Decision Process
For some purposes, it might be useful to formulate the MDP with varying discount factor as a total expected cost MDP with an absorbing state. More precisely, the control model described in Sect. 2 is augmented with a cemetery state denoted by \(\Delta \), so that
-
the state space is \({\hat{{\textbf{X}}}}={\textbf{X}}\cup \{\Delta \}\) with \(\Delta \) an isolated point,
-
the action space is \({\hat{{\textbf{A}}}}={\textbf{A}}\cup \{a_\Delta \}\) with \(a_\Delta \) an isolated point. We let \({\hat{{\textbf{A}}}}(x)={\textbf{A}}(x)\) for \(x\in {\textbf{X}}\) and \({\hat{{\textbf{A}}}}(\Delta )=\{a_\delta \}\).
-
the state \(\Delta \) is absorbing and has zero cost, i.e.,
$$\begin{aligned} {\hat{Q}}(\Delta |\Delta ,a_\Delta )=1\quad \hbox {and}\quad {\hat{c}}(\Delta ,a_\Delta )={\hat{d}}_i(\Delta ,a_\Delta )=0, \end{aligned}$$ -
the transition probability for \((x,a)\in {\textbf{X}}\times {\textbf{A}}\) is given by
$$\begin{aligned} {\hat{Q}}(B|x,a)=\alpha (x,a)Q(B|x,a)\quad \hbox {if}\quad B\in {\mathcal {B}}({\textbf{X}})\quad \hbox {and}\quad {\hat{Q}}(\Delta |x,a)=1-\alpha (x,a). \end{aligned}$$In particular, when taking action a in state x, the probability of not being absorbed by \(\Delta \) is \(\alpha (x,a)\).
-
The initial distribution \(\nu \) of the system is supported on \({\textbf{X}}\), i.e., \(\nu ({\textbf{X}})=1\).
Slightly abusing the notation, and since there is a single action at the cemetery state, we will also denote by \({\textbf{F}}\), \(\mathbf {\Phi }\), and \(\mathbf {\Pi }\) the sets of policies of this augmented model. The corresponding probability measure on the set \({\hat{\mathbf {\Omega }}}\) of augmented sample paths and the expectation operator are denoted by \({\hat{{\textbf{P}}}}^\pi _\nu \) and \({\hat{{\textbf{E}}}}^\pi _\nu \). We will use the same notations \(X_t,A_t\) for the state and action processes.
It is easy to see that the total cost MDP is equivalent to the original control model because we have (cf. (4)–(5))
for \(0\le i\le q\). Letting \(T_\Delta \) be the hitting time of \(\Delta \), i.e.,
the above sums can be restricted to the set of indices \(0\le t<T_\Delta \).
It is very important to notice that for any \(\pi \in \mathbf {\Pi }\) and any integer \(t\ge 0\)
so that, the total mass of the occupation measure of \(\pi \) (recall (16)) coincides with the expected absorption time:
According to [15], we will say that the total cost MDP is absorbing when
and uniformly absorbing when it is absorbing and, in addition,
We note that, under our conditions, (30) is not necessarily satisfied: indeed, by Proposition 2.1(i) we have \({\hat{{\textbf{E}}}}^\pi _\nu [T_\Delta ]<\infty \) provided that \(V_0(\pi ,\nu )\) is finite. However, the condition (31) holds and it has been established, in fact, in the proof of Proposition 4.1: see the Eq. (25). For related issues on uniformly absorbing MDPs, the interested reader can also consult [12, 36].
Existence of optimal chattering stationary policies. We say that a stationary policy \(\varphi \in \mathbf {\Phi }\) is m-chattering if, for some integer \(m\ge 1\), there exist measurable functions \(\gamma _1,\ldots ,\gamma _m:{\textbf{X}}\rightarrow [0,1]\) and deterministic stationary policies \(f_1,\ldots ,f_m\in {\textbf{F}}\) such that
This means that the random action taken by \(\varphi \) at state \(x\in {\textbf{X}}\) is a mixture of the actions \(f_i(x)\) for \(i=1,\ldots ,m\), though the coefficients of the mixture may vary with x.
Theorem 5.2
Suppose that Assumptions 2.1 and 2.2 hold and, in addition, we have:
-
(i)
for every \(\pi \in \mathbf {\Pi }\),
$$\begin{aligned} {\textbf{E}}^\pi _\nu \left[ \sum _{t=0}^\infty \prod _{j=0}^{t-1}\alpha (X_j,A_j)\right] <+\infty ; \end{aligned}$$ -
(ii)
the cost functions c and \(d_i\) are bounded.
Under these conditions, there exists a \((q+1)\)-chattering stationary Markovian policy in \(\mathbf {\Phi }\) which is optimal for the constrained problem (CP).
Proof
The condition (i), together with (31), ensures that the total cost MDP is uniformly absorbing. The cost functions c and \(d_i\) being bounded, this ensures that the total cost MDP satisfies all the conditions in [16, Theorem 9.1]. Hence, if \(\varphi \in \mathbf {\Phi }\) is an optimal policy for the constrained problem (CP) then there exist \(q+1\) deterministic stationary policies \(f_0,f_1,\ldots ,f_q\in {\textbf{F}}\) and there exist coefficients \(\beta _i\ge 0\) with \(\beta _0+\cdots +\beta _q=1\) such that the occupation measures \(\mu _\varphi \) and \(\mu _{f_i}\) satisfy
Without loss of generality, we can assume that all the \(\beta _i\) are strictly positive. This implies that \(\mu ^{\textbf{X}}_{f_i}\ll \mu _\varphi ^{\textbf{X}}\). Now we let
Mimicking the arguments in Lemma 5 and Theorem 2 in [14], it follows that the chattering policy obtained from the \(\gamma _i(x)\) and the \(f_i\) satisfies the stated result. \(\square \)
Existence of optimal deterministic stationary policies. In the paper [15] the authors prove under fairly general conditions that in an atomless discounted MDP (with constant discount factor) or in an atomless uniformly absorbing MDP, given a finite number of bounded cost functions, for every policy \(\pi \in \mathbf {\Pi }\) there exists some \(f\in {\textbf{F}}\) such that \(\pi \) and f have the same expected costs for, simultaneously, all the given cost functions. The following result is, therefore, a direct consequence of [15, Theorem 3.8].
Theorem 5.3
Suppose that Assumptions 2.1 and 2.2, together with the conditions (i)–(ii) in Theorem 5.2 hold, and that the control model is atomless, meaning that
for every \((x,a)\in {\mathbb {K}}\) and \(y\in {\textbf{X}}\). Then there exists a deterministic stationary Markovian policy in \({\textbf{F}}\) which is optimal for the constrained problem (CP).
Proof
We have that the total cost MDP is uniformly absorbing. Letting \(\varphi \in \mathbf {\Phi }\) be an optimal policy for the constrained problem (CP), by [15, Theorem 3.8] there exists some \(f\in {\textbf{F}}\) with \(V_i(\varphi ,\nu )=V_i(f,\nu )\) for all \(i=0,\ldots ,q\). Hence, \(f\in {\textbf{F}}\) is also optimal for the constrained problem (CP). \(\square \)
5.3 Example: Queueing Systems with Failures
We consider a queueing system with finite capacity \(M>0\). The controller controls the service rate \(a\in [0,1]\) over an infinite time horizon. Additionally, we allow for a fatal failure of the system which can occur with a certain probability depending on the state of the system and the action chosen: for instance, we may assume that when the system is close to its maximal capacity and a low service rate is chosen, then failure of the system becomes more likely; failure can also occur due to random exogenous causes, or endogenous causes depending on the nature of the system. The goal of the controller is to minimize a total expected running cost subject to the constraint that the expected lifetime of the system is above some constant and to some other similar constraints on some reward functions. Next, we show how this controlled queueing system falls into the framework of a constrained MDP with a non-constant discount factor.
The state space is \({\textbf{X}}=[0,M]\) for some constant \(M>0\) and the action space is \({\textbf{A}}=[0,1]\) with \({\textbf{A}}(x)={\textbf{A}}\) for every \(x\in {\textbf{X}}\). The state of the system stands for the size of the queue (ranging from 0 to the maximal capacity of the system M), and the action space models the service rate (from the absence of service at 0 to a maximal service rate at 1). The transition probability function has a density function q(y|x, a) with respect to the uniform distribution \(\lambda \) on [0, M]. We assume that q(y|x, a) is continuous on \({\textbf{X}}\times {\textbf{X}}\times {\textbf{A}}\). Suppose that the initial distribution of the system \(\nu \) is absolutely continuous with respect to the Lebesgue measure on [0, M].
Let \(\alpha (x,a)\) be the probability that the system does not collapse when taking action \(a\in {\textbf{A}}\) in state \(x\in {\textbf{X}}\). Intuitively, a small value of x combined with a large value of a should yield values of \(\alpha (x,a)\) close to one, whereas large values of x combined with small values of a should lead to values of \(\alpha (x,a)\) close to zero. Obviously, many choices of a functional expression for \(\alpha (x,a)\) are compatible with these requirements. For illustration, we can choose the following
In particular, we have \(\alpha (x,a)=1\) whenever \(a-\frac{x}{M}\ge \frac{1}{3}\), which is the triangle in the upper left side of \({\textbf{X}}\times {\textbf{A}}\) of vertices (0, 1), (0, 1/3), \((\frac{2}{3}M,1)\).
Assume that the cost-per-stage function is a strictly positive function c on \({\textbf{X}}\times {\textbf{A}}\) and that we are also given some continuous reward function r on \({\textbf{X}}\times {\textbf{A}}\). The control problem is to minimize the total expected cost when managing this queueing system, with the constraint that the expected lifetime of the system is above some \(T_0\) and that the total expected reward is above some \(r_0\). Denoting by T the lifetime of the system, the control problem can be written
subject to
which is indeed equivalent to (4)–(5) with \(d_1\equiv -1\) and \(d_2=-r\) (recall the formulation as a total cost MDP given in Sect. 5.2).
Let us now check our conditions on this model. It is clear that Assumption 2.1(a) holds, while (c) is satisfied by letting \(\pi _0\) be the policy that always takes action \(a=0\): indeed, we have \(\alpha (x,0)\le 3/4\) for every \(x\in {\textbf{X}}\) and thus \(E_\nu ^{\pi _0}[T]\le 4\), which implies statement (c). Regarding Assumption 2.1(b), we use Proposition 5.1. Conditions (a)–(b) in Assumption 5.1 are satisfied, while by choosing \(\epsilon >0\) small enough it is easily seen that \(U_\epsilon ^{\textbf{X}}={\textbf{X}}\) and so Assumption 5.1(c) holds (in this particular case, any \(\epsilon <1/4\) satisfies the desired condition). Concerning the condition in Proposition 5.1, it is satisfied for \(t_0=1\) and any \(\epsilon <1/4\) provided that for every \((x,a)\in {\textbf{X}}\times {\textbf{A}}\) there exists some \(2M/3<y\le M\) with \(q(y|x,a)>0\). Under the above conditions, the queueing system with constraints possesses an optimal stationary policy.
This illustrates that a large class of controlled queuing systems with possible failure can be modeled using the results in this paper. Many generalizations are possible. For instance, it is possible to give a strictly positive probability to the maximal capacity M, i.e., \(Q(\{M\}|x,a)>0\) for some x and a. This makes sense because a large demand can result in the system reaching its maximal capacity. This case can be treated just by changing the topology on \({\textbf{X}}\): on [0, M) we would still consider the usual topology, and we would let \(\{M\}\) to be an open set (loosely, this consists in letting M to be an isolated point). A similar treatment could be given to the state 0. We do not give the technical details on these generalizations, but we would like to stress that our framework in this paper is flexible enough to cover many variations of the described queueing system.
6 Concluding Remarks
In this paper we have studied a constrained control problem in Borel action-state spaces when the objective function is a total expected cost with a non-constant discount factor, that takes values in the interval [0, 1]. We have rewritten the original optimization problem as a linear optimization problem defined on a space of occupation measures satisfying certain conditions. Optimal elements have been obtained through the topological properties (i.e., compactness) of the space of occupation measures together with the semicontinuity property of the objective function.
Our main contribution can be divided into two facts: (1) the study of a constrained control problem when the discount factor is not necessarily bounded away from one, and (2) the set of hypotheses we have proposed in the paper are considerably different from those assumed in the literature since most of the constrained problems analyzed in the past assume some kind of inf-compactness on the cost-per-stage. We have not made such an assumption here.
Finally, let us mention that this new line of research could be extended to the theory of dynamic games; in particular, to the so-named hybrid games, which are very appealing models due to their ability to model deep structural modifications of the system. These changes are carried out thanks to the possibility that the discount factor may take the value of one at some stages; for more details, see [29,30,31].
Notes
This means that if \(\varphi ,\varphi '\) satisfy \(\mu =\mu ^{\textbf{X}}\otimes \varphi =\mu ^{\textbf{X}}\otimes \varphi '\) then \(\varphi (\cdot |x)=\varphi '(\cdot |x)\) for \(\mu ^{\textbf{X}}\)-almost every \(x\in {\textbf{X}}\).
For further references on this definition, see [28, Assumption 2.2(a)].
References
Aliprantis, C.D., Border, K.C.: Infinite Dimensional Analysis: A Hitchhiker’s Guide, 3rd edn. Springer, Berlin (2006)
Altman, E.: Constrained Markov decision processes with total cost criteria: occupation measures and primal LP. Math. Methods Oper. Res. 43, 45–72 (1996)
Alvarez-Mena, J., Hernández-Lerma, O.: Convergence of the optimal values of constrained Markov control processes. Math. Methods Oper. Res. 55, 461–484 (2002)
Balder, E.J.: Lectures on Young Measure Theory and Its Applications in Economics. Rijksuniversiteit Utrecht. Mathematisch Instituut. Utrecht University, Utrecht (1998)
Balder, E.J.: On Cournot–Nash equilibrium distributions for games with differential information and discontinuous payoffs. Econ. Theory 1, 339–354 (1991)
Balder, E.J.: On ws-convergence of product measures. Math. Oper. Res. 26, 494–518 (2001)
Bertsekas, D., Shreve, S.E.: Stochastic Optimal Control: The Discrete-Time Case, vol. 5. Academic Press, New York (1978)
Carmon, Y., Shwartz, A.: Markov decision processes with exponentially representable discounting. Oper. Res. Lett. 37, 51–55 (2009)
Dufour, F., Prieto-Rumeau, T.: Finite linear programming approximations of constrained discounted Markov decision processes. SIAM J. Control Optim. 51, 1298–1324 (2013)
Dufour, F., Prieto-Rumeau, T.: Stochastic approximations of constrained discounted Markov decision processes. J. Math. Anal. Appl. 413, 856–879 (2014)
Dufour, F., Prieto-Rumeau, T.: Stationary Markov Nash equilibria for nonzero-sum constrained ARAT Markov games. SIAM J. Control. Optim. 60, 945–967 (2022)
Dufour, F., Prieto-Rumeau, T.: Absorbing Markov decision processes. ESAIM Control Optim. Calc. Var. 30, 5 (2024)
Dynkin, E.B., Yushkevich, A.A.: Controlled Markov Processes. Springer, New York (1979)
Feinberg, E.A., Jaśkiewicz, A., Nowak, A.S.: Constrained discounted Markov decision processes with Borel state spaces. Automatica 111, 105582 (2020)
Feinberg, E.A., Piunovskiy, A.: Sufficiency of deterministic policies for atomless discounted and uniformly absorbing MDPs with multiple criteria. SIAM J. Control Optim. 57, 163–191 (2019)
Feinberg, E.A., Rothblum, U.G.: Splitting randomized stationary policies in total-reward Markov decision processes. Math. Oper. Res. 37, 129–153 (2012)
Feinberg, E.A., Shwartz, A.: Markov decision models with weighted discounted criteria. Math. Oper. Res. 19, 152–168 (1994)
Florescu, L.C., Godet-Thobie, C.: Young Measures and Compactness in Measure Spaces. De Gruyter, Berlin (2012)
Hernández-Hernández, D., Hernández-Lerma, O.: Discounted cost Markov decision processes on Borel spaces: the linear programming formulation. J. Math. Anal. Appl. 183, 335–351 (1994)
González-Hernández, J., Hernández-Lerma, O.: Constrained Markov control processes in Borel spaces: the discounted case. Math. Methods Oper. Res. 52, 271–285 (2000)
González-Hernández, J., López-Martínez, R.R., Minjárez-Sosa, J.A., Gabriel-Arguelles, J.R.: Constrained Markov control processes with randomized discounted cost criteria: occupation measures and extremal points. Risk Decis. Anal. 4, 163–176 (2013)
González-Hernández, J., López-Martínez, R.R., Minjárez-Sosa, J.A., Gabriel-Arguelles, J.R.: Constrained Markov control processes with randomized discounted cost criteria: infinite linear programming approach. Optim. Control Appl. Methods 35, 575–591 (2014)
Guo, X.: Constrained nonhomogeneous Markov decision processes with expected total reward criterion. Acta Appl. Math. Sin. (Engl. Ser.) 23, 230–235 (2000)
Hernández-Lerma, O., González-Hernández, J., López-Martínez, R.R.: Constrained average cost Markov control processes in Borel spaces. SIAM J. Control. Optim. 42, 442–468 (2003)
Hernández-Lerma, O., Lasserre, J.B.: Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, New York (1996)
Hernández-Lerma, O., Lasserre, J.B.: Further Topics on Discrete-Time Markov Control Processes. Springer, New York (1999)
Ilhuicatzi-Roldán, R., Cruz-Suárez, H., Chávez-Rodríguez, S.: Markov decision processes with time-varying discount factors and random horizon. Kybernetika 53, 82–98 (2017)
Jasso-Fuentes, H., López-Martinez, R.R., Minjárez-Sosa, J.A.: Some advances on constrained Markov decision processes in Borel spaces with random state-dependent discount factors. Optimization 73, 925–951 (2024)
Jasso-Fuentes, H., Menaldi, J.L., Prieto-Rumeau, T.: Discrete-time control with non-constant discount factor. Math. Methods Oper. Res. 92, 377–399 (2020)
Jasso-Fuentes, H., Menaldi, J.L., Prieto-Rumeau, T.: Discrete-time hybrid control in Borel spaces. Appl. Math. Optim. 81, 409–441 (2020)
Jasso-Fuentes, H., Menaldi, J.L., Prieto-Rumeau, T.: Recent results on discrete-time hybrid control models with general state and action spaces. Pure Appl. Funct. Anal. (in press) (2024)
Jasso-Fuentes, H., Salgado-Suárez G.D.: Discrete-time hybrid control processes with unbounded costs. Submitted (2024)
Mendoza-Pérez, A.F., Jasso-Fuentes, H., De la Cruz-Courtois, O.A.: Constrained Markov decision processes in Borel spaces: from discounted to average optimality. Math. Methods Oper. Res. 84, 489–525 (2016)
Minjárez-Sosa, J.A.: Markov control models with unknown random state-action-dependent discount factors. TOP 23, 743–772 (2015)
Piunovskiy, A.: Optimal Control of Random Sequences in Problems With Constraints. Springer, Dordrecht (2012)
Piunovskiy, A., Zhang, Y.: On the continuity of the projection mapping from strategic measures to occupation measures in absorbing Markov decision processes. arXiv:2311.14043 (2023)
Schäl, M.: On dynamic programming: compactness of the space of policies. Stoch. Proc. Appl. 3, 345–364 (1975)
Wei, Q., Guo, X.: Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper. Res. Lett. 39, 369–374 (2011)
Funding
Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. Research partially supported by CONACyT grant Ciencia de Frontera No. 87787 and by Grant No. PID2021-122442NB-I00 from the Spanish Ministerio de Ciencia e Innovación.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Communicated by Benoit Chachuat.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Jasso-Fuentes, H., Prieto-Rumeau, T. Constrained Markov Decision Processes with Non-constant Discount Factor. J Optim Theory Appl 202, 897–931 (2024). https://doi.org/10.1007/s10957-024-02453-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-024-02453-y