To address convergence issues, we assume that the input to point-based DP update is uniformly improvable and require its output to be also uniformly improvable. We will explain later how the assumption can be facilitated and how the requirement guarantees convergence of the modified value iteration algorithm. In this subsection, we discuss how the requirement can be fulfilled.
Point-based DP update constructs new vectors by backing
up on belief points and the new vectors are all members of .
Hence the output of point-based DP update
is trivially dominated by
.
If the output also dominates
, then it must be uniformly
improvable by Lemma 2. The question is how to guarantee that the
output dominates
.
Consider the set resulted from
backUpWitnessPoints.
If it does not dominate
, then there must exist a belief
state b such
. Consequently, there must
exist a vector
in
such that
.
This gives us the following subroutine for testing
whether
dominates
and for, when this is not the case,
adding vectors to
so that it does.
The subroutine is called backUpLPPoints
because belief points are found by solving linear programs.
:
1. for each
![]()
2. do {
3.
.
4. if
,
5.
.
6.
.
7.
.
8. } while (
).
The subroutine examines vectors in one by one.
For each
in
, it calls another subroutine
dominanceCheck to try to find a belief point b such that
. If such a point is found, it backs up on it,
resulting in a new vector
(line 5).
By the property of the backup operator,
b is a witness point of
w.r.t
(line 6).
There cannot be any vector in
that
equals
.
Consequently, the vector is simply added to
without checking for duplicates
(line 7).
The process repeats for
until
dominanceCheck returns NULL, that is when
there are no belief
points b such that
.
When backUpLPPoints terminates, we have
for
any vector
in
and any belief point b. Hence
dominates
.
The subroutine first checks
whether there exists a vector
in
that
pointwise dominates
, that is
for all states s. If such an
exists, it returns
NULL right away. Otherwise, it solves the
following linear program
.
It returns the solution point b when the optimal
value of the objective
function is positive and returns NULL otherwise:
LP:
1. Variables: x, b(s) for each state s
2. Maximize: x.
3. Constraints:
4.
for all
![]()
5.
,
for all states s.