A Tutorial on Stochastic Programming

Alexander Shapiro1 and Andy Philpott2

Contents

1  Introduction
2  Two-stage stochastic programming
    2.1  Scenario construction
    2.2  Monte Carlo techniques
    2.3  Evaluating candidate solutions
3  Multi-stage stochastic programming
4  Risk averse optimization
5  Notes
6  Appendix
    6.1  Derivation of (1.5)
    6.2  Derivation of best convex approximation to chance constraints.

1  Introduction

This tutorial is aimed at introducing some basic ideas of stochastic programming. The intended audience of the tutorial is optimization practitioners and researchers who wish to acquaint themselves with the fundamental issues that arise when modeling optimization problems as stochastic programs. The emphasis of the paper is on motivation and intuition rather than technical completeness (although we could not avoid giving some technical details). Since it is not intended to be a historical overview of the subject, relevant references are given in the "Notes" section at the end of the paper, rather than in the text.
Stochastic programming is an approach for modeling optimization problems that involve uncertainty. Whereas deterministic optimization problems are formulated with known parameters, real world problems almost invariably include parameters which are unknown at the time a decision should be made. When the parameters are uncertain, but assumed to lie in some given set of possible values, one might seek a solution that is feasible for all possible parameter choices and optimizes a given objective function. Such an approach might make sense for example when designing a least-weight bridge with steel having a tensile strength that is known only to within some tolerance. Stochastic programming models are similar in style but try to take advantage of the fact that probability distributions governing the data are known or can be estimated. Often these models apply to settings in which decisions are made repeatedly in essentially the same circumstances, and the objective is to come up with a decision that will perform well on average. An example would be designing truck routes for daily milk delivery to customers with random demand. Here probability distributions (e.g., of demand) could be estimated from data that have been collected over time. The goal is to find some policy that is feasible for all (or almost all) the possible parameter realizations and optimizes the expectation of some function of the decisions and the random variables.
Stochastic programming can also be applied in a setting in which a one-off decision must be made. Here an example would be the construction of an investment portfolio to maximize return. Like the milk delivery example, probability distributions of the returns on the financial instruments being considered are assumed to be known, but in the absence of data from future periods, these distributions will have to be elicited from some accompanying model, which in its simplest form might derive solely from the prior beliefs of the decision maker. Another complication in this setting is the choice of objective function: maximizing expected return becomes less justifiable when the decision is to be made once only, and the decision-maker's attitude to risk then becomes important.
The most widely applied and studied stochastic programming models are two-stage (linear) programs. Here the decision maker takes some action in the first stage, after which a random event occurs affecting the outcome of the first-stage decision. A recourse decision can then be made in the second stage that compensates for any bad effects that might have been experienced as a result of the first-stage decision. The optimal policy from such a model is a single first-stage decision and a collection of recourse decisions (a decision rule) defining which second-stage action should be taken in response to each random outcome. As an introductory example we discuss below the classical inventory model.
Example 1 [Inventory model] Suppose that a company has to decide an order quantity x of a certain product to satisfy demand d. The cost of ordering is c > 0 per unit. If the demand d is bigger than x, then a back order penalty of b ³ 0 per unit is incurred. The cost of this is equal to b(d-x) if d > x , and is zero otherwise. On the other hand if d < x , then a holding cost of h(x-d) ³ 0 is incurred. The total cost is then
G(x,d)=cx+b [d-x]+ +h[x-d]+,
(1.1)
where [a]+ denotes the maximum of a and 0. We assume that b > c, i.e., the back order cost is bigger than the ordering cost. We will treat x and d as continuous (real valued) variables rather than integers. This will simplify the presentation and makes sense in various situations.
The objective is to minimize the total cost G(x,d). Here x is the decision variable and the demand d is a parameter. Therefore, if the demand is known, the corresponding optimization problem can be formulated in the form


min
x ³ 0 
G(x,d).
(1.2)
The nonnegativity constraint x ³ 0 can be removed if a back order policy is allowed. The objective function G(x,d) can be rewritten as
G(x,d)= max

ì
í
î
(c-b)x+bd,(c+h)x-hd ü
ý
þ
,
(1.3)
which is piecewise linear with a minimum attained at ¾x=d. That is, if the demand d is known, then (no surprises) the best decision is to order exactly the demand quantity d.
For a numerical instance suppose c= 1, b= 1.5, and h= 0.1. Then
G(x,d)= ì
ï
í
ï
î
-0.5x+1.5d,

if x < d
1.1x-0.1d,

if x ³ d.

Let d=50. Then G(x,50) is the pointwise maximum of the linear functions plotted in Figure 1.
fig1.png
Figure 1: Plot of G(x,50). Its minimum is at ¾x=50
Consider now the case when the ordering decision should be made before a realization of the demand becomes known. One possible way to proceed in such situation is to view the demand D as a random variable (denoted here by capital D in order to emphasize that it is now viewed as a random variable and to distinguish it from its particular realization d). We assume, further, that the probability distribution of D is known. This makes sense in situations where the ordering procedure repeats itself and the distribution of D can be estimated, say, from historical data. Then it makes sense to talk about the expected value, denoted E[G(x,D)], of the total cost and to write the corresponding optimization problem


min
x ³ 0 
E[G(x,D)].
(1.4)
The above formulation approaches the problem by optimizing (minimizing) the total cost on average. What would be a possible justification of such approach? If the process repeats itself then, by the Law of Large Numbers, for a given (fixed) x, the average of the total cost, over many repetitions, will converge with probability one to the expectation E[G(x,D)]. Indeed, in that case a solution of problem (1.4) will be optimal on average.
The above problem gives a simple example of a recourse action. At the first stage, before a realization of the demand D is known, one has to make a decision about ordering quantity x. At the second stage after demand D becomes known, it may happen that d > x. In that case the company can meet demand by taking the recourse action of ordering the required quantity d-x at a penalty cost of b > c.
The next question is how to solve the optimization problem (1.4). In the present case problem (1.4) can be solved in a closed form. Consider the cumulative distribution function (cdf) F(z):=Prob(D £ z) of the random variable D. Note that F(z)=0 for any z < 0. This is because the demand cannot be negative. It is possible to show (see the Appendix) that
E[G(x,D)]=b E[D]+(c-b)x+(b+h) ó
õ
x

0 
F(z)dz.
(1.5)
Therefore, by taking the derivative, with respect to x, of the right hand side of (1.5) and equating it to zero we obtain that optimal solutions of problem (1.4) are defined by the equation (b+h)F(x)+c-b=0, and hence an optimal solution of problem (1.4) is given by the quantile 3

-
x
 
=F-1( k) ,
(1.6)
where k:=[(b-c)/(b+h)].
Suppose for the moment that the random variable D has a finitely supported distribution, i.e., it takes values d1,...,dK (called scenarios) with respective probabilities p1,...,pK. In that case its cdf F(·) is a step function with jumps of size pk at each dk, k=1,...,K. Formula (1.6), for an optimal solution, still holds with the corresponding k-quantile, coinciding with one of the points dk, k=1,...,K. For example, the considered scenarios may represent historical data collected over a period of time. In such case the corresponding cdf is viewed as the empirical cdf giving an approximation (estimation) of the true cdf, and the associated k-quantile is viewed as the sample estimate of the k-quantile associated with the true distribution. It is instructive to compare the quantile solution ¾x with a solution corresponding to one scenario d=¾d, where ¾d is, say, the mean (expected value) of D. As it was mentioned earlier, the optimal solution of such (deterministic) problem is ¾d. The mean ¾d can be very different from the k-quantile ¾x. It is also worthwhile mentioning that typically sample quantiles are much less sensitive than the sample mean to random perturbations of the empirical data.
In most real settings, closed-form solutions for stochastic programming problems such as (1.4) are rarely available. In the case of finitely many scenarios it is possible to model the stochastic program as a deterministic optimization problem, by writing the expected value E[G(x,D)] as the weighted sum:
E[G(x,D)]= K
å
k=1 
pkG(x,dk).
(1.7)
The deterministic formulation (1.2) corresponds to one scenario d taken with probability one. By using the representation (1.3), we can write problem (1.2) as the linear programming problem



min
x,t 

t

s.t.
t ³ (c-b)x+bd,


t ³ (c+h)x-hd,


x ³ 0.


(1.8)
Indeed, for fixed x, the optimal value of (1.8) is equal to max{(c-b)x+bd,(c+h)x-hd}, which is equal to G(x,d). Similarly, the expected value problem (1.4), with scenarios d1,...,dK, can be written as the linear programming problem:



min
x,t1,...,tK 


K
å
k=1 
pktk

s.t.
tk ³ (c-b)x+bdk,  k=1,...,K,


tk ³ (c+h)x-hdk,  k=1,...,K,


x ³ 0.


(1.9)
The tractability of linear programming problems makes approximation by scenarios an attractive approach for attacking problems like (1.4). In the next section we will investigate the convergence of the solution of such a scenario approximation as K becomes large. It is also worth noting here the "almost separable" structure of problem (1.9). For fixed x, problem (1.9) separates into the sum of optimal values of problems of the form (1.8) with d=dk. Such decomposable structure is typical for two-stage linear stochastic programming problems.
We digress briefly here to compare the exact solution to (1.4) with the scenario solution for the numerical values c= 1.0, b= 1.5, and h= 0.1. Suppose that D has a uniform distribution on the interval [0,100]. Then for any x Î [0,100],

E[G(x,D)]
=
b E[D]+(c-b)x+(b+h) ó
õ
x

0 
F(z)dz



=
75-0.5x+0.008x2

as shown in Figure 2.
fig2.png
Figure 2: Plot of E[G(x,D)]. Its minimum is at ¾x=31.25
Suppose the demand is approximated by a finitely supported distribution with two equally likely scenarios d1=20, d2=80. Then
E[G(x,D)]= 1

2

2
å
k=1 
G(x,dk),
which is shown plotted in Figure 3 in comparison with the plot for uniformly distributed D.
fig3.png
Figure 3: Scenario approximations of E[G(x,D)].
Figure 3 also shows the same construction with three scenarios d1=20, d2=50, d3=80, with respective probabilities [2/5],[1/5],[2/5]. This illustrates how more scenarios in general yield a better approximation to the objective function (although in this case this has not made the approximate solution any better). The plot for three scenarios also illustrates the fact that if the scenarios are based on conditional expectations of D over respective intervals [0,40], (40,60], and (60,100] (with corresponding probabilities) and G(x,D) is convex in D, then the approximation gives a lower bounding approximation to E[G(x,D)] by virtue of Jensen's inequality. [¯]
This example raises several questions. First, how should we approximate the random variable by one with a finitely-supported probability distribution? Second, how should we solve the resulting (approximate) optimization problem? Third, how can we gauge the quality of the approximate solution? In the next section we will discuss these issues in the context of two-stage linear stochastic programming problems. One natural generalization of the two-stage model, which we discuss in section 3, extends it to many stages. Here each stage consists of a decision followed by a set of observations of the uncertain parameters which are gradually revealed over time. In this context stochastic programming is closely related to decision analysis, optimization of discrete event simulations, stochastic control theory, Markov decision processes, and dynamic programming. Another class of stochastic programming models seeks to safeguard the solution obtained against very bad outcomes. Some classical approaches to deal with this are mean-variance optimization, and so-called chance constraints. We briefly discuss these in section 4, and explore the relationship with their modern counterparts, coherent risk measures and robust optimization.

2  Two-stage stochastic programming

In this section we discuss the two-stage stochastic programming approach to optimization under uncertainty. The basic idea of two-stage stochastic programming is that (optimal) decisions should be based on data available at the time the decisions are made and should not depend on future observations. The classical two-stage linear stochastic programming problems can be formulated as


min
x Î X 

ì
í
î
g(x):=cTx+E[Q(x,x)] ü
ý
þ
,
(2.1)
where Q(x,x) is the optimal value of the second-stage problem


min
y 
qTy    subject  to    Tx+Wy £ h.
(2.2)
Here x Î Rn is the first-stage decision vector, X is a polyhedral set, defined by a finite number of linear constraints, y Î Rm is the second-stage decision vector and x = (q,T,W,h) contains the data of the second-stage problem. In this formulation, at the first stage we have to make a "here-and-now" decision x before the realization of the uncertain data x, viewed as a random vector, is known. At the second stage, after a realization of x becomes available, we optimize our behavior by solving an appropriate optimization problem.
At the first stage we optimize (minimize) the cost cTx of the first-stage decision plus the expected cost of the (optimal) second-stage decision. We can view the second-stage problem simply as an optimization problem which describes our supposedly optimal behavior when the uncertain data is revealed, or we can consider its solution as a recourse action where the term Wy compensates for a possible inconsistency of the system Tx £ h and qTy is the cost of this recourse action.
The considered two-stage problem is linear since the objective functions and the constraints are linear. Conceptually this is not essential and one can consider more general two-stage stochastic programs. For example, if the first-stage problem is integer (combinatorial), its feasible set X could be discrete (finite).
Let us take a closer look at the above two-stage problem. Its formulation involves the assumption that the second-stage data4 x can be modelled as a random (not just uncertain) vector with a known probability distribution. As mentioned in the inventory example this would be justified in situations where the problem is solved repeatedly under random conditions which do not significantly change over the considered period of time. In such situations one may reliably estimate the required probability distribution and the optimization on average could be justified by the Law of Large Numbers.
The other basic question is whether the formulated problem can be solved numerically. In that respect the standard approach is to assume that random vector x has a finite number of possible realizations, called scenarios, say x1,...,xK, with respective (positive) probabilities p1,...,pK. Then the expectation can be written as the summation
E[Q(x,x)]= K
å
k=1 
pkQ(x,xk),
(2.3)
and, moreover, the two-stage problem (2.1)-(2.2) can be formulated as one large linear programming problem:



min
x,y1,...,yK 

cTx+ K
å
k=1 
qkTyk


s.t.
x Î X,  Tkx+Wkyk £ hk,  k=1,...,K.



(2.4)
In the above formulation (2.4) we make one copy yk, of the second stage decision vector, for every scenario xk=(qk,Tk,Wk,hk) . By solving (2.4) we obtain an optimal solution ¾x of the first-stage problem and optimal solutions ¾yk of the second-stage problem for each scenario xk, k=1,...,K. Given ¾x, each ¾yk gives an optimal second-stage decision corresponding to a realization x = xk of the respective scenario.
Example 2 [Inventory model] Recall the inventory model with K scenarios. This can be written as



min
x,t1,...,tK 

0x + K
å
k=1 
pktk

s.t.
(c-b)x - tk £ -bdk,  k=1,...,K,


(c+h)x - tk £ hdk,  k=1,...,K,


x ³ 0.


(2.5)
The first-stage decision is x, and the second-stage decisions are tk, k=1,...,K.
[¯]
When x has an infinite (or very large) number of possible realizations the standard approach is then to represent this distribution by scenarios. As mentioned above this approach raises three questions, namely:
  1. how to construct scenarios;
  2. how to solve the linear programming problem (2.4);
  3. how to measure quality of the obtained solutions with respect to the `true' optimum.
The answers to these questions are of course not independent: for example, the number of scenarios constructed will affect the tractability of (2.4). We now proceed to address the first and third of these questions in the next subsections. The second question is not given much attention here, but a brief discussion of it is given in the Notes section at the end of the paper.

2.1  Scenario construction

In practice it might be possible to construct scenarios by eliciting experts' opinions on the future. We would like the number of constructed scenarios to be relatively modest so that the obtained linear programming problem can be solved with reasonable computational effort. It is often claimed that a solution that is optimal using only a few scenarios provides more adaptable plans than one that assumes a single scenario only. In some cases such a claim could be verified by a simulation. In theory, we would want some measure of guarantee that an obtained solution solves the original problem with reasonable accuracy. We also would like to point out that typically in applications only the first stage optimal solution ¾x has a practical value since almost certainly a "true" realization of the random (uncertain) data will be different from the set of constructed (generated) scenarios.
A natural question for a modeler to ask is: "what is a reasonable number of scenarios to choose so that an obtained solution will be close to the `true' optimum of the original problem". First, it is clear that this question supposes the existence of some good methodology to construct scenarios - a poor procedure may never get close to the optimum, no matter how many are used. If one seeks to construct scenarios, say by using experts, then it quickly becomes clear that, as the number of random parameters increases, some automated procedure to do this will become necessary. For example, suppose that the components of the random vector x Î  Rd are independent of each other and we construct scenarios by assigning to each component three possible values, by classifying future realizations as low, medium and high. Then the total number of scenarios is K=3d. Such exponential growth of the number of scenarios makes model development using expert opinion very difficult already for reasonably sized d (say d=25).
Even assuming an automated scenario generation process, with K=325 » 1012, say, the corresponding linear program (2.4) becomes too large to be tractable. The situation becomes even worse if we model xi as random variables with continuous distributions. To make a reasonable discretization of a (one dimensional) random variable xi we would need considerably more than 3 points. At this point the modeler is faced with two somewhat contradictory goals. On one hand, the number of employed scenarios should be computationally manageable. Note that reducing the number of scenarios by a factor of, say, 10 will not help much if the true number of scenarios is K=1012, say. On the other hand, the constructed approximation should provide a reasonably accurate solution of the true problem. A possible approach to reconcile these two contradictory goals is by randomization. That is, scenarios could be generated by Monte-Carlo sampling techniques.
Before we discuss the latter approach, it is important to mention boundedness and feasibility. The function Q(x,x) is defined as the optimal value of the second-stage problem (2.2). It may happen that for some feasible x Î X and a scenario xk, the problem (2.2) is unbounded from below, i.e., Q(x,xk)=-¥. This is a somewhat pathological and unrealistic situation meaning that for such feasible x one can improve with a positive probability the second-stage cost indefinitely. One should make sure at the modeling stage that this does not happen.
Another, more serious, problem is if for some x Î X and scenario xk, the system Tkx+Wky £ hk is incompatible, i.e., the second-stage problem is infeasible. In that case the standard practice is to set Q(x,xk)=+¥. That is, we impose an infinite penalty if the second-stage problem is infeasible and such x cannot be an optimal solution of the first-stage problem. This will make sense if such a scenario results in a catastrophic event. It is said that the problem has relatively complete recourse if such infeasibility does not happen, i.e., for every x Î X and every scenario xk, the second-stage problem is feasible. It is always possible to make the second-stage problem feasible (i.e., exhibit complete recourse) by changing it to


min
y,t 
qTy+gt    subject  to    Tx+Wy-te £ h,  t ³ 0,
(2.6)
where g > 0 is a chosen constant and e is a vector of an appropriate dimension with all its coordinates equal to one. In this formulation the penalty for a possible infeasibility is controlled by the parameter g.

2.2  Monte Carlo techniques

We now turn our attention to approaches that reduce the scenario set to a manageable size by using Monte Carlo simulation. That is, suppose that the total number of scenarios is very large or even infinite. Suppose, further, that we can generate a sample x1,...,xN of N replications of the random vector x. By this we mean that each xj, j=1,...,N, has the same probability distribution as x. Moreover, if xj are distributed independently of each other, it is said that the sample is iid (independent identically distributed). Given a sample, we can approximate the expectation function q(x)=E[Q(x,x)] by the average

^
q
 

N 
(x)=N-1 N
å
j=1 
Q(x,xj),
and consequently the "true" (expectation) problem (2.1) by the problem


min
x Î X 

ì
í
î
^
g
 

N 
(x):=cTx+ 1

N

N
å
j=1 
Q(x,xj) ü
ý
þ
.
(2.7)
This rather simple idea was suggested by different authors under different names. In recent literature it has became known as the sample average approximation (SAA) method. Note that for a generated sample x1,...,xN, the SAA problem (2.7) is of the same form as the two-stage problem (2.1)-(2.2) with the scenarios xj, j=1,...,N, each taken with the same probability pj=1/N. Note also that the SAA method is not an algorithm, one still has to solve the obtained SAA problem by an appropriate numerical procedure.
Example 3 [Inventory model continued] We can illustrate the SAA method on the instance of the inventory example discussed in section 1. Recall that
G(x,D)= ì
ï
í
ï
î
-0.5x+1.5D,

if x < D
1.1x-0.1D,

if x ³ D,

where D has a uniform distribution on [0,100]. In Figure 4, we show SAA approximations for three random samples, two with N=5 ( xj = 15, 60, 72, 78, 82, and xj= 24, 24, 32, 41, 62), and one with N=10 (xj= 8, 10, 21, 47, 62, 63, 75, 78, 79, 83). These samples were randomly generated from the uniform [0,100] distribution and rounded to integers for the sake of simplicity of presentation. The true cost function is shown in bold.
NewFig5.png
Figure 4: SAA approximations of E[G(x,D)].
[¯]
We now can return to the question above: "how large should we choose the sample size N (i.e., how many scenarios should be generated) in order for the SAA problem (2.7) to give a reasonably accurate approximation of the true problem (2.1)". Monte Carlo simulation methods are notorious for slow convergence. For a fixed x Î X and an iid sample, the variance of the sample average ÙqN(x) is equal to Var[Q(x,x)]/N . This implies that, in stochastic terms, the sample average converges to the corresponding expectation at a rate of Op(N-1/2). That is, in order to improve the accuracy by one digit, we need to increase the sample size by the factor of 100.
It should be clearly understood that it is not advantageous to use Monte Carlo techniques when the number d of random variables is small. For d £ 2 it would be much better to use a uniform discretization grid for numerical calculations of the required integrals. This is illustrated by the example above. For d £ 10 quasi-Monte Carlo methods usually work better than Monte Carlo techniques5. The point is, however, that in principle it is not possible to evaluate the required expectation with a high accuracy by Monte Carlo (or any other) techniques in multivariate cases. It is also possible to show theoretically that the numerical complexity of two-stage linear programming problems grows fast with an increase of the number of random variables and such problems cannot be solved with a high accuracy, like say 10-4, as we expect in deterministic optimization.
The good news about Monte Carlo techniques is that the accuracy of the sample average approximation does not depend on the number of scenarios (which can be infinite), but depends only on the variance of Q(x,x). It is remarkable and somewhat surprising that the SAA method turns out to be quite efficient for solving some classes of two-stage stochastic programming problems. It was shown theoretically and verified in numerical experiments that with a manageable sample size, the SAA method (and its variants) solves the "true" problem with a reasonable accuracy (say 1% or 2%) provided the following conditions are met:

    (i) it is possible to generate a sample of realizations of the random vector x,
    (ii) for moderate values of the sample size it is possible to efficiently solve the obtained SAA problem,
    (iii) the true problem has relatively complete recourse,
    (iv) variability of the second-stage (optimal value) function is not "too large".
Usually the requirement (i) does not pose a serious problem in stochastic programming applications, and there are standard techniques and software for generating random samples. Also modern computers coupled with good algorithms can solve linear programming problems with millions of variables (see the Notes section at the end of the tutorial).
Condition (iii), on the other hand, requires a few words of discussion. If for some feasible x Î X the second-stage problem is infeasible with a positive probability (it does not matter how small this positive probability is) then the expected value of Q(x,x), and hence its variability, is infinite. If the probability p of such an event (scenario) is very small, say p=10-6, then it is very difficult, or may even be impossible, to verify this by sampling. For example, one would need an iid sample of size N significantly bigger than p-1 to reproduce such a scenario in the sample with a reasonable probability.
The sample average approximation problems that arise from the above process are large-scale programming problems. Since our purpose in this tutorial is to introduce stochastic programming models and approaches, we point to the literature for a further reading on that subject in the "Notes" section.

2.3  Evaluating candidate solutions

Given a feasible point Ùx Î X, possibly obtained by solving a SAA problem, a practical problem is how to evaluate the quality of this point viewed as a candidate for solving the true problem. Since the point Ùx is feasible, we clearly have that g(Ùx) ³ v*, where v*=minx Î Xg(x) is the optimal value of the true problem. The quality of Ùx can be measured by the optimality gap
gap(
^
x
 
):=g(
^
x
 
)-v*.
We outline now a statistical procedure for estimating this optimality gap. The true value g(Ùx) can be estimated by Monte Carlo sampling. That is, an iid random sample xj, j=1,...,N¢, of x is generated and g(Ùx) is estimated by the corresponding sample average ÙgN¢(Ùx)=cTÙx+ÙqN¢(Ùx). At the same time the sample variance

^
s
 
2
N¢ 
(
^
x
 
):= 1

N¢(N¢-1)

N¢
å
j=1 

é
ë
Q(
^
x
 
,xj)-
^
q
 

N¢ 
(
^
x
 
) ù
û
2
 
,
(2.8)
of ÙqN¢(Ùx), is calculated. Note that it is feasible here to use a relatively large sample size N¢ since calculation of the required values Q(Ùx,xj) involves solving only individual second-stage problems. Then
UN¢(
^
x
 
):=
^
g
 

N¢ 
(
^
x
 
)+za
^
s
 

N¢ 
(
^
x
 
)
(2.9)
gives an approximate 100(1-a)% confidence upper bound for g(Ùx). This bound is justified by the Central Limit Theorem with the critical value za=F-1(1-a), where F(z) is the cdf of the standard normal distribution. For example, for a = 5% we have that za » 1.64, and for a = 1% we have that za » 2.33.
In order to calculate a lower bound for v* we proceed as follows. Denote by ÙvN the optimal value of the SAA problem based on a sample of size N. Note that ÙvN is a function of the (random) sample and hence is random. To obtain a lower bound for v* observe that E[ÙgN(x)]=g(x), i.e., the sample average ÙgN(x) is an unbiased estimator of the expectation g(x). We also have that for any x Î X the inequality ÙgN(x) ³ infx¢ Î XÙgN(x¢) holds, so for any x Î X, we have
g(x)= E é
ë
^
g
 

N 
(x) ù
û
³ E é
ë

inf
x¢ Î X 

^
g
 

N 
(x¢) ù
û
=E é
ë
^
v
 

N 
ù
û
.
By taking the minimum over x Î X of the left hand side of the above inequality we obtain that v* ³ E[ÙvN].
We can estimate E[ÙvN] by solving the SAA problems several times and averaging the calculated optimal values. That is, SAA problems based on independently generated samples, each of size N, are solved to optimality M times. Let ÙvN1,...,ÙvNM be the computed optimal values of these SAA problems. Then

-
v
 

N,M 
:= 1

M

M
å
j=1 

^
v
 
j
N 

(2.10)
is an unbiased estimator of E[ÙvN]. Since the samples, and hence ÙvN1,...,ÙvNM, are independent, we can estimate the variance of ¾vN,M by

^
s
 
2
N,M 
:= 1

M(M-1)

M
å
j=1 

æ
è
^
v
 
j
N 
-
-
v
 

N,M 
ö
ø
2
 
.
(2.11)
An approximate 100(1-a)% lower bound for E[ÙvN] is then given by
LN,M:=
-
v
 

N,M 
-ta,n
^
s
 

N,M 
,
(2.12)
where n = M-1 and ta,n is the a-critical value of the t-distribution with n degrees of freedom. In applications it often suffices to use small values of M, say M=5 or M=10. For a small number of degrees of freedom, the critical value ta,n is slightly bigger than the corresponding standard normal critical value za, and ta,n quickly approaches za as n increases. Since v* ³ E[ÙvN], we have that LN,M gives a valid lower statistical bound for v* as well. Consequently

^
gap
 
(
^
x
 
):=UN¢(
^
x
 
)-LN,M
(2.13)
gives a statistically valid (with confidence at least 1-2a) bound on the true gap(Ùx). It can be noted that the lower bound LN,M is somewhat conservative since the bias v*-E[ÙvN] of the SAA estimator of the true optimal value could be significant for not "too large" values of the sample size N.
Example 4 [Inventory model continued] We can illustrate this procedure on the instance of the inventory example (example 1) discussed in section 1. The following derivations are given for illustration purposes only. As discussed above, in the case of one dimensional random data it is much better to use a uniform discretization grid on the interval [0,100] rather than a uniformly distributed random sample. Recall
G(x,d)= ì
ï
í
ï
î
-0.5x+1.5d,

if x < d
1.1x-0.1d,

if x ³ d.

and for x Î [0,100]
E[G(x,D)]=75-0.5x+0.008x2
which has a minimum of v* = 67.19 at ¾x=31.25. (We round all real numbers to two decimal places in this example.) Suppose we have a candidate solution Ùx=40. Then it is easy to compute the true value of this candidate:
E[G(40,D)]=67.80.
To obtain statistical bounds on this value, we sample from D, to give d1,d2,¼,dN¢. Consider the following (ordered) random sample of {1, 11, 26, 26, 36, 45, 45, 46, 50, 54, 56, 56, 59, 62, 70, 98} (again this random sample is rounded to integers). With N¢=16, this gives an estimated cost of


^
g
 

N¢ 
(
^
x
 
)
=

1