The ‘gfiExtremes’ package

The ‘gfiExtremes’ package offers two main functions: gfigpd1 and gfigpd2. Each of them generates simulations from the fiducial distribution of a model involving the generalized Pareto distribution. The difference is that the threshold μ of the generalized Pareto distribution is assumed to be known by gfigpd1, whereas gfigpd2 estimates this threshold.

The algorithms are implemented in C++. They are translated from the original R code written by the authors of the reference paper.

Note

The package allows to run some MCMC chains in parallel. In the examples below, as well as in the examples of the package documentation, I set nthreads = 2L because of CRAN restrictions: CRAN does not allow more than two R processes in parallel and then it would not accept the package if a higher number of threads were set.

library(gfiExtremes)
#> Loading required package: coda

The model with a known threshold

When the threshold μ is known, it is assumed that the data are independent realizations of a random variable X which follows a generalized Pareto distribution GP(μ, γ, σ) conditionally to X ≥ μ. On the event X < μ, no distributional assumption is made.

Then the algorithm performed by the gfigpd1 function produces some simulations of the fiducial distributions of γ, σ and of the 100β%-quantiles of X at the requested values of β. These are MCMC chains.

For example, assume that X follows the GP(μ, γ, σ) distribution (so X < μ never happens).

set.seed(666L)
X <- rgpareto(200L, mu = 10, gamma = 0.3, sigma = 2)
gf1 <- gfigpd1(
  X, beta = c(0.99, 0.995, 0.999), threshold = 10, iter = 10000L,
  nchains = 4L, nthreads = 2L
)

The output of gfigpd1 is a R object ready for analysis with the ‘coda’ package. In particular, it has a summary method.

summary(gf1)
#> 
#> Iterations = 2001:61995
#> Thinning interval = 6 
#> Number of chains = 4 
#> Sample size per chain = 10000 
#> 
#> 1. Empirical mean and standard deviation for each variable,
#>    plus standard error of the mean:
#> 
#>          Mean      SD  Naive SE Time-series SE
#> gamma  0.3132  0.1017 0.0005086      0.0009532
#> sigma  1.9692  0.2351 0.0011757      0.0022424
#> beta1 30.9276  5.4264 0.0271322      0.0467451
#> beta2 38.0272  9.0772 0.0453861      0.0799586
#> beta3 63.5831 27.2337 0.1361684      0.2454226
#> 
#> 2. Quantiles for each variable:
#> 
#>          2.5%     25%     50%     75%    97.5%
#> gamma  0.1347  0.2415  0.3055  0.3769   0.5319
#> sigma  1.5371  1.8051  1.9585  2.1210   2.4602
#> beta1 23.9020 27.2199 29.7925 33.3429  44.5348
#> beta2 27.0268 31.9434 35.9689 41.6726  61.1854
#> beta3 35.3511 46.5221 56.5144 72.0191 134.1137
# compare with the true quantiles:
qgpareto(c(0.99, 0.995, 0.999), mu = 10, gamma = 0.3, sigma = 2)
#> [1] 29.87381 36.00849 56.28855

Every parameter is very well estimated by the median of its fiducial distribution.

We can get the shortest fiducial confidence intervals with the ‘coda’ function HPDinterval:

HPDinterval(joinMCMCchains(gf1))
#>            lower       upper
#> gamma  0.1196782   0.5116534
#> sigma  1.5219247   2.4336786
#> beta1 22.7531531  41.4748133
#> beta2 25.1379496  55.3255893
#> beta3 31.1060321 113.4594573
#> attr(,"Probability")
#> [1] 0.95

The model with an unknown threshold

When the threshold μ is unknown, it is also assumed that the data are independent realizations of a random variable X which follows a generalized Pareto distribution GP(μ, γ, σ) conditionally to X ≥ μ, but there are additional assumptions.

These additional assumptions have no meaningful interpretation but this is not important in order to estimate the quantiles of X: it is possible that the parameters γ and σ cannot be estimated (it is always possible if X strictly follows the unrealistic assumptions of the model) but μ can always be estimated and the fiducial distributions of the quantiles are available.

Let’s assume for example that X follows a log-normal distribution:

set.seed(666L)
X <- rlnorm(400L, meanlog = 1)
gf2 <- gfigpd2(
  X, beta = c(0.99, 0.995, 0.999), iter = 10000L, burnin = 10000L,
  nchains = 4L, nthreads = 2L
)
summary(gf2)
#> 
#> Iterations = 10001:69995
#> Thinning interval = 6 
#> Number of chains = 4 
#> Sample size per chain = 10000 
#> 
#> 1. Empirical mean and standard deviation for each variable,
#>    plus standard error of the mean:
#> 
#>        Mean     SD Naive SE Time-series SE
#> beta1 25.64  4.025  0.02013         0.1160
#> beta2 32.77  6.471  0.03235         0.1970
#> beta3 55.60 17.167  0.08584         0.5701
#> 
#> 2. Quantiles for each variable:
#> 
#>        2.5%   25%   50%   75%  97.5%
#> beta1 19.73 22.75 24.97 27.77  35.44
#> beta2 23.69 28.19 31.59 35.96  48.99
#> beta3 34.07 43.71 51.84 62.69 100.44
# compare with the true quantiles:
qlnorm(c(0.99, 0.995, 0.999), meanlog = 1)
#> [1] 27.83649 35.72423 59.75377

As you can see, the inference on the quantiles is good.