EmpiricalCDFs

# EmpiricalCDFs

Empirical cumulative distribution functions

The source repository is https://github.com/jlapeyre/EmpiricalCDFs.jl.

Provides empirical cumulative distribution functions (CDFs) (or "empirical distribution functions" as they are know to probabalists).

EmpiricalCDFs implements empirical CDFs; building, evaluating, random sampling, evaluating the inverse, etc. It is useful especially for examining the tail of the CDF obtained from streaming a large number of data, more than can be stored in memory. For this purpose, you specify a lower cutoff; data points below this value will be silently rejected, but the resulting CDF will still be properly normalized. This ability to process and filter data online is absent in `StatsBase.ecdf`.

`````` cdf = EmpiricalCDF()
append!(cdf, randn(10^5))
push!(cdf, randn())
sort!(cdf)

using Statistics
mean(cdf)
std(cdf)

...
print(io,cdf)

# reject points `x < xmin` to use less memory
cdf = EmpiricalCDF(xmin)``````

Before using the cdf, you must call `sort!(cdf)`. For efficiency data is not sorted as it is inserted. The exception is `print`, which does sort the cdf before printing.

## Empirical CDF types

``AbstractEmpiricalCDF``

Concrete types are `EmpiricalCDF` and `EmpiricalCDFHi`.

source
``EmpiricalCDF{T=Float64}()``

Construct an empirical CDF. After inserting elements with `push!` or `append!`, and before using most of the functions below, the CDF must be sorted with `sort!`.

`EmpiricalCDF` and `EmpiricalCDFHi` are callable objects. For `cdf::AbstractEmpiricalCDF`, `cdf(x)` returns the estimate of the CDF at `x`. By contrast, `cdf[inds]` indexes into the underlying data array.

``````julia> cdf = EmpiricalCDF();
julia> append!(cdf,randn(10^6));
julia> sort!(cdf);
julia> cdf(0.0)
0.499876
julia> cdf(1.0)
0.840944
julia> cdf(-1.0)
0.158494``````

In this example, we collected \$10^6\$ samples from the unit normal distribution. About half of the samples are greater than zero. Approximately the same mass is between zero and one as is between zero and minus one.

source
``EmpiricalCDF(lowreject::Real)``

If `lowereject` is finite return `EmpiricalCDFHi(lowreject)`. Otherwise return `EmpiricalCDF()`.

source
``EmpiricalCDFHi{T <: Real} <: AbstractEmpiricalCDF``

Empirical CDF with lower cutoff. That is, keep only the tail.

source

## Functions

``push!(cdf::EmpiricalCDF,x::Real)``

add the sample `x` to `cdf`.

source
``append!(cdf::EmpiricalCDF, a::AbstractArray)``

add samples in `a` to `cdf`.

source
``sort!(cdf::AbstractEmpiricalCDF)``

Sort the data collected in `cdf`. You must call `sort!` before using `cdf`.

source
``data(cdf::AbstractEmpiricalCDF)``

return the array holding samples for `cdf`.

source
``counts(cdf::AbstractEmpiricalCDF)``

Return the number of counts added to `cdf`. This includes counts that may have been discarded because they are below of the cutoff.

source
``rand(cdf::EmpiricalCDF)``

Pick a random sample from the distribution represented by `cdf`.

source
``finv(cdf::AbstractEmpiricalCDF) --> Function``

Return the quantile function, that is, the functional inverse of `cdf`. `cdf` is a callable object. Note that finv differs slightly from `quantile`.

Examples

Here, `cdf` contains \$10^6\$ samples from the unit normal distribution.

``````julia> icdf = finv(cdf);
julia> icdf(.5)
-0.00037235611091389375
julia> icdf(1.0-eps())
4.601393290425543
julia> maximum(cdf)
4.601393290425543``````
source

Methods are defined on `AbstractEmpiricalCDF` for the following functions: `length`, `minimum`, `maximum`, `extrema`, `mean`, `median`, `std`, `quantile`.

## Text file output

``print(io::IO, cdf::AbstractEmpiricalCDF)``

Call logprint(io,cdf)

source

`linprint(io::IO ,cdf::AbstractEmpiricalCDF, n=2000)` print (not more than) `n` linearly spaced points after sorting the data.

source
``linprint(fn::String, cdf::AbstractEmpiricalCDF, n=2000)``

print `cdf` to file `fn`. Print no more than `n` linearly spaced points.

source

`logprint(io::IO, cdf::EmpiricalCDF, n=2000)` print (not more than) `n` log spaced points after sorting the data.

source

## Binary IO

I found available serialization choices to be too slow. So, very simple, very fast, binary storage and retrieval is provided. By now, or in the future, there will certainly be packages that provide a sufficient or better replacement.

The type `CDFfile` supports reading and writing `AbstractEmpiricalCDF` objects in binary format. Most functions that operate on `AbstractEmpiricalCDF` also work with `CDFfile`, with the call being passed to the `cdf` field.

``````CDFfile(cdf::AbstractEmpiricalCDF, header="")

struct CDFfile{T <: AbstractEmpiricalCDF}
cdf::T
end``````

Binary data file for `AbstractEmpiricalCDF`

The file format is

• Identifying string
• `n::Int64` number of bytes in the header string
• `s::String` The header string
• `t::Int64` Type of `AbstractEmpiricalCDF`, 1 or 2. 1 for `EmpiricalCDF`, 2 for `EmpiricalCDFHi`.
• `lowreject::Float64` the lower cutoff, only for `EmpiricalCDFHi`.
• `npts::Int64` number of data points in the CDF
• `npts` data points of type `Float64`
source
``save(fn::String, cdf::AbstractEmpiricalCDF, header::String="")``

write `cdf` to file `fn` in a fast binary format.

source
``readcdf(fn::String)``

Read an empirical CDF from file `fn`. Return an object `cdff` of type `CDFfile`. The header is in field `header`. The cdf is in in field `cdf`.

source
``readcdfinfo(fn::String)``

Return an object containing information about the cdf saved in the binary file `fn`. The data itself is not read.

source
``header::String = header(cdff::CDFfile)``

Return the header from `cdff`.

source
``cdf::AbstractEmpiricalCDF = getcdf(cdff::CDFfile)``

Return the CDF from `cdff`.

source
``version(cdff::CDFfile)``

Return the version number of the file format.

source

## Comparison with `ecdf`

This package differs from the `ecdf` function from `StatsBase.jl`.

• `ecdf` takes a sorted vector as input and returns a function that looks up the value of the CDF. An instance of `EmpiricalCDF`, `cdf`, both stores data, eg via `push!(cdf,x)`, and looks up the value of the CDF via `cdf(x)`.
• When computing the CDF at an array of values, `ecdf`, sorts the input and uses an algorithm that scans the data. Instead, `EmpiricalCDFs` does a binary search for each element of an input vector. Tests showed that this is typically not slower. If the CDF stores a large number of points relative to the size of the input vector, the second method, the one used by `EmpiricalCDFs` is faster.