EmpiricalCDFs

Empirical cumulative distribution functions

The source repository is https://github.com/jlapeyre/EmpiricalCDFs.jl.

Provides empirical cumulative distribution functions (CDFs) (or "empirical distribution functions" as they are know to probabalists).

EmpiricalCDFs implements empirical CDFs; building, evaluating, random sampling, evaluating the inverse, etc. It is useful especially for examining the tail of the CDF obtained from streaming a large number of data, more than can be stored in memory. For this purpose, you specify a lower cutoff; data points below this value will be silently rejected, but the resulting CDF will still be properly normalized. This ability to process and filter data online is absent in StatsBase.ecdf.

 cdf = EmpiricalCDF()
 append!(cdf, randn(10^5))
 push!(cdf, randn())
 sort!(cdf)

 using Statistics
 mean(cdf)
 std(cdf)

 ...
 print(io,cdf)

# reject points `x < xmin` to use less memory
 cdf = EmpiricalCDF(xmin)

Warning about sorting

Before using the cdf, you must call sort!(cdf). For efficiency data is not sorted as it is inserted. The exception is print, which does sort the cdf before printing.

Contents

EmpiricalCDFs

Contents
Index
Empirical CDF types
Functions
Text file output
Binary IO
Comparison with ecdf

Index

EmpiricalCDFs.AbstractEmpiricalCDF
EmpiricalCDFs.EmpiricalCDF
EmpiricalCDFs.EmpiricalCDFHi
EmpiricalCDFs.IOcdf.CDFfile
Base.append!
Base.print
Base.push!
Base.rand
Base.sort!
EmpiricalCDFs.IOcdf.getcdf
EmpiricalCDFs.IOcdf.header
EmpiricalCDFs.IOcdf.readcdf
EmpiricalCDFs.IOcdf.readcdfinfo
EmpiricalCDFs.IOcdf.save
EmpiricalCDFs.IOcdf.version
EmpiricalCDFs.counts
EmpiricalCDFs.data
EmpiricalCDFs.finv
EmpiricalCDFs.linprint
EmpiricalCDFs.logprint

Empirical CDF types

EmpiricalCDFs.AbstractEmpiricalCDF — Type.

AbstractEmpiricalCDF

Concrete types are EmpiricalCDF and EmpiricalCDFHi.

EmpiricalCDFs.EmpiricalCDF — Type.

EmpiricalCDF{T=Float64}()

Construct an empirical CDF. After inserting elements with push! or append!, and before using most of the functions below, the CDF must be sorted with sort!.

EmpiricalCDF and EmpiricalCDFHi are callable objects. For cdf::AbstractEmpiricalCDF, cdf(x) returns the estimate of the CDF at x. By contrast, cdf[inds] indexes into the underlying data array.

julia> cdf = EmpiricalCDF();
julia> append!(cdf,randn(10^6));
julia> sort!(cdf);
julia> cdf(0.0)
0.499876
julia> cdf(1.0)
0.840944
julia> cdf(-1.0)
0.158494

In this example, we collected $10^6$ samples from the unit normal distribution. About half of the samples are greater than zero. Approximately the same mass is between zero and one as is between zero and minus one.

EmpiricalCDF(lowreject::Real)

If lowereject is finite return EmpiricalCDFHi(lowreject). Otherwise return EmpiricalCDF().

EmpiricalCDFs.EmpiricalCDFHi — Type.

EmpiricalCDFHi{T <: Real} <: AbstractEmpiricalCDF

Empirical CDF with lower cutoff. That is, keep only the tail.

Functions

Base.push! — Function.

push!(cdf::EmpiricalCDF,x::Real)

add the sample x to cdf.

Base.append! — Function.

append!(cdf::EmpiricalCDF, a::AbstractArray)

add samples in a to cdf.

Base.sort! — Function.

sort!(cdf::AbstractEmpiricalCDF)

Sort the data collected in cdf. You must call sort! before using cdf.

EmpiricalCDFs.data — Function.

data(cdf::AbstractEmpiricalCDF)

return the array holding samples for cdf.

EmpiricalCDFs.counts — Function.

counts(cdf::AbstractEmpiricalCDF)

Return the number of counts added to cdf. This includes counts that may have been discarded because they are below of the cutoff.

Base.rand — Function.

rand(cdf::EmpiricalCDF)

Pick a random sample from the distribution represented by cdf.

EmpiricalCDFs.finv — Function.

finv(cdf::AbstractEmpiricalCDF) --> Function

Return the quantile function, that is, the functional inverse of cdf. cdf is a callable object. Note that finv differs slightly from quantile.

Examples

Here, cdf contains $10^6$ samples from the unit normal distribution.

julia> icdf = finv(cdf);
julia> icdf(.5)
-0.00037235611091389375
julia> icdf(1.0-eps())
4.601393290425543
julia> maximum(cdf)
4.601393290425543

Methods are defined on AbstractEmpiricalCDF for the following functions: length, minimum, maximum, extrema, mean, median, std, quantile.

Text file output

Base.print — Function.

print(io::IO, cdf::AbstractEmpiricalCDF)

Call logprint(io,cdf)

EmpiricalCDFs.linprint — Function.

linprint(io::IO ,cdf::AbstractEmpiricalCDF, n=2000) print (not more than) n linearly spaced points after sorting the data.

linprint(fn::String, cdf::AbstractEmpiricalCDF, n=2000)

print cdf to file fn. Print no more than n linearly spaced points.

EmpiricalCDFs.logprint — Function.

logprint(io::IO, cdf::EmpiricalCDF, n=2000) print (not more than) n log spaced points after sorting the data.

Binary IO

I found available serialization choices to be too slow. So, very simple, very fast, binary storage and retrieval is provided. By now, or in the future, there will certainly be packages that provide a sufficient or better replacement.

The type CDFfile supports reading and writing AbstractEmpiricalCDF objects in binary format. Most functions that operate on AbstractEmpiricalCDF also work with CDFfile, with the call being passed to the cdf field.

EmpiricalCDFs.IOcdf.CDFfile — Type.

CDFfile(cdf::AbstractEmpiricalCDF, header="")

struct CDFfile{T <: AbstractEmpiricalCDF}
    cdf::T
    header::String
end

Binary data file for AbstractEmpiricalCDF

The file format is

Identifying string
n::Int64 number of bytes in the header string
s::String The header string
t::Int64 Type of AbstractEmpiricalCDF, 1 or 2. 1 for EmpiricalCDF, 2 for EmpiricalCDFHi.
lowreject::Float64 the lower cutoff, only for EmpiricalCDFHi.
npts::Int64 number of data points in the CDF
npts data points of type Float64

EmpiricalCDFs.IOcdf.save — Function.

save(fn::String, cdf::AbstractEmpiricalCDF, header::String="")

write cdf to file fn in a fast binary format.

EmpiricalCDFs.IOcdf.readcdf — Function.

readcdf(fn::String)

Read an empirical CDF from file fn. Return an object cdff of type CDFfile. The header is in field header. The cdf is in in field cdf.

EmpiricalCDFs.IOcdf.readcdfinfo — Function.

readcdfinfo(fn::String)

Return an object containing information about the cdf saved in the binary file fn. The data itself is not read.

EmpiricalCDFs.IOcdf.header — Function.

header::String = header(cdff::CDFfile)

Return the header from cdff.

EmpiricalCDFs.IOcdf.getcdf — Function.

cdf::AbstractEmpiricalCDF = getcdf(cdff::CDFfile)

Return the CDF from cdff.

EmpiricalCDFs.IOcdf.version — Function.

version(cdff::CDFfile)

Return the version number of the file format.

Comparison with `ecdf`

This package differs from the ecdf function from StatsBase.jl.

ecdf takes a sorted vector as input and returns a function that looks up the value of the CDF. An instance of EmpiricalCDF, cdf, both stores data, eg via push!(cdf,x), and looks up the value of the CDF via cdf(x).
When computing the CDF at an array of values, ecdf, sorts the input and uses an algorithm that scans the data. Instead, EmpiricalCDFs does a binary search for each element of an input vector. Tests showed that this is typically not slower. If the CDF stores a large number of points relative to the size of the input vector, the second method, the one used by EmpiricalCDFs is faster.