EmpiricalCDFs

EmpiricalCDFs

Empirical cumulative distribution functions

The source repository is https://github.com/jlapeyre/EmpiricalCDFs.jl.

Provides empirical cumulative distribution functions (CDFs) (or "empirical distribution functions" as they are know to probabalists).

EmpiricalCDFs implements empirical CDFs; building, evaluating, random sampling, evaluating the inverse, etc. It is useful especially for examining the tail of the CDF obtained from streaming a large number of data, more than can be stored in memory. For this purpose, you specify a lower cutoff; data points below this value will be silently rejected, but the resulting CDF will still be properly normalized. This ability to process and filter data online is absent in StatsBase.ecdf.

 cdf = EmpiricalCDF()
 append!(cdf, randn(10^5))
 push!(cdf, randn())
 sort!(cdf)

 using Statistics
 mean(cdf)
 std(cdf)

 ...
 print(io,cdf)

# reject points `x < xmin` to use less memory
 cdf = EmpiricalCDF(xmin)

Warning about sorting

Before using the cdf, you must call sort!(cdf). For efficiency data is not sorted as it is inserted. The exception is print, which does sort the cdf before printing.

Contents

Index

Empirical CDF types

AbstractEmpiricalCDF

Concrete types are EmpiricalCDF and EmpiricalCDFHi.

source
EmpiricalCDF{T=Float64}()

Construct an empirical CDF. After inserting elements with push! or append!, and before using most of the functions below, the CDF must be sorted with sort!.

EmpiricalCDF and EmpiricalCDFHi are callable objects. For cdf::AbstractEmpiricalCDF, cdf(x) returns the estimate of the CDF at x. By contrast, cdf[inds] indexes into the underlying data array.

julia> cdf = EmpiricalCDF();
julia> append!(cdf,randn(10^6));
julia> sort!(cdf);
julia> cdf(0.0)
0.499876
julia> cdf(1.0)
0.840944
julia> cdf(-1.0)
0.158494

In this example, we collected $10^6$ samples from the unit normal distribution. About half of the samples are greater than zero. Approximately the same mass is between zero and one as is between zero and minus one.

source
EmpiricalCDF(lowreject::Real)

If lowereject is finite return EmpiricalCDFHi(lowreject). Otherwise return EmpiricalCDF().

source
EmpiricalCDFHi{T <: Real} <: AbstractEmpiricalCDF

Empirical CDF with lower cutoff. That is, keep only the tail.

source

Functions

Base.push!Function.
push!(cdf::EmpiricalCDF,x::Real)

add the sample x to cdf.

source
Base.append!Function.
append!(cdf::EmpiricalCDF, a::AbstractArray)

add samples in a to cdf.

source
Base.sort!Function.
sort!(cdf::AbstractEmpiricalCDF)

Sort the data collected in cdf. You must call sort! before using cdf.

source
EmpiricalCDFs.dataFunction.
data(cdf::AbstractEmpiricalCDF)

return the array holding samples for cdf.

source
EmpiricalCDFs.countsFunction.
counts(cdf::AbstractEmpiricalCDF)

Return the number of counts added to cdf. This includes counts that may have been discarded because they are below of the cutoff.

source
Base.randFunction.
rand(cdf::EmpiricalCDF)

Pick a random sample from the distribution represented by cdf.

source
EmpiricalCDFs.finvFunction.
finv(cdf::AbstractEmpiricalCDF) --> Function

Return the quantile function, that is, the functional inverse of cdf. cdf is a callable object. Note that finv differs slightly from quantile.

Examples

Here, cdf contains $10^6$ samples from the unit normal distribution.

julia> icdf = finv(cdf);
julia> icdf(.5)
-0.00037235611091389375
julia> icdf(1.0-eps())
4.601393290425543
julia> maximum(cdf)
4.601393290425543
source

Methods are defined on AbstractEmpiricalCDF for the following functions: length, minimum, maximum, extrema, mean, median, std, quantile.

Text file output

Base.printFunction.
print(io::IO, cdf::AbstractEmpiricalCDF)

Call logprint(io,cdf)

source

linprint(io::IO ,cdf::AbstractEmpiricalCDF, n=2000) print (not more than) n linearly spaced points after sorting the data.

source
linprint(fn::String, cdf::AbstractEmpiricalCDF, n=2000)

print cdf to file fn. Print no more than n linearly spaced points.

source

logprint(io::IO, cdf::EmpiricalCDF, n=2000) print (not more than) n log spaced points after sorting the data.

source

Binary IO

I found available serialization choices to be too slow. So, very simple, very fast, binary storage and retrieval is provided. By now, or in the future, there will certainly be packages that provide a sufficient or better replacement.

The type CDFfile supports reading and writing AbstractEmpiricalCDF objects in binary format. Most functions that operate on AbstractEmpiricalCDF also work with CDFfile, with the call being passed to the cdf field.

CDFfile(cdf::AbstractEmpiricalCDF, header="")

struct CDFfile{T <: AbstractEmpiricalCDF}
    cdf::T
    header::String
end

Binary data file for AbstractEmpiricalCDF

The file format is

  • Identifying string
  • n::Int64 number of bytes in the header string
  • s::String The header string
  • t::Int64 Type of AbstractEmpiricalCDF, 1 or 2. 1 for EmpiricalCDF, 2 for EmpiricalCDFHi.
  • lowreject::Float64 the lower cutoff, only for EmpiricalCDFHi.
  • npts::Int64 number of data points in the CDF
  • npts data points of type Float64
source
save(fn::String, cdf::AbstractEmpiricalCDF, header::String="")

write cdf to file fn in a fast binary format.

source
readcdf(fn::String)

Read an empirical CDF from file fn. Return an object cdff of type CDFfile. The header is in field header. The cdf is in in field cdf.

source
readcdfinfo(fn::String)

Return an object containing information about the cdf saved in the binary file fn. The data itself is not read.

source
header::String = header(cdff::CDFfile)

Return the header from cdff.

source
cdf::AbstractEmpiricalCDF = getcdf(cdff::CDFfile)

Return the CDF from cdff.

source
version(cdff::CDFfile)

Return the version number of the file format.

source

Comparison with ecdf

This package differs from the ecdf function from StatsBase.jl.