EmpiricalCDFs
Empirical cumulative distribution functions
The source repository is https://github.com/jlapeyre/EmpiricalCDFs.jl.
Provides empirical cumulative distribution functions (CDFs) (or "empirical distribution functions" as they are know to probabalists).
EmpiricalCDFs implements empirical CDFs; building, evaluating, random sampling, evaluating the inverse, etc. It is useful especially for examining the tail of the CDF obtained from streaming a large number of data, more than can be stored in memory. For this purpose, you specify a lower cutoff; data points below this value will be silently rejected, but the resulting CDF will still be properly normalized. This ability to process and filter data online is absent in StatsBase.ecdf.
cdf = EmpiricalCDF()
append!(cdf, randn(10^5))
push!(cdf, randn())
sort!(cdf)
using Statistics
mean(cdf)
std(cdf)
...
print(io,cdf)
# reject points `x < xmin` to use less memory
cdf = EmpiricalCDF(xmin)Warning about sorting
Before using the cdf, you must call sort!(cdf). For efficiency data is not sorted as it is inserted. The exception is print, which does sort the cdf before printing.
Contents
Index
EmpiricalCDFs.AbstractEmpiricalCDFEmpiricalCDFs.EmpiricalCDFEmpiricalCDFs.EmpiricalCDFHiEmpiricalCDFs.IOcdf.CDFfileBase.append!Base.printBase.push!Base.randBase.sort!EmpiricalCDFs.IOcdf.getcdfEmpiricalCDFs.IOcdf.headerEmpiricalCDFs.IOcdf.readcdfEmpiricalCDFs.IOcdf.readcdfinfoEmpiricalCDFs.IOcdf.saveEmpiricalCDFs.IOcdf.versionEmpiricalCDFs.countsEmpiricalCDFs.dataEmpiricalCDFs.finvEmpiricalCDFs.linprintEmpiricalCDFs.logprint
Empirical CDF types
AbstractEmpiricalCDFConcrete types are EmpiricalCDF and EmpiricalCDFHi.
EmpiricalCDFs.EmpiricalCDF — Type.EmpiricalCDF{T=Float64}()Construct an empirical CDF. After inserting elements with push! or append!, and before using most of the functions below, the CDF must be sorted with sort!.
EmpiricalCDF and EmpiricalCDFHi are callable objects. For cdf::AbstractEmpiricalCDF, cdf(x) returns the estimate of the CDF at x. By contrast, cdf[inds] indexes into the underlying data array.
julia> cdf = EmpiricalCDF();
julia> append!(cdf,randn(10^6));
julia> sort!(cdf);
julia> cdf(0.0)
0.499876
julia> cdf(1.0)
0.840944
julia> cdf(-1.0)
0.158494In this example, we collected $10^6$ samples from the unit normal distribution. About half of the samples are greater than zero. Approximately the same mass is between zero and one as is between zero and minus one.
EmpiricalCDF(lowreject::Real)If lowereject is finite return EmpiricalCDFHi(lowreject). Otherwise return EmpiricalCDF().
EmpiricalCDFs.EmpiricalCDFHi — Type.EmpiricalCDFHi{T <: Real} <: AbstractEmpiricalCDFEmpirical CDF with lower cutoff. That is, keep only the tail.
Functions
Base.push! — Function.push!(cdf::EmpiricalCDF,x::Real)add the sample x to cdf.
Base.append! — Function.append!(cdf::EmpiricalCDF, a::AbstractArray)add samples in a to cdf.
Base.sort! — Function.sort!(cdf::AbstractEmpiricalCDF)Sort the data collected in cdf. You must call sort! before using cdf.
EmpiricalCDFs.data — Function.data(cdf::AbstractEmpiricalCDF)return the array holding samples for cdf.
EmpiricalCDFs.counts — Function.counts(cdf::AbstractEmpiricalCDF)Return the number of counts added to cdf. This includes counts that may have been discarded because they are below of the cutoff.
Base.rand — Function.rand(cdf::EmpiricalCDF)Pick a random sample from the distribution represented by cdf.
EmpiricalCDFs.finv — Function.finv(cdf::AbstractEmpiricalCDF) --> FunctionReturn the quantile function, that is, the functional inverse of cdf. cdf is a callable object. Note that finv differs slightly from quantile.
Examples
Here, cdf contains $10^6$ samples from the unit normal distribution.
julia> icdf = finv(cdf);
julia> icdf(.5)
-0.00037235611091389375
julia> icdf(1.0-eps())
4.601393290425543
julia> maximum(cdf)
4.601393290425543Methods are defined on AbstractEmpiricalCDF for the following functions: length, minimum, maximum, extrema, mean, median, std, quantile.
Text file output
Base.print — Function.print(io::IO, cdf::AbstractEmpiricalCDF)Call logprint(io,cdf)
EmpiricalCDFs.linprint — Function.linprint(io::IO ,cdf::AbstractEmpiricalCDF, n=2000) print (not more than) n linearly spaced points after sorting the data.
linprint(fn::String, cdf::AbstractEmpiricalCDF, n=2000)print cdf to file fn. Print no more than n linearly spaced points.
EmpiricalCDFs.logprint — Function.logprint(io::IO, cdf::EmpiricalCDF, n=2000) print (not more than) n log spaced points after sorting the data.
Binary IO
I found available serialization choices to be too slow. So, very simple, very fast, binary storage and retrieval is provided. By now, or in the future, there will certainly be packages that provide a sufficient or better replacement.
The type CDFfile supports reading and writing AbstractEmpiricalCDF objects in binary format. Most functions that operate on AbstractEmpiricalCDF also work with CDFfile, with the call being passed to the cdf field.
EmpiricalCDFs.IOcdf.CDFfile — Type.CDFfile(cdf::AbstractEmpiricalCDF, header="")
struct CDFfile{T <: AbstractEmpiricalCDF}
cdf::T
header::String
endBinary data file for AbstractEmpiricalCDF
The file format is
- Identifying string
n::Int64number of bytes in the header strings::StringThe header stringt::Int64Type ofAbstractEmpiricalCDF, 1 or 2. 1 forEmpiricalCDF, 2 forEmpiricalCDFHi.lowreject::Float64the lower cutoff, only forEmpiricalCDFHi.npts::Int64number of data points in the CDFnptsdata points of typeFloat64
EmpiricalCDFs.IOcdf.save — Function.save(fn::String, cdf::AbstractEmpiricalCDF, header::String="")write cdf to file fn in a fast binary format.
EmpiricalCDFs.IOcdf.readcdf — Function.readcdf(fn::String)Read an empirical CDF from file fn. Return an object cdff of type CDFfile. The header is in field header. The cdf is in in field cdf.
EmpiricalCDFs.IOcdf.readcdfinfo — Function.readcdfinfo(fn::String)Return an object containing information about the cdf saved in the binary file fn. The data itself is not read.
EmpiricalCDFs.IOcdf.header — Function.header::String = header(cdff::CDFfile)Return the header from cdff.
EmpiricalCDFs.IOcdf.getcdf — Function.cdf::AbstractEmpiricalCDF = getcdf(cdff::CDFfile)Return the CDF from cdff.
EmpiricalCDFs.IOcdf.version — Function.version(cdff::CDFfile)Return the version number of the file format.
Comparison with ecdf
This package differs from the ecdf function from StatsBase.jl.
ecdftakes a sorted vector as input and returns a function that looks up the value of the CDF. An instance ofEmpiricalCDF,cdf, both stores data, eg viapush!(cdf,x), and looks up the value of the CDF viacdf(x).- When computing the CDF at an array of values,
ecdf, sorts the input and uses an algorithm that scans the data. Instead,EmpiricalCDFsdoes a binary search for each element of an input vector. Tests showed that this is typically not slower. If the CDF stores a large number of points relative to the size of the input vector, the second method, the one used byEmpiricalCDFsis faster.