EmpiricalCDFs
Empirical cumulative distribution functions
The source repository is https://github.com/jlapeyre/EmpiricalCDFs.jl.
Provides empirical cumulative distribution functions (CDFs) (or "empirical distribution functions" as they are know to probabalists).
EmpiricalCDFs implements empirical CDFs; building, evaluating, random sampling, evaluating the inverse, etc. It is useful especially for examining the tail of the CDF obtained from streaming a large number of data, more than can be stored in memory. For this purpose, you specify a lower cutoff; data points below this value will be silently rejected, but the resulting CDF will still be properly normalized. This ability to process and filter data online is absent in StatsBase.ecdf
.
cdf = EmpiricalCDF()
append!(cdf, randn(10^5))
push!(cdf, randn())
sort!(cdf)
using Statistics
mean(cdf)
std(cdf)
...
print(io,cdf)
# reject points `x < xmin` to use less memory
cdf = EmpiricalCDF(xmin)
Warning about sorting
Before using the cdf, you must call sort!(cdf)
. For efficiency data is not sorted as it is inserted. The exception is print
, which does sort the cdf before printing.
Contents
Index
EmpiricalCDFs.AbstractEmpiricalCDF
EmpiricalCDFs.EmpiricalCDF
EmpiricalCDFs.EmpiricalCDFHi
EmpiricalCDFs.IOcdf.CDFfile
Base.append!
Base.print
Base.push!
Base.rand
Base.sort!
EmpiricalCDFs.IOcdf.getcdf
EmpiricalCDFs.IOcdf.header
EmpiricalCDFs.IOcdf.readcdf
EmpiricalCDFs.IOcdf.readcdfinfo
EmpiricalCDFs.IOcdf.save
EmpiricalCDFs.IOcdf.version
EmpiricalCDFs.counts
EmpiricalCDFs.data
EmpiricalCDFs.finv
EmpiricalCDFs.linprint
EmpiricalCDFs.logprint
Empirical CDF types
AbstractEmpiricalCDF
Concrete types are EmpiricalCDF
and EmpiricalCDFHi
.
EmpiricalCDFs.EmpiricalCDF
— Type.EmpiricalCDF{T=Float64}()
Construct an empirical CDF. After inserting elements with push!
or append!
, and before using most of the functions below, the CDF must be sorted with sort!
.
EmpiricalCDF
and EmpiricalCDFHi
are callable objects. For cdf::AbstractEmpiricalCDF
, cdf(x)
returns the estimate of the CDF at x
. By contrast, cdf[inds]
indexes into the underlying data array.
julia> cdf = EmpiricalCDF();
julia> append!(cdf,randn(10^6));
julia> sort!(cdf);
julia> cdf(0.0)
0.499876
julia> cdf(1.0)
0.840944
julia> cdf(-1.0)
0.158494
In this example, we collected $10^6$ samples from the unit normal distribution. About half of the samples are greater than zero. Approximately the same mass is between zero and one as is between zero and minus one.
EmpiricalCDF(lowreject::Real)
If lowereject
is finite return EmpiricalCDFHi(lowreject)
. Otherwise return EmpiricalCDF()
.
EmpiricalCDFs.EmpiricalCDFHi
— Type.EmpiricalCDFHi{T <: Real} <: AbstractEmpiricalCDF
Empirical CDF with lower cutoff. That is, keep only the tail.
Functions
Base.push!
— Function.push!(cdf::EmpiricalCDF,x::Real)
add the sample x
to cdf
.
Base.append!
— Function.append!(cdf::EmpiricalCDF, a::AbstractArray)
add samples in a
to cdf
.
Base.sort!
— Function.sort!(cdf::AbstractEmpiricalCDF)
Sort the data collected in cdf
. You must call sort!
before using cdf
.
EmpiricalCDFs.data
— Function.data(cdf::AbstractEmpiricalCDF)
return the array holding samples for cdf
.
EmpiricalCDFs.counts
— Function.counts(cdf::AbstractEmpiricalCDF)
Return the number of counts added to cdf
. This includes counts that may have been discarded because they are below of the cutoff.
Base.rand
— Function.rand(cdf::EmpiricalCDF)
Pick a random sample from the distribution represented by cdf
.
EmpiricalCDFs.finv
— Function.finv(cdf::AbstractEmpiricalCDF) --> Function
Return the quantile function, that is, the functional inverse of cdf
. cdf
is a callable object. Note that finv differs slightly from quantile
.
Examples
Here, cdf
contains $10^6$ samples from the unit normal distribution.
julia> icdf = finv(cdf);
julia> icdf(.5)
-0.00037235611091389375
julia> icdf(1.0-eps())
4.601393290425543
julia> maximum(cdf)
4.601393290425543
Methods are defined on AbstractEmpiricalCDF
for the following functions: length
, minimum
, maximum
, extrema
, mean
, median
, std
, quantile
.
Text file output
Base.print
— Function.print(io::IO, cdf::AbstractEmpiricalCDF)
Call logprint(io,cdf)
EmpiricalCDFs.linprint
— Function.linprint(io::IO ,cdf::AbstractEmpiricalCDF, n=2000)
print (not more than) n
linearly spaced points after sorting the data.
linprint(fn::String, cdf::AbstractEmpiricalCDF, n=2000)
print cdf
to file fn
. Print no more than n
linearly spaced points.
EmpiricalCDFs.logprint
— Function.logprint(io::IO, cdf::EmpiricalCDF, n=2000)
print (not more than) n
log spaced points after sorting the data.
Binary IO
I found available serialization choices to be too slow. So, very simple, very fast, binary storage and retrieval is provided. By now, or in the future, there will certainly be packages that provide a sufficient or better replacement.
The type CDFfile
supports reading and writing AbstractEmpiricalCDF
objects in binary format. Most functions that operate on AbstractEmpiricalCDF
also work with CDFfile
, with the call being passed to the cdf
field.
EmpiricalCDFs.IOcdf.CDFfile
— Type.CDFfile(cdf::AbstractEmpiricalCDF, header="")
struct CDFfile{T <: AbstractEmpiricalCDF}
cdf::T
header::String
end
Binary data file for AbstractEmpiricalCDF
The file format is
- Identifying string
n::Int64
number of bytes in the header strings::String
The header stringt::Int64
Type ofAbstractEmpiricalCDF
, 1 or 2. 1 forEmpiricalCDF
, 2 forEmpiricalCDFHi
.lowreject::Float64
the lower cutoff, only forEmpiricalCDFHi
.npts::Int64
number of data points in the CDFnpts
data points of typeFloat64
EmpiricalCDFs.IOcdf.save
— Function.save(fn::String, cdf::AbstractEmpiricalCDF, header::String="")
write cdf
to file fn
in a fast binary format.
EmpiricalCDFs.IOcdf.readcdf
— Function.readcdf(fn::String)
Read an empirical CDF from file fn
. Return an object cdff
of type CDFfile
. The header is in field header
. The cdf is in in field cdf
.
EmpiricalCDFs.IOcdf.readcdfinfo
— Function.readcdfinfo(fn::String)
Return an object containing information about the cdf saved in the binary file fn
. The data itself is not read.
EmpiricalCDFs.IOcdf.header
— Function.header::String = header(cdff::CDFfile)
Return the header from cdff
.
EmpiricalCDFs.IOcdf.getcdf
— Function.cdf::AbstractEmpiricalCDF = getcdf(cdff::CDFfile)
Return the CDF from cdff
.
EmpiricalCDFs.IOcdf.version
— Function.version(cdff::CDFfile)
Return the version number of the file format.
Comparison with ecdf
This package differs from the ecdf
function from StatsBase.jl
.
ecdf
takes a sorted vector as input and returns a function that looks up the value of the CDF. An instance ofEmpiricalCDF
,cdf
, both stores data, eg viapush!(cdf,x)
, and looks up the value of the CDF viacdf(x)
.- When computing the CDF at an array of values,
ecdf
, sorts the input and uses an algorithm that scans the data. Instead,EmpiricalCDFs
does a binary search for each element of an input vector. Tests showed that this is typically not slower. If the CDF stores a large number of points relative to the size of the input vector, the second method, the one used byEmpiricalCDFs
is faster.