NPT

Introduction
Downloads
Source Code
Options
Contact

NPT is a fast implementation of n-point spatial statistics. It uses kd trees in a new way that allows computations of n-point statistics at far greater speed than a naive nontree-based implementation.

For general information about NPT, please refer to this paper.
For detailed information about NPT, please refer to this paper.

Downloads

NPT Linux tar package (compiled on gcc egcs-2.91.66 and libc.so.6)
NPT SunOS tar package (compiled on SunOS_4.1.4_sun4m with gcc 2.7.2)
NPT R Library (compiled on Linux)

Source Code

To obtain NPT source code, please fill out this form.

Options

The standard command line syntax for NPT includes an input file and the following list of options.

./npt npt in datafilename [options]

where datafilename is the name of an ascii file in which each datapoint is given on a separated line, and each line is a series of numbers specified by spaces or commas.

Optionally the datafile may have the first row be a set of non-numeric strings correspnding to attribute names.

So, for example, .csv (comma separated values) files are fine.

Options are....

Option	Type	Default	Description
num_rows	int	999999999	If num_rows is less than #points in dataset, will use a random sample of dataset containing "num_rows" points. Useful if you're doing experiments vs dataset size.
all_equal_metric	bool	TRUE	Do we use the plain obvious distance metric? (TRUE means "don't autoscale the dimensions") (FALSE means "put everything inside a unit cube")
matcher	matchspec	0.05	Specify the n-point predicate to use. The syntax is described below.
thresh_ntuples	double	0	Skip any n-tuples of kdnodes in which the maximum possible number of tuples from them are less than thresh_ntuples. This argument is IGNORED if autofind is set to TRUE.
n	int	2	The "n" in "n-point"
autofind	bool	FALSE	If set to TRUE do repeated attempts at solving with successively smaller thresh_ntuples values, until you find one in which the maximum possible error is within fraction "errfrac" of the true count.
errfrac	double	0.05	Only relevant if autofind is TRUE. This is "epsilon" in the paper.
verbose	double	1	Set to 0 unless you want an animation (which'll slow it down)
rmin	int	20	The leaf-list size in the kd-tree. Unclear what its general effect is, but 'smaller the better subject to fitting in main memory' is NOT the way to go. I suggest not playing with this. 20 usually is fine.
min_rel_width	double	0.0001	kdnodes smaller than this are not split no matter how many points. I suggest not playing with this.
rdraw	bool	FALSE	If set to true, pops up a window and does a very fast flickery animation of progress in the search
winsize	int	512	Pixels in ag window edge (size of popup window)
binfile	filename	NULL	(The following courtesy of Nick Konidaris) A binfile is a /single line/ that looks like this: <field1> <field2> <field3> ... <fieldN> Where <fieldi> is a parameter that you would pass to the switch matcher. An example binfile looks like: 1,2 2,3 3,4 5,6 6,10 10,100 And it will iteratively go through each of these fields to find matches.
rdata	string	random.csv	If you want to do an n-point count between two datasets (e.g. data and random) then this is the argument with which you can specify the random dataset.
use_permute	bool	TRUE	You should not need to use this. By setting it to FALSE you can get the same behavior that used to appear in the "compound" predicate cases. i.e. a set of points is counted multiple times if it matches the template in multiple orderings and individual points can appear in the same tuple multiple times. In its default (TRUE) setting, it now counts each set of points only once and a set must consist of unique points. HORRIBLE WARNING FOR SETTING use_permute TO FALSE Okay there's something ugly going on. For efficiency, if the the code ever sees that the same dataset is used in all components of the format, it uses symmetry to avoid wasting time doing redundant counting (e.g. in 2-point it doesn't count the same pair twice). Of course, there's no such kind of symmetry with, say, DR, type searching. The warning is: MAKE SURE YOU TAKE THIS INTO ACCOUNT IF YOU EVER TRY TO DO COMPARISONS OF A DD vs DR vs RR COUNT.
format	string	(n d's, i.e. "ddd...d" )	If you are doing 2-point, use format dd for data vs data format dr for data vs random format rr for random vs random If you are doing 3-point, use (for example) format ddr for data vs data vs random etc... Note: the d's must be at the beginning and the r's must be at the end of the string.
nweights	int	0	The last nweights columns in the data set are assumed to be weightings on the data. If this argument is non-zero, all npoint operations will return both the counts of matching n-tuples, and the weighted counts. Each tuple matching the template will be counted with a weight equal to the product of the weights of the points in the tuple. Note: the approximation methods still use the plain counts to guide the coarseness of the approximation. Caveat: The weighted npoint computation currently only works for n <= 3. It can be extended for higher n if necessary.

The following is a sample NPT command line:

./npt npt in vsimple.csv rdraw TRUE

Contact

Webmaster
Auton Project webpage