Option |
Type |
Default |
Description |
num_rows |
int |
999999999 |
If num_rows is less than #points in dataset, will use a
random sample of dataset containing "num_rows" points. Useful
if you're doing experiments vs dataset size. |
all_equal_metric |
bool |
TRUE |
Do we use the plain obvious distance metric? (TRUE means
"don't autoscale the dimensions") (FALSE means "put everything
inside a unit cube") |
matcher |
matchspec |
0.05 |
Specify the n-point predicate to use. The syntax is described
below. |
thresh_ntuples |
double |
0 |
Skip any n-tuples of kdnodes in which the maximum possible
number of tuples from them are less than thresh_ntuples.
This argument is IGNORED if autofind is set to TRUE. |
n |
int |
2 |
The "n" in "n-point" |
autofind |
bool |
FALSE |
If set to TRUE do repeated attempts at solving with
successively smaller thresh_ntuples values, until you find one in which
the maximum possible error is within fraction "errfrac" of the true
count. |
errfrac |
double |
0.05 |
Only relevant if autofind is TRUE. This is "epsilon" in the
paper. |
verbose |
double |
1 |
Set to 0 unless you want an animation (which'll slow it
down) |
rmin |
int |
20 |
The leaf-list size in the kd-tree. Unclear what its general
effect is, but 'smaller the better subject to fitting in main memory' is
NOT the way to go. I suggest not playing with this. 20 usually is
fine. |
min_rel_width |
double |
0.0001 |
kdnodes smaller than this are not split no matter how many
points. I suggest not playing with this. |
rdraw |
bool |
FALSE |
If set to true, pops up a window and does a very fast
flickery animation of progress in the search |
winsize |
int |
512 |
Pixels in ag window edge (size of popup window) |
binfile |
filename |
NULL |
(The following courtesy of Nick Konidaris)
A binfile is a /single line/ that looks like this:
<field1> <field2> <field3> ... <fieldN>
Where
<fieldi> is a parameter that you would pass to the switch matcher.
An example binfile looks like:
1,2 2,3 3,4 5,6 6,10 10,100
And it will iteratively go through each of these fields to find
matches. |
rdata |
string |
random.csv |
If you want to do an n-point count between two datasets (e.g.
data and random) then this is the argument with which you can specify the
random dataset. |
use_permute |
bool |
TRUE |
You should not need to use this. By setting it to FALSE you
can get the same behavior that used to appear in the "compound" predicate cases.
i.e. a set of points is counted multiple times if it matches the
template in multiple orderings and individual points can appear in the
same tuple multiple times.
In its default (TRUE) setting, it now counts each set of points only
once and a set must consist of unique points.
HORRIBLE WARNING FOR SETTING use_permute TO FALSE
Okay there's something ugly going on. For efficiency, if the
the code ever sees that the same dataset is used in all components
of the format, it uses symmetry to avoid wasting time doing
redundant counting (e.g. in 2-point it doesn't count the same
pair twice).
Of course, there's no such kind of symmetry with, say, DR, type
searching.
The warning is: MAKE SURE YOU TAKE THIS INTO ACCOUNT IF YOU
EVER TRY TO DO COMPARISONS OF A DD vs DR vs RR COUNT. |
format |
string |
(n d's, i.e. "ddd...d" ) |
If you are doing 2-point, use format dd for data vs data
format dr for data vs random
format rr for random vs random
If you are doing 3-point, use (for example) format ddr
for data vs data vs random etc...
Note: the d's must be at the beginning and the r's must be at the end
of the string. |
nweights |
int |
0 |
The last nweights columns in the data set are assumed to be
weightings on the data. If this argument is non-zero, all npoint operations
will return both the counts of matching n-tuples, and the weighted counts.
Each tuple matching the template will be counted with a weight equal to
the product of the weights of the points in the tuple.
Note: the approximation methods still use the plain counts to guide the
coarseness of the approximation.
Caveat:
The weighted npoint computation currently only works for n <= 3.
It can be extended for higher n if necessary.
|