patch for bugs in complete and single-link clustering with finite distance bounds
- View SourceThe bug
There are bugs in both single and complete link
clustering that arise only when there are elements that
are further than the distance bound from all other
elements. Thus the problem only arises with the
two-argument constructors -- the one-argument constructors
set this bound to positive infinity.
The cause is that I was pruning the set of distance
pairs by removing any pair beyond the maximum distance.
Elements then get lost if they are not within the max distance
bound of any other element.
In both cases, within the hierarchicalCluster() method:
if (score > maxDistance) continue;
if (distanceIJ > maxDistance) continue;
Unfortunately, while the fix makes the methods match the
documentation and do a complete hierarchical clustering,
it will be slower and use more memory in the case where
there are pairs of elements further away from one another
than the maximum distance.
I found this through a very helpful anonymous
user who sent not only a bug report but a unit test
that failed when it should've succeeded.
- Bob Carpenter