Loading ...
Sorry, an error occurred while loading the content.

patch for bugs in complete and single-link clustering with finite distance bounds

Expand Messages
  • Bob Carpenter
    The bug ... There are bugs in both single and complete link clustering that arise only when there are elements that are further than the distance bound from
    Message 1 of 1 , Mar 7, 2011
    View Source
    • 0 Attachment
      The bug
      -------
      There are bugs in both single and complete link
      clustering that arise only when there are elements that
      are further than the distance bound from all other
      elements. Thus the problem only arises with the
      two-argument constructors -- the one-argument constructors
      set this bound to positive infinity.

      The cause
      ---------
      The cause is that I was pruning the set of distance
      pairs by removing any pair beyond the maximum distance.
      Elements then get lost if they are not within the max distance
      bound of any other element.

      The patches
      -----------

      In both cases, within the hierarchicalCluster() method:

      CompleteLinkClusterer.java
      remove line:
      if (score > maxDistance) continue;

      SingleLinkClusterer.java
      remove line:
      if (distanceIJ > maxDistance) continue;


      The downside
      ------------
      Unfortunately, while the fix makes the methods match the
      documentation and do a complete hierarchical clustering,
      it will be slower and use more memory in the case where
      there are pairs of elements further away from one another
      than the maximum distance.

      Thank You
      ---------
      I found this through a very helpful anonymous
      user who sent not only a bug report but a unit test
      that failed when it should've succeeded.

      - Bob Carpenter
      LingPipe
    Your message has been successfully submitted and would be delivered to recipients shortly.