Loading ...
Sorry, an error occurred while loading the content.

Re: [boost] Re: new component, extent_set<>

Expand Messages
  • Nathan Myers
    ... Thank you, Joel, for your patience with me. I hadn t thought of them as separable components; I got thrown off by the reference counting. The range_run
    Message 1 of 32 , Dec 31, 2001
    • 0 Attachment
      joel de guzman wrote:
      > From: "Nathan Myers" :
      > > Joel's implementation ... uses a sorted
      > > vector of ranges, rather than a map indexed on the first element. As a
      > > result, insert and erase operations involve block moves, but memory usage
      > > is tight, where in extent_set<> everything involves tree operations.
      > > Joel's uses reference counting, with copy-on-write to enable efficient
      > > copy/assignment.
      > >
      > > The important differences are more subtle. Joel's interface emphasizes
      > > construction, copying, combining and membership tests; extent_set<>
      > > emphasizes inserting and erasing ranges, and scanning the ranges present.
      > > ...
      > The actual implementation has two layers. You forgot to mention
      > the lower layer which is the actual range_run and range classes.
      > The range_run class (which is the real engine) has only a few
      > member functions. Most importantly, 1) test, 2) set 3) clear.
      > test tests for membership. set and clear, ummm, sets and clears
      > a range. It also exposes the range iterators. This class is not ref-
      > counted. This class is standalone and is not tied to any parser code
      > at all. I intend this to be the basis of a more generalized submission.

      Thank you, Joel, for your patience with me.

      I hadn't thought of them as separable components; I got thrown off
      by the reference counting. The range_run<> type is indeed very similar
      to extent_set<>, and more complete.

      > I may be wrong, but based on what I saw initially, extent_set does
      > not allow insertion and removal of overlapping ranges. The test
      > program does not perform any. For instance, is it possible to
      > do something like this?
      >
      > s.set(range(0, 100)); // result = [0..100]
      > s.add(range(200, 300)); // result = [0..100] [200,300]
      > s.set(range(50, 150)); // result = [0..300]
      >
      > or given the current state: [0..10] [20..30] [40..50] [60..70]
      >
      > s.set(range(0, 100)); // result = [0..100]
      >
      > and this?
      >
      > set.set(range(min_int, max_int)); // min_int..max_int
      > set.clear(range(-100,100)); // [min_int..-100] [100..max_int]

      Such functionality would be easy to add to extent_set<>, but would involve
      a loop to implement. Maybe the loop would only run one cycle or be skipped
      entirely in the non-overlapping case. However, it would not be quite so
      elegant :-).

      >> The preconditions radically simplify the logic of the insert and erase
      >> members. They correspond, logically, to the requirement the same memory
      >> not be passed to free() twice. More general functions that don't impose
      >> such preconditions must contain a loop. They would be easy to add (along
      >> with general set union, intersection, and complement-intersection), and I
      >> expect that will happen eventually, but they would not be a substitute for
      >> the more basic operations.
      >
      > If I read this correctly, Spirit's low level range_run class might be the
      > eventual result you are looking for?
      >
      > BTW, your other concerns e.g. less-than < vs. generic compare is of
      > course trivial to implement. Finally, I am still not sure if a map is a good
      > choice, performance wise, nevertheless, a higher level class might
      > use a policy to choose between the vector/map implementation.

      I agree, we could merge range_set and extent_set with probably minimal
      upset to either. A policy argument for whether to use a vector or a map
      to represent it, to trade off space against update speed, also seems
      workable, in principle.

      An important difference is that with a tree representation, incrementing
      or decrementing the iterator can be pretty expensive, so the algorithms
      need to be careful not to do it unnecessarily. (That is why extent_set<>
      stores its extents in reverse order; map<>::lower_bound gives the right
      answer, then.)

      A survey...

      Judging from the postings by Darin, Howard, and Dave, this represents one
      of a family of very similar interesting structures. What is common to
      them all is not so much their range interface, but rather that they
      interpolate coverage of their domain and/or range. They store one
      or both endpoints depending on whether they are continuous, and use
      a slope of one or zero. (Dave suggested allowing other slopes; this
      would allow mapping from, e.g., block number to byte offset. He also
      mentioned bidirectional mapping. That might be hard to shoehorn into
      a common structure.)

      Darin's IncrementalRangeMap, my extent_map<>, and Howard's range_map<>
      are almost the same.

      Darin's RangeMap differs from the other examples in having a 0 slope; it
      could tolerate a non-numeric type in its, er, range. RangeSet, extent_set,
      and range_run have what amounts to a boolean, um, range. In all four
      cases arithmetic on the domain type is limited to, at most, incrementing
      or decrementing.

      Treatment of argument domains that overlap existing mappings varies from
      "Precondition: not", with elegantly minimal operator code, to "anything
      goes". I see no reason not to have both interfaces; the difference is
      usually more a matter of preference -- minimalism vs. generality -- than
      of semantics or performance.

      (BTW, I find the "range" term difficult to use around mapping primitives,
      given the existing domain/range usage in the same context. That's another
      part of why I have preferred to say "extent".)

      It's not clear to me whether all these should be unified under a policy
      parameter, or just presented as a family of components. Probably the
      latter, although they might be able to share some implementation code.
      in the way that std::set<> and std::map<> do.

      Dave's cumulative extent tree, a 1D case of Beman's RTree and
      Reid's 3D structures, is another productive area to explore.

      Happy New Year to all,

      Nathan Myers
      ncm at cantrip dot org
    • joel de guzman
      ... BTW, symbol tables in Spirit are implemented using ternary state trees (TSTs). Common prefixes are shared. Searching for a string of length k in a ternary
      Message 32 of 32 , Jan 2, 2002
      • 0 Attachment
        ----- Original Message -----
        From: "Nathan Myers" :

        >
        > BTW^2 (speaking of work), trees keyed on strings group entries with identical
        > prefixes together, so that as you walk down them or rebalance them you are
        > comparing the same substrings over and over before you get to the parts that
        > differ. Specializing trees for the case of string keys to identify where
        > the difference begins, should speed up string-keyed trees dramatically.
        > (I think the extremum of such an optimization is a lexicographic tree.)

        BTW, symbol tables in Spirit are implemented using ternary state
        trees (TSTs). Common prefixes are shared. Searching for a string
        of length k in a ternary search tree with n strings will require at most
        O(log n+k) *character* comparisons. Quite nifty indeed. The data
        structure is faster than hashing for many typical search problems
        especially when the search interface is iterator based. Instead of
        passing in a string in its "find" function, you can just pass in first-last
        iterators straight from the stream, no need to pre-extract anything.
        TSTs are many times faster than hash tables for unsuccessful searches
        since mismatches are discovered earlier after examining only a few
        characters. Hash tables always examine an entire key when searching.

        --Joel
      Your message has been successfully submitted and would be delivered to recipients shortly.