Re: [boost] Re: new component, extent_set<>
- joel de guzman wrote:
> From: "Nathan Myers" :Thank you, Joel, for your patience with me.
> > Joel's implementation ... uses a sorted
> > vector of ranges, rather than a map indexed on the first element. As a
> > result, insert and erase operations involve block moves, but memory usage
> > is tight, where in extent_set<> everything involves tree operations.
> > Joel's uses reference counting, with copy-on-write to enable efficient
> > copy/assignment.
> > The important differences are more subtle. Joel's interface emphasizes
> > construction, copying, combining and membership tests; extent_set<>
> > emphasizes inserting and erasing ranges, and scanning the ranges present.
> > ...
> The actual implementation has two layers. You forgot to mention
> the lower layer which is the actual range_run and range classes.
> The range_run class (which is the real engine) has only a few
> member functions. Most importantly, 1) test, 2) set 3) clear.
> test tests for membership. set and clear, ummm, sets and clears
> a range. It also exposes the range iterators. This class is not ref-
> counted. This class is standalone and is not tied to any parser code
> at all. I intend this to be the basis of a more generalized submission.
I hadn't thought of them as separable components; I got thrown off
by the reference counting. The range_run<> type is indeed very similar
to extent_set<>, and more complete.
> I may be wrong, but based on what I saw initially, extent_set doesSuch functionality would be easy to add to extent_set<>, but would involve
> not allow insertion and removal of overlapping ranges. The test
> program does not perform any. For instance, is it possible to
> do something like this?
> s.set(range(0, 100)); // result = [0..100]
> s.add(range(200, 300)); // result = [0..100] [200,300]
> s.set(range(50, 150)); // result = [0..300]
> or given the current state: [0..10] [20..30] [40..50] [60..70]
> s.set(range(0, 100)); // result = [0..100]
> and this?
> set.set(range(min_int, max_int)); // min_int..max_int
> set.clear(range(-100,100)); // [min_int..-100] [100..max_int]
a loop to implement. Maybe the loop would only run one cycle or be skipped
entirely in the non-overlapping case. However, it would not be quite so
>> The preconditions radically simplify the logic of the insert and eraseI agree, we could merge range_set and extent_set with probably minimal
>> members. They correspond, logically, to the requirement the same memory
>> not be passed to free() twice. More general functions that don't impose
>> such preconditions must contain a loop. They would be easy to add (along
>> with general set union, intersection, and complement-intersection), and I
>> expect that will happen eventually, but they would not be a substitute for
>> the more basic operations.
> If I read this correctly, Spirit's low level range_run class might be the
> eventual result you are looking for?
> BTW, your other concerns e.g. less-than < vs. generic compare is of
> course trivial to implement. Finally, I am still not sure if a map is a good
> choice, performance wise, nevertheless, a higher level class might
> use a policy to choose between the vector/map implementation.
upset to either. A policy argument for whether to use a vector or a map
to represent it, to trade off space against update speed, also seems
workable, in principle.
An important difference is that with a tree representation, incrementing
or decrementing the iterator can be pretty expensive, so the algorithms
need to be careful not to do it unnecessarily. (That is why extent_set<>
stores its extents in reverse order; map<>::lower_bound gives the right
Judging from the postings by Darin, Howard, and Dave, this represents one
of a family of very similar interesting structures. What is common to
them all is not so much their range interface, but rather that they
interpolate coverage of their domain and/or range. They store one
or both endpoints depending on whether they are continuous, and use
a slope of one or zero. (Dave suggested allowing other slopes; this
would allow mapping from, e.g., block number to byte offset. He also
mentioned bidirectional mapping. That might be hard to shoehorn into
a common structure.)
Darin's IncrementalRangeMap, my extent_map<>, and Howard's range_map<>
are almost the same.
Darin's RangeMap differs from the other examples in having a 0 slope; it
could tolerate a non-numeric type in its, er, range. RangeSet, extent_set,
and range_run have what amounts to a boolean, um, range. In all four
cases arithmetic on the domain type is limited to, at most, incrementing
Treatment of argument domains that overlap existing mappings varies from
"Precondition: not", with elegantly minimal operator code, to "anything
goes". I see no reason not to have both interfaces; the difference is
usually more a matter of preference -- minimalism vs. generality -- than
of semantics or performance.
(BTW, I find the "range" term difficult to use around mapping primitives,
given the existing domain/range usage in the same context. That's another
part of why I have preferred to say "extent".)
It's not clear to me whether all these should be unified under a policy
parameter, or just presented as a family of components. Probably the
latter, although they might be able to share some implementation code.
in the way that std::set<> and std::map<> do.
Dave's cumulative extent tree, a 1D case of Beman's RTree and
Reid's 3D structures, is another productive area to explore.
Happy New Year to all,
ncm at cantrip dot org
----- Original Message -----
From: "Nathan Myers" :
> BTW^2 (speaking of work), trees keyed on strings group entries with identical
> prefixes together, so that as you walk down them or rebalance them you are
> comparing the same substrings over and over before you get to the parts that
> differ. Specializing trees for the case of string keys to identify where
> the difference begins, should speed up string-keyed trees dramatically.
> (I think the extremum of such an optimization is a lexicographic tree.)
BTW, symbol tables in Spirit are implemented using ternary state
trees (TSTs). Common prefixes are shared. Searching for a string
of length k in a ternary search tree with n strings will require at most
O(log n+k) *character* comparisons. Quite nifty indeed. The data
structure is faster than hashing for many typical search problems
especially when the search interface is iterator based. Instead of
passing in a string in its "find" function, you can just pass in first-last
iterators straight from the stream, no need to pre-extract anything.
TSTs are many times faster than hash tables for unsuccessful searches
since mismatches are discovered earlier after examining only a few
characters. Hash tables always examine an entire key when searching.