Re: Universal Binary JSON Specification
Great feedback so far, I have a few thoughts on the subject:
1. The hard-to-measure value of a specification being simple and immediately grok'able is more important than total coverage. I think we've all seen that any number of times... for example XML vs JSON. XML defines support for every possible data structure known and unknown through schema references. The *need* to become so incredibly verbose sent people screaming into the arms of a simpler format (JSON) at the first sign of alternatives.
2. As soon as a specification, of any kind, delves into concepts that are not immediately map-able to a mental model that you are familiar with, I would say assimilation of the concepts slows down about 4x.
3. Theoretically I agree with you 110% that the format needs to natively support arbitrarily large numeric formats to be successful in all sorts of use cases. There is absolutely no argument here, every reason you have given is spot on.
4. BUT, I have a very strong feeling (I don't know why... divine intervention maybe) that the addition of these two arbitrary types that are unfamiliar to most people writing software today, could be *just* strange enough to seriously slow down assimilation of a new data format.
e.g. "int, ok got it... double, yep use that all the time, String, yea makes sense, BigInt... wait... what is an arbitrarily long number? I don't get it, does <MY_LANG> support that? I've never used one... how do you convert a byte into a *number*... weird... I gotta go read some docs now"
That is exaggerated for sure, but you see what I am getting at. Because we are operating at the spec level, the nature of the work is to nit-pick and poke and prod and make sure every 'i is dotted and 't' is crossed. That is good, but there is a point at which the intersection between what a spec provides and what people want maximizes and then starts to wane at the cost of the success of the *entire* spec unfortunately.
I have a very strong feeling that the complexities of arbitrarily sized numbers is exactly that apex at which returns start to fall off for the greater good.
5. Given #3 and #4, I want to define the BigInt and BigDecimal support as proposals and add them to the specification on the site and let people discuss them further until there is a strong preference for or against them.
I don't want you to think that I disagree with you, I don't... it is just this very strong nagging gut feeling I have that I have to honor in the name of simplicity.
6. I would make the argument, that if you took the grouping of people ALL using JSON as a data interchange, say 100,000 people, the number of people in that group using BigInt and BigDecimals to exchange data between two internal systems that both support those numeric formats is... a small percentage. (this leads into point #7)
7. Simplicity is what will make this specification succeed over other, potentially faster specs. JSON never won the format war because it was fast or more efficient... it won because it was so unbelievably easy to use.
I could sit down with a C developer and a Erlang developer and say "OK guys, my web service is going to generate replies like THIS, you two need to process that and send me back results that look like that too"
There is no discussion of namespaces, schemas, DTDs, encoding, dublin core or endianness... it was like describing a CSV file format to someone with braces.
I am trying to model that in binary in more than just data representation, but also spirit. That is why some of the binary representations are possibly 1 or 2 bytes longer than they could be if maximally optimized or why simple human readable char markers were chosen for easy discovery in a HEX editor.
It is my belief that the utter simplicity of describing a single layout (marker-size-data) that maps to known types in almost every modern language is what will make this work well.
This may limit the Universal Binary JSON Spec from being the ultimate binary data format, but there are other more specific and difficult-to-use specs that offer faster performance if that level of detail is what you need (e.g. protobuf comes to mind).
My goal is to create the every-man's binary format just like JSON became the every-man's data interchange format.
It isn't for everybody, but it works wonderfully for a whole lot of people.
Thank you again for the well thought out feedback Don.
--- In firstname.lastname@example.org, Don Owens <don@...> wrote:
> I forgot to add that encoders should only use the big number format if the
> number is too big to fit in int64 (or int32, depending on which will be the
> largest in the spec) or a double. That way, if a decoder can't handle a
> number larger than int64 anyway, it does not need to implement decoding of
> big numbers -- you don't want a number that will fit in an int32 put into a
> big number format anyway.
> On Thu, Sep 22, 2011 at 7:15 AM, Don Owens <don@...> wrote:
> > Yes, that is what I was getting at. But see comments embedded.
> > On Wed, Sep 21, 2011 at 7:50 PM, rkalla123 <rkalla@...> wrote:
> >> **
> >> Don,
> >> I see your point. The way I understand it is that this would require 2 new
> >> data types, effectively BigInt and BigDecimal.
> >> So say something along these lines:
> >> bigint - marker 'G'
> >> [G][129 big-endian ordered bytes representing a BigInt]
> >> It should be mentioned that they are signed ints, but doing two's
> > complement and such is probably too much work. Maybe just specify that the
> > first bit always represents the sign (0 for no sign, 1 or minus).
> >> bigdouble - marker 'W'
> >> [W][222 big-endian ordered bytes representing a BigDecimal]
> > BigDecimal should probably be renamed to something like BigFloat, since
> > decimal is ambiguous (used to mean base-10 and floating point). I'm less
> > familiar with large floating point, but I think a floating point number
> > should consist of a sign bit plus two integers (one for the
> > mantissa/significand and one for the exponent). In the interest of space
> > savings, I think the sign bit should just be included in the exponent and
> > order things so they look similar to the IEEE 754 spec, e.g.,
> > [W][3 big-endian ordered bytes (where first bit is sign bit) of
> > exponent][222 big-endian ordered bytes of mantissa]
> >> Thoughts?
> > In terms of the documentation, I think the big integers and floats should
> > be qualified with a "should implement" instead of a "must implement", since,
> > as others have mentioned, not every encoder and decoder will be able to
> > handle these. I think this matches JSON implementations well. If an
> > encoder does not handle large numbers, it could just throw an error, just as
> > it should throw an error now if an oversized number is encountered in JSON.
> > The same goes for the decoder side. If there is no good way to represent a
> > large number in the language your are working in, throw an error indicating
> > that the number is too large.
> > Have you looked into using variable-length integers for length specifiers?
> > If you have a lot of short strings (or big numbers, etc.) in your data,
> > these could significantly reduce your space usage (at the cost of more
> > complexity for the developer and CPU). There should be a balance between
> > space efficiency and complexity. Thoughts?
> >> --- In email@example.com, Don Owens <don@> wrote:
> >> >
> >> > I've seen very large numbers used in JSON. In Perl, that can be
> >> represented
> >> > as a Math::BigInt object. And that is the way I have implemented it in
> >> my
> >> > JSON module for Perl (JSON::DWIW). Python has arbitrary length integers
> >> > built-in. For my own language that I'm working on, I'm using libgmp in C
> >> to
> >> > handle arbitrary length integers.
> >> >
> >> > JSON is used as a data exchange format. I want to be able to do a
> >> > roundtrip, e.g., Python -> encoded -> Python with native integers (with
> >> > arbitrary length in this case). In JSON, this just works, as far as the
> >> > encoding is concerned. I see the need for this in any binary JSON format
> >> as
> >> > well. If a large number is represented as a string, then on the decoding
> >> > side, you don't know if that was a number or a string (just because it
> >> looks
> >> > like a number doesn't mean that the sender means it's a number). If,
> >> when
> >> > decoding JSON, the library can't handle large numbers, it has to throw
> >> an
> >> > error anyway. The same should go for binary JSON.
> >> >
> >> > ./don
> > --
> > Don Owens
> > don@...
> Don Owens
> [Non-text portions of this message have been removed]
- On Mon, Feb 20, 2012 at 9:42 AM, rkalla123 <rkalla@...> wrote:
> Stephan,For what it is worth, I also consider support for only signed values a
> No problem; your feedback are still very applicable and much appreciated.
> The additional view-point on the signed/unsigned issue was exactly what I was hoping for. My primary goal has always been simplicity and I know at least from the Java world, going with unsigned values would have made the impl distinctly *not* simple (and an annoying API).
> So I am glad to get some validation there that I am not alienating every other language at the cost of Java.
-+ Tatu +-