Loading ...
Sorry, an error occurred while loading the content.

Re: [json] Re: Universal Binary JSON Specification

Expand Messages
  • Milo Sredkov
    Hello Riyad, Stephan, Don, Tatu, and all group members, I recently analysed about 70 of the libraries linked from json.org (almost all listed in the C++, C,
    Message 1 of 76 , Sep 22, 2011
    • 0 Attachment
      Hello Riyad, Stephan, Don, Tatu, and all group members,

      I recently analysed about 70 of the libraries linked from json.org (almost
      all listed in the C++, C, Java, Python, Haskell, JavaScript, Ruby, C#, PHP,
      and Lisp sections) and would like to share some opinions about the presented
      specification, and also about some of the topics that rose in the discussion
      so far.

      First, I'm really happy to see people trying to do cool things in favour of
      the JSON community – a simple and efficient binary JSON representation is
      for sure a cool thing from which we can all benefit. Although you will
      probably need to do more work in order to show everyone that this format is
      *the one*, initiating a discussion is probably the right thing to do.

      IMHO there are two things, already pointed out by the others, which I find
      disturbing. First, as other binary JSON representations are already present,
      you need to position the Universal Binary JSON format very clearly compared
      to the alternatives, especially to the aforementioned Smile, which seems to
      solve very similar goals. It's obvious that the proposed format is simple,
      and that this may be its unique strength, but unless you clearly
      (quantitatively) show exactly how simpler it is, people will not hurry to
      adopt it. What's more, as there already exist several binary JSON formats,
      failure to persuade the community that the new one is superior not only will
      make your effort unsuccessful, it will also worsen things by introducing
      additional discrepancy.

      The second issue is about the numbers. This one is actually an issue of JSON
      itself, and more specifically, the fact that JSON is specified only at the
      syntax level and there lacks a commonly accepted data-model (or meta-model,
      information model, etc.) specifying the set of information that can be
      encoded in JSON. The JSON specification just states how numbers are encoded.
      It does not state whether 10, 10.0, or 1E1 are different numbers, neither
      does it say how large the numbers can be, or whether the concrete way in
      which they are written is important. From this, in the libraries for the 10
      programming languages I mentioned, there are huge variations in the
      supported range formats. Some distinguish integers from floats, others
      don't, some expose the concrete string in which the number was encoded,
      others don't, and so on. Most importantly, the supported ranges vary for
      each library, starting from 30bit (not 32) signed integers to unlimited
      decimal numbers.

      Having said that, although you are not the one who is responsible for the
      situation, you should really treat the numbers very carefully. In my
      opinion, thinking language neutrally, there are only 2 strategies (or data
      models) that make sense. The first, which many people imply because of
      JSON's origin, is to assume JavaScript semantics, that is, in this case that
      numbers are 64bit IEEE 754 floating point numbers. This makes things really
      simple, but is not suitable for applications where rounding errors are not
      tolerable, e.g. storing monetary values. The second approach is to assume
      (unlimited) decimal numbers – tools are free to have their limitations, but
      any real number that can be encoded as a finite decimal fraction is
      supported by the specification, and tools try their best to deliver it
      without any loss of precision. This approach makes most sense to me – it
      allows JSON to be used for a large number of applications. However, it
      contrasts to the JSON's idea of being the intersection of the modern
      programming languages, not the union. Adopting it means the following:
      * There is no reason for having big integers at the format level, only
      decimal numbers should be enough (1.0 == 1 == 1.00e0)
      * The semantics and precision guarantees of the "double" encoding should be
      very carefully and strictly defined. Keep in mind, that even simple decimal
      values like 1.7 cannot be expressed exactly in binary IEEE 754 floats
      * +0, -0, 0.0e0, are the same value, and according to the rule of picking
      the smallest suitable type, they should be encoded as "byte"
      * "BigDecimal should probably be renamed to something like BigFloat" may
      not be a good idea – first, parsing binary floats with arbitrary precision
      is not something easy and commonly supported, and secondly, decimal
      precision guaranties suits better most large precision applications.
      * The encoding of decimal numbers should be very carefully specified.

      Btw, I hope that in few days I will publish the exact results of the
      analysis I mentioned, which is actually a by-product of an effort to define
      a strict data model for JSON.

      Milo Sredkov

      On Thu, Sep 22, 2011 at 5:33 PM, Don Owens <don@...> wrote:

      > **
      > I forgot to add that encoders should only use the big number format if the
      > number is too big to fit in int64 (or int32, depending on which will be the
      > largest in the spec) or a double. That way, if a decoder can't handle a
      > number larger than int64 anyway, it does not need to implement decoding of
      > big numbers -- you don't want a number that will fit in an int32 put into a
      > big number format anyway.
      > On Thu, Sep 22, 2011 at 7:15 AM, Don Owens <don@...> wrote:
      > > Yes, that is what I was getting at. But see comments embedded.
      > >
      > > On Wed, Sep 21, 2011 at 7:50 PM, rkalla123 <rkalla@...> wrote:
      > >
      > >> **
      > >>
      > >>
      > >> Don,
      > >>
      > >> I see your point. The way I understand it is that this would require 2
      > new
      > >> data types, effectively BigInt and BigDecimal.
      > >>
      > >> So say something along these lines:
      > >>
      > >> bigint - marker 'G'
      > >> [G][129][129 big-endian ordered bytes representing a BigInt]
      > >>
      > >> It should be mentioned that they are signed ints, but doing two's
      > > complement and such is probably too much work. Maybe just specify that
      > the
      > > first bit always represents the sign (0 for no sign, 1 or minus).
      > >
      > >
      > >> bigdouble - marker 'W'
      > >> [W][222][222 big-endian ordered bytes representing a BigDecimal]
      > >>
      > >>
      > > BigDecimal should probably be renamed to something like BigFloat, since
      > > decimal is ambiguous (used to mean base-10 and floating point). I'm less
      > > familiar with large floating point, but I think a floating point number
      > > should consist of a sign bit plus two integers (one for the
      > > mantissa/significand and one for the exponent). In the interest of space
      > > savings, I think the sign bit should just be included in the exponent and
      > > order things so they look similar to the IEEE 754 spec, e.g.,
      > >
      > > [W][3][3 big-endian ordered bytes (where first bit is sign bit) of
      > > exponent][222][222 big-endian ordered bytes of mantissa]
      > >
      > >
      > >
      > >> Thoughts?
      > >>
      > >
      > > In terms of the documentation, I think the big integers and floats should
      > > be qualified with a "should implement" instead of a "must implement",
      > since,
      > > as others have mentioned, not every encoder and decoder will be able to
      > > handle these. I think this matches JSON implementations well. If an
      > > encoder does not handle large numbers, it could just throw an error, just
      > as
      > > it should throw an error now if an oversized number is encountered in
      > JSON.
      > > The same goes for the decoder side. If there is no good way to represent
      > a
      > > large number in the language your are working in, throw an error
      > indicating
      > > that the number is too large.
      > >
      > > Have you looked into using variable-length integers for length
      > specifiers?
      > > If you have a lot of short strings (or big numbers, etc.) in your data,
      > > these could significantly reduce your space usage (at the cost of more
      > > complexity for the developer and CPU). There should be a balance between
      > > space efficiency and complexity. Thoughts?
      > >
      > >
      > >
      > >>
      > >>
      > >
      > >> --- In json@yahoogroups.com, Don Owens <don@...> wrote:
      > >> >
      > >> > I've seen very large numbers used in JSON. In Perl, that can be
      > >> represented
      > >> > as a Math::BigInt object. And that is the way I have implemented it in
      > >> my
      > >> > JSON module for Perl (JSON::DWIW). Python has arbitrary length
      > integers
      > >> > built-in. For my own language that I'm working on, I'm using libgmp in
      > C
      > >> to
      > >> > handle arbitrary length integers.
      > >> >
      > >> > JSON is used as a data exchange format. I want to be able to do a
      > >> > roundtrip, e.g., Python -> encoded -> Python with native integers
      > (with
      > >> > arbitrary length in this case). In JSON, this just works, as far as
      > the
      > >> > encoding is concerned. I see the need for this in any binary JSON
      > format
      > >> as
      > >> > well. If a large number is represented as a string, then on the
      > decoding
      > >> > side, you don't know if that was a number or a string (just because it
      > >> looks
      > >> > like a number doesn't mean that the sender means it's a number). If,
      > >> when
      > >> > decoding JSON, the library can't handle large numbers, it has to throw
      > >> an
      > >> > error anyway. The same should go for binary JSON.
      > >> >
      > >> > ./don
      > >>
      > >>
      > >>
      > >
      > >
      > >
      > > --
      > > Don Owens
      > > don@...
      > >
      > >
      > --
      > Don Owens
      > don@...
      > [Non-text portions of this message have been removed]

      [Non-text portions of this message have been removed]
    • Tatu Saloranta
      ... For what it is worth, I also consider support for only signed values a good thing. -+ Tatu +-
      Message 76 of 76 , Feb 20, 2012
      • 0 Attachment
        On Mon, Feb 20, 2012 at 9:42 AM, rkalla123 <rkalla@...> wrote:
        > Stephan,
        > No problem; your feedback are still very applicable and much appreciated.
        > The additional view-point on the signed/unsigned issue was exactly what I was hoping for. My primary goal has always been simplicity and I know at least from the Java world, going with unsigned values would have made the impl distinctly *not* simple (and an annoying API).
        > So I am glad to get some validation there that I am not alienating every other language at the cost of Java.

        For what it is worth, I also consider support for only signed values a
        good thing.

        -+ Tatu +-
      Your message has been successfully submitted and would be delivered to recipients shortly.