Loading ...
Sorry, an error occurred while loading the content.

Re: member names outside ASCII but still in the Unicode Basic Multilingual Plane

Expand Messages
  • douglascrockford
    ... There is confusion beyond the ASCII set. There are codes with identical or similar glyphs. A correct program can be indistinguishable from an incorrect
    Message 1 of 14 , Jan 5, 2012
    • 0 Attachment
      --- In jslint_com@yahoogroups.com, "Brennan" <brennan@...> wrote:

      > I am posting here with some trepidation, and I am thoroughly expecting to be insulted by DC. Here goes anyway.
      >
      > "The Good Parts" Appendix A5 (Awful parts) states:
      >
      > «JavaScript was designed at a time when Unicode was expected to have at most 65536 characters. It has since grown to have a capacity of more than 1 million characters.
      >
      > JavaScript's characters are 16 bits. That is enough to cover the original 65536 (which is now known as the Basic Multilingual Plane).
      >
      > Each of the remaining million characters can be represented as a pair of characters. Unicode considers the pair to be a single character. JavaScript thinks the pair is two distinct characters.»
      >
      > This indicates to me that - when naming members - going outside ASCII, but sticking to the Basic Multilingual Plane (e.g. using alphabetical glyphs like æ or ß) should work fine - and expands the expressiveness of the code. I claim that it's about time we moved on from the honoring ghost of ASCII in our source code, when the language specifies a wider set.
      >
      > But JSlint disagrees. I get (e.g.) "Unexpected æ". Is there an explanation?



      There is confusion beyond the ASCII set. There are codes with identical or similar glyphs. A correct program can be indistinguishable from an incorrect program. So JSLint prefers ASCII for identifiers. These are the characters that are recognized by all programmers, whatever their native language.

      JSLint likes the full Unicode set in strings. But for identifiers, it recommends a much smaller set.
    • Tom Worster
      I like to program in Unicode (☭ = ☃ + π;) but I accept that there can be difficulties. One question is what collation JS should use to decide equivalence
      Message 2 of 14 , Jan 5, 2012
      • 0 Attachment
        I like to program in Unicode (☭ = ☃ + π;) but I accept that there can be
        difficulties. One question is what collation JS should use to decide
        equivalence (according to Unicode, whether é is different from e depends
        on locale). Another is that Unicode offers different character sequences,
        and thus different byte strings, to represent the exact same thing (ö and
        ö look the same to me but the first is U+006F U+0308 the second is U+00F6).


        On 1/5/12 8:57 AM, "Brennan" <brennan@...> wrote:

        >I am posting here with some trepidation, and I am thoroughly expecting to
        >be insulted by DC. Here goes anyway.
        >
        >"The Good Parts" Appendix A5 (Awful parts) states:
        >
        >«JavaScript was designed at a time when Unicode was expected to have at
        >most 65536 characters. It has since grown to have a capacity of more than
        >1 million characters.
        >
        >JavaScript's characters are 16 bits. That is enough to cover the original
        >65536 (which is now known as the Basic Multilingual Plane).
        >
        >Each of the remaining million characters can be represented as a pair of
        >characters. Unicode considers the pair to be a single character.
        >JavaScript thinks the pair is two distinct characters.»
        >
        >This indicates to me that - when naming members - going outside ASCII,
        >but sticking to the Basic Multilingual Plane (e.g. using alphabetical
        >glyphs like æ or ß) should work fine - and expands the expressiveness of
        >the code. I claim that it's about time we moved on from the honoring
        >ghost of ASCII in our source code, when the language specifies a wider
        >set.
        >
        >But JSlint disagrees. I get (e.g.) "Unexpected æ". Is there an
        >explanation?
        >
        >If there is no explanation beyond the fact that DC is from an anglo-saxon
        >background, and just prefers ASCII out of pure habit, then this may be
        >considered a feature request: jslint should tolerate all alphabetical
        >characters in the Basic Multilingual Plane for member names.
      • douglascrockford
        ... All JavaScript sees is raw code points, so identifiers that look the same may be different identifiers.
        Message 3 of 14 , Jan 5, 2012
        • 0 Attachment
          --- In jslint_com@yahoogroups.com, Tom Worster <fsb@...> wrote:

          > I like to program in Unicode (☭ = ☃ + π;) but I accept that there can be
          > difficulties. One question is what collation JS should use to decide
          > equivalence (according to Unicode, whether é is different from e depends
          > on locale). Another is that Unicode offers different character sequences,
          > and thus different byte strings, to represent the exact same thing (ö and
          > ö look the same to me but the first is U+006F U+0308 the second is U+00F6).


          All JavaScript sees is raw code points, so identifiers that look the same may be different identifiers.
        • Joshua Bell
          ... A Globalization API for JavaScript is under consideration on es-discuss, for implementation by browser vendors as host objects and/or inclusion in the next
          Message 4 of 14 , Jan 6, 2012
          • 0 Attachment
            On Thu, Jan 5, 2012 at 6:54 AM, Tom Worster <fsb@...> wrote:

            > **
            >
            >
            > I like to program in Unicode (☭ = ☃ + π;) but I accept that there can be
            > difficulties. One question is what collation JS should use to decide
            > equivalence (according to Unicode, whether é is different from e depends
            > on locale). Another is that Unicode offers different character sequences,
            > and thus different byte strings, to represent the exact same thing (ö and
            > ö look the same to me but the first is U+006F U+0308 the second is U+00F6).
            >

            A Globalization API for JavaScript is under consideration on es-discuss,
            for implementation by browser vendors as host objects and/or inclusion in
            the next version of the ECMAScript standard as a module. I believe the
            latest version of the proposal can be found at:

            http://norbertlindenberg.com/2011/11/ecmascript-globalization-api/index.html

            The current proposal includes support for locale-specific collation and all
            the Unicode-goodness you'd expect. This is done with new objects/functions
            - existing JavaScript string comparison operations remain unchanged (i.e.
            continue to operate by ordinal comparison of the 16-bit elements of JS
            strings)


            [Non-text portions of this message have been removed]
          • Joshua Bell
            ... ... and to expound on Crockford s point on the other fork of this thread (mea culpa!), the above proposal assumes no changes to the ECMAScript language
            Message 5 of 14 , Jan 6, 2012
            • 0 Attachment
              On Fri, Jan 6, 2012 at 8:32 AM, Joshua Bell <inexorabletash@...>wrote:

              > On Thu, Jan 5, 2012 at 6:54 AM, Tom Worster <fsb@...> wrote:
              >
              >> **
              >>
              >>
              >> I like to program in Unicode (☭ = ☃ + π;) but I accept that there can be
              >> difficulties. One question is what collation JS should use to decide
              >> equivalence (according to Unicode, whether é is different from e depends
              >> on locale). Another is that Unicode offers different character sequences,
              >> and thus different byte strings, to represent the exact same thing (ö and
              >> ö look the same to me but the first is U+006F U+0308 the second is
              >> U+00F6).
              >>
              >
              > A Globalization API for JavaScript is under consideration on es-discuss,
              > for implementation by browser vendors as host objects and/or inclusion in
              > the next version of the ECMAScript standard as a module. I believe the
              > latest version of the proposal can be found at:
              >
              >
              > http://norbertlindenberg.com/2011/11/ecmascript-globalization-api/index.html
              >
              > The current proposal includes support for locale-specific collation and
              > all the Unicode-goodness you'd expect. This is done with new
              > objects/functions - existing JavaScript string comparison operations remain
              > unchanged (i.e. continue to operate by ordinal comparison of the 16-bit
              > elements of JS strings)
              >

              ... and to expound on Crockford's point on the other fork of this thread
              (mea culpa!), the above proposal assumes no changes to the ECMAScript
              language itself. Different JS strings (i.e. different sequences of 16-bit
              code points) would remain different identifiers, both in the source and,
              perhaps more importantly, in basic ECMAScript operations like keys for
              objects. e.g. o["ö"] and o["ö"] refer to different properties (assuming my
              clipboard didn't normalize), although other proposed changes in ECMAScript
              may enable collation-aware string maps with that convenient syntax.

              Encoding is still a very real issue on the Web, and you don't want to find
              out that your server thought your script file was UTF-8 while some browsers
              thought your script file was Windows-1252 only after your code is in
              production, so keeping your source code ASCII is still the best practice.

              Are the well known minification tools able to cope with non-ASCII input?


              [Non-text portions of this message have been removed]
            • douglascrockford
              ... JSMin likes UTF-8.
              Message 6 of 14 , Jan 6, 2012
              • 0 Attachment
                --- In jslint_com@yahoogroups.com, Joshua Bell <inexorabletash@...> wrote:

                > Encoding is still a very real issue on the Web, and you don't want to find
                > out that your server thought your script file was UTF-8 while some browsers
                > thought your script file was Windows-1252 only after your code is in
                > production, so keeping your source code ASCII is still the best practice.
                >
                > Are the well known minification tools able to cope with non-ASCII input?

                JSMin likes UTF-8.
              • Tom Worster
                ... Meaning that one script such as öle = olé + 3; does different things in different countries? öle = olé + 3; This is done with new objects/functions ...
                Message 7 of 14 , Jan 6, 2012
                • 0 Attachment
                  On 1/6/12 11:32 AM, "Joshua Bell" <inexorabletash@...> wrote:

                  >On Thu, Jan 5, 2012 at 6:54 AM, Tom Worster <fsb@...> wrote:
                  >
                  >> **
                  >>
                  >>
                  >> I like to program in Unicode (☭ = ☃ + π;) but I accept that there can be
                  >> difficulties. One question is what collation JS should use to decide
                  >> equivalence (according to Unicode, whether é is different from e depends
                  >> on locale). Another is that Unicode offers different character
                  >>sequences,
                  >> and thus different byte strings, to represent the exact same thing (ö
                  >>and
                  >> ö look the same to me but the first is U+006F U+0308 the second is
                  >>U+00F6).
                  >>
                  >
                  >A Globalization API for JavaScript is under consideration on es-discuss,
                  >for implementation by browser vendors as host objects and/or inclusion in
                  >the next version of the ECMAScript standard as a module. I believe the
                  >latest version of the proposal can be found at:
                  >
                  >http://norbertlindenberg.com/2011/11/ecmascript-globalization-api/index.ht
                  >ml
                  >
                  >The current proposal includes support for locale-specific collation and
                  >all
                  >the Unicode-goodness you'd expect.

                  Meaning that one script such as öle = olé + 3; does different things in
                  different countries?

                  öle = olé + 3;


                  This is done with new objects/functions
                  >- existing JavaScript string comparison operations remain unchanged (i.e.
                  >continue to operate by ordinal comparison of the 16-bit elements of JS
                  >strings)

                  Strings are not the question. Identifiers.
                • Luke Page
                  For a real world example, I ve seen a bug where an identifier had a с (crylic c) in it written by a russian coder which looked in most fonts the same as c..
                  Message 8 of 14 , Jan 6, 2012
                  • 0 Attachment
                    For a real world example, I've seen a bug where an identifier had a с (crylic
                    c) in it written by a russian coder which looked in most fonts the same as
                    c..

                    Still I think it would be nice to not just stamp western standards on
                    programming.

                    On 6 January 2012 18:02, Tom Worster <fsb@...> wrote:

                    > **
                    >
                    >
                    >
                    >
                    > On 1/6/12 11:32 AM, "Joshua Bell" <inexorabletash@...> wrote:
                    >
                    > >On Thu, Jan 5, 2012 at 6:54 AM, Tom Worster <fsb@...> wrote:
                    > >
                    > >> **
                    > >>
                    > >>
                    > >> I like to program in Unicode (☭ = ☃ + π;) but I accept that there can be
                    > >> difficulties. One question is what collation JS should use to decide
                    > >> equivalence (according to Unicode, whether é is different from e depends
                    > >> on locale). Another is that Unicode offers different character
                    > >>sequences,
                    > >> and thus different byte strings, to represent the exact same thing (ö
                    > >>and
                    > >> ö look the same to me but the first is U+006F U+0308 the second is
                    > >>U+00F6).
                    > >>
                    > >
                    > >A Globalization API for JavaScript is under consideration on es-discuss,
                    > >for implementation by browser vendors as host objects and/or inclusion in
                    > >the next version of the ECMAScript standard as a module. I believe the
                    > >latest version of the proposal can be found at:
                    > >
                    > >
                    > http://norbertlindenberg.com/2011/11/ecmascript-globalization-api/index.ht
                    > >ml
                    > >
                    > >The current proposal includes support for locale-specific collation and
                    > >all
                    > >the Unicode-goodness you'd expect.
                    >
                    > Meaning that one script such as öle = olé + 3; does different things in
                    > different countries?
                    >
                    > öle = olé + 3;
                    >
                    > This is done with new objects/functions
                    > >- existing JavaScript string comparison operations remain unchanged (i.e.
                    > >continue to operate by ordinal comparison of the 16-bit elements of JS
                    > >strings)
                    >
                    > Strings are not the question. Identifiers.
                    >
                    >
                    >


                    [Non-text portions of this message have been removed]
                  • douglascrockford
                    ... What are you trying to say? You gave us clear evidence of why it shouldn t accept both Cyrillic and Latin. So are you arguing that JSLint should only
                    Message 9 of 14 , Jan 6, 2012
                    • 0 Attachment
                      --- In jslint_com@yahoogroups.com, Luke Page <luke.a.page@...> wrote:

                      > For a real world example, I've seen a bug where an identifier had a с (crylic
                      > c) in it written by a russian coder which looked in most fonts the same as
                      > c..
                      >
                      > Still I think it would be nice to not just stamp western standards on
                      > programming.


                      What are you trying to say? You gave us clear evidence of why it shouldn't accept both Cyrillic and Latin. So are you arguing that JSLint should only accept Cyrillic?
                    • Luke Page
                      I m arguing in favour of the current situation.. for myself, working in English. I should have let someone else for whom it is of benefit argue for anything
                      Message 10 of 14 , Jan 6, 2012
                      • 0 Attachment
                        I'm arguing in favour of the current situation.. for myself, working in
                        English.

                        I should have let someone else for whom it is of benefit argue for anything
                        different.
                        On Jan 6, 2012 9:26 PM, "douglascrockford" <douglas@...> wrote:

                        > **
                        >
                        >
                        > --- In jslint_com@yahoogroups.com, Luke Page <luke.a.page@...> wrote:
                        >
                        > > For a real world example, I've seen a bug where an identifier had a �
                        > (crylic
                        > > c) in it written by a russian coder which looked in most fonts the same
                        > as
                        > > c..
                        > >
                        > > Still I think it would be nice to not just stamp western standards on
                        > > programming.
                        >
                        > What are you trying to say? You gave us clear evidence of why it shouldn't
                        > accept both Cyrillic and Latin. So are you arguing that JSLint should only
                        > accept Cyrillic?
                        >
                        >
                        >


                        [Non-text portions of this message have been removed]
                      • Rob Richardson
                        Programming in general is done in English. I m sorry to be the dumb American, but that s pretty much how it works. I ve heard from many international
                        Message 11 of 14 , Jan 7, 2012
                        • 0 Attachment
                          Programming in general is done in English. I'm sorry to be the dumb
                          American, but that's pretty much how it works. I've heard from many
                          international programmers that using localized versions of developer tools
                          or code documentation is ineffective, and that for as much as English is not
                          their native tongue, English is their preferred programming metaphor. Thus
                          constraining non-string content to ASCII only is likely not a hindrance to
                          many. Perhaps the "bug" is that we've learned some since the book was
                          published.

                          Rob


                          -----Original Message-----
                          From: jslint_com@yahoogroups.com [mailto:jslint_com@yahoogroups.com] On
                          Behalf Of Luke Page
                          Sent: Friday, January 06, 2012 2:40 PM
                          To: jslint_com@yahoogroups.com
                          Subject: Re: [jslint] member names outside ASCII but still in the Unicode
                          Basic Multilingual Plane

                          I'm arguing in favour of the current situation.. for myself, working in
                          English.

                          I should have let someone else for whom it is of benefit argue for anything
                          different.
                          On Jan 6, 2012 9:26 PM, "douglascrockford" <douglas@...> wrote:

                          > **
                          >
                          >
                          > --- In jslint_com@yahoogroups.com, Luke Page <luke.a.page@...> wrote:
                          >
                          > > For a real world example, I've seen a bug where an identifier had a
                          > > Ñ
                          > (crylic
                          > > c) in it written by a russian coder which looked in most fonts the
                          > > same
                          > as
                          > > c..
                          > >
                          > > Still I think it would be nice to not just stamp western standards
                          > > on programming.
                          >
                          > What are you trying to say? You gave us clear evidence of why it
                          > shouldn't accept both Cyrillic and Latin. So are you arguing that
                          > JSLint should only accept Cyrillic?
                          >
                          >
                          >


                          [Non-text portions of this message have been removed]



                          ------------------------------------

                          Yahoo! Groups Links
                        • Brennan
                          I m afraid that I don t accept that s the way it s always been or variants as a strong argument. Especially when exceptions can be so readily found. But OK,
                          Message 12 of 14 , Feb 24, 2012
                          • 0 Attachment
                            I'm afraid that I don't accept "that's the way it's always been" or
                            variants as a strong argument. Especially when exceptions can be so
                            readily found.
                            But OK, if I may restate the problem as I now understand it, with some
                            finer nuances:
                            We have a Latin "A" (U+0041) is distinct from the Cyrillic "А"
                            (U+0410) and the Greek alpha "Î`" (U+0391). They look identical, but
                            have different code points. So, going outside of ascii when naming
                            identifiers could cause name-mismatch bugs which would be very difficult
                            to spot. This is indeed a good reason for not accepting those
                            characters.
                            (They look identical as long as they haven't been mangled by passing
                            through some non-unicode system along the way, which appears to be
                            happening with some of the posts on this thread, and maybe this one too.
                            This is a separate problem, and should have no bearing on how jslint
                            behaves or ought to behave. BTW I notice that the web yahoo groups plain
                            text editor interface is not doing 'the right thing' to my non-ascii
                            chars when I preview this message, so I have switched to rich text.
                            Let's see what happens after I send it).
                            But while I respect the basic logic and simple pragmatism of rejecting
                            all non-ascii characters, there must surely be a subset of the basic
                            multilingual plane where the non-ascii glyphs do not resemble any
                            others, and therefore would be 'safe' to code with. Is this a reasonable
                            suggestion?
                            For example, I would like to feel free to use θ (lower case theta,
                            entity θ) for angles (as mathematicians have done for thousands of
                            years), and there are dozens of other non-ascii characters - mostly
                            Greek - which are conventionally used in various problem domains.
                            Theta appears four times in the basic multilingual plane:
                            Θ or entity Θ ( U+0398)θ or entity θ (U+03B8)Ï`
                            or entity ϑ (U+03D1)Ï´ or entity ϴ (U+03F4)
                            All four forms are clearly distinct from one another. To my eyes they
                            are at least as distinct as 1 and I and | and l or O and 0 in ascii. And
                            they do not resemble any other glyphs, least of all those found in
                            ascii.
                            I can see no good reason why such characters should not be tolerated by
                            jslint. (Except perhaps that jslint may be bloated with some kind of
                            lookup table, and - of course - some work always takes more time than no
                            work).
                            Another suggestion would be to make it an *option* to tolerate
                            characters that fall outside of ascii.
                            If I am so perversely traditional (or radical and progressive) that I
                            insist on using θ in my trigonometry script, then I would hope that
                            I know what I am doing. This is not a matter of having my feelings hurt.
                            Rather it seems misleading for jslint to tell me that I did something
                            'unexpected', when the truth of the matter is that I did something that
                            jslint does indeed expect would cause a very particular (but
                            unmentioned) problem. A problem which (in the case of θ) would never
                            happen.


                            [Non-text portions of this message have been removed]
                          • Tom Worster
                            ... this leads to a need for a standard resembles($char1, $char2) function. but resemblance is subjective. the ICU SpoofChecker?
                            Message 13 of 14 , Feb 24, 2012
                            • 0 Attachment
                              On 2/24/12 4:46 AM, "Brennan" <brennan@...> wrote:

                              >But while I respect the basic logic and simple pragmatism of rejecting
                              >all non-ascii characters, there must surely be a subset of the basic
                              >multilingual plane where the non-ascii glyphs do not resemble any
                              >others, and therefore would be 'safe' to code with. Is this a reasonable
                              >suggestion?

                              this leads to a need for a standard resembles($char1, $char2) function.
                              but resemblance is subjective.

                              the ICU SpoofChecker?
                            Your message has been successfully submitted and would be delivered to recipients shortly.