Loading ...
Sorry, an error occurred while loading the content.

Re: "ocaml_beginners"::[] Yet another OCaml tutorial

Expand Messages
  • james woodyatt
    ... Just out of curiosity... if real-world programmers don t care how theoretically elegant is some aspect of their programming language, then why should
    Message 1 of 26 , Jun 12, 2003
    • 0 Attachment
      On Thursday, Jun 12, 2003, at 02:36 US/Pacific, Richard Jones wrote:
      >
      > Well I didn't mean that to be as controversial [...], but here's my
      > list:
      >
      > * Lack of documentation. (In English, and please don't mention
      > the lambda-calculus, currying, etc. because real world programmers
      > don't care how theoretically elegant something is.)
      > [...]

      Just out of curiosity... if real-world programmers don't care how
      "theoretically elegant" is some aspect of their programming language,
      then why should they care about type safety in the first place?


      --
      j h woodyatt <jhw@...>
      that's my village calling... no doubt, they want their idiot back.
    • james woodyatt
      ... I m not sure many programmers of the popular languages in industry are seeing any real world benefit of strong typing. With all the big languages, it is
      Message 2 of 26 , Jun 12, 2003
      • 0 Attachment
        On Thursday, Jun 12, 2003, at 09:42 US/Pacific, Brian Hurt wrote:
        > On Thu, 12 Jun 2003, james woodyatt wrote:
        >> [Brian Hurt wrote:]
        >>>
        >>> [...] please don't mention the lambda-calculus, currying, etc.
        >>> because real world programmers don't care [...]
        >>
        >> Just out of curiosity... if real-world programmers don't care how
        >> "theoretically elegant" is some aspect of their programming language,
        >> then why should they care about type safety in the first place?
        >
        > Because it has real world benefits.

        I'm not sure many programmers of the popular languages in industry are
        seeing any real world benefit of strong typing. With all the big
        languages, it is practically impossible to write anything but toy
        programs without resorting to hard typecasting operations. With the
        full range of C++ tricks, you can almost sorta kinda live in a type
        safe way, but it is *not* easy. Java is continually forcing you to do
        dynamic casts and catch typecasting exceptions. Even Java with
        generics will only help so much there.

        > The more bugs the compiler catches automatically, the less time you
        > spend
        > in debugging. Strong typing is your first line of defense.

        Agreed. I'm just not sure that's something a lot of real-world
        programmers know it, let alone care very much about it. There are a
        lot of happy programmers who are quite comfortable passing around
        untyped string representations of internal application values. In
        fact, that's one of the reasons many of these programmers like PCRE
        support in their language as much as they do.

        And the other things you mentioned, i.e. lambda-calculus, currying,
        etc., are just as beneficial in the real world, even if real-world
        programmers don't know or care about them.

        The trick is selling them on the benefits of such curiosities, and you
        don't do sell your product without telling your potential customers
        about how your product will improve their lifestyle.


        --
        j h woodyatt <jhw@...>
        markets are only free to the people who own them.
      • Brian Hurt
        ... Because it has real world benefits. The ocaml compiler s error messages are occassionally obtuse- occassionally it takes me 10 minutes or so staring at a
        Message 3 of 26 , Jun 12, 2003
        • 0 Attachment
          On Thu, 12 Jun 2003, james woodyatt wrote:

          > On Thursday, Jun 12, 2003, at 02:36 US/Pacific, Richard Jones wrote:
          > >
          > > Well I didn't mean that to be as controversial [...], but here's my
          > > list:
          > >
          > > * Lack of documentation. (In English, and please don't mention
          > > the lambda-calculus, currying, etc. because real world programmers
          > > don't care how theoretically elegant something is.)
          > > [...]
          >
          > Just out of curiosity... if real-world programmers don't care how
          > "theoretically elegant" is some aspect of their programming language,
          > then why should they care about type safety in the first place?
          >

          Because it has real world benefits. The ocaml compiler's error messages
          are occassionally obtuse- occassionally it takes me 10 minutes or so
          staring at a peice of code to figure out what's wrong. That's less and
          less often as I get the hang of Ocaml. But firing up a debugger and
          recreating an error takes minutes *at least*. Anyone who has been
          programming for any length of time has horror stories of days-long
          debugging sessions to track down some minor mistake. I once spent a solid
          *month* of 60-hour weeks to track down a bug that resulted in a 2-line
          change to fix.

          The more bugs the compiler catches automatically, the less time you spend
          in debugging. Strong typing is your first line of defense.

          Brian
        • Brian Hurt
          ... [ Why is stong typing good? ] ... I agree- the type systems of C++ and Java are broken, and thus *force* you to work around them. Working around the type
          Message 4 of 26 , Jun 12, 2003
          • 0 Attachment
            On Thu, 12 Jun 2003, james woodyatt wrote:

            > On Thursday, Jun 12, 2003, at 09:42 US/Pacific, Brian Hurt wrote:

            [ Why is stong typing good? ]

            > > Because it has real world benefits.
            >
            > I'm not sure many programmers of the popular languages in industry are
            > seeing any real world benefit of strong typing. With all the big
            > languages, it is practically impossible to write anything but toy
            > programs without resorting to hard typecasting operations. With the
            > full range of C++ tricks, you can almost sorta kinda live in a type
            > safe way, but it is *not* easy. Java is continually forcing you to do
            > dynamic casts and catch typecasting exceptions. Even Java with
            > generics will only help so much there.

            I agree- the type systems of C++ and Java are broken, and thus *force* you
            to work around them. Working around the type systems then becomes normal,
            and the type systems stop catching as many bugs as they were claimed to
            (because you're not using them. Akin to seat belts only saving lives when
            they're worn). Type systems then get the reputation of being clumsy and
            annoying to work around, while giving very little benefit- so why not get
            rid of them?

            The real solution is to fix the type systems so you don't have to work
            around them, you can work *with* them. They stop being a burden and start
            becomming an aide.

            >
            > > The more bugs the compiler catches automatically, the less time you
            > > spend
            > > in debugging. Strong typing is your first line of defense.
            >
            > Agreed. I'm just not sure that's something a lot of real-world
            > programmers know it, let alone care very much about it. There are a
            > lot of happy programmers who are quite comfortable passing around
            > untyped string representations of internal application values. In
            > fact, that's one of the reasons many of these programmers like PCRE
            > support in their language as much as they do.
            >

            ...

            >
            > The trick is selling them on the benefits of such curiosities, and you
            > don't do sell your product without telling your potential customers
            > about how your product will improve their lifestyle.
            >
            >

            Yep. Rather like the starving, disease ridden kids in africa who don't
            realize that not *everyone* is starving and disease ridden. Which is why
            I pitched my response the way I did. What's the benefit of a workable
            strong typing system? Less debugging time. I don't know any programmer
            who really *likes* debugging- some hate it more than others is my
            experience. Spend less time debugging and more time coding- I don't know
            any programmer who wouldn't like *that*. Especially considering that
            Ocaml isn't really that much harder to program than C/C++, per meaningful
            behavior.

            BTW, I don't know of anything in RegEx than requires breaking strong
            typing. A regex operates on a string and returns a string. I'd guess
            that it can easily be done less painfully than printf formatting.

            Brian
          • Yamagata Yoriyuki
            From: Richard Jones Subject: Re: ocaml_beginners ::[] Yet another OCaml tutorial Date: Thu, 12 Jun 2003 10:36:55 +0100 ... Unicode/UTF-8
            Message 5 of 26 , Jun 12, 2003
            • 0 Attachment
              From: Richard Jones <rich@...>
              Subject: Re: "ocaml_beginners"::[] Yet another OCaml tutorial
              Date: Thu, 12 Jun 2003 10:36:55 +0100

              > * Unicode/UTF-8 support - absolutely vital for the contracts that I
              > might use OCaml on.

              Unicode/UTF-8 support can mean many different things. In a sense, you
              can handle UTF-8 in ocaml because UTF-8 is just a sequence of bytes.
              (I don't say this approach is ideal. I'm pointing out a possibility.)
              What you actually need? I'm just interested.

              Cheers,
              --
              Yamagata Yoriyuki
            • Richard Jones
              ... Treating UTF-8 as just a bunch of bytes is the lowest level of support. It doesn t help me if I want to, for example, find out how many characters are in
              Message 6 of 26 , Jun 13, 2003
              • 0 Attachment
                On Fri, Jun 13, 2003 at 10:27:10AM +0900, Yamagata Yoriyuki wrote:
                > From: Richard Jones <rich@...>
                > Subject: Re: "ocaml_beginners"::[] Yet another OCaml tutorial
                > Date: Thu, 12 Jun 2003 10:36:55 +0100
                >
                > > * Unicode/UTF-8 support - absolutely vital for the contracts that I
                > > might use OCaml on.
                >
                > Unicode/UTF-8 support can mean many different things. In a sense, you
                > can handle UTF-8 in ocaml because UTF-8 is just a sequence of bytes.
                > (I don't say this approach is ideal. I'm pointing out a possibility.)
                > What you actually need? I'm just interested.

                Treating UTF-8 as just a "bunch of bytes" is the lowest level of
                support. It doesn't help me if I want to, for example, find out
                how many characters are in the string (# characters <= # bytes), or
                extract the nth character, etc.

                I need to be able to, at a minimum:

                * normalise the string
                * find the length in bytes or characters
                * iterate over either bytes or characters
                * extract bytes or characters
                * use Unicode regexps on the string

                Of course I can write short library functions to do this, but
                nothing beats having it built into the language.

                Rich.

                --
                Richard Jones, Red Hat Inc. (London) and Merjis Ltd. http://www.merjis.com/
                http://www.annexia.org/ Freshmeat projects: http://freshmeat.net/users/rwmj
                All new technology is irrelevant until it is taken up by the public.
              • Brian Hurt
                ... The number of bytes a string takes up depends upon the encoding. For example, the string hello, world takes up 12 bytes in UTF-8 encoding, 24 in UF-16,
                Message 7 of 26 , Jun 13, 2003
                • 0 Attachment
                  On Fri, 13 Jun 2003, Richard Jones wrote:

                  > * find the length in bytes or characters
                  > * iterate over either bytes or characters
                  > * extract bytes or characters

                  The number of bytes a string takes up depends upon the encoding. For
                  example, the string "hello, world" takes up 12 bytes in UTF-8 encoding, 24
                  in UF-16, and 48 in UCS.

                  Two huge advantages that Ocaml already has: characters are not just short
                  integers (of either 8- or 16- bit quantities), but instead "magical" data
                  types. And mostly stored as full words anyways. And strings are not just
                  arrays of characters, but instead special datastructures. These two facts
                  make me optimistic that Ocaml can support internationalization in a
                  resonably backwards compatible way. Hmm. UCS is a 31-bit format. Fits
                  quite nicely into an Ocaml int...

                  With both UTF-16 and UCS you have endian questions.

                  > * use Unicode regexps on the string

                  This gets to be trickier. Here's the problem. Basically, all your
                  regular expression/lexing code breaks down to a huge finite state
                  automaton. In code, this is expressed by having a huge table of the form
                  (pseudo-C code):

                  extern int next_state[NUM_STATES][NUM_INPUTS];

                  Where NUM_INPUTS is the number of possible input characters- 128 for 7-bit
                  ASCII, 256 for 8-bit. Transitions then behave like:

                  state = next_state[state][input];

                  Now, this is already large enough (and redundant enough) that most lexers
                  provide a translation table, so the above code looks like:

                  extern char translation[NUM_INPUTS];
                  extern int next_state[NUM_STATES][NUM_UNIQUE_INPUTS];

                  state = next_state[state][translation[input]]

                  As NUM_UNIQUE_INPUTS is general less than NUM_INPUTS you save. Now, even
                  if assuming NUM_UNIQUE_INPUTS stays the same, you have a much larger
                  translation table. 256 times larger than 8-bit ASCII for UTF-16, and 8
                  million times larger for UCS.

                  A slightly better technique would be to do all pattern matching on UTF-8,
                  limiting the number of characters. But this increases the number of
                  states. For example, consider that you want to transition from state A to
                  state B on any digit. In addition to transitions on characters 0x30-0x39,
                  but also Bengali digits 0x9E6-0x9EF. These all become three-byte UTF-8
                  sequences, 0xE0, 0xA7, 0xA6-0xAF. This means I have to add two new states
                  that were not in the original state machine *each time* I want to
                  recognize Bengali digits as well as latin-1 digits.

                  Brian
                • Yamagata Yoriyuki
                  From: Richard Jones Subject: Re: ocaml_beginners ::[] Yet another OCaml tutorial Date: Fri, 13 Jun 2003 11:58:10 +0100 ... Actually, I once
                  Message 8 of 26 , Jun 13, 2003
                  • 0 Attachment
                    From: Richard Jones <rich@...>
                    Subject: Re: "ocaml_beginners"::[] Yet another OCaml tutorial
                    Date: Fri, 13 Jun 2003 11:58:10 +0100

                    > * normalise the string
                    > * find the length in bytes or characters
                    > * iterate over either bytes or characters
                    > * extract bytes or characters

                    Actually, I once made the library doing them, except regexps.
                    (http://camomile.sf.net) If we don't care about performance, then
                    regexps is perhaps not so difficult, but...

                    --
                    Yamagata Yoriyuki
                  • Yamagata Yoriyuki
                    From: Brian Hurt Subject: Re: ocaml_beginners ::[] Yet another OCaml tutorial Date: Fri, 13 Jun 2003 12:56:34 -0500 (CDT) ... Use of
                    Message 9 of 26 , Jun 14, 2003
                    • 0 Attachment
                      From: Brian Hurt <brian.hurt@...>
                      Subject: Re: "ocaml_beginners"::[] Yet another OCaml tutorial
                      Date: Fri, 13 Jun 2003 12:56:34 -0500 (CDT)

                      > A slightly better technique would be to do all pattern matching on UTF-8,
                      > limiting the number of characters.

                      Use of bytes representation of UTF-16 or UCS-4 would be better for
                      this purpose. Another solution would be use of better table
                      representation, such as diets, or multi-stage tables.

                      But if we need collation matching and ordering in regexps, things
                      become even more trickier, since coding collation by finite-automata
                      is practically (or maybe theoretically) impossible.

                      --
                      Yamagata Yoriyuki
                    • Brian Hurt
                      ... UTF-16 might be a workable tradeoff, with a translation table (see my example). If I can get down to
                      Message 10 of 26 , Jun 15, 2003
                      • 0 Attachment
                        On Sat, 14 Jun 2003, Yamagata Yoriyuki wrote:

                        > From: Brian Hurt <brian.hurt@...>
                        > Subject: Re: "ocaml_beginners"::[] Yet another OCaml tutorial
                        > Date: Fri, 13 Jun 2003 12:56:34 -0500 (CDT)
                        >
                        > > A slightly better technique would be to do all pattern matching on UTF-8,
                        > > limiting the number of characters.
                        >
                        > Use of bytes representation of UTF-16 or UCS-4 would be better for
                        > this purpose. Another solution would be use of better table
                        > representation, such as diets, or multi-stage tables.

                        UTF-16 might be a workable tradeoff, with a translation table (see my
                        example). If I can get down to < 256 "character classes", the translation
                        table needs only be 64K. And you have extra states, but fewer of them in
                        general I'd expect.

                        UCS-4 would be completely unworkable, as the translation table would need
                        to be 4 Gigabytes.

                        >
                        > But if we need collation matching and ordering in regexps, things
                        > become even more trickier, since coding collation by finite-automata
                        > is practically (or maybe theoretically) impossible.
                        >

                        I find this comment interesting, as if you can't do it with a finite
                        automata, you basically can't do it computationally. I suspect there is a
                        "Do what I mean" requirement hidden in the spec in this case. Which is
                        one of the problems with design by committee- anything is easy to those
                        who don't have to do it themselves.

                        Brian
                      • Yamagata Yoriyuki
                        From: Brian Hurt Subject: Re: ocaml_beginners ::[] Yet another OCaml tutorial Date: Sun, 15 Jun 2003 13:27:50 -0500 (CDT) ... No, no,
                        Message 11 of 26 , Jun 16, 2003
                        • 0 Attachment
                          From: Brian Hurt <brian.hurt@...>
                          Subject: Re: "ocaml_beginners"::[] Yet another OCaml tutorial
                          Date: Sun, 15 Jun 2003 13:27:50 -0500 (CDT)

                          > > Use of bytes representation of UTF-16 or UCS-4 would be better for
                          > > this purpose. Another solution would be use of better table
                          > > representation, such as diets, or multi-stage tables.
                          >
                          > UTF-16 might be a workable tradeoff, with a translation table (see my
                          > example). If I can get down to < 256 "character classes", the translation
                          > table needs only be 64K. And you have extra states, but fewer of them in
                          > general I'd expect.
                          >
                          > UCS-4 would be completely unworkable, as the translation table would need
                          > to be 4 Gigabytes.

                          No, no, I talked about UTF-16 or UCS-4 encoded as a byte sequence. In
                          general, UTF-16 is more compact than UTF-8. (UTF-8 behaves badly for
                          2-bytes characters.) UCS-4 consumes 4-bytes per character while UTF-8
                          may requires 6-bytes for one character.

                          From: Brian Hurt <brian.hurt@...>
                          Subject: Re: "ocaml_beginners"::[] Yet another OCaml tutorial
                          Date: Sun, 15 Jun 2003 13:27:50 -0500 (CDT)

                          > I find this comment interesting, as if you can't do it with a finite
                          > automata, you basically can't do it computationally. I suspect there is a
                          > "Do what I mean" requirement hidden in the spec in this case. Which is
                          > one of the problems with design by committee- anything is easy to those
                          > who don't have to do it themselves.

                          Sorry, I couldn't follow you. But I didn't mean it can't do
                          computationally. Just we can't compile regexps to automata, so
                          text-book tricks we all know about regexps does not work.

                          --
                          Yamagata Yoriyuki
                        • Brian Hurt
                          ... You still have a state explosion. Recognizing the character a now requires three extra states (to consume the three extra bytes). And you ve also
                          Message 12 of 26 , Jun 16, 2003
                          • 0 Attachment
                            On Mon, 16 Jun 2003, Yamagata Yoriyuki wrote:

                            > No, no, I talked about UTF-16 or UCS-4 encoded as a byte sequence. In
                            > general, UTF-16 is more compact than UTF-8. (UTF-8 behaves badly for
                            > 2-bytes characters.) UCS-4 consumes 4-bytes per character while UTF-8
                            > may requires 6-bytes for one character.

                            You still have a state explosion. Recognizing the character 'a' now
                            requires three extra states (to consume the three extra bytes). And
                            you've also abandoned any hope of character compressions- because now,
                            depending upon which byte it is, 0x61 may be the character 'a' or some
                            other byte in a four-byte character. So you can't simply declare
                            character 0x61 and character 0x62 to be the same- the 0x61 vr.s 0x62 could
                            be distinguishing betwenn the bengalese and the elven alphabets (I don't
                            know if that's literally true, but you get the idea). Not that you could
                            with UTF anyways.

                            > > I find this comment interesting, as if you can't do it with a finite
                            > > automata, you basically can't do it computationally. I suspect there is a
                            > > "Do what I mean" requirement hidden in the spec in this case. Which is
                            > > one of the problems with design by committee- anything is easy to those
                            > > who don't have to do it themselves.
                            >
                            > Sorry, I couldn't follow you. But I didn't mean it can't do
                            > computationally. Just we can't compile regexps to automata, so
                            > text-book tricks we all know about regexps does not work.

                            IIRC, cellular automata are turing equivelent. Which means if you can't
                            do it with cellular automata, you can't do it on computers. More likely,
                            you can do it with cellular automata, it's just that you need bignum
                            states.

                            Brian
                          • Yamagata Yoriyuki
                            From: Brian Hurt Subject: Re: ocaml_beginners ::[] Yet another OCaml tutorial Date: Mon, 16 Jun 2003 11:38:07 -0500 (CDT) ... My point
                            Message 13 of 26 , Jun 16, 2003
                            • 0 Attachment
                              From: Brian Hurt <brian.hurt@...>
                              Subject: Re: "ocaml_beginners"::[] Yet another OCaml tutorial
                              Date: Mon, 16 Jun 2003 11:38:07 -0500 (CDT)

                              > You still have a state explosion. Recognizing the character 'a' now
                              > requires three extra states (to consume the three extra bytes).

                              My point was that UTF-16 or UCS-4 is better than UTF-8, not that this
                              method itself is a good idea. (Admittedly, UCS-4 is not better than
                              UTF-8 because Unicode characters are contained mostly the 16-bits zone
                              (BMP), where UTF-8 requires 1-4 bytes while UCS-4 always need 4
                              bytes.)

                              > IIRC, cellular automata are turing equivelent. Which means if you can't
                              > do it with cellular automata, you can't do it on computers. More likely,
                              > you can do it with cellular automata, it's just that you need bignum
                              > states.

                              Ah, I meant *finite* automata. That clarifies things, No?

                              --
                              Yamagata Yoriyuki
                            • james woodyatt
                              ... Bad example. An 0x61 octet in a UTF-8 string can mean only one thing. The ambiguity you are talking about arises only in octets with a one in bit 7. The
                              Message 14 of 26 , Jun 16, 2003
                              • 0 Attachment
                                On Monday, Jun 16, 2003, at 09:38 US/Pacific, Brian Hurt wrote:
                                >
                                > You still have a state explosion. Recognizing the character 'a' now
                                > requires three extra states (to consume the three extra bytes). And
                                > you've also abandoned any hope of character compressions- because now,
                                > depending upon which byte it is, 0x61 may be the character 'a' or some
                                > other byte in a four-byte character.

                                Bad example. An 0x61 octet in a UTF-8 string can mean only one thing.
                                The ambiguity you are talking about arises only in octets with a one in
                                bit 7.

                                The way I have been attacking this problem is to write a generic DFA
                                compiler that uses function combinators to represent the NFA input
                                trees and a functor for compiling the DFA engine. The goal is to make
                                ASCII regular expression matching to be reasonably fast while still
                                allowing reasonable Unicode expressions to be handled with the same
                                interface. The engine for matching ASCII expressions use arrays of
                                state transitions indexed by character number, and the engine for
                                Unicode expressions (when I write it) will use interval maps of state
                                transitions indexed by code number. (I haven't yet had a burning need
                                for Unicode regular expressions.)

                                I haven't validated the code I have written for this to my satisfaction
                                yet, so it remains unreleased and I can't say when I will I be ready to
                                release it. I mention my approach because it may be that someone else
                                will start and finish a similar project before I do, assuming they get
                                started soon and have sufficient scheduling resources to finish it
                                quickly.


                                --
                                j h woodyatt <jhw@...>
                                that's my village calling... no doubt, they want their idiot back.
                              Your message has been successfully submitted and would be delivered to recipients shortly.