Loading ...
Sorry, an error occurred while loading the content.

Re: unicode case mechanisms

Expand Messages
  • Tony Mechelynck
    ... From: Maiorana, Jason To: Sent: Thursday, August 22, 2002 3:51 PM Subject: unicode case mechanisms ... I
    Message 1 of 5 , Aug 22 9:05 AM
    • 0 Attachment
      ----- Original Message -----
      From: "Maiorana, Jason" <jmaiorana@...>
      To: <vim-multibyte@...>
      Sent: Thursday, August 22, 2002 3:51 PM
      Subject: unicode case mechanisms


      >
      > >> If anyone is interested, I have a bit of code that can create a
      > unicode
      > >> case mapper (UPPER/lower) that works for all languages, to an extent
      > >> (none of the multi-letter mappings in the extended case
      > extensions,just
      > >> simple single char to single char) which should work for all unicode.
      >
      > >I'm interested, at least for latin scripts (including French, German
      > and
      > >Esperanto), Greek and Russian.
      >
      > Well mostly those work fine except for german: it requires a special
      > case
      > where two capital S becomes a ß. This is much more difficult, because
      > you must read through the string for context. (for example how do you
      > lower
      > case "SSS").

      I think I can afford to downcase -SSS- to -sss- and search manually for the
      latter. However, all the examples I can think of downcase to -ßs-. (-sß- is
      not possible because -ß- is always postvocalic.) Examples include Maßstab
      (cf. NL "maatstaf") and similar, always at a word-root boundary. OTOH, in
      classical High German, -SS- would have to downcase to -ss- in some cases and
      to -ß- in others (see example below); I don't think it's possible to
      automate it (without a dictionary), so I propose to always downcase -SS-
      to -ss-, and maybe warn the user that he'll have to do a /ss search
      afterwards to sort out the cases where -ss- must stay as such from those
      where -ss- is in fact -ß-. IIRC, the Swiss have a tendency to systematically
      use -ss- where the German (and Austrian?) use -ß- so it wouldn't be too
      awfully bad.

      Example (I think): Rußland "Russia" but russisch "Russian"
      >
      > >Is it a binary or a vim script?
      >
      > actually what ive got is a c++ program which generates a c library

      Hm. I suppose I'll have to wait until the next gvim release then, since I
      don't have M$W compiling tools.

      > which implements something like:
      >
      > wchar_t toupper( whchar_t );
      > wchar_t tolower( whchar_t );
      >
      > so obviously any script in which is not 1:1 lower:upper isnt really
      > going
      > to work with that sort of interface. Does any one use vim with a german
      > locale? If so does the SS->ß thing work correctly(it really shrinks and
      > grow
      > the line to accomodate the extra character)? what happens when you
      > lowercase
      > the string "SSS"?

      See above my "humble opinion" about downcasing. For upcasing, if you cannot
      upcase -ß- to -SS- it might be a problem. (BTW, some prewar books
      upcased -ß- to -SZ- which can be done reversibly but I think it's a thing of
      the past.) Utf-8 is an encoding where characters don't all have the same
      number of bytes anyway; and current-day replace-mode expands or shrinks the
      lines as needed when replacing one codepoint with another using a different
      number of bytes so there ought to be a solution.
      >
      > Im not sure how this could be integrated with vim, but it is one way
      > to do case conversions correctly across most languages. If anyone is
      > still interested, ill clean it up and make it available.
      >
      > >Is Japanes a-ka-sa-ta different from Japanese-a-i-u-e-o which I had
      > >explained to me?
      >
      > no, its probably the same. A notoriously hard language to sort, I
      > understand that the LC_COLLATE format cannot even really handle
      > proper sorting of japanese.
      >
      As long as only kana and/or romaji are used it could probably be done; but
      what with the various readings of any single kanji... For the latter, the
      only "automatizable" sorting would probably depend on a dictionary by
      radicals or by stroke-count and that would exclude all three of a-i-u-e-o,
      i-ro-ha and romaji sorting. A difficult language indeed.

      Tony.
    • Maiorana, Jason
      ... I ... hrm, though the generated c file would work with msvc, the process that generates it likely wouldnt (mix of makefiles, bash scripts, perl, and c++
      Message 2 of 5 , Aug 22 10:06 AM
      • 0 Attachment
        > > >Is it a binary or a vim script?
        > >
        > > actually what ive got is a c++ program which generates a c library

        >Hm. I suppose I'll have to wait until the next gvim release then, since
        I
        >don't have M$W compiling tools.

        hrm, though the generated c file would work with msvc, the process
        that generates it likely wouldnt (mix of makefiles, bash scripts, perl,
        and c++ template code). I would highly suggest cygwin, if you dont
        have it.

        >Utf-8 is an encoding where characters don't all have the same
        >number of bytes anyway; and current-day replace-mode expands or shrinks
        the
        >lines as needed when replacing one codepoint with another using a
        different
        >number of bytes so there ought to be a solution.

        right,but german is just the tip of the iceberg. When you cover all of
        utf-8 you get into all kinds of special cases (ligatures,titlecase,so
        on).
        Also imagine the context sensitivity issues when you have case
        insensitve
        regular expressions, what will s/.ß/P/g do to "xSsßS", will the result
        be "PP", "PßS", or "xSPS"?

        What I have is a best-efort case-converter for unicode, which can take
        a single ucs-4 codepoint to another single ucs-4 codepoint. The problem
        with having a more statefull converter is that:

        case_crvt("ABC") != case_cvrt("A")+case_cvrt("BC")

        And this makes doing any type of partial string operation case
        sensitive.
        Not to mention the added complexities required for all the machinery.

        > no, its probably the same. A notoriously hard language to sort, I
        > understand that the LC_COLLATE format cannot even really handle
        > proper sorting of japanese.
        >
        >As long as only kana and/or romaji are used it could probably be done;
        but
        >what with the various readings of any single kanji... For the latter,
        the
        >only "automatizable" sorting would probably depend on a dictionary by
        >radicals or by stroke-count and that would exclude all three of
        a-i-u-e-o,
        >i-ro-ha and romaji sorting. A difficult language indeed.

        also: iteration marks have a statefull collation; typically equal to
        their
        preceding glyph when it exists.
        for example

        昔々
        should sort right next to
        昔昔

        this introduces a subtle problem, if the iteration mark sorts exactly
        equal,
        then the output will be messy, because there is no standard for the
        iteration
        mark being before or after the regular kanji. So you need a special way
        to
        say that the iteration mark takes up a special value that in either just
        more
        or just less that the preceding glyph, such that no other character can
        come
        between them.
      • Tony Mechelynck
        ... From: Maiorana, Jason To: Sent: Thursday, August 22, 2002 7:06 PM Subject: RE: unicode case mechanisms
        Message 3 of 5 , Aug 22 11:19 AM
        • 0 Attachment
          ----- Original Message -----
          From: "Maiorana, Jason" <jmaiorana@...>
          To: <vim-multibyte@...>
          Sent: Thursday, August 22, 2002 7:06 PM
          Subject: RE: unicode case mechanisms

          [...]
          >
          > hrm, though the generated c file would work with msvc, the process
          > that generates it likely wouldnt (mix of makefiles, bash scripts, perl,
          > and c++ template code). I would highly suggest cygwin, if you dont
          > have it.

          I have it, not that I like it much; it is a quite primitive version where
          "man man" results in "BASH: man: command not found".
          >
          > >Utf-8 is an encoding where characters don't all have the same
          > >number of bytes anyway; and current-day replace-mode expands or shrinks
          > the
          > >lines as needed when replacing one codepoint with another using a
          > different
          > >number of bytes so there ought to be a solution.
          >
          > right,but german is just the tip of the iceberg. When you cover all of
          > utf-8 you get into all kinds of special cases (ligatures,titlecase,so
          > on).
          > Also imagine the context sensitivity issues when you have case
          > insensitve
          > regular expressions, what will s/.ß/P/g do to "xSsßS", will the result
          > be "PP", "PßS", or "xSPS"?
          >
          > What I have is a best-efort case-converter for unicode, which can take
          > a single ucs-4 codepoint to another single ucs-4 codepoint. The problem
          > with having a more statefull converter is that:
          >
          > case_crvt("ABC") != case_cvrt("A")+case_cvrt("BC")
          >
          > And this makes doing any type of partial string operation case
          > sensitive.
          > Not to mention the added complexities required for all the machinery.

          Well, OK. Now Vim handles all Unicode internally as utf-8; so if the case
          inversion acts only on utf-32 I suppose a conversion is needed. Since utf-8
          strictly segregates the byte values for "isolated" "first of several" and
          "not-first of several", and since vim already doesn't (IIUC) cut strings in
          the middle of a codepoint that shouldn't be much of an issue. It can
          influence speed but the conversion ought to be done only for some particular
          commands such as case-invert (tilde) or maybe caseless search. I hope we can
          live with that. The problem remains of whether any search involving German
          eszet can be regarded as caseless and what to do with the (deprecated IIUC?)
          ligatures. Also: shall we do the {utf-8_to_utf-32; case_invert;
          utf-32_to_utf-8} once for each character (codepoint) inside a string or once
          for the whole string (with three loops). I suppose the former but I don't
          see the whole picture clearly.

          Thinking back on it, maybe it would be worth while to have not only your C
          routine (for speed) but also an equivalent vim script (for portability). We
          must not forget that the UTF standard is constantly being added to, so it
          might be useful to arrange a way to easily update the casepairs for utf
          updates independently of the rest of the code (which could then stay
          invariant except for bug fixes). There are people like me who use
          precompiled versions of vim and I don't think that, for a production
          version, we can force everyone to recompile it should a new case-aware
          script be added to the Unicode standard. A separate
          "architecture-independent" set-of-casepairs would IMHO be easier to maintain
          in the long run. But it would probably need specially handled exceptions
          like eszet, etc.
          >
          [...]
          >
          > also: iteration marks have a statefull collation; typically equal to
          > their
          > preceding glyph when it exists.
          > for example
          >
          > 昔々
          > should sort right next to
          > 昔昔
          >
          > this introduces a subtle problem, if the iteration mark sorts exactly
          > equal,
          > then the output will be messy, because there is no standard for the
          > iteration
          > mark being before or after the regular kanji. So you need a special way
          > to
          > say that the iteration mark takes up a special value that in either just
          > more
          > or just less that the preceding glyph, such that no other character can
          > come
          > between them.
          >
          Hm. Collation numbers with half-integer values; that's a new one. Of course
          it bears no direct relation to the case-inversion question.
          >

          Tony.
        • Maiorana, Jason
          ... Well I dont know how to write vim scripts atm, but the c version is at: http://members.telocity.com/~seer26/uconv_export.tgz The file uconv.c will contains
          Message 4 of 5 , Aug 26 1:43 PM
          • 0 Attachment
            >Thinking back on it, maybe it would be worth while to
            >have not only your C routine (for speed) but also an
            >equivalent vim script (for portability).

            Well I dont know how to write vim scripts atm, but the
            c version is at:
            http://members.telocity.com/~seer26/uconv_export.tgz

            The file uconv.c will contains the generated
            implementations, once built. Once again its a language
            agnostic, unicode-wide, implementation of

            wchar_t unicode_tolower(wchar_t ucs);
            wchar_t unicode_toupper(wchar_t ucs);

            using nothing more than if statements and integer
            operators.

            An asymmetric, context-sensitive case swapper, such as
            needed by german, would require a different interface.
            Such an interface would most likely require memory
            operations or conversion contexts similiar to that
            needed for doing an iconv.
          Your message has been successfully submitted and would be delivered to recipients shortly.