Loading ...
Sorry, an error occurred while loading the content.

unicode case mechanisms

Expand Messages
  • Maiorana, Jason
    ... unicode ... extensions,just ... and ... Well mostly those work fine except for german: it requires a special case where two capital S becomes a ß. This is
    Message 1 of 5 , Aug 22, 2002
    • 0 Attachment
      >> If anyone is interested, I have a bit of code that can create a
      unicode
      >> case mapper (UPPER/lower) that works for all languages, to an extent
      >> (none of the multi-letter mappings in the extended case
      extensions,just
      >> simple single char to single char) which should work for all unicode.

      >I'm interested, at least for latin scripts (including French, German
      and
      >Esperanto), Greek and Russian.

      Well mostly those work fine except for german: it requires a special
      case
      where two capital S becomes a ß. This is much more difficult, because
      you must read through the string for context. (for example how do you
      lower
      case "SSS").

      >Is it a binary or a vim script?

      actually what ive got is a c++ program which generates a c library
      which implements something like:

      wchar_t toupper( whchar_t );
      wchar_t tolower( whchar_t );

      so obviously any script in which is not 1:1 lower:upper isnt really
      going
      to work with that sort of interface. Does any one use vim with a german
      locale? If so does the SS->ß thing work correctly(it really shrinks and
      grow
      the line to accomodate the extra character)? what happens when you
      lowercase
      the string "SSS"?

      Im not sure how this could be integrated with vim, but it is one way
      to do case conversions correctly across most languages. If anyone is
      still interested, ill clean it up and make it available.

      >Is Japanes a-ka-sa-ta different from Japanese-a-i-u-e-o which I had
      >explained to me?

      no, its probably the same. A notoriously hard language to sort, I
      understand that the LC_COLLATE format cannot even really handle
      proper sorting of japanese.
    • Tony Mechelynck
      ... From: Maiorana, Jason To: Sent: Thursday, August 22, 2002 3:51 PM Subject: unicode case mechanisms ... I
      Message 2 of 5 , Aug 22, 2002
      • 0 Attachment
        ----- Original Message -----
        From: "Maiorana, Jason" <jmaiorana@...>
        To: <vim-multibyte@...>
        Sent: Thursday, August 22, 2002 3:51 PM
        Subject: unicode case mechanisms


        >
        > >> If anyone is interested, I have a bit of code that can create a
        > unicode
        > >> case mapper (UPPER/lower) that works for all languages, to an extent
        > >> (none of the multi-letter mappings in the extended case
        > extensions,just
        > >> simple single char to single char) which should work for all unicode.
        >
        > >I'm interested, at least for latin scripts (including French, German
        > and
        > >Esperanto), Greek and Russian.
        >
        > Well mostly those work fine except for german: it requires a special
        > case
        > where two capital S becomes a ß. This is much more difficult, because
        > you must read through the string for context. (for example how do you
        > lower
        > case "SSS").

        I think I can afford to downcase -SSS- to -sss- and search manually for the
        latter. However, all the examples I can think of downcase to -ßs-. (-sß- is
        not possible because -ß- is always postvocalic.) Examples include Maßstab
        (cf. NL "maatstaf") and similar, always at a word-root boundary. OTOH, in
        classical High German, -SS- would have to downcase to -ss- in some cases and
        to -ß- in others (see example below); I don't think it's possible to
        automate it (without a dictionary), so I propose to always downcase -SS-
        to -ss-, and maybe warn the user that he'll have to do a /ss search
        afterwards to sort out the cases where -ss- must stay as such from those
        where -ss- is in fact -ß-. IIRC, the Swiss have a tendency to systematically
        use -ss- where the German (and Austrian?) use -ß- so it wouldn't be too
        awfully bad.

        Example (I think): Rußland "Russia" but russisch "Russian"
        >
        > >Is it a binary or a vim script?
        >
        > actually what ive got is a c++ program which generates a c library

        Hm. I suppose I'll have to wait until the next gvim release then, since I
        don't have M$W compiling tools.

        > which implements something like:
        >
        > wchar_t toupper( whchar_t );
        > wchar_t tolower( whchar_t );
        >
        > so obviously any script in which is not 1:1 lower:upper isnt really
        > going
        > to work with that sort of interface. Does any one use vim with a german
        > locale? If so does the SS->ß thing work correctly(it really shrinks and
        > grow
        > the line to accomodate the extra character)? what happens when you
        > lowercase
        > the string "SSS"?

        See above my "humble opinion" about downcasing. For upcasing, if you cannot
        upcase -ß- to -SS- it might be a problem. (BTW, some prewar books
        upcased -ß- to -SZ- which can be done reversibly but I think it's a thing of
        the past.) Utf-8 is an encoding where characters don't all have the same
        number of bytes anyway; and current-day replace-mode expands or shrinks the
        lines as needed when replacing one codepoint with another using a different
        number of bytes so there ought to be a solution.
        >
        > Im not sure how this could be integrated with vim, but it is one way
        > to do case conversions correctly across most languages. If anyone is
        > still interested, ill clean it up and make it available.
        >
        > >Is Japanes a-ka-sa-ta different from Japanese-a-i-u-e-o which I had
        > >explained to me?
        >
        > no, its probably the same. A notoriously hard language to sort, I
        > understand that the LC_COLLATE format cannot even really handle
        > proper sorting of japanese.
        >
        As long as only kana and/or romaji are used it could probably be done; but
        what with the various readings of any single kanji... For the latter, the
        only "automatizable" sorting would probably depend on a dictionary by
        radicals or by stroke-count and that would exclude all three of a-i-u-e-o,
        i-ro-ha and romaji sorting. A difficult language indeed.

        Tony.
      • Maiorana, Jason
        ... I ... hrm, though the generated c file would work with msvc, the process that generates it likely wouldnt (mix of makefiles, bash scripts, perl, and c++
        Message 3 of 5 , Aug 22, 2002
        • 0 Attachment
          > > >Is it a binary or a vim script?
          > >
          > > actually what ive got is a c++ program which generates a c library

          >Hm. I suppose I'll have to wait until the next gvim release then, since
          I
          >don't have M$W compiling tools.

          hrm, though the generated c file would work with msvc, the process
          that generates it likely wouldnt (mix of makefiles, bash scripts, perl,
          and c++ template code). I would highly suggest cygwin, if you dont
          have it.

          >Utf-8 is an encoding where characters don't all have the same
          >number of bytes anyway; and current-day replace-mode expands or shrinks
          the
          >lines as needed when replacing one codepoint with another using a
          different
          >number of bytes so there ought to be a solution.

          right,but german is just the tip of the iceberg. When you cover all of
          utf-8 you get into all kinds of special cases (ligatures,titlecase,so
          on).
          Also imagine the context sensitivity issues when you have case
          insensitve
          regular expressions, what will s/.ß/P/g do to "xSsßS", will the result
          be "PP", "PßS", or "xSPS"?

          What I have is a best-efort case-converter for unicode, which can take
          a single ucs-4 codepoint to another single ucs-4 codepoint. The problem
          with having a more statefull converter is that:

          case_crvt("ABC") != case_cvrt("A")+case_cvrt("BC")

          And this makes doing any type of partial string operation case
          sensitive.
          Not to mention the added complexities required for all the machinery.

          > no, its probably the same. A notoriously hard language to sort, I
          > understand that the LC_COLLATE format cannot even really handle
          > proper sorting of japanese.
          >
          >As long as only kana and/or romaji are used it could probably be done;
          but
          >what with the various readings of any single kanji... For the latter,
          the
          >only "automatizable" sorting would probably depend on a dictionary by
          >radicals or by stroke-count and that would exclude all three of
          a-i-u-e-o,
          >i-ro-ha and romaji sorting. A difficult language indeed.

          also: iteration marks have a statefull collation; typically equal to
          their
          preceding glyph when it exists.
          for example

          昔々
          should sort right next to
          昔昔

          this introduces a subtle problem, if the iteration mark sorts exactly
          equal,
          then the output will be messy, because there is no standard for the
          iteration
          mark being before or after the regular kanji. So you need a special way
          to
          say that the iteration mark takes up a special value that in either just
          more
          or just less that the preceding glyph, such that no other character can
          come
          between them.
        • Tony Mechelynck
          ... From: Maiorana, Jason To: Sent: Thursday, August 22, 2002 7:06 PM Subject: RE: unicode case mechanisms
          Message 4 of 5 , Aug 22, 2002
          • 0 Attachment
            ----- Original Message -----
            From: "Maiorana, Jason" <jmaiorana@...>
            To: <vim-multibyte@...>
            Sent: Thursday, August 22, 2002 7:06 PM
            Subject: RE: unicode case mechanisms

            [...]
            >
            > hrm, though the generated c file would work with msvc, the process
            > that generates it likely wouldnt (mix of makefiles, bash scripts, perl,
            > and c++ template code). I would highly suggest cygwin, if you dont
            > have it.

            I have it, not that I like it much; it is a quite primitive version where
            "man man" results in "BASH: man: command not found".
            >
            > >Utf-8 is an encoding where characters don't all have the same
            > >number of bytes anyway; and current-day replace-mode expands or shrinks
            > the
            > >lines as needed when replacing one codepoint with another using a
            > different
            > >number of bytes so there ought to be a solution.
            >
            > right,but german is just the tip of the iceberg. When you cover all of
            > utf-8 you get into all kinds of special cases (ligatures,titlecase,so
            > on).
            > Also imagine the context sensitivity issues when you have case
            > insensitve
            > regular expressions, what will s/.ß/P/g do to "xSsßS", will the result
            > be "PP", "PßS", or "xSPS"?
            >
            > What I have is a best-efort case-converter for unicode, which can take
            > a single ucs-4 codepoint to another single ucs-4 codepoint. The problem
            > with having a more statefull converter is that:
            >
            > case_crvt("ABC") != case_cvrt("A")+case_cvrt("BC")
            >
            > And this makes doing any type of partial string operation case
            > sensitive.
            > Not to mention the added complexities required for all the machinery.

            Well, OK. Now Vim handles all Unicode internally as utf-8; so if the case
            inversion acts only on utf-32 I suppose a conversion is needed. Since utf-8
            strictly segregates the byte values for "isolated" "first of several" and
            "not-first of several", and since vim already doesn't (IIUC) cut strings in
            the middle of a codepoint that shouldn't be much of an issue. It can
            influence speed but the conversion ought to be done only for some particular
            commands such as case-invert (tilde) or maybe caseless search. I hope we can
            live with that. The problem remains of whether any search involving German
            eszet can be regarded as caseless and what to do with the (deprecated IIUC?)
            ligatures. Also: shall we do the {utf-8_to_utf-32; case_invert;
            utf-32_to_utf-8} once for each character (codepoint) inside a string or once
            for the whole string (with three loops). I suppose the former but I don't
            see the whole picture clearly.

            Thinking back on it, maybe it would be worth while to have not only your C
            routine (for speed) but also an equivalent vim script (for portability). We
            must not forget that the UTF standard is constantly being added to, so it
            might be useful to arrange a way to easily update the casepairs for utf
            updates independently of the rest of the code (which could then stay
            invariant except for bug fixes). There are people like me who use
            precompiled versions of vim and I don't think that, for a production
            version, we can force everyone to recompile it should a new case-aware
            script be added to the Unicode standard. A separate
            "architecture-independent" set-of-casepairs would IMHO be easier to maintain
            in the long run. But it would probably need specially handled exceptions
            like eszet, etc.
            >
            [...]
            >
            > also: iteration marks have a statefull collation; typically equal to
            > their
            > preceding glyph when it exists.
            > for example
            >
            > 昔々
            > should sort right next to
            > 昔昔
            >
            > this introduces a subtle problem, if the iteration mark sorts exactly
            > equal,
            > then the output will be messy, because there is no standard for the
            > iteration
            > mark being before or after the regular kanji. So you need a special way
            > to
            > say that the iteration mark takes up a special value that in either just
            > more
            > or just less that the preceding glyph, such that no other character can
            > come
            > between them.
            >
            Hm. Collation numbers with half-integer values; that's a new one. Of course
            it bears no direct relation to the case-inversion question.
            >

            Tony.
          • Maiorana, Jason
            ... Well I dont know how to write vim scripts atm, but the c version is at: http://members.telocity.com/~seer26/uconv_export.tgz The file uconv.c will contains
            Message 5 of 5 , Aug 26, 2002
            • 0 Attachment
              >Thinking back on it, maybe it would be worth while to
              >have not only your C routine (for speed) but also an
              >equivalent vim script (for portability).

              Well I dont know how to write vim scripts atm, but the
              c version is at:
              http://members.telocity.com/~seer26/uconv_export.tgz

              The file uconv.c will contains the generated
              implementations, once built. Once again its a language
              agnostic, unicode-wide, implementation of

              wchar_t unicode_tolower(wchar_t ucs);
              wchar_t unicode_toupper(wchar_t ucs);

              using nothing more than if statements and integer
              operators.

              An asymmetric, context-sensitive case swapper, such as
              needed by german, would require a different interface.
              Such an interface would most likely require memory
              operations or conversion contexts similiar to that
              needed for doing an iconv.
            Your message has been successfully submitted and would be delivered to recipients shortly.