Loading ...
Sorry, an error occurred while loading the content.

Re: [gbinfitt] TAB to Unicode conversion algorithm

Expand Messages
  • Sinnathurai Srivas
    Dear Peter Jacobi, Thank you for working on this project. au modifier is a simple problem. However, because of misleading Unicode encoding documentation you
    Message 1 of 2 , Nov 8, 2003
    • 0 Attachment
      Dear Peter Jacobi,



      Thank you for working on this project. au modifier is a simple problem. However, because of misleading Unicode encoding documentation you may find it difficult to grasp the idea of au marker in Unicode. hence you problem to solve the conversition from TAB to Unicode.



      Waht is the problem with Unicode documentation.

      Answer: there is no au marker encoded in Unicode. ie, there is no code point 0bd7. This may surprises you, but this is a fact.

      Then what do you see in Unicode documentation as "au" marker code point 0bd7?

      Answer: This is an error by those documented the Unicode Architecture. In the name of continuity Unicode refuses to correct this mistake found in the documentation.

      There is only "au" in code point 0bcc.

      I hope you can now plan your conversion system and conversion program.

      Additional Info:

      Suratha has published multiple modifiers. You can find the TAB to unicode conversion utility as part of these modifiers at

      http://www.jaffnalibrary.com/tools/tamilconverter.htm

      http://www.jaffnalibrary.com/tools/

      If you need find more info on the necessary algorithm, please write to e-Uthavi@yahoogroups.com.

      Sinnathurai Srivas


      Peter Jacobi <peter_jacobi@...> wrote:
      Dear List Members,

      I'm a contributing developer for Open Source Software, namely the
      Firebird SQL database., in the field of I18N mainly.

      I'm looking for information how to convert TAB to Unicode. The links I've
      found in the archives of this mailing list were dead or only lead to
      executable
      programs, not source code.

      Whereas I have found the mapping for all 'easy' cases, I don't know how to
      disambiguate the au modifier. I understand that this a non-problem for
      native
      speakers of Tamil, but I need an algorithm for use in a conversion program.

      Source code should be public domain or available with BSD and (L)GPL
      compatible licenses. Algorithmic descriptions should be free of patents,
      which
      I assume is not problem.

      Regards,
      Peter Jacobi
      Hamburg, Germany


      --
      NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
      Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

      Jetzt kostenlos anmelden unter http://www.gmx.net

      +++ GMX - die erste Adresse für Mail, Message, More! +++



      Yahoo! Groups SponsorADVERTISEMENT

      To unsubscribe from this group, send an email to:
      gbinfitt-unsubscribe@egroups.com



      Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.


      ---------------------------------
      Want to chat instantly with your online friends? Get the FREE Yahoo!Messenger

      [Non-text portions of this message have been removed]
    • Sinnathurai Srivas
      Dear Peter Jacobi, I think you may be looking at two different issues (mixing issues). In your statement you said au marker is redundant. Because you are
      Message 2 of 2 , Nov 8, 2003
      • 0 Attachment
        Dear Peter Jacobi,

        I think you may be looking at two different issues (mixing issues).

        In your statement you said "au" marker is redundant.
        Because you are converting from TAB to Unicode, you never have to convert to the sequence ka+e+"au-marker". You always have to convert to ka+au. There will not arise a situation where you will need to convert to ka+e+"au-marker" as Unicode data.

        Outside of this TAB to Unicode conversion, the canonical handling may be essential, if some one input the wrong sequence in Unicode (for example using charmap facility in Windows) it should be corrected to the correct Unicode data storage format by the software. As I pointed earlier, TAB to Unicode conversion will not encounter this problem and there can not be a solution for non-existing problem.

        Mixing of issues: I do not know if you now wish to discuss about kau and kau.
        ie, 1icombu+ka+LL and 1icombu+ka+LL.

        Both look same, confusing!!

        Let me assure you there is only one form that is in use. The other is not in use, but some assume the other is also in use and hunt hard to find one or two examples. So here we are going to chase a wild goose. Assuming, someone has convinced us to chase the wild geese, what is the solution?

        You scenario is like this.

        1 Standard form
        1icombu+ka+LL + va + i = kauvi
        2 Un-used form is
        1icombu+ka+LL + va + i = keLLavi (keLavi)

        Solution: TAB does not distinguish this difference. So ideally there is no solution. Tamil writing does not distinguish this difference, so ideally there is no solution.

        Now assume some one informs that he meant the unused "keLavi", then you must enable within converted Unicode text to be manually transformed to "keLavi".

        Now assume that some times in the future TAB encoding it self decide to chase the wild-goose problem and include a diacritic marker to distinguish kauvi and keLavi differences. You can plan for the future and place a diacritical marker identifier to solve this problem.

        You can see that this scenario can not be a requirement of Unicode.

        Can you clarify if mixing of issues can be a problem here and which problem we wished to discuss further.

        Sinnathurai Srivas


        Peter Jacobi <peter_jacobi@...> wrote:
        Dear Sinnathurai Srivas,

        I assume I've stated my problem in unclear terms. I try to
        give some examples, but because of my lack of knowledge
        of Tamil I must apologize in advance if the example strings
        are silly or even offensive.

        > Answer: there is no au marker encoded in Unicode. ie, there is no code
        > point 0bd7. This may surprises you, but this is a fact.
        >
        > Then what do you see in Unicode documentation as "au" marker code point
        > 0bd7?
        >
        > Answer: This is an error by those documented the Unicode Architecture. In
        > the name of continuity Unicode refuses to correct this mistake found in
        the
        > documentation.

        Concerning the Unicode side of the issue, I'm aware of the fact U+0BD7
        is redundant, and regarding to most commentators, should not be used.

        Despite that, conforming software is required to interpret
        U+0B94 and U+0B92 U+0BD7
        or
        U+0B95 U+0BCC and U+0B95 U+0BC6 U+0BD7
        as canonically equivalent.

        This is a technicality which is usually hidden from the user. The software
        will
        most likely always write the form without U+0BD7, but it most accept the
        other
        form as input.

        What I don't understand is the TAB side (my TSCII conversion is working).

        When TAB input contains the sequence
        170 TAMIL VOWEL SIGN E
        232 TAMIL LETTER KA
        247 TAMIL LETTER LLA

        This can be transcoded to Unicode as:
        U+0B95 TAMIL LETTER KA
        U+0BC6 TAMIL VOWEL SIGN E
        U+0BB3 TAMIL LETTER LLA

        But, as far as I have learned from the archives,
        it can also be transcoded to Unicode:
        U+0B95 TAMIL LETTER KA
        U+0BCC TAMIL VOWEL SIGN AU

        All three forms give exactly the same glyph displays.

        Nevertheless, only one transcoding can be right, and
        I assume a native speaker of Tamil will immediately see
        which one, depending on context.

        My task now, is to put this decision into an algorithm.

        > Suratha has published multiple modifiers. You can find the TAB to unicode
        > conversion utility as part of these modifiers at
        >
        > http://www.jaffnalibrary.com/tools/tamilconverter.htm
        >
        > http://www.jaffnalibrary.com/tools/

        Unfortunately, both the Tamil text on the Web site and the compact,
        uncommented
        Javascript code of the converter, have so far blocked my understanding.

        > If you need find more info on the necessary algorithm, please write to
        > e-Uthavi@yahoogroups.com.

        I'll try this. Thank you for the link.

        Regards,
        Peter Jacobi

        --
        NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
        Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

        Jetzt kostenlos anmelden unter http://www.gmx.net

        +++ GMX - die erste Adresse für Mail, Message, More! +++


        Yahoo! Groups SponsorADVERTISEMENT

        To unsubscribe from this group, send an email to:
        gbinfitt-unsubscribe@egroups.com



        Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.


        ---------------------------------
        Want to chat instantly with your online friends? Get the FREE Yahoo!Messenger

        [Non-text portions of this message have been removed]
      Your message has been successfully submitted and would be delivered to recipients shortly.