Loading ...
Sorry, an error occurred while loading the content.
 

Re: [Clip] Code page/character issues

Expand Messages
  • Axel Berger
    ... It has to, those characters are not in CP1252. Converting your sample and assuming mail transfer has not broken anything I get: Speķa Pīrāgi These are
    Message 1 of 14 , Jan 12, 2013
      John Shotsky wrote:
      > Text View: Speka Piragi
      > Hex/Ascii View: Speķa Pīrāgi
      > NoteTab correctly detects it as utf-8. But when I force it to
      > Windows 1252, it displays as in NoteTab – incorrectly.

      It has to, those characters are not in CP1252. Converting your sample
      and assuming mail transfer has not broken anything I get:

      Speķa Pīrāgi

      These are from the "extended block A"
      http://www.sql-und-xml.de/unicode-database/latin-extended-a.html

      NoteTab will never be able to deal with them satisfactorily. What I
      don't get at all is how Win7 interferes with them, but then I have so
      far refrained from using eXPerimental and stick to Win98. Even that
      tries to interfere and impose its preferences over mine, but there I can
      more or less control it. Your identical byte count might result from
      using UTF-16, don't newer Windoses do that? If so the byte count should
      be twice the letter count.

      > But, since EditPad Pro detects it correctly, I
      > don't think it's Windows.

      If editpad is true UTF, as you say, then it need not detect anything.
      Notetab is stricly 8-bit and strictly codepage based, all it can do is
      read letters from inside that single chosen codepage when encoded as
      UTF-8. Letters from more than one codepage inside the same document will
      never work.

      Axel
    • John Shotsky
      EditPad Pro is a Unicode editor, so yes, it displays Unicode and utf-8 and many other code pages correctly. But that file is not Unicode, it is 8-bit UTF. When
      Message 2 of 14 , Jan 12, 2013
        EditPad Pro is a Unicode editor, so yes, it displays Unicode and utf-8 and many other code pages correctly. But that
        file is not Unicode, it is 8-bit UTF. When one of these files is moved, NoteTab not only displays it correctly, but it
        also saves it correctly, that is, without the accents. So, that is the workaround for now. What is not acceptable is the
        file as first opened, which does not result in a question mark or any valid character in any code page. It is just
        garbage. Previously, NoteTab displayed a question mark for any character out of its map. Now, it doesn't.

        But that's not actually the point anyway. The file is UTF-8 when it is written, and after it is copied. Nothing is
        different about the file except that there is a copy in another location. The copy displays correctly in NoteTab, but
        the original doesn't. The copy works with my clip library, the original doesn't. If I export the original in NoteTab to
        UTF-8 it displays correctly, but of course just copying it works, as does renaming it, so I can't say the export
        actually does anything. However, if I export it to Ascii, question marks show up for those characters, as expected. The
        clip library can't work with a bunch of question marks either, of course, as there is no way to guess what the missing
        character is except through a very, very complex word map which replaces question marks with characters if the word is
        otherwise recognized. So, for the words you correctly detected below, I would simply substitute the unaccented
        characters for accented ones and that would be fine. But I can't do that with the original, because it displays EXTRA
        characters, as indicated in my 'Hex/Ascii' view below.

        So, for now, my instructions will include moving the FireFox-exported file to a work folder, and we'll go with that as
        long as it continues to work. As to the problem, I will leave it in the category of unresolvable.

        Regards,
        John
        RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/

        From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
        Sent: Saturday, January 12, 2013 07:23
        To: ntb-clips@yahoogroups.com
        Subject: Re: [Clip] Code page/character issues


        John Shotsky wrote:
        > Text View: Speka Piragi
        > Hex/Ascii View: Spe��a P��r��gi
        > NoteTab correctly detects it as utf-8. But when I force it to
        > Windows 1252, it displays as in NoteTab � incorrectly.

        It has to, those characters are not in CP1252. Converting your sample
        and assuming mail transfer has not broken anything I get:

        Speka Piragi

        These are from the "extended block A"
        http://www.sql-und-xml.de/unicode-database/latin-extended-a.html

        NoteTab will never be able to deal with them satisfactorily. What I
        don't get at all is how Win7 interferes with them, but then I have so
        far refrained from using eXPerimental and stick to Win98. Even that
        tries to interfere and impose its preferences over mine, but there I can
        more or less control it. Your identical byte count might result from
        using UTF-16, don't newer Windoses do that? If so the byte count should
        be twice the letter count.

        > But, since EditPad Pro detects it correctly, I
        > don't think it's Windows.

        If editpad is true UTF, as you say, then it need not detect anything.
        Notetab is stricly 8-bit and strictly codepage based, all it can do is
        read letters from inside that single chosen codepage when encoded as
        UTF-8. Letters from more than one codepage inside the same document will
        never work.

        Axel



        [Non-text portions of this message have been removed]
      • Axel Berger
        ... To my understanding UTF-8 as a specific encoding is a subset, or rather one of several possible versions, of Unicode. ... Sorry, but if those letters do
        Message 3 of 14 , Jan 12, 2013
          John Shotsky wrote:
          > But that file is not Unicode, it is 8-bit UTF.

          To my understanding UTF-8 as a specific encoding is a subset, or rather
          one of several possible versions, of Unicode.

          > When one of these files is moved, NoteTab not only displays it
          > correctly, but it also saves it correctly, that is, without the
          > accents.

          Sorry, but if those letters do have accents, then anything without is
          INcorrect. It may be an acceptable workaround, like Muller or Mueller
          instead of Müller, but never correct.

          > So, that is the workaround for now.

          Right

          > But that's not actually the point anyway.

          Agreed. Win7 does something strange here and I'm very happy I need not
          concern myself with that.

          > As to the problem, I will leave it in the category of unresolvable.

          Probably best.
        Your message has been successfully submitted and would be delivered to recipients shortly.