Loading ...
Sorry, an error occurred while loading the content.

RE: [Clip] Code page/character issues

Expand Messages
  • John Shotsky
    EditPad Pro is a Unicode editor, so yes, it displays Unicode and utf-8 and many other code pages correctly. But that file is not Unicode, it is 8-bit UTF. When
    Message 1 of 14 , Jan 12, 2013
    • 0 Attachment
      EditPad Pro is a Unicode editor, so yes, it displays Unicode and utf-8 and many other code pages correctly. But that
      file is not Unicode, it is 8-bit UTF. When one of these files is moved, NoteTab not only displays it correctly, but it
      also saves it correctly, that is, without the accents. So, that is the workaround for now. What is not acceptable is the
      file as first opened, which does not result in a question mark or any valid character in any code page. It is just
      garbage. Previously, NoteTab displayed a question mark for any character out of its map. Now, it doesn't.

      But that's not actually the point anyway. The file is UTF-8 when it is written, and after it is copied. Nothing is
      different about the file except that there is a copy in another location. The copy displays correctly in NoteTab, but
      the original doesn't. The copy works with my clip library, the original doesn't. If I export the original in NoteTab to
      UTF-8 it displays correctly, but of course just copying it works, as does renaming it, so I can't say the export
      actually does anything. However, if I export it to Ascii, question marks show up for those characters, as expected. The
      clip library can't work with a bunch of question marks either, of course, as there is no way to guess what the missing
      character is except through a very, very complex word map which replaces question marks with characters if the word is
      otherwise recognized. So, for the words you correctly detected below, I would simply substitute the unaccented
      characters for accented ones and that would be fine. But I can't do that with the original, because it displays EXTRA
      characters, as indicated in my 'Hex/Ascii' view below.

      So, for now, my instructions will include moving the FireFox-exported file to a work folder, and we'll go with that as
      long as it continues to work. As to the problem, I will leave it in the category of unresolvable.

      Regards,
      John
      RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/

      From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
      Sent: Saturday, January 12, 2013 07:23
      To: ntb-clips@yahoogroups.com
      Subject: Re: [Clip] Code page/character issues


      John Shotsky wrote:
      > Text View: Speka Piragi
      > Hex/Ascii View: Spe��a P��r��gi
      > NoteTab correctly detects it as utf-8. But when I force it to
      > Windows 1252, it displays as in NoteTab � incorrectly.

      It has to, those characters are not in CP1252. Converting your sample
      and assuming mail transfer has not broken anything I get:

      Speka Piragi

      These are from the "extended block A"
      http://www.sql-und-xml.de/unicode-database/latin-extended-a.html

      NoteTab will never be able to deal with them satisfactorily. What I
      don't get at all is how Win7 interferes with them, but then I have so
      far refrained from using eXPerimental and stick to Win98. Even that
      tries to interfere and impose its preferences over mine, but there I can
      more or less control it. Your identical byte count might result from
      using UTF-16, don't newer Windoses do that? If so the byte count should
      be twice the letter count.

      > But, since EditPad Pro detects it correctly, I
      > don't think it's Windows.

      If editpad is true UTF, as you say, then it need not detect anything.
      Notetab is stricly 8-bit and strictly codepage based, all it can do is
      read letters from inside that single chosen codepage when encoded as
      UTF-8. Letters from more than one codepage inside the same document will
      never work.

      Axel



      [Non-text portions of this message have been removed]
    • Axel Berger
      ... To my understanding UTF-8 as a specific encoding is a subset, or rather one of several possible versions, of Unicode. ... Sorry, but if those letters do
      Message 2 of 14 , Jan 12, 2013
      • 0 Attachment
        John Shotsky wrote:
        > But that file is not Unicode, it is 8-bit UTF.

        To my understanding UTF-8 as a specific encoding is a subset, or rather
        one of several possible versions, of Unicode.

        > When one of these files is moved, NoteTab not only displays it
        > correctly, but it also saves it correctly, that is, without the
        > accents.

        Sorry, but if those letters do have accents, then anything without is
        INcorrect. It may be an acceptable workaround, like Muller or Mueller
        instead of Müller, but never correct.

        > So, that is the workaround for now.

        Right

        > But that's not actually the point anyway.

        Agreed. Win7 does something strange here and I'm very happy I need not
        concern myself with that.

        > As to the problem, I will leave it in the category of unresolvable.

        Probably best.
      Your message has been successfully submitted and would be delivered to recipients shortly.