Loading ...
Sorry, an error occurred while loading the content.

Re: ASCIIfication (removal of accent, cedilla, etc)

Expand Messages
  • Benjamin R. Haskell
    The tl;dr version, pipe it through: uconv -t ASCII -x nfd -c ... Just to cover my bases: this seems like a bad idea in general. I don t know much about
    Message 1 of 11 , Aug 29, 2012
    • 0 Attachment
      The tl;dr version, pipe it through:

      uconv -t ASCII -x nfd -c


      On Wed, 29 Aug 2012, Tim Chase wrote:

      > I've got some Portuguese text that I need to perform some
      > transformations on to make them ASCII (7-bit). That means removing
      > accent marks, cedillas, tildes, etc.

      Just to cover my bases: this seems like a bad idea in general. I don't
      know much about Portuguese, but one of the minimal pairs listed in the
      Wikipedia article for Portuguese phonology¹ is:

      pensamos "we think"
      vs.
      pensámos "we thought"


      > Is there some fast transform in Vim that I've missed, or an easy way
      > to go about this?

      In most contexts, Unicode strings are stored in Normal Form C (NFC),
      which means they're equivalent to having passed through Canonical
      Decomposition followed by Canonical Composition. This means that any
      characters that have "combined" codepoints are so combined.

      Characters in Unicode strings stored in Normal Form D (NFD) (==
      Canonical Decomposition) have their "combined" codepoints split into the
      base codepoint and "combining character" codepoints.

      As a practical example, the string "é" is:

      in NFC:

      U+00E9 LATIN SMALL LETTER E WITH ACUTE

      in NFD:

      U+0065 LATIN SMALL LETTER E
      U+0301 COMBINING ACUTE ACCENT

      Unicode consortium has full details².

      The 'icu' project³ (International Components for Unicode) has a
      converter similar to `iconv` called `uconv`, which also lets you specify
      a transliterator to run over the input. So, to get rid of accents,
      cedillas, tildes, etc, you can convert your text into Unicode NFD, then
      convert it to ASCII and discard any characters not in ASCII (which
      includes the combining accent marks).

      Assuming the text is encoded in the same encoding as your current
      locale and you're in a Unicode locale, you can pipe it through:

      uconv -t ASCII -x nfd -c

      -t ASCII = convert to ASCII (t = to/target)
      -x nfd = use the NFD transliterator
      -c = discard any characters that don't have equivalents in the target

      If your source data is in a different encoding and/or you're not in a
      Unicode locale (or just a differently-encoded locale), you might have to
      be more explicit, e.g.:

      uconv -f SOURCE-ENCODING -t ASCII -x nfd -c

      (where SOURCE-ENCODING could be, e.g. ISO-8859-1 or ISO-8859-15 -- full
      list from running `uconv -l`)

      --
      Best,
      Ben

      ¹: http://en.wikipedia.org/wiki/Portuguese_phonology
      ²: http://unicode.org/reports/TR15/
      ³: http://icu-project.org

      --
      You received this message from the "vim_use" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php
    • Andy Wokula
      ... From the pattern point of view, there are equivalence classes: /[[= ... etc. -- Andy -- You received this message from the vim_use maillist. Do not
      Message 2 of 11 , Aug 30, 2012
      • 0 Attachment
        Am 30.08.2012 04:36, schrieb Tim Chase:
        > I've got some Portuguese text that I need to perform some
        > transformations on to make them ASCII (7-bit). That means removing
        > accent marks, cedillas, tildes, etc.
        >
        > Is there some fast transform in Vim that I've missed, or an easy way
        > to go about this?
        >
        > Thanks,
        >
        > -tim

        From the pattern point of view, there are equivalence classes:
        /[[=

        Maybe this can be used in:
        :%s/[[=e=]]/e/g
        etc.

        --
        Andy

        --
        You received this message from the "vim_use" maillist.
        Do not top-post! Type your reply below the text you are replying to.
        For more information, visit http://www.vim.org/maillist.php
      • Dominique Pellé
        ... You need a fairly recent version of Vim for this feature to work with non-latin1 characters since equivalent classes [[=.=]] were improved in this patch
        Message 3 of 11 , Aug 30, 2012
        • 0 Attachment
          Andy Wokula wrote:

          > Am 30.08.2012 04:36, schrieb Tim Chase:
          >
          >> I've got some Portuguese text that I need to perform some
          >> transformations on to make them ASCII (7-bit). That means removing
          >> accent marks, cedillas, tildes, etc.
          >>
          >> Is there some fast transform in Vim that I've missed, or an easy way
          >> to go about this?
          >>
          >> Thanks,
          >>
          >> -tim
          >
          >
          > From the pattern point of view, there are equivalence classes:
          > /[[=
          >
          > Maybe this can be used in:
          > :%s/[[=e=]]/e/g
          > etc.


          You need a fairly recent version of Vim for this feature to work
          with non-latin1 characters since equivalent classes [[=.=]] were
          improved in this patch ftp://ftp.vim.org/pub/vim/patches/7.3/7.3.259

          Regards
          -- Dominique

          --
          You received this message from the "vim_use" maillist.
          Do not top-post! Type your reply below the text you are replying to.
          For more information, visit http://www.vim.org/maillist.php
        • Tim Chase
          ... Thanks to everybody for their suggestions. Playing around a little, ... which is still tedious, but at least a little less so. If there s some magic
          Message 4 of 11 , Aug 30, 2012
          • 0 Attachment
            On 08/29/12 21:46, Salman Halim wrote:
            >> I've got some Portuguese text that I need to perform some
            >> transformations on to make them ASCII (7-bit). That means
            >> removing accent marks, cedillas, tildes, etc.
            >>
            >> Is there some fast transform in Vim that I've missed, or an
            >> easy way to go about this?
            >
            > I don't believe there is something that will figure out the
            > non-accented version of a given character, but you could do
            > something similar using tr() by passing in "èéêëē" and "eeeee",
            > for example.

            Thanks to everybody for their suggestions. Playing around a little,
            I went with using equivalence classes:

            :%s/[[=a=]]/a/g|%s/[[=e=]]/e/g|...

            which is still tedious, but at least a little less so. If there's
            some magic method I've missed (this happens to be a work thing, so
            I'm stuck on Win32 without the conversion utility mentioned
            elsewhere in the thread), I'd love to know how to improve this.

            -tim



            --
            You received this message from the "vim_use" maillist.
            Do not top-post! Type your reply below the text you are replying to.
            For more information, visit http://www.vim.org/maillist.php
          • Bee
            ... Something like the following to simplify: function! AEIOU() for x in [ a , e , i , o , u ] execute :%s/[[= .x. =]]/ .x. /g endfor endf :call AEIOU()
            Message 5 of 11 , Aug 30, 2012
            • 0 Attachment
              Tim Chase wrote:
              > On 08/29/12 21:46, Salman Halim wrote:
              > >> I've got some Portuguese text that I need to perform some
              > >> transformations on to make them ASCII (7-bit). That means
              > >> removing accent marks, cedillas, tildes, etc.
              > >>
              > >> Is there some fast transform in Vim that I've missed, or an
              > >> easy way to go about this?
              > >
              > > I don't believe there is something that will figure out the
              > > non-accented version of a given character, but you could do
              > > something similar using tr() by passing in "èéêëē" and "eeeee",
              > > for example.
              >
              > Thanks to everybody for their suggestions. Playing around a little,
              > I went with using equivalence classes:
              >
              > :%s/[[=a=]]/a/g|%s/[[=e=]]/e/g|...
              >
              > which is still tedious, but at least a little less so. If there's
              > some magic method I've missed (this happens to be a work thing, so
              > I'm stuck on Win32 without the conversion utility mentioned
              > elsewhere in the thread), I'd love to know how to improve this.
              >
              > -tim

              Something like the following to simplify:

              function! AEIOU()
              for x in ["a","e","i","o","u"]
              execute ':%s/[[='.x.'=]]/'.x.'/g'
              endfor
              endf " :call AEIOU()

              Bill

              --
              You received this message from the "vim_use" maillist.
              Do not top-post! Type your reply below the text you are replying to.
              For more information, visit http://www.vim.org/maillist.php
            • Bee
              ... Something like: function! AEIOU() for x in [ a , e , i , o , u , n , y ] execute :%s/[[= .x. =]]/ .x. /g endfor endf :call AEIOU() Test on: ?
              Message 6 of 11 , Aug 30, 2012
              • 0 Attachment
                On Aug 30, 7:20 am, Tim Chase <v...@...> wrote:
                > On 08/29/12 21:46, Salman Halim wrote:
                >
                > >> I've got some Portuguese text that I need to perform some
                > >> transformations on to make them ASCII (7-bit).  That means
                > >> removing accent marks, cedillas, tildes, etc.
                >
                > >> Is there some fast transform in Vim that I've missed, or an
                > >> easy way to go about this?
                >
                > > I don't believe there is something that will figure out the
                > > non-accented version of a given character, but you could do
                > > something similar using tr() by passing in "èéêëē" and "eeeee",
                > > for example.
                >
                > Thanks to everybody for their suggestions. Playing around a little,
                > I went with using equivalence classes:
                >
                >   :%s/[[=a=]]/a/g|%s/[[=e=]]/e/g|...
                >
                > which is still tedious, but at least a little less so.  If there's
                > some magic method I've missed (this happens to be a work thing, so
                > I'm stuck on Win32 without the conversion utility mentioned
                > elsewhere in the thread), I'd love to know how to improve this.
                >
                > -tim


                " Something like:

                function! AEIOU()
                for x in ["a","e","i","o","u","n","y"]
                execute ':%s/[[='.x.'=]]/'.x.'/g'
                endfor
                endf " :call AEIOU()

                " Test on:
                "? èéêëē
                ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

                Bill

                --
                You received this message from the "vim_use" maillist.
                Do not top-post! Type your reply below the text you are replying to.
                For more information, visit http://www.vim.org/maillist.php
              • Christian Brabandt
                Hi Benjamin! [reformated] ... The interesting part is you can add many many combining chars together to create even new Chars, that don t exist as precombined
                Message 7 of 11 , Aug 30, 2012
                • 0 Attachment
                  Hi Benjamin!

                  [reformated]

                  On Mi, 29 Aug 2012, Benjamin R. Haskell wrote:

                  > On Wed, 29 Aug 2012, Tim Chase wrote:
                  >
                  > >I've got some Portuguese text that I need to perform some
                  > >transformations on to make them ASCII (7-bit). That means
                  > >removing accent marks, cedillas, tildes, etc.
                  >
                  > Just to cover my bases: this seems like a bad idea in general. I
                  > don't know much about Portuguese, but one of the minimal pairs
                  > listed in the Wikipedia article for Portuguese phonology¹ is:
                  >
                  > pensamos "we think"
                  > vs.
                  > pensámos "we thought"
                  >
                  >
                  > >Is there some fast transform in Vim that I've missed, or an easy
                  > >way to go about this?
                  >
                  > In most contexts, Unicode strings are stored in Normal Form C (NFC),
                  > which means they're equivalent to having passed through Canonical
                  > Decomposition followed by Canonical Composition. This means that
                  > any characters that have "combined" codepoints are so combined.
                  >
                  > Characters in Unicode strings stored in Normal Form D (NFD) (==
                  > Canonical Decomposition) have their "combined" codepoints split into
                  > the base codepoint and "combining character" codepoints.
                  >
                  > As a practical example, the string "é" is:
                  >
                  > in NFC:
                  >
                  > U+00E9 LATIN SMALL LETTER E WITH ACUTE
                  >
                  > in NFD:
                  >
                  > U+0065 LATIN SMALL LETTER E
                  > U+0301 COMBINING ACUTE ACCENT
                  >
                  > Unicode consortium has full details².

                  The interesting part is you can add many many combining chars together
                  to create even new Chars, that don't exist as precombined separate
                  glyphs. And BTW: for decomposed chars, the 'delcombined' option can be
                  useful.

                  One of the major drawbacks is that this will probably cause a lot of
                  interoperability issues when exchanging data between Unix and Mac OS X,
                  because on Unix the NFC form is used, while Mac OS X saves data in NFD
                  form. I already have seen problems like this:

                  #v+
                  chrisbra@R500:~/charset$ ls
                  ä ä
                  chrisbra@R500:~/charset$ ls |xxd
                  0000000: 61cc 880a c3a4 0a a......
                  #v-

                  So one filename consists of
                  U+0061 LATIN SMALL LETTER A
                  U+0308 COMBINING DIAERESIS
                  while the other filename is stored as
                  U+00E4 LATIN SMALL LETTER A WITH DIAERESIS

                  In this case you can convert the filenames using convmv and using the
                  --nfc or --nfd switch.

                  I also have seen queries from developers, why sometimes data looks
                  totally garbled. After investigating, this happened because of NFC/NFD
                  confusion (or programs not correctly converting those chars).

                  > The 'icu' project³ (International Components for Unicode) has a
                  > converter similar to `iconv` called `uconv`, which also lets you
                  > specify a transliterator to run over the input. So, to get rid of
                  > accents, cedillas, tildes, etc, you can convert your text into
                  > Unicode NFD, then convert it to ASCII and discard any characters not
                  > in ASCII (which includes the combining accent marks).
                  >
                  > Assuming the text is encoded in the same encoding as your current
                  > locale and you're in a Unicode locale, you can pipe it through:
                  >
                  > uconv -t ASCII -x nfd -c
                  >
                  > -t ASCII = convert to ASCII (t = to/target)
                  > -x nfd = use the NFD transliterator
                  > -c = discard any characters that don't have equivalents in the target
                  >
                  > If your source data is in a different encoding and/or you're not in
                  > a Unicode locale (or just a differently-encoded locale), you might
                  > have to be more explicit, e.g.:
                  >
                  > uconv -f SOURCE-ENCODING -t ASCII -x nfd -c
                  >
                  > (where SOURCE-ENCODING could be, e.g. ISO-8859-1 or ISO-8859-15 --
                  > full list from running `uconv -l`)

                  Thanks Benjamin, that is really useful. I didn't know about uconv and
                  this looks interesting. Unfortunately, this doesn't work really well.
                  Consider this test file:
                  #v+
                  chrisbra@R500:~/charset$ cat file_utf8_nfc.txt èéêëē
                  ß
                  ü

                  Æ
                  Office
                  ế
                  2⁵
                  chrisbra@R500:~/charset$ uconv -f utf-8 -t ASCII -x nfd -c
                  file_utf8_nfc.txt eeeee

                  u


                  Oce
                  e
                  2
                  #v-

                  Slightly better is, to transliterate into NFKD (which allows to
                  transform single glyphs into similar letters) form, before deleting
                  non-ascii Chars, so this also doesn't work correctly.

                  #v+
                  chrisbra@R500:~/charset$ uconv -f utf-8 -t ASCII -x nfkd -c
                  file_utf8_nfc.txt eeeee

                  u


                  Office
                  e
                  25
                  #v-

                  As you can see, this doesn't work really well, for some more exotic
                  chars. Even the German Eszett 'ß', which should be not so unknown, can't
                  be converted to ss, which should certainly be possible.
                  In this case, iconv still works better:

                  #v+
                  chrisbra@R500:~/charset$ iconv -f utf-8 -t ascii//translit file_utf8_nfc.txt
                  eeeee
                  ss
                  ue
                  EUR
                  AE
                  Office
                  e
                  2?
                  #v-

                  The //translit means, to convert using approximation if a char cannot be
                  converted directly.

                  To come back to Vim, it should be possible, to use Vims iconv() function
                  together with the //translit string, to strip those diacritics, but
                  unfortunately, this doesn't seem to work very well (and also doesn't
                  seem to work on Windows at all, although my Vim has +iconv/dyn and I
                  have iconv.dll¹ lying around):

                  :%s#.#\=iconv(submatch(0), 'utf-8', 'ascii//translit')#g
                  produces:
                  ?????
                  ss
                  ?
                  EUR
                  AE
                  Office
                  ?
                  2?

                  For German readers, I'll have also blogged about this at:
                  https://blog.256bit.org/archives/768-Das-Problem-mit-UTF-8-Teil2.html
                  https://blog.256bit.org/archives/724-Das-Problem-mit-UTF-8.html

                  For reference, I'll save this file below
                  http://www.256bit.org/~chrisbra/utf8_mail.html
                  in case google groups mangles the characters and browsers seem to be
                  better in rendering multibyte characters.

                  ¹) In case you are looking for a iconv.dll for windows, you can download
                  it from here:
                  http://sourceforge.net/projects/gettext/files/latest/download
                  and while you are at it, you should possibly also download intl.dll from
                  http://sourceforge.net/projects/gettext/files/gettext-win32/0.13.1/gettext-runtime-0.13.1.bin.woe32.zip


                  regards,
                  Christian

                  --
                  You received this message from the "vim_use" maillist.
                  Do not top-post! Type your reply below the text you are replying to.
                  For more information, visit http://www.vim.org/maillist.php
                • Bee
                  Or slightly simpler: function! AEIOU() ! :help [=* equivalence class for x in split( aeiouny , zs ) execute :%s/[[= .x. =]]/ .x. /g endfor endf usage
                  Message 8 of 11 , Aug 30, 2012
                  • 0 Attachment
                    Or slightly simpler:

                    function! AEIOU() "! :help [=* equivalence class
                    for x in split("aeiouny", '\zs')
                    execute ':%s/[[='.x.'=]]/'.x.'/g'
                    endfor
                    endf "usage :silent! call AEIOU()

                    "! :help digraph-table
                    " Test on:
                    "? ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

                    Bill

                    --
                    You received this message from the "vim_use" maillist.
                    Do not top-post! Type your reply below the text you are replying to.
                    For more information, visit http://www.vim.org/maillist.php
                  Your message has been successfully submitted and would be delivered to recipients shortly.