Loading ...
Sorry, an error occurred while loading the content.

Re: [patch] exclude East Asian characters form spell checking

Expand Messages
  • Bram Moolenaar
    ... I was wondering if this should be an option or a spell setting of some kind. So, you argue that we won t every have useful spell checking for CJK
    Message 1 of 6 , Oct 7, 2013
    • 0 Attachment
      Tony Mechelynck wrote:

      > On 07/10/13 14:02, Ken Takata wrote:
      > > Hi,
      > >
      > > I wrote a patch for the following items from todo.txt:
      > >
      > >> Have an option for spell checking to not mark any Chinese, Japanese or other
      > >> double-width characters as error. Or perhaps all characters above 256.
      > >> (Bill Sun) Helps a lot for mixed Asian and latin text.
      > >
      > >> - have some way not to give spelling errors for a range of characters.
      > >> E.g. for Chinese and other languages with specific characters for which we
      > >> don't have a spell file. Useful when there is also text in other
      > >> languages in the file.
      > >
      > > When I write mixed Japanese and English text, it really annoys me.
      > > Current Vim's spell checking algorithm doesn't support Chinese, Japanese or
      > > other East Asian languages. So I just exclude these characters from spell
      > > checking. (No options)
      > > Please check the attached patch.
      > >
      > > Regards,
      > > Ken Takata
      > >
      >
      > "All characters above 256" would seem a little rash IMHO: after all,
      > Russian, Ukrainian, Bulgarian, Greek, etc. can (or should be able to)
      > use spell checking even though their writing systems are entirely above
      > U+00FF, and even in Latin script, some French nouns such as śil (eye),
      > śuf (egg), bśuf (ox or beef), śil-de-bśuf (a small round window), vśu
      > (wish), Śdipe (Oedipus), śsophage (oesophagus), etc., use characters
      > (the oe / OE digraphs, which in French are one character each) above
      > U+00FF. Similarly for the accented letters of non-West-European
      > languages, many of which fall outside tha Latin1 range.
      >
      > I suppose that excluding CJK is the right thing to do, since the nearest
      > thing to "spell checking" for handwritten CJK would mean checking that
      > the correct brush strokes were used, but "wrong" brush stroke
      > combinations (other than simplified vs. traditional glyphs, or than
      > Japanese "national" /kokuji/ characters in a Chinese text, etc.) cannot
      > be produced as computer text even in Unicode; or else it might mean
      > checking that word elements ("Han syllables") are meaningfully combined,
      > which IMHO is more akin to checking semantics or syntax than orthography.

      I was wondering if this should be an option or a spell setting of some
      kind. So, you argue that we won't every have useful spell checking for
      CJK characters, so we should just ignore them.

      What if if have some text in a language that is spell checked, and by
      some mistake a few CJK characters show up (copy/paste error, encoding
      conversion mistake, etc.). Then they should be marked as errors right?

      For me, I ocasionally get these characters when an Asian name is used.
      I don't really care if that is highlighted as an error or not (can't
      read it anyway). Other names are marked as errors, so perhaps foreign
      names should be as well?

      Following that line of thinking it should be an option. Perhaps a
      special entry in 'spelllang' "cjk" ?

      --
      DEAD PERSON: I'm getting better!
      CUSTOMER: No, you're not -- you'll be stone dead in a moment.
      MORTICIAN: Oh, I can't take him like that -- it's against regulations.
      The Quest for the Holy Grail (Monty Python)

      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
      /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
      \\\ an exciting new programming language -- http://www.Zimbu.org ///
      \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

      --
      --
      You received this message from the "vim_dev" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php

      ---
      You received this message because you are subscribed to the Google Groups "vim_dev" group.
      To unsubscribe from this group and stop receiving emails from it, send an email to vim_dev+unsubscribe@....
      For more options, visit https://groups.google.com/groups/opt_out.
    • Ken Takata
      Hi, ... My previous patch excludes only CJK characters not All characters above 256 . But I agree that checking CJK characters is useful for some kind of
      Message 2 of 6 , Oct 8, 2013
      • 0 Attachment
        Hi,

        2013/10/08 Tue 6:21:22 UTC+9 Bram Moolenaar wrote:
        > Tony Mechelynck wrote:
        >
        > > On 07/10/13 14:02, Ken Takata wrote:
        > > > Hi,
        > > >
        > > > I wrote a patch for the following items from todo.txt:
        > > >
        > > >> Have an option for spell checking to not mark any Chinese, Japanese or other
        > > >> double-width characters as error. Or perhaps all characters above 256.
        > > >> (Bill Sun) Helps a lot for mixed Asian and latin text.
        > > >
        > > >> - have some way not to give spelling errors for a range of characters.
        > > >> E.g. for Chinese and other languages with specific characters for which we
        > > >> don't have a spell file. Useful when there is also text in other
        > > >> languages in the file.
        > > >
        > > > When I write mixed Japanese and English text, it really annoys me.
        > > > Current Vim's spell checking algorithm doesn't support Chinese, Japanese or
        > > > other East Asian languages. So I just exclude these characters from spell
        > > > checking. (No options)
        > > > Please check the attached patch.
        > > >
        > > > Regards,
        > > > Ken Takata
        > > >
        > >
        > > "All characters above 256" would seem a little rash IMHO: after all,
        > > Russian, Ukrainian, Bulgarian, Greek, etc. can (or should be able to)
        > > use spell checking even though their writing systems are entirely above
        > > U+00FF, and even in Latin script, some French nouns such as �il (eye),
        > > �uf (egg), b�uf (ox or beef), �il-de-b�uf (a small round window), v�u
        > > (wish), �dipe (Oedipus), �sophage (oesophagus), etc., use characters
        > > (the oe / OE digraphs, which in French are one character each) above
        > > U+00FF. Similarly for the accented letters of non-West-European
        > > languages, many of which fall outside tha Latin1 range.
        > >
        > > I suppose that excluding CJK is the right thing to do, since the nearest
        > > thing to "spell checking" for handwritten CJK would mean checking that
        > > the correct brush strokes were used, but "wrong" brush stroke
        > > combinations (other than simplified vs. traditional glyphs, or than
        > > Japanese "national" /kokuji/ characters in a Chinese text, etc.) cannot
        > > be produced as computer text even in Unicode; or else it might mean
        > > checking that word elements ("Han syllables") are meaningfully combined,
        > > which IMHO is more akin to checking semantics or syntax than orthography.
        >
        > I was wondering if this should be an option or a spell setting of some
        > kind. So, you argue that we won't every have useful spell checking for
        > CJK characters, so we should just ignore them.
        >
        > What if if have some text in a language that is spell checked, and by
        > some mistake a few CJK characters show up (copy/paste error, encoding
        > conversion mistake, etc.). Then they should be marked as errors right?
        >
        > For me, I ocasionally get these characters when an Asian name is used.
        > I don't really care if that is highlighted as an error or not (can't
        > read it anyway). Other names are marked as errors, so perhaps foreign
        > names should be as well?
        >
        > Following that line of thinking it should be an option. Perhaps a
        > special entry in 'spelllang' "cjk" ?

        My previous patch excludes only CJK characters not "All characters above 256".
        But I agree that checking CJK characters is useful for some kind of mistakes.
        How about adding "nocjk" in 'spelllang'? For example, if you want to check
        English but exclude CJK chars:
        :set spelllang=en,nocjk

        Please check the attached patch.
        (I also merged my another patch:
        https://groups.google.com/d/msg/vim_dev/UxuwQaj1HAc/BvjwIJg6WGIJ )

        Regards,
        Ken Takata

        --
        --
        You received this message from the "vim_dev" maillist.
        Do not top-post! Type your reply below the text you are replying to.
        For more information, visit http://www.vim.org/maillist.php

        ---
        You received this message because you are subscribed to the Google Groups "vim_dev" group.
        To unsubscribe from this group and stop receiving emails from it, send an email to vim_dev+unsubscribe@....
        For more options, visit https://groups.google.com/groups/opt_out.
      • Bram Moolenaar
        ... Thanks. nocjk is a bit strange, the other entries in spelllang specify languages for which words will be recognized and not marked as errors. I
        Message 3 of 6 , Oct 8, 2013
        • 0 Attachment
          Ken Takata wrote:

          > Hi,
          >
          > 2013/10/08 Tue 6:21:22 UTC+9 Bram Moolenaar wrote:
          > > Tony Mechelynck wrote:
          > >
          > > > On 07/10/13 14:02, Ken Takata wrote:
          > > > > Hi,
          > > > >
          > > > > I wrote a patch for the following items from todo.txt:
          > > > >
          > > > >> Have an option for spell checking to not mark any Chinese, Japanese or other
          > > > >> double-width characters as error. Or perhaps all characters above 256.
          > > > >> (Bill Sun) Helps a lot for mixed Asian and latin text.
          > > > >
          > > > >> - have some way not to give spelling errors for a range of characters.
          > > > >> E.g. for Chinese and other languages with specific characters for which we
          > > > >> don't have a spell file. Useful when there is also text in other
          > > > >> languages in the file.
          > > > >
          > > > > When I write mixed Japanese and English text, it really annoys me.
          > > > > Current Vim's spell checking algorithm doesn't support Chinese, Japanese or
          > > > > other East Asian languages. So I just exclude these characters from spell
          > > > > checking. (No options)
          > > > > Please check the attached patch.
          > > > >
          > > > > Regards,
          > > > > Ken Takata
          > > > >
          > > >
          > > > "All characters above 256" would seem a little rash IMHO: after all,
          > > > Russian, Ukrainian, Bulgarian, Greek, etc. can (or should be able to)
          > > > use spell checking even though their writing systems are entirely above
          > > > U+00FF, and even in Latin script, some French nouns such as �il (eye),
          > > > �uf (egg), b�uf (ox or beef), �il-de-b�uf (a small round window), v�u
          > > > (wish), �dipe (Oedipus), �sophage (oesophagus), etc., use characters
          > > > (the oe / OE digraphs, which in French are one character each) above
          > > > U+00FF. Similarly for the accented letters of non-West-European
          > > > languages, many of which fall outside tha Latin1 range.
          > > >
          > > > I suppose that excluding CJK is the right thing to do, since the nearest
          > > > thing to "spell checking" for handwritten CJK would mean checking that
          > > > the correct brush strokes were used, but "wrong" brush stroke
          > > > combinations (other than simplified vs. traditional glyphs, or than
          > > > Japanese "national" /kokuji/ characters in a Chinese text, etc.) cannot
          > > > be produced as computer text even in Unicode; or else it might mean
          > > > checking that word elements ("Han syllables") are meaningfully combined,
          > > > which IMHO is more akin to checking semantics or syntax than orthography.
          > >
          > > I was wondering if this should be an option or a spell setting of some
          > > kind. So, you argue that we won't every have useful spell checking for
          > > CJK characters, so we should just ignore them.
          > >
          > > What if if have some text in a language that is spell checked, and by
          > > some mistake a few CJK characters show up (copy/paste error, encoding
          > > conversion mistake, etc.). Then they should be marked as errors right?
          > >
          > > For me, I ocasionally get these characters when an Asian name is used.
          > > I don't really care if that is highlighted as an error or not (can't
          > > read it anyway). Other names are marked as errors, so perhaps foreign
          > > names should be as well?
          > >
          > > Following that line of thinking it should be an option. Perhaps a
          > > special entry in 'spelllang' "cjk" ?
          >
          > My previous patch excludes only CJK characters not "All characters above 256".
          > But I agree that checking CJK characters is useful for some kind of mistakes.
          > How about adding "nocjk" in 'spelllang'? For example, if you want to check
          > English but exclude CJK chars:
          > :set spelllang=en,nocjk
          >
          > Please check the attached patch.
          > (I also merged my another patch:
          > https://groups.google.com/d/msg/vim_dev/UxuwQaj1HAc/BvjwIJg6WGIJ )

          Thanks. "nocjk" is a bit strange, the other entries in 'spelllang'
          specify languages for which words will be recognized and not marked as
          errors. I suggested "cjk" as it would see all CJK letters as OK.
          Perhaps "ignore-cjk" would be clearer, but it's a bit long.

          I don't think there will ever be a "cjk" language, thus there should be
          no reason to avoid that in case we do get a "cjk" spell checker.

          --
          [clop clop]
          MORTICIAN: Who's that then?
          CUSTOMER: I don't know.
          MORTICIAN: Must be a king.
          CUSTOMER: Why?
          MORTICIAN: He hasn't got shit all over him.
          The Quest for the Holy Grail (Monty Python)

          /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
          /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
          \\\ an exciting new programming language -- http://www.Zimbu.org ///
          \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

          --
          --
          You received this message from the "vim_dev" maillist.
          Do not top-post! Type your reply below the text you are replying to.
          For more information, visit http://www.vim.org/maillist.php

          ---
          You received this message because you are subscribed to the Google Groups "vim_dev" group.
          To unsubscribe from this group and stop receiving emails from it, send an email to vim_dev+unsubscribe@....
          For more options, visit https://groups.google.com/groups/opt_out.
        • Ken Takata
          Hi Bram, ... Ah, I understand. I have updated the patch. Regards, Ken Takata -- -- You received this message from the vim_dev maillist. Do not top-post! Type
          Message 4 of 6 , Oct 8, 2013
          • 0 Attachment
            Hi Bram,

            2013/10/09 Wed 6:05:24 UTC+9 Bram Moolenaar wrote:
            > Thanks. "nocjk" is a bit strange, the other entries in 'spelllang'
            > specify languages for which words will be recognized and not marked as
            > errors. I suggested "cjk" as it would see all CJK letters as OK.
            > Perhaps "ignore-cjk" would be clearer, but it's a bit long.
            >
            > I don't think there will ever be a "cjk" language, thus there should be
            > no reason to avoid that in case we do get a "cjk" spell checker.

            Ah, I understand.
            I have updated the patch.

            Regards,
            Ken Takata

            --
            --
            You received this message from the "vim_dev" maillist.
            Do not top-post! Type your reply below the text you are replying to.
            For more information, visit http://www.vim.org/maillist.php

            ---
            You received this message because you are subscribed to the Google Groups "vim_dev" group.
            To unsubscribe from this group and stop receiving emails from it, send an email to vim_dev+unsubscribe@....
            For more options, visit https://groups.google.com/groups/opt_out.
          Your message has been successfully submitted and would be delivered to recipients shortly.