Loading ...
Sorry, an error occurred while loading the content.

[bug] reproducible crash when trying to create dictionary with :mkspell

Expand Messages
  • Dominique Pellé
    Hi Vim-7.3.107 crashes when I try to create the Esperanto dictionary from OpenOffice-3. Steps to reproduce: $ wget
    Message 1 of 4 , Jan 29, 2011
    • 0 Attachment
      Hi

      Vim-7.3.107 crashes when I try to create the Esperanto
      dictionary from OpenOffice-3.

      Steps to reproduce:

      $ wget http://extensions.services.openoffice.org/e-files/3377/1/1.0-dev.oxt
      $ unzip 1.0-dev.oxt
      $ vim -u NONE --noplugin -c 'set nomore' -c 'mkspell! /tmp/eo literumilo'
      Vim: Caught deadly signal SEGV

      Vim: Finished.
      Segmentation fault (core dumped)


      Valgrind memory checker gives errors:

      ==3877== Invalid write of size 1
      ==3877== at 0x4026087: strcat (mc_replace_strmem.c:176)
      ==3877== by 0x818AF60: spell_read_aff (spell.c:5493)
      ==3877== by 0x819359D: mkspell (spell.c:9234)
      ==3877== by 0x8191F53: ex_mkspell (spell.c:8557)
      ==3877== by 0x80A81AC: do_one_cmd (ex_docmd.c:2657)
      ==3877== by 0x80A5A85: do_cmdline (ex_docmd.c:1123)
      ==3877== by 0x80A513F: do_cmdline_cmd (ex_docmd.c:728)
      ==3877== by 0x80EAEE5: exe_commands (main.c:2803)
      ==3877== by 0x80E85CA: main (main.c:881)
      ==3877== Address 0x7e3b434 is 0 bytes after a block of size 16,012 alloc'd
      ==3877== at 0x4025230: malloc (vg_replace_malloc.c:236)
      ==3877== by 0x8117B93: lalloc (misc2.c:918)
      ==3877== by 0x8117ACB: alloc_clear (misc2.c:829)
      ==3877== by 0x818F951: getroom (spell.c:7368)
      ==3877== by 0x818AEFA: spell_read_aff (spell.c:5485)
      ==3877== by 0x819359D: mkspell (spell.c:9234)
      ==3877== by 0x8191F53: ex_mkspell (spell.c:8557)
      ==3877== by 0x80A81AC: do_one_cmd (ex_docmd.c:2657)
      ==3877== by 0x80A5A85: do_cmdline (ex_docmd.c:1123)
      ==3877== by 0x80A513F: do_cmdline_cmd (ex_docmd.c:728)
      ==3877== by 0x80EAEE5: exe_commands (main.c:2803)
      ==3877== by 0x80E85CA: main (main.c:881)
      (several other errors after that)

      spell.c:

      5478 else if (is_aff_rule(items, itemcnt, "COMPOUNDRULE", 2))
      5479 {
      5480 /* Concatenate this string to previously defined ones, using a
      5481 * slash to separate them. */
      5482 l = (int)STRLEN(items[1]) + 1;
      5483 if (compflags != NULL)
      5484 l += (int)STRLEN(compflags) + 1;
      5485 p = getroom(spin, l, FALSE);
      5486 if (p != NULL)
      5487 {
      5488 if (compflags != NULL)
      5489 {
      5490 STRCPY(p, compflags);
      5491 STRCAT(p, "/");
      5492 }
      !!5493 STRCAT(p, items[1]);
      5494 compflags = p;
      5495 }
      5496 }

      When it crashes, I notice that variable l reaches 16005 which is just
      slightly bigger than SBLOCKSIZE (#define SBLOCKSIZE 16000 at
      spell.c:4885).

      I changed SBLOCKSIZE from 16000 to 1024000 at spell.c:4885
      and it longer crashes but that's quite a dramatic increase so I
      doubt whether that's right. While creating the dictionary,
      l variable at spell.c:5485 reached l=564,458.

      Vim temporary used 12.5 Gb of memory while creating the
      dictionary which is quite a lot.

      Size of the created dictionary file 'eo.utf-8.spl' is 587,214 bytes
      and it does not work. Trying to use it with...

      $ cp /tmp/eo.utf-8.spl ~/.vim/spell/.

      ... then in Vim...
      :setlocal spell spelllang=eo

      ... gives errors after waiting for ~10 sec or so:

      Error detected while processing /home/pel/.vim/spell/eo.utf-8.spl:
      E339: Pattern too long
      E759: Format error in spell file

      ":help E339" says: "This only happens on systems with 16 bit ints".

      Well, that's not true since I get E339 on Linux x86_64 where
      sizeof(int) is 4 (32 bits).

      E339 at regexp.c:1059 is in between SMALL_MALLOC
      but E339 at regexp.c:1077 is not in between SMALL_MALLOC.

      Either help file needs to be updated for E339 or regexp.c
      needs to handle longer regexes.

      Problem is triggered by the fact that file 'literumilo.aff' contains
      many COMPOUNDRULE lines (Esperanto being an agglutinative
      language).

      Esperanto dictionary from OpenOffice-2.x as currently used by
      Vim works fine.

      Regards
      -- Dominique

      --
      You received this message from the "vim_dev" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php
    • Bram Moolenaar
      ... First thing to fix would be to add a check in getroom() for a length more than what can be handled. ... Pfew, that s a lot of compound rules. Did you
      Message 2 of 4 , Jan 30, 2011
      • 0 Attachment
        Dominique Pelle wrote:

        > Vim-7.3.107 crashes when I try to create the Esperanto
        > dictionary from OpenOffice-3.
        >
        > Steps to reproduce:
        >
        > $ wget http://extensions.services.openoffice.org/e-files/3377/1/1.0-dev.oxt
        > $ unzip 1.0-dev.oxt
        > $ vim -u NONE --noplugin -c 'set nomore' -c 'mkspell! /tmp/eo literumilo'
        > Vim: Caught deadly signal SEGV
        >
        > Vim: Finished.
        > Segmentation fault (core dumped)
        >
        >
        > Valgrind memory checker gives errors:
        >
        > ==3877== Invalid write of size 1
        > ==3877== at 0x4026087: strcat (mc_replace_strmem.c:176)
        > ==3877== by 0x818AF60: spell_read_aff (spell.c:5493)
        > ==3877== by 0x819359D: mkspell (spell.c:9234)
        > ==3877== by 0x8191F53: ex_mkspell (spell.c:8557)
        > ==3877== by 0x80A81AC: do_one_cmd (ex_docmd.c:2657)
        > ==3877== by 0x80A5A85: do_cmdline (ex_docmd.c:1123)
        > ==3877== by 0x80A513F: do_cmdline_cmd (ex_docmd.c:728)
        > ==3877== by 0x80EAEE5: exe_commands (main.c:2803)
        > ==3877== by 0x80E85CA: main (main.c:881)
        > ==3877== Address 0x7e3b434 is 0 bytes after a block of size 16,012 alloc'd
        > ==3877== at 0x4025230: malloc (vg_replace_malloc.c:236)
        > ==3877== by 0x8117B93: lalloc (misc2.c:918)
        > ==3877== by 0x8117ACB: alloc_clear (misc2.c:829)
        > ==3877== by 0x818F951: getroom (spell.c:7368)
        > ==3877== by 0x818AEFA: spell_read_aff (spell.c:5485)
        > ==3877== by 0x819359D: mkspell (spell.c:9234)
        > ==3877== by 0x8191F53: ex_mkspell (spell.c:8557)
        > ==3877== by 0x80A81AC: do_one_cmd (ex_docmd.c:2657)
        > ==3877== by 0x80A5A85: do_cmdline (ex_docmd.c:1123)
        > ==3877== by 0x80A513F: do_cmdline_cmd (ex_docmd.c:728)
        > ==3877== by 0x80EAEE5: exe_commands (main.c:2803)
        > ==3877== by 0x80E85CA: main (main.c:881)
        > (several other errors after that)
        >
        > spell.c:
        >
        > 5478 else if (is_aff_rule(items, itemcnt, "COMPOUNDRULE", 2))
        > 5479 {
        > 5480 /* Concatenate this string to previously defined ones, using a
        > 5481 * slash to separate them. */
        > 5482 l = (int)STRLEN(items[1]) + 1;
        > 5483 if (compflags != NULL)
        > 5484 l += (int)STRLEN(compflags) + 1;
        > 5485 p = getroom(spin, l, FALSE);
        > 5486 if (p != NULL)
        > 5487 {
        > 5488 if (compflags != NULL)
        > 5489 {
        > 5490 STRCPY(p, compflags);
        > 5491 STRCAT(p, "/");
        > 5492 }
        > !!5493 STRCAT(p, items[1]);
        > 5494 compflags = p;
        > 5495 }
        > 5496 }
        >
        > When it crashes, I notice that variable l reaches 16005 which is just
        > slightly bigger than SBLOCKSIZE (#define SBLOCKSIZE 16000 at
        > spell.c:4885).

        First thing to fix would be to add a check in getroom() for a length
        more than what can be handled.

        > I changed SBLOCKSIZE from 16000 to 1024000 at spell.c:4885
        > and it longer crashes but that's quite a dramatic increase so I
        > doubt whether that's right. While creating the dictionary,
        > l variable at spell.c:5485 reached l=564,458.

        Pfew, that's a lot of compound rules. Did you check in the affix file
        that this is actually there? I suppose this must be generated, no human
        would type this.

        A solution would be to make compflags a growarray. The way it's done
        now it is re-allocated for each COMPOUNDRULE, very inefficient. This
        was made with the assumption there would be only a few COMPOUNDRULEs.

        > Vim temporary used 12.5 Gb of memory while creating the
        > dictionary which is quite a lot.

        Compound rules are indeed a problem, since Vim expands them into all
        possible words to be able to build the trie. It compresses really well,
        but lots of memory is needed before the compression happens.

        > Size of the created dictionary file 'eo.utf-8.spl' is 587,214 bytes
        > and it does not work. Trying to use it with...
        >
        > $ cp /tmp/eo.utf-8.spl ~/.vim/spell/.
        >
        > ... then in Vim...
        > :setlocal spell spelllang=eo
        >
        > ... gives errors after waiting for ~10 sec or so:
        >
        > Error detected while processing /home/pel/.vim/spell/eo.utf-8.spl:
        > E339: Pattern too long
        > E759: Format error in spell file
        >
        > ":help E339" says: "This only happens on systems with 16 bit ints".
        >
        > Well, that's not true since I get E339 on Linux x86_64 where
        > sizeof(int) is 4 (32 bits).
        >
        > E339 at regexp.c:1059 is in between SMALL_MALLOC
        > but E339 at regexp.c:1077 is not in between SMALL_MALLOC.
        >
        > Either help file needs to be updated for E339 or regexp.c
        > needs to handle longer regexes.

        The help is wrong, it can also happen when an offset gets too big.

        > Problem is triggered by the fact that file 'literumilo.aff' contains
        > many COMPOUNDRULE lines (Esperanto being an agglutinative
        > language).

        Did you check which regcomp() call results in E339?

        > Esperanto dictionary from OpenOffice-2.x as currently used by
        > Vim works fine.

        --
        hundred-and-one symptoms of being an internet addict:
        168. You have your own domain name.

        /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
        /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
        \\\ an exciting new programming language -- http://www.Zimbu.org ///
        \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

        --
        You received this message from the "vim_dev" maillist.
        Do not top-post! Type your reply below the text you are replying to.
        For more information, visit http://www.vim.org/maillist.php
      • Dominique Pellé
        ... I tried adding a check to avoid allocating more than SBLOCKSIZE in getroom(). $ hg diff spell.c diff -r 26fb122355d4 src/spell.c ... +++ b/src/spell.c Mon
        Message 3 of 4 , Jan 31, 2011
        • 0 Attachment
          Bram Moolenaar wrote:

          > Dominique Pelle wrote:
          >
          >> Vim-7.3.107 crashes when I try to create the Esperanto
          >> dictionary from OpenOffice-3.
          >>
          >> Steps to reproduce:
          >>
          >> $ wget http://extensions.services.openoffice.org/e-files/3377/1/1.0-dev.oxt
          >> $ unzip 1.0-dev.oxt
          >> $ vim -u NONE --noplugin -c 'set nomore' -c 'mkspell! /tmp/eo literumilo'
          >> Vim: Caught deadly signal SEGV
          >>
          >> Vim: Finished.
          >> Segmentation fault (core dumped)
          >>
          >>
          >> Valgrind memory checker gives errors:
          >>
          >> ==3877== Invalid write of size 1
          >> ==3877==    at 0x4026087: strcat (mc_replace_strmem.c:176)
          >> ==3877==    by 0x818AF60: spell_read_aff (spell.c:5493)
          >> ==3877==    by 0x819359D: mkspell (spell.c:9234)
          >> ==3877==    by 0x8191F53: ex_mkspell (spell.c:8557)
          >> ==3877==    by 0x80A81AC: do_one_cmd (ex_docmd.c:2657)
          >> ==3877==    by 0x80A5A85: do_cmdline (ex_docmd.c:1123)
          >> ==3877==    by 0x80A513F: do_cmdline_cmd (ex_docmd.c:728)
          >> ==3877==    by 0x80EAEE5: exe_commands (main.c:2803)
          >> ==3877==    by 0x80E85CA: main (main.c:881)
          >> ==3877==  Address 0x7e3b434 is 0 bytes after a block of size 16,012 alloc'd
          >> ==3877==    at 0x4025230: malloc (vg_replace_malloc.c:236)
          >> ==3877==    by 0x8117B93: lalloc (misc2.c:918)
          >> ==3877==    by 0x8117ACB: alloc_clear (misc2.c:829)
          >> ==3877==    by 0x818F951: getroom (spell.c:7368)
          >> ==3877==    by 0x818AEFA: spell_read_aff (spell.c:5485)
          >> ==3877==    by 0x819359D: mkspell (spell.c:9234)
          >> ==3877==    by 0x8191F53: ex_mkspell (spell.c:8557)
          >> ==3877==    by 0x80A81AC: do_one_cmd (ex_docmd.c:2657)
          >> ==3877==    by 0x80A5A85: do_cmdline (ex_docmd.c:1123)
          >> ==3877==    by 0x80A513F: do_cmdline_cmd (ex_docmd.c:728)
          >> ==3877==    by 0x80EAEE5: exe_commands (main.c:2803)
          >> ==3877==    by 0x80E85CA: main (main.c:881)
          >> (several other errors after that)
          >>
          >> spell.c:
          >>
          >>   5478     else if (is_aff_rule(items, itemcnt, "COMPOUNDRULE", 2))
          >>   5479     {
          >>   5480         /* Concatenate this string to previously defined ones, using a
          >>   5481          * slash to separate them. */
          >>   5482         l = (int)STRLEN(items[1]) + 1;
          >>   5483         if (compflags != NULL)
          >>   5484             l += (int)STRLEN(compflags) + 1;
          >>   5485         p = getroom(spin, l, FALSE);
          >>   5486         if (p != NULL)
          >>   5487         {
          >>   5488             if (compflags != NULL)
          >>   5489             {
          >>   5490                 STRCPY(p, compflags);
          >>   5491                 STRCAT(p, "/");
          >>   5492             }
          >> !!5493             STRCAT(p, items[1]);
          >>   5494             compflags = p;
          >>   5495         }
          >>   5496     }
          >>
          >> When it crashes, I notice that variable l reaches 16005 which is just
          >> slightly bigger than SBLOCKSIZE (#define  SBLOCKSIZE 16000 at
          >> spell.c:4885).
          >
          > First thing to fix would be to add a check in getroom() for a length
          > more than what can be handled.

          I tried adding a check to avoid allocating more than SBLOCKSIZE
          in getroom().

          $ hg diff spell.c
          diff -r 26fb122355d4 src/spell.c
          --- a/src/spell.c Sat Jan 22 21:25:11 2011 +0100
          +++ b/src/spell.c Mon Jan 31 21:12:13 2011 +0100
          @@ -7364,6 +7364,9 @@

          if (bl == NULL || bl->sb_used + len > SBLOCKSIZE)
          {
          + if (len > SBLOCKSIZE)
          + return NULL;
          +
          /* Allocate a block of memory. This is not freed until much later. */
          bl = (sblock_T *)alloc_clear((unsigned)(sizeof(sblock_T) + SBLOCKSIZE));
          if (bl == NULL)


          A dictionary is then created without crash but it's crude:
          many (and most) COMPOUNDRULE are skipped without
          warning when creating the dictionary. I did not try to improve
          to give a warning since dictionary is bad anyway (:spelldump
          shows very few words). It's probably best to not create a
          dictionary at all when getroom() returns NULL.

          >> I changed SBLOCKSIZE from 16000 to 1024000 at spell.c:4885
          >> and it longer crashes but that's quite a dramatic increase so I
          >> doubt whether that's right.  While creating the dictionary,
          >> l variable at spell.c:5485 reached l=564,458.
          >
          > Pfew, that's a lot of compound rules.  Did you check in the affix file
          > that this is actually there?  I suppose this must be generated, no human
          > would type this.

          The file "literumilo.aff" contains 37,300 lines of COMPOUNDRULE .
          (previous email contains a link to the OpenOffice file which contains
          the *.aff & *.dic files).

          COMPOUNDRULE lines are most certainly created automatically.
          It seems that most words are built using COMPOUNDRULE in this
          dictionary. I need to read the doc of Myspell or Hunspell
          carefully to understand this better.

          > A solution would be to make compflags a growarray.  The way it's done
          > now it is re-allocated for each COMPOUNDRULE, very inefficient.  This
          > was made with the assumption there would be only a few COMPOUNDRULEs.

          Even if we did that, a regex is created with all COMPOUNDRULE
          with many alternatives \|. The regex would be slow I think.
          See below...

          >> Vim temporary used 12.5 Gb of memory while creating the
          >> dictionary which is quite a lot.
          >
          > Compound rules are indeed a problem, since Vim expands them into all
          > possible words to be able to build the trie.  It compresses really well,
          > but lots of memory is needed before the compression happens.
          >
          >> Size of the created dictionary file 'eo.utf-8.spl' is 587,214 bytes
          >> and it does not work.  Trying to use it with...
          >>
          >>   $ cp /tmp/eo.utf-8.spl ~/.vim/spell/.
          >>
          >> ... then in Vim...
          >>   :setlocal spell spelllang=eo
          >>
          >>   ... gives errors after waiting for ~10 sec or so:
          >>
          >>   Error detected while processing /home/pel/.vim/spell/eo.utf-8.spl:
          >>   E339: Pattern too long
          >>   E759: Format error in spell file
          >>
          >> ":help E339" says: "This only happens on systems with 16 bit ints".
          >>
          >> Well, that's not true since I get E339 on Linux x86_64 where
          >> sizeof(int) is 4 (32 bits).
          >>
          >> E339 at regexp.c:1059 is in between SMALL_MALLOC
          >> but E339 at regexp.c:1077 is not in between SMALL_MALLOC.
          >>
          >> Either help file needs to be updated for E339 or regexp.c
          >> needs to handle longer regexes.
          >
          > The help is wrong, it can also happen when an offset gets too big.
          >
          >> Problem is triggered by the fact that file 'literumilo.aff' contains
          >> many COMPOUNDRULE lines (Esperanto being an agglutinative
          >> language).
          >
          > Did you check which regcomp() call results in E339?

          I've added a printf I see this:

          [vim_regcomp:1061] len(expr)=491048
          expr=[^\(<7f>\~}||\|{z\|{yz\|xwz\|vutz\|vstz\|rz\|wz\|qpz\|oz\|xn\|xmn\|xlkn\|xljin\|xhgn\|xhtn\|xhfn\|xhin\|xhen\|xhdn\|xghen\|
          .... etc, etc... (huge regexp of ~491 Kb)

          The stack trace when E339 happens is:

          #17 0x000000000052c0c3 in vim_regcomp (
          expr=0x7fc9c6a21010
          "^\\(\177\\~}||\\|{z\\|{yz\\|xwz\\|vutz\\|vstz\\|rz\\|wz\\|qpz\\|oz\\|xn\\|xmn\\|xlkn\\|xljin\\|xhgn\\|xhtn\\|xhfn\\|xhin\\|xhen\\|xhdn\\|xghen\\|xgcn\\|xbln\\|xbhn\\|xthn\\|xbtn\\|xbwn\\|xban\\|xt`_Zn\\|xbYn\\|xtYn\\|xwaiZn\\|xXtn\\|xXYn\\"...,
          re_flags=7) at regexp.c:1077
          #18 0x000000000055bc33 in read_compound (fd=0x92d1b0, slang=0x92d3f0,
          len=453893) at spell.c:3704
          #19 0x000000000055a7a9 in spell_load_file (fname=0xa0d5f0
          "/home/pel/.vim/spell/eo.utf-8.spl", lang=0x7fff44b8a430 "eo",
          old_lp=0x0, silent=0)
          at spell.c:2920
          #20 0x000000000055a352 in spell_load_cb (fname=0xa0d5f0
          "/home/pel/.vim/spell/eo.utf-8.spl", cookie=0x7fff44b8a430) at
          spell.c:2732
          #21 0x0000000000464602 in do_in_runtimepath (name=0x7fff44b8a540
          "spell/eo.utf-8.spl", all=0, callback=0x55a323 <spell_load_cb>,
          cookie=0x7fff44b8a430) at ex_cmds2.c:2715
          #22 0x0000000000559c09 in spell_load_lang (lang=0x7fff44b8b5f0 "eo")
          at spell.c:2486
          #23 0x000000000055cd74 in did_set_spelllang (wp=0x858e30) at spell.c:4304
          #24 0x0000000000512263 in did_set_string_option (opt_idx=284,
          varp=0x85c298, new_value_alloced=1, oldval=0x85d430 "en",
          errbuf=0x7fff44b8b9b0 "",
          opt_flags=4) at option.c:6551
          #25 0x000000000050f0fd in do_set (arg=0xa0d63c "", opt_flags=4) at option.c:4792
          #26 0x00000000004773bd in ex_set (eap=0x7fff44b8bb80) at ex_docmd.c:11121
          #27 0x00000000004697b2 in do_one_cmd (cmdlinep=0x7fff44b8c228,
          sourcing=1, cstack=0x7fff44b8bd80, fgetline=0x45253b <get_func_line>,
          cookie=0x92b860)
          at ex_docmd.c:2657
          #28 0x0000000000466f15 in do_cmdline (cmdline=0x0, fgetline=0x45253b
          <get_func_line>, cookie=0x92b860, flags=7) at ex_docmd.c:1123
          #29 0x0000000000451b79 in call_user_func (fp=0x884890, argcount=0,
          argvars=0x7fff44b8c750, rettv=0x7fff44b8c910, firstline=1, lastline=1,
          selfdict=0x0) at eval.c:22018
          #30 0x000000000043c9f9 in call_func (funcname=0x8611e0 "MySpellLang",
          len=11, rettv=0x7fff44b8c910, argcount=0, argvars=0x7fff44b8c750,
          firstline=1,
          lastline=1, doesrange=0x7fff44b8c90c, evaluate=1, selfdict=0x0) at
          eval.c:8316
          #31 0x000000000043c62c in get_func_tv (name=0x8611e0 "MySpellLang",
          len=11, rettv=0x7fff44b8c910, arg=0x7fff44b8c930, firstline=1,
          lastline=1,
          doesrange=0x7fff44b8c90c, evaluate=1, selfdict=0x0) at eval.c:8158
          #32 0x00000000004358b4 in ex_call (eap=0x7fff44b8ca10) at eval.c:3398
          #33 0x00000000004697b2 in do_one_cmd (cmdlinep=0x7fff44b8d0b8,
          sourcing=0, cstack=0x7fff44b8cc10, fgetline=0x47e54f <getexline>,
          cookie=0x0)
          at ex_docmd.c:2657
          #34 0x0000000000466f15 in do_cmdline (cmdline=0x0, fgetline=0x47e54f
          <getexline>, cookie=0x0, flags=0) at ex_docmd.c:1123
          #35 0x00000000004f96dc in nv_colon (cap=0x7fff44b8d210) at normal.c:5342
          #36 0x00000000004f239c in normal_cmd (oap=0x7fff44b8d2d0, toplevel=1)
          at normal.c:1193
          #37 0x00000000004ae938 in main_loop (cmdwin=0, noexmode=0) at main.c:1256
          #38 0x00000000004ae349 in main (argc=1, argv=0x7fff44b8d5f8) at main.c:961

          read_compound() turns the COMPOUNDRULE items into a
          regex so a huge number of COMPOUNDRULE won't work.

          No good solution so far.

          -- Dominique

          --
          You received this message from the "vim_dev" maillist.
          Do not top-post! Type your reply below the text you are replying to.
          For more information, visit http://www.vim.org/maillist.php
        • Bram Moolenaar
          ... I ll at least add this check to avoid a crash and give an error when this happens. ... Oof. That probably requires a bit of redesign to handle. ... Good
          Message 4 of 4 , Feb 1, 2011
          • 0 Attachment
            Dominique Pelle wrote:

            > >> Vim-7.3.107 crashes when I try to create the Esperanto
            > >> dictionary from OpenOffice-3.
            > >>
            > >> Steps to reproduce:
            > >>
            > >> $ wget http://extensions.services.openoffice.org/e-files/3377/1/1.0-dev.oxt
            > >> $ unzip 1.0-dev.oxt
            > >> $ vim -u NONE --noplugin -c 'set nomore' -c 'mkspell! /tmp/eo literumilo'
            > >> Vim: Caught deadly signal SEGV
            > >>
            > >> Vim: Finished.
            > >> Segmentation fault (core dumped)
            > >>
            > >>
            > >> Valgrind memory checker gives errors:
            > >>
            > >> ==3877== Invalid write of size 1
            > >> ==3877==    at 0x4026087: strcat (mc_replace_strmem.c:176)
            > >> ==3877==    by 0x818AF60: spell_read_aff (spell.c:5493)
            > >> ==3877==    by 0x819359D: mkspell (spell.c:9234)
            > >> ==3877==    by 0x8191F53: ex_mkspell (spell.c:8557)
            > >> ==3877==    by 0x80A81AC: do_one_cmd (ex_docmd.c:2657)
            > >> ==3877==    by 0x80A5A85: do_cmdline (ex_docmd.c:1123)
            > >> ==3877==    by 0x80A513F: do_cmdline_cmd (ex_docmd.c:728)
            > >> ==3877==    by 0x80EAEE5: exe_commands (main.c:2803)
            > >> ==3877==    by 0x80E85CA: main (main.c:881)
            > >> ==3877==  Address 0x7e3b434 is 0 bytes after a block of size 16,012 alloc'd
            > >> ==3877==    at 0x4025230: malloc (vg_replace_malloc.c:236)
            > >> ==3877==    by 0x8117B93: lalloc (misc2.c:918)
            > >> ==3877==    by 0x8117ACB: alloc_clear (misc2.c:829)
            > >> ==3877==    by 0x818F951: getroom (spell.c:7368)
            > >> ==3877==    by 0x818AEFA: spell_read_aff (spell.c:5485)
            > >> ==3877==    by 0x819359D: mkspell (spell.c:9234)
            > >> ==3877==    by 0x8191F53: ex_mkspell (spell.c:8557)
            > >> ==3877==    by 0x80A81AC: do_one_cmd (ex_docmd.c:2657)
            > >> ==3877==    by 0x80A5A85: do_cmdline (ex_docmd.c:1123)
            > >> ==3877==    by 0x80A513F: do_cmdline_cmd (ex_docmd.c:728)
            > >> ==3877==    by 0x80EAEE5: exe_commands (main.c:2803)
            > >> ==3877==    by 0x80E85CA: main (main.c:881)
            > >> (several other errors after that)
            > >>
            > >> spell.c:
            > >>
            > >>   5478     else if (is_aff_rule(items, itemcnt, "COMPOUNDRULE", 2))
            > >>   5479     {
            > >>   5480         /* Concatenate this string to previously defined ones, using a
            > >>   5481          * slash to separate them. */
            > >>   5482         l = (int)STRLEN(items[1]) + 1;
            > >>   5483         if (compflags != NULL)
            > >>   5484             l += (int)STRLEN(compflags) + 1;
            > >>   5485         p = getroom(spin, l, FALSE);
            > >>   5486         if (p != NULL)
            > >>   5487         {
            > >>   5488             if (compflags != NULL)
            > >>   5489             {
            > >>   5490                 STRCPY(p, compflags);
            > >>   5491                 STRCAT(p, "/");
            > >>   5492             }
            > >> !!5493             STRCAT(p, items[1]);
            > >>   5494             compflags = p;
            > >>   5495         }
            > >>   5496     }
            > >>
            > >> When it crashes, I notice that variable l reaches 16005 which is just
            > >> slightly bigger than SBLOCKSIZE (#define  SBLOCKSIZE 16000 at
            > >> spell.c:4885).
            > >
            > > First thing to fix would be to add a check in getroom() for a length
            > > more than what can be handled.
            >
            > I tried adding a check to avoid allocating more than SBLOCKSIZE
            > in getroom().
            >
            > $ hg diff spell.c
            > diff -r 26fb122355d4 src/spell.c
            > --- a/src/spell.c Sat Jan 22 21:25:11 2011 +0100
            > +++ b/src/spell.c Mon Jan 31 21:12:13 2011 +0100
            > @@ -7364,6 +7364,9 @@
            >
            > if (bl == NULL || bl->sb_used + len > SBLOCKSIZE)
            > {
            > + if (len > SBLOCKSIZE)
            > + return NULL;
            > +
            > /* Allocate a block of memory. This is not freed until much later. */
            > bl = (sblock_T *)alloc_clear((unsigned)(sizeof(sblock_T) + SBLOCKSIZE));
            > if (bl == NULL)
            >
            >
            > A dictionary is then created without crash but it's crude:
            > many (and most) COMPOUNDRULE are skipped without
            > warning when creating the dictionary. I did not try to improve
            > to give a warning since dictionary is bad anyway (:spelldump
            > shows very few words). It's probably best to not create a
            > dictionary at all when getroom() returns NULL.

            I'll at least add this check to avoid a crash and give an error when
            this happens.

            > >> I changed SBLOCKSIZE from 16000 to 1024000 at spell.c:4885
            > >> and it longer crashes but that's quite a dramatic increase so I
            > >> doubt whether that's right.  While creating the dictionary,
            > >> l variable at spell.c:5485 reached l=564,458.
            > >
            > > Pfew, that's a lot of compound rules.  Did you check in the affix file
            > > that this is actually there?  I suppose this must be generated, no human
            > > would type this.
            >
            > The file "literumilo.aff" contains 37,300 lines of COMPOUNDRULE .
            > (previous email contains a link to the OpenOffice file which contains
            > the *.aff & *.dic files).

            Oof. That probably requires a bit of redesign to handle.

            > COMPOUNDRULE lines are most certainly created automatically.
            > It seems that most words are built using COMPOUNDRULE in this
            > dictionary. I need to read the doc of Myspell or Hunspell
            > carefully to understand this better.

            Good luck, Hunspell docs are sparese. The Vim docs may also help,
            although recent changes in Hunspell may have changed behavior.

            > > A solution would be to make compflags a growarray.  The way it's done
            > > now it is re-allocated for each COMPOUNDRULE, very inefficient.  This
            > > was made with the assumption there would be only a few COMPOUNDRULEs.
            >
            > Even if we did that, a regex is created with all COMPOUNDRULE
            > with many alternatives \|. The regex would be slow I think.
            > See below...
            >
            > >> Vim temporary used 12.5 Gb of memory while creating the
            > >> dictionary which is quite a lot.
            > >
            > > Compound rules are indeed a problem, since Vim expands them into all
            > > possible words to be able to build the trie.  It compresses really well,
            > > but lots of memory is needed before the compression happens.
            > >
            > >> Size of the created dictionary file 'eo.utf-8.spl' is 587,214 bytes
            > >> and it does not work.  Trying to use it with...
            > >>
            > >>   $ cp /tmp/eo.utf-8.spl ~/.vim/spell/.
            > >>
            > >> ... then in Vim...
            > >>   :setlocal spell spelllang=eo
            > >>
            > >>   ... gives errors after waiting for ~10 sec or so:
            > >>
            > >>   Error detected while processing /home/pel/.vim/spell/eo.utf-8.spl:
            > >>   E339: Pattern too long
            > >>   E759: Format error in spell file
            > >>
            > >> ":help E339" says: "This only happens on systems with 16 bit ints".
            > >>
            > >> Well, that's not true since I get E339 on Linux x86_64 where
            > >> sizeof(int) is 4 (32 bits).
            > >>
            > >> E339 at regexp.c:1059 is in between SMALL_MALLOC
            > >> but E339 at regexp.c:1077 is not in between SMALL_MALLOC.
            > >>
            > >> Either help file needs to be updated for E339 or regexp.c
            > >> needs to handle longer regexes.
            > >
            > > The help is wrong, it can also happen when an offset gets too big.
            > >
            > >> Problem is triggered by the fact that file 'literumilo.aff' contains
            > >> many COMPOUNDRULE lines (Esperanto being an agglutinative
            > >> language).
            > >
            > > Did you check which regcomp() call results in E339?
            >
            > I've added a printf I see this:
            >
            > [vim_regcomp:1061] len(expr)=491048
            > expr=[^\(<7f>\~}||\|{z\|{yz\|xwz\|vutz\|vstz\|rz\|wz\|qpz\|oz\|xn\|xmn\|xlkn\|xljin\|xhgn\|xhtn\|xhfn\|xhin\|xhen\|xhdn\|xghen\|
            > .... etc, etc... (huge regexp of ~491 Kb)
            >
            > The stack trace when E339 happens is:
            >
            > #17 0x000000000052c0c3 in vim_regcomp (
            > expr=0x7fc9c6a21010
            > "^\\(\177\\~}||\\|{z\\|{yz\\|xwz\\|vutz\\|vstz\\|rz\\|wz\\|qpz\\|oz\\|xn\\|xmn\\|xlkn\\|xljin\\|xhgn\\|xhtn\\|xhfn\\|xhin\\|xhen\\|xhdn\\|xghen\\|xgcn\\|xbln\\|xbhn\\|xthn\\|xbtn\\|xbwn\\|xban\\|xt`_Zn\\|xbYn\\|xtYn\\|xwaiZn\\|xXtn\\|xXYn\\"...,
            > re_flags=7) at regexp.c:1077
            > #18 0x000000000055bc33 in read_compound (fd=0x92d1b0, slang=0x92d3f0,
            > len=453893) at spell.c:3704
            > #19 0x000000000055a7a9 in spell_load_file (fname=0xa0d5f0
            > "/home/pel/.vim/spell/eo.utf-8.spl", lang=0x7fff44b8a430 "eo",
            > old_lp=0x0, silent=0)
            > at spell.c:2920
            > #20 0x000000000055a352 in spell_load_cb (fname=0xa0d5f0
            > "/home/pel/.vim/spell/eo.utf-8.spl", cookie=0x7fff44b8a430) at
            > spell.c:2732
            > #21 0x0000000000464602 in do_in_runtimepath (name=0x7fff44b8a540
            > "spell/eo.utf-8.spl", all=0, callback=0x55a323 <spell_load_cb>,
            > cookie=0x7fff44b8a430) at ex_cmds2.c:2715
            > #22 0x0000000000559c09 in spell_load_lang (lang=0x7fff44b8b5f0 "eo")
            > at spell.c:2486
            > #23 0x000000000055cd74 in did_set_spelllang (wp=0x858e30) at spell.c:4304
            > #24 0x0000000000512263 in did_set_string_option (opt_idx=284,
            > varp=0x85c298, new_value_alloced=1, oldval=0x85d430 "en",
            > errbuf=0x7fff44b8b9b0 "",
            > opt_flags=4) at option.c:6551
            > #25 0x000000000050f0fd in do_set (arg=0xa0d63c "", opt_flags=4) at option.c:4792
            > #26 0x00000000004773bd in ex_set (eap=0x7fff44b8bb80) at ex_docmd.c:11121
            > #27 0x00000000004697b2 in do_one_cmd (cmdlinep=0x7fff44b8c228,
            > sourcing=1, cstack=0x7fff44b8bd80, fgetline=0x45253b <get_func_line>,
            > cookie=0x92b860)
            > at ex_docmd.c:2657
            > #28 0x0000000000466f15 in do_cmdline (cmdline=0x0, fgetline=0x45253b
            > <get_func_line>, cookie=0x92b860, flags=7) at ex_docmd.c:1123
            > #29 0x0000000000451b79 in call_user_func (fp=0x884890, argcount=0,
            > argvars=0x7fff44b8c750, rettv=0x7fff44b8c910, firstline=1, lastline=1,
            > selfdict=0x0) at eval.c:22018
            > #30 0x000000000043c9f9 in call_func (funcname=0x8611e0 "MySpellLang",
            > len=11, rettv=0x7fff44b8c910, argcount=0, argvars=0x7fff44b8c750,
            > firstline=1,
            > lastline=1, doesrange=0x7fff44b8c90c, evaluate=1, selfdict=0x0) at
            > eval.c:8316
            > #31 0x000000000043c62c in get_func_tv (name=0x8611e0 "MySpellLang",
            > len=11, rettv=0x7fff44b8c910, arg=0x7fff44b8c930, firstline=1,
            > lastline=1,
            > doesrange=0x7fff44b8c90c, evaluate=1, selfdict=0x0) at eval.c:8158
            > #32 0x00000000004358b4 in ex_call (eap=0x7fff44b8ca10) at eval.c:3398
            > #33 0x00000000004697b2 in do_one_cmd (cmdlinep=0x7fff44b8d0b8,
            > sourcing=0, cstack=0x7fff44b8cc10, fgetline=0x47e54f <getexline>,
            > cookie=0x0)
            > at ex_docmd.c:2657
            > #34 0x0000000000466f15 in do_cmdline (cmdline=0x0, fgetline=0x47e54f
            > <getexline>, cookie=0x0, flags=0) at ex_docmd.c:1123
            > #35 0x00000000004f96dc in nv_colon (cap=0x7fff44b8d210) at normal.c:5342
            > #36 0x00000000004f239c in normal_cmd (oap=0x7fff44b8d2d0, toplevel=1)
            > at normal.c:1193
            > #37 0x00000000004ae938 in main_loop (cmdwin=0, noexmode=0) at main.c:1256
            > #38 0x00000000004ae349 in main (argc=1, argv=0x7fff44b8d5f8) at main.c:961
            >
            > read_compound() turns the COMPOUNDRULE items into a
            > regex so a huge number of COMPOUNDRULE won't work.
            >
            > No good solution so far.

            The reason a regexp is used is that there can be * and + characters in the
            rules. However, this Esperanto affix file doesn't use them, thus we
            could instead fill a hashtab with the possibilities and do a lookup
            in that. An alternative would be to make a list of compiled regexp
            items, but that would be terribly slow.

            I do see ? characters in the COMPOUNDRULEs, is that an optional item?
            Yes, it is. Vim doesn't handle that correctly, I'll make that work.
            For the hastab we would need to add the item with and without the
            optional character.

            As an temporary solution, it should be possible to have Hunspell
            generate the full word list and feed that to Vim as a .dic file. The
            list will be huge, not sure if that works.

            --
            hundred-and-one symptoms of being an internet addict:
            172. You join listservers just for the extra e-mail.

            /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
            /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
            \\\ an exciting new programming language -- http://www.Zimbu.org ///
            \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

            --
            You received this message from the "vim_dev" maillist.
            Do not top-post! Type your reply below the text you are replying to.
            For more information, visit http://www.vim.org/maillist.php
          Your message has been successfully submitted and would be delivered to recipients shortly.