Loading ...
Sorry, an error occurred while loading the content.

Re: [bug] reproducible crash when trying to create dictionary with :mkspell

Expand Messages
  • Bram Moolenaar
    ... I ll at least add this check to avoid a crash and give an error when this happens. ... Oof. That probably requires a bit of redesign to handle. ... Good
    Message 1 of 4 , Feb 1, 2011
    • 0 Attachment
      Dominique Pelle wrote:

      > >> Vim-7.3.107 crashes when I try to create the Esperanto
      > >> dictionary from OpenOffice-3.
      > >>
      > >> Steps to reproduce:
      > >>
      > >> $ wget http://extensions.services.openoffice.org/e-files/3377/1/1.0-dev.oxt
      > >> $ unzip 1.0-dev.oxt
      > >> $ vim -u NONE --noplugin -c 'set nomore' -c 'mkspell! /tmp/eo literumilo'
      > >> Vim: Caught deadly signal SEGV
      > >>
      > >> Vim: Finished.
      > >> Segmentation fault (core dumped)
      > >>
      > >>
      > >> Valgrind memory checker gives errors:
      > >>
      > >> ==3877== Invalid write of size 1
      > >> ==3877==    at 0x4026087: strcat (mc_replace_strmem.c:176)
      > >> ==3877==    by 0x818AF60: spell_read_aff (spell.c:5493)
      > >> ==3877==    by 0x819359D: mkspell (spell.c:9234)
      > >> ==3877==    by 0x8191F53: ex_mkspell (spell.c:8557)
      > >> ==3877==    by 0x80A81AC: do_one_cmd (ex_docmd.c:2657)
      > >> ==3877==    by 0x80A5A85: do_cmdline (ex_docmd.c:1123)
      > >> ==3877==    by 0x80A513F: do_cmdline_cmd (ex_docmd.c:728)
      > >> ==3877==    by 0x80EAEE5: exe_commands (main.c:2803)
      > >> ==3877==    by 0x80E85CA: main (main.c:881)
      > >> ==3877==  Address 0x7e3b434 is 0 bytes after a block of size 16,012 alloc'd
      > >> ==3877==    at 0x4025230: malloc (vg_replace_malloc.c:236)
      > >> ==3877==    by 0x8117B93: lalloc (misc2.c:918)
      > >> ==3877==    by 0x8117ACB: alloc_clear (misc2.c:829)
      > >> ==3877==    by 0x818F951: getroom (spell.c:7368)
      > >> ==3877==    by 0x818AEFA: spell_read_aff (spell.c:5485)
      > >> ==3877==    by 0x819359D: mkspell (spell.c:9234)
      > >> ==3877==    by 0x8191F53: ex_mkspell (spell.c:8557)
      > >> ==3877==    by 0x80A81AC: do_one_cmd (ex_docmd.c:2657)
      > >> ==3877==    by 0x80A5A85: do_cmdline (ex_docmd.c:1123)
      > >> ==3877==    by 0x80A513F: do_cmdline_cmd (ex_docmd.c:728)
      > >> ==3877==    by 0x80EAEE5: exe_commands (main.c:2803)
      > >> ==3877==    by 0x80E85CA: main (main.c:881)
      > >> (several other errors after that)
      > >>
      > >> spell.c:
      > >>
      > >>   5478     else if (is_aff_rule(items, itemcnt, "COMPOUNDRULE", 2))
      > >>   5479     {
      > >>   5480         /* Concatenate this string to previously defined ones, using a
      > >>   5481          * slash to separate them. */
      > >>   5482         l = (int)STRLEN(items[1]) + 1;
      > >>   5483         if (compflags != NULL)
      > >>   5484             l += (int)STRLEN(compflags) + 1;
      > >>   5485         p = getroom(spin, l, FALSE);
      > >>   5486         if (p != NULL)
      > >>   5487         {
      > >>   5488             if (compflags != NULL)
      > >>   5489             {
      > >>   5490                 STRCPY(p, compflags);
      > >>   5491                 STRCAT(p, "/");
      > >>   5492             }
      > >> !!5493             STRCAT(p, items[1]);
      > >>   5494             compflags = p;
      > >>   5495         }
      > >>   5496     }
      > >>
      > >> When it crashes, I notice that variable l reaches 16005 which is just
      > >> slightly bigger than SBLOCKSIZE (#define  SBLOCKSIZE 16000 at
      > >> spell.c:4885).
      > >
      > > First thing to fix would be to add a check in getroom() for a length
      > > more than what can be handled.
      >
      > I tried adding a check to avoid allocating more than SBLOCKSIZE
      > in getroom().
      >
      > $ hg diff spell.c
      > diff -r 26fb122355d4 src/spell.c
      > --- a/src/spell.c Sat Jan 22 21:25:11 2011 +0100
      > +++ b/src/spell.c Mon Jan 31 21:12:13 2011 +0100
      > @@ -7364,6 +7364,9 @@
      >
      > if (bl == NULL || bl->sb_used + len > SBLOCKSIZE)
      > {
      > + if (len > SBLOCKSIZE)
      > + return NULL;
      > +
      > /* Allocate a block of memory. This is not freed until much later. */
      > bl = (sblock_T *)alloc_clear((unsigned)(sizeof(sblock_T) + SBLOCKSIZE));
      > if (bl == NULL)
      >
      >
      > A dictionary is then created without crash but it's crude:
      > many (and most) COMPOUNDRULE are skipped without
      > warning when creating the dictionary. I did not try to improve
      > to give a warning since dictionary is bad anyway (:spelldump
      > shows very few words). It's probably best to not create a
      > dictionary at all when getroom() returns NULL.

      I'll at least add this check to avoid a crash and give an error when
      this happens.

      > >> I changed SBLOCKSIZE from 16000 to 1024000 at spell.c:4885
      > >> and it longer crashes but that's quite a dramatic increase so I
      > >> doubt whether that's right.  While creating the dictionary,
      > >> l variable at spell.c:5485 reached l=564,458.
      > >
      > > Pfew, that's a lot of compound rules.  Did you check in the affix file
      > > that this is actually there?  I suppose this must be generated, no human
      > > would type this.
      >
      > The file "literumilo.aff" contains 37,300 lines of COMPOUNDRULE .
      > (previous email contains a link to the OpenOffice file which contains
      > the *.aff & *.dic files).

      Oof. That probably requires a bit of redesign to handle.

      > COMPOUNDRULE lines are most certainly created automatically.
      > It seems that most words are built using COMPOUNDRULE in this
      > dictionary. I need to read the doc of Myspell or Hunspell
      > carefully to understand this better.

      Good luck, Hunspell docs are sparese. The Vim docs may also help,
      although recent changes in Hunspell may have changed behavior.

      > > A solution would be to make compflags a growarray.  The way it's done
      > > now it is re-allocated for each COMPOUNDRULE, very inefficient.  This
      > > was made with the assumption there would be only a few COMPOUNDRULEs.
      >
      > Even if we did that, a regex is created with all COMPOUNDRULE
      > with many alternatives \|. The regex would be slow I think.
      > See below...
      >
      > >> Vim temporary used 12.5 Gb of memory while creating the
      > >> dictionary which is quite a lot.
      > >
      > > Compound rules are indeed a problem, since Vim expands them into all
      > > possible words to be able to build the trie.  It compresses really well,
      > > but lots of memory is needed before the compression happens.
      > >
      > >> Size of the created dictionary file 'eo.utf-8.spl' is 587,214 bytes
      > >> and it does not work.  Trying to use it with...
      > >>
      > >>   $ cp /tmp/eo.utf-8.spl ~/.vim/spell/.
      > >>
      > >> ... then in Vim...
      > >>   :setlocal spell spelllang=eo
      > >>
      > >>   ... gives errors after waiting for ~10 sec or so:
      > >>
      > >>   Error detected while processing /home/pel/.vim/spell/eo.utf-8.spl:
      > >>   E339: Pattern too long
      > >>   E759: Format error in spell file
      > >>
      > >> ":help E339" says: "This only happens on systems with 16 bit ints".
      > >>
      > >> Well, that's not true since I get E339 on Linux x86_64 where
      > >> sizeof(int) is 4 (32 bits).
      > >>
      > >> E339 at regexp.c:1059 is in between SMALL_MALLOC
      > >> but E339 at regexp.c:1077 is not in between SMALL_MALLOC.
      > >>
      > >> Either help file needs to be updated for E339 or regexp.c
      > >> needs to handle longer regexes.
      > >
      > > The help is wrong, it can also happen when an offset gets too big.
      > >
      > >> Problem is triggered by the fact that file 'literumilo.aff' contains
      > >> many COMPOUNDRULE lines (Esperanto being an agglutinative
      > >> language).
      > >
      > > Did you check which regcomp() call results in E339?
      >
      > I've added a printf I see this:
      >
      > [vim_regcomp:1061] len(expr)=491048
      > expr=[^\(<7f>\~}||\|{z\|{yz\|xwz\|vutz\|vstz\|rz\|wz\|qpz\|oz\|xn\|xmn\|xlkn\|xljin\|xhgn\|xhtn\|xhfn\|xhin\|xhen\|xhdn\|xghen\|
      > .... etc, etc... (huge regexp of ~491 Kb)
      >
      > The stack trace when E339 happens is:
      >
      > #17 0x000000000052c0c3 in vim_regcomp (
      > expr=0x7fc9c6a21010
      > "^\\(\177\\~}||\\|{z\\|{yz\\|xwz\\|vutz\\|vstz\\|rz\\|wz\\|qpz\\|oz\\|xn\\|xmn\\|xlkn\\|xljin\\|xhgn\\|xhtn\\|xhfn\\|xhin\\|xhen\\|xhdn\\|xghen\\|xgcn\\|xbln\\|xbhn\\|xthn\\|xbtn\\|xbwn\\|xban\\|xt`_Zn\\|xbYn\\|xtYn\\|xwaiZn\\|xXtn\\|xXYn\\"...,
      > re_flags=7) at regexp.c:1077
      > #18 0x000000000055bc33 in read_compound (fd=0x92d1b0, slang=0x92d3f0,
      > len=453893) at spell.c:3704
      > #19 0x000000000055a7a9 in spell_load_file (fname=0xa0d5f0
      > "/home/pel/.vim/spell/eo.utf-8.spl", lang=0x7fff44b8a430 "eo",
      > old_lp=0x0, silent=0)
      > at spell.c:2920
      > #20 0x000000000055a352 in spell_load_cb (fname=0xa0d5f0
      > "/home/pel/.vim/spell/eo.utf-8.spl", cookie=0x7fff44b8a430) at
      > spell.c:2732
      > #21 0x0000000000464602 in do_in_runtimepath (name=0x7fff44b8a540
      > "spell/eo.utf-8.spl", all=0, callback=0x55a323 <spell_load_cb>,
      > cookie=0x7fff44b8a430) at ex_cmds2.c:2715
      > #22 0x0000000000559c09 in spell_load_lang (lang=0x7fff44b8b5f0 "eo")
      > at spell.c:2486
      > #23 0x000000000055cd74 in did_set_spelllang (wp=0x858e30) at spell.c:4304
      > #24 0x0000000000512263 in did_set_string_option (opt_idx=284,
      > varp=0x85c298, new_value_alloced=1, oldval=0x85d430 "en",
      > errbuf=0x7fff44b8b9b0 "",
      > opt_flags=4) at option.c:6551
      > #25 0x000000000050f0fd in do_set (arg=0xa0d63c "", opt_flags=4) at option.c:4792
      > #26 0x00000000004773bd in ex_set (eap=0x7fff44b8bb80) at ex_docmd.c:11121
      > #27 0x00000000004697b2 in do_one_cmd (cmdlinep=0x7fff44b8c228,
      > sourcing=1, cstack=0x7fff44b8bd80, fgetline=0x45253b <get_func_line>,
      > cookie=0x92b860)
      > at ex_docmd.c:2657
      > #28 0x0000000000466f15 in do_cmdline (cmdline=0x0, fgetline=0x45253b
      > <get_func_line>, cookie=0x92b860, flags=7) at ex_docmd.c:1123
      > #29 0x0000000000451b79 in call_user_func (fp=0x884890, argcount=0,
      > argvars=0x7fff44b8c750, rettv=0x7fff44b8c910, firstline=1, lastline=1,
      > selfdict=0x0) at eval.c:22018
      > #30 0x000000000043c9f9 in call_func (funcname=0x8611e0 "MySpellLang",
      > len=11, rettv=0x7fff44b8c910, argcount=0, argvars=0x7fff44b8c750,
      > firstline=1,
      > lastline=1, doesrange=0x7fff44b8c90c, evaluate=1, selfdict=0x0) at
      > eval.c:8316
      > #31 0x000000000043c62c in get_func_tv (name=0x8611e0 "MySpellLang",
      > len=11, rettv=0x7fff44b8c910, arg=0x7fff44b8c930, firstline=1,
      > lastline=1,
      > doesrange=0x7fff44b8c90c, evaluate=1, selfdict=0x0) at eval.c:8158
      > #32 0x00000000004358b4 in ex_call (eap=0x7fff44b8ca10) at eval.c:3398
      > #33 0x00000000004697b2 in do_one_cmd (cmdlinep=0x7fff44b8d0b8,
      > sourcing=0, cstack=0x7fff44b8cc10, fgetline=0x47e54f <getexline>,
      > cookie=0x0)
      > at ex_docmd.c:2657
      > #34 0x0000000000466f15 in do_cmdline (cmdline=0x0, fgetline=0x47e54f
      > <getexline>, cookie=0x0, flags=0) at ex_docmd.c:1123
      > #35 0x00000000004f96dc in nv_colon (cap=0x7fff44b8d210) at normal.c:5342
      > #36 0x00000000004f239c in normal_cmd (oap=0x7fff44b8d2d0, toplevel=1)
      > at normal.c:1193
      > #37 0x00000000004ae938 in main_loop (cmdwin=0, noexmode=0) at main.c:1256
      > #38 0x00000000004ae349 in main (argc=1, argv=0x7fff44b8d5f8) at main.c:961
      >
      > read_compound() turns the COMPOUNDRULE items into a
      > regex so a huge number of COMPOUNDRULE won't work.
      >
      > No good solution so far.

      The reason a regexp is used is that there can be * and + characters in the
      rules. However, this Esperanto affix file doesn't use them, thus we
      could instead fill a hashtab with the possibilities and do a lookup
      in that. An alternative would be to make a list of compiled regexp
      items, but that would be terribly slow.

      I do see ? characters in the COMPOUNDRULEs, is that an optional item?
      Yes, it is. Vim doesn't handle that correctly, I'll make that work.
      For the hastab we would need to add the item with and without the
      optional character.

      As an temporary solution, it should be possible to have Hunspell
      generate the full word list and feed that to Vim as a .dic file. The
      list will be huge, not sure if that works.

      --
      hundred-and-one symptoms of being an internet addict:
      172. You join listservers just for the extra e-mail.

      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
      /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
      \\\ an exciting new programming language -- http://www.Zimbu.org ///
      \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

      --
      You received this message from the "vim_dev" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php
    Your message has been successfully submitted and would be delivered to recipients shortly.