Loading ...
Sorry, an error occurred while loading the content.

RE: [OmT] Findings on the Czech tokenizer

Expand Messages
  • Didier Briel
    ... Milan Hubacek ... Czech stemmer, see the links below. I wonder if their Java codes are usable for the OmegaT tokenizer... ... Thank you for the links, they
    Message 1 of 77 , Jan 1, 2010
    • 0 Attachment
      -----Original Message-----
      >From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]On Behalf Of
      Milan Hubacek
      >Sent: Thursday, December 31, 2009 3:56 AM
      >To: OmegaT@yahoogroups.com
      >Subject: Re: [OmT] Findings on the Czech tokenizer

      >Seems that a group at the University of Neuchatel has done some job on the
      Czech stemmer, see the links below. I wonder if their Java codes are usable
      for the OmegaT tokenizer...
      >
      >http://members.unine.ch/jacques.savoy/clef/CzechStemmerLight.txt
      >http://members.unine.ch/jacques.savoy/clef/CzechStemmerAgressive.txt
      >http://clef-campaign.org/2007/working_notes/DolamicCLEF2007.pdf

      Thank you for the links, they are very interesting.

      Notably because all the stop words and the code is under the BSD license
      (which is compatible with the tokenizers and OmegaT). Some of these stop
      words are already used in Lucene.

      I have no idea yet whether the Czech stemmer might be usable and (supposing
      it is), how much effort would be required to use it in an OmegaT tokenizer.

      Didier
    • Didier Briel
      ... Or the extension used is not a usual one (OmegaT has .xml and .dbk by default). Or it s a variant of DocBook not supported by OmegaT. We support DocBook 4
      Message 77 of 77 , Jan 18, 2010
      • 0 Attachment
        -----Original Message-----
        >From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]On Behalf Of Milan Hubacek
        >Sent: Monday, January 18, 2010 9:37 AM
        >To: OmegaT@yahoogroups.com
        >Subject: Re: [OmT] Findings on the Czech tokenizer

        >> "???" being for "why would you want to convert these files, when
        >> OmegaT has
        >> a DocBook filter?"
        >>
        >
        >Simply because I could not get the XML files opened on import (unlike
        >XLIFF :-)
        >Maybe I am missing missed something in the manual again...

        Or the extension used is not a usual one (OmegaT has .xml and .dbk by default).
        Or it's a variant of DocBook not supported by OmegaT. We support DocBook 4 and 5.

        >Just one question: Will the flags added in newer DocBook help files
        >editions be usable even after translation? That is what my customer is
        >worried about.

        I have no idea of what these "flags" are.
        If you give an example, I might be able to give you an opinion.

        >> Having a look at Options/File Filters (and at the manual) would help.
        >> For instance, .utf8 is a defined extension for text files, where "=" (by
        >> definition) is *not* a separator. What you want is, either treat all .ini
        >> files as UTF-8, or define your own "ini8" extension, for .ini files with
        >> UTF-8 content.
        >>
        >
        >I tried to re-save the source files using a text editor like this: ANSI
        >-> UTF-8 before processing, but with no effect. In OmegaT everything is
        >displayed OK, but in the target files I still keep getting question
        >marks instead of characters with diacritics.
        >
        >But now I found that the filters are editable like <auto> -> UTF-8. Will
        >play around with the settings.

        That's why they are there.

        Didier
      Your message has been successfully submitted and would be delivered to recipients shortly.