Loading ...
Sorry, an error occurred while loading the content.

Glossary, Umlauts & tokenizer

Expand Messages
  • tde51k
    While I m still relatively new to omegaT, (using 2.2.1_1) to translate German to English) I m slowly getting used to using the glossary. My command line is:
    Message 1 of 6 , Dec 1, 2010
    • 0 Attachment
      While I'm still relatively new to omegaT, (using 2.2.1_1) to translate German to English) I'm slowly getting used to using the glossary.

      My command line is:
      "C:\Program Files (x86)\OmegaT-2.2.1\OmegaT.exe" --ITokenizer=org.omegat.plugins.tokenizer.LuceneGermanTokenizer --ITokenizerTarget=org.omegat.plugins.tokenizer.SnowballEnglishTokenizer

      Today I have run into a curious problem in that the tokenizer does not seem to recognize (i.e. mark) some quite simple words.

      After a bit of time, the only pattern I can recognize is that all the words which do not show contain either an Umlaut of the sharp 's'
      sample words are:
      unablässig
      entreißen
      Ungestüm

      All these words occur in the original as is, (i.e no stemming, tokenizing or any other analysis required) but are not flagged as part of the glossary.

      I'm not sure that I have seen this earlier, but if I had, I likely put it down to problems with analysis and accepted it a just one of those things.

      Has anyone else come across this problem?
      and most importantly, what, if anything can I do about it?

      TIA
    • Marc Prior
      This sounds to me like a problem with the glossary, not the tokenizer. What is the encoding of your glossary content, and what is the glossary s file
      Message 2 of 6 , Dec 1, 2010
      • 0 Attachment
        This sounds to me like a problem with the glossary, not the tokenizer.
        What is the encoding of your glossary content, and what is the
        glossary's file extension?

        Marc


        tde51k wrote:
        > While I'm still relatively new to omegaT, (using 2.2.1_1) to
        translate German to English) I'm slowly getting used to using the glossary.
        >
        > My command line is:
        > "C:\Program Files (x86)\OmegaT-2.2.1\OmegaT.exe"
        --ITokenizer=org.omegat.plugins.tokenizer.LuceneGermanTokenizer
        --ITokenizerTarget=org.omegat.plugins.tokenizer.SnowballEnglishTokenizer
        >
        > Today I have run into a curious problem in that the tokenizer does
        not seem to recognize (i.e. mark) some quite simple words.
        >
        > After a bit of time, the only pattern I can recognize is that all the
        words which do not show contain either an Umlaut of the sharp 's'
        > sample words are:
        > unablässig
        > entreißen
        > Ungestüm
        >
        > All these words occur in the original as is, (i.e no stemming,
        tokenizing or any other analysis required) but are not flagged as part
        of the glossary.
        >
        > I'm not sure that I have seen this earlier, but if I had, I likely
        put it down to problems with analysis and accepted it a just one of
        those things.
        >
        > Has anyone else come across this problem?
        > and most importantly, what, if anything can I do about it?
        >
        > TIA
        >
        >
      • m27aa
        I ran into something similar the Spanish and came to the conclusion that all I need to do is make sure that the glossary is a utf8 file . If you are using
        Message 3 of 6 , Dec 2, 2010
        • 0 Attachment
          I ran into something similar the Spanish and came to the conclusion that
          all I need to do is make sure that the glossary is a utf8 file . If you
          are using windows, open it with notepad and save it as a UTF8 file with
          a UTF8 extension. If, when you try to do as much, you see Notepad
          offering to save it as ANSI or another encoding, that was your problem.

          On 02/12/2010 00:11, tde51k wrote:
          >
          > While I'm still relatively new to omegaT, (using 2.2.1_1) to translate
          > German to English) I'm slowly getting used to using the glossary.
          >
          > My command line is:
          > "C:\Program Files (x86)\OmegaT-2.2.1\OmegaT.exe"
          > --ITokenizer=org.omegat.plugins.tokenizer.LuceneGermanTokenizer
          > --ITokenizerTarget=org.omegat.plugins.tokenizer.SnowballEnglishTokenizer
          >
          > Today I have run into a curious problem in that the tokenizer does not
          > seem to recognize (i.e. mark) some quite simple words.
          >
          > After a bit of time, the only pattern I can recognize is that all the
          > words which do not show contain either an Umlaut of the sharp 's'
          > sample words are:
          > unablässig
          > entreißen
          > Ungestüm
          >
          > All these words occur in the original as is, (i.e no stemming,
          > tokenizing or any other analysis required) but are not flagged as part
          > of the glossary.
          >
          > I'm not sure that I have seen this earlier, but if I had, I likely put
          > it down to problems with analysis and accepted it a just one of those
          > things.
          >
          > Has anyone else come across this problem?
          > and most importantly, what, if anything can I do about it?
          >
          > TIA
          >
          >


          [Non-text portions of this message have been removed]
        • Дмитрий Габинский
          ... .txt is fine with latest builds. Best regards, Dmitri Gabinski
          Message 4 of 6 , Dec 2, 2010
          • 0 Attachment
            2010/12/2, m27aa <m27aa@...>:
            > it as a UTF8 file with
            > a UTF8 extension.

            .txt is fine with latest builds.

            Best regards,

            Dmitri Gabinski
          • tde51k
            ... Thank you all who responded. Yes, it was a problem with the encoding of the glossary. This glossary was for a new project and the default for the editor I
            Message 5 of 6 , Dec 2, 2010
            • 0 Attachment
              --- In OmegaT@yahoogroups.com, Marc Prior <mail@...> wrote:
              >
              > This sounds to me like a problem with the glossary, not the tokenizer.
              > What is the encoding of your glossary content, and what is the
              > glossary's file extension?

              Thank you all who responded.
              Yes, it was a problem with the encoding of the glossary.
              This glossary was for a new project and the default for the editor I use (Notepad++) seems to be ANSI and I forgot to change the encoding - even forgot that it had to be done :-(

              Hope it will stick in my mind now
              >
              > Marc
              >
              >
              > tde51k wrote:
              > > While I'm still relatively new to omegaT, (using 2.2.1_1) to
              > translate German to English) I'm slowly getting used to using the glossary.
              > >
              > > My command line is:
              > > "C:\Program Files (x86)\OmegaT-2.2.1\OmegaT.exe"
              > --ITokenizer=org.omegat.plugins.tokenizer.LuceneGermanTokenizer
              > --ITokenizerTarget=org.omegat.plugins.tokenizer.SnowballEnglishTokenizer
              > >
              > > Today I have run into a curious problem in that the tokenizer does
              > not seem to recognize (i.e. mark) some quite simple words.
              > >
              > > After a bit of time, the only pattern I can recognize is that all the
              > words which do not show contain either an Umlaut of the sharp 's'
              > > sample words are:
              > > unablässig
              > > entreißen
              > > Ungestüm
              > >
              > > All these words occur in the original as is, (i.e no stemming,
              > tokenizing or any other analysis required) but are not flagged as part
              > of the glossary.
              > >
              > > I'm not sure that I have seen this earlier, but if I had, I likely
              > put it down to problems with analysis and accepted it a just one of
              > those things.
              > >
              > > Has anyone else come across this problem?
              > > and most importantly, what, if anything can I do about it?
              > >
              > > TIA
              > >
              > >
              >
            • Didier Briel
              ... tde51k ... (x86) OmegaT-2.2.1 OmegaT.exe --ITokenizer=org.omegat.plugins.tokenizer.Luc eneGermanTokenizer
              Message 6 of 6 , Dec 3, 2010
              • 0 Attachment
                -----Original Message-----
                >From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]On Behalf Of
                tde51k
                >Sent: Thursday, December 02, 2010 12:11 AM
                >To: OmegaT@yahoogroups.com
                >Subject: [OmT] Glossary, Umlauts & tokenizer

                >My command line is:
                >"C:\Program Files
                (x86)\OmegaT-2.2.1\OmegaT.exe" --ITokenizer=org.omegat.plugins.tokenizer.Luc
                eneGermanTokenizer --ITokenizerTarget=org.omegat.plugins.tokenizer.SnowballE
                nglishTokenizer

                Thank you for documenting that.

                Under Windows, it is indeed possible to use the tokenizers without creating
                a .bat file or using the command line.

                Here is a very short and simple Howto.

                - Create a shortcut of OmegaT.exe as usual (e.g., right click, drag,
                release, make shortcut).

                - Edit the shortcut properties (right click, properties). In the "target"
                field, just after the call to OmegaT.exe (e.g., "C:\Program Files\OmegaT
                2.2\OmegaT.exe"), simply add the tokenizer parameters
                (e.g;, --ITokenizer=org.omegat.plugins.tokenizer.LuceneGermanTokenizer).
                The total target field should now look like:
                "C:\Program Files\OmegaT
                2.2\OmegaT.exe" --ITokenizer=org.omegat.plugins.tokenizer.LuceneGermanTokeni
                zer
                on a single line (ignore the carriage return(s) added by the email.

                The advantage is that all the Java parameters (memory, user interface
                language, etc.) are still handled by OmegaT.l4J.ini, and have thus to be
                managed only once.


                Didier
              Your message has been successfully submitted and would be delivered to recipients shortly.