Loading ...
Sorry, an error occurred while loading the content.
 

Script to convert OmegaT glossaries into DICT for QA Distiller

Expand Messages
  • Roman Mironov
    Hello, Could you please suggest any idea of a script to automatically convert OmegaT glossaries with notes in the third column into DICT files (used with QA
    Message 1 of 4 , Aug 1, 2012
      Hello,

      Could you please suggest any idea of a script to automatically convert
      OmegaT glossaries with notes in the third column into DICT files (used
      with QA Distiller) that have a short header, UCS-2 Little Endian
      encoding, just two columns (the third column with notes must be deleted)
      and | as a delimiter instead of a tab? As I am using QA Distiller
      extensively to check for glossary errors, manual conversion is too
      time-consuming.

      Please also feel free to contact me if you can develop such script for a
      fee. Thank you.

      Best regards,

      Roman

      ----------------------------------------
      Roman Mironov, Velior
      153035 Russia, Ivanovo, 3 Poletnaya, 5-50
      Tel: 7 920 679 33 21
      Skype: velior_roman_mironov
      Blog: velior.ru/blog/en <http://www.velior.ru/blog/en>
      LinkedIn: ru.linkedin.com/in/romanmironov
      <http://ru.linkedin.com/in/romanmironov>


      [Non-text portions of this message have been removed]
    • Didier Briel
      ... Is the header text (human-readable) or binary? ... Are you sure you really need UCS-2? http://en.wikipedia.org/wiki/UCS-2
      Message 2 of 4 , Aug 1, 2012
        > -----Original Message-----
        > From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com] On
        > Behalf Of Roman Mironov
        > Sent: Wednesday, August 01, 2012 11:02 AM
        > To: OmegaT@yahoogroups.com
        > Subject: [OmT] Script to convert OmegaT glossaries into DICT for QA Distiller


        > Could you please suggest any idea of a script to automatically convert
        > OmegaT glossaries with notes in the third column into DICT files (used with
        > QA Distiller) that have a short header,

        Is the header "text" (human-readable) or binary?

        >UCS-2 Little Endian encoding,

        Are you sure you really need UCS-2?
        http://en.wikipedia.org/wiki/UCS-2

        <<The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996.[2] It produces a fixed-length format by simply using the code point as the 16-bit code unit and produces exactly the same result as UTF-16 for 96.9% of all the code points in the range 0-0xFFFF, including all characters that had been assigned a value at that time.>>

        I would find it odd that recent software would use such an old character set.

        >just two
        > columns (the third column with notes must be deleted) and | as a delimiter
        > instead of a tab? As I am using QA Distiller extensively to check for glossary
        > errors, manual conversion is too time-consuming.

        Apart from the header and character set questions, that might require a bit more head scratching, depending on your answers, that sounds reasonably simple to do in Perl, for instance.

        Open each line, split with tab, write first column, write |, write second column, go to next line.

        Didier
      • HOGG Maynard
        ... As little as two lines in AWK—without error checking. BEGIN {FS = t } {print $1 | $2}
        Message 3 of 4 , Aug 1, 2012
          On Wed, Aug 1, 2012 at 6:39 PM, Didier Briel <d.briel@...> wrote:
          > Open each line, split with tab, write first column, write |, write second column, go to next line.

          As little as two lines in AWK—without error checking.

          BEGIN {FS = "\t"}
          {print $1 "|" $2}
        • roman.mironov@ymail.com
          Hello, Thank you for your input. I am sorry it took me so long to get back. ... It s human-readable. In fact, it s simply two lines: DICTFILE V1.1 EN-US|RU-RU
          Message 4 of 4 , Aug 16, 2012
            Hello,

            Thank you for your input. I am sorry it took me so long to get back.

            >Is the header "text" (human-readable) or binary?

            It's human-readable. In fact, it's simply two lines:
            DICTFILE V1.1
            EN-US|RU-RU

            >Are you sure you really need UCS-2? http://en.wikipedia.org/wiki/UCS-2

            This is the character set the program uses to create the DICT files by default, I'm not sure why this is so. I tried to convert it to UTF-8 using Notepad++, and the Russian characters get corrupted in QA Distiller. I then tried ANSI, and this time, it seemed to work so opting for ANSI instead of UCS-2 might work as well.

            >As little as two lines in AWK—without error checking.

            BEGIN {FS = "\t"}
            {print $1 "|" $2}

            Thank you very much for your suggestion! What are the exact steps to run this script? I saved it as a text file with AWK extension and installed Gawk for Windows (http://gnuwin32.sourceforge.net/packages/gawk.htm). How do I point to the file I want to convert?
          Your message has been successfully submitted and would be delivered to recipients shortly.