Loading ...
Sorry, an error occurred while loading the content.

Replace commands in clips incorporating Chinese characters

Expand Messages
  • simon.drimer
    Hi - so I have this problem ... I regularly need to process about 60 large (~30mb each) txt files that contain a mix of simplified Chinese characters and
    Message 1 of 11 , Jan 17, 2010
    • 0 Attachment
      Hi - so I have this problem ... I regularly need to process about 60 large (~30mb each) txt files that contain a mix of simplified Chinese characters and English numerals. 99% of the Chinese characters are words and phrases that appear repeatedly in the txt files (basically thousands of database records, with field names like "Registration number", "Start date" etc - although all in Chinese) and the remaining 1% I can live without. I need to get out of Chinese language and into English, so it seems the most efficient process would be to write a clip with a couple of hundred individual Replace commands to convert the Chinese field names into English. Now the problem is, while I can open the txt files with the Chinese characters intact, using one of the Unicode settings when opening the document, I am unable to write a clip which will hold Chinese characters. Any Chinese characters I write or paste into a clip just get converted to "????".
      Anyone - is there a way to write clips which keep the integrity of unicode character sets like Chinese language ? Or is there some other way I can convert Chinese characters into English in a repeatable way ?
    • Alec Burgess
      This is NOT an answer but perhaps a direction. Maybe you can use Perl to do the replacements? First requires installing Perl if you don t have it already. Then
      Message 2 of 11 , Jan 18, 2010
      • 0 Attachment
        This is NOT an answer but perhaps a direction.
        Maybe you can use Perl to do the replacements?
        First requires installing Perl if you don't have it already.
        Then have a look at the clips in Samples: "Perl script" and "Perl NumLines"

        If you can use that pair of scripts to number the lines in one of your
        files containing chinese characters and have it "come back" with out
        changing every Chinese character string to ????'s you s/b be able to do
        the replacements.

        If that fails (and it may well, I haven't done much with Unicode
        myself) you might want to wade through enough of the Perl manuals to do
        the whole thing in a free-standing Perl script.

        I've used sed (one of the unix programs you can get with UnixUtils or
        Cygwin to run in windows) to do simple regex replacements. I quickly
        googled [sed Unicode] and from scanning the extracts it *appears* that
        sed can handle Unicode which I believe would make it feasible.

        Probably Sheri (or one of the other Notetab-Perl guru's on this list)
        can come up with a more definitive answer or even an answer to doing the
        whole think in Notetab clips directly.

        simon.drimer wrote:
        > Hi - so I have this problem ... I regularly need to process about 60 large (~30mb each) txt files that contain a mix of simplified Chinese characters and English numerals. 99% of the Chinese characters are words and phrases that appear repeatedly in the txt files (basically thousands of database records, with field names like "Registration number", "Start date" etc - although all in Chinese) and the remaining 1% I can live without. I need to get out of Chinese language and into English, so it seems the most efficient process would be to write a clip with a couple of hundred individual Replace commands to convert the Chinese field names into English. Now the problem is, while I can open the txt files with the Chinese characters intact, using one of the Unicode settings when opening the document, I am unable to write a clip which will hold Chinese characters. Any Chinese characters I write or paste into a clip just get converted to "????".
        > Anyone - is there a way to write clips which keep the integrity of unicode character sets like Chinese language ? Or is there some other way I can convert Chinese characters into English in a repeatable way ?
        >
        >
        >

        --
        Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)
      • Axel Berger
        ... I don t quite see, why you need to do that. You open the file, do all your ^!Replaces and when you save afterwords, you said yourself, it won t matter, if
        Message 3 of 11 , Jan 18, 2010
        • 0 Attachment
          "simon.drimer" wrote:
          > I am unable to write a clip which will hold Chinese characters.

          I don't quite see, why you need to do that. You open the file, do all
          your ^!Replaces and when you save afterwords, you said yourself, it
          won't matter, if a tiny unreplaced rest gets corrupted.

          There may be one problem though: I've had the case more than once, that
          copying an otherwise unlegible string and pasting it into a Menu find
          and Replace string will work but doing the same thing in a clip won't.
          In those cases I've resorted to using \xnn instead. It's a tiresome
          nuisance to get that working right, but you only need to do it once.

          Axel
        • John Shotsky
          Can you open it with Google, translate it, then save it? Google is excellent with Chinese. Regards, John From: ntb-clips@yahoogroups.com
          Message 4 of 11 , Jan 18, 2010
          • 0 Attachment
            Can you open it with Google, translate it, then save it? Google is excellent with Chinese.



            Regards,

            John



            From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of simon.drimer
            Sent: Sunday, January 17, 2010 7:58 PM
            To: ntb-clips@yahoogroups.com
            Subject: [Clip] Replace commands in clips incorporating Chinese characters





            Hi - so I have this problem ... I regularly need to process about 60 large (~30mb each) txt files that contain a mix of
            simplified Chinese characters and English numerals. 99% of the Chinese characters are words and phrases that appear
            repeatedly in the txt files (basically thousands of database records, with field names like "Registration number",
            "Start date" etc - although all in Chinese) and the remaining 1% I can live without. I need to get out of Chinese
            language and into English, so it seems the most efficient process would be to write a clip with a couple of hundred
            individual Replace commands to convert the Chinese field names into English. Now the problem is, while I can open the
            txt files with the Chinese characters intact, using one of the Unicode settings when opening the document, I am unable
            to write a clip which will hold Chinese characters. Any Chinese characters I write or paste into a clip just get
            converted to "????".
            Anyone - is there a way to write clips which keep the integrity of unicode character sets like Chinese language ? Or is
            there some other way I can convert Chinese characters into English in a repeatable way ?





            [Non-text portions of this message have been removed]
          • Alec Burgess
            John Shotsky (jshotsky@comcast.net) wrote (in part) (on 2010-01-18 at ... I like that idea John! Not sure if Google translate will handle files that big -
            Message 5 of 11 , Jan 18, 2010
            • 0 Attachment
              John Shotsky (jshotsky@...) wrote (in part) (on 2010-01-18 at
              08:16):
              > Can you open it with Google, translate it, then save it? Google is
              > excellent with Chinese.

              I like that idea John! Not sure if Google translate will handle files
              that big - he's talking about 60 x 30 MB regularly. I *think* the
              easiest way would be to get the files up on the web somewhere (maybe to
              Google Documents?) so URL's can be passed to Google translate then all
              you need to do is use wgets to ask Google to do the translate.

              Then all the regexps required of Notetab would be to fix up any English
              where what Google gives isn't quite what Simon wants.

              --
              Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)




              [Non-text portions of this message have been removed]
            • Sheri
              ... Yes, I think it can be done in a NoteTab clip, using UTF-8 and regex. If you d care to upload a small sample to show what you have vs what you need, I d be
              Message 6 of 11 , Jan 18, 2010
              • 0 Attachment
                On 1/17/2010 10:57 PM, simon.drimer wrote:
                > Hi - so I have this problem ... I regularly need to process about 60 large (~30mb each) txt files that contain a mix of simplified Chinese characters and English numerals. 99% of the Chinese characters are words and phrases that appear repeatedly in the txt files (basically thousands of database records, with field names like "Registration number", "Start date" etc - although all in Chinese) and the remaining 1% I can live without. I need to get out of Chinese language and into English, so it seems the most efficient process would be to write a clip with a couple of hundred individual Replace commands to convert the Chinese field names into English. Now the problem is, while I can open the txt files with the Chinese characters intact, using one of the Unicode settings when opening the document, I am unable to write a clip which will hold Chinese characters. Any Chinese characters I write or paste into a clip just get converted to "????".
                > Anyone - is there a way to write clips which keep the integrity of unicode character sets like Chinese language ? Or is there some other way I can convert Chinese characters into English in a repeatable way ?
                >

                Yes, I think it can be done in a NoteTab clip, using UTF-8 and regex.

                If you'd care to upload a small sample to show what you have vs what you
                need, I'd be happy to take a look.

                Regards,
                Sheri
              • simon.drimer
                Hey ... thanks everyone. Yes Google Translate would be perfect, but the files are too big for it and it tends to fall over 5% of the way into the document and
                Message 7 of 11 , Jan 18, 2010
                • 0 Attachment
                  Hey ... thanks everyone. Yes Google Translate would be perfect, but the files are too big for it and it tends to fall over 5% of the way into the document and thten fails to translate the Chinese text into English. I'll have a look at finding some other way to use the Goodle Translation engine maybe with uploads to the web but could be a lot of trouble (also looking at the other machine translators like Systran). I would love to get NT working !
                  Sheri - thanks - here's a sample partial record (and each file contains let's say 10,000 of these)
                  姓名 刘丽梅
                  性别 女
                  资格证书号码 00200812210000008520
                  资格证书状态 有效
                  有效期截止日期 2011-12-10
                  And Google Translate will turn that into:
                  Name Li-Mei Liu
                  Gender Female
                  Certificate Number 00200812210000008520
                  Certificate Status Effective
                  Valid cut-off date 2011-12-10
                  So what I am trying to do is write a clip that replaces "姓名" with "Name", "资格证书号码" with "Certificate Number" and so on - let's say there will be 50 different replacements...

                  --- In ntb-clips@yahoogroups.com, Sheri <silvermoonwoman@...> wrote:
                  >
                  > On 1/17/2010 10:57 PM, simon.drimer wrote:
                  > > Hi - so I have this problem ... I regularly need to process about 60 large (~30mb each) txt files that contain a mix of simplified Chinese characters and English numerals. 99% of the Chinese characters are words and phrases that appear repeatedly in the txt files (basically thousands of database records, with field names like "Registration number", "Start date" etc - although all in Chinese) and the remaining 1% I can live without. I need to get out of Chinese language and into English, so it seems the most efficient process would be to write a clip with a couple of hundred individual Replace commands to convert the Chinese field names into English. Now the problem is, while I can open the txt files with the Chinese characters intact, using one of the Unicode settings when opening the document, I am unable to write a clip which will hold Chinese characters. Any Chinese characters I write or paste into a clip just get converted to "????".
                  > > Anyone - is there a way to write clips which keep the integrity of unicode character sets like Chinese language ? Or is there some other way I can convert Chinese characters into English in a repeatable way ?
                  > >
                  >
                  > Yes, I think it can be done in a NoteTab clip, using UTF-8 and regex.
                  >
                  > If you'd care to upload a small sample to show what you have vs what you
                  > need, I'd be happy to take a look.
                  >
                  > Regards,
                  > Sheri
                  >
                • Don - HtmlFixIt.com
                  Well try something like the following: ;clip by don AT htmlfixit DOT com ;Chinese (ish) to English translation example ^!Replace 姓名 Name
                  Message 8 of 11 , Jan 18, 2010
                  • 0 Attachment
                    Well try something like the following:
                    ;clip by don AT htmlfixit DOT com
                    ;Chinese (ish) to English translation example
                    ^!Replace "姓名" >> "Name" ATIWS
                    ^!Replace "资格证书号码" >>
                    "Certificate Number" ATIWS
                    ; copy below as needed ^!Replace "" >> "" ATIWS
                    ;line 7 -- watch for wrapped lines in email
                    ;line 8 -- all lines start with ^! or semi-colon or colon

                    It could also be done with an array and a loop if you wish.



                    simon.drimer wrote:
                    > Hey ... thanks everyone. Yes Google Translate would be perfect, but the files are too big for it and it tends to fall over 5% of the way into the document and thten fails to translate the Chinese text into English. I'll have a look at finding some other way to use the Goodle Translation engine maybe with uploads to the web but could be a lot of trouble (also looking at the other machine translators like Systran). I would love to get NT working !
                    > Sheri - thanks - here's a sample partial record (and each file contains let's say 10,000 of these)
                    > 姓名 刘丽梅
                    > 性别 女
                    > 资格证书号码 00200812210000008520
                    > 资格证书状态 有效
                    > 有效期截止日期 2011-12-10
                    > And Google Translate will turn that into:
                    > Name Li-Mei Liu
                    > Gender Female
                    > Certificate Number 00200812210000008520
                    > Certificate Status Effective
                    > Valid cut-off date 2011-12-10
                    > So what I am trying to do is write a clip that replaces "姓名" with "Name", "资格证书号码" with "Certificate Number" and so on - let's say there will be 50 different replacements...
                    >
                    > --- In ntb-clips@yahoogroups.com, Sheri <silvermoonwoman@...> wrote:
                    >> On 1/17/2010 10:57 PM, simon.drimer wrote:
                    >>> Hi - so I have this problem ... I regularly need to process about 60 large (~30mb each) txt files that contain a mix of simplified Chinese characters and English numerals. 99% of the Chinese characters are words and phrases that appear repeatedly in the txt files (basically thousands of database records, with field names like "Registration number", "Start date" etc - although all in Chinese) and the remaining 1% I can live without. I need to get out of Chinese language and into English, so it seems the most efficient process would be to write a clip with a couple of hundred individual Replace commands to convert the Chinese field names into English. Now the problem is, while I can open the txt files with the Chinese characters intact, using one of the Unicode settings when opening the document, I am unable to write a clip which will hold Chinese characters. Any Chinese characters I write or paste into a clip just get converted to "????".
                    >>> Anyone - is there a way to write clips which keep the integrity of unicode character sets like Chinese language ? Or is there some other way I can convert Chinese characters into English in a repeatable way ?
                    >>>
                    >> Yes, I think it can be done in a NoteTab clip, using UTF-8 and regex.
                    >>
                    >> If you'd care to upload a small sample to show what you have vs what you
                    >> need, I'd be happy to take a look.
                    >>
                    >> Regards,
                    >> Sheri
                    >>
                    >
                    >
                    >
                    >
                    > ------------------------------------
                    >
                    > Fookes Software: http://www.fookes.com/
                    > NoteTab website: http://www.notetab.com/
                    > NoteTab Discussion Lists: http://www.notetab.com/groups.php
                    >
                    > ***
                    > Yahoo! Groups Links
                    >
                    >
                    >
                    >
                  • John Shotsky
                    That is actually pretty straightforward. Create a clipbook, with one entry for each different field you need to manage, and use a clip like this for each one:
                    Message 9 of 11 , Jan 18, 2010
                    • 0 Attachment
                      That is actually pretty straightforward. Create a clipbook, with one entry for each different field you need to manage,
                      and use a clip like this for each one:

                      ^!Replace "&\#22995;&\#21517;" >> "Name" ARSTW

                      ^!Replace "&\#36164;&\#26684;&\#35777;&\#20070;&\#21495;&\#30721;" >> "Certificate Number" ARSTW



                      Note the added backslashes for the pound signs - NoteTab is not reliable without them. Open a document, run the clipbook
                      (CH2ENG) for example, and just sit back and let it go.



                      Start with a small file with all the headings you need to capture and let Google give you the translation. Write the
                      clips, and you'll be good to go.



                      Regards,

                      John



                      From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of simon.drimer
                      Sent: Monday, January 18, 2010 4:50 PM
                      To: ntb-clips@yahoogroups.com
                      Subject: Re: [Clip] Replace commands in clips incorporating Chinese characters





                      Hey ... thanks everyone. Yes Google Translate would be perfect, but the files are too big for it and it tends to fall
                      over 5% of the way into the document and thten fails to translate the Chinese text into English. I'll have a look at
                      finding some other way to use the Goodle Translation engine maybe with uploads to the web but could be a lot of trouble
                      (also looking at the other machine translators like Systran). I would love to get NT working !
                      Sheri - thanks - here's a sample partial record (and each file contains let's say 10,000 of these)
                      姓名 刘丽梅
                      性别 女
                      资格证书号码 00200812210000008520
                      资格证书状态 有效
                      有效期截止日期 2011-12-10
                      And Google Translate will turn that into:
                      Name Li-Mei Liu
                      Gender Female
                      Certificate Number 00200812210000008520
                      Certificate Status Effective
                      Valid cut-off date 2011-12-10
                      So what I am trying to do is write a clip that replaces "姓名" with "Name",
                      "资格证书号码" with "Certificate Number" and so on - let's say there will be 50
                      different replacements...

                      --- In ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> , Sheri <silvermoonwoman@...> wrote:
                      >
                      > On 1/17/2010 10:57 PM, simon.drimer wrote:
                      > > Hi - so I have this problem ... I regularly need to process about 60 large (~30mb each) txt files that contain a mix
                      of simplified Chinese characters and English numerals. 99% of the Chinese characters are words and phrases that appear
                      repeatedly in the txt files (basically thousands of database records, with field names like "Registration number",
                      "Start date" etc - although all in Chinese) and the remaining 1% I can live without. I need to get out of Chinese
                      language and into English, so it seems the most efficient process would be to write a clip with a couple of hundred
                      individual Replace commands to convert the Chinese field names into English. Now the problem is, while I can open the
                      txt files with the Chinese characters intact, using one of the Unicode settings when opening the document, I am unable
                      to write a clip which will hold Chinese characters. Any Chinese characters I write or paste into a clip just get
                      converted to "????".
                      > > Anyone - is there a way to write clips which keep the integrity of unicode character sets like Chinese language ? Or
                      is there some other way I can convert Chinese characters into English in a repeatable way ?
                      > >
                      >
                      > Yes, I think it can be done in a NoteTab clip, using UTF-8 and regex.
                      >
                      > If you'd care to upload a small sample to show what you have vs what you
                      > need, I'd be happy to take a look.
                      >
                      > Regards,
                      > Sheri
                      >





                      [Non-text portions of this message have been removed]
                    • simon.drimer
                      Many thanks all. I see now that simply converting the Chinese characters to that different (readable) format is the way forward ...
                      Message 10 of 11 , Jan 18, 2010
                      • 0 Attachment
                        Many thanks all. I see now that simply converting the Chinese characters to that different (readable) format is the way forward ...

                        --- In ntb-clips@yahoogroups.com, "John Shotsky" <jshotsky@...> wrote:
                        >
                        > That is actually pretty straightforward. Create a clipbook, with one entry for each different field you need to manage,
                        > and use a clip like this for each one:
                        >
                        > ^!Replace "&\#22995;&\#21517;" >> "Name" ARSTW
                        >
                        > ^!Replace "&\#36164;&\#26684;&\#35777;&\#20070;&\#21495;&\#30721;" >> "Certificate Number" ARSTW
                        >
                        >
                        >
                        > Note the added backslashes for the pound signs - NoteTab is not reliable without them. Open a document, run the clipbook
                        > (CH2ENG) for example, and just sit back and let it go.
                        >
                        >
                        >
                        > Start with a small file with all the headings you need to capture and let Google give you the translation. Write the
                        > clips, and you'll be good to go.
                        >
                        >
                        >
                        > Regards,
                        >
                        > John
                        >
                        >
                        >
                        > From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of simon.drimer
                        > Sent: Monday, January 18, 2010 4:50 PM
                        > To: ntb-clips@yahoogroups.com
                        > Subject: Re: [Clip] Replace commands in clips incorporating Chinese characters
                        >
                        >
                        >
                        >
                        >
                        > Hey ... thanks everyone. Yes Google Translate would be perfect, but the files are too big for it and it tends to fall
                        > over 5% of the way into the document and thten fails to translate the Chinese text into English. I'll have a look at
                        > finding some other way to use the Goodle Translation engine maybe with uploads to the web but could be a lot of trouble
                        > (also looking at the other machine translators like Systran). I would love to get NT working !
                        > Sheri - thanks - here's a sample partial record (and each file contains let's say 10,000 of these)
                        > 姓名 刘丽梅
                        > 性别 女
                        > 资格证书号码 00200812210000008520
                        > 资格证书状态 有效
                        > 有效期截止日期 2011-12-10
                        > And Google Translate will turn that into:
                        > Name Li-Mei Liu
                        > Gender Female
                        > Certificate Number 00200812210000008520
                        > Certificate Status Effective
                        > Valid cut-off date 2011-12-10
                        > So what I am trying to do is write a clip that replaces "姓名" with "Name",
                        > "资格证书号码" with "Certificate Number" and so on - let's say there will be 50
                        > different replacements...
                        >
                        > --- In ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> , Sheri <silvermoonwoman@> wrote:
                        > >
                        > > On 1/17/2010 10:57 PM, simon.drimer wrote:
                        > > > Hi - so I have this problem ... I regularly need to process about 60 large (~30mb each) txt files that contain a mix
                        > of simplified Chinese characters and English numerals. 99% of the Chinese characters are words and phrases that appear
                        > repeatedly in the txt files (basically thousands of database records, with field names like "Registration number",
                        > "Start date" etc - although all in Chinese) and the remaining 1% I can live without. I need to get out of Chinese
                        > language and into English, so it seems the most efficient process would be to write a clip with a couple of hundred
                        > individual Replace commands to convert the Chinese field names into English. Now the problem is, while I can open the
                        > txt files with the Chinese characters intact, using one of the Unicode settings when opening the document, I am unable
                        > to write a clip which will hold Chinese characters. Any Chinese characters I write or paste into a clip just get
                        > converted to "????".
                        > > > Anyone - is there a way to write clips which keep the integrity of unicode character sets like Chinese language ? Or
                        > is there some other way I can convert Chinese characters into English in a repeatable way ?
                        > > >
                        > >
                        > > Yes, I think it can be done in a NoteTab clip, using UTF-8 and regex.
                        > >
                        > > If you'd care to upload a small sample to show what you have vs what you
                        > > need, I'd be happy to take a look.
                        > >
                        > > Regards,
                        > > Sheri
                        > >
                        >
                        >
                        >
                        >
                        >
                        > [Non-text portions of this message have been removed]
                        >
                      • Axel Berger
                        ... Ah, that s different, why didn t you say so? These are not Chinese characters as such, be it UTF or any other encoding, but rather HTML entities. From
                        Message 11 of 11 , Jan 19, 2010
                        • 0 Attachment
                          "simon.drimer" wrote:
                          > here's a sample partial record
                          > 姓名 刘丽梅
                          > 性别 女

                          Ah, that's different, why didn't you say so? These are not Chinese
                          characters as such, be it UTF or any other encoding, but rather HTML
                          entities. From NoteTab's point of view all that is pure 7-bit US-ASCII.
                          Nothing simpler than finding and replacing that. (Simple in concept that
                          is, still a lot of work, but, as I said before, it needs only be done
                          once.)

                          Axel
                        Your message has been successfully submitted and would be delivered to recipients shortly.