Loading ...
Sorry, an error occurred while loading the content.

"noise" and "philosphy"...

Expand Messages
  • hiz--dic@islandnet.com
    Hi, surely many of you are familiar with one or other usage of the term noise . Today i have a question concerning the philosphy behind what goes intot he
    Message 1 of 24 , Feb 6, 2008
    View Source
    • 0 Attachment
      Hi,

      surely many of you are familiar with one or other usage of the term "noise". Today i have a question concerning the philosphy behind what goes intot he database and what not, and this question was triggerd off by a certin kind of "noise", related to the glossdic, that i feel has been increasing in the last few years.

      When i have a text with a significant number of kanji or kanji combinations, that i cannot read on sight, i use the "Translate Words" function to get a first approximation of what kind of text i am dealing with and what i need to look up further. Today, when using the glossdic. a had a high noise level, and i would like to share a few observations:

      In the text that at hand, the expression ベネッセ occurs a few times. Though a foreign word (name), it is a word that many in Japan would have heard or read in some or other context. And what does the glossdic make of it? I don't mind that it does not "know" ベネッセ - after all, such "knowledge" is the direct result of some person having "told" the dictionary about this word, and that seems to not have happened yet (i'll add it after i have sent this postP. But the glossdic provides the following (mis-)information:

      べネッサレッドグレイヴ (u) Vaness Redgrave; NA [Partial Match]

      No, that is not a "partial match" - that is simply "noise" (i guess if we were dealing with people we might call it a "potshot").

      And then i come across the following glossary buster item (never mind that is is not exactly well-written Japanese):

      この他、古い家屋を改修し、アーティストが家の空間そのものを作品化して公開している「家プロジェクト」やベネッセアートサイト直島に関する書籍を揃えた「本村ラウンジ&アーカイブ」、これらを拠点に、瀬戸内海のもつ自然と歴史のリズム、それに共振するセレクトされたアートがともに響きあう創造の場として展開を続けています。

      Let's look at what the glossdic presents for the katakana strings occurring in that sentence:

      * ベネッサレッドグレイヴ (u) Vanessa Redgrave; NA [Partial Match!]
      * セア (p) Sayre; NA
      * トサ Tosa (f); NA
      * イト Ito (f); NA
      * ラウンジ (n) lounge; (P); EP
      * アーカイブ (n) archive; ED
      * リズム (n) rhythm; (P); EP
      * セレクト (n,vs) select; (P); EP
      * アート (n) art; (P); EP

      If アーカイブ means "archive" and "セレクト" means "select" and "アート" means "art", why are those words even in the glossary? Are we going to by and by katakanize every English word known to man and put it in there? What is the philosophy behind putting plain English words in katakana into the database?

      And what about セア (Sayre), トサ (Tosa), and イト (Ito)? Almost any Japanese sound combination can be a name in Japanese - are we going to put them all in there? アイ, アイコ,エイコ, モモ, ナナ, ユリ, ワカナ, and so on? What is the philosophy behind putting such items into the database at all?

      And here is another one:

      豊かな自然と温暖な気候に恵まれた岡山県は、果物の栽培に適しています。加えて高度な栽培技術を有し、種類の豊富さと美味から「くだもの王国」と呼ばれています。ぶどう、白桃、メロン、いちご、ミカン、スイカ、オリーブほか、年間を通じて様々な果物が収穫されており、新鮮な旬の果物を食べることができます。また、産地の果樹園でフルーツ狩りを体験することのもおすすめです。

      Again, isolating the katakana items:

      * メロン (n) melon; (P)
      * ミカ Mika (f); NA
      * スイカ (n) Suica (rechargeable prepaid IC card that can be used as a train pass in the greater Tokyo, Osaka and Sendai regions and also as electric money in some stores)
      * オリーブ (n) (fr:) olive
      * フルーツ (n) fruit; (P)

      ミカ Mika (f); NA? Um... no, not really. "Rechargeable prepaid IC card"? Well, that is ONE possibility, but where are the more obvious results for the queries, like mandarin orange and water melon? Is the problem that the names of the fruits are written in katakana? Well, that is how those words are commonly written in this era (i am the only person i know who uses kanji for fruits and fish).

      Thanks to ongoing volunteer effort the number of entries in the database keeps growing, and, as one can see from the "New Entries/Amendments" pages on the web, such entries are often rather sophisticated and specialised (and thus their inclusion makes the WWWJDIC so valuable), so it seems to me increasingly important that undesirable side effects like the "noise" i described above, be understood and counteracted.

      All that i have written here should not be construed as a complaint (the WWWJDIC is one of the best resources we have as translators, and the price is unbeatable) but as a contribution toward an analysis of the problems that users like me encounter on a regular basis - with the goal of limiting undesirable side effects of the continuing increase in quantity and quality of the data.

      When i ask question about the philosophy i am, on the ground level, asking about both editorial policy and the algorithms at work, and i would like to contribute in some way toward refining both. To that effect i would like to learn more about them, and whether this list is the right place for that i don't know, but at least it seems the right place to start the quest.

      Thanks & regards: Hendrik





      --


      * 南風言語業(大工ヘンドリク) *
      http://www.paikaji-translation.com/

      --
    • Paul Blay
      ... The philosophy is to put $B30Mh8l(B that is in use by Japanese into the database. All three of those fall firmly into the in use category (and
      Message 2 of 24 , Feb 6, 2008
      View Source
      • 0 Attachment
        > If アーカイブ means "archive" and "セレクト" means "select" and
        > "アート" means "art", why are those words even in the glossary?
        > Are we going to by and by katakanize every English word known to
        > man and put it in there? What is the philosophy behind putting
        > plain English words in katakana into the database?

        The philosophy is to put 外来語 that is in use by Japanese into
        the database. All three of those fall firmly into the 'in use'
        category (and アーカイブ) deserves a (P).

        アーカイブ 47,400,000 Google hits.
        セレクト 13,500,000
        アート 41,200,000

        > Is the problem that the names of the fruits are written in katakana?
        > Well, that is how those words are commonly written in this era.

        I believe there is special handling for words that are usually in
        katakana but have kanji in glossdic. They have to be specified first,
        though. If you want your ミカンs and スイカs to be picked out
        correctly somebody needs to point them out to Jim (and then he needs
        to implement them).

        > When i ask question about the philosophy i am, on the ground level,
        > asking about both editorial policy and the algorithms at work, and
        > i would like to contribute in some way toward refining both. To that
        > effect i would like to learn more about them, and whether this list
        > is the right place for that i don't know, but at least it seems the
        > right place to start the quest.

        I suspect that highest on the current list is the effort to let
        (select) people other than Jim work on adding, validating and editing
        Edict entries.
      • Darren Cook
        ... Yes, agreed. In other words the problem is the Japanese people s delight in using katakana. The other problem is that the katakana words are not always
        Message 3 of 24 , Feb 7, 2008
        View Source
        • 0 Attachment
          >> If アーカイブ means "archive" and "セレクト" means "select" and
          >> "アート" means "art", why are those words even in the glossary?
          >> Are we going to by and by katakanize every English word known to
          >> man and put it in there? What is the philosophy behind putting
          >> plain English words in katakana into the database?
          >
          > The philosophy is to put 外来語 that is in use by Japanese into
          > the database.

          Yes, agreed. In other words the problem is the Japanese people's delight
          in using katakana.

          The other problem is that the katakana words are not always based on an
          English word, or if so, not with exactly the same meaning as the English
          word, so are very handy to have in the dictionary.

          Knowing what should be in a dictionary and what shouldn't is a tough call.

          > All three of those fall firmly into the 'in use'
          > category (and アーカイブ) deserves a (P).
          >
          > アーカイブ 47,400,000 Google hits.
          > セレクト 13,500,000
          > アート 41,200,000

          I've been getting suspicious of googits for katakana words. I'm betting
          that Google is doing some special processing (e.g. also matching
          hiragana) and/or is not splitting words very effectively (so longer
          words that contain アート, possible after normalizing everything to
          hiragana) are being matched.
          Of course I've just skimmed the first 20 pages of google hits for アート
          and they also seemed to be talking about art. But, I'm still going to be
          suspicious of the bottom 99.99% of the search hits.

          Darren



          --
          Darren Cook
          http://dcook.org/mlsn/ (English-Japanese-German-Chinese free dictionary)
          http://dcook.org/work/ (About me and my work)
          http://dcook.org/work/charts/ (My flash charting demos)
        • Paul Blay
          ... I don t think so. Some alternative spelling pairs have special processing, but that can be disabled by using the + character and doesn t result in
          Message 4 of 24 , Feb 7, 2008
          View Source
          • 0 Attachment
            > > アーカイブ 47,400,000 Google hits.
            > > セレクト 13,500,000
            > > アート 41,200,000
            >
            > I've been getting suspicious of googits for katakana words. I'm betting
            > that Google is doing some special processing (e.g. also matching
            > hiragana)

            I don't think so. Some 'alternative spelling pairs' have special
            processing, but that can be disabled by using the + character and
            doesn't result in massive hit inflation.

            > and/or is not splitting words very effectively (so longer
            > words that contain アート, possible after normalizing everything to
            > hiragana) are being matched.

            Possibly, but not going to be common.

            > Of course I've just skimmed the first 20 pages of google hits for アート
            > and they also seemed to be talking about art. But, I'm still going to be
            > suspicious of the bottom 99.99% of the search hits.

            Google hit numbers should be viewed with suspicion, but that's nothing
            special to katakana words (or even to Japanese). I've seen some
            truly bizarre effects and massive number inflation from time to time.

            Getting back to アート, if you stick a particle on each end and enclose
            it in quotes you still get a lot of hits.
            "やアートの" 27,200 (and the first 1,000 actually exist ;-)
            Compare it with a kanji equivalent
            "や美術の" 25,500
            and it certainly looks like kanji is losing the battle.
          • René Malenfant
            The noise and philosophy thread reminded me of something I wanted to bring up as well. One thing I ve been thinking about is the automated creation of a
            Message 5 of 24 , Feb 7, 2008
            View Source
            • 0 Attachment
              The "noise and philosophy" thread reminded me of something I wanted
              to bring up as well.

              One thing I've been thinking about is the automated creation of a
              third version of edict.

              Right now there are two versions. There's the (P)-list version, and
              the full version. The former basically acts as a 小辞典, and
              the latter as a 大辞典. This leaves a rather large, and
              (hopefully) easily filled 中辞典 gap between the two.

              I agree with Hendrik that it is precisely the rare and specialized
              entries that make edict valuable, and it's rather been my policy to
              treat edict like Wikipedia: it's "not paper" so anything and
              everything should go in.

              That said, having so much rare "stuff" in the dictionary can result
              in search noise, and especially when searching E-J, long explanatory
              entries like "Buddhist statue of a figure sitting contemplatively in
              the half-lotus position (often of Maitreya)" result in false positives.

              The way I see it, a good edict 中辞典 would be the same as
              the 大辞典, but stripped of the (iK), (oK) and (io) headwords;
              the (ik) and (ok) readings; and the (obsc), (obs) and (arch) senses.
              (Of course, any word that is in the P-list would get an automatic
              pass, even if it has one of these tags.)


              In addition, I think a more powerful search for the WWWJDIC page is
              in order. (And yes, I know there's one up and running on Arakawa,
              but that one is a bit intimidating, what with all its checkboxes and
              whatnot.)

              All of the Japanese dictionaries I've used on my computer offer
              "starts with...", "ends with...", "contains..." and "is..." search
              options.

              WWWJDIC, at present, seems to default to "contains...", which results
              in loads of funky hits coming up before the word I actually want. I
              know these can be drastically reduced by checking the "starting
              kanji" and "exact word-match" boxes. But the "starting kanji" box
              only applies to the Japanese headword, and the "exact match" box only
              applies to the English gloss.

              I think there should be an HTML select list with "starts with...",
              "ends with...", "contains..." and "is..." options, with either
              "starts with..." or "is ..." selected by default. This would
              *drastically* reduce the number of false positives.

              You'd then have one checkbox left for "common words", and could
              perhaps add another one for "ignore archaic Japanese", etc.


              My thoughts.


              Rene Malenfant
            • Jim Rose
              ... I wouldn t mind a little and or boolean capability built into WWWJDIC... unless I get around to fixing one up on Kanjicafe. On Feb 7, 2008, at 5:05 AM,
              Message 6 of 24 , Feb 7, 2008
              View Source
              • 0 Attachment

                On Feb 7, 2008, at 5:05 AM, René Malenfant wrote:

                I think there should be an HTML select list with "starts with...", 
                "ends with...", "contains... " and "is..." options, with either 
                "starts with..." or "is ..." selected by default. This would 
                *drastically* reduce the number of false positives.


                I wouldn't mind a little "and" "or" boolean capability built into WWWJDIC... unless I get around to fixing one up on Kanjicafe.
              • hiz--dic@islandnet.com
                Hi Paul and Darren, thanks for the feedback... Paul Blay $B$5$s$,(B 07:58 8/02/07 +0000 $B$K=q$$$?%a%C%;!
                Message 7 of 24 , Feb 7, 2008
                View Source
                • 0 Attachment
                  Hi Paul and Darren, thanks for the feedback...

                  Paul Blay さんが 07:58 8/02/07 +0000 に書いたメッセージの件:
                  >The philosophy is to put 外来語 that is in use by Japanese into
                  >the database.

                  Darren Cook さんが 17:14 8/02/07 +0900 に書いたメッセージの件:
                  >The other problem is that the katakana words are not always based on an
                  >English word, or if so, not with exactly the same meaning as the English
                  >word, so are very handy to have in the dictionary.

                  Right - if the meaning of the 外来語 is not what the "back translation" into English suggests (e.g., リニューアル and リフォーム do not mean "renewal" and "reform"), then i think an entry is called for, but...

                  ... there is a large number of 外来語 that do not add any useful information (e.g., アート, meaning "art", meaning 芸術 or related concepts), and i don't see any point in cluttering up the database with such fashion words until it becomes apparent that their meaning differs in some fundamental way from the English meaning. :-)

                  Darren:
                  >Knowing what should be in a dictionary and what shouldn't is a tough call.

                  Well, that is where an ever expanding database needs a more defined editorial approach, which brings me back to the reason why i posted in the first place. And one of my concerns (mentioned quite a while ago but apprently still unmet) is that there seems to be no native speaker of Japanese on this list and so i suspect also that there is no native speaker of Japanese involved in the collective effort of those who are subscribed to this list. :-)

                  I would like to see this ongoing project and this mailing list advertised more among practicing translators and wouldn't mind doing that myself (but Jim shares at least one of the related mailing lists, so i do not want to pre-empt him).

                  Paul:
                  >> Is the problem that the names of the fruits are written in katakana?
                  >> Well, that is how those words are commonly written in this era.
                  >
                  >I believe there is special handling for words that are usually in
                  >katakana but have kanji in glossdic. They have to be specified first,
                  >though. If you want your ミカンs and スイカs to be picked out
                  >correctly somebody needs to point them out to Jim (and then he needs
                  >to implement them).

                  OK... i'll make an effort to collect such incidences and forward the information.

                  And as regards the enamdict noise (all those sounds like トサ, イト, and so on), if kana names are to be kept/included in there, then i would like to suggest to only call on enamdic's _kanji_ entries during a glossdic search - the reasoning being that, if it is not obvious from the context that a given kana string in a text refers to a person's name, then having a database telling me that this string might be a person's name is not going to help me, while, OTOH, getting such information in all those other cases where it is plainly inappropriate has a negative benefit. :-)

                  Thanks & regards: Hendrik





                  --




                  * 南風言語業(大工ヘンドリク) *
                  http://www.paikaji-translation.com/

                  --
                • hiz--dic@islandnet.com
                  Hello Ren$Bq@(B Ren$Bq@(BMalenfant $B$5$s$,(B 18:05 8/02/07 +0900 $B$K=q$$$?%a%C%;!
                  Message 8 of 24 , Feb 7, 2008
                  View Source
                  • 0 Attachment
                    Hello Ren饑

                    Ren饑Malenfant さんが 18:05 8/02/07 +0900 に書いたメッセージの件:
                    >[...] it's rather been my policy to treat edict like Wikipedia:
                    > it's "not paper" so anything and everything should go in.

                    I see...

                    >[...] I think a more powerful search for the WWWJDIC page is
                    >in order. (And yes, I know there's one up and running on Arakawa,
                    >but that one is a bit intimidating, what with all its checkboxes and
                    >whatnot.)

                    I missed that somehow - please tell me where i can try that server...

                    >I think there should be an HTML select list with "starts with...",
                    >"ends with...", "contains..." and "is..." options, with either
                    >"starts with..." or "is ..." selected by default. This would
                    >*drastically* reduce the number of false positives.
                    >
                    >You'd then have one checkbox left for "common words", and could
                    >perhaps add another one for "ignore archaic Japanese", etc.

                    Having such options would be a great leap forward. One other option i would like to see is a wildcard character (maybe even a 2-byte wildcard character?) so that one can search for kanji combinations, such as 寒??渓 to find 寒霞渓, for example.

                    To make the "translate words" search (using glossdic) less noisy _some_ of those options (e.g., "ignore archaic Japanese",) could probably be used, as well, but other settings would be needed in addition (e.g., "don't show kana words").

                    Regards: Hendrik





                    --


                    * 南風言語業(大工ヘンドリク) *
                    http://www.paikaji-translation.com/

                    --
                  • Paul Blay
                    ... There are about three important pieces of information that entries like $B% !
                    Message 9 of 24 , Feb 7, 2008
                    View Source
                    • 0 Attachment
                      > Right - if the meaning of the 外来語 is not what the "back
                      > translation" into English suggests (e.g., リニューアル and リフォーム
                      > do not mean "renewal" and "reform"), then i think an entry is called
                      > for, but...
                      >
                      > ... there is a large number of 外来語 that do not add any useful
                      > information (e.g., アート, meaning "art", meaning 芸術 or related
                      > concepts), and i don't see any point in cluttering up the database with
                      > such fashion words until it becomes apparent that their meaning differs
                      > in some fundamental way from the English meaning. :-)

                      There are about three important pieces of information that entries
                      like アーカイブ provide as Edict entries.
                      1. The information that it _is_ in use by Japanese (in particular
                      those with (P) tags).
                      2. How that word is represented in katakana (at all, and which version
                      is more popular).
                      and, last but not least,
                      3. Just because you and I can recognize what word アーカイブ is
                      doesn't mean that everybody using the dictionary would be able to.

                      Not all of those apply to Glossdic, number 3 does, though.
                      Personally I'm a little surprised that the "translate words in Japanese
                      text" function is still found useful by people past beginner mode.

                      What I use all the time is a 'right-click dictionary look-up'.
                      I've got links to WWWJDIC, 大辞林, SpaceALC and Yahoo and I find
                      it a lot less fiddly to use than wading through a glossdic output.

                      > Well, that is where an ever expanding database needs a more defined
                      > editorial approach, which brings me back to the reason why i posted in
                      > the first place.

                      I think 'editorial policy' is going to be more useful when it is actually
                      possible for there to be more actual editors than just Jim.

                      > And one of my concerns (mentioned quite a while ago but
                      > apprently still unmet) is that there seems to be no native speaker of
                      > Japanese on this list

                      There was (and presumably is) at least one native speaker of Japanese
                      on this list.

                      > and so i suspect also that there is no native
                      > speaker of Japanese involved in the collective effort of those who are
                      > subscribed to this list. :-)

                      Er, what does that mean that you haven't just said in the first half
                      of that sentence?
                    • Paul Blay
                      ... I was going to say that there s an easy way to do something similar to that - but there s something screwy going on with the search function. Suppose you
                      Message 10 of 24 , Feb 7, 2008
                      View Source
                      • 0 Attachment
                        > Having such options would be a great leap forward. One other option i
                        > would like to see is a wildcard character (maybe even a 2-byte wildcard
                        > character?) so that one can search for kanji combinations, such as
                        > 寒??渓 to find 寒霞渓, for example.

                        I was going to say that there's an easy way to do something similar to
                        that - but there's something screwy going on with the search function.

                        Suppose you want to find 最終更新 but you don't remember the two middle
                        kanji (bit far-fetched, I know).

                        Search with "最 新" and "match from start" and you'll get every word
                        starting with 最 and also including 新 (which is a pretty short list).

                        The same thing should work for "寒 渓", but it didn't. Don't know why.
                        (P.S. Yes, I was looking in Enamdict).
                      • hiz--dic@islandnet.com
                        Hi again, and thanks for the additional info... Paul Blay $B$5$s$,(B 11:26 8/02/07 +0000 $B$K=q$$$?%a%C%;!
                        Message 11 of 24 , Feb 7, 2008
                        View Source
                        • 0 Attachment
                          Hi again, and thanks for the additional info...

                          Paul Blay さんが 11:26 8/02/07 +0000 に書いたメッセージの件:
                          >There are about three important pieces of information that entries
                          >like アーカイブ provide as Edict entries.
                          >1. The information that it _is_ in use by Japanese (in particular
                          >those with (P) tags).

                          Ah... i based my comments on the assumption that edict is for non-Japanese who need to figure out something that is written in Japanese (and encountering アーカイブ in a text would indicate to the reader that アーカイブ is being used). But if it is meant to be a general dictionary and to be used in both directions (nothing wrong with that), then having such an entry would make sense. By the way, the (P) tag is something whose purpose i confess has eluded me until now - when i translate an expression it is perfectly irrelevant whether that expression is considered "popular" or not - but again, i can see how that information can be useful in other situations, for example, if one goes E to J.

                          >2. How that word is represented in katakana (at all, and which version
                          >is more popular).

                          Same comments as under 1.

                          >and, last but not least,
                          >3. Just because you and I can recognize what word アーカイブ is
                          >doesn't mean that everybody using the dictionary would be able to.

                          That's true, and again, i think we are working from different assumptions: when i started using "translate words" some years ago it was a kanji lookup tool - but it seems to have slowly developed toward being a general dictionary (hence my comment about the increasing noise over the years). ;-)

                          >Not all of those apply to Glossdic, number 3 does, though.
                          >Personally I'm a little surprised that the "translate words in Japanese
                          >text" function is still found useful by people past beginner mode.

                          What are you surprised about? That someone who is not a beginner doesn't know the readings of all kanji or kanji combinations? 8-)

                          >What I use all the time is a 'right-click dictionary look-up'.
                          >I've got links to WWWJDIC, 大辞林, SpaceALC and Yahoo and I find
                          >it a lot less fiddly to use than wading through a glossdic output.

                          That assumes you have a mouse with a right button and software that works as you describe... maybe that is the way of the future for normal reading activities, but when translating it is helpful to have all items available at a glance in their written form.

                          >I think 'editorial policy' is going to be more useful when it is actually
                          >possible for there to be more actual editors than just Jim.

                          >> And one of my concerns (mentioned quite a while ago but
                          >> apprently still unmet) is that there seems to be no native speaker of
                          >> Japanese on this list
                          >
                          >There was (and presumably is) at least one native speaker of Japanese
                          >on this list.

                          We must hav emissed each other then...

                          >> and so i suspect also that there is no native
                          >> speaker of Japanese involved in the collective effort of those who are
                          >> subscribed to this list. :-)
                          >
                          >Er, what does that mean that you haven't just said in the first half
                          >of that sentence?

                          What it means is that i don't assume _all_ of those who use the "New Entry/Amendmend" button regularly or who contribute in some other way to the development of the WWJDIC were subscribed to this list. :-)

                          Thanks & regards: Hendrik






                          --


                          * 南風言語業(大工ヘンドリク) *
                          http://www.paikaji-translation.com/

                          --
                        • Paul Blay
                          ... WWWJDIC is used mostly by non-Japanese (I do get the occasional comment on example sentences that looks like it s from someone Japanese). WWWJDIC is also
                          Message 12 of 24 , Feb 7, 2008
                          View Source
                          • 0 Attachment
                            > Ah... i based my comments on the assumption that edict is for
                            > non-Japanese who need to figure out something that is written in
                            > Japanese (and encountering アーカイブ in a text would indicate to
                            > the reader that アーカイブ is being used). But if it is meant to be
                            > a general dictionary and to be used in both directions (nothing wrong
                            > with that), then having such an entry would make sense.

                            WWWJDIC is used mostly by non-Japanese (I do get the occasional comment
                            on example sentences that looks like it's from someone Japanese).
                            WWWJDIC is also more useful for J->E than vice versa, but it's not without
                            its uses the other way round. I often use it when I think I may know
                            what word to use but I'm not sure.

                            > What are you surprised about? That someone who is not a beginner
                            > doesn't know the readings of all kanji or kanji combinations? 8-)

                            That someone who is not a beginner finds it most useful to
                            get the lot 'translated' rather than pick out words he's unsure of.
                            Well, whatever suits your working style.

                            > >What I use all the time is a 'right-click dictionary look-up'.
                            > >I've got links to WWWJDIC, 大辞林, SpaceALC and Yahoo and I find
                            > >it a lot less fiddly to use than wading through a glossdic output.
                            >
                            > That assumes you have a mouse with a right button and software that
                            > works as you describe...

                            I'm sure there's an add-in for the browser of your choice (mine's
                            Firefox and DictionarySearch).

                            > >There was (and presumably is) at least one native speaker of Japanese
                            > >on this list.
                            >
                            > We must hav emissed each other then...

                            Maybe you just didn't realise he was one.
                          • Paul Blay
                            ... Oh, I forgot to say - do you know why Yahoo Groups hates you? Seriously, every email you send in gets marked as spam and I (or Jim) has to approve you.
                            Message 13 of 24 , Feb 7, 2008
                            View Source
                            • 0 Attachment
                              > Maybe you just didn't realise he was one.

                              Oh, I forgot to say - do you know why Yahoo Groups hates you?
                              Seriously, every email you send in gets marked as 'spam' and
                              I (or Jim) has to 'approve' you.
                            • Jim Breen
                              Greetings, [I spend an evening away for the Internet and find a few long discussions have happened. I think Paul and others have handled the general question
                              Message 14 of 24 , Feb 7, 2008
                              View Source
                              • 0 Attachment
                                Greetings,

                                [I spend an evening away for the Internet and find a few long discussions
                                have happened. I think Paul and others have handled the general question of
                                the inclusion of 外来語. I'm quite happy for as many as possible to be
                                included. I'm not bursting to find lots; in fact I have a rather undigested
                                list of many thousands which could go in if I got the time and inspiration
                                to check them,

                                I have no idea why Yahoo keeps thinking Hendrik's posts are spam. Be patient
                                and I or Paul will push them through, but it's a nuisance they are
                                being delayed.

                                I'll just try and comment on points not picked up by others.]


                                On 07/02/2008, hiz--dic@... <hiz--dic@...> wrote:

                                > In the text that at hand, the expression ベネッセ occurs a few times. Though a foreign word (name), it is a word that many in Japan would have heard or read in some or other context. And what does the glossdic make of it? I don't mind that it does not "know" ベネッセ - after all, such "knowledge" is the direct result of some person having "told" the dictionary about this word, and that seems to not have happened yet (i'll add it after i have sent this postP. But the glossdic provides the following (mis-)information:
                                >
                                > べネッサレッドグレイヴ (u) Vaness Redgrave; NA [Partial Match]
                                >
                                > No, that is not a "partial match" - that is simply "noise" (i guess if we were dealing with people we might call it a "potshot").

                                In the files that WWWJDIC has available, the longest match it could get with
                                ベネッセ was indeed the first three kana of べネッサレッドグレイヴ. Not a great
                                match, but I work on the principle that reporting a partial match is
                                better than
                                no match at all. (ベネッセ has gone in, BTW, and should be live tomorrow.)

                                > And then i come across the following glossary buster item (never mind that is is not exactly well-written Japanese):
                                >
                                > この他、古い家屋を改修し、アーティストが家の空間そのものを作品化して公開している「家プロジェクト」やベネッセアートサイト直島に関する書籍を揃えた「本村ラウンジ&アーカイブ」、これらを拠点に、瀬戸内海のもつ自然と歴史のリズム、それに共振するセレクトされたアートがともに響きあう創造の場として展開を続けています。
                                >
                                > Let's look at what the glossdic presents for the katakana strings occurring in that sentence:
                                >
                                > * ベネッサレッドグレイヴ (u) Vanessa Redgrave; NA [Partial Match!]
                                > * セア (p) Sayre; NA
                                > * トサ Tosa (f); NA
                                > * イト Ito (f); NA

                                You can see what's happening here. Having chopped off at ベネッ, it has split
                                the rest as セア + トサ + イト. Yes ugly. I think it;s better than the only
                                other alternative, which is too skip any remaining katakana after a partial
                                match.

                                > * ラウンジ (n) lounge; (P); EP

                                And it's finally back in synch.


                                > And here is another one:
                                >
                                > 豊かな自然と温暖な気候に恵まれた岡山県は、果物の栽培に適しています。加えて高度な栽培技術を有し、種類の豊富さと美味から「くだもの王国」と呼ばれています。ぶどう、白桃、メロン、いちご、ミカン、スイカ、オリーブほか、年間を通じて様々な果物が収穫されており、新鮮な旬の果物を食べることができます。また、産地の果樹園でフルーツ狩りを体験することのもおすすめです。
                                >
                                > Again, isolating the katakana items:
                                >
                                > * メロン (n) melon; (P)
                                > * ミカ Mika (f); NA
                                > * スイカ (n) Suica (rechargeable prepaid IC card that can be used as a train pass in the greater Tokyo, Osaka and Sendai regions and also as electric money in some stores)
                                > * オリーブ (n) (fr:) olive
                                > * フルーツ (n) fruit; (P)

                                As Paul pointed out, it needs a specific ミカン to match on. I have now added it.

                                スイカ needed attention. I (now) have two スイカ entries - one for 西瓜/すいか/スイカ
                                and one for the travel card. Since WWWJDIC can't choose between them
                                when glossing
                                text, I have a special file which now has a composite entry use for
                                this function.

                                > When i ask question about the philosophy i am, on the ground level, asking about both editorial policy and the algorithms at work, and i would like to contribute in some way toward refining both. To that effect i would like to learn more about them, and whether this list is the right place for that i don't know, but at least it seems the right place to start the quest.

                                We are trying to grow an editorial policy. As Paul commented, it will really be
                                important when I am no longer sole editor. The beginnings can be seen at:
                                http://www.edrdg.org/wiki/index.php/Main_Page but it's clearly early days.

                                Cheers

                                Jim

                                --
                                Jim Breen
                                Honorary Senior Research Fellow
                                Clayton School of Information Technology,
                                Monash University, VIC 3800, Australia
                                http://www.csse.monash.edu.au/~jwb/
                              • Jim Breen
                                ... Apart from E- J usage, one good reason for having $B30Mh8l(B is to help the Translate Words function to segment text properly. In fact the database is
                                Message 15 of 24 , Feb 7, 2008
                                View Source
                                • 0 Attachment
                                  On 07/02/2008, hiz--dic@... <hiz--dic@...> wrote:

                                  > Right - if the meaning of the 外来語 is not what the "back translation" into English suggests (e.g., リニューアル and リフォーム do not mean "renewal" and "reform"), then i think an entry is called for, but...
                                  >
                                  > ... there is a large number of 外来語 that do not add any useful information (e.g., アート, meaning "art", meaning 芸術 or related concepts), and i don't see any point in cluttering up the database with such fashion words until it becomes apparent that their meaning differs in some fundamental way from the English meaning. :-)

                                  Apart from E->J usage, one good reason for having 外来語 is to help the
                                  "Translate Words" function to segment text properly. In fact the database
                                  is not really "cluttered" in that unless you are looking for a word, you are
                                  unlikely to come across it. WWWJDIC's search function may not be as flexible
                                  as would be optimal, but they are very fast and scale well (disk accesses rise
                                  linearly with the log of the file size.)

                                  > ... which brings me back to the reason why i posted in the first place. And one of my concerns (mentioned quite a while ago but apprently still unmet) is that there seems to be no native speaker of Japanese on this list and so i suspect also that there is no native speaker of Japanese involved in the collective effort of those who are subscribed to this list. :-)

                                  I wish there were more. We have/had Kanji Haitani, who was a Honyaku old-timer,
                                  involved, but he hasn't been active lately. There have been other
                                  native speakers
                                  involved over the years. I guess we are seeing the result of the files
                                  being mainly
                                  used by gaijin.

                                  > I would like to see this ongoing project and this mailing list advertised more among practicing translators and wouldn't mind doing that myself (but Jim shares at least one of the related mailing lists, so i do not want to pre-empt him).

                                  I was interested to see in the survey I did before last year's IJET
                                  just how many
                                  people use it regularly. I too wish practising translators contributed
                                  more. A few
                                  did, but I guess they are pretty busy.

                                  > And as regards the enamdict noise (all those sounds like トサ, イト, and so on), if kana names are to be kept/included in there, then i would like to suggest to only call on enamdic's _kanji_ entries during a glossdic search - the reasoning being that, if it is not obvious from the context that a given kana string in a text refers to a person's name, then having a database telling me that this string might be a person's name is not going to help me, while, OTOH, getting such information in all those other cases where it is plainly inappropriate has a negative benefit. :-)

                                  I'm not sure that will give the best outcome for all possible users.
                                  ENAMDICT has
                                  ~75,000 entries begining with katakana. Most of these are the
                                  katakanaized names
                                  of people or places. I suspect the majority of people using the function value
                                  their presence.

                                  An easier thing for me to do is to add a checkbox option to tell it to ignore
                                  katakana when glossing text.

                                  Cheers

                                  Jim

                                  --
                                  Jim Breen
                                  Honorary Senior Research Fellow
                                  Clayton School of Information Technology,
                                  Monash University, VIC 3800, Australia
                                  http://www.csse.monash.edu.au/~jwb/
                                • Jim Breen
                                  ... [...] ... Removing the arch/obs/obsc senses may not have much impact. About 1900 senses have one or more of those markings, out of 127,000 senses currently
                                  Message 16 of 24 , Feb 7, 2008
                                  View Source
                                  • 0 Attachment
                                    On 07/02/2008, René Malenfant <rene_malenfant@...> wrote:

                                    > One thing I've been thinking about is the automated creation of a
                                    > third version of edict.

                                    [...]

                                    > Right now there are two versions. There's the (P)-list version, and
                                    > the full version. The former basically acts as a 小辞典, and
                                    > the latter as a 大辞典. This leaves a rather large, and
                                    > (hopefully) easily filled 中辞典 gap between the two.
                                    >
                                    > I agree with Hendrik that it is precisely the rare and specialized
                                    > entries that make edict valuable, and it's rather been my policy to
                                    > treat edict like Wikipedia: it's "not paper" so anything and
                                    > everything should go in.
                                    >
                                    > That said, having so much rare "stuff" in the dictionary can result
                                    > in search noise, and especially when searching E-J, long explanatory
                                    > entries like "Buddhist statue of a figure sitting contemplatively in
                                    > the half-lotus position (often of Maitreya)" result in false positives.
                                    >
                                    > The way I see it, a good edict 中辞典 would be the same as
                                    > the 大辞典, but stripped of the (iK), (oK) and (io) headwords;
                                    > the (ik) and (ok) readings; and the (obsc), (obs) and (arch) senses.
                                    > (Of course, any word that is in the P-list would get an automatic
                                    > pass, even if it has one of these tags.)

                                    Removing the arch/obs/obsc senses may not have much impact. About 1900
                                    senses have one or more of those markings, out of 127,000 senses currently
                                    identified in about 120,000 entries.

                                    Taking out the iK/oK/etc. would declutter it a bit, but again the grand
                                    total of such markers (698) is not that high.

                                    > In addition, I think a more powerful search for the WWWJDIC page is
                                    > in order. (And yes, I know there's one up and running on Arakawa,
                                    > but that one is a bit intimidating, what with all its checkboxes and
                                    > whatnot.)
                                    >
                                    > All of the Japanese dictionaries I've used on my computer offer
                                    > "starts with...", "ends with...", "contains..." and "is..." search
                                    > options.
                                    >
                                    > WWWJDIC, at present, seems to default to "contains...", which results
                                    > in loads of funky hits coming up before the word I actually want. I
                                    > know these can be drastically reduced by checking the "starting
                                    > kanji" and "exact word-match" boxes. But the "starting kanji" box
                                    > only applies to the Japanese headword, and the "exact match" box only
                                    > applies to the English gloss.

                                    Strictly speaking WWWJDIC only has "starting with ..." for kanji, kana or
                                    alphabetic strings and "contains ..." for kanji strings. In addition there
                                    is a "must also contain" option which provides a crude sort of Boolean AND.

                                    Changing this would mean a complete rethink and rebuild of the way WWWJDIC
                                    looks things up. It's not very flexible now, but that helps it be
                                    fast and light-weight.

                                    > I think there should be an HTML select list with "starts with...",
                                    > "ends with...", "contains..." and "is..." options, with either
                                    > "starts with..." or "is ..." selected by default. This would
                                    > *drastically* reduce the number of false positives.
                                    >
                                    > You'd then have one checkbox left for "common words", and could
                                    > perhaps add another one for "ignore archaic Japanese", etc.

                                    That would all be terrific. It would really mean new system and a fresh
                                    start. And in many ways it needs it. WWWJDIC began as a HTML-generation
                                    wrapper around my Unix/X11 "xjdic" code, which I first released in 1992.
                                    Some of the C came from the 1990 DOS "jdic". It has grown like topsy and
                                    frankly is a mess.

                                    Who'll make the fresh start? Not me - I'm too old and tired, and anyway
                                    my software skills are now very dated. I think new software needs new
                                    faces.

                                    Cheers

                                    Jim

                                    --
                                    Jim Breen
                                    Honorary Senior Research Fellow
                                    Clayton School of Information Technology,
                                    Monash University, VIC 3800, Australia
                                    http://www.csse.monash.edu.au/~jwb/
                                  • hiz--dic@islandnet.com
                                    Hello from Okinawa... thanks, Paul and Jim, for your comments and further info. I ll send a combined reply to several of your posts and then recede a bit into
                                    Message 17 of 24 , Feb 9, 2008
                                    View Source
                                    • 0 Attachment
                                      Hello from Okinawa... thanks, Paul and Jim, for your comments and further info. I'll send a combined reply to several of your posts and then recede a bit into the background for a few days to attend to my work that i am neglecting while posting here... :-)

                                      At 12:15 8/02/07 +0000, Paul Blay wrote:
                                      >> What are you surprised about? That someone who is not a beginner
                                      >> doesn't know the readings of all kanji or kanji combinations? 8-)
                                      >
                                      >That someone who is not a beginner finds it most useful to
                                      >get the lot 'translated' rather than pick out words he's unsure of.

                                      I see... for me it is in many cases easier to eliminate what i don't need from the bulk output than picking up needed items one by one. When i have a long text (several dozen to several hundred pages) many of the terms i need to look up appear more than once, and having the glossary output lined up with the text is really helpful, since, by adjusting the related glossary entries with global search-and-replacee operations as i work my way through the text, it is easy to maintain consistency over the length of the document and the time it takes to complete the translation.

                                      I think you can see now why i am concerned about "noise": i strip away those parts of the output that i don't need (with global search and delete operations), and if more and more of the glossary output is related to kana in the input text (both katakana and hiragana, see my related comment top Jim right at the end), then this is, of course, bothersome in my particular case (but other users' mileage may be different, and Jim has hinted at a practical solution already).

                                      After using the glossary the way i just described, i check the remaining output items against the text, and in those cases where they do not yield the meaning i need (with 'wrong parsing of the input' and 'having drawn the data from a part database that does not apply to the text' perhaps being the two most frequent causes), i do a secondary lookup regarding the involved kanji and add the newly derived output to the work file via copy-and-paste.

                                      Setting up this work file means that i will have gone through the whole source text twice already before i start the translation, and that means i have gotten a good understanding of its content, style, tone, and so on - this is all valuable information for me and creates a sense of "mental comfort" when i then work on the actual translation.

                                      At 12:16 8/02/07 +0000, Paul Blay wrote:
                                      >Oh, I forgot to say - do you know why Yahoo Groups hates you?
                                      >Seriously, every email you send in gets marked as 'spam' and
                                      >I (or Jim) has to 'approve' you.

                                      Hm... i don't know, although i've heard such comments from two other lists (one also with US Yahoogroups and one with Googlegroups), but there are no problems of this sort in relation to lists hosted with Yahoo Japan or with other services. Sorry about that... is there no whitelist to put my address in?

                                      At 15:46 8/02/08 +1100, Jim Breen wrote:
                                      >In the files that WWWJDIC has available, the longest match it could get with
                                      >ベネッセ was indeed the first three kana of べネッサレッドグレイヴ. Not a great
                                      >match, but I work on the principle that reporting a partial match is
                                      >better than no match at all.

                                      Yes, i can see how this can work well with expressions involving kanji: if you have a string with, say, two kanji_and an okurigana, and the kanji match but not the kana, then the likelyhood that the meaning of the input term and the output term are _related_ is obviously very high, but when we are dealing with a string of only kana or latin characters i think we need to reconsider this approach - the fact that two given English expressions start with the same 3 letters or two Japanese expressions with the same 3 kana is not a good correlation to the concept "closeness of meaning". I don't think there is any English dictionary that would, for example, in case it did not contain the word "temple", offer as partial match for the input "temple" the term "temporary".

                                      Look at the following example, which gets positively exotic (i have removed the kanji related output from the data shown here):

                                      [...] シルクロードの命名者でもあるドイツの地理学者フェルディナンド・フォン・リヒトホーフェンが執筆した旅行記により世界中に紹介され[...]

                                      * フェルディナンド (u) Ferdinand; NA
                                      * フォンリントー (p) Fenglingdu; NA [Partial Match!]
                                      * ヒト 【ひと】 human being (n); Homo sapiens (Latin); human (n); human (adj,n); LS
                                      * ホーフェイ (p) Hefei (China); NA [Partial Match!]

                                      Like, what on earth do _Chinese_ expressions do in the database in _katakana_ format? And why is ヒト (a katakana version of a Japanese word) in there? If anything and everything goes in there just because _someone_ _somewhere_ wrote it in kana we will get a rapidly increasing amount of this unexpected side effect that i call "noise"... :-)

                                      [skip]

                                      >> Again, isolating the katakana items:
                                      >>
                                      >> * メロン (n) melon; (P)
                                      >> * ミカ Mika (f); NA
                                      >> * スイカ (n) Suica (rechargeable prepaid IC card that can be used as a train pass in the greater Tokyo, Osaka and Sendai regions and also as electric money in some stores)
                                      >> * オリーブ (n) (fr:) olive
                                      >> * フルーツ (n) fruit; (P)
                                      >
                                      >As Paul pointed out, it needs a specific ミカン to match on. I have now added it.

                                      This another good illustration of what i explained above, that if we have headwords based on sounds rather than meaning (and that is what we get when we accept kana entries), the partial match function is not that useful - if i put in a new term, say "エレファンタイン", then i really would need to check that ALL other existing expressions starting with "エ", "エレ" and "エレフ" and so on, are already in the database - otherwise the database might even spit out "エレファンタイン [partial match]" next time an unsuspecting user inputs text that contains "エラーイ!エラーイ!"... :-)

                                      >スイカ needed attention. I (now) have two スイカ entries - one for
                                      > 西瓜/すいか/スイカ and one for the travel card.

                                      A very nice illustration of the point i just made...

                                      > Since WWWJDIC can't choose between them when glossing text

                                      This is something i wanted to ask you about anyway: what is the order in which part databases are queried? And what are the criteria that a partial solution from one part database is skipped in favor of a complete solution from another one. To illustrate what i am getting at: i get many cases where a two-kanji combination from enamdic is offered when using glossdic where two individual kanji readings taken from edic would actually have been far more informative in terms of figurig out what the text is about. And is there no way to offer more than one result, from more than one part database, for a given item that is being queried? I guess the latter could make the code unwieldy...

                                      >We are trying to grow an editorial policy. As Paul commented, it will really be
                                      >important when I am no longer sole editor. The beginnings can be seen at:
                                      >http://www.edrdg.org/wiki/index.php/Main_Page but it's clearly early days.

                                      OK, i'll take a look... thanks...

                                      At 16:05 8/02/08 +1100, Jim Breen wrote:
                                      >> I would like to see this ongoing project and this mailing list advertised
                                      >> more among practicing translators and wouldn't mind doing that myself (but
                                      >> Jim shares at least one of the related mailing lists, so i do not want to
                                      >> pre-empt him).
                                      >
                                      >I was interested to see in the survey I did before last year's IJET
                                      >just how many people use it regularly. I too wish practising translators
                                      >contributed more. A few did, but I guess they are pretty busy.

                                      I'll give this matter some thought and get back to you about it... i want to do something but am not sure yet on the best course of action...

                                      >> [...] i would like to suggest to only call on enamdic's _kanji_ entries
                                      >> during a glossdic search [...]
                                      >
                                      >I'm not sure that will give the best outcome for all possible users.
                                      >ENAMDICT has ~75,000 entries begining with katakana. Most of these are
                                      >the katakanaized names of people or places.

                                      How many of those are katakanaized versions of Japanese or Chinese names of people and places? Given that i got two in one stroke, フォンリントー and ホーフェイ, i am really worried now...

                                      >An easier thing for me to do is to add a checkbox option to tell it to ignore
                                      >katakana when glossing text.

                                      That is the kind of solution i like: adding refinements or features without limiting other people's choice. :-) In fact, if i may be so bold, i'd like to ask for a second button that suppresses output on _hiragana_ only expressions - some people may want the glosses for できません, しています, でありあす, and so on, but i have the sneaky suspicion i am not the only one who does _not_ want them. :-)

                                      Anyway, sorry for being around only sporadically - both my life in general and my work move in a rather irregular pattern, but i will continue to try and contribute in some way...

                                      Thanks, everybody for your patience...

                                      Regards: Hendrik





                                      --


                                      * 南風言語業(大工ヘンドリク) *
                                      http://www.paikaji-translation.com/

                                      --
                                    • Paul Blay
                                      ... Funny, I wasn t aware that the translate words in Japanese text function did any hiragana only stuff. I ll just throw something in it and check ...
                                      Message 18 of 24 , Feb 9, 2008
                                      View Source
                                      • 0 Attachment
                                        > That is the kind of solution i like: adding refinements or features
                                        > without limiting other people's choice. :-) In fact, if i may be so
                                        > bold, i'd like to ask for a second button that suppresses output on
                                        > _hiragana_ only expressions - some people may want the glosses for
                                        > できません, しています, でありあす, and so on, but i have the sneaky
                                        > suspicion i am not the only one who does _not_ want them. :-)

                                        Funny, I wasn't aware that the "translate words in Japanese text"
                                        function did any hiragana only stuff. I'll just throw something
                                        in it and check ...

                                        我輩は猫であります。

                                        * 我輩 【わがはい】 (n) (1) (arch) I, me, myself (masc) (nuance of
                                        arrogance); (2) we, us, ourselves; ED
                                        * 猫 【ねこ】 (n) (1) cat; (2) (uk) (col) submissive partner of a
                                        homosexual relationship; (P); EP
                                        * あります (v) is, be, am (polite copula); KD

                                        Well waddya know. It not only does hiragana only, it gets it wrong. ;-)

                                        > >Oh, I forgot to say - do you know why Yahoo Groups hates you?
                                        > >Seriously, every email you send in gets marked as 'spam' and
                                        > >I (or Jim) has to 'approve' you.
                                        >
                                        > Hm... i don't know, although i've heard such comments from two
                                        > other lists (one also with US Yahoogroups and one with Googlegroups),
                                        > but there are no problems of this sort in relation to lists hosted
                                        > with Yahoo Japan or with other services. Sorry about that... is there
                                        > no whitelist to put my address in?

                                        Well you're already on the 'whitelist' in that this is a members only
                                        group for posting. (By the way, for some reason this time your
                                        email wasn't marked as spam).

                                        As to why it happens, your ISP is going to be the first suspect.
                                        If it's a 'spam-haven' then non-spamming users in the same IP
                                        block are going to get tarred with the same brush.

                                        Next up is the possibility that your computer has viruses and
                                        is happily spamming places for you when you're not looking.
                                        (My mum had some trouble from that once :-/

                                        Lastly (but probably not in this case) certain patterns of activity
                                        can be mis-interpreted by Yahoo, Google, etc. as spam-like activity.
                                        For example if you do a series of quick Google searches on very
                                        closely related terms it can flag you as a possible 'robot' and
                                        require you to verify you are human. That happens to me every
                                        now and again when I Google to check frequencies of use of
                                        different kanji / okurigana versions of the same word/phrase.
                                      • hiz--dic@islandnet.com
                                        Hi Paul, two short replies to your reply... 1) ... Well... you could try $B$ $j$^$9(B with edict, enjoy the resulting $B5B4,(B, and follow the invitation
                                        Message 19 of 24 , Feb 9, 2008
                                        View Source
                                        • 0 Attachment
                                          Hi Paul,

                                          two short replies to your reply...

                                          1)

                                          At 09:39 8/02/09 +0000, Paul Blay wrote:
                                          >Funny, I wasn't aware that the "translate words in Japanese text"
                                          >function did any hiragana only stuff. I'll just throw something
                                          >in it and check ...
                                          >
                                          >我輩は猫であります。
                                          >
                                          > * 我輩 【わがはい】 (n) (1) (arch) I, me, myself (masc) (nuance of
                                          >arrogance); (2) we, us, ourselves; ED
                                          > * 猫 【ねこ】 (n) (1) cat; (2) (uk) (col) submissive partner of a
                                          >homosexual relationship; (P); EP
                                          > * あります (v) is, be, am (polite copula); KD
                                          >
                                          >Well waddya know. It not only does hiragana only, it gets it wrong. ;-)

                                          Well... you could try あります with edict, enjoy the resulting 蟻巻, and follow the invitation to propose a new entry for あります。;-) :-)

                                          2)

                                          >(By the way, for some reason this time your email wasn't marked as spam).

                                          I am sure #1 and #2 of your 3 suggestions don't apply and that #3 is highly unlikely, and i also think, from seeing the results on other lists, the solution to Yahoo's constipation might be abstain from using the Japanese equivalent of "A wrote X on Y-day at Z-o'clock" when quoting messages. I just hadn't thought about changing the Japanese back to English with this account, since i hadn't been aware of the problem... i recently made the same change to other accounts for another reason (occasional mojibake with lists served by Yahoo or Google) with positive results...

                                          If this message also goes through without offending the spam cop, we prolly have a winner...

                                          Thanks * regards: Hendrik





                                          --
                                        • Jim Breen
                                          ... It is mentioned in the docs. The handling of hiragana strings is very cautious. It only looks at them if they are 3 or more characters long, it uses a
                                          Message 20 of 24 , Feb 10, 2008
                                          View Source
                                          • 0 Attachment
                                            On 09/02/2008, Paul Blay <blay.paul@...> wrote:

                                            > Funny, I wasn't aware that the "translate words in Japanese text"
                                            > function did any hiragana only stuff. I'll just throw something
                                            > in it and check ...

                                            It is mentioned in the docs. The handling of hiragana strings is very cautious.
                                            It only looks at them if they are 3 or more characters long, it uses a
                                            specials file; not the main glossaries, and it only accepts exact matches; not
                                            partial ones. On the whole it seems to function OK.

                                            > 我輩は猫であります。

                                            A fine example of modern Japanese (am I the only person who was annoyed when
                                            夏目漱石's picture was removed from the Y1000 note?)

                                            > * 我輩 【わがはい】 (n) (1) (arch) I, me, myself (masc) (nuance of
                                            > arrogance); (2) we, us, ourselves; ED
                                            > * 猫 【ねこ】 (n) (1) cat; (2) (uk) (col) submissive partner of a
                                            > homosexual relationship; (P); EP
                                            > * あります (v) is, be, am (polite copula); KD
                                            >
                                            > Well waddya know. It not only does hiragana only, it gets it wrong. ;-)

                                            Looking forward to your proposed amendment. I have added であります
                                            to the special kana file.

                                            Jim

                                            --
                                            Jim Breen
                                            Honorary Senior Research Fellow
                                            Clayton School of Information Technology,
                                            Monash University, VIC 3800, Australia
                                            http://www.csse.monash.edu.au/~jwb/
                                          • Jim Breen
                                            ... That s pretty much my experience. The times when I have had to translate largish slabs of text, I tip it into WWWJDIC with the no repeated translations
                                            Message 21 of 24 , Feb 10, 2008
                                            View Source
                                            • 0 Attachment
                                              On 09/02/2008, hiz--dic@... <hiz--dic@...> wrote:

                                              > I see... for me it is in many cases easier to eliminate what i don't need from the bulk output than picking up needed items one by one.

                                              That's pretty much my experience. The times when I have had to translate largish
                                              slabs of text, I tip it into WWWJDIC with the "no repeated translations" option
                                              selected, then use an editor to trim out what I don't need/know already. That
                                              inludes most of the hiragana/katakana matches.

                                              I can add options to ignore words starting with katakana and words starting with
                                              hiragana. Would it be better to have these as two options, or have a single
                                              "only match words starting with kanji" option?

                                              > Look at the following example, which gets positively exotic (i have removed the kanji related output from the data shown here):
                                              >
                                              > [...] シルクロードの命名者でもあるドイツの地理学者フェルディナンド・フォン・リヒトホーフェンが執筆した旅行記により世界中に紹介され[...]
                                              >
                                              > * フェルディナンド (u) Ferdinand; NA
                                              > * フォンリントー (p) Fenglingdu; NA [Partial Match!]

                                              Yes, it has matched フォンリ from フォン・リヒトホーフェン

                                              > * ヒト 【ひと】 human being (n); Homo sapiens (Latin); human (n); human (adj,n); LS
                                              > * ホーフェイ (p) Hefei (China); NA [Partial Match!]
                                              >
                                              > Like, what on earth do _Chinese_ expressions do in the database in _katakana_ format? And why is ヒト (a katakana version of a Japanese word) in there?

                                              Hefei (place in China) seems to be written ホーフェイ in Japanese far more
                                              often than in its hanzi version (合肥).

                                              That "ヒト 【ひと】 human being " entry is from the Life Sciences file.

                                              > If anything and everything goes in there just because _someone_ _somewhere_ wrote it in kana we will get a rapidly increasing amount of this unexpected side effect that i call "noise"... :-)

                                              Certainly the parsing of long katakana strings ends up a mess in a mismatch, as
                                              it can take a while to resynchronize. I might look to see if such
                                              partial matches
                                              can be improved. One option would be to skip over the rest of a long katakana
                                              string after there has been a partial match at the beginning.

                                              > This is something i wanted to ask you about anyway: what is the order in which part databases are queried?

                                              Actally a single file is used, containin EDICT, ENAMDICT and a heap of
                                              glossaries.
                                              The files that go into this file (glossdic) have a ranking tag, and where the
                                              same headword is found in more than one file, the lower ranked one(s)
                                              are dropped.

                                              > And what are the criteria that a partial solution from one part database is skipped in favor of a complete solution from another one.

                                              Exact matches should always be preferred over partial ones.

                                              For example, if you give it 清水安静 it should select 清水 + 安静, not get
                                              a partial match on the name 清水安三.

                                              > To illustrate what i am getting at: i get many cases where a two-kanji combination from enamdic is offered when using glossdic where two individual kanji readings taken from edic would actually have been far more informative in terms of figurig out what the text is about.

                                              I can't really help that, because the text may very well contain that name.

                                              Othet tools such as RikaiChan do offer a breakdown kanji-by-kanji. I
                                              think it will
                                              distract most users.

                                              > And is there no way to offer more than one result, from more than one part database, for a given item that is being queried? I guess the latter could make the code unwieldy...

                                              More to the point, it would make the output huge.

                                              > >I'm not sure that will give the best outcome for all possible users.
                                              > >ENAMDICT has ~75,000 entries begining with katakana. Most of these are
                                              > >the katakanaized names of people or places.
                                              >
                                              > How many of those are katakanaized versions of Japanese or Chinese names of people and places?

                                              Only a small proportion (almost none are katakanaized Japanese names.)

                                              Cheers

                                              Jim
                                              --
                                              Jim Breen
                                              Honorary Senior Research Fellow
                                              Clayton School of Information Technology,
                                              Monash University, VIC 3800, Australia
                                              http://www.csse.monash.edu.au/~jwb/
                                            • Jim Breen
                                              Greetings ... [...] ... OK. It s done. There are now checkbox options to exclude katakana and hiragana words/phrases in the Translate Words function. Also,
                                              Message 22 of 24 , Feb 12, 2008
                                              View Source
                                              • 0 Attachment
                                                Greetings

                                                On 11/02/2008, Jim Breen <jimbreen@...> wrote:
                                                > I can add options to ignore words starting with katakana and words starting with
                                                > hiragana. Would it be better to have these as two options, or have a single
                                                > "only match words starting with kanji" option?
                                                [...]

                                                > One option would be to skip over the rest of a long katakana
                                                > string after there has been a partial match at the beginning.

                                                OK. It's done. There are now checkbox options to exclude katakana and
                                                hiragana words/phrases in the Translate Words function. Also, whenever
                                                a "partial match" occurs in a katakana string, the rest of that string
                                                is skipped.

                                                Cheers

                                                Jim
                                                --
                                                Jim Breen
                                                Honorary Senior Research Fellow
                                                Clayton School of Information Technology,
                                                Monash University, VIC 3800, Australia
                                                http://www.csse.monash.edu.au/~jwb/
                                              • hiz--dic@islandnet.com
                                                Thanks, Jim, for some reassuring explanations. :-) ... And i just had to hit on it. :-) ... And i just had to hit on it. :-) OK, as an aside, why would it be
                                                Message 23 of 24 , Feb 12, 2008
                                                View Source
                                                • 0 Attachment
                                                  Thanks, Jim, for some reassuring explanations. :-)

                                                  At 13:08 8/02/11 +1100, Jim Breen wrote:
                                                  >Hefei (place in China) seems to be written ホーフェイ in Japanese far more
                                                  >often than in its hanzi version (合肥).

                                                  And i just had to hit on it. :-)

                                                  >That "ヒト 【ひと】 human being " entry is from the Life Sciences file.

                                                  And i just had to hit on it. :-)
                                                  OK, as an aside, why would it be in there? Are there many katakana version of words in there that normal people write in kanji?

                                                  >> This is something i wanted to ask you about anyway: what is the order in which part databases are queried?
                                                  >
                                                  >Actally a single file is used, containin EDICT, ENAMDICT and a heap of
                                                  >glossaries.
                                                  >The files that go into this file (glossdic) have a ranking tag, and where the
                                                  >same headword is found in more than one file, the lower ranked one(s)
                                                  >are dropped.

                                                  I see - and depending on the text, one may need one of the lower ranking items, but that gets to the issue of multiple outputs (more on that further down)...

                                                  >> And what are the criteria that a partial solution from one part database
                                                  >> is skipped in favor of a complete solution from another one.
                                                  >
                                                  >Exact matches should always be preferred over partial ones.
                                                  >
                                                  >For example, if you give it 清水安静 it should select 清水 + 安静, not get
                                                  >a partial match on the name 清水安三.

                                                  I see... so if the (for me) obvious kanji combination does not come up, it usually means i need to send it in... which i have been doing lately...

                                                  >> And is there no way to offer more than one result, from more than one
                                                  >> part database, for a given item that is being queried? I guess the
                                                  >> latter could make the code unwieldy...
                                                  >
                                                  >More to the point, it would make the output huge.

                                                  If that is the only problem, could an option for multiple output be offered?

                                                  >> >I'm not sure that will give the best outcome for all possible users.
                                                  >> >ENAMDICT has ~75,000 entries begining with katakana. Most of these are
                                                  >> >the katakanaized names of people or places.
                                                  >>
                                                  >> How many of those are katakanaized versions of Japanese or Chinese names
                                                  >> of people and places?
                                                  >
                                                  >Only a small proportion (almost none are katakanaized Japanese names.)

                                                  I am relieved... that day where i got two of them plus the ヒト must have been a statistical aberration...

                                                  At 08:38 8/02/13 +1100, Jim Breen wrote:
                                                  >OK. It's done. There are now checkbox options to exclude katakana and
                                                  >hiragana words/phrases in the Translate Words function. Also, whenever
                                                  >a "partial match" occurs in a katakana string, the rest of that string
                                                  >is skipped.

                                                  I missed the related discussion but am happy with the outcome. :-)
                                                  Any idea when those checkboxes will show on the TUFS interface?
                                                  (That is the server i usually use, since it the closest by...)
                                                  Does it require much extra work, meaning i should use the Monash server in case want to use those checkboxes?

                                                  Thanks & regards: Hendrik




                                                  --
                                                • Jim Breen
                                                  ... That does seem an odd one. Most of the ~9,000 katakana words in Lifscidic are $B30Mh8l(B or names of animals/plants, e.g. $B%d%D%a%&%J%.(B.. There is a
                                                  Message 24 of 24 , Feb 12, 2008
                                                  View Source
                                                  • 0 Attachment
                                                    On 13/02/2008, hiz--dic@... <hiz--dic@...> wrote:
                                                    > >That "ヒト 【ひと】 human being " entry is from the Life Sciences file.
                                                    >
                                                    > And i just had to hit on it. :-)
                                                    > OK, as an aside, why would it be in there? Are there many katakana version of words in there that normal people write in kanji?

                                                    That does seem an odd one. Most of the ~9,000 katakana words in Lifscidic are
                                                    外来語 or names of animals/plants, e.g. ヤツメウナギ.. There is a batch
                                                    of terms using ヒト: ヒト型, ヒト絨毛性ゴナドトロピン, ヒト染色体,
                                                    ヒトゲノム解析計画, etc. etc., so I suspect use of ヒト is a bit special
                                                    in biomed circles.

                                                    > >For example, if you give it 清水安静 it should select 清水 + 安静, not get
                                                    > >a partial match on the name 清水安三.
                                                    >
                                                    > I see... so if the (for me) obvious kanji combination does not come up, it usually means i need to send it in... which i have been doing lately...

                                                    YES!

                                                    > >> And is there no way to offer more than one result, from more than one
                                                    > >> part database, for a given item that is being queried? I guess the
                                                    > >> latter could make the code unwieldy...
                                                    > >
                                                    > >More to the point, it would make the output huge.
                                                    >
                                                    > If that is the only problem, could an option for multiple output be offered?

                                                    It's not as easy as it sounds, as there is a lot of code working on showing the
                                                    best single entry. I just hacked a change into a testbed version that
                                                    does what you i
                                                    want. The results can be ugly, e.g. for "家の空間" it gave:

                                                    # 家 【いえ; うち; け; か】 (n) (いえ) house; (うち) house (one's own); (け,か)
                                                    (suff) house; family; person; expert; -ist; SP
                                                    # 家 【いえ】 (n) (1) house; residence; dwelling; (2) family; household;
                                                    (3) lineage; family name; (P); EP
                                                    # 家 【いえ】 Ie (s) 【いえたか】 Ietaka (s) 【いえつぐ】 Ietsugu (u) 【いえとく】 Ietoku (s)
                                                    【うち】 Uchi (p) 【おしお】 Oshio (u) 【おすお】 Osuo (u) 【かりゅう】 Karyuu (s) NA
                                                    # 家 【うち】 (n,adj-no) (1) house; (2) home (one's own); (one's) family;
                                                    (one's) household; (P); EP
                                                    # 家 【か】 (suf) -ist (used after a noun indicating someone's occupation,
                                                    pursuits, disposition, etc.); -er; ED
                                                    # 家 【け】 (suf) house (e.g. of Tokugawa); family; (P); EP
                                                    # 家 【ち】 (n,adj-no) (1) house; (2) home (one's own); (one's) family;
                                                    (one's) household; ED
                                                    # 家 【や】 (suf) (1) (something) shop; (suf) (2) somebody who sells
                                                    (something) or works as (something); (suf) (3) somebody with a
                                                    (certain) personality trait; (n) (4) house; (5) roof; ED
                                                    # 家 【んち】 (exp) (uk) (abbr) 's house; 's home; ED
                                                    # 空間 【あきま】 (n) vacancy; room for rent or lease; ED
                                                    # 空間 【くうかん】 (n) space; room; airspace; (P); EP
                                                    # 空間 【そらま】 Sorama (s) NA

                                                    Anyway, feel free to experiment with it. Use the URL:
                                                    http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic2.cgi?9T It's got
                                                    some bugs -
                                                    some stray glosses are leaking in.

                                                    And if you want to get ALL matches from all files, select "the_lot" instead of
                                                    the default "glossdic". It's super-ugly.

                                                    > >Only a small proportion (almost none are katakanaized Japanese names.)
                                                    >
                                                    > I am relieved... that day where i got two of them plus the ヒト must have been a statistical aberration...

                                                    We call it Murphy's Law...

                                                    > At 08:38 8/02/13 +1100, Jim Breen wrote:
                                                    > >OK. It's done. There are now checkbox options to exclude katakana and
                                                    > >hiragana words/phrases in the Translate Words function. Also, whenever
                                                    > >a "partial match" occurs in a katakana string, the rest of that string
                                                    > >is skipped.
                                                    >
                                                    > I missed the related discussion but am happy with the outcome. :-)
                                                    > Any idea when those checkboxes will show on the TUFS interface?

                                                    Within 24 hours. The mirrors are all identical, but they only update
                                                    once a day; usually around their local midnight.

                                                    > Does it require much extra work, meaning i should use the Monash server in case want to use those checkboxes?

                                                    Today only. Tomorrow it will available at TUFS.

                                                    Cheers

                                                    Jim

                                                    --
                                                    Jim Breen
                                                    Honorary Senior Research Fellow
                                                    Clayton School of Information Technology,
                                                    Monash University, VIC 3800, Australia
                                                    http://www.csse.monash.edu.au/~jwb/
                                                  Your message has been successfully submitted and would be delivered to recipients shortly.