Loading ...
Sorry, an error occurred while loading the content.

Re: [edict-jmdict] Re: mysql limitations

Expand Messages
  • Jim Breen
    [wmaton ([edict-jmdict] Re: mysql limitations) writes:] ... Just pointing out that the 200 entries are in the JIS X 0213 part of kanjidic2.xml, i.e. the
    Message 1 of 21 , Oct 5, 2006
    View Source
    • 0 Attachment
      [wmaton ([edict-jmdict] Re: mysql limitations) writes:]
      >> > (FWIW, according to
      >> >
      >> http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
      >> > Postgresql supports 4-byte utf-8.)
      >>
      >> I think I raised this issue previously, and someone responded that the
      >> were something like only 200 entries affected - but still, let's do
      >> this right at the start....

      Just pointing out that the "200 entries" are in the JIS X 0213 part of
      kanjidic2.xml, i.e. the "new" kanji that joined Unicode from JIS X 0213.
      There is no 4-byte utf-8 problem with either the JMdict/EDICT content or
      the JMNedict/ENAMDICT content, although it it is possible that some odd
      JIS213 kanji might get into the latter file at some stage.

      I would not suggest letting the "4-byte utf-8 problem" drive the choice of
      RDBMS.

      Jim

      --
      Jim Breen http://www.csse.monash.edu.au/~jwb/
      Clayton School of Information Technology, Tel: +61 3 9905 9554
      Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
      (Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学
    • SrinTuar
      ... MySQL is not serious competition for Postgresql at any rate, imo. I would highly recommend postgresql for anyone starting a new project.
      Message 2 of 21 , Oct 5, 2006
      View Source
      • 0 Attachment

        My only slant towards PostgreSQL is historical (very historical) and
        also that I've that UTF-8 handling seems to be better at the moment.


        MySQL is not serious competition for Postgresql at any rate, imo.
        I would highly recommend postgresql for anyone starting a new project.


      • Stuart McGraw
        ... This is the internet... any time is the right time! :-) Below are eight queries (plus one). They vary on three dimensions: Database: Mysql / Postgresql
        Message 3 of 21 , Oct 5, 2006
        View Source
        • 0 Attachment
          wmaton wrote:
          > --- In edict-jmdict@yahoogroups.com, "Stuart McGraw" <smcg4191@...> wrote:
          > [snip good sample test cases]
          > > Note that this is not a problem in Postgresql, MS Access, MS SQL
          > > Server which are the three other database I have tried importing
          > > jmdict into.
          > >
          > > An application can compensate for this limitation in various ways
          > > such as putting the WHERE condition on the inner select.. But that
          > > precludes such standard techniques as defining the query as a
          > > view, and applying a WHERE condition when the view is used.
          > >
          > > So, is using another database a possibility? The most obvious
          > > candidate would probably be Postgresql.
          >
          > I have staged both PostgreSQL and MySQL on my server, as well as
          > Arakawa. What I'd like to do is repeat your tests on these servers as
          > well. To make it a fair test though to make the comparaisons, you
          > should supply the steps to repeat to duplicate the behaviour. And no,
          > we won't publically embarass you until it is time. ;-)

          This is the internet... any time is the right time! :-)

          Below are eight queries (plus one). They vary on three dimensions:

          Database: Mysql / Postgresql
          Example: Simplified example / Real query
          Extent: Return results for all entries / Return results for a single entry

          All assume a schema as given in a previous posting of mine
          and populated with my load_jmdict.py script (or equiv).
          All were run on a 700Mhz relic running MS Windows 2000.
          Mysql is set up with utf8 default character set and all tables
          are InnoDB. Postgresql is set up with "unicode" [sic]
          default encoding. All other setting are the installation
          defaults.

          Below each query I have put the times taken for execution
          as reported by the Mysql or Postgresql query browsers.
          Interestingy, despite Mysql reputation for speed Postgresql
          is even or wins on these. I have made no attempt to optimize
          the queries' performance (other than #5) though. (I had a
          hard engouh time just getting them to return the right results!)

          Before running the Postgresql examples you will need to
          execute the following to create the aggregate function that
          I use to provide Postgresql with something like Mysql's
          group_concat function.

          CREATE AGGREGATE accum (
          SFUNC = ARRAY_APPEND,
          BASETYPE = ANYELEMENT,
          STYPE = ANYARRAY,
          INITCOND = '{}');

          The queries (below) are...

          1) Mysql: simplified "sense summary" example, all rows
          2) Mysql: simplified "sense summary" example, specific row
          3) Postgresql: simplified "sense summary" example, all rows
          4) Postgresql: simplified "sense summary" example, specific row

          The first two queries above are the Mysql queries that I posted
          in the first post of this thread. The second two are Postgresql
          equivalents.

          The returned rows in the above queries should look like:
          26;"to discuss / to find fault with; to criticize; to criticise; (schriftspr.) diskutieren; erörtern / kritisieren; bekritteln / aufgreifen; (ein Thema)"

          The next four queries are what I was actually trying to write
          when I found the problem with the Mysql subselects. Queries
          1-4 above are simplified versions (to illustrate the problem)
          of the actual queries I wanted (5-8) below.

          5a) Mysql: "entry summary" query, all rows. A natural extension of (1),
          it uses two nested subselects. Probably won't finish running until
          sometime next year.
          5) Mysql: "entry summary" query, all rows. Uses join to avoid nested subselects.
          6) Mysql: "entry summary" query, specific row
          7) Postgresql: "entry summary" query, all rows
          8) Postgresql: "entry summary" query, specific row

          The output of the above queries should look like:
          26;1000280;"あげつらう";"論う";"to discuss / to find fault with; to criticize; to criticise; (schriftspr.) diskutieren; erörtern / kritisieren; bekritteln / aufgreifen; (ein Thema)"

          Here are the actual queries...

          (1)
          SELECT
          sg.entr,group_concat(sg.txt ORDER BY sg.ord SEPARATOR ' / ') txt
          FROM
          (SELECT
          s.entr, s.ord, group_concat(g.txt ORDER BY g.ord SEPARATOR '; ') txt
          FROM sens s
          JOIN gloss g ON g.sens=s.id
          GROUP BY s.id) sg
          GROUP BY sg.entr

          39.446s (execution time)
          32.654s (data fetch time)
          106007 rows

          (2)
          SELECT
          sg.entr,group_concat(sg.txt ORDER BY sg.ord SEPARATOR ' / ') txt
          FROM
          (SELECT
          s.entr, s.ord, group_concat(g.txt ORDER BY g.ord SEPARATOR '; ') txt
          FROM sens s
          JOIN gloss g ON g.sens=s.id
          GROUP BY s.id) sg
          WHERE sg.entr=24
          GROUP BY sg.entr

          36.891 (execution time)
          00.037 (data fetch time)
          1 row

          (3)
          SELECT
          ss.entr,ARRAY_TO_STRING(ACCUM( ss.gtxt ), ' / ')
          FROM
          (SELECT
          entr,
          (SELECT ARRAY_TO_STRING(ACCUM(sg.txt), '; ')
          FROM
          (SELECT txt FROM gloss g WHERE g.sens=s.id ORDER BY g.ord) AS sg
          ORDER BY entr,ord ) AS gtxt
          FROM sens s
          ORDER BY s.ord) AS ss
          GROUP BY ss.entr
          ORDER BY ss.entr

          Total query runtime: 29147 ms.
          Data retrieval runtime: 10023 ms.
          106987 rows retrieved.

          (4)
          SELECT
          ss.entr,ARRAY_TO_STRING(ACCUM( ss.gtxt ), ' / ')
          FROM
          (SELECT
          entr,
          (SELECT ARRAY_TO_STRING(ACCUM(sg.txt), '; ')
          FROM
          (SELECT txt FROM gloss g WHERE g.sens=s.id ORDER BY g.ord) AS sg
          ORDER BY entr,ord ) AS gtxt
          FROM sens s
          ORDER BY s.ord) AS ss
          GROUP BY ss.entr
          WHERE ss.entr=26;

          Total query runtime: 21 ms.
          Data retrieval runtime: 10 ms.
          1 rows retrieved.

          (5a)
          SELECT entr.id,entr.seq,
          (SELECT GROUP_CONCAT(k.txt ORDER BY ord SEPARATOR '; ')
          FROM kana k WHERE k.entr=entr.id) AS kana,
          (SELECT GROUP_CONCAT(j.txt ORDER BY ord SEPARATOR '; ')
          FROM kanj j WHERE j.entr=entr.id) AS kanj,
          (SELECT group_concat(sg.txt ORDER BY sg.ord SEPARATOR ' / ')
          FROM
          (SELECT s.entr, s.ord, group_concat(g.txt ORDER BY g.ord SEPARATOR '; ') txt
          FROM sens s JOIN gloss g ON g.sens=s.id
          GROUP BY s.id) sg
          WHERE sg.entr=entr.id
          GROUP BY sg.entr)
          FROM entr;

          aborted after 10 minutes of execution time without results.

          (5b)
          SELECT e.id,e.seq,
          (SELECT GROUP_CONCAT(k.txt ORDER BY ord SEPARATOR '; ')
          FROM kana k WHERE k.entr=e.id) AS kana,
          (SELECT GROUP_CONCAT(j.txt ORDER BY ord SEPARATOR '; ')
          FROM kanj j WHERE j.entr=e.id) AS kanj,
          group_concat(sg.txt ORDER BY sg.ord SEPARATOR ' / ')
          FROM entr e
          JOIN
          (SELECT s.entr, s.ord, group_concat(g.txt ORDER BY g.ord SEPARATOR '; ') txt
          FROM sens s JOIN gloss g ON g.sens=s.id GROUP BY s.id) sg ON sg.entr=e.id
          GROUP BY e.id;

          98.473 (execution time)
          78.015s (data fetch time)
          106007 rows

          (6)
          SELECT entr.id,entr.seq,
          (SELECT GROUP_CONCAT(k.txt ORDER BY ord SEPARATOR '; ')
          FROM kana k WHERE k.entr=entr.id) AS kana,
          (SELECT GROUP_CONCAT(j.txt ORDER BY ord SEPARATOR '; ')
          FROM kanj j WHERE j.entr=entr.id) AS kanj,
          (SELECT group_concat(sg.txt ORDER BY sg.ord SEPARATOR ' / ')
          FROM
          (SELECT s.entr, s.ord, group_concat(g.txt ORDER BY g.ord SEPARATOR '; ') txt
          FROM sens s JOIN gloss g ON g.sens=s.id
          GROUP BY s.id) sg
          WHERE sg.entr=entr.id
          GROUP BY sg.entr)
          FROM entr
          WHERE entr.id=26;

          39.597s (execution time)
          00.051s (data fetch time)
          1 row

          (7)
          SELECT e.id,e.seq,
          (SELECT ARRAY_TO_STRING(ACCUM(sk.txt), '; ')
          FROM (SELECT k.txt FROM kana k WHERE k.entr=e.id ORDER BY k.ord) AS sk) AS kana,
          (SELECT ARRAY_TO_STRING(ACCUM(sj.txt), '; ')
          FROM (SELECT j.txt FROM kanj j WHERE j.entr=e.id ORDER BY j.ord) AS sj) AS kanj,
          (SELECT ARRAY_TO_STRING(ACCUM( ss.gtxt ), ' / ')
          FROM
          (SELECT
          (SELECT ARRAY_TO_STRING(ACCUM(sg.txt), '; ')
          FROM (SELECT txt FROM gloss g WHERE g.sens=s.id ORDER BY g.ord) AS sg
          ORDER BY entr,ord) AS gtxt
          FROM sens s WHERE s.entr=e.id ORDER BY s.ord) AS ss) AS gloss
          FROM entr e;

          Total query runtime: 36987 ms.
          Data retrieval runtime: 15860 ms.
          106987 rows retrieved.

          (8)
          SELECT e.id,e.seq,
          (SELECT ARRAY_TO_STRING(ACCUM(sk.txt), '; ')
          FROM (SELECT k.txt FROM kana k WHERE k.entr=e.id ORDER BY k.ord) AS sk) AS kana,
          (SELECT ARRAY_TO_STRING(ACCUM(sj.txt), '; ')
          FROM (SELECT j.txt FROM kanj j WHERE j.entr=e.id ORDER BY j.ord) AS sj) AS kanj,
          (SELECT ARRAY_TO_STRING(ACCUM( ss.gtxt ), ' / ')
          FROM
          (SELECT
          (SELECT ARRAY_TO_STRING(ACCUM(sg.txt), '; ')
          FROM (SELECT txt FROM gloss g WHERE g.sens=s.id ORDER BY g.ord) AS sg
          ORDER BY entr,ord) AS gtxt
          FROM sens s WHERE s.entr=e.id ORDER BY s.ord) AS ss) AS gloss
          FROM entr e
          WHERE id=26;

          Total query runtime: 20 ms.
          Data retrieval runtime: 30 ms.
          1 rows retrieved.
        • Stuart McGraw
          ... None of the databases I mentioned have a group_concat either -- in postgresql you need to create a user defined aggregate to do this, in MS Access you can
          Message 4 of 21 , Oct 5, 2006
          View Source
          • 0 Attachment
            Paul Stewart wrote:
            > On 2006.10.04 17:51, Stuart McGraw wrote:
            > > Note that this is not a problem in Postgresql, MS Access, MS SQL
            > > Server which are the three other database I have tried importing
            > > jmdict into.
            >
            > I was going to suggest doing the same with SQLite. but a bit of
            > pre-reading turned up the fact that SQLite doesn't have a group_concat
            > builitin (however inside Rails, you can easily create one using
            > db.create_agrregate(): http://www.bigbold.com/snippets/posts/show/283).
            > However, your overall aim, which was to make building blocks for
            > creating views in order to simplify common representations, would only
            > be visible within the app, not the database.

            None of the databases I mentioned have a group_concat either --
            in postgresql you need to create a user defined aggregate to
            do this, in MS Access you can use a VBA function, in MS SQL
            'Server a stored procedure. But in all those cases, the pseudo-
            group_concat is usable within the database and is not restricted
            to an application.

            But just to be clear, the issue is not group_concat, but
            subqueries. These are a standard part of sql, used in
            answering many kinds of questions. I used the group_concat
            example simply because that what I was doing when I bumped
            into the problem the first time, and because it is query
            that clearly can't be rewritten as a join. Another case
            where I bumped into this was a query to list the headwords
            (all valid combinations of readings/kanji) for each entry,
            also required a correlated subquery.(at least the way I
            came up with to write it).

            > The tradeoff from my point of view is a decision about the extent to
            > which you are designing your system to be portable among a number of
            > different database client applications (in which case you design
            > complexity into the database like these views, at the cost of
            > performance or lack of functionality in any but a few chosen
            > databases), or if you are designing for the data to be portable among a
            > number of different databases (in which case you eschew database
            > features you expect to find variability in support, at the cost of
            > having to engineer the complexity in each application). Each approach
            > has its merits.

            Yup, I agree. This goes back to the data-centric versus app-centric
            views of design. If one has an app-centric design then it is very
            feasible to adopt a minimal-common-database-features approach.
            But one needs to be careful with this in a data-centric design.
            Portability is nice but for data-centric applications it is in my
            experience a chimera because

            o Demanding a high degree of database portability will force
            the design to become an app centric one.

            o Sql is a very incomplete specification and covers only a small
            part of real world requirements. Things like triggers or stored
            procedures are not part of sql for example. (I think they may
            be in sql-2003, but not sure).

            o Even standard sql is not terribly portable. Things like a
            CREATE TABLE statement need changing when moving between
            databases because datatypes are different.

            o There are several versions of the sql standards and
            few if any databases comply with any of them 100%

            o One usually does not change databases very often, typically
            not at all, unless one made a bad decision at the start, the
            database vendor goes out of business, demands on the database
            outgrow its capabilities, etc. In any event, there will usually
            be substantial work involved.

            I am not advocating using every shiny "gee-whiz" feature
            that the chosen database offers -- on the contrary I try to
            stick with standard sql where possible, and where not, to
            only use features that also can be implemented in other databases
            albeit with some work. But as I pointed out, subqueries are
            standard sql and short of doing the work in the application,
            it is difficult to work around their lack (whether that lack is
            because they are missing, or because they are too inefficient
            to use.)
          • wmaton
            [stuart wrote:] ... har=har! ;-) ... single entry Thanks Stuart! I re-read your original message (ID 503) but looks like Yahoo doesn t store the attachaments
            Message 5 of 21 , Oct 6, 2006
            View Source
            • 0 Attachment
              [stuart wrote:]
              > > we won't publically embarass you until it is time. ;-)
              >
              > This is the internet... any time is the right time! :-)

              har=har! ;-)

              > Below are eight queries (plus one). They vary on three dimensions:
              >
              > Database: Mysql / Postgresql
              > Example: Simplified example / Real query
              > Extent: Return results for all entries / Return results for a
              single entry

              Thanks Stuart! I re-read your original message (ID 503) but looks
              like Yahoo doesn't store the attachaments for very long. Can you
              repost or point me to an alternate download site?

              Thanks!
            • wmaton
              correction, ID 508.
              Message 6 of 21 , Oct 6, 2006
              View Source
              • 0 Attachment
                correction, ID 508.

                --- In edict-jmdict@yahoogroups.com, "wmaton" <wmaton@...> wrote:
                >
                > [stuart wrote:]
                > > > we won't publically embarass you until it is time. ;-)
                > >
                > > This is the internet... any time is the right time! :-)
                >
                > har=har! ;-)
                >
                > > Below are eight queries (plus one). They vary on three dimensions:
                > >
                > > Database: Mysql / Postgresql
                > > Example: Simplified example / Real query
                > > Extent: Return results for all entries / Return results for a
                > single entry
                >
                > Thanks Stuart! I re-read your original message (ID 503) but looks
                > like Yahoo doesn't store the attachaments for very long. Can you
                > repost or point me to an alternate download site?
                >
                > Thanks!
                >
              • Stuart McGraw
                ... Jim said he put a copy at http://www.csse.monash.edu.au/~jwb/schema.zip I also have schema and python loader script modified for Postgresql if you or
                Message 7 of 21 , Oct 6, 2006
                View Source
                • 0 Attachment
                  wmaton wrote:
                  > [stuart wrote:]
                  > > > we won't publically embarass you until it is time. ;-)
                  > >
                  > > This is the internet... any time is the right time! :-)
                  >
                  > har=har! ;-)
                  >
                  > > Below are eight queries (plus one). They vary on three dimensions:
                  > >
                  > > Database: Mysql / Postgresql
                  > > Example: Simplified example / Real query
                  > > Extent: Return results for all entries / Return results for a
                  > single entry
                  >
                  > Thanks Stuart! I re-read your original message (ID 503) but looks
                  > like Yahoo doesn't store the attachaments for very long. Can you
                  > repost or point me to an alternate download site?

                  Jim said he put a copy at http://www.csse.monash.edu.au/~jwb/schema.zip

                  I also have schema and python loader script modified for
                  Postgresql if you or anyone wants it.
                • Jim Breen
                  [Stuart McGraw (RE: [edict-jmdict] mysql limitations) writes:] ... My reaction here is: well what is the database for? . I guess my (initial) idea/concept is
                  Message 8 of 21 , Oct 6, 2006
                  View Source
                  • 0 Attachment
                    [Stuart McGraw (RE: [edict-jmdict] mysql limitations) writes:]
                    >>
                    >> But just to be clear, the issue is not group_concat, but
                    >> subqueries. These are a standard part of sql, used in
                    >> answering many kinds of questions.

                    My reaction here is: "well what is the database for?".

                    I guess my (initial) idea/concept is that it is primarily for supporting
                    an online edit capability, combined with the periodic generation of the
                    distributed forms (JMdict, EDICT, etc.)

                    A secondary role is to enable a small nmber of gurus to do smart things at
                    the database command/browser level, although my conservatism tells me such
                    things are probably better done on another system, for the sake of
                    integrity and performance. But yes, it would be great to be able to
                    do a bulk update according to some criteria.

                    >> I am not advocating using every shiny "gee-whiz" feature
                    >> that the chosen database offers -- on the contrary I try to
                    >> stick with standard sql where possible, and where not, to
                    >> only use features that also can be implemented in other databases
                    >> albeit with some work. But as I pointed out, subqueries are
                    >> standard sql and short of doing the work in the application,
                    >> it is difficult to work around their lack (whether that lack is
                    >> because they are missing, or because they are too inefficient
                    >> to use.)

                    I guess all other things being equal, having efficient subquery processing
                    is a Good Thing(TM), but if guru-level activity is going to be sparse,
                    performance with subqueries may not be a huge issue.

                    Jim

                    --
                    Jim Breen http://www.csse.monash.edu.au/~jwb/
                    Clayton School of Information Technology, Tel: +61 3 9905 9554
                    Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
                    (Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学
                  • wmaton
                    ... Got it thanks for that! ... Yes please. :-) Then I can embarass you - or more likely myself - publically ;-)
                    Message 9 of 21 , Oct 6, 2006
                    View Source
                    • 0 Attachment
                      --- In edict-jmdict@yahoogroups.com, "Stuart McGraw" <smcg4191@...>
                      > Jim said he put a copy at http://www.csse.monash.edu.au/~jwb/schema.zip

                      Got it thanks for that!

                      > I also have schema and python loader script modified for
                      > Postgresql if you or anyone wants it.

                      Yes please. :-)

                      Then I can embarass you - or more likely myself - publically ;-)
                    • Stuart McGraw
                      ... That was the assumtion I ve been working under. ... I realize the sql I posted looks somewhat complex, especially in the postgresql case, but that is
                      Message 10 of 21 , Oct 6, 2006
                      View Source
                      • 0 Attachment
                        Jim Breen wrote:
                        > [Stuart McGraw (RE: [edict-jmdict] mysql limitations) writes:]
                        > >>
                        > >> But just to be clear, the issue is not group_concat, but
                        > >> subqueries. These are a standard part of sql, used in
                        > >> answering many kinds of questions.
                        >
                        > My reaction here is: "well what is the database for?".
                        >
                        > I guess my (initial) idea/concept is that it is primarily for supporting
                        > an online edit capability, combined with the periodic generation of the
                        > distributed forms (JMdict, EDICT, etc.)

                        That was the assumtion I've been working under.

                        > A secondary role is to enable a small nmber of gurus to do smart things at
                        > the database command/browser level, although my conservatism tells me such
                        > things are probably better done on another system, for the sake of
                        > integrity and performance. But yes, it would be great to be able to
                        > do a bulk update according to some criteria.

                        I realize the sql I posted looks somewhat complex, especially
                        in the postgresql case, but that is because each group of
                        concatenated values requires two selects, one to put the rows
                        in the right order, and one to aggregate the strings into a
                        single column. In mysql the ordering and aggregating are both
                        done in the group_concat function.

                        The nice thing is that you can encapsulate such a query in
                        a view. Then you use the view just like it was another table.
                        So even queries for which some level of guruness is required
                        to create, their use becomes simple.

                        As for "such things are probably better done on another
                        system, for the sake of integrity and performance", working
                        on live data always has risks and requires care (as I'm sure
                        you know better than me, being the maintainer of such systems!)
                        For a large change one would probably want to test in a clone
                        of the database before doing it on the live data. For almost
                        any maintenance activity I can imagine though, it would be
                        faster and less disruptive to execute a set of (tested) updates
                        on the live data than to update a copy of the data, shutdown
                        the database, swap in the updated data, and restart the database.
                        (But I've never administered a seious mysql or postgresql system
                        so I could easily be wrong about that.)

                        > >> I am not advocating using every shiny "gee-whiz" feature
                        > >> that the chosen database offers -- on the contrary I try to
                        > >> stick with standard sql where possible, and where not, to
                        > >> only use features that also can be implemented in other databases
                        > >> albeit with some work. But as I pointed out, subqueries are
                        > >> standard sql and short of doing the work in the application,
                        > >> it is difficult to work around their lack (whether that lack is
                        > >> because they are missing, or because they are too inefficient
                        > >> to use.)
                        >
                        > I guess all other things being equal, having efficient subquery processing
                        > is a Good Thing(TM), but if guru-level activity is going to be sparse,
                        > performance with subqueries may not be a huge issue.

                        But subselects are not guru-level activity.
                        Please believe me, anything that comes from me is assuredly
                        not guru level!!

                        The reason I was interested in the query I posted is that it provides
                        a concise look at an entry. If you've played with the schema at all
                        you have undoubtedly discovered that looking at a table and seeing a
                        bunch of id numbers is not very enlightening. If you are debugging an
                        app, query, stored procedure, whatever, and it is doing something with
                        entry 40877 it is nice to know what that entry is (in term of the kanji,
                        glosses, etc associated with it). The query lets you see the entry,
                        and its kanji, reading, and gloss strings all in one row.

                        Typically, a view would be created based on the query called, say.
                        "entry_summary" You would then simply do something like

                        select * from entry_summary" where id=40877

                        to take a peek at all the text strings for that entry. No need for
                        guruness.

                        Applications that want to display similar information will find
                        it easier to use entry_summary, and receive the information
                        already condensed into one row per entry, than to do a join that
                        returns Nk*Nr*Ns*Ng rows per entry (where Nk,Nr,Ns,nG, are the
                        number on kanji, readings, senses, and glosses per sense, per
                        entry), or alternately four seperate queries on each of those
                        tables and then condense all that information into one row.
                        Why do that coding in every application (even when it is easy
                        as Paul Stewart showed for RoR), when it can be written once
                        in a single place (the database in the form of a view) and used
                        by every application?.

                        Queries such as the one I posted also are combined with other
                        queres. For example, if you put together an ad-hoc query to
                        do a bulk update, you might want to verify that you were about
                        to update what you intended. Joining the select part of your
                        query with the entry_summary view would provide you with that
                        verification.

                        The point of all this is not that the specific query I presented
                        in particularly important, but rather that subselects are a common
                        non-guru tool, are used when putting together many kinds of queries
                        and that if their availability is limited, it is likely to cause
                        pain at some point. If there are stong reasons to prefer Mysql
                        over Postgresql, then that pain may be a reasonable tradeoff.
                        But I haven't seen any such reasons presented here yet.
                      • Stuart McGraw
                        ... I send it off tomorrow, I want to run it again to make sure it still works, before I post it.
                        Message 11 of 21 , Oct 6, 2006
                        View Source
                        • 0 Attachment
                          wmaton wrote:
                          > --- In edict-jmdict@yahoogroups.com, "Stuart McGraw" <smcg4191@...>
                          > > I also have schema and python loader script modified for
                          > > Postgresql if you or anyone wants it.
                          >
                          > Yes please. :-)

                          I send it off tomorrow, I want to run it again to
                          make sure it still works, before I post it.
                        • Jim Breen
                          [Stuart McGraw (RE: [edict-jmdict] mysql limitations) writes:] ... I was referring more to exploratory work rather than updating. Yes, for bulk updates you d
                          Message 12 of 21 , Oct 9, 2006
                          View Source
                          • 0 Attachment
                            [Stuart McGraw (RE: [edict-jmdict] mysql limitations) writes:]
                            >> Jim Breen wrote:
                            >> > A secondary role is to enable a small nmber of gurus to do smart things at
                            >> > the database command/browser level, although my conservatism tells me such
                            >> > things are probably better done on another system, for the sake of
                            >> > integrity and performance. But yes, it would be great to be able to
                            >> > do a bulk update according to some criteria.
                            >>
                            >> I realize the sql I posted looks somewhat complex, especially
                            >> in the postgresql case, but that is because each group of
                            >> concatenated values requires two selects, one to put the rows
                            >> in the right order, and one to aggregate the strings into a
                            >> single column. In mysql the ordering and aggregating are both
                            >> done in the group_concat function.
                            >>
                            >> The nice thing is that you can encapsulate such a query in
                            >> a view. Then you use the view just like it was another table.
                            >> So even queries for which some level of guruness is required
                            >> to create, their use becomes simple.
                            >>
                            >> As for "such things are probably better done on another
                            >> system, for the sake of integrity and performance", working
                            >> on live data always has risks and requires care (as I'm sure
                            >> you know better than me, being the maintainer of such systems!)
                            >> For a large change one would probably want to test in a clone
                            >> of the database before doing it on the live data. For almost
                            >> any maintenance activity I can imagine though, it would be
                            >> faster and less disruptive to execute a set of (tested) updates
                            >> on the live data than to update a copy of the data, shutdown
                            >> the database, swap in the updated data, and restart the database.
                            >> (But I've never administered a seious mysql or postgresql system
                            >> so I could easily be wrong about that.)

                            I was referring more to exploratory work rather than updating. Yes,
                            for bulk updates you'd want to do it to a live system, but probably
                            need to test it offline first, and do some before/after diffing to make
                            sure the updates happened correctly and there was no collateral damage.
                            That's what I do now.

                            >> > I guess all other things being equal, having efficient subquery processing
                            >> > is a Good Thing(TM), but if guru-level activity is going to be sparse,
                            >> > performance with subqueries may not be a huge issue.
                            >>
                            >> But subselects are not guru-level activity.
                            >> Please believe me, anything that comes from me is assuredly
                            >> not guru level!!
                            >>
                            >> The reason I was interested in the query I posted is that it provides
                            >> a concise look at an entry. If you've played with the schema at all
                            >> you have undoubtedly discovered that looking at a table and seeing a
                            >> bunch of id numbers is not very enlightening. If you are debugging an
                            >> app, query, stored procedure, whatever, and it is doing something with
                            >> entry 40877 it is nice to know what that entry is (in term of the kanji,
                            >> glosses, etc associated with it). The query lets you see the entry,
                            >> and its kanji, reading, and gloss strings all in one row.

                            It's an interesting process. You have to do an elaborate deconstruction
                            of an entry to get it into all those tables in order to give you the
                            flexibility and scope needed, and then you need to do an equally
                            elaborate reconstruction to get it back to an entry you can see as whole.
                            I guess that's the up-front pain of moving to a DBMS.

                            As I have been watching the DBMS discussion unfold, I have been pondering
                            the suggestion from someone a while ago (it may have been off-line) just
                            to drop a complete entry into a text window and let someone edit it.

                            For example, replacing my simple markup with XML-like labels, the
                            entry: 半々(P); 半半 【はんはん】 (n) half and half; fifty-fifty; (P)
                            is stored:

                            <entry> 1601210
                            半々
                            <k_pri> news1 nf18
                            半半
                            <reading>
                            はんはん
                            <re_pri> news1 nf18
                            <sense>
                            <pos> n
                            half and half
                            fifty-fifty

                            I guess for the average user about to fiddle with an entry, having things
                            in defined fields is lot more understandable,and easier to check.

                            Jim

                            --
                            Jim Breen http://www.csse.monash.edu.au/~jwb/
                            Clayton School of Information Technology, Tel: +61 3 9905 9554
                            Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
                            (Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学
                          • Dennis Schrader
                            I made this offering to Jim and he suggested I offer it via our group. I am somewhat reticent to do so in that I am not a database guy . What I have, and paid
                            Message 13 of 21 , Oct 10, 2006
                            View Source
                            • 0 Attachment
                              I made this offering to Jim and he suggested I offer it via our group. I am somewhat reticent to do so in that I am not a "database guy".
                               
                              What I have, and paid for, when Oracle was interested in using my rather extensive LA facilities in the morning, evening and weekends for an Oracle training center, is a paid-for Oracle 8i, Windows-based and full kernel. If anyone out there would find it helpful, I'd be more than happy to send a copy to you. Oracle 8i, as many of you probably know, was fully operational and hence preceeded the "buggy" versions that followed. Again, I am spread too thin on other projects to answer questions or provide assistance.
                               
                              A simple offering. Any takers?
                               
                              Dennis G. Schrader

                               
                              On 10/6/06, Stuart McGraw <smcg4191@...> wrote:

                              Jim Breen wrote:
                              > [Stuart McGraw (RE: [edict-jmdict] mysql limitations) writes:]
                              > >>
                              > >> But just to be clear, the issue is not group_concat, but
                              > >> subqueries. These are a standard part of sql, used in
                              > >> answering many kinds of questions.
                              >
                              > My reaction here is: "well what is the database for?".
                              >
                              > I guess my (initial) idea/concept is that it is primarily for supporting
                              > an online edit capability, combined with the periodic generation of the
                              > distributed forms (JMdict, EDICT, etc.)

                              That was the assumtion I've been working under.

                              > A secondary role is to enable a small nmber of gurus to do smart things at
                              > the database command/browser level, although my conservatism tells me such
                              > things are probably better done on another system, for the sake of
                              > integrity and performance. But yes, it would be great to be able to
                              > do a bulk update according to some criteria.

                              I realize the sql I posted looks somewhat complex, especially
                              in the postgresql case, but that is because each group of
                              concatenated values requires two selects, one to put the rows
                              in the right order, and one to aggregate the strings into a
                              single column. In mysql the ordering and aggregating are both
                              done in the group_concat function.

                              The nice thing is that you can encapsulate such a query in
                              a view. Then you use the view just like it was another table.
                              So even queries for which some level of guruness is required
                              to create, their use becomes simple.

                              As for "such things are probably better done on another
                              system, for the sake of integrity and performance", working
                              on live data always has risks and requires care (as I'm sure
                              you know better than me, being the maintainer of such systems!)
                              For a large change one would probably want to test in a clone
                              of the database before doing it on the live data. For almost
                              any maintenance activity I can imagine though, it would be
                              faster and less disruptive to execute a set of (tested) updates
                              on the live data than to update a copy of the data, shutdown
                              the database, swap in the updated data, and restart the database.
                              (But I've never administered a seious mysql or postgresql system
                              so I could easily be wrong about that.)

                              > >> I am not advocating using every shiny "gee-whiz" feature
                              > >> that the chosen database offers -- on the contrary I try to
                              > >> stick with standard sql where possible, and where not, to
                              > >> only use features that also can be implemented in other databases
                              > >> albeit with some work. But as I pointed out, subqueries are
                              > >> standard sql and short of doing the work in the application,
                              > >> it is difficult to work around their lack (whether that lack is
                              > >> because they are missing, or because they are too inefficient
                              > >> to use.)
                              >
                              > I guess all other things being equal, having efficient subquery processing
                              > is a Good Thing(TM), but if guru-level activity is going to be sparse,
                              > performance with subqueries may not be a huge issue.

                              But subselects are not guru-level activity.
                              Please believe me, anything that comes from me is assuredly
                              not guru level!!

                              The reason I was interested in the query I posted is that it provides
                              a concise look at an entry. If you've played with the schema at all
                              you have undoubtedly discovered that looking at a table and seeing a
                              bunch of id numbers is not very enlightening. If you are debugging an
                              app, query, stored procedure, whatever, and it is doing something with
                              entry 40877 it is nice to know what that entry is (in term of the kanji,
                              glosses, etc associated with it). The query lets you see the entry,
                              and its kanji, reading, and gloss strings all in one row.

                              Typically, a view would be created based on the query called, say.
                              "entry_summary" You would then simply do something like

                              select * from entry_summary" where id=40877

                              to take a peek at all the text strings for that entry. No need for
                              guruness.

                              Applications that want to display similar information will find
                              it easier to use entry_summary, and receive the information
                              already condensed into one row per entry, than to do a join that
                              returns Nk*Nr*Ns*Ng rows per entry (where Nk,Nr,Ns,nG, are the
                              number on kanji, readings, senses, and glosses per sense, per
                              entry), or alternately four seperate queries on each of those
                              tables and then condense all that information into one row.
                              Why do that coding in every application (even when it is easy
                              as Paul Stewart showed for RoR), when it can be written once
                              in a single place (the database in the form of a view) and used
                              by every application?.

                              Queries such as the one I posted also are combined with other
                              queres. For example, if you put together an ad-hoc query to
                              do a bulk update, you might want to verify that you were about
                              to update what you intended. Joining the select part of your
                              query with the entry_summary view would provide you with that
                              verification.

                              The point of all this is not that the specific query I presented
                              in particularly important, but rather that subselects are a common
                              non-guru tool, are used when putting together many kinds of queries
                              and that if their availability is limited, it is likely to cause
                              pain at some point. If there are stong reasons to prefer Mysql
                              over Postgresql, then that pain may be a reasonable tradeoff.
                              But I haven't seen any such reasons presented here yet.


                            • Jim Breen
                              [Dennis Schrader (Re: [edict-jmdict] mysql limitations) writes:] ... Thanks for the offer. A point I overlooked when it was first raised with me was the fact
                              Message 14 of 21 , Oct 10, 2006
                              View Source
                              • 0 Attachment
                                [Dennis Schrader (Re: [edict-jmdict] mysql limitations) writes:]
                                >> What I have, and paid for, when Oracle was interested in using my rather
                                >> extensive LA facilities in the morning, evening and weekends for an
                                >> Oracle training center, is a paid-for Oracle 8i, Windows-based and full
                                >> kernel. If anyone out there would find it helpful, I'd be more than happy to
                                >> send a copy to you. Oracle 8i, as many of you probably know, was fully
                                >> operational and hence preceeded the "buggy" versions that followed. Again, I
                                >> am spread too thin on other projects to answer questions or
                                >> provide assistance.

                                Thanks for the offer. A point I overlooked when it was first raised with
                                me was the fact that (of course) it is a Windows-based licence. All our
                                DBMS discussions have been in the context of the update server being
                                a Unix/Linux one. The service I have set up at arakawa.edrdg.org is such
                                a one. My experience an expertise with Windows servers is close to nil.

                                Jim

                                --
                                Jim Breen http://www.csse.monash.edu.au/~jwb/
                                Clayton School of Information Technology, Tel: +61 3 9905 9554
                                Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
                                (Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学
                              • Stuart McGraw
                                ... As can be guessed from the delay, it didn t work and one bug fix led to another and... Anyway, here is a revised version of the schema.zip file I posted on
                                Message 15 of 21 , Oct 10, 2006
                                View Source
                                • 0 Attachment
                                  Stuart McGraw wrote:
                                  > wmaton wrote:
                                  > > --- In edict-jmdict@yahoogroups.com, "Stuart McGraw" <smcg4191@...>
                                  > > > I also have schema and python loader script modified for
                                  > > > Postgresql if you or anyone wants it.
                                  > >
                                  > > Yes please. :-)
                                  >
                                  > I send it off tomorrow, I want to run it again to
                                  > make sure it still works, before I post it.

                                  As can be guessed from the delay, it didn't work and one
                                  bug fix led to another and...

                                  Anyway, here is a revised version of the schema.zip file
                                  I posted on 2006/09/21. The schema itself it unchanged
                                  but the load_jndict.py and showentr.py scripts now work
                                  with either Mysql or Postgresql and run on Linux or
                                  Windows. (Actually I wasn't able test with mysql on
                                  Windows because I upgraded Python on my Windows
                                  box to 2.5 and the Python-Mysql connector software is
                                  not available for 2.5 yet, but it was working and I don't
                                  think I broke it.)

                                  There is new schema.png file that presents the schema
                                  in more detail and with a clearer layout.

                                  README.txt contain more info on setting things up and
                                  loading jmdict.
                                • Stuart McGraw
                                  ... [...] ... yes, I hadn t thought about it that way before. :-) ... There s some pain with any change. And if one is used to programming,
                                  Message 16 of 21 , Oct 10, 2006
                                  View Source
                                  • 0 Attachment
                                    Jim Breen wrote:
                                    > [Stuart McGraw (RE: [edict-jmdict] mysql limitations) writes:]
                                    > >> Jim Breen wrote:
                                    [...]
                                    > >> > I guess all other things being equal, having efficient subquery processing
                                    > >> > is a Good Thing(TM), but if guru-level activity is going to be sparse,
                                    > >> > performance with subqueries may not be a huge issue.
                                    > >>
                                    > >> But subselects are not guru-level activity.
                                    > >> Please believe me, anything that comes from me is assuredly
                                    > >> not guru level!!
                                    > >>
                                    > >> The reason I was interested in the query I posted is that it provides
                                    > >> a concise look at an entry. If you've played with the schema at all
                                    > >> you have undoubtedly discovered that looking at a table and seeing a
                                    > >> bunch of id numbers is not very enlightening. If you are debugging an
                                    > >> app, query, stored procedure, whatever, and it is doing something with
                                    > >> entry 40877 it is nice to know what that entry is (in term of the kanji,
                                    > >> glosses, etc associated with it). The query lets you see the entry,
                                    > >> and its kanji, reading, and gloss strings all in one row.
                                    >
                                    > It's an interesting process. You have to do an elaborate deconstruction
                                    > of an entry to get it into all those tables in order to give you the
                                    > flexibility and scope needed, and then you need to do an equally
                                    > elaborate reconstruction to get it back to an entry you can see as whole.

                                    <Chuckle> yes, I hadn't thought about it that way before. :-)

                                    > I guess that's the up-front pain of moving to a DBMS.

                                    There's some pain with any change. And if one is used to
                                    programming, especially in a procedural language like C,
                                    SQL (which is basically a functional language that operates
                                    on sets) can seem pretty foreign initially. At least it
                                    was that way for me.

                                    One of the biggest payoffs with a database system is
                                    data consistency. The database enforces rules that
                                    prevent inconsistent data from ever getting in there.
                                    For example, just the exercise of loading jmdict into
                                    a database identified a lot of minor things. I found
                                    a number of entitles that were used inconsistently.
                                    Someone else (Ronan?) posted a list of xrefs without
                                    valid targets. My schema doesn't provide for having
                                    an entry marked "re_nokanji" if the entry has no kanji.
                                    As a result I found entries 2066740, 2067160, 2067300,
                                    and 2067680 all have this condition. All of this
                                    consistency checking helps prevent the "bit-rot" that
                                    tends to afflict large collections of data.

                                    > As I have been watching the DBMS discussion unfold, I have been pondering
                                    > the suggestion from someone a while ago (it may have been off-line) just
                                    > to drop a complete entry into a text window and let someone edit it.

                                    Instead of a database-based solution?

                                    > For example, replacing my simple markup with XML-like labels, the
                                    > entry: 半々(P); 半半 【はんはん】 (n) half and half; fifty-fifty; (P)
                                    > is stored:
                                    >
                                    > <entry> 1601210
                                    > 半々
                                    > <k_pri> news1 nf18
                                    > 半半
                                    > <reading>
                                    > はんはん
                                    > <re_pri> news1 nf18
                                    > <sense>
                                    > <pos> n
                                    > half and half
                                    > fifty-fifty
                                    >
                                    > I guess for the average user about to fiddle with an entry, having things
                                    > in defined fields is lot more understandable,and easier to check.

                                    Of course, one can have a database behind this
                                    interface as easily as a file.

                                    One nice thing about forms is dropdown/combo/-
                                    selection boxes. Without those one needs a pretty
                                    extraordinary help system in order to let the casual
                                    user know quickly that they need to type "adj-na"
                                    rather than "na-adj" or "adjna". Or the parser has
                                    to be very heuristic.

                                    I built a task tracking system once that worked like
                                    that. Task priorities, status, notes, assignments,
                                    etc were all displayed in a big window as formatted
                                    text. I used it for a long time by myself and liked
                                    it very much but when other people started to use it,
                                    they got frustrated at syntax errors when they submitted
                                    an edited entry and ended up making me rewrite it as a
                                    form based system. I still wonder if better help, error
                                    messages, and parsing could have made it viable.

                                    I am currently in the (very slow) process of moving the
                                    front-end of my personal jmdict database app from MS
                                    Access to Python/wxWidgets. My motivation was that MS
                                    Access forms, although very fast to build a gui with,
                                    have a lot of limitations (that and I can't stand Visual
                                    Basic). What I want something like what you suggest,
                                    but read-only, not for editing. I want a big text window
                                    in which an entry (or multiple entries) appear formatted
                                    much as you would see in a paper dictionary. I find
                                    that much easier to read than a form, and a much more
                                    efficient use of screen real estate. As for editing,
                                    that is still on the drawing board. I hope I will get
                                    some ideas from how its done for jmdict. :-)
                                  • Jim Breen
                                    [Stuart McGraw (RE: [edict-jmdict] Re: mysql limitations) writes:] ... [...] Available at: http://www.csse.monash.edu.au/~jwb/schema.zip Jim -- Jim Breen
                                    Message 17 of 21 , Oct 10, 2006
                                    View Source
                                    • 0 Attachment
                                      [Stuart McGraw (RE: [edict-jmdict] Re: mysql limitations) writes:]
                                      >>
                                      >> Anyway, here is a revised version of the schema.zip file
                                      [...]

                                      Available at: http://www.csse.monash.edu.au/~jwb/schema.zip

                                      Jim

                                      --
                                      Jim Breen http://www.csse.monash.edu.au/~jwb/
                                      Clayton School of Information Technology, Tel: +61 3 9905 9554
                                      Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
                                      (Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学
                                    • Jim Breen
                                      [Stuart McGraw (RE: [edict-jmdict] mysql limitations) writes:] ... Yes, this is very important, as it can save a lot of manual labour later on. ... Yes, and
                                      Message 18 of 21 , Oct 10, 2006
                                      View Source
                                      • 0 Attachment
                                        [Stuart McGraw (RE: [edict-jmdict] mysql limitations) writes:]
                                        >> One of the biggest payoffs with a database system is
                                        >> data consistency. The database enforces rules that
                                        >> prevent inconsistent data from ever getting in there.

                                        Yes, this is very important, as it can save a lot of manual labour
                                        later on.

                                        >> For example, just the exercise of loading jmdict into
                                        >> a database identified a lot of minor things. I found
                                        >> a number of entitles that were used inconsistently.
                                        >> Someone else (Ronan?) posted a list of xrefs without
                                        >> valid targets.

                                        Yes, and very useful they were. I think I fixed them all.

                                        >> My schema doesn't provide for having
                                        >> an entry marked "re_nokanji" if the entry has no kanji.
                                        >> As a result I found entries 2066740, 2067160, 2067300,
                                        >> and 2067680 all have this condition. All of this
                                        >> consistency checking helps prevent the "bit-rot" that
                                        >> tends to afflict large collections of data.

                                        Thanks. I have fixed those re_nokanji errors. They weren't a problem,
                                        but not required.

                                        >> One nice thing about forms is dropdown/combo/-
                                        >> selection boxes. Without those one needs a pretty
                                        >> extraordinary help system in order to let the casual
                                        >> user know quickly that they need to type "adj-na"
                                        >> rather than "na-adj" or "adjna". Or the parser has
                                        >> to be very heuristic.

                                        When I first put up the page at
                                        http://www.csse.monash.edu.au/~jwb/wwwnewword.html I just had people do
                                        their own POS according to guidelines. It was hopeless, so I changed to the
                                        current dropdowns. My problem now is people not bothering to change the
                                        default noun.

                                        >> I am currently in the (very slow) process of moving the
                                        >> front-end of my personal jmdict database app from MS
                                        >> Access to Python/wxWidgets. My motivation was that MS
                                        >> Access forms, although very fast to build a gui with,
                                        >> have a lot of limitations (that and I can't stand Visual
                                        >> Basic). What I want something like what you suggest,
                                        >> but read-only, not for editing. I want a big text window
                                        >> in which an entry (or multiple entries) appear formatted
                                        >> much as you would see in a paper dictionary. I find
                                        >> that much easier to read than a form, and a much more
                                        >> efficient use of screen real estate. As for editing,
                                        >> that is still on the drawing board. I hope I will get
                                        >> some ideas from how its done for jmdict. :-)

                                        Not too much on the drawing board, I hope 8-)}

                                        Jim

                                        --
                                        Jim Breen http://www.csse.monash.edu.au/~jwb/
                                        Clayton School of Information Technology, Tel: +61 3 9905 9554
                                        Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
                                        (Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学
                                      Your message has been successfully submitted and would be delivered to recipients shortly.