Loading ...
Sorry, an error occurred while loading the content.
 

Re: [xml-dbms] All in one answer....

Expand Messages
  • Ronald Bourret
    ... This is an interesting idea, although it won t be included in the next release due to lack of time. (It would require completely rearchitecting the
    Message 1 of 9 , Dec 11, 2000
      Pareena Shah wrote:
      >
      > Question for the people thinking about the new version of XML DBMS: What do
      > you think about using something like sqlloader to bulk load transformed XML
      > data into an Oracle database? If I have a situation where I am going to be
      > processing large volumes of XML data into an Oracle database, and I want to
      > optimize by buffering rows, and using Oracle's direct path load
      > functionality, is sql loader the best way? Could you comment on the
      > advantages/disadvantages?

      This is an interesting idea, although it won't be included in the next
      release due to lack of time. (It would require completely rearchitecting
      the DOMToDBMS and DBMSToDOM classes.)

      The following discussion is not specific to Oracle's bulk loader, but
      discusses how XML-DBMS might do bulk inserts in the future. This assumes
      such updates are possible using JDBC, and it is not clear to me that
      they are.

      The challenge is this. Suppose we have an XML document that looks like
      the following:

      <A>
      <A1>...</A1>
      <A2>...</A2>
      <A3>...</A3>
      <A4>...</A4>
      <B>
      <B1>...</B1>
      <B2>...</B2>
      <B3>...</B3>
      </B>
      </A>

      and that this document was mapped to tables A (columns A1-A4) and B
      (columns B1-B3) as expected, with the primary key in table A. Now
      suppose you have a whole lot of these structures in a single XML
      document:

      <root>
      <A>
      <A1>...</A1>
      <A2>...</A2>
      <A3>...</A3>
      <A4>...</A4>
      <B>
      <B1>...</B1>
      <B2>...</B2>
      <B3>...</B3>
      </B>
      </A>
      ...
      <A>
      <A1>...</A1>
      <A2>...</A2>
      <A3>...</A3>
      <A4>...</A4>
      <B>
      <B1>...</B1>
      <B2>...</B2>
      <B3>...</B3>
      </B>
      </A>
      </root>

      Currently, what the code does is inserts the row for the first A, then
      the row for the first B, then the row for the second A, then the row for
      the second B, and so on.

      To use bulk loading, the code would need to buffer rows for A and rows
      for B, then insert them when there are a certain number of rows in the
      buffer -- say 100. While this probably wouldn't be too bad in the above
      case, it could get very complicated in the general case.

      For example, imagine there can be an arbitrary number of B children for
      each A parent. Thus, the buffer for B rows would fill up before the
      buffer for A rows. However, the code has to be careful about when it
      inserts rows. That is, it can't just wait until the buffer for B rows is
      full and then just insert them. Because of referential integrity, it has
      to insert the A rows before the B rows, so you need to coordinate when
      the buffers are emptied. Now, imagine doing this for an XML document
      that is nested arbitrarily deep and you'll see that the code is
      non-trivial.

      So while this is a good idea and worth looking at in the future, we
      don't have time to do it now.

      --
      Ronald Bourret
      Programming, Writing, and Training
      XML, Databases, and Schemas
      http://www.rpbourret.com
    • Ronald Bourret
      Ahhh. We re closer than I thought. I think the last thing we need to do is move the dispatch method from the transfer/map engine to the CLI class. In practical
      Message 2 of 9 , Dec 11, 2000
        Ahhh. We're closer than I thought. I think the last thing we need to do
        is move the dispatch method from the transfer/map engine to the CLI
        class. In practical terms, this means moving the init, action, and
        transfer methods from Xmldbms to Transfer, and the init and action
        methods from Xmldbms to GenerateMap. Thus, assuming Transfer and
        GenerateMap inherit from a ProcessProperties class, they would look
        something like the following. Notice that dispatch is a public method,
        so people (like the GUI) who want to do text-based programming can write
        directly to it without going through a command line.

        public class Transfer {

        public static main(String[] args) throws Exception {
        // Parse the arguments and generate a Properties object
        Properties props = this.getProperties(args);

        // Dispatch the action
        dispatch(props);
        }

        public static void dispatch(Properties props) throws Exception
        {
        TransferEngine transferEngine = new TransferEngine();

        // Set up the parser and database
        transferEngine.setParserProperties(props);
        transferEngine.setDatabaseProperties(props);

        // Dispatch the action
        String action = props.get(ACTION);
        if (action.equals(STOREDOCUMENT)) {
        String mapFilename = props.get(MAPFILE);
        String xmlFilename = props.get(XMLFILE);
        int commitMode = convertCommitMode(props.get(COMMITMODE));
        String keyGeneratorClass = props.get(KEYGENERATORCLASS);
        transferEngine.storeDocument(mapFilename, xmlFilename,
        commitMode, keyGeneratorClass);
        } else if action.equals(...) {
        ...
        ...
        } else ... {
        ...
        }
        }
        }

        This will make the CLI classes more complex than they are now. (It
        requires them to know the transfer and map engine APIs.) However, this
        means we have a clean separation between the text-based interface and
        the programmatic interface. Furthermore, since the text-based interface
        is layered on top of the programmatic interface, it makes sense for the
        text-based interface (higher level) to know about the programmatic
        interface (lower level) but not vice versa. Finally, it means we can
        evolve both interfaces separately without too much worry of interference
        between the two.

        As a first stab at the text-based interface, see what you think of the
        properties in:

        http://www.eGroups.com/message/xml-dbms/486

        Ignore the discussion of three separate files and the DatabasePropsFile
        and ParserPropsFile properties. The rest is pretty much the same as
        what's in textvalues.txt, except for: (1) renaming, and (2)
        consolidation of the action, t_status, and t_direction properties into a
        single Action property.

        Does this give people too many options? For example, should we remove
        commit mode and keygeneratorclass, infer the schema type from the file
        extension, and merge retrieveDocumentByKey and retrieveDocumentByKeys?
        On the one hand, this is supposed to be a simple API. On the other hand,
        people probably want control.

        You can also ignore the comment that these properties simply reflect the
        underlying API. Although that statement might be true now, I don't think
        it will be true in the future. In particular, the text-based API should
        have properties that make sense for execution in a language-independent,
        disconnected, probably stateless environment. The programmatic API
        should have methods that make sense for execution in a Java-based,
        connected, state-maintaining environment.

        For the moment, you can use the following as the transfer and map engine
        APIs, but we definitely need to take another look at these and see what
        makes sense for the future:

        http://www.eGroups.com/message/xml-dbms/480
        http://www.eGroups.com/message/xml-dbms/485

        Other comments below.

        adam flinton wrote:

        > Done Deal. I'll get that done ASAP. To which end could you cast your eyes
        > over the textvalues.txt & (a) Add anything which is missing (b) take out
        > anything unneccessary (c) check the names of the Key values e.g. XMLDocument
        > or Map or whatever.

        See comments above.

        > > The specific case I am thinking about is when a Web application calls
        > > the transfer engine to get an XML document. Currently, our
        > > API/property
        > > set only allows you to write the document to disk. This is inefficient
        > > and we should be able to stream the document directly back to the
        > > application as XML.
        >
        > This is very true & is something which I've given some thought to (esp re
        > servlets (I am playing with using servlets for messaging.....no one ever
        > said that servlets need to produce / accept just HTML or indeed that their
        > output needs to be "visible")). It is almost the same as where do you get
        > the file from / put it to. E.g. let's imagine that you want to send the
        > resulting doc somewhere via http put. My intial answer (& it remains the
        > same right now) is that it simply means adding stuff to the XMLwriting
        > methods (or possibly even moving the file read / write out to a separate
        > class as in writeFile(File,location) sort of thing).

        Let's leave this alone for the moment, get the architecture in place,
        possibly do a beta release, and then take another look at this before
        final release. I can't help but think there's a reasonable solution to
        this in the text-based case. Perhaps the Transfer.dispatch method can
        return an Object?

        > I'll have a look round....My only problems with XML DB'es per se are :
        >
        > A) Most of the world's data is & will remain in SQL table structures (i.e
        > Relational not tree based)
        > B) A number of very good tree based DB'es exist such as Cache which have
        > been built (& optimsed) over many years & in essence the 2 are the same
        > thing.

        Note that this is an "XML-based API to databases", not an "API to XML
        databases". That is, just as ODBC/JDBC is based on the relational model,
        this API is based on XML. And just as you can implement ODBC/JDBC over
        non-relational data by mapping that data to the relational model, you
        can implement this API over relational data by mapping the relational
        data to XML. (Presumably using something like mapping documents in
        XML-DBMS, DAD in DB2, or annotated schemas in SQL Server.)

        The goal of the API is to make all databases that support XML look the
        same, regardless of whether the underlying storage is native,
        relational, object-oriented, hierarchical, or whatever else.

        > Yup. The feature set is very simple to set out:
        >
        > 1) Mapping / Design:
        > 1.1) Build a DB structure from an XML/ tree structure.
        > 1.2) Build XML from a DB / table based structure
        >
        > 2) Operation:
        > 2.1) Transfer information as fast as possible from XML to an SQL DB
        > 2.2) Transfer information as fast as possible from and SQL DB to XML.

        This is a good summary and worth remembering.

        > Let's be honest....if one were a java programmer then one
        > could sidestep both transfer & GenerateMap & call DOMtoDBMS etc. yourself
        > passing in structures which you'd created yourself. Equally you could build
        > your own transfer engine etc.

        Agreed.

        > That's not the person I've been aiming @. I've
        > been aiming @ the Oracle/DB2/SQLServer/Sybase etc.etc DBA or the guy who
        > wants to get an answer in XML.

        Also agreed. I think what took so long to get through my head is that
        the text-based API is the simplest API and is separate from the lower
        level APIs, which give more control to people who want it.

        > RMI probs include non Java apps, firewalling.
        >
        > JMS CORBA SOAP would all carry properties files as @ the end of the day they
        > carry text files & Properties files are just that. You could add servlets +
        > any other dynamic http protocol.

        OK. Let's set this aside for the moment. We've got enough to do...

        > I've been investigating the enhydra schemamapper class for use with
        > generating class'es / objects such that I can have a GUI app which accepts /
        > gets sent an XML doc & can then load the relevant class to deal with / map
        > to the xml doc. In essence XML per se is useless, unless something is done
        > with it (whether in a GUI or a servlet or whatever). So building / using
        > something which allows my developers to easily do something with the
        > resultant XML (& indeed provide XML for use by XMLDBMS) is also important.
        > Then it struck me (as things do when it's late & I'm tired) that in many
        > ways the org.xmlmiddleware kinda covered this too.
        >
        > i.e one "action" might well be to produce the relevant class'es to deal with
        > the XML docs produced according to the schema (or indeed to create a new XML
        > doc) such that you have an SQLDB. You produce the SQL structure you wish to
        > have mapped. This results in a map file & a schema. What then?
        > Wellllllll........run that schema through with "action=produceclasses" (or
        > something similar) & voila you have something which your servlet / GUI
        > developers can then use. The thought was triggered partly by my own needs &
        > partly as we may well (you mentioned it sometime back) use the schemamapper
        > any way & this would allow the use of the same code infrastructure (e.g. the
        > abstraction of the parsers etc). It would also ties in with moving transfer,
        > genmap etc into separate classes as all I would be doing would be to add a
        > "genJava" class....

        I've thought of this, too, and it's what behind Castor, Bluestone,
        Informix's Object Translator, Sun's Project Adelard, and probably some
        other things I'm not aware of. Personally, I think this is where things
        will go in the future. Let's face it, transferring data between an XML
        document and a database is not nearly as interesting as having an
        intermediate object that you can use to manipulate that data.

        As for XML-DBMS' involvement in this sort of thing, I've kept clear of
        it for two reasons. First, there are enough interesting problems in the
        straight XML <=> DBMS world to keep me busy for a long time. Second, a
        bunch of other people are already doing this, so I see little point in
        duplicating other peoples' work, especially when some of that work is
        Open Source.

        That said, I was planning to keep it in mind when designing the map
        factory for XML schemas, which could form the basis for this sort of
        code.

        (By the way, last time I looked at the schemamapper class in Enhydra, it
        was woefully underpowered. That is, it supported just a tiny fragment of
        what schemas can do. I assume it will evolve as time goes on, but at the
        moment, it doesn't do us much good.)

        --
        Ronald Bourret
        Programming, Writing, and Training
        XML, Databases, and Schemas
        http://www.rpbourret.com
      • meyappan@yahoo.com
        Hi: I am just wondering if we have a flat xml that is with no nested relationship, Is it feasible to do bulk loading of xml data into oracle using direct path
        Message 3 of 9 , Apr 9, 2001
          Hi:

          I am just wondering if we have a flat xml that is with no nested
          relationship, Is it feasible to do bulk loading of xml data into
          oracle using direct path load.

          Thanks
          Meyyappan


          --- In xml-dbms@y..., Ronald Bourret <rpbourret@r...> wrote:
          > Pareena Shah wrote:
          > >
          > > Question for the people thinking about the new version of XML
          DBMS: What do
          > > you think about using something like sqlloader to bulk load
          transformed XML
          > > data into an Oracle database? If I have a situation where I am
          going to be
          > > processing large volumes of XML data into an Oracle database, and
          I want to
          > > optimize by buffering rows, and using Oracle's direct path load
          > > functionality, is sql loader the best way? Could you comment on
          the
          > > advantages/disadvantages?
          >
          > This is an interesting idea, although it won't be included in the
          next
          > release due to lack of time. (It would require completely
          rearchitecting
          > the DOMToDBMS and DBMSToDOM classes.)
          >
          > The following discussion is not specific to Oracle's bulk loader,
          but
          > discusses how XML-DBMS might do bulk inserts in the future. This
          assumes
          > such updates are possible using JDBC, and it is not clear to me that
          > they are.
          >
          > The challenge is this. Suppose we have an XML document that looks
          like
          > the following:
          >
          > <A>
          > <A1>...</A1>
          > <A2>...</A2>
          > <A3>...</A3>
          > <A4>...</A4>
          > <B>
          > <B1>...</B1>
          > <B2>...</B2>
          > <B3>...</B3>
          > </B>
          > </A>
          >
          > and that this document was mapped to tables A (columns A1-A4) and B
          > (columns B1-B3) as expected, with the primary key in table A. Now
          > suppose you have a whole lot of these structures in a single XML
          > document:
          >
          > <root>
          > <A>
          > <A1>...</A1>
          > <A2>...</A2>
          > <A3>...</A3>
          > <A4>...</A4>
          > <B>
          > <B1>...</B1>
          > <B2>...</B2>
          > <B3>...</B3>
          > </B>
          > </A>
          > ...
          > <A>
          > <A1>...</A1>
          > <A2>...</A2>
          > <A3>...</A3>
          > <A4>...</A4>
          > <B>
          > <B1>...</B1>
          > <B2>...</B2>
          > <B3>...</B3>
          > </B>
          > </A>
          > </root>
          >
          > Currently, what the code does is inserts the row for the first A,
          then
          > the row for the first B, then the row for the second A, then the
          row for
          > the second B, and so on.
          >
          > To use bulk loading, the code would need to buffer rows for A and
          rows
          > for B, then insert them when there are a certain number of rows in
          the
          > buffer -- say 100. While this probably wouldn't be too bad in the
          above
          > case, it could get very complicated in the general case.
          >
          > For example, imagine there can be an arbitrary number of B children
          for
          > each A parent. Thus, the buffer for B rows would fill up before the
          > buffer for A rows. However, the code has to be careful about when it
          > inserts rows. That is, it can't just wait until the buffer for B
          rows is
          > full and then just insert them. Because of referential integrity,
          it has
          > to insert the A rows before the B rows, so you need to coordinate
          when
          > the buffers are emptied. Now, imagine doing this for an XML document
          > that is nested arbitrarily deep and you'll see that the code is
          > non-trivial.
          >
          > So while this is a good idea and worth looking at in the future, we
          > don't have time to do it now.
          >
          > --
          > Ronald Bourret
          > Programming, Writing, and Training
          > XML, Databases, and Schemas
          > http://www.rpbourret.com
        • Ronald Bourret
          ... Is direct path load a feature of Oracle? If so, XML-DBMS does not use it. By flat XML I assume you mean something like the following:
          Message 4 of 9 , Apr 12, 2001
            meyappan@... wrote:

            > I am just wondering if we have a flat xml that is with no nested
            > relationship, Is it feasible to do bulk loading of xml data into
            > oracle using direct path load.

            Is "direct path load" a feature of Oracle? If so, XML-DBMS does not use
            it.

            By "flat XML" I assume you mean something like the following:

            <Table>
            <Row>
            <Column1>...</Column1>
            <Column2>...</Column2>
            ...
            </Row>
            <Row>
            ...
            </Row>
            ...
            </Table>

            If this is the case, XML-DBMS is probably overkill, as it is designed
            especially to work with nested XML. If you want to use a
            database-specific bulk-load utility, you can probably write your own
            code to do this fairly easily. Such code would presumably use ODBC
            (which supports bulk loads) or Oracle's own API.

            I've attached a rough example of what a SAX version of this code would
            look like at the end of this message -- you would need to modify it for
            bulk loading. (I have a vague feeling there is a state error somewhere
            in this code, but haven't ever run it so I'm not sure.)

            -- Ron

            The code to transfer data from XML to the database follows a common
            pattern, regardless of whether it uses SAX or DOM:

            1.Table element start: prepare an INSERT statement
            2.Row element start: clear INSERT statement parameters
            3.Column elements: buffer PCDATA and set INSERT statement parameters
            4.Row element end: execute INSERT statement
            5.Table element end: close INSERT statement

            The code does not make any assumptions about the names of the tags. In
            fact, it uses the name of the table-level tag to build the
            INSERT statement and the names of the column-level tags to identify
            parameters in the INSERT statement. Thus, these names
            could correspond exactly to the names in the database or could be mapped
            to names in the database using a configuration file.

            Here is the code using SAX:

            int state = TABLE;
            PreparedStatement stmt;
            StringBuffer data;

            public void startElement(String uri, String name, String qName,
            Attributes attr) {
            if (state == TABLE) {
            stmt = getInsertStmt(name);
            state = ROW;
            } else if (state == ROW) {
            stmt.clearParameters();
            state = COLUMN;
            } else { // if (state == COLUMN)
            data = new StringBuffer();
            }
            }

            public void characters (char[] chars, int start, int length) {
            if (state == COLUMN)
            data.append(chars, start, length);
            }

            public void endElement(String uri, String name, String qName) {
            if (state == TABLE)
            stmt.close();
            else if (state == ROW) {
            stmt.executeUpdate();
            state = TABLE;
            } else { // if (state == COLUMN)
            setParameter(stmt, name, data.toString());
            state = ROW;
            }
            }
          Your message has been successfully submitted and would be delivered to recipients shortly.