Loading ...
Sorry, an error occurred while loading the content.

Re: [xml-dbms] SAX version?

Expand Messages
  • rpbourret@rpbourret.com
    ... OK. Three things you should really look for are: 1) The generated map uses keys generated by XML-DBMS. Are there any fields in the document that you can /
    Message 1 of 5 , Feb 5, 2004
    • 0 Attachment
      > On Thu, 5 Feb 2004 rpbourret@... wrote:
      > It looked OK to me. It's been awhile since I've done any real database
      > design. I thought about trying to apply one or more of the available data
      > modeling tools to check the map but I haven't got that far.

      OK. Three things you should really look for are:

      1) The generated map uses keys generated by XML-DBMS. Are there any fields in
      the document that you can / want to use as keys? Since this is coming from an
      Oracle database, there's a good bet this is the case (unless ROWIDs were used
      as keys in the original database).

      2) Does the XML schema have extra structure that needs to be eliminated. The
      classic example here is an address element. For an example, see the last few
      paragraphs of section 5.5 of "XML and Databases":

      http://www.rpbourret.com/xml/XMLAndDatabases.htm#generating

      Again, since the source was an Oracle database, you're probably OK.

      3) Use of generated order columns. These are useful only if you want to exactly
      reproduce the original XML document and a waste of time and space otherwise.
      Since you're using XML to transport data from one relational database to
      another, you can safely get rid of them.

      > > If you are interested, I can give you some pointers, but be prepared to
      write a
      > > reasonable amount of code.
      >
      > If you have a moment. I am up to my eyeballs between work and school, but
      > this is an interesting and useful problem to solve and I'm willing to
      > take a crack at it. I'm guessing there's a CVS tree somewhere? (I haven't
      > looked for it.)

      Nothing in the CVS tree, although there is one available at:

      http://sourceforge.net/cvs/?group_id=13613

      I'll send a separate email detailing the main design points.

      > The particular data set I'm working with came from an Oracle database.
      > Which raises an interesting point--have you ever tried 'round-tripping'
      > with XML-DBMS? I expect that there is a schema available which I'm going
      > to try to get. (Or I could just buy Oracle... ;-) )

      Yes. You can easily round-trip XML=>DB=XML and DB=>XML=>DB. The latter is
      always exact. The former is always exact except for:

      1) Sibling order. You can make it exact even with respect to sibling order if
      you store order information in the database. As I mentioned above, there is no
      need in your case.

      2) Comments, processing instructions, DTD, etc. These are always discarded.

      3) Entity references, character references, CDATA sections. These are always
      discarded.

      > > If the performance of XML-DBMS becomes an issue, you might try a native XML
      > > database such as eXist (http://exist.sourceforge.net) or Berkeley DB XML
      > > (http://www.sleepycat.com/products/xml.shtml).
      > >
      >
      > I'll take a look. I need to get something that works rather quickly and
      > then I can look into the best solution.

      Since the source of the data is a relational database, XML-DBMS will probably
      do just fine for you.

      > Good idea. That's _exactly_ the form of the data. A little perl script to
      > break it up at every nth tag would do nicely.
      >
      > > http://groups.yahoo.com/group/xml-dbms/message/2934

      OK. I don't know how easy it is to mix PERL and Java, but that's your
      choice. You can do that or use the tool mentioned in the mail above. It's also
      easy to write a simple SAX application that will do this for you. (

      --
      Ron
    • Ronald Bourret
      I just realized I hadn t answered this message yet. Here s the main design issues. Note that the proposed solutions definitely follow the 80/20 rule. They are
      Message 2 of 5 , Mar 8, 2004
      • 0 Attachment
        I just realized I hadn't answered this message yet.

        Here's the main design issues. Note that the proposed solutions
        definitely follow the 80/20 rule. They are not complete solutions, but
        should be useful in most situations.

        DBMSToSAX:
        ----------
        The main problem here is order.

        It is not feasible to use order columns at all. The problem is that
        order columns can place a child element (or text) anywhere within the
        set of siblings. This means you need to cache all siblings before you
        issue SAX events for any of them. Since siblings can come from child
        tables as well as the columns of a single table, the worst case here is
        that you cache the entire document before issuing events for any of it.
        Therefore, ignore order columns.

        You can, however, use fixed order values. Again, because of the
        potential for having to do a lot of caching, I suggest you ignore fixed
        order values that apply to related classes (child tables). Instead, just
        order the columns in a given table according to their fixed order values
        (I think you need to modify DMLGenerator to do this) and issue events
        for each column as you read it. Note that you will need to place columns
        mapped to attributes before any columns mapped to child elements; this
        is because attributes are sent as part of the event for the row-level
        (class) element.

        The strategy for DBMSToSAX is otherwise very straightforward:

        1) Start with a root table (or tables) and retrieve data based on a
        filter. Read through the result set. For a given row, issue an event for
        the table element (and any attribute columns), then for any columns
        mapped to child elements. Then execute step 2 before proceeding to the
        next row.

        2) Retrieve data for each child table (complex child elements), based
        on the key values in the current row and any table filters. Recursively
        process the row according to step (1). Note that child tables should be
        processed according to their fixed order values of their corresponding
        elements (if any).

        DOMToDBMS:
        ----------
        DOMToDBMS is somewhat more complex. There are two problems. First, you
        need to know when it is OK to insert a row of data. That is, when have
        you retrieved all the data from the XML document for a single row. For
        example, consider the following sales order document:

        <Order>
        <Number>...</Number>
        <CustNum>...</CustNum>
        <Date>...</Date>
        <Item>
        ... item children ...
        </Item>
        <Item>
        ... item children ...
        </Item>
        </Order>

        We can insert the row into the sales order table after we get the event
        for the Date element. The problem is that the mapping information
        doesn't tell us that the Date element is the last element containing
        data for the Orders table.

        Our second problem is referential integrity. Consider the following
        document, which is the same as our first document except that Date now
        appears after the Item elements:

        <Order>
        <Number>...</Number>
        <CustNum>...</CustNum>
        <Item>
        ... item children ...
        </Item>
        <Item>
        ... item children ...
        </Item>
        <Date>...</Date>
        </Order>

        In this case, we still can insert the row until after the event for the
        Date element, but notice we have already processed the Item elements.
        Because the Items table has a foreign key pointing to a row in the
        Orders table, we can't insert any rows in the Items table until we have
        inserted the corresponding row in the Orders table. Since SAX processes
        documents linearly, this means we will have to cache the item data and
        insert it after we insert the order data.

        To give you an idea of the range of problems here, consider the
        following document in which the Customer element is complex and mapped
        to its own table:

        <Order>
        <Number>...</Number>
        <Date>...</Date>
        <Customer>
        ...customer children...
        </Customer>
        <Item>
        ... item children ...
        </Item>
        <Item>
        ... item children ...
        </Item>
        </Order>

        Again, because of referential integrity constraints, we can't insert the
        sales order row until we have inserted the customer row.

        So how do we solve these problems? Let's look first at the problem of
        knowing when it is safe to insert a row.

        DOMToDBMS solves this problem by simply reading to the end of a set of
        siblings. There is no memory hit for this because the document is
        already stored in memory as a DOM tree. In SAXToDBMS, we have several
        options:

        1) Follow the same strategy as DOMToDBMS. That is, extract the data from
        the document as we go and cache it as Row objects. In virtually all
        cases, this will require us to cache all of the data for the document.
        Depending on the ratio of tag size to data, this will probably require
        50-90% of the size of the document. A good solution for big documents,
        but won't work with truly large documents. Probably not too hard to
        implement.

        2) Require all child elements to have fixed order values. This allows us
        to figure out when we have all the data for a row. (Or add a new element
        to the map DTD that tells us this information.) This will work for all
        documents, but I suspect it would be difficult to implement.
        Furthermore, the worst cases will require caching the entire document as
        in (1), so I see little real benefit.

        3) Require all child elements that are mapped as properties to be
        grouped together. This means that, as soon as we hit an element mapped
        as a related class (or the end of the parent), we can insert the row.
        This won't work for all documents. For example, it will work with the
        first document shown above but not the second. However, I suspect this
        is a common design pattern, so this might be a good 80/20 solution.
        Again, probably not too hard to implement.

        (Documents that don't match this pattern will need to be transformed.
        With XSLT, this usually means bringing the entire document into memory,
        which obviously doesn't work with very large documents.)

        Our second question is how to handle the referential integrity problems.
        There are really two choices:

        1) Cache data that depends on other data and insert it later. For
        example, in the third document, cache the order data until we have the
        Customer data. Insert the customer data, then insert the order data,
        then insert the item data. This is not too complex -- it is roughly what
        DOMToDBMS does today -- and it works for many documents.

        2) Require that the document be ordered such that data can be inserted
        as it is read. For example, the third document would have to have the
        customer data first, then the order data, then the item data. Although
        this is much easier to implement, it is not clear how many documents are
        structured this way, although I suspect that many data-centric documents
        are. This will work for any document where the primary key is always in
        the table of the parent element, but not for any documents where the
        primary key is ever in the table of the child element.

        Personally, I prefer (1) in this case. It is not that much harder than
        (2) to implement and works with a greater variety of documents. I don't
        think the caching will cause memory problems except in very rare cases.

        -- Ron


        rpbourret@... wrote:
        > > If you have a moment. I am up to my eyeballs between work and school, but
        > > this is an interesting and useful problem to solve and I'm willing to
        > > take a crack at it. I'm guessing there's a CVS tree somewhere? (I haven't
        > > looked for it.)
        >
        > Nothing in the CVS tree, although there is one available at:
        >
        > http://sourceforge.net/cvs/?group_id=13613
        >
        > I'll send a separate email detailing the main design points.
      Your message has been successfully submitted and would be delivered to recipients shortly.