Loading ...
Sorry, an error occurred while loading the content.

Re: [xml-dbms] SAX version?

Expand Messages
  • Ronald Bourret
    I just realized I hadn t answered this message yet. Here s the main design issues. Note that the proposed solutions definitely follow the 80/20 rule. They are
    Message 1 of 5 , Mar 8, 2004
    • 0 Attachment
      I just realized I hadn't answered this message yet.

      Here's the main design issues. Note that the proposed solutions
      definitely follow the 80/20 rule. They are not complete solutions, but
      should be useful in most situations.

      The main problem here is order.

      It is not feasible to use order columns at all. The problem is that
      order columns can place a child element (or text) anywhere within the
      set of siblings. This means you need to cache all siblings before you
      issue SAX events for any of them. Since siblings can come from child
      tables as well as the columns of a single table, the worst case here is
      that you cache the entire document before issuing events for any of it.
      Therefore, ignore order columns.

      You can, however, use fixed order values. Again, because of the
      potential for having to do a lot of caching, I suggest you ignore fixed
      order values that apply to related classes (child tables). Instead, just
      order the columns in a given table according to their fixed order values
      (I think you need to modify DMLGenerator to do this) and issue events
      for each column as you read it. Note that you will need to place columns
      mapped to attributes before any columns mapped to child elements; this
      is because attributes are sent as part of the event for the row-level
      (class) element.

      The strategy for DBMSToSAX is otherwise very straightforward:

      1) Start with a root table (or tables) and retrieve data based on a
      filter. Read through the result set. For a given row, issue an event for
      the table element (and any attribute columns), then for any columns
      mapped to child elements. Then execute step 2 before proceeding to the
      next row.

      2) Retrieve data for each child table (complex child elements), based
      on the key values in the current row and any table filters. Recursively
      process the row according to step (1). Note that child tables should be
      processed according to their fixed order values of their corresponding
      elements (if any).

      DOMToDBMS is somewhat more complex. There are two problems. First, you
      need to know when it is OK to insert a row of data. That is, when have
      you retrieved all the data from the XML document for a single row. For
      example, consider the following sales order document:

      ... item children ...
      ... item children ...

      We can insert the row into the sales order table after we get the event
      for the Date element. The problem is that the mapping information
      doesn't tell us that the Date element is the last element containing
      data for the Orders table.

      Our second problem is referential integrity. Consider the following
      document, which is the same as our first document except that Date now
      appears after the Item elements:

      ... item children ...
      ... item children ...

      In this case, we still can insert the row until after the event for the
      Date element, but notice we have already processed the Item elements.
      Because the Items table has a foreign key pointing to a row in the
      Orders table, we can't insert any rows in the Items table until we have
      inserted the corresponding row in the Orders table. Since SAX processes
      documents linearly, this means we will have to cache the item data and
      insert it after we insert the order data.

      To give you an idea of the range of problems here, consider the
      following document in which the Customer element is complex and mapped
      to its own table:

      ...customer children...
      ... item children ...
      ... item children ...

      Again, because of referential integrity constraints, we can't insert the
      sales order row until we have inserted the customer row.

      So how do we solve these problems? Let's look first at the problem of
      knowing when it is safe to insert a row.

      DOMToDBMS solves this problem by simply reading to the end of a set of
      siblings. There is no memory hit for this because the document is
      already stored in memory as a DOM tree. In SAXToDBMS, we have several

      1) Follow the same strategy as DOMToDBMS. That is, extract the data from
      the document as we go and cache it as Row objects. In virtually all
      cases, this will require us to cache all of the data for the document.
      Depending on the ratio of tag size to data, this will probably require
      50-90% of the size of the document. A good solution for big documents,
      but won't work with truly large documents. Probably not too hard to

      2) Require all child elements to have fixed order values. This allows us
      to figure out when we have all the data for a row. (Or add a new element
      to the map DTD that tells us this information.) This will work for all
      documents, but I suspect it would be difficult to implement.
      Furthermore, the worst cases will require caching the entire document as
      in (1), so I see little real benefit.

      3) Require all child elements that are mapped as properties to be
      grouped together. This means that, as soon as we hit an element mapped
      as a related class (or the end of the parent), we can insert the row.
      This won't work for all documents. For example, it will work with the
      first document shown above but not the second. However, I suspect this
      is a common design pattern, so this might be a good 80/20 solution.
      Again, probably not too hard to implement.

      (Documents that don't match this pattern will need to be transformed.
      With XSLT, this usually means bringing the entire document into memory,
      which obviously doesn't work with very large documents.)

      Our second question is how to handle the referential integrity problems.
      There are really two choices:

      1) Cache data that depends on other data and insert it later. For
      example, in the third document, cache the order data until we have the
      Customer data. Insert the customer data, then insert the order data,
      then insert the item data. This is not too complex -- it is roughly what
      DOMToDBMS does today -- and it works for many documents.

      2) Require that the document be ordered such that data can be inserted
      as it is read. For example, the third document would have to have the
      customer data first, then the order data, then the item data. Although
      this is much easier to implement, it is not clear how many documents are
      structured this way, although I suspect that many data-centric documents
      are. This will work for any document where the primary key is always in
      the table of the parent element, but not for any documents where the
      primary key is ever in the table of the child element.

      Personally, I prefer (1) in this case. It is not that much harder than
      (2) to implement and works with a greater variety of documents. I don't
      think the caching will cause memory problems except in very rare cases.

      -- Ron

      rpbourret@... wrote:
      > > If you have a moment. I am up to my eyeballs between work and school, but
      > > this is an interesting and useful problem to solve and I'm willing to
      > > take a crack at it. I'm guessing there's a CVS tree somewhere? (I haven't
      > > looked for it.)
      > Nothing in the CVS tree, although there is one available at:
      > http://sourceforge.net/cvs/?group_id=13613
      > I'll send a separate email detailing the main design points.
    Your message has been successfully submitted and would be delivered to recipients shortly.