Loading ...
Sorry, an error occurred while loading the content.

performances elements

Expand Messages
  • Yannick Thebault
    Hi all, I m working on an archiving project. The goal of this project is to provide to companies some archiving solutions of their business et legal documents
    Message 1 of 4 , Jan 13, 2004
    • 0 Attachment
      Hi all,

      I'm working on an archiving project. The goal of this project is to
      provide to companies some archiving solutions of their business et
      legal documents (contracts, emails, invoices, ...) during all their
      legal retention time.
      As documents formats have many differences (betwwen emails and
      contracts and between contracts from 2 different companies), we
      thought that XML dbms could be a solution to store and index the
      documents.
      But amounts of data will be very high and performances (essentially
      insertion and index performances) is a very important criterion of
      choice for the XML database products.

      Does somebody have some benchmarks about XML database products
      especially open source XML database products ?

      Thank in advance

      regards

      Yannick Thebault
    • Ronald Bourret
      1) I don t know of any publicly published benchmarks. The problem is that many licenses do not allow benchmark results to be published. 2) There are a number
      Message 2 of 4 , Jan 14, 2004
      • 0 Attachment
        1) I don't know of any publicly published benchmarks. The problem is
        that many licenses do not allow benchmark results to be published.

        2) There are a number of benchmarks for XML and databases. For a list,
        see:

        http://www.rpbourret.com/xml/XMLDBLinks.htm#Benchmarks

        3) I'm not sure that XML-DBMS is a good solution for you. There are two
        problems:

        a) Many legal documents must be stored exactly. XML-DBMS decomposes
        documents and stores the data in tables. The original document is then
        discarded and cannot be reconstructed. (For example, comments,
        processing instructions, entity references, and DTDs are lost, as well
        as the document's name, date, etc.)

        b) Most contracts and emails contain lots of mixed content. Although
        XML-DBMS can handle mixed content, it is *very* inefficient.

        4) Here are two possible solutions:

        a) Use a native XML database. These handle mixed content much more
        efficiently. Two popular Open Source native XML databases are eXist and
        Xindice. For a list of native XML databases, see
        http://www.rpbourret.com/xml/ProdsNative.htm. Note that some native XML
        databases can store exact copies of documents and some cannot. If this
        is important to you for legal reasons, make sure your native XML
        database can do this.

        b) Store the documents in a CLOB or BLOB column in a relational
        database. Note that this allows you to store an exact copy of the
        document.

        Oracle and DB2 both allow you to index individual values in XML
        documents stored in BLOB/CLOB columns. Both also have add-ons that can
        perform XML-aware full-text searches. (I think SQL Server supports this
        as well, but I'm not sure.) Furthermore, Oracle 9i release 2 can perform
        XPath queries over these documents. (Note that performance of queries
        that cannot be resolved using indexes alone will probably be poor.)

        If you don't have Oracle or DB2, you can also index documents stored in
        CLOB or BLOB columns yourself using "side tables". This is what DB2 does
        and the code shouldn't be too hard to implement yourself. You can then
        query the documents by querying the side tables. For more information,
        see http://www.rpbourret.com/xml/XMLAndDatabases.htm#blob

        -- Ron

        Yannick Thebault wrote:
        >
        > Hi all,
        >
        > I'm working on an archiving project. The goal of this project is to
        > provide to companies some archiving solutions of their business et
        > legal documents (contracts, emails, invoices, ...) during all their
        > legal retention time.
        > As documents formats have many differences (betwwen emails and
        > contracts and between contracts from 2 different companies), we
        > thought that XML dbms could be a solution to store and index the
        > documents.
        > But amounts of data will be very high and performances (essentially
        > insertion and index performances) is a very important criterion of
        > choice for the XML database products.
        >
        > Does somebody have some benchmarks about XML database products
        > especially open source XML database products ?
      • Yannick Thebault
        Hi, Thanks for your answer. Our target is to use a native XML database but as we will need to manage many tera octets of data, we need to have a careful
        Message 3 of 4 , Jan 15, 2004
        • 0 Attachment
          Hi,

          Thanks for your answer.
          Our target is to use a native XML database but as we will need to
          manage many tera octets of data, we need to have a careful approach
          of the definition of the target architecture.

          I think we will try to tests one of the benchmarks you provide me.

          The problem with the decomposition of legal document will be solve by
          joining a scannerized picture of the document or a e-copy of the
          document. In facts, we will store the document 2 times.
          The problem is that our customers (companies whose store their
          documents) ask us a full text search capabiliy.

          Maybe, if you are ok, I could submit to you the architecture of our
          project. You will could give us your feedback about it.

          Best regards
          Yannick

          --- In xml-dbms@yahoogroups.com, Ronald Bourret <rpbourret@r...>
          wrote:
          > 1) I don't know of any publicly published benchmarks. The problem is
          > that many licenses do not allow benchmark results to be published.
          >
          > 2) There are a number of benchmarks for XML and databases. For a
          list,
          > see:
          >
          > http://www.rpbourret.com/xml/XMLDBLinks.htm#Benchmarks
          >
          > 3) I'm not sure that XML-DBMS is a good solution for you. There are
          two
          > problems:
          >
          > a) Many legal documents must be stored exactly. XML-DBMS decomposes
          > documents and stores the data in tables. The original document is
          then
          > discarded and cannot be reconstructed. (For example, comments,
          > processing instructions, entity references, and DTDs are lost, as
          well
          > as the document's name, date, etc.)
          >
          > b) Most contracts and emails contain lots of mixed content. Although
          > XML-DBMS can handle mixed content, it is *very* inefficient.
          >
          > 4) Here are two possible solutions:
          >
          > a) Use a native XML database. These handle mixed content much more
          > efficiently. Two popular Open Source native XML databases are eXist
          and
          > Xindice. For a list of native XML databases, see
          > http://www.rpbourret.com/xml/ProdsNative.htm. Note that some native
          XML
          > databases can store exact copies of documents and some cannot. If
          this
          > is important to you for legal reasons, make sure your native XML
          > database can do this.
          >
          > b) Store the documents in a CLOB or BLOB column in a relational
          > database. Note that this allows you to store an exact copy of the
          > document.
          >
          > Oracle and DB2 both allow you to index individual values in XML
          > documents stored in BLOB/CLOB columns. Both also have add-ons that
          can
          > perform XML-aware full-text searches. (I think SQL Server supports
          this
          > as well, but I'm not sure.) Furthermore, Oracle 9i release 2 can
          perform
          > XPath queries over these documents. (Note that performance of
          queries
          > that cannot be resolved using indexes alone will probably be poor.)
          >
          > If you don't have Oracle or DB2, you can also index documents
          stored in
          > CLOB or BLOB columns yourself using "side tables". This is what DB2
          does
          > and the code shouldn't be too hard to implement yourself. You can
          then
          > query the documents by querying the side tables. For more
          information,
          > see http://www.rpbourret.com/xml/XMLAndDatabases.htm#blob
          >
          > -- Ron
          >
          > Yannick Thebault wrote:
          > >
          > > Hi all,
          > >
          > > I'm working on an archiving project. The goal of this project is
          to
          > > provide to companies some archiving solutions of their business et
          > > legal documents (contracts, emails, invoices, ...) during all
          their
          > > legal retention time.
          > > As documents formats have many differences (betwwen emails and
          > > contracts and between contracts from 2 different companies), we
          > > thought that XML dbms could be a solution to store and index the
          > > documents.
          > > But amounts of data will be very high and performances
          (essentially
          > > insertion and index performances) is a very important criterion of
          > > choice for the XML database products.
          > >
          > > Does somebody have some benchmarks about XML database products
          > > especially open source XML database products ?
        • Ronald Bourret
          Just so you understand. This mailing list is for a product named XML-DBMS. This product transfers data between XML documents and relational databases using an
          Message 4 of 4 , Jan 16, 2004
          • 0 Attachment
            Just so you understand. This mailing list is for a product named
            XML-DBMS. This product transfers data between XML documents and
            relational databases using an object-relational mapping (similar to that
            found in Oracle, DB2, and SQL Server). It is *not* a native XML
            database. (I wish now I had a used a different name, but it's too
            late...)

            Now on to your questions.

            One problem you may have is finding a native XML database -- especially
            an Open Source database -- that can handle terabytes of data. While some
            native XML databases can handle this much data, not all of them can.

            A number of native XML databases (including eXist but not Xindice) also
            support full-text search, so this may solve your problems in that
            regard. (If you only need XML-aware full-text searches, you should
            consider TEXTML Server [1], which stores documents in text form and
            indexes them. This saves you from having to store the documents twice
            and storage in text form is smaller than storage in other forms,
            although the indexes will add size.)

            You are welcome to send me your architecture for review. However, I have
            never used a native XML database (I work on XML-enabled databases), so
            the amount of help I can give you is limited.

            -- Ron

            [1] http://www.rpbourret.com/xml/ProdsNative.htm#textml

            Yannick Thebault wrote:
            >
            > Hi,
            >
            > Thanks for your answer.
            > Our target is to use a native XML database but as we will need to
            > manage many tera octets of data, we need to have a careful approach
            > of the definition of the target architecture.
            >
            > I think we will try to tests one of the benchmarks you provide me.
            >
            > The problem with the decomposition of legal document will be solve by
            > joining a scannerized picture of the document or a e-copy of the
            > document. In facts, we will store the document 2 times.
            > The problem is that our customers (companies whose store their
            > documents) ask us a full text search capabiliy.
            >
            > Maybe, if you are ok, I could submit to you the architecture of our
            > project. You will could give us your feedback about it.
          Your message has been successfully submitted and would be delivered to recipients shortly.