Loading ...
Sorry, an error occurred while loading the content.

RFD: XML schema for the Tipitaka and Pali literature

Expand Messages
  • ong.yongpeng
    Dear friends REQUEST FOR DISCUSSION (RFD) An XML schema for the Tipitaka and Pali literature will provide a standard description to the structure and content
    Message 1 of 11 , Aug 24, 2008
    • 0 Attachment
      Dear friends

      REQUEST FOR DISCUSSION (RFD)

      An XML schema for the Tipitaka and Pali literature will provide a
      standard description to the structure and content of any Pali text,
      e.g. a sutta, a gathaa, etc.

      Is there anyone who had worked or thought of working on the
      development of an XML schema for the Tipitaka and Pali literature?

      If so, this list may provide a starting point for discussion and
      exchange of ideas.

      I am currently working on such a proposal which will be posted online
      via the PDF page. http://www.tipitaka.net/forge/pdf/

      I am also happy to discuss with anyone interested to work together and
      contribute to it. Contributors will be named on the proposal.

      metta,
      Yong Peng.
    • Jon Fernquest
      Dear Yong Peng; Sounds like a good idea. Would make it much easier to reformat and repurpose Tipitaka text, for example, matching the Pali original with
      Message 2 of 11 , Aug 25, 2008
      • 0 Attachment
        Dear Yong Peng;

        Sounds like a good idea. Would make it much easier to reformat and repurpose Tipitaka text, for example, matching the Pali original with translations in various languages, or with renditions of the Pali in different scripts.

        If the Pali was indexed by the verse it would be relatively easy to
        interleave the Pali with a translation.

        Some of the groundwork for this must have been done for this sort of thing
        already:

        Text Encoding Initiative:
        http://en.wikipedia.org/wiki/Text_Encoding_Initiative

        Open Scripture Information Standard:
        http://en.wikipedia.org/wiki/Open_Scripture_Information_Standard

        Knuth's TeX is a powerful open sourced typesetting system that perhaps should be used:

        http://en.wikipedia.org/wiki/TeX

        Haven't checked recently whether, like Adobe's Framemaker, it takes XML as a frontend.

        With Metta
        Jon























        [Non-text portions of this message have been removed]
      • ong.yongpeng
        Dear Jon and friends, thank you. TEI is indeed one which we can learn from, especially its experience in the adoption of XML. As you mention, well defined
        Message 3 of 11 , Aug 25, 2008
        • 0 Attachment
          Dear Jon and friends,

          thank you. TEI is indeed one which we can learn from, especially its
          experience in the adoption of XML.

          As you mention, well defined schema(s) would facilitate better
          exchange of Pali texts, and also scholarly works in translations,
          comparative studies, etc. Verse by verse capture of text is possible
          in XML, given its high degree of flexibility. I think the
          possibilities are endless.

          The schema(s) will remain open and free for any website or group to
          use. It will also remain as a live document, to be frequently updated
          to reflect current knowledge and technology in text encoding.

          I believe such schema(s) would be very helpful in the development of
          other projects on tipitaka.net.

          Given the nature of your profession, your inputs to the development of
          the proposed schema(s) would be valuable. Would you be keen to be the
          editor?

          I believe there are also other members in the group who are able to
          provide important feedback too. However, allow me to first lay the
          groundwork. ;-)

          This week, I will present a draft to the group. The draft shall
          include your first suggestion, i.e. verse-by-verse text encoding. In
          the meantime, please keep your mails coming. Thank you.

          metta,
          Yong Peng.


          --- In Pali@yahoogroups.com, Jon Fernquest wrote:

          Sounds like a good idea. Would make it much easier to reformat and
          repurpose Tipitaka text, for example, matching the Pali original with
          translations in various languages, or with renditions of the Pali in
          different scripts.

          If the Pali was indexed by the verse it would be relatively easy to
          interleave the Pali with a translation.
        • Frank Snow
          The VRI texts included with CST4 ( http://www.tipitaka.org/cst/installation/ ) are in Text Encoding Initiative format. I used TEI Lite as a starting point:
          Message 4 of 11 , Aug 25, 2008
          • 0 Attachment
            The VRI texts included with CST4 (
            http://www.tipitaka.org/cst/installation/ ) are in Text Encoding
            Initiative format. I used TEI Lite as a starting point:
            http://www.tei-c.org/Guidelines/Customization/Lite/ . Our current
            markup could be called "very lite".

            with metta,
            Frank


            --- In Pali@yahoogroups.com, "ong.yongpeng" <pali.smith@...> wrote:
            >
            > Dear friends
            >
            > REQUEST FOR DISCUSSION (RFD)
            >
            > An XML schema for the Tipitaka and Pali literature will provide a
            > standard description to the structure and content of any Pali text,
            > e.g. a sutta, a gathaa, etc.
            >
            > Is there anyone who had worked or thought of working on the
            > development of an XML schema for the Tipitaka and Pali literature?
            >
            > If so, this list may provide a starting point for discussion and
            > exchange of ideas.
            >
            > I am currently working on such a proposal which will be posted online
            > via the PDF page. http://www.tipitaka.net/forge/pdf/
            >
            > I am also happy to discuss with anyone interested to work together and
            > contribute to it. Contributors will be named on the proposal.
            >
            > metta,
            > Yong Peng.
            >
          • Jon Fernquest
            Dear Frank and Yong Peng; Thanks for the information. Our current markup could be called very lite . This seems to be the best way to go on small projects.
            Message 5 of 11 , Aug 27, 2008
            • 0 Attachment
              Dear Frank and Yong Peng;

              Thanks for the information.

              "Our current markup could be called 'very lite'."

              This seems to be the best way to go on small projects.

              Use the minimum markup necessary, branching off,
              and adding markup for special purposes such as
              interlinear translation:

              http://linguistlist.org/emeld/workshop/2003/bowbadenbird-paper.html

              Using the XML tags for interlinear translation, a simple XSLT programme
              could format it into a set of web pages using CSS.

              Are there standard divisions of the Tipitaka into verses?

              (I know the Pali Text Society translations of Rhys Davids
              break texts into verses, but more generally?)

              With metta,
              Jon Fernquest
            • ong.yongpeng
              Dear Frank and Jon, thanks for the information, Frank. My current understanding of TEI is still limited, and I would be spending some time going through them
              Message 6 of 11 , Aug 27, 2008
              • 0 Attachment
                Dear Frank and Jon,

                thanks for the information, Frank.

                My current understanding of TEI is still limited, and I would be
                spending some time going through them as part of the work I plan for
                PDF. Still, I had a look at DTD of the Lite version and there are at
                least a few hundreds of elements and attributes, which already sent my
                head spinning. ;-) I can understand why your markup for CST4 is "very
                lite".

                TEI is a very generalised scheme which allows a very high level of
                customisation. The names it uses for the elements and attributes are
                also generalised, and tend towards verbosity.

                Adopting TEI increases the chances of interchangeability with other
                similarly encoded texts. However, TEI may be overly complex for the
                technically challenged.

                I would prefer to use a non-TEI native scheme for Pali texts, and then
                employ tools to convert to TEI as and when needed. A specially
                developed scheme for Pali texts can use names that are more natural
                than those from TEI. An important aspect, the metadata, may also be
                better captured with a schema specific to the Pali texts. It is
                probably XML, not TEI, that we seek to apply to the Pali texts.

                I also intend to build in some degree of knowledge management, instead
                of simply text markup, into the proposed schema(s). However, I would
                be happy to learn what you have already done with CST4, and
                collaborate with you in those aspects where we see advantages to have
                common definitions.

                metta,
                Yong Peng.


                --- In Pali@yahoogroups.com, Frank Snow wrote:

                The VRI texts included with CST4
                (http://www.tipitaka.org/cst/installation/ ) are in Text Encoding
                Initiative format. I used TEI Lite as a starting point:
                http://www.tei-c.org/Guidelines/Customization/Lite/ . Our current
                markup could be called "very lite".
              • ong.yongpeng
                Dear Jon and Frank, thanks again, Jon, for the discussion. I have come across texts which are partitioned (into paragraphs) slightly differently in different
                Message 7 of 11 , Aug 28, 2008
                • 0 Attachment
                  Dear Jon and Frank,

                  thanks again, Jon, for the discussion.

                  I have come across texts which are "partitioned" (into paragraphs)
                  slightly differently in different editions. I think the logical
                  blocks, i.e. sentences of prose and lines (using meter) of verse, are
                  adequate for interlinear translation.

                  metta,
                  Yong Peng.


                  --- In Pali@yahoogroups.com, Jon Fernquest wrote:

                  Are there standard divisions of the Tipitaka into verses?
                • Jon Fernquest
                  Dear Yong Peng; I think you are on the right track with XML. The criteria for using a given technology should be whether it makes life easier. The interlinear
                  Message 8 of 11 , Aug 28, 2008
                  • 0 Attachment
                    Dear Yong Peng;

                    I think you are on the right track with XML.

                    The criteria for using a given technology should be whether it makes
                    life easier. The interlinear XML resources seem to do this since they
                    already have style sheets:

                    http://www.cs.mu.oz.au/research/lt/projects/interlinear/#STYLESHEETS

                    If the XML markup can be easily transformed using scripting languages
                    like Perl or XSLT, the XML has added value and made life easier.

                    That individual Suttantas aren't indentified in citations to Pali Text
                    Society texts always seemed strange to me, because that is the
                    greatest common denominator whatever script, language, or version you
                    are reading the Tipitaka it. The natural citation would be (Suttanta,
                    verse number). Putting a Suttanta in a given language (Pali, English)
                    in a plain text file would be logical.

                    Then the next step could be to break the text into verses or blocks of
                    text and then link the verses between the two different language
                    versions. TEI has cross-referencing tags that might provide a nice
                    standard and convenient way of doing this:

                    http://www.tei-c.org/Guidelines/P4/html/SA.html

                    But all that really seems necessary is to number the paragraphs so
                    they can be matched. Statistical text alignment programmes that are
                    used to make so-called "parallel corpuses" could provide this
                    alignment or it could be done manually.

                    Once we have the alignment, we have to think of how to print it in
                    convenient and instructive way to read. We might want to design a
                    parallel language scripture format, like the one in the parallel
                    Thai-English bible I mentioned once. Or the interlinear strategy this
                    site has used in the keys to the Pali textbook questions (which can be
                    used as learning material itself).

                    Looking for technology transfer opportunities from bible translators
                    and printers is also natural since they have been doing this for a lot
                    longer. The natural objection would be that one is creating some
                    so-called protestant Buddhism and that Pali is mainly an oral
                    tradition. Elan provides a way of syncing audio with text:

                    http://www.lat-mpi.eu/tools/elan/

                    With metta,
                    Jon Fernquest

                    (Note: Seem to remember that the Pali text society translation does
                    not maintain the original punctuation in the Burmese or Sri Lankan
                    script original version. Will have to look into this.)
                  • Jon Fernquest
                    Dear Yong Peng; Here is an interesting paper on aligning Thai and English text using statistical methods based on word frequency. Uses minimal info and is very
                    Message 9 of 11 , Aug 28, 2008
                    • 0 Attachment
                      Dear Yong Peng;

                      Here is an interesting paper on aligning Thai and English text using
                      statistical methods based on word frequency. Uses minimal info and is
                      very simple. Used to align bible text:

                      http://www.iait2007.org/Proceedings/P00192.pdf

                      There are probably some review of the literature papers that summarize
                      these algorithms.

                      Dan Malamed publishes a lot in this area:

                      http://www.cs.nyu.edu/~melamed/pubs.html

                      Good articles in following book at Google books:
                      Parallel Text Processing By Jean VĂ©ronis
                      http://books.google.com/books?id=I_4FPNS-RrEC&pg=PA25&lpg=PA25&dq=parallel+corpus+bitext+dan+melamed&source=web&ots=PcruOHrGTV&sig=EWNAc82On72khdWf36cfCEuNF_A&hl=en&sa=X&oi=book_result&resnum=6&ct=result#PPA4,M1

                      With metta,
                      Jon
                    • ong.yongpeng
                      Dear Jon, Frank and friends, as discussed earlier, I have prepared a very rough draft, which is meant to highlight several of the points I have mentioned to be
                      Message 10 of 11 , Aug 31, 2008
                      • 0 Attachment
                        Dear Jon, Frank and friends,

                        as discussed earlier, I have prepared a very rough draft, which is
                        meant to highlight several of the points I have mentioned to be
                        characteristics of the schema I intend to develop. I believe more work
                        is required to make it a working schema. Here it goes,

                        <!ELEMENT pali_texts (text*)>
                        <!ELEMENT text (title+, principal, order, superorder*, suborder*,
                        ordinality, section*)>
                        <!ELEMENT title (#PCDATA)>
                        <!ATTLIST title language CDATA #IMPLIED>
                        <!ELEMENT principal (#PCDATA)>
                        <!ATTLIST principal edition CDATA #IMPLIED>
                        <!ELEMENT order (#PCDATA)>
                        <!ELEMENT superorder (#PCDATA)>
                        <!ELEMENT suborder (#PCDATA)>
                        <!ELEMENT ordinality (#PCDATA)>
                        <!ELEMENT section (heading?, contents*)>
                        <!ELEMENT heading (#PCDATA)>
                        <!ELEMENT contents (paragraph|verse)*>
                        <!ELEMENT paragraph (sentence+)>
                        <!ELEMENT sentence (#PCDATA)>
                        <!ELEMENT verse (line+)>
                        <!ELEMENT line (#PCDATA)>

                        As we would not wish to turn this into a programming list, I have
                        created a mailing list to continue our discussion on this topic. All
                        members are welcome to participate on the new list. Jon and Frank will
                        receive an invitation. For more details, please visit:
                        https://lists.sourceforge.net/lists/listinfo/suttasenze-pdf-xml


                        metta,
                        Yong Peng.

                        --- In Pali@yahoogroups.com, ong.yongpeng wrote:

                        This week, I will present a draft to the group. The draft shall
                        include your first suggestion, i.e. verse-by-verse text encoding.
                      • Ong Yong Peng
                        Dear friends, Re: RFD: XML schema for the Tipitaka and Pali literature Last year we started the discussion of developing an XML schema for the Tipitaka and
                        Message 11 of 11 , Aug 10, 2009
                        • 0 Attachment
                          Dear friends,

                          Re: RFD: XML schema for the Tipitaka and Pali literature

                          Last year we started the discussion of developing an XML schema for the Tipitaka and Pali literature. Things went a bit quiet after a while.

                          Now, I hope to reignite the project with a new name: Pali Data Architecture (PDA). The focus on PDA is I hope it remains a group project. And, I invite anyone with the passion to take up a leading role in this project. All queries are welcome.

                          http://groups.yahoo.com/group/Pali/message/12798


                          metta,
                          Yong Peng.
                        Your message has been successfully submitted and would be delivered to recipients shortly.