Loading ...
Sorry, an error occurred while loading the content.

Parsing XML with Lex, Yacc, Nicolas and Tony

Expand Messages
  • stalkern2
    Hello to everybody I ve been studying the beautiful example of XML-oriented parsing written by Nicolas Cannasse and called XML Light. ... Code ... 1) Check out
    Message 1 of 2 , May 2, 2002
    • 0 Attachment
      Hello to everybody

      I've been studying the beautiful example of XML-oriented parsing written by
      Nicolas Cannasse and called XML Light.

      -----------------------------------------------------------------------------------
      Code
      -----------------------------------------------------------------------------------
      1) Check out http://warplayer.free.fr for the original version of XML Light.
      2) Check out http://www.gaertner.de/~lindig/software/tony.html for the
      original version of Tony (another XML parser, now a.f.a.i.k. unmantained)
      3) Wait for a new message from me for downloading and seeing my retouches of
      XML Light allowing text nodes and in-Tag/out-of-Tag parsing (I'm waiting for
      an OK for a name from Nicolas Cannasse, the author of XML Light)


      ----------------------------------------------------------------------------------------------
      What I've learnt/noticed/thought
      ----------------------------------------------------------------------------------------------

      1) Nicolas has been clever in using the tokenizer, using several actions and
      nesting their use when the contents to tokenize are nested, such as between "
      and ". E.g. In a string like
      key="value"
      the token <key> can be taken as a NAME, but the doublequoted token <"value">
      should, say, be taken as a VALUE, not as a doublequoted NAME.
      To do so, Nicolas scans all the input with the NAME-detecting action, but as
      soon as he finds a doublequote, he gives control of scanning to a different,
      VALUE-detecting action. When this VALUE-detecting action finds a doublequote,
      or better say a "closing" doublequote, it gives control back to the
      NAME-detecting action, and returns what it found before the doublequote as a
      VALUE, just as desired.

      2) Nicolas has been clever in using the parser, using a rule that looks for
      closing tags before looking for lower nested tags.

      3) I wanted to recognize text nodes also, so I've changed the type of Xml
      from
      type xml = string * (string * string) list * xml list
      to
      type xmlLeafTypes = Text | Header |Comment |Code
      type xmlNodeType =
      XmlLeafTypeBuild of (xmlLeafTypes * string)
      | XmlNodeTypeBuild of (string * (string * string) list * xmlNodeType list)

      That means, XML is made of nodes AND leaves, i.e. terminal nodes, i.e nodes
      without children. In a human readable form, an XML Node is either a string of
      type
      Text,
      <?xml Header ?>,
      <!-- Comment--> or
      <%code%>,
      or
      a node with
      a name
      a list of key-value attributes
      a list of children, that can be either nodes or leaves.
      as in the first version of XML Light.

      4) Detecting text leaves requires a more careful scanning. Take for instance
      the equal character '='; inside a tag it binds a key to its value, while in a
      text leaf it is a "dumb" character such as 'a', 'b' or '}'. In fact, it
      merits a token when it's inside a tag, and does not merit any special token
      when it's not inside a tag. The serious problem with the scanning, is that
      one can't rewind. What is not tokenized on the fly is lost. And some token
      can have identical starts, such as an IDENTIFIER named "tag" and a TEXT leaf
      such as "tag = the joy of parsing ". In sum, one should know the right
      scanning style IN ADVANCE... so, how shall we set different scanning styles
      in advance?
      This is generally achieved by means of global variables and triggers like, in
      XML, '<' and '>'. The hint here, such as in Christian Lindig's Tony XML
      parser, is to have the scanner ruled by the parser. Huh? I used to think to
      the scanner as something preceding the parser... But only the parser deals in
      fact with semantic analysis, the scanner is just a filter.

      That's it my friends!
      Ciao
      Ernesto
    • stalkern2
      ... I m now using the name xml mild for the sake of putting the code online. You can download xml mild, that is not a library on its own but rather an
      Message 2 of 2 , May 9, 2002
      • 0 Attachment
        Il Thursday 02 May 2002 12:32, hai scritto:
        > 3) Wait for a new message from me for downloading and seeing my retouches
        > of XML Light allowing text nodes and in-Tag/out-of-Tag parsing (I'm waiting
        > for an OK for a name from Nicolas Cannasse, the author of XML Light)


        I'm now using the name "xml mild" for the sake of putting the code online.
        You can download xml mild, that is not a library on its own but rather an
        enhanced form of xml light, at http://membres.lycos.fr/ipotesi/OCAML/
        By the way, I discovered a very strange behaviour in my previous release,
        i.e., the parser couldn't recognize text children placed on the right of tag
        children at a nesting level > 1, even if the grammar allowed that. I think
        that this may be related to the management of states in yacc; I've
        (blindly) moved this case from one production rule to another and so it
        works.


        Ciao
        Ernesto
      Your message has been successfully submitted and would be delivered to recipients shortly.