Loading ...
Sorry, an error occurred while loading the content.
 

Re: remove and clean CDATA out of xml

Expand Messages
  • Tim Chase
    ... what happens to the rest of the content here? ... You might be able to do something like ... ]* , , g )/g (all on one line) It doesn t
    Message 1 of 11 , Feb 1, 2010
      bw wrote:
      > I am looking for a way to remove the CDATA and only get the text.
      > CURRENT:
      > <add>
      > <doc>
      > <some_title>My title</some_title>
      > <content><![[CDATA[
      > <p>The <strong>keyword</strong> is nice to have but is not needed to
      > include in a solr feed</p><p><table cellspacing="2" cellpadding="2"
      > border="1" width="100%"><tbody><tr><td>Étape 1 :</td></tr>
      > ]]></content>
      > </doc>
      > <doc>
      > ....
      > </doc>
      > </add>
      >
      > WANTED:
      > <add>
      > <doc>
      > <some_title>My title</some_title>
      > <content>The keyword is nice to have but is not needed to
      > include in a solr feed

      what happens to the rest of the content here?

      > </content>
      > </doc>
      > <doc>
      > ....
      > </doc>
      > </add>
      >
      > any vim tricks to do this?

      You might be able to do something like

      :%s/<!\[\[CDATA\[\(\%(\%(]]>\)\@!\_.\)\{-}\)]]>/\=substitute(submatch(1),
      '<[^>]*>', '', 'g')/g

      (all on one line)
      It doesn't post-process XML entities, but otherwise, it worked on
      your example...

      -tim



      --
      You received this message from the "vim_use" maillist.
      For more information, visit http://www.vim.org/maillist.php
    • bw
      THX! that did the job! ... -- [Bb](astia{2}n)? s?[Ww](ak{2}ie)?$ -- You received this message from the vim_use maillist. For more information, visit
      Message 2 of 11 , Feb 1, 2010
        THX! that did the job!

        On 01/02/2010, Tim Chase <vim@...> wrote:
        > bw wrote:
        >> I am looking for a way to remove the CDATA and only get the text.
        >> CURRENT:
        >> <add>
        >> <doc>
        >> <some_title>My title</some_title>
        >> <content><![[CDATA[
        >> <p>The <strong>keyword</strong> is nice to have but is not needed to
        >> include in a solr feed</p><p><table cellspacing="2" cellpadding="2"
        >> border="1" width="100%"><tbody><tr><td>Étape 1 :</td></tr>
        >> ]]></content>
        >> </doc>
        >> <doc>
        >> ....
        >> </doc>
        >> </add>
        >>
        >> WANTED:
        >> <add>
        >> <doc>
        >> <some_title>My title</some_title>
        >> <content>The keyword is nice to have but is not needed to
        >> include in a solr feed
        >
        > what happens to the rest of the content here?
        >
        >> </content>
        >> </doc>
        >> <doc>
        >> ....
        >> </doc>
        >> </add>
        >>
        >> any vim tricks to do this?
        >
        > You might be able to do something like
        >
        > :%s/<!\[\[CDATA\[\(\%(\%(]]>\)\@!\_.\)\{-}\)]]>/\=substitute(submatch(1),
        > '<[^>]*>', '', 'g')/g
        >
        > (all on one line)
        > It doesn't post-process XML entities, but otherwise, it worked on
        > your example...
        >
        > -tim
        >
        >
        >
        > --
        > You received this message from the "vim_use" maillist.
        > For more information, visit http://www.vim.org/maillist.php


        --
        [Bb](astia{2}n)?\s?[Ww](ak{2}ie)?$

        --
        You received this message from the "vim_use" maillist.
        For more information, visit http://www.vim.org/maillist.php
      • Tony Mechelynck
        ... That s a hard one. I think you would have to write an ad-hoc function, using search() and maybe :mark, unless you always have a linebreak after
        Message 3 of 11 , Feb 1, 2010
          On 01/02/10 15:10, bw wrote:
          > Hello,
          >
          > I have a big xml solr feed out of my content management system that
          > includes wysiwyg html tags inside CDATA tags.
          >
          > I am looking for a way to remove the CDATA and only get the text.
          > CURRENT:
          > <add>
          > <doc>
          > <some_title>My title</some_title>
          > <content><![[CDATA[
          > <p>The<strong>keyword</strong> is nice to have but is not needed to
          > include in a solr feed</p><p><table cellspacing="2" cellpadding="2"
          > border="1" width="100%"><tbody><tr><td>Étape 1 :</td></tr>
          > ]]></content>
          > </doc>
          > <doc>
          > ....
          > </doc>
          > </add>
          >
          > WANTED:
          > <add>
          > <doc>
          > <some_title>My title</some_title>
          > <content>The keyword is nice to have but is not needed to
          > include in a solr feed</content>
          > </doc>
          > <doc>
          > ....
          > </doc>
          > </add>
          >
          > any vim tricks to do this?
          >
          > thx

          That's a hard one. I think you would have to write an ad-hoc function,
          using search() and maybe :mark, unless you always have a linebreak after
          <![[CDATA[ and another one before the corresponding ]]>, in which case
          the following (untested) might work

          1
          %g/<!\[\]CDATA\[/.+1;/]]>/-1s/<.{-}>//
          %s/<!\[\[CDATA\[\|]]>//

          but only if you have no other ]]>


          Best regards,
          Tony.
          --
          hundred-and-one symptoms of being an internet addict:
          253. You wait for a slow loading web page before going to the toilet.

          --
          You received this message from the "vim_use" maillist.
          For more information, visit http://www.vim.org/maillist.php
        • Christian Brabandt
          ... If the start and end pattern are always in a separate line, you could ... followed by an additional ... to remove the remaining
          Message 4 of 11 , Feb 1, 2010
            On Mon, February 1, 2010 3:10 pm, bw wrote:
            > I am looking for a way to remove the CDATA and only get the text.
            > CURRENT:
            > <add>
            > <doc>
            > <some_title>My title</some_title>
            > <content><![[CDATA[
            > <p>The <strong>keyword</strong> is nice to have but is not needed to
            > include in a solr feed</p><p><table cellspacing="2" cellpadding="2"
            > border="1" width="100%"><tbody><tr><td>Étape 1 :</td></tr>
            > ]]></content>
            > </doc>
            > <doc>
            > ....
            > </doc>
            > </add>
            >
            > WANTED:
            > <add>
            > <doc>
            > <some_title>My title</some_title>
            > <content>The keyword is nice to have but is not needed to
            > include in a solr feed</content>
            > </doc>
            > <doc>
            > ....
            > </doc>
            > </add>
            >
            > any vim tricks to do this?

            If the start and end pattern are always in a separate line, you could
            possibly use something like this:
            :g/\V<![[CDATA[/+,/\V]]>/-s/<\_[^>]*>//g
            followed by an additional
            :%s/\V<![[CDATA[\|]]>//
            to remove the remaining <![[CDATA start and end delimiters.

            Alternatively, you could use something like
            :%s/\V<![[CDATA[\_.\{-}]]/\=substitute(submatch(0),
            '\(<[^>]*>\)\|\(^\V![[CDATA[\)\|\(\V]]\$\)', '', 'g')/
            (1 line, barely tested, should work in your example case).

            Nevertheless, both leave the Étape 1 : parts in your text. So
            you might be able to put the expression
            :s/&[^;]*;//
            into the previous expression, which would then look like this:
            %s/\V<![[CDATA[\_.\{-}]]/\=substitute(submatch(0),
            '\(<[^>]*>\)\|\(^\V![[CDATA[\)\|\(\V]]\$\)\|\m\(&[^;]*;\)', '', 'g')/
            and should work. However, I have it only barely tested.

            regards,
            Christian

            --
            You received this message from the "vim_use" maillist.
            For more information, visit http://www.vim.org/maillist.php
          • bw
            Your last comment made me think. I would like all the html encoded parts like É, é ’ etc... to be transformed into real utf8 as the feed should be utf8.
            Message 5 of 11 , Feb 1, 2010
              Your last comment made me think. I would like all the html encoded
              parts like É, é ’ etc... to be transformed into real
              utf8 as the feed should be utf8. (É, é and ’)

              Any tips here?

              On 01/02/2010, Christian Brabandt <cblists@...> wrote:
              > On Mon, February 1, 2010 3:10 pm, bw wrote:
              >> I am looking for a way to remove the CDATA and only get the text.
              >> CURRENT:
              >> <add>
              >> <doc>
              >> <some_title>My title</some_title>
              >> <content><![[CDATA[
              >> <p>The <strong>keyword</strong> is nice to have but is not needed to
              >> include in a solr feed</p><p><table cellspacing="2" cellpadding="2"
              >> border="1" width="100%"><tbody><tr><td>Étape 1 :</td></tr>
              >> ]]></content>
              >> </doc>
              >> <doc>
              >> ....
              >> </doc>
              >> </add>
              >>
              >> WANTED:
              >> <add>
              >> <doc>
              >> <some_title>My title</some_title>
              >> <content>The keyword is nice to have but is not needed to
              >> include in a solr feed</content>
              >> </doc>
              >> <doc>
              >> ....
              >> </doc>
              >> </add>
              >>
              >> any vim tricks to do this?
              >
              > If the start and end pattern are always in a separate line, you could
              > possibly use something like this:
              > :g/\V<![[CDATA[/+,/\V]]>/-s/<\_[^>]*>//g
              > followed by an additional
              > :%s/\V<![[CDATA[\|]]>//
              > to remove the remaining <![[CDATA start and end delimiters.
              >
              > Alternatively, you could use something like
              > :%s/\V<![[CDATA[\_.\{-}]]/\=substitute(submatch(0),
              > '\(<[^>]*>\)\|\(^\V![[CDATA[\)\|\(\V]]\$\)', '', 'g')/
              > (1 line, barely tested, should work in your example case).
              >
              > Nevertheless, both leave the Étape 1 : parts in your text. So
              > you might be able to put the expression
              > :s/&[^;]*;//
              > into the previous expression, which would then look like this:
              > %s/\V<![[CDATA[\_.\{-}]]/\=substitute(submatch(0),
              > '\(<[^>]*>\)\|\(^\V![[CDATA[\)\|\(\V]]\$\)\|\m\(&[^;]*;\)', '', 'g')/
              > and should work. However, I have it only barely tested.
              >
              > regards,
              > Christian
              >
              > --
              > You received this message from the "vim_use" maillist.
              > For more information, visit http://www.vim.org/maillist.php


              --
              [Bb](astia{2}n)?\s?[Ww](ak{2}ie)?$

              --
              You received this message from the "vim_use" maillist.
              For more information, visit http://www.vim.org/maillist.php
            • Christian Brabandt
              ... Please don t top post. ... should do what you want. regards, Christian -- You received this message from the vim_use maillist. For more information,
              Message 6 of 11 , Feb 1, 2010
                On Mon, February 1, 2010 4:49 pm, bw wrote:
                > Your last comment made me think. I would like all the html encoded
                > parts like É, é ’ etc... to be transformed into real
                > utf8 as the feed should be utf8. (É, é and ’)

                Please don't top post.

                Regarding your question, I believe this:
                :%s/&#\(\d\+\);/\=printf("%s ", nr2char(str2nr(submatch(1),10)))/

                should do what you want.


                regards,
                Christian

                --
                You received this message from the "vim_use" maillist.
                For more information, visit http://www.vim.org/maillist.php
              • bw
                Sorry, I do not understand the concept top post, but I guess you mean start a new thread for a different question ;-) I just needed to add a /g in order to get
                Message 7 of 11 , Feb 1, 2010
                  Sorry, I do not understand the concept top post, but I guess you mean
                  start a new thread for a different question ;-)

                  I just needed to add a /g in order to get is done everywhere.

                  Thanks! Very helpful for me to understand even more the power of vim :)

                  On 01/02/2010, Christian Brabandt <cblists@...> wrote:
                  > On Mon, February 1, 2010 4:49 pm, bw wrote:
                  >> Your last comment made me think. I would like all the html encoded
                  >> parts like É, é ’ etc... to be transformed into real
                  >> utf8 as the feed should be utf8. (É, é and ’)
                  >
                  > Please don't top post.
                  >
                  > Regarding your question, I believe this:
                  > :%s/&#\(\d\+\);/\=printf("%s ", nr2char(str2nr(submatch(1),10)))/
                  >
                  > should do what you want.
                  >
                  >
                  > regards,
                  > Christian
                  >
                  > --
                  > You received this message from the "vim_use" maillist.
                  > For more information, visit http://www.vim.org/maillist.php


                  --
                  [Bb](astia{2}n)?\s?[Ww](ak{2}ie)?$

                  --
                  You received this message from the "vim_use" maillist.
                  For more information, visit http://www.vim.org/maillist.php
                • Raúl Núñez de Arenas Coronado
                  Saluton bw :) ... No, it s putting the reply text *before* the quoted text: http://en.wikipedia.org/wiki/Posting_style#Top-posting The preferred style on the
                  Message 8 of 11 , Feb 1, 2010
                    Saluton bw :)

                    bw <b...@...> skribis:
                    > Sorry, I do not understand the concept top post, but I guess you mean
                    > start a new thread for a different question ;-)

                    No, it's putting the reply text *before* the quoted text:
                    http://en.wikipedia.org/wiki/Posting_style#Top-posting

                    The preferred style on the list is interleaved-posting (also explained
                    in the link above), but a good bunch of members just do as they please.

                    --
                    Raúl "DervishD" Núñez de Arenas Coronado
                    Linux Registered User 88736 | http://www.dervishd.net
                    It's my PC and I'll cry if I want to... RAmen!

                    --
                    You received this message from the "vim_use" maillist.
                    For more information, visit http://www.vim.org/maillist.php
                  • bw
                    ... I have a hard time understand the ( %( %(]] ) @! _. ) {-} ) part. What does it do? What does % mean? I do understand it will take anything in CDATA
                    Message 9 of 11 , Feb 2, 2010
                      > :%s/<!\[\[CDATA\[\(\%(\%(]]>\)\@!\_.\)\{-}\)]]>/\=substitute(submatch(1),'<[^>]*>', '', 'g')/g

                      I have a hard time understand the \(\%(\%(]]>\)\@!\_.\)\{-}\) part.
                      What does it do? What does \% mean? I do understand it will take
                      anything in CDATA brackets and run the substiture command over it.


                      thanks

                      --
                      You received this message from the "vim_use" maillist.
                      For more information, visit http://www.vim.org/maillist.php
                    • Tim Chase
                      ... The %(... ) is a non-capturing group. The command breaks down as ...
                      Message 10 of 11 , Feb 2, 2010
                        bw wrote:
                        >> :%s/<!\[\[CDATA\[\(\%(\%(]]>\)\@!\_.\)\{-}\)]]>/\=substitute(submatch(1),'<[^>]*>', '', 'g')/g
                        >
                        > I have a hard time understand the \(\%(\%(]]>\)\@!\_.\)\{-}\) part.
                        > What does it do? What does \% mean? I do understand it will take
                        > anything in CDATA brackets and run the substiture command over it.

                        The \%(...\) is a non-capturing group.

                        The command breaks down as

                        :%s/ substitute

                        <!\[\[CDATA\[ a literal "<![[CDATA["

                        \( begin capturing
                        \%( begin non-capturing group #1
                        \%( begin non-capturing group #2
                        ]]> a literal "]]>" close tag
                        \) (end non-cap group #2)
                        \@! isn't allowed to match here
                        \_. match any one character incl NL
                        \) (end non-cap group #1)
                        \{-} as few as possible
                        \) end capture group
                        ]]> the literal "]]>" that matches
                        / and replace it with
                        \= the following expression
                        substitute( uh...substitute :)
                        submatch(1), the content of the CDATA
                        '<[^>]*>', all tags and replace them
                        '', with nothing
                        'g') for all of the tags
                        /g for all of the matches on a line

                        In retrospect, because "]]>" unilaterally closes a CDATA and
                        you're capturing everything inside, you might be able to simplify
                        that to just

                        :%s/:%s/<!\[\[CDATA\[\(\_.\{-}\)]]>/\=substitute(submatch(1),'<[^>]*>',
                        '', 'g')/g

                        HTH,

                        -tim


                        --
                        You received this message from the "vim_use" maillist.
                        For more information, visit http://www.vim.org/maillist.php
                      Your message has been successfully submitted and would be delivered to recipients shortly.