Loading ...
Sorry, an error occurred while loading the content.

remove and clean CDATA out of xml

Expand Messages
  • bw
    Hello, I have a big xml solr feed out of my content management system that includes wysiwyg html tags inside CDATA tags. I am looking for a way to remove the
    Message 1 of 11 , Feb 1, 2010
    • 0 Attachment
      Hello,

      I have a big xml solr feed out of my content management system that
      includes wysiwyg html tags inside CDATA tags.

      I am looking for a way to remove the CDATA and only get the text.
      CURRENT:
      <add>
      <doc>
      <some_title>My title</some_title>
      <content><![[CDATA[
      <p>The <strong>keyword</strong> is nice to have but is not needed to
      include in a solr feed</p><p><table cellspacing="2" cellpadding="2"
      border="1" width="100%"><tbody><tr><td>Étape 1 :</td></tr>
      ]]></content>
      </doc>
      <doc>
      ....
      </doc>
      </add>

      WANTED:
      <add>
      <doc>
      <some_title>My title</some_title>
      <content>The keyword is nice to have but is not needed to
      include in a solr feed</content>
      </doc>
      <doc>
      ....
      </doc>
      </add>

      any vim tricks to do this?

      thx
      --
      [Bb](astia{2}n)?\s?[Ww](ak{2}ie)?$

      --
      You received this message from the "vim_use" maillist.
      For more information, visit http://www.vim.org/maillist.php
    • Tim Chase
      ... what happens to the rest of the content here? ... You might be able to do something like ... ]* , , g )/g (all on one line) It doesn t
      Message 2 of 11 , Feb 1, 2010
      • 0 Attachment
        bw wrote:
        > I am looking for a way to remove the CDATA and only get the text.
        > CURRENT:
        > <add>
        > <doc>
        > <some_title>My title</some_title>
        > <content><![[CDATA[
        > <p>The <strong>keyword</strong> is nice to have but is not needed to
        > include in a solr feed</p><p><table cellspacing="2" cellpadding="2"
        > border="1" width="100%"><tbody><tr><td>Étape 1 :</td></tr>
        > ]]></content>
        > </doc>
        > <doc>
        > ....
        > </doc>
        > </add>
        >
        > WANTED:
        > <add>
        > <doc>
        > <some_title>My title</some_title>
        > <content>The keyword is nice to have but is not needed to
        > include in a solr feed

        what happens to the rest of the content here?

        > </content>
        > </doc>
        > <doc>
        > ....
        > </doc>
        > </add>
        >
        > any vim tricks to do this?

        You might be able to do something like

        :%s/<!\[\[CDATA\[\(\%(\%(]]>\)\@!\_.\)\{-}\)]]>/\=substitute(submatch(1),
        '<[^>]*>', '', 'g')/g

        (all on one line)
        It doesn't post-process XML entities, but otherwise, it worked on
        your example...

        -tim



        --
        You received this message from the "vim_use" maillist.
        For more information, visit http://www.vim.org/maillist.php
      • bw
        THX! that did the job! ... -- [Bb](astia{2}n)? s?[Ww](ak{2}ie)?$ -- You received this message from the vim_use maillist. For more information, visit
        Message 3 of 11 , Feb 1, 2010
        • 0 Attachment
          THX! that did the job!

          On 01/02/2010, Tim Chase <vim@...> wrote:
          > bw wrote:
          >> I am looking for a way to remove the CDATA and only get the text.
          >> CURRENT:
          >> <add>
          >> <doc>
          >> <some_title>My title</some_title>
          >> <content><![[CDATA[
          >> <p>The <strong>keyword</strong> is nice to have but is not needed to
          >> include in a solr feed</p><p><table cellspacing="2" cellpadding="2"
          >> border="1" width="100%"><tbody><tr><td>Étape 1 :</td></tr>
          >> ]]></content>
          >> </doc>
          >> <doc>
          >> ....
          >> </doc>
          >> </add>
          >>
          >> WANTED:
          >> <add>
          >> <doc>
          >> <some_title>My title</some_title>
          >> <content>The keyword is nice to have but is not needed to
          >> include in a solr feed
          >
          > what happens to the rest of the content here?
          >
          >> </content>
          >> </doc>
          >> <doc>
          >> ....
          >> </doc>
          >> </add>
          >>
          >> any vim tricks to do this?
          >
          > You might be able to do something like
          >
          > :%s/<!\[\[CDATA\[\(\%(\%(]]>\)\@!\_.\)\{-}\)]]>/\=substitute(submatch(1),
          > '<[^>]*>', '', 'g')/g
          >
          > (all on one line)
          > It doesn't post-process XML entities, but otherwise, it worked on
          > your example...
          >
          > -tim
          >
          >
          >
          > --
          > You received this message from the "vim_use" maillist.
          > For more information, visit http://www.vim.org/maillist.php


          --
          [Bb](astia{2}n)?\s?[Ww](ak{2}ie)?$

          --
          You received this message from the "vim_use" maillist.
          For more information, visit http://www.vim.org/maillist.php
        • Tony Mechelynck
          ... That s a hard one. I think you would have to write an ad-hoc function, using search() and maybe :mark, unless you always have a linebreak after
          Message 4 of 11 , Feb 1, 2010
          • 0 Attachment
            On 01/02/10 15:10, bw wrote:
            > Hello,
            >
            > I have a big xml solr feed out of my content management system that
            > includes wysiwyg html tags inside CDATA tags.
            >
            > I am looking for a way to remove the CDATA and only get the text.
            > CURRENT:
            > <add>
            > <doc>
            > <some_title>My title</some_title>
            > <content><![[CDATA[
            > <p>The<strong>keyword</strong> is nice to have but is not needed to
            > include in a solr feed</p><p><table cellspacing="2" cellpadding="2"
            > border="1" width="100%"><tbody><tr><td>Étape 1 :</td></tr>
            > ]]></content>
            > </doc>
            > <doc>
            > ....
            > </doc>
            > </add>
            >
            > WANTED:
            > <add>
            > <doc>
            > <some_title>My title</some_title>
            > <content>The keyword is nice to have but is not needed to
            > include in a solr feed</content>
            > </doc>
            > <doc>
            > ....
            > </doc>
            > </add>
            >
            > any vim tricks to do this?
            >
            > thx

            That's a hard one. I think you would have to write an ad-hoc function,
            using search() and maybe :mark, unless you always have a linebreak after
            <![[CDATA[ and another one before the corresponding ]]>, in which case
            the following (untested) might work

            1
            %g/<!\[\]CDATA\[/.+1;/]]>/-1s/<.{-}>//
            %s/<!\[\[CDATA\[\|]]>//

            but only if you have no other ]]>


            Best regards,
            Tony.
            --
            hundred-and-one symptoms of being an internet addict:
            253. You wait for a slow loading web page before going to the toilet.

            --
            You received this message from the "vim_use" maillist.
            For more information, visit http://www.vim.org/maillist.php
          • Christian Brabandt
            ... If the start and end pattern are always in a separate line, you could ... followed by an additional ... to remove the remaining
            Message 5 of 11 , Feb 1, 2010
            • 0 Attachment
              On Mon, February 1, 2010 3:10 pm, bw wrote:
              > I am looking for a way to remove the CDATA and only get the text.
              > CURRENT:
              > <add>
              > <doc>
              > <some_title>My title</some_title>
              > <content><![[CDATA[
              > <p>The <strong>keyword</strong> is nice to have but is not needed to
              > include in a solr feed</p><p><table cellspacing="2" cellpadding="2"
              > border="1" width="100%"><tbody><tr><td>Étape 1 :</td></tr>
              > ]]></content>
              > </doc>
              > <doc>
              > ....
              > </doc>
              > </add>
              >
              > WANTED:
              > <add>
              > <doc>
              > <some_title>My title</some_title>
              > <content>The keyword is nice to have but is not needed to
              > include in a solr feed</content>
              > </doc>
              > <doc>
              > ....
              > </doc>
              > </add>
              >
              > any vim tricks to do this?

              If the start and end pattern are always in a separate line, you could
              possibly use something like this:
              :g/\V<![[CDATA[/+,/\V]]>/-s/<\_[^>]*>//g
              followed by an additional
              :%s/\V<![[CDATA[\|]]>//
              to remove the remaining <![[CDATA start and end delimiters.

              Alternatively, you could use something like
              :%s/\V<![[CDATA[\_.\{-}]]/\=substitute(submatch(0),
              '\(<[^>]*>\)\|\(^\V![[CDATA[\)\|\(\V]]\$\)', '', 'g')/
              (1 line, barely tested, should work in your example case).

              Nevertheless, both leave the Étape 1 : parts in your text. So
              you might be able to put the expression
              :s/&[^;]*;//
              into the previous expression, which would then look like this:
              %s/\V<![[CDATA[\_.\{-}]]/\=substitute(submatch(0),
              '\(<[^>]*>\)\|\(^\V![[CDATA[\)\|\(\V]]\$\)\|\m\(&[^;]*;\)', '', 'g')/
              and should work. However, I have it only barely tested.

              regards,
              Christian

              --
              You received this message from the "vim_use" maillist.
              For more information, visit http://www.vim.org/maillist.php
            • bw
              Your last comment made me think. I would like all the html encoded parts like É, é ’ etc... to be transformed into real utf8 as the feed should be utf8.
              Message 6 of 11 , Feb 1, 2010
              • 0 Attachment
                Your last comment made me think. I would like all the html encoded
                parts like É, é ’ etc... to be transformed into real
                utf8 as the feed should be utf8. (É, é and ’)

                Any tips here?

                On 01/02/2010, Christian Brabandt <cblists@...> wrote:
                > On Mon, February 1, 2010 3:10 pm, bw wrote:
                >> I am looking for a way to remove the CDATA and only get the text.
                >> CURRENT:
                >> <add>
                >> <doc>
                >> <some_title>My title</some_title>
                >> <content><![[CDATA[
                >> <p>The <strong>keyword</strong> is nice to have but is not needed to
                >> include in a solr feed</p><p><table cellspacing="2" cellpadding="2"
                >> border="1" width="100%"><tbody><tr><td>Étape 1 :</td></tr>
                >> ]]></content>
                >> </doc>
                >> <doc>
                >> ....
                >> </doc>
                >> </add>
                >>
                >> WANTED:
                >> <add>
                >> <doc>
                >> <some_title>My title</some_title>
                >> <content>The keyword is nice to have but is not needed to
                >> include in a solr feed</content>
                >> </doc>
                >> <doc>
                >> ....
                >> </doc>
                >> </add>
                >>
                >> any vim tricks to do this?
                >
                > If the start and end pattern are always in a separate line, you could
                > possibly use something like this:
                > :g/\V<![[CDATA[/+,/\V]]>/-s/<\_[^>]*>//g
                > followed by an additional
                > :%s/\V<![[CDATA[\|]]>//
                > to remove the remaining <![[CDATA start and end delimiters.
                >
                > Alternatively, you could use something like
                > :%s/\V<![[CDATA[\_.\{-}]]/\=substitute(submatch(0),
                > '\(<[^>]*>\)\|\(^\V![[CDATA[\)\|\(\V]]\$\)', '', 'g')/
                > (1 line, barely tested, should work in your example case).
                >
                > Nevertheless, both leave the Étape 1 : parts in your text. So
                > you might be able to put the expression
                > :s/&[^;]*;//
                > into the previous expression, which would then look like this:
                > %s/\V<![[CDATA[\_.\{-}]]/\=substitute(submatch(0),
                > '\(<[^>]*>\)\|\(^\V![[CDATA[\)\|\(\V]]\$\)\|\m\(&[^;]*;\)', '', 'g')/
                > and should work. However, I have it only barely tested.
                >
                > regards,
                > Christian
                >
                > --
                > You received this message from the "vim_use" maillist.
                > For more information, visit http://www.vim.org/maillist.php


                --
                [Bb](astia{2}n)?\s?[Ww](ak{2}ie)?$

                --
                You received this message from the "vim_use" maillist.
                For more information, visit http://www.vim.org/maillist.php
              • Christian Brabandt
                ... Please don t top post. ... should do what you want. regards, Christian -- You received this message from the vim_use maillist. For more information,
                Message 7 of 11 , Feb 1, 2010
                • 0 Attachment
                  On Mon, February 1, 2010 4:49 pm, bw wrote:
                  > Your last comment made me think. I would like all the html encoded
                  > parts like É, é ’ etc... to be transformed into real
                  > utf8 as the feed should be utf8. (É, é and ’)

                  Please don't top post.

                  Regarding your question, I believe this:
                  :%s/&#\(\d\+\);/\=printf("%s ", nr2char(str2nr(submatch(1),10)))/

                  should do what you want.


                  regards,
                  Christian

                  --
                  You received this message from the "vim_use" maillist.
                  For more information, visit http://www.vim.org/maillist.php
                • bw
                  Sorry, I do not understand the concept top post, but I guess you mean start a new thread for a different question ;-) I just needed to add a /g in order to get
                  Message 8 of 11 , Feb 1, 2010
                  • 0 Attachment
                    Sorry, I do not understand the concept top post, but I guess you mean
                    start a new thread for a different question ;-)

                    I just needed to add a /g in order to get is done everywhere.

                    Thanks! Very helpful for me to understand even more the power of vim :)

                    On 01/02/2010, Christian Brabandt <cblists@...> wrote:
                    > On Mon, February 1, 2010 4:49 pm, bw wrote:
                    >> Your last comment made me think. I would like all the html encoded
                    >> parts like É, é ’ etc... to be transformed into real
                    >> utf8 as the feed should be utf8. (É, é and ’)
                    >
                    > Please don't top post.
                    >
                    > Regarding your question, I believe this:
                    > :%s/&#\(\d\+\);/\=printf("%s ", nr2char(str2nr(submatch(1),10)))/
                    >
                    > should do what you want.
                    >
                    >
                    > regards,
                    > Christian
                    >
                    > --
                    > You received this message from the "vim_use" maillist.
                    > For more information, visit http://www.vim.org/maillist.php


                    --
                    [Bb](astia{2}n)?\s?[Ww](ak{2}ie)?$

                    --
                    You received this message from the "vim_use" maillist.
                    For more information, visit http://www.vim.org/maillist.php
                  • Raúl Núñez de Arenas Coronado
                    Saluton bw :) ... No, it s putting the reply text *before* the quoted text: http://en.wikipedia.org/wiki/Posting_style#Top-posting The preferred style on the
                    Message 9 of 11 , Feb 1, 2010
                    • 0 Attachment
                      Saluton bw :)

                      bw <b...@...> skribis:
                      > Sorry, I do not understand the concept top post, but I guess you mean
                      > start a new thread for a different question ;-)

                      No, it's putting the reply text *before* the quoted text:
                      http://en.wikipedia.org/wiki/Posting_style#Top-posting

                      The preferred style on the list is interleaved-posting (also explained
                      in the link above), but a good bunch of members just do as they please.

                      --
                      Raúl "DervishD" Núñez de Arenas Coronado
                      Linux Registered User 88736 | http://www.dervishd.net
                      It's my PC and I'll cry if I want to... RAmen!

                      --
                      You received this message from the "vim_use" maillist.
                      For more information, visit http://www.vim.org/maillist.php
                    • bw
                      ... I have a hard time understand the ( %( %(]] ) @! _. ) {-} ) part. What does it do? What does % mean? I do understand it will take anything in CDATA
                      Message 10 of 11 , Feb 2, 2010
                      • 0 Attachment
                        > :%s/<!\[\[CDATA\[\(\%(\%(]]>\)\@!\_.\)\{-}\)]]>/\=substitute(submatch(1),'<[^>]*>', '', 'g')/g

                        I have a hard time understand the \(\%(\%(]]>\)\@!\_.\)\{-}\) part.
                        What does it do? What does \% mean? I do understand it will take
                        anything in CDATA brackets and run the substiture command over it.


                        thanks

                        --
                        You received this message from the "vim_use" maillist.
                        For more information, visit http://www.vim.org/maillist.php
                      • Tim Chase
                        ... The %(... ) is a non-capturing group. The command breaks down as ...
                        Message 11 of 11 , Feb 2, 2010
                        • 0 Attachment
                          bw wrote:
                          >> :%s/<!\[\[CDATA\[\(\%(\%(]]>\)\@!\_.\)\{-}\)]]>/\=substitute(submatch(1),'<[^>]*>', '', 'g')/g
                          >
                          > I have a hard time understand the \(\%(\%(]]>\)\@!\_.\)\{-}\) part.
                          > What does it do? What does \% mean? I do understand it will take
                          > anything in CDATA brackets and run the substiture command over it.

                          The \%(...\) is a non-capturing group.

                          The command breaks down as

                          :%s/ substitute

                          <!\[\[CDATA\[ a literal "<![[CDATA["

                          \( begin capturing
                          \%( begin non-capturing group #1
                          \%( begin non-capturing group #2
                          ]]> a literal "]]>" close tag
                          \) (end non-cap group #2)
                          \@! isn't allowed to match here
                          \_. match any one character incl NL
                          \) (end non-cap group #1)
                          \{-} as few as possible
                          \) end capture group
                          ]]> the literal "]]>" that matches
                          / and replace it with
                          \= the following expression
                          substitute( uh...substitute :)
                          submatch(1), the content of the CDATA
                          '<[^>]*>', all tags and replace them
                          '', with nothing
                          'g') for all of the tags
                          /g for all of the matches on a line

                          In retrospect, because "]]>" unilaterally closes a CDATA and
                          you're capturing everything inside, you might be able to simplify
                          that to just

                          :%s/:%s/<!\[\[CDATA\[\(\_.\{-}\)]]>/\=substitute(submatch(1),'<[^>]*>',
                          '', 'g')/g

                          HTH,

                          -tim


                          --
                          You received this message from the "vim_use" maillist.
                          For more information, visit http://www.vim.org/maillist.php
                        Your message has been successfully submitted and would be delivered to recipients shortly.