Loading ...
Sorry, an error occurred while loading the content.
 

cleaning extra spaces

Expand Messages
  • peterhulm
    I have been trying to devise a clip that will clean up extra spaces and newlines inserted into HTML pages by many programs. I have updated this clip from
    Message 1 of 17 , Jan 16, 2014
      I have been trying to devise a clip that will clean up extra spaces and newlines inserted into HTML pages by many programs. I have updated this clip from ClipWriter:

      ^!Jump DOC_START

      :loop
      ^!IfError EXIT

      ^!Replace "\s\r\n\s">>"\r\n" RW
      as
      ^!Goto loop

      That is, I seek to replace spaces plus return plus newline (line feed) followed by spaces with a return and newline,using a Regular Expression across the Whole document.

      It doesn't work, simply returning a message that \s\r\n\s was not found in the whole document.

      If's not what I want anyway. I want to get rid of the paragraph marks as well that split up the lines where I don't need them.

      Anyone else have this problem, and do you have a solution? Or at least an explanation why the old code no longer works? I cannot understand the help pages on Regular Expressions and how to use the (ANYCRLF) token.
    • Axel Berger
      ... Which is probably true. Try: ^!Replace s* R+ s* r n WRASTI ... Hard to do unless you can explain the difference between newlines you want and
      Message 2 of 17 , Jan 16, 2014
        peterhulm@... wrote:
        > simply returning a message that \s\r\n\s was not found
        > in the whole document.

        Which is probably true.

        Try:

        ^!Replace "\s*\R+\s*" >> "\r\n" WRASTI

        > I want to get rid of the paragraph marks as
        > well that split up the lines where I don't need them.

        Hard to do unless you can explain the difference between newlines you want
        and newlines you don't want. If you can explain that to anyone without
        relying on phrases like "where it makes sense" or "that I don't remember
        having inserted myself", you will also be able to explain it to NoteTab.

        Axel
      • Alex Plantema
        ... Your clip only deletes a trailing blank if the next line has a leading blank and vice versa, and only one at a time. Try this: ^!Replace ^ *(.*?) *$
        Message 3 of 17 , Jan 16, 2014
          Op donderdag 16 januari 2014 21:12 schreef peterhulm@...:

          > I have been trying to devise a clip that will clean up extra spaces
          > and newlines inserted into HTML pages by many programs. I have
          > updated this clip from ClipWriter:
          >
          > ^!Jump DOC_START
          >
          >> loop
          > ^!IfError EXIT
          >
          > ^!Replace "\s\r\n\s">>"\r\n" RW
          > as
          > ^!Goto loop
          >
          > That is, I seek to replace spaces plus return plus newline (line
          > feed) followed by spaces with a return and newline,using a Regular
          > Expression across the Whole document.
          >
          > It doesn't work, simply returning a message that \s\r\n\s was not
          > found in the whole document.
          >
          > If's not what I want anyway. I want to get rid of the paragraph marks
          > as well that split up the lines where I don't need them.
          >
          > Anyone else have this problem, and do you have a solution? Or at
          > least an explanation why the old code no longer works? I cannot
          > understand the help pages on Regular Expressions and how to use the
          > (ANYCRLF) token.

          Your clip only deletes a trailing blank if the next line has a leading blank and vice versa, and only one at a time.
          Try this:
          ^!Replace "^ *(.*?) *$" >> "$1" WRSA

          Alex.
        • Peter Hulm
          Thanks Axel, I ll try it and let you know. Since I am cleaning up HTML, the places that need to keep the paragraph marks all end with , so I can theoretically
          Message 4 of 17 , Jan 17, 2014
            Thanks Axel, I'll try it and let you know.
            Since I am cleaning up HTML, the places that need to keep the paragraph marks all end with >, so I can theoretically use >+^p as a search term to keep the ones I need.

          • Peter Hulm
            Neither of your suggestions worked on this text (the line endings are as they appear in NoteTab 7) 1. Where s My Start Button? In Windows 8 think
            Message 5 of 17 , Jan 17, 2014
              Neither of your suggestions worked on this text (the line endings are as they appear in NoteTab 7)

              <h1>1. Where's My Start Button?</h1>
              <p>In Windows 8 think Start screen not Start button. This is what Windows 8 in
                its original version gave you, with all your favourite apps (well they will
                be once you have finished customizing them) laid out for you as &quot;tiles&quot;
                to use immediately.</p>

              Any commentary you can give me on the RegEx terms used will also help. Thanks.

            • John Shotsky
              It is not totally clear what you want here, because email may have wrapped lines. But I work with html all the time, and have the need to remove CR s that are
              Message 6 of 17 , Jan 17, 2014

                It is not totally clear what you want here, because email may have wrapped lines. But I work with html all the time, and have the need to remove CR's that are not where they belong - between html tags.

                If your goal is to force the paragraph to have no interspersed carriage returns, the following would work:

                Find:

                [^>\x20]\K\x20*\R+\x20*

                That is any spaces, not following a tag close (>) followed by one or more of any type of carriage return, followed by any spaces.

                Replace:

                \x20

                You do want a space between the closed lines, but of course you COULD have hyphenated words that don't need that space. That is a whole nother problem, which I deal with extensively in my work.

                In terms of a clip, that would be

                ^!Replace "[^>\x20]\K\x20*\R+\x20*" >> "\x20 ARSW

                That would remove CR's that occur anywhere except after tag closes and place a space where the CR was.

                Regards,
                John
                RecipeTools Web Site: http://recipetools.gotdns.com/
                John's Mags Yahoo Group:  http://groups.yahoo.com/group/johnsmags/

                 

                From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Peter Hulm
                Sent: Friday, January 17, 2014 07:25
                To: ntb-clips@yahoogroups.com
                Subject: [Clip] Re: cleaning extra spaces

                 

                 

                Neither of your suggestions worked on this text (the line endings are as they appear in NoteTab 7)

                <h1>1. Where's My Start Button?</h1>
                <p>In Windows 8 think Start screen not Start button. This is what Windows 8 in
                  its original version gave you, with all your favourite apps (well they will
                  be once you have finished customizing them) laid out for you as &quot;tiles&quot;
                  to use immediately.</p>

                Any commentary you can give me on the RegEx terms used will also help. Thanks.

                 

              • Axel Berger
                ... That surprises me. I just tried my clip (only one) on your text example and it got rid of all spaces before and after newlines as requested. It did not
                Message 7 of 17 , Jan 17, 2014
                  Peter Hulm wrote:
                  > Neither of your suggestions worked on this text
                  > (the line endings are as they appear in NoteTab 7)

                  That surprises me. I just tried my clip (only one) on your text example and
                  it got rid of all spaces before and after newlines as requested. It did not
                  touch newlines as such but thats by design.
                  After you added your criterion for wanted vs. unwanted newlines John has
                  already supplied a solution for that too.

                  What I did is as follows

                  ^!Replace "\s*\R+\s*" >> "\r\n" WRASTI

                  Look for zero or more white-space-characters follwed by at least one
                  newline followed by zero or more white spaces
                  and replace all that with a single DOS-newline, CRLF.
                  Do this starting at the top, using regular expressions, for all instances,
                  silently, not limited to whole words, and disregarding capitalizing.

                  Axel
                • Peter Hulm
                  You guys are magnificent. John even spotted something I knew but forgot: that I want to replace the carriage returns with a space rather than simply suppress
                  Message 8 of 17 , Jan 18, 2014
                    You guys are magnificent. John even spotted something I knew but forgot: that I want to replace the carriage returns with a space rather than simply suppress it. And yes, you are right, the email reformatted the text so that it appeared as it should.

                    The code you gave me: ^!Replace "[^>\x20]\K\x20*\R+\x20*" >> "\x20" ARSW worked with one problem. The spaces vanished but so did all the line breaks. When I ran it through a web editor, however, these reappeared. It did its own beautifying, and when I reopened the saved text in NoteTab all was fine.

                    I can live with this two-step approach, if there is no alternative, but I am puzzled why [^`>] still replaced the linebreaks after >, and I don't really understand \K or should it be \K\20.


                  • Don
                    ... K means do not capture (thus don t select or replace) what precedes it. [^whatever] means NOT whatever. So it is taking anything other than a space which
                    Message 9 of 17 , Jan 18, 2014
                      On 1/18/2014 3:54 PM, Peter Hulm wrote:
                      > You guys are magnificent. John even spotted something I knew but forgot: that I want to replace the carriage returns with a space rather than simply suppress it. And yes, you are right, the email reformatted the text so that it appeared as it should.
                      >
                      > The code you gave me: ^!Replace "[^>\x20]\K\x20*\R+\x20*" >> "\x20" ARSW worked with one problem. The spaces vanished but so did all the line breaks. When I ran it through a web editor, however, these reappeared. It did its own beautifying, and when I reopened the saved text in NoteTab all was fine.
                      >
                      > I can live with this two-step approach, if there is no alternative, but I am puzzled why [^`>] still replaced the linebreaks after >, and I don't really understand \K or should it be \K\20.
                      >

                      \K means do not capture (thus don't select or replace) what precedes it.

                      [^whatever] means NOT whatever. So it is taking anything other than a
                      space which is what \x20 means and anything other than a > followed by
                      none or more spaces and any return \R or more followed by zero or more
                      spaces.
                    • John Shotsky
                      x20 is the written code for a space. Much easier to understand in email, and NoteTab accepts either. As to replacing CR s after a , it should not have. The
                      Message 10 of 17 , Jan 18, 2014

                        \x20 is the written code for a space. Much easier to understand in email, and NoteTab accepts either.

                        As to replacing CR's after a '>', it should not have. The clip says it cannot stop on either a space or a >. So, the CR cannot be preceded by either spaces or > before the \K which simply means don't capture anything before the \K. It provides a stop point from which to proceed. Any character can be before the \K except a space or >. Following the \K, any combination of spaces with at least one CR will trigger the replacement. The replacement is one space. If it removed the CR's, it added the spaces. I can't imagine any 'problem' in which it removed the CR's and didn't insert the spaces - they have to be there or it would not have triggered. When you say they 'reappeared', I am confused, because nothing can 'reappear' that is not already there. If you are using NoteTab Pro, you can see the spaces as dots.

                        The only thing I can think of that could cause any inconsistency is if they not actually spaces, but non-break spaces, which are Unicode characters which would not function in this clip as is. The way around that it to convert them to spaces first, or include them in each location in which there are spaces above.

                        To convert them first:

                        ^!Replace "&#160;" >> " " AIRSW

                        ^!IfError Next Else Skip_-1

                        After this, all spaces are normal spaces, and the code provided will run as expected. Otherwise, add &#160; to the command as follows:

                        ^!Replace "[^>\x20&#160;]\K[\x20&#160;]*\R+[\x20&#160;]*" >> "\x20" ARSW

                        I would have to see the original html to know what is happening for sure, but if spaces in html are not acting as expected, there is a good chance they are non-breaking spaces which has no shortcut in regex that doesn't include the \R.

                         

                        Regards,
                        John
                        RecipeTools Web Site: http://recipetools.gotdns.com/
                        John's Mags Yahoo Group:  http://groups.yahoo.com/group/johnsmags/

                         

                        From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Peter Hulm
                        Sent: Saturday, January 18, 2014 12:55
                        To: ntb-clips@yahoogroups.com
                        Subject: [Clip] Re: cleaning extra spaces

                         

                         

                        You guys are magnificent. John even spotted something I knew but forgot: that I want to replace the carriage returns with a space rather than simply suppress it. And yes, you are right, the email reformatted the text so that it appeared as it should.

                        The code you gave me: ^!Replace "[^>\x20]\K\x20*\R+\x20*" >> "\x20" ARSW worked with one problem. The spaces vanished but so did all the line breaks. When I ran it through a web editor, however, these reappeared. It did its own beautifying, and when I reopened the saved text in NoteTab all was fine.

                        I can live with this two-step approach, if there is no alternative, but I am puzzled why [^`>] still replaced the linebreaks after >, and I don't really understand \K or should it be \K\20.

                         

                      • John Shotsky
                        I misspoke about non-break spaces being Unicode. While they are Unicode characters, they are also Ansi characters which work just fine in NoteTab, but must be
                        Message 11 of 17 , Jan 18, 2014

                          I misspoke about non-break spaces being Unicode. While they are Unicode characters, they are also Ansi characters which work just fine in NoteTab, but must be explicitly called out as I showed in the revised clip below. If you have characters above the Ansi range (0-255), then you have a whole nother problem to solve. One that I have already done, but it take a potful of code to deal with full Unicode in NoteTab's regex. My code either converts high-order characters to Ansi characters, or simply omits them entirely, thus removing, for example Asian characters from the text entirely.

                           

                          Regards,
                          John
                          RecipeTools Web Site: http://recipetools.gotdns.com/
                          John's Mags Yahoo Group:  http://groups.yahoo.com/group/johnsmags/

                           

                          From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of John Shotsky
                          Sent: Saturday, January 18, 2014 13:28
                          To: ntb-clips@yahoogroups.com
                          Subject: RE: [Clip] Re: cleaning extra spaces

                           

                           

                          \x20 is the written code for a space. Much easier to understand in email, and NoteTab accepts either.

                          As to replacing CR's after a '>', it should not have. The clip says it cannot stop on either a space or a >. So, the CR cannot be preceded by either spaces or > before the \K which simply means don't capture anything before the \K. It provides a stop point from which to proceed. Any character can be before the \K except a space or >. Following the \K, any combination of spaces with at least one CR will trigger the replacement. The replacement is one space. If it removed the CR's, it added the spaces. I can't imagine any 'problem' in which it removed the CR's and didn't insert the spaces - they have to be there or it would not have triggered. When you say they 'reappeared', I am confused, because nothing can 'reappear' that is not already there. If you are using NoteTab Pro, you can see the spaces as dots.

                          The only thing I can think of that could cause any inconsistency is if they not actually spaces, but non-break spaces, which are Unicode characters which would not function in this clip as is. The way around that it to convert them to spaces first, or include them in each location in which there are spaces above.

                          To convert them first:

                          ^!Replace "&#160;" >> " " AIRSW

                          ^!IfError Next Else Skip_-1

                          After this, all spaces are normal spaces, and the code provided will run as expected. Otherwise, add &#160; to the command as follows:

                          ^!Replace "[^>\x20&#160;]\K[\x20&#160;]*\R+[\x20&#160;]*" >> "\x20" ARSW

                          I would have to see the original html to know what is happening for sure, but if spaces in html are not acting as expected, there is a good chance they are non-breaking spaces which has no shortcut in regex that doesn't include the \R.

                           

                          Regards,
                          John
                          RecipeTools Web Site: http://recipetools.gotdns.com/
                          John's Mags Yahoo Group:  http://groups.yahoo.com/group/johnsmags/

                           

                          From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Peter Hulm
                          Sent: Saturday, January 18, 2014 12:55
                          To: ntb-clips@yahoogroups.com
                          Subject: [Clip] Re: cleaning extra spaces

                           

                           

                          You guys are magnificent. John even spotted something I knew but forgot: that I want to replace the carriage returns with a space rather than simply suppress it. And yes, you are right, the email reformatted the text so that it appeared as it should.

                          The code you gave me: ^!Replace "[^>\x20]\K\x20*\R+\x20*" >> "\x20" ARSW worked with one problem. The spaces vanished but so did all the line breaks. When I ran it through a web editor, however, these reappeared. It did its own beautifying, and when I reopened the saved text in NoteTab all was fine.

                          I can live with this two-step approach, if there is no alternative, but I am puzzled why [^`>] still replaced the linebreaks after >, and I don't really understand \K or should it be \K\20.

                           

                        • peterhulm
                          Thanks again John for taking time to explain. In NoteTab when I run your clip all the lines cascade together
                          Message 12 of 17 , Jan 18, 2014
                            Thanks again John for taking time to explain. In NoteTab when I run your clip all the lines cascade together
                            <!DOCTYPE html> <html lang="en"> <head> <!--google fonts--> <link rel="stylesheet" type="text/css" href="http://fonts.googleapis.com/css?family=Domine:400,700|Nobile:400,400italic,700italic,700"> <meta charset="utf-8"> <title>Postmodern studies: House of Cards</title>

                             but when I paste them here they are reformatted, as:

                            <!DOCTYPE html>
                             <html lang="en">
                             <head>
                             <!--google fonts-->
                             <link rel="stylesheet" type="text/css" href="http://fonts.googleapis.com/css?family=Domine:400,700|Nobile:400,400italic,700italic,700">
                             <meta charset="utf-8">
                             <title>Postmodern studies: House of Cards</title>

                            Maybe I have something configured wrong in NoteTab's settings.

                            I still can't explain to myself why the old clip does not work, i.e.
                            ^!Replace "\s\R\s">>"\R" RW (replacing the old \r with \R.

                            I'm hoping you have an explanation for that too.
                          • Axel Berger
                            ... You can t use R in replaces, only in search strings. It s the same with all ambiguous multi-value metacharacters like s d or w. Axel
                            Message 13 of 17 , Jan 18, 2014
                              peterhulm@... wrote:
                              > I still can't explain to myself why the old clip does not work, i.e.
                              > ^!Replace "\s\R\s">>"\R" RW (replacing the old \r with \R.

                              You can't use \R in replaces, only in search strings. It's the same with
                              all ambiguous multi-value metacharacters like \s \d or \w.

                              Axel
                            • John Shotsky
                              Your example doesn t work because s INCLUDES line endings. To replace all kinds of line endings with a Windows one you would replace R with r n. But s
                              Message 14 of 17 , Jan 18, 2014

                                Your example doesn't work because \s INCLUDES line endings.

                                To replace all kinds of line endings with a Windows one you would replace \R with \r\n.

                                But \s includes all whitespace - spaces, tabs line endings, etc.

                                To do what you want requires treating spaces separately from line ends, because you want to force the situation where line ends occur, not just whitespace.

                                I can't exactly explain why your text runs together, but you might want to run a clip before the ones I've shown to make all line ends Windows CRLF. (\r\n as shown above) Do that first to make Windows line ends, then run what I've provided, and then all line ends should display as expected. In Html, your line ends could come from Windows, Mac or *nix, each of which treats line ends differently, and browsers obey them all. All my clips standardize line ends as a first step, and usually the same with non-break spaces. I have seen odd behavior in NoteTab when the full CRLF was not present.

                                Regards,
                                John
                                RecipeTools Web Site: http://recipetools.gotdns.com/
                                John's Mags Yahoo Group:  http://groups.yahoo.com/group/johnsmags/

                                 

                                From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of peterhulm@...
                                Sent: Saturday, January 18, 2014 14:12
                                To: ntb-clips@yahoogroups.com
                                Subject: RE: [Clip] Re: cleaning extra spaces

                                 

                                 

                                Thanks again John for taking time to explain. In NoteTab when I run your clip all the lines cascade together
                                <!DOCTYPE html> <html lang="en"> <head> <!--google fonts--> <link rel="stylesheet" type="text/css" href="http://fonts.googleapis.com/css?family=Domine:400,700|Nobile:400,400italic,700italic,700"> <meta charset="utf-8"> <title>Postmodern studies: House of Cards</title>

                                 but when I paste them here they are reformatted, as:

                                <!DOCTYPE html>
                                 <html lang="en">
                                 <head>
                                 <!--google fonts-->
                                 <link rel="stylesheet" type="text/css" href="http://fonts.googleapis.com/css?family=Domine:400,700|Nobile:400,400italic,700italic,700">
                                 <meta charset="utf-8">
                                 <title>Postmodern studies: House of Cards</title>

                                Maybe I have something configured wrong in NoteTab's settings.

                                I still can't explain to myself why the old clip does not work, i.e.
                                ^!Replace "\s\R\s">>"\R" RW (replacing the old \r with \R.

                                I'm hoping you have an explanation for that too.

                              • Alex Plantema
                                ... Your original clip contained r n instead R. In my first reply, on Thursday, I explained why it doesn t work and I suggested another solution which you
                                Message 15 of 17 , Jan 18, 2014
                                  Op zaterdag 18 januari 2014 23:12 schreef peterhulm@...:

                                  > I still can't explain to myself why the old clip does not work, i.e.
                                  > ^!Replace "\s\R\s">>"\R" RW (replacing the old \r with \R.
                                  >
                                  > I'm hoping you have an explanation for that too.

                                  Your original clip contained \r\n instead \R.
                                  In my first reply, on Thursday, I explained why it doesn't work
                                  and I suggested another solution which you haven't commented yet.

                                  Alex.
                                • Peter Hulm
                                  Thank you for the explanation. I think you should know that your RegEx works perfectly in Notepad++, so it may be that I am doing something wrong in NoteTab
                                  Message 16 of 17 , Jan 19, 2014
                                    Thank you for the explanation. I think you should know that your RegEx works perfectly in Notepad++, so it may be that I am doing something wrong in NoteTab v7.

                                  • Peter Hulm
                                    Ah. all is clear now. Thanks a million.
                                    Message 17 of 17 , Jan 19, 2014
                                      Ah. all is clear now. Thanks a million.
                                    Your message has been successfully submitted and would be delivered to recipients shortly.