Loading ...
Sorry, an error occurred while loading the content.

Stripping HTML tags

Expand Messages
  • Richard
    I have discovered a serious bug (I hope it is not a feature) in Note Tab. As most know, HTML rendering engines compress a series of spaces to a single space
    Message 1 of 10 , Sep 4, 2007
    • 0 Attachment
      I have discovered a serious bug (I hope it is not a feature) in Note
      Tab. As most know, HTML rendering engines compress a series of spaces
      to a single space unless the coder specifies a   (which is
      decimal 160, or 0xA0 in hex). This is done to "trick" the engine into
      displaying an unprintable character whenever it encounters one.

      The bug is: when stripping HTML tags (Modify > Strip HTML Tags >
      Remove All Tags -- shift+control+T) it does not restore a "real" space
      (0x20, dec 32); rather, it preserves the "faux space; i.e., 0xA0.
      This causes Modify > Spaces not to work at all. Which causes me to
      use very bad language. It was not until recently that I looked at a
      problem document in a hex editor and realized what had happened.

      It is not practical to have to do a global replace on a character you
      cannot see -- how, pray tell, can you be certain that the character
      you have selected is indeed the one you wish replaced? -- and having
      to write a clip seems to me to be even more infuriating.

      So, what I would like to see is for Note Tab to convert   to the
      intended 0x20 space; or, at least to provide the option to select from
      0x20 or 0xA0. The Options > HTML tab would be as good a place as any
      to provide a checkbox to control this behavior.

      -Richard Anderson
    • ebbtidalflats
      Richard, I m going to have ta spoil your day. You are indeed talking about a FEATURE of NoteTab. Read on for the explanation, as well as the solution to
      Message 2 of 10 , Sep 6, 2007
      • 0 Attachment
        Richard,

        I'm going to have ta spoil your day. You are indeed talking about a
        FEATURE of NoteTab. Read on for the explanation, as well as the
        solution to having your cake, and eating it too.


        --- In notetab@yahoogroups.com, "Richard" <anderson@...> wrote:
        ...
        > to a single space unless the coder specifies a   (which is
        > decimal 160, or 0xA0 in hex). This is done to "trick" the engine into
        > displaying an unprintable character whenever it encounters one.

        Just because you, or anyone else uses the non-breaking-space to
        "trick" the browser into unusual behavior, doesn't mean that NoteTab
        has a bug in it. The bug is, that YOU are using the nbsp wrong.

        If you want to display more than 1 space in a web page, one legit way
        is to use the <pre> tag (see below). Rather than "trick" the browser
        into anything, the NBSP (non-breaking space) handles the occasional
        need to keep two words together, even in situations, where the space
        between them would separate them at the end of the line.


        In other words, when ___YOU___ insert an NBSP into html text, you are
        telling the browser, that you want what comes before the NBSP to stay
        together with what comes after.

        This has NOTHING to do with NoteTab. When you STRIP html from the doc
        (or selected text) you are telling it to convert the HTML nbsp to a
        Unicode non-breaking-space. Seems to me a very useful function, to
        actually do what you tell it to.

        If you want to retain a series of real spaces after stripping htmls,
        you'll have to START with real spaces, NOT non-breaking-spaces!

        For example:

        Display 5 spaces inside quote: <pre>" "</pre>
        both in the browser, AND after notetab strips html.

        Note, that I'm not counting spaces outside the quotes.
        Nor do you really need the quotes (except to point out
        where the spaces are.


        Cheers,


        Eb
      • anderson@richlandvillager.com
        Eb (and Eric), Now I m going to have to spoil your day as well. While it is admirable to think that   ought to be used in only (very) limited instances,
        Message 3 of 10 , Sep 19, 2007
        • 0 Attachment
          Eb (and Eric),

          Now I'm going to have to spoil your day as well.
          While it is admirable to think that   ought to be used in only
          (very) limited instances, the web community adopted it instantly as
          a method of presenting content. It is a SPACE and nothing but a SPACE.

          Let us examine how Note Tab handles   :

          I prepared the following test case:

          1) an original string:
          Some normal text followed by five spaces followed by more text.
          [ Note: the substring "s f" is hex 73 20 20 20 20 20 66 ]

          2) using Modify>Document to HTML>With Paragraph Tags the result is:
          Some normal text followed by five spaces     followed by more text.

          3) reconverting (2) using Modify>Strip HTML Tags>Remove All Tags the result is:
          Some normal text followed by five spaces     followed by more text.
          [ Note: the reconverted substring is hex 73 20 A0 A0 A0 A0 66 ]

          This is an ERROR. I (and I suspect most people) expect that the operation
          A --> B --> A to result in A == A. Note Tab does not do this.

          As I stated in my original post, this behavior causes me all sorts of problems
          when I wish to process the result of converting HTML email to standard text
          email (which is 75% of my Note Tab usage); and, I wish it to be corrected.

          Sincerely,
          Richard Anderson

          NOTE: NO ADDITIONAL COMMENTARY BEYOND THIS POINT

          =======================================
          On Fri, 07 Sep 2007 02:45:04 -0000, you wrote:

          >Richard,
          >
          >I'm going to have ta spoil your day. You are indeed talking about a
          >FEATURE of NoteTab. Read on for the explanation, as well as the
          >solution to having your cake, and eating it too.
          >
          >
          >--- In notetab@yahoogroups.com, "Richard" <anderson@...> wrote:
          >...
          >> to a single space unless the coder specifies a   (which is
          >> decimal 160, or 0xA0 in hex). This is done to "trick" the engine into
          >> displaying an unprintable character whenever it encounters one.
          >
          >Just because you, or anyone else uses the non-breaking-space to
          >"trick" the browser into unusual behavior, doesn't mean that NoteTab
          >has a bug in it. The bug is, that YOU are using the nbsp wrong.
          >
          >If you want to display more than 1 space in a web page, one legit way
          >is to use the <pre> tag (see below). Rather than "trick" the browser
          >into anything, the NBSP (non-breaking space) handles the occasional
          >need to keep two words together, even in situations, where the space
          >between them would separate them at the end of the line.
          >
          >
          >In other words, when ___YOU___ insert an NBSP into html text, you are
          >telling the browser, that you want what comes before the NBSP to stay
          >together with what comes after.
          >
          >This has NOTHING to do with NoteTab. When you STRIP html from the doc
          >(or selected text) you are telling it to convert the HTML nbsp to a
          >Unicode non-breaking-space. Seems to me a very useful function, to
          >actually do what you tell it to.
          >
          >If you want to retain a series of real spaces after stripping htmls,
          >you'll have to START with real spaces, NOT non-breaking-spaces!
          >
          >For example:
          >
          >Display 5 spaces inside quote: <pre>" "</pre>
          >both in the browser, AND after notetab strips html.
          >
          >Note, that I'm not counting spaces outside the quotes.
          >Nor do you really need the quotes (except to point out
          >where the spaces are.
          >
          >
          >Cheers,
          >
          >
          >Eb
          >
        • Don - HtmlFixIt.com
          ... Dear Richard Anderson :-) Try this two line clip: ^!Replace   ACIWS ^!ToolBar Strip HTML Does that re-hitch thy wagon? Whether it should or
          Message 4 of 10 , Sep 19, 2007
          • 0 Attachment
            > I prepared the following test case:
            >
            > 1) an original string:
            > Some normal text followed by five spaces followed by more text.
            > [ Note: the substring "s f" is hex 73 20 20 20 20 20 66 ]
            >
            > 2) using Modify>Document to HTML>With Paragraph Tags the result is:
            > Some normal text followed by five spaces     followed by more text.
            >
            > 3) reconverting (2) using Modify>Strip HTML Tags>Remove All Tags the result is:
            > Some normal text followed by five spaces followed by more text.
            > [ Note: the reconverted substring is hex 73 20 A0 A0 A0 A0 66 ]
            >
            > This is an ERROR. I (and I suspect most people) expect that the operation
            > A --> B --> A to result in A == A. Note Tab does not do this.
            >
            > As I stated in my original post, this behavior causes me all sorts of problems
            > when I wish to process the result of converting HTML email to standard text
            > email (which is 75% of my Note Tab usage); and, I wish it to be corrected.
            >
            Dear Richard Anderson :-)
            Try this two line clip:
            ^!Replace " " >> " " ACIWS
            ^!ToolBar Strip HTML

            Does that re-hitch thy wagon?
            Whether it should or should not, whether it is a feature or a bug (and I
            tend to agree a=b=a makes sense ...), I think this may solve it?
          • Richard
            ... Well Don, I didn t bother using the clip; rather, I simply used a global   -- [space] then presented the result to Modify Strip HTML Tags Remove All
            Message 5 of 10 , Sep 20, 2007
            • 0 Attachment
              Don - HtmlFixIt.com suggested:

              >Dear Richard Anderson :-)
              >Try this two line clip:
              >^!Replace " " >> " " ACIWS
              >^!ToolBar Strip HTML
              >
              >Does that re-hitch thy wagon?
              >Whether it should or should not, whether it is a feature or a bug (and I
              >tend to agree a=b=a makes sense ...), I think this may solve it?

              Well Don, I didn't bother using the clip; rather, I simply used a global
                --> [space] then presented the result to
              Modify>Strip HTML Tags>Remove All Tags.

              You will just love what the result was: a string such as [space][space]
              was converted to A SINGLE SPACE (just like the browser would have rendered it).
              That really is not what I had in mind.

              So, I suppose what I need to do is:
              1) select text and Modify>Strip HTML Tags>Remove All Tags (this will give
              me 0xA0 for  )
              2) select "converted" text and apply global replace of \xA0 (or is it \xa0?)
              with a space (maybe \x20?) remembering all the while to check the [Reg Exp]
              box

              As the old saying goes, "I think NOT!!!!"

              This "feature" is broken and needs to be fixed.

              Sincerely,
              Richard

              NO MORE NEW COMMENTARY PAST THIS POINT

              =======================================
              On Thu, 20 Sep 2007 01:34:23 -0400, you wrote:

              >> I prepared the following test case:
              >>
              >> 1) an original string:
              >> Some normal text followed by five spaces followed by more text.
              >> [ Note: the substring "s f" is hex 73 20 20 20 20 20 66 ]
              >>
              >> 2) using Modify>Document to HTML>With Paragraph Tags the result is:
              >> Some normal text followed by five spaces     followed by more text.
              >>
              >> 3) reconverting (2) using Modify>Strip HTML Tags>Remove All Tags the result is:
              >> Some normal text followed by five spaces followed by more text.
              >> [ Note: the reconverted substring is hex 73 20 A0 A0 A0 A0 66 ]
              >>
              >> This is an ERROR. I (and I suspect most people) expect that the operation
              >> A --> B --> A to result in A == A. Note Tab does not do this.
              >>
              >> As I stated in my original post, this behavior causes me all sorts of problems
              >> when I wish to process the result of converting HTML email to standard text
              >> email (which is 75% of my Note Tab usage); and, I wish it to be corrected.
              >>
            • Eric Fookes
              Richard, ... No, it isn t.   stands for Non-Breaking Space, which is different from a regular space. The ANSI equivalent is not hex 20, but hex A0. The
              Message 6 of 10 , Sep 20, 2007
              • 0 Attachment
                Richard,

                > Now I'm going to have to spoil your day as well.
                > While it is admirable to think that   ought to be used in only
                > (very) limited instances, the web community adopted it instantly as
                > a method of presenting content. It is a SPACE and nothing but a SPACE.

                No, it isn't.   stands for Non-Breaking Space, which is different
                from a regular space. The ANSI equivalent is not hex 20, but hex A0. The
                way NoteTab handles those characters is by design and I have no plans to
                change that.

                > Let us examine how Note Tab handles   :
                >
                > I prepared the following test case:
                >
                > 1) an original string:
                > Some normal text followed by five spaces followed by more text.
                > [ Note: the substring "s f" is hex 73 20 20 20 20 20 66 ]
                >
                > 2) using Modify>Document to HTML>With Paragraph Tags the result is:
                > Some normal text followed by five spaces     followed by more text.

                The feature is a compromise. It was not designed to work the same way in
                reverse, because NoteTab doesn't know exactly what the user intends to
                do with a sequence of spaces or non-breaking spaces.

                > 3) reconverting (2) using Modify>Strip HTML Tags>Remove All Tags the result is:
                > Some normal text followed by five spaces followed by more text.
                > [ Note: the reconverted substring is hex 73 20 A0 A0 A0 A0 66 ]
                >
                > This is an ERROR. I (and I suspect most people) expect that the operation
                > A --> B --> A to result in A == A. Note Tab does not do this.

                Note all operations are symmetrical. As I mentioned above, this is the
                result of a compromise.

                > As I stated in my original post, this behavior causes me all sorts of problems
                > when I wish to process the result of converting HTML email to standard text
                > email (which is 75% of my Note Tab usage); and, I wish it to be corrected.

                I think this must be the first time since NoteTab was released in 1995
                that this feature is discussed. My guess is that I must have done
                something right there.

                --
                Regards,

                Eric Fookes
                http://www.fookes.com/
              • Don - HtmlFixIt.com
                ... Given that it the way it was designed (and I guess now I see how it might make sense with Eric s explanation) it seems to me that a simple clip could in
                Message 7 of 10 , Sep 20, 2007
                • 0 Attachment
                  > I think this must be the first time since NoteTab was released in 1995
                  > that this feature is discussed. My guess is that I must have done
                  > something right there.
                  >

                  Given that it the way it was designed (and I guess now I see how it
                  might make sense with Eric's explanation) it seems to me that a simple
                  clip could in fact be written to fix the situation right up. The whole
                  point of notetab is you can almost always make it work. With a clip one
                  need not remember to check regex, because "do a regex" is part of the
                  clip.

                  I understand tone and tenor can be mistaken in emails, but Richard I am
                  reading a bit of a can't do attitude, instead of an ok, I'm finding a
                  way to make it work. You didn't even try my clip so who knows if it
                  might have worked. It will be too bad if you turn your back on this
                  great tool, but even worse if all you do is complain.

                  Maybe I'm misreading, but I don't feel inclined to help much more. Good
                  luck.
                • ebbtidalflats
                  ... Your arguments don t wash. If   were indeed just a SPACE, why then is there a difference between and   in the first place. Why do YOU use
                  Message 8 of 10 , Sep 21, 2007
                  • 0 Attachment
                    --- In notetab@yahoogroups.com, anderson@... wrote:
                    >
                    > Eb (and Eric),
                    >
                    > Now I'm going to have to spoil your day as well.
                    > While it is admirable to think that   ought to be used in only
                    > (very) limited instances, the web community adopted it instantly as
                    > a method of presenting content. It is a SPACE and nothing but a SPACE.
                    >

                    Your arguments don't wash. If " " were indeed just a SPACE, why
                    then is there a difference between " " and " " in the first
                    place. Why do YOU use normal spaces at all, instead of using  
                    everywhere?

                    "NBSP" in   continues to stand for "Non Braking SPace".

                    Arguing this fact isn't going to get you a conversion of nbsp to plain
                    spaces, nor does it offer a solution for preventing the wrap of lines
                    at inconvenient places in text, while still allowing an automatic line
                    break at any other space.


                    Were you correct, you would be able to offer a means to deal with the
                    following example (underline represents the UNBREAKABLE SPACE):

                    "I want this text to be all on one line, except for
                    'Titles_go_together_on_one_line' when they happen to reach the end of
                    the line"


                    More importantly, how would you propose to solve this so it works
                    BEFORE and AFTER stripping html?
                  • tuttle.grey
                    ... Good points. ... Non-breaking space ... plain ... lines ... line ... the ... of ... The trouble here is that someone is making possibly inappropriate and
                    Message 9 of 10 , Sep 21, 2007
                    • 0 Attachment
                      --- In notetab@yahoogroups.com, "ebbtidalflats" <ebbtidalflats@...>
                      wrote:

                      > Your arguments don't wash. If " " were indeed just a SPACE, why
                      > then is there a difference between " " and " " in the first
                      > place. Why do YOU use normal spaces at all, instead of using  
                      > everywhere?

                      Good points.

                      > "NBSP" in   continues to stand for "Non Braking SPace".

                      Non-breaking space

                      > Arguing this fact isn't going to get you a conversion of nbsp to
                      plain
                      > spaces, nor does it offer a solution for preventing the wrap of
                      lines
                      > at inconvenient places in text, while still allowing an automatic
                      line
                      > break at any other space.


                      > Were you correct, you would be able to offer a means to deal with
                      the
                      > following example (underline represents the UNBREAKABLE SPACE):
                      >
                      > "I want this text to be all on one line, except for
                      > 'Titles_go_together_on_one_line' when they happen to reach the end
                      of
                      > the line"

                      The trouble here is that someone is making possibly inappropriate and
                      certainly inconvenient use of non-breaking spaces. The elegant and
                      flexible solution to the problem of keeping some strings of text on
                      one line is to use CSS:

                      selector {white-space: nowrap;}
                    • Scott Fordin
                      Another thing to remember about   is that multiple   entities in a row aren t ignored per HTML spec like regular blank spaces. That is, if you have
                      Message 10 of 10 , Sep 21, 2007
                      • 0 Attachment
                        Another thing to remember about   is that multiple
                          entities in a row aren't ignored per HTML spec like
                        regular blank spaces. That is, if you have several blank
                        spaces in a row, HTML gloms them all together, treating
                        them as a single blank space. Not so with  . All
                          entities count.

                        FWIW, I think Eric has the handling of   entities
                        just right. It is not too difficult to write a clip, if
                        you wish, that manipulates those   entities. For
                        example, if you want to convert the   entities to
                        spaces but still keep the number of spaces intact, you
                        could do write a two-pass search and replace, wherein
                        the first pass converts the   entities to some
                        dummy string (like -/zzzzz/-), and the second pass
                        converts the dummy string to space characters.

                        Scott

                        ebbtidalflats wrote:
                        >
                        >
                        > --- In notetab@yahoogroups.com <mailto:notetab%40yahoogroups.com>,
                        > anderson@... wrote:
                        >>
                        >> Eb (and Eric),
                        >>
                        >> Now I'm going to have to spoil your day as well.
                        >> While it is admirable to think that   ought to be used in only
                        >> (very) limited instances, the web community adopted it instantly as
                        >> a method of presenting content. It is a SPACE and nothing but a SPACE.
                        >>
                        >
                        > Your arguments don't wash. If " " were indeed just a SPACE, why
                        > then is there a difference between " " and " " in the first
                        > place. Why do YOU use normal spaces at all, instead of using  
                        > everywhere?
                        >
                        > "NBSP" in   continues to stand for "Non Braking SPace".
                        >
                        > Arguing this fact isn't going to get you a conversion of nbsp to plain
                        > spaces, nor does it offer a solution for preventing the wrap of lines
                        > at inconvenient places in text, while still allowing an automatic line
                        > break at any other space.
                        >
                        > Were you correct, you would be able to offer a means to deal with the
                        > following example (underline represents the UNBREAKABLE SPACE):
                        >
                        > "I want this text to be all on one line, except for
                        > 'Titles_go_together_on_one_line' when they happen to reach the end of
                        > the line"
                        >
                        > More importantly, how would you propose to solve this so it works
                        > BEFORE and AFTER stripping html?
                        >
                        >
                      Your message has been successfully submitted and would be delivered to recipients shortly.