Loading ...
Sorry, an error occurred while loading the content.

Unicode U+2028 line separator

Expand Messages
  • Bill Moseley
    I have a utf-8 file that uses the unicode line separator. Not something I ve come across very often. In utf-8 the sequence is: 0xE2 0x80 0xA8 (e280a8) In a
    Message 1 of 4 , Mar 2, 2007
    • 0 Attachment
      I have a utf-8 file that uses the unicode line separator. Not
      something I've come across very often. In utf-8 the sequence is:

      0xE2 0x80 0xA8 (e280a8)

      In a uxterm vim correctly reads (and sets) the file encoding as utf8
      (there's no BOM on the file), but the U-2028 character is displayed
      as an un-displayable character and not displayed as a new line.
      That is, all the text is displayed as a single line.

      Can anyone educate me a bit on the use of the Line Separator character
      and if or how it can be supported in Vim?

      I'm having other problems -- such as the Perl script that is reading
      this file doesn't see the character as a new line (although it does
      see it as a matching a \s regular expression.



      --
      Bill Moseley
      moseley@...
    • A.J.Mechelynck
      ... I may be wrong, but IIUC this codepoint plays the same role as the HTML tag: it does not define an end of line in the text file which contains it,
      Message 2 of 4 , Mar 2, 2007
      • 0 Attachment
        Bill Moseley wrote:
        > I have a utf-8 file that uses the unicode line separator. Not
        > something I've come across very often. In utf-8 the sequence is:
        >
        > 0xE2 0x80 0xA8 (e280a8)
        >
        > In a uxterm vim correctly reads (and sets) the file encoding as utf8
        > (there's no BOM on the file), but the U-2028 character is displayed
        > as an un-displayable character and not displayed as a new line.
        > That is, all the text is displayed as a single line.
        >
        > Can anyone educate me a bit on the use of the Line Separator character
        > and if or how it can be supported in Vim?
        >
        > I'm having other problems -- such as the Perl script that is reading
        > this file doesn't see the character as a new line (although it does
        > see it as a matching a \s regular expression.
        >
        >
        >

        I may be wrong, but IIUC this codepoint plays the same role as the HTML <br>
        tag: it does not define an "end of line" in the text file which contains it,
        but it means that, when rendered typographically, as in a browser or a WYSIWYG
        editor (neither of which Vim is, or tries to mimic), the rendered output must
        have a linebreak at this point.

        IOW: I think it's a feature, not a bug.

        You can add a linebreak after every occurrence of that codepoint by using

        :exe "%s/\<Char-0x2028>/" . '\0\r/g'

        Note that I intentionally use double quotes in the first part and single
        quotes in the second part.


        Best regards,
        Tony.
        --
        It is said that the lonely eagle flies to the mountain peaks while the
        lowly ant crawls the ground, but cannot the soul of the ant soar as
        high as the eagle?
      • Matthew Winn
        On Fri, 02 Mar 2007 20:24:44 +0100, A.J.Mechelynck ... According to http://www.unicode.org/reports/tr13/tr13-9.html the correct way to treat U+2028 and
        Message 3 of 4 , Mar 3, 2007
        • 0 Attachment
          On Fri, 02 Mar 2007 20:24:44 +0100, "A.J.Mechelynck"
          <antoine.mechelynck@...> wrote:

          > Bill Moseley wrote:
          > > I have a utf-8 file that uses the unicode line separator. Not
          > > something I've come across very often. In utf-8 the sequence is:
          > >
          > > 0xE2 0x80 0xA8 (e280a8)
          > >
          > > In a uxterm vim correctly reads (and sets) the file encoding as utf8
          > > (there's no BOM on the file), but the U-2028 character is displayed
          > > as an un-displayable character and not displayed as a new line.
          > > That is, all the text is displayed as a single line.
          > >
          > > Can anyone educate me a bit on the use of the Line Separator character
          > > and if or how it can be supported in Vim?
          >
          > I may be wrong, but IIUC this codepoint plays the same role as the HTML <br>
          > tag: it does not define an "end of line" in the text file which contains it,
          > but it means that, when rendered typographically, as in a browser or a WYSIWYG
          > editor (neither of which Vim is, or tries to mimic), the rendered output must
          > have a linebreak at this point.
          >
          > IOW: I think it's a feature, not a bug.
          >
          > You can add a linebreak after every occurrence of that codepoint by using
          >
          > :exe "%s/\<Char-0x2028>/" . '\0\r/g'
          >
          > Note that I intentionally use double quotes in the first part and single
          > quotes in the second part.

          According to http://www.unicode.org/reports/tr13/tr13-9.html the
          correct way to treat U+2028 and U+2029 (paragraph separator) is to
          translate them into the platform's standard sequence for representing
          the end of a line. (What it actually says is that if the purpose of
          the line break is unambiguously known -- that is, whether it is the
          end of a line or the end of a paragraph -- then the corresponding
          Unicode character should be used. But Vim is a text editor and knows
          nothing of paragraphs, so I would expect both these characters to be
          translated into the platform's end-of-line representation.)

          However, this would be lossy, so if this were to be implemented I
          suspect an option would be required for the benefit of people who want
          to edit Unicode text without losing the distinction between line and
          paragraph endings.

          --
          Matthew Winn
        • A.J.Mechelynck
          ... That s why I suggested adding an ASCII linebreak after the LSEP, not replacing it. Best regards, Tony. -- There was a young lady from Maine Who claimed she
          Message 4 of 4 , Mar 3, 2007
          • 0 Attachment
            Matthew Winn wrote:
            > On Fri, 02 Mar 2007 20:24:44 +0100, "A.J.Mechelynck"
            > <antoine.mechelynck@...> wrote:
            >
            >> Bill Moseley wrote:
            >>> I have a utf-8 file that uses the unicode line separator. Not
            >>> something I've come across very often. In utf-8 the sequence is:
            >>>
            >>> 0xE2 0x80 0xA8 (e280a8)
            >>>
            >>> In a uxterm vim correctly reads (and sets) the file encoding as utf8
            >>> (there's no BOM on the file), but the U-2028 character is displayed
            >>> as an un-displayable character and not displayed as a new line.
            >>> That is, all the text is displayed as a single line.
            >>>
            >>> Can anyone educate me a bit on the use of the Line Separator character
            >>> and if or how it can be supported in Vim?
            >> I may be wrong, but IIUC this codepoint plays the same role as the HTML <br>
            >> tag: it does not define an "end of line" in the text file which contains it,
            >> but it means that, when rendered typographically, as in a browser or a WYSIWYG
            >> editor (neither of which Vim is, or tries to mimic), the rendered output must
            >> have a linebreak at this point.
            >>
            >> IOW: I think it's a feature, not a bug.
            >>
            >> You can add a linebreak after every occurrence of that codepoint by using
            >>
            >> :exe "%s/\<Char-0x2028>/" . '\0\r/g'
            >>
            >> Note that I intentionally use double quotes in the first part and single
            >> quotes in the second part.
            >
            > According to http://www.unicode.org/reports/tr13/tr13-9.html the
            > correct way to treat U+2028 and U+2029 (paragraph separator) is to
            > translate them into the platform's standard sequence for representing
            > the end of a line. (What it actually says is that if the purpose of
            > the line break is unambiguously known -- that is, whether it is the
            > end of a line or the end of a paragraph -- then the corresponding
            > Unicode character should be used. But Vim is a text editor and knows
            > nothing of paragraphs, so I would expect both these characters to be
            > translated into the platform's end-of-line representation.)
            >
            > However, this would be lossy, so if this were to be implemented I
            > suspect an option would be required for the benefit of people who want
            > to edit Unicode text without losing the distinction between line and
            > paragraph endings.
            >

            That's why I suggested adding an ASCII linebreak after the LSEP, not replacing it.

            Best regards,
            Tony.
            --
            There was a young lady from Maine
            Who claimed she had men on her brain.
            But you knew from the view,
            As her abdomen grew,
            It was not on her brain that he'd lain.
          Your message has been successfully submitted and would be delivered to recipients shortly.