Loading ...
Sorry, an error occurred while loading the content.

Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset

Expand Messages
  • Benjamin Fritz
    On Sat, Aug 28, 2010 at 4:16 PM, Tony Mechelynck ... Ok, I think I ll make the edit, then. Your response gives me an idea to fix something else that s been
    Message 1 of 15 , Aug 28, 2010
    • 0 Attachment
      On Sat, Aug 28, 2010 at 4:16 PM, Tony Mechelynck
      <antoine.mechelynck@...> wrote:
      >>
      >>> From my understanding, 'fileencoding' is the encoding Vim is supposed
      >>
      >> to use to read/write the file. So, it does make sense that we should
      >> use this instead of just 'encoding' for the charset of the generated
      >> html. Does anyone know why TOhtml has used 'encoding' instead? I have
      >> not touched the charset detection code yet, other than to move it from
      >> the 2html.vim file into the autoload/tohtml.vim file.
      >
      > You got it right, and it does indeed make sense.
      > One possibility is that anything can be represented in UTF-8, including text
      > not yet saved from the latest edit of the file, and possibly incompatible
      > with the 'fileencoding' - such text is of course in error, and will cause an
      > error if one tries to save it.
      >

      Ok, I think I'll make the edit, then.

      Your response gives me an idea to fix something else that's been
      bothering me. Currently, if Vim cannot determine the correct charset
      to use, it defaults to not including one at all. I think I'll have it
      default the charset and file encoding to UTF-8 if neither the
      fileencoding nor the encoding option gives a valid charset. The user
      should be able to manually leave out the charset and manually set the
      encoding if desired.

      Here's what I'm thinking in more detail:

      For one buffer:
      1. If user specified a charset, try to determine 'fileencoding' from
      charset. If this fails, warn the user they will need to manually set
      the fileencoding.
      2. If no charset is specified, try to determine a charset from the
      'fileencoding' option. If successful, use the same 'fileencoding' and
      the associated charset in the generated buffer.
      3. If could not determine charset from 'fileencoding', try again with
      'encoding'. If successful, set 'fileencoding' to blank in the new html
      buffer and use the charset from the 'encoding' option.
      4. If could not determine charset from either 'encoding' or
      'fileencoding', default to UTF-8 and warn the user.

      Multiple buffers in diff mode will be done similarly, except that we
      will determine the charset as above for ALL buffers. If they differ,
      set 'fileencoding' to blank and use the charset from 'encoding' (or
      UTF-8 if cannot determine charset from 'encoding').

      What do you think? Or maybe this is too complicated and I should just
      use 'encoding' as done currently?

      What do you think?

      --
      You received this message from the "vim_dev" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php
    • Tony Mechelynck
      ... I think you re on the right track. Maybe a little too complicated but I m not sure. I would just use fileencoding , or if empty (or if it can be
      Message 2 of 15 , Aug 28, 2010
      • 0 Attachment
        On 29/08/10 04:29, Benjamin Fritz wrote:
        > On Sat, Aug 28, 2010 at 4:16 PM, Tony Mechelynck
        > <antoine.mechelynck@...> wrote:
        >>>
        >>>> From my understanding, 'fileencoding' is the encoding Vim is supposed
        >>>
        >>> to use to read/write the file. So, it does make sense that we should
        >>> use this instead of just 'encoding' for the charset of the generated
        >>> html. Does anyone know why TOhtml has used 'encoding' instead? I have
        >>> not touched the charset detection code yet, other than to move it from
        >>> the 2html.vim file into the autoload/tohtml.vim file.
        >>
        >> You got it right, and it does indeed make sense.
        >> One possibility is that anything can be represented in UTF-8, including text
        >> not yet saved from the latest edit of the file, and possibly incompatible
        >> with the 'fileencoding' - such text is of course in error, and will cause an
        >> error if one tries to save it.
        >>
        >
        > Ok, I think I'll make the edit, then.
        >
        > Your response gives me an idea to fix something else that's been
        > bothering me. Currently, if Vim cannot determine the correct charset
        > to use, it defaults to not including one at all. I think I'll have it
        > default the charset and file encoding to UTF-8 if neither the
        > fileencoding nor the encoding option gives a valid charset. The user
        > should be able to manually leave out the charset and manually set the
        > encoding if desired.
        >
        > Here's what I'm thinking in more detail:
        >
        > For one buffer:
        > 1. If user specified a charset, try to determine 'fileencoding' from
        > charset. If this fails, warn the user they will need to manually set
        > the fileencoding.
        > 2. If no charset is specified, try to determine a charset from the
        > 'fileencoding' option. If successful, use the same 'fileencoding' and
        > the associated charset in the generated buffer.
        > 3. If could not determine charset from 'fileencoding', try again with
        > 'encoding'. If successful, set 'fileencoding' to blank in the new html
        > buffer and use the charset from the 'encoding' option.
        > 4. If could not determine charset from either 'encoding' or
        > 'fileencoding', default to UTF-8 and warn the user.
        >
        > Multiple buffers in diff mode will be done similarly, except that we
        > will determine the charset as above for ALL buffers. If they differ,
        > set 'fileencoding' to blank and use the charset from 'encoding' (or
        > UTF-8 if cannot determine charset from 'encoding').
        >
        > What do you think? Or maybe this is too complicated and I should just
        > use 'encoding' as done currently?
        >
        > What do you think?
        >

        I think you're on the right track. Maybe a little too complicated but
        I'm not sure. I would just use 'fileencoding', or if empty (or if it can
        be ascertained that the current buffer contains characters which are
        invalid for it) then fall back on 'encoding' (by leaving 'fileencoding'
        empty in the tohtml output buffer). But go ahead if you think you can
        refine it more or make it better.

        I don't know what is being done ATM, but I'd always include the line

        <meta http-equiv="Content-Type" content="text/html; charset=whatever" />

        (replacing "whatever" by the charset name) somewhere near the start of
        the <head> element. You may want to use a synonym, e.g. iso-8859-1 for
        Latin1, but that's just the finishing touch.


        Best regards,
        Tony.
        --
        "In defeat, unbeatable; in victory, unbearable."
        -- Winston Curchill, of Montgomery

        --
        You received this message from the "vim_dev" maillist.
        Do not top-post! Type your reply below the text you are replying to.
        For more information, visit http://www.vim.org/maillist.php
      • Benjamin Fritz
        On Sat, Aug 28, 2010 at 10:00 PM, Tony Mechelynck ... Yes, that s mostly what it does now, except it omits the line if it could not determine the charset,
        Message 3 of 15 , Aug 29, 2010
        • 0 Attachment
          On Sat, Aug 28, 2010 at 10:00 PM, Tony Mechelynck
          <antoine.mechelynck@...> wrote:
          >
          > I don't know what is being done ATM, but I'd always include the line
          >
          > <meta http-equiv="Content-Type" content="text/html; charset=whatever" />
          >
          > (replacing "whatever" by the charset name) somewhere near the start of the
          > <head> element. You may want to use a synonym, e.g. iso-8859-1 for Latin1,
          > but that's just the finishing touch.
          >

          Yes, that's mostly what it does now, except it omits the line if it
          could not determine the charset, always uses 'encoding' instead of
          'fileencoding', and specifies the encoding in the <?xml line instead
          when optionally using xhtml. I think using utf-8 as a fallback instead
          of leaving it out entirely would be a better idea.

          The user can specify the charset now, but then the fileencoding will
          be wrong unless the user remembers to manually set it (or if it gets
          inherited...'fileencoding' seems to act like a "global-local" option).

          --
          You received this message from the "vim_dev" maillist.
          Do not top-post! Type your reply below the text you are replying to.
          For more information, visit http://www.vim.org/maillist.php
        • JiaYanwei
          Sorry, it s my omission, I had set fileencoding in .vimrc ... ps: Excuse me to get this message so late. I cannot visit google group last few days. ... --
          Message 4 of 15 , Aug 29, 2010
          • 0 Attachment
            Sorry, it's my omission, I had set 'fileencoding' in '.vimrc'...

            ps:
            Excuse me to get this message so late. I cannot visit google group
            last few days.

            On 2010-8-28, 03:37 Ben Fritz <fritzophre...@...> wrote:
            > On Aug 25, 11:11 pm, JiaYanwei <jia...@...> wrote:
            >
            >
            >
            > > e.g. If the system/vim encoding is 'UTF-8', but a text file encoding is
            > > 'latin-1'. If the default HTML charset is 'encoding', after ':TOhtml', we
            > > should change the HTML charset to 'iso-8859-1', or save the generated HTML
            > > file by ':w ++enc=utf-8'.
            >
            > Hmm...unless I understand correctly, the sequence is:
            >
            > Load text file. File encoding is latin-1, Vim encoding is utf-8.
            > Do :TOhtml to create a new html buffer. File encoding defaults to
            > empty, Vim encoding is still utf-8.
            > :TOhtml sees encoding and sets the charset in the generated markup to
            > UTF-8.
            > :w the new html buffer. Vim sees empty file encoding, so uses utf-8 as
            > the new file's encoding. Thus file encoding matches the html charset.
            >
            > You claim that the new html buffer has "latin-1" encoding. Am I
            > missing something here?
            >
            > I still think using fileencoding might be the "correct" way to do it,
            > but doing so would require 2html.vim to set the file encoding of the
            > new html buffer explicitly to be equal to the source file.
            >
            > This also brings up another shortcoming of 2html, because using
            > g:html_use_encoding may change the html charset meta tag, but it does
            > NOT change the actual character encoding of the file. It looks like I
            > will need to set the fileencoding of the new html buffer to whatever
            > corresponds to the supplied user option as a separate fix.
            >
            > Any thoughts?

            --
            You received this message from the "vim_dev" maillist.
            Do not top-post! Type your reply below the text you are replying to.
            For more information, visit http://www.vim.org/maillist.php
          • Tony Mechelynck
            ... Well, for existing files, fileencoding will be set locally by the fileencodings (plural) heuristic if the latter option is set. For new files, you can
            Message 5 of 15 , Aug 29, 2010
            • 0 Attachment
              On 30/08/10 04:51, Benjamin Fritz wrote:
              > On Sat, Aug 28, 2010 at 10:00 PM, Tony Mechelynck
              > <antoine.mechelynck@...> wrote:
              >>
              >> I don't know what is being done ATM, but I'd always include the line
              >>
              >> <meta http-equiv="Content-Type" content="text/html; charset=whatever" />
              >>
              >> (replacing "whatever" by the charset name) somewhere near the start of the
              >> <head> element. You may want to use a synonym, e.g. iso-8859-1 for Latin1,
              >> but that's just the finishing touch.
              >>
              >
              > Yes, that's mostly what it does now, except it omits the line if it
              > could not determine the charset, always uses 'encoding' instead of
              > 'fileencoding', and specifies the encoding in the<?xml line instead
              > when optionally using xhtml. I think using utf-8 as a fallback instead
              > of leaving it out entirely would be a better idea.
              >
              > The user can specify the charset now, but then the fileencoding will
              > be wrong unless the user remembers to manually set it (or if it gets
              > inherited...'fileencoding' seems to act like a "global-local" option).
              >

              Well, for existing files, 'fileencoding' will be set locally by the
              'fileencodings' (plural) heuristic if the latter option is set. For new
              files, you can :setg fenc=something and it will be used when creating a
              new file.

              If 'fileencoding' (singular) is the empty string for a file (which is
              the default for new files) you'll inherit the value of 'encoding'.


              Best regards,
              Tony.
              --
              Said a swinging young chick named Lyth
              Whose virtue was largely a myth,
              "Try as hard as I can,
              I can't find a man
              That it's fun to be virtuous with."

              --
              You received this message from the "vim_dev" maillist.
              Do not top-post! Type your reply below the text you are replying to.
              For more information, visit http://www.vim.org/maillist.php
            • Benjamin Fritz
              The attached patch against the latest 7.3.3 changeset in Mercurial adds the requested use of fencoding instead of encoding when it is set to determine the
              Message 6 of 15 , Sep 10, 2010
              • 0 Attachment
                The attached patch against the latest 7.3.3 changeset in Mercurial
                adds the requested use of 'fencoding' instead of 'encoding' when it is
                set to determine the HTML charset.

                Additionally, it will now support a lot more encodings, and
                automatically set the file encoding of the new file to match the
                charset.

                All encodings that are both native to Vim (listed by name in :help
                encoding-names) and appear in the IANA registry (
                http://www.iana.org/assignments/character-sets ) are supported. Note
                that not all of these encodings are supported by major web browsers or
                the w3c validator. New options are provided to override specific
                encodings in the charset detection, or there is still
                g:html_use_encoding to override all automatic detection. It is
                probably a good idea to use this option if publishing to a web page.

                There may be some charsets that previously were automatically detected
                that no longer are, and there are some encodings supported by Vim
                which I could not find in the IANA registry.

                Unfortunately, I could not find a list of widely supported charsets,
                so I just used all the ones in Vim and the IANA registry, as mentioned
                previously. If there is such a list, would it be a good idea to limit
                the automatically detected charsets to those in the list? Along those
                lines, it could be a good idea to automatically use UTF-8 in place of
                UTF-16 and UTF-32. Currently these charsets are selected as-is.

                So, consider this a beta release. PLEASE test and comment, I expect
                some changes may be needed before final submission.

                Patch is attached, or the files are available for download at the site
                I use for the TOhtml test suite:

                http://code.google.com/p/vim-2html-test/downloads/list

                --
                You received this message from the "vim_dev" maillist.
                Do not top-post! Type your reply below the text you are replying to.
                For more information, visit http://www.vim.org/maillist.php
              • Ben Fritz
                ... Notably, I should mention: UTF-32 is not supported at all in Opera. In fact, they removed support for UTF-32 in version 10:
                Message 7 of 15 , Sep 11, 2010
                • 0 Attachment
                  On Sep 10, 10:22 pm, Benjamin Fritz <fritzophre...@...> wrote:
                  > Unfortunately, I could not find a list of widely supported charsets,
                  > so I just used all the ones in Vim and the IANA registry, as mentioned
                  > previously. If there is such a list, would it be a good idea to limit
                  > the automatically detected charsets to those in the list? Along those
                  > lines, it could be a good idea to automatically use UTF-8 in place of
                  > UTF-16 and UTF-32. Currently these charsets are selected as-is.
                  >

                  Notably, I should mention:

                  UTF-32 is not supported at all in Opera. In fact, they removed support
                  for UTF-32 in version 10: http://www.opera.com/docs/changelogs/windows/1000b1/

                  UTF-32 and UTF-16 do not seem to be supported by Firefox at all for
                  xhtml, and I had to manually select the correct encoding for the html
                  documents.

                  Google Chrome, Internet Explorer 8, and Safari seem to have no
                  problems (although IE8 does not support xhtml at all so I could not
                  test these in that browser).

                  I'm thinking that I will make the automatic detection from the Vim
                  encoding default to UTF-8 for these encodings, but will leave the
                  detection of encoding from charset in case the user specifies one of
                  them using g:html_use_encoding. The user can also use
                  g:html_charset_override if they want these to be automatically
                  detected.

                  Thoughts? There are some test files available here if you're curious:

                  http://code.google.com/p/vim-2html-test/source/browse/encoding_test/

                  --
                  You received this message from the "vim_dev" maillist.
                  Do not top-post! Type your reply below the text you are replying to.
                  For more information, visit http://www.vim.org/maillist.php
                • Ben Fritz
                  ... I created a separate thread with another beta release which makes this and a couple other changes, on both vim_dev and vim_use for greater visibility. I
                  Message 8 of 15 , Oct 6, 2010
                  • 0 Attachment
                    On Sep 11, 8:57 am, Ben Fritz <fritzophre...@...> wrote:
                    >
                    > I'm thinking that I will make the automatic detection from the Vim
                    > encoding default to UTF-8 for these encodings, but will leave the
                    > detection of encoding from charset in case the user specifies one of
                    > them using g:html_use_encoding. The user can also use
                    > g:html_charset_override if they want these to be automatically
                    > detected.
                    >

                    I created a separate thread with another beta release which makes this
                    and a couple other changes, on both vim_dev and vim_use for greater
                    visibility. I have not yet received any feedback from the first beta.

                    Here is the new thread on vim_dev:

                    http://groups.google.com/group/vim_dev/browse_thread/thread/a04e42e642872736

                    --
                    You received this message from the "vim_dev" maillist.
                    Do not top-post! Type your reply below the text you are replying to.
                    For more information, visit http://www.vim.org/maillist.php
                  Your message has been successfully submitted and would be delivered to recipients shortly.