Loading ...
Sorry, an error occurred while loading the content.

Text file differences - what's going on?

Expand Messages
  • Alec Burgess
    see: http://tech.groups.yahoo.com/group/ntb-OffTopic/files/HTML_vs_xHTML.zip I was having problems getting X1 to index a bunch of files extracted from a zip
    Message 1 of 9 , Dec 2, 2010
    • 0 Attachment
      see:
      http://tech.groups.yahoo.com/group/ntb-OffTopic/files/HTML_vs_xHTML.zip

      I was having problems getting X1 to index a bunch of files extracted
      from a zip file torrent of Cablegate.wiklleaks.org
      I finally figured out that there must be something funny about the file
      formats.
      Above zip contains two files art_bad.html and art_good.html

      art_bad.html is original file after extracting from the zip file.
      art_good.html is the result of opening the "bad" file in Notepad.exe and
      then saving it under new name.

      Opening both files in Notetab: status bar reports both as UTF-8

      If I try to open "art_bad.html" with PsPad-Hex viewer it says "Can't
      open file"

      If I run file.exe (from cygwin) against both it reports:
      ----
      art_bad.html: xHTML document text
      art_good.html: HTML document text
      ---

      Using online hex dump tool http://www.fileformat.info/tool/hexdump.htm
      shows the first 16 bytes of each as:
      art_bad.html vs. art_good.html
      0000-0010: 3c 3f 78 6d-6c 20 76 65-72 73 69 6f-6e 3d 27 31 <?xml.ve rsion='1
      0000-0010: ef bb bf 3c-3f 78 6d 6c-20 76 65 72-73 69 6f 6e ...<?xml .version

      so the difference is the first three bytes not present in the "bad"
      version (before Save as ... in Notepad)
      I know I've seen this discussed here before but can't remember what this
      is all about.
      Can anybody explain and/or point me to a tool I can use to change the
      "bad" files to "good" so I can get them indexed by X1?

      --
      Regards ... Alec (buralex@gmail& WinLiveMess - alec.m.burgess@skype)
    • loro
      ... What s X1? Maybe it s a BOM (byte order mark). Lotta
      Message 2 of 9 , Dec 2, 2010
      • 0 Attachment
        Alex wrote:
        >see:
        >http://tech.groups.yahoo.com/group/ntb-OffTopic/files/HTML_vs_xHTML.zip
        >
        >I was having problems getting X1 to index a bunch of files extracted
        >from a zip file torrent of Cablegate.wiklleaks.org
        >I finally figured out that there must be something funny about the file
        >formats.
        >Above zip contains two files art_bad.html and art_good.html

        What's X1? Maybe it's a BOM (byte order mark).
        <http://www.w3.org/International/questions/qa-utf8-bom.en.php>

        Lotta
      • Alex Plantema
        ... EF BB BF is the UTF-8 representation of U+FEFF, the byte order mark. With Notepad, you can create a file containing only these three bytes by saving an
        Message 3 of 9 , Dec 2, 2010
        • 0 Attachment
          Op donderdag 2 december 2010 13:01 schreef Alec Burgess:

          > Can anybody explain and/or point me to a tool I can use to change the
          > "bad" files to "good" so I can get them indexed by X1?

          EF BB BF is the UTF-8 representation of U+FEFF, the byte order mark.
          With Notepad, you can create a file containing only these three bytes
          by saving an empty file with UTF-8 encoding,
          and use it to add them to a file, e.g.:

          copy empty.txt+art_bad.html art_good2.html

          Alex.
        • Alec Burgess
          ... Thanks Alex for that explanation and workaround. I ll use it. After googling [EF BB BF] and reading http://en.wikipedia.org/wiki/Byte_order_mark it appears
          Message 4 of 9 , Dec 2, 2010
          • 0 Attachment
            On 2010-12-02 12:36, Alex Plantema wrote:
            > Op donderdag 2 december 2010 13:01 schreef Alec Burgess:
            >
            > > Can anybody explain and/or point me to a tool I can use to change the
            > > "bad" files to "good" so I can get them indexed by X1?
            >
            > EF BB BF is the UTF-8 representation of U+FEFF, the byte order mark.
            > With Notepad, you can create a file containing only these three bytes
            > by saving an empty file with UTF-8 encoding,
            > and use it to add them to a file, e.g.:
            >
            > copy empty.txt+art_bad.html art_good2.html
            Thanks Alex for that explanation and workaround. I'll use it.

            After googling [EF BB BF] and reading
            http://en.wikipedia.org/wiki/Byte_order_mark it appears that presence of
            the byte order mark is supposed to be optional in UTF-8 files (and
            deprecated).

            Am I reading that correctly? If so the behavior of PsPad (says can't
            read in hex format) and X1 (won't index) should be considered "bugs"?

            --
            Regards ... Alec (buralex@gmail& WinLiveMess - alec.m.burgess@skype)
          • Alec Burgess
            ... Thanks Lotta and Alex (Plantema). Because his reply left my name as alec instead of alex it got filtered to my read-first folder and I saw his reply
            Message 5 of 9 , Dec 2, 2010
            • 0 Attachment
              On 2010-12-02 11:40, loro wrote:
              > Alex wrote:
              > >see:
              > >http://tech.groups.yahoo.com/group/ntb-OffTopic/files/HTML_vs_xHTML.zip
              > >
              > >I was having problems getting X1 to index a bunch of files extracted
              > >from a zip file torrent of Cablegate.wiklleaks.org
              > >I finally figured out that there must be something funny about the file
              > >formats.
              > >Above zip contains two files art_bad.html and art_good.html
              >
              > What's X1? Maybe it's a BOM (byte order mark).
              > <http://www.w3.org/International/questions/qa-utf8-bom.en.php>
              Thanks Lotta and Alex (Plantema). Because his reply left my name as
              "alec" instead of "alex" it got filtered to my read-first folder and I
              saw his reply before yours. ;-)

              In the reference you (Lotta) cite I see this:
              > by the way
              > You will find that some text editors such as Windows Notepad will
              > automatically add a UTF-8 signature to any file you save as UTF-8.
              If I'd read that first I'm not at all sure would have figured out Alex's
              neat workaround.

              About X1 http://www.x1.com/ : Its a desktop indexing and search program.
              (IMO) superior to Google Desktop or Windows Indexing. For some time it
              was available free to all and a lightly customized version was available
              as Yahoo's alternative to Google Desktop. AFAIK - its no longer
              available for free. :-( I have my license because I was a beta-tester
              for it since 2002 (then using Win98) when I first heard about it.

              --
              Regards ... Alec (buralex@gmail& WinLiveMess - alec.m.burgess@skype)
            • Alex Plantema
              ... I think you re right, see http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark Alex.
              Message 6 of 9 , Dec 2, 2010
              • 0 Attachment
                Op donderdag 2 december 2010 23:01 schreef Alec Burgess:

                > After googling [EF BB BF] and reading
                > http://en.wikipedia.org/wiki/Byte_order_mark it appears that presence
                > of the byte order mark is supposed to be optional in UTF-8 files (and
                > deprecated).
                >
                > Am I reading that correctly? If so the behavior of PsPad (says can't
                > read in hex format) and X1 (won't index) should be considered "bugs"?

                I think you're right, see http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark

                Alex.
              • loro
                Al wrote:-) ... FYI my copy of PSPad opens both files without problems. The hex view doesn t show me the BOM though, the files look identical. I haven t
                Message 7 of 9 , Dec 2, 2010
                • 0 Attachment
                  Al wrote:-)
                  >After googling [EF BB BF] and reading
                  >http://en.wikipedia.org/wiki/Byte_order_mark it appears that presence of
                  >the byte order mark is supposed to be optional in UTF-8 files (and
                  >deprecated).

                  FYI my copy of PSPad opens both files without problems. The hex view
                  doesn't show me the BOM though, the files look identical. I haven't
                  upgraded in while, let's see, version 4.3.0 (1971).
                • Alec Burgess
                  ... Thanks for checking ... my version was 4.5.1 and I updated to latest stable 4.5.4 (There is a 4.5.5 beta available) I hadn t realized ... didn t check ...
                  Message 8 of 9 , Dec 2, 2010
                  • 0 Attachment
                    On 2010-12-02 21:36, loro wrote:
                    > Al wrote:-)
                    > >After googling [EF BB BF] and reading
                    > >http://en.wikipedia.org/wiki/Byte_order_mark it appears that presence of
                    > >the byte order mark is supposed to be optional in UTF-8 files (and
                    > >deprecated).
                    >
                    > FYI my copy of PSPad opens both files without problems. The hex view
                    > doesn't show me the BOM though, the files look identical. I haven't
                    > upgraded in while, let's see, version 4.3.0 (1971).
                    Thanks for checking ... my version was 4.5.1 and I updated to latest
                    stable 4.5.4 (There is a 4.5.5 beta available)
                    I hadn't realized ... didn't check ...
                    PSPad will add three Explorer context menu options: PSPad/PSPad Hex
                    View/PSPad Text Diff
                    I get the "Can not open file art_bad.html" error *ONLY* when I try to
                    open art_bad.html from the context menu [PSPad Hex View]

                    After opening the files in PSPad normally and then use View-Hex Edit Mode:
                    art_bad.html (w/o Byte order mark)
                    - shows in Status bar: Code page: ANSI (Windows) and shows first byte as
                    3C (the '<' in "<?xml version='1")

                    art_good.html (with Byte order mark)
                    - shows in Status bar: Code page: UTF-8 and shows the first four bytes
                    as FFFE 3C00 with first two chars: 'ÿþ<' in "ÿþ<?xml v" where the funny
                    character 'ÿþ' is (I guess) UTF-8 FFFE
                    - This doesn't appear to have any direct relation to the byte order mark
                    [EF BB BF] and who knows why its being shown that way. :-)

                    When I use Alex's "trick" and open art_good.html, delete all text and
                    save as either BOM_only.txt or BOM_only.html then PSPad shows both files
                    as Code page: ANSI (Windows) with contents "" and in hex the
                    expected "EFBB BF".

                    I've been learning way more about this stuff than I ever really wanted !!

                    I'll probably post a question/bug-report with PSPad and ask on X1 forums
                    if this "bug?" has been corrected in later versions of X1 and beg for an
                    updated copy.

                    So far only 593 of 251,287 cables have been made available on
                    http://cablegate.wikileaks.org/ with torrent link:
                    http://file.wikileaks.org/torrent/cablegate/cablegate-201012021301.7z.torrent
                    and I gather from the media I've got a couple of weeks to figure out how
                    to search them when eventually released.

                    One more data point: HippoEDIT was the giveawayoftheday a couple of
                    weeks ago ... when I open art_bad.html in it the file is immediately
                    flagged as modified - haven't checked, but guess that if I save it the
                    BOM will have been inserted a la Notepad.

                    --
                    Regards ... Alec (buralex@gmail& WinLiveMess - alec.m.burgess@skype)
                  • loro
                    ... Yeah, ÿþ is how it uses to show up in the upper left corner of web pages. Lotta
                    Message 9 of 9 , Dec 2, 2010
                    • 0 Attachment
                      At 07:43 2010-12-03, Alec Burgess wrote:
                      >with first two chars: 'ÿþ<' in "ÿþ<?xml v" where the funny
                      >character 'ÿþ' is (I guess) UTF-8 FFFE

                      Yeah, ÿþ is how it uses to show up in the upper left corner of web pages.

                      Lotta
                    Your message has been successfully submitted and would be delivered to recipients shortly.