Loading ...
Sorry, an error occurred while loading the content.

8992Text file differences - what's going on?

Expand Messages
  • Alec Burgess
    Dec 2, 2010
    • 0 Attachment
      see:
      http://tech.groups.yahoo.com/group/ntb-OffTopic/files/HTML_vs_xHTML.zip

      I was having problems getting X1 to index a bunch of files extracted
      from a zip file torrent of Cablegate.wiklleaks.org
      I finally figured out that there must be something funny about the file
      formats.
      Above zip contains two files art_bad.html and art_good.html

      art_bad.html is original file after extracting from the zip file.
      art_good.html is the result of opening the "bad" file in Notepad.exe and
      then saving it under new name.

      Opening both files in Notetab: status bar reports both as UTF-8

      If I try to open "art_bad.html" with PsPad-Hex viewer it says "Can't
      open file"

      If I run file.exe (from cygwin) against both it reports:
      ----
      art_bad.html: xHTML document text
      art_good.html: HTML document text
      ---

      Using online hex dump tool http://www.fileformat.info/tool/hexdump.htm
      shows the first 16 bytes of each as:
      art_bad.html vs. art_good.html
      0000-0010: 3c 3f 78 6d-6c 20 76 65-72 73 69 6f-6e 3d 27 31 <?xml.ve rsion='1
      0000-0010: ef bb bf 3c-3f 78 6d 6c-20 76 65 72-73 69 6f 6e ...<?xml .version

      so the difference is the first three bytes not present in the "bad"
      version (before Save as ... in Notepad)
      I know I've seen this discussed here before but can't remember what this
      is all about.
      Can anybody explain and/or point me to a tool I can use to change the
      "bad" files to "good" so I can get them indexed by X1?

      --
      Regards ... Alec (buralex@gmail& WinLiveMess - alec.m.burgess@skype)
    • Show all 9 messages in this topic