Loading ...
Sorry, an error occurred while loading the content.
 

unicode strings, hashes and filenames in Windows XP?

Expand Messages
  • thisis_not_anapple
    I m having issues with filenames in Windows XP SP2 that contain international characters. I m trying to parse the iTunes music database (from a XML file) to
    Message 1 of 1 , Jun 11 12:35 PM
      I'm having issues with filenames in Windows XP SP2 that contain
      international characters. I'm trying to parse the iTunes music database
      (from a XML file) to extract the filenames of each track in the database
      and (among other things) list the files that are missing in a text file.

      In the XML filenames are represented by a file:// URL. Here's an
      example:
      file://localhost/H:/mp3/Gabriel%20O%20Pensador/Quebra%20Cabe%C3%A7a/Gabr\
      iel%20O%20Pensador%20-%20Quebra-Cabe%C3%A7a%20-%2006.%20En%20La%20Casa.m\
      p3

      So part of my program converts this into the actual filepath using this
      code:
      $tracks{$TrackID}->{'filename'} = $value; #
      this adds an extra property containing the decoded filename from the
      Location (which is a URL with percent encoding and decimal numerical
      character references)
      # the following decodes the URL to a filename:
      $tracks{$TrackID}->{'filename'} =~
      s/^file:\/\/localhost\///; # this removes the url "file://localhost"
      header from the file start
      $tracks{$TrackID}->{'filename'} =~ s/\//\\/g;
      # this replaces the forward slashes with backslashes
      $tracks{$TrackID}->{'filename'} =~
      s/%([A-Fa-f\d]{2})/chr hex $1/eg; # this should decode the percent
      encoding of the URL (convert %dd to the character with that hex value)
      $tracks{$TrackID}->{'filename'} =~
      s/&\#(\d*);/chr $1/eg; # this should decode the decimal numerical
      character references (eg. '&' = '&')

      Later on the program checks if this file exists, and if not, outputs it
      to a text file:
      if (exists($tracks{$_}->{'filename'})) {
      unless (-e $tracks{$_}->{'filename'}) { # this will output
      another file for tracks that are missing
      print MS $tracks{$_}->{'filename'}, "\n";
      }
      }

      Now if I load that text file into Microsoft Word it detects it as a
      UTF-8 encoded text file and if I say OK to that the filenames in that
      file look like they do in the filesystem. So the example file I gave
      above does appear in this file as:
      H:\mp3\Gabriel O Pensador\Quebra Cabeça\Gabriel O Pensador -
      Quebra-Cabeça - 06. En La Casa.mp3

      Only it shouldn't be there since this file exists.

      As a test I wrote this little program:
      my $filename = 'H:\mp3\Gabriel O Pensador\Quebra Cabeça\Gabriel O
      Pensador - Quebra-Cabeça - 06. En La Casa.mp3';

      unless (-e $filename) {
      print "$filename DOES NOT EXIST\n";
      } else {
      print "$filename DOES EXIST\n";
      }

      and it DOES indicate the file exists.

      So back to my main program, I modified it to read like so:
      if (exists($tracks{$_}->{'filename'})) {
      if ($tracks{$_}->{'filename'} eq 'H:\mp3\Gabriel O
      Pensador\Quebra Cabeça\Gabriel O Pensador - Quebra-Cabeça - 06.
      En La Casa.mp3') {
      print 'I found H:\mp3\Gabriel O Pensador\Quebra
      Cabeça\Gabriel O Pensador - Quebra-Cabeça - 06. En La
      Casa.mp3!',"\n";
      }
      unless (-e $tracks{$_}->{'filename'}) { # this will output
      another file for tracks that are missing
      print MS $tracks{$_}->{'filename'}, "\n";
      }
      }

      and I never get an indication that the filename was matched...

      So I try opening the text file in Microsoft Word using Standard Windows
      Encoding and the file shows up as:
      H:\mp3\Gabriel O Pensador\Quebra Cabeça\Gabriel O Pensador -
      Quebra-Cabeça - 06. En La Casa.mp3

      which is not what it looks like in the filesystem.
      Nevertheless, if I modify the code to look like:
      if (exists($tracks{$_}->{'filename'})) {
      if ($tracks{$_}->{'filename'} eq 'H:\mp3\Gabriel O
      Pensador\Quebra Cabeça\Gabriel O Pensador - Quebra-Cabeça
      - 06. En La Casa.mp3') {
      print 'I found H:\mp3\Gabriel O Pensador\Quebra
      Cabeça\Gabriel O Pensador - Quebra-Cabeça - 06. En La
      Casa.mp3!',"\n";
      }
      unless (-e $tracks{$_}->{'filename'}) { # this will output
      another file for tracks that are missing
      print MS $tracks{$_}->{'filename'}, "\n";
      }
      }

      now I get an output line indicating it matched the filename to the hash
      value even though it can't find the file on the system.

      So obviously there is some issue with how the unicode characters are
      being represented in different places but I'm not sure how to resolve
      this...

      I just downloaded and installed the latest Activeperl which is supposed
      to fix some unicode filename issues but it didn't resolve this problem
      for me.

      Can anyone help?

      Thanks!
    Your message has been successfully submitted and would be delivered to recipients shortly.