PerlIoOpenEncoding

If you want to open a file that has been encoded in something other than ASCII, ISO-8859-1 or UTF8, you have to use Perl's PerlIO Encoding capabilities. These are default inclusions in Perl 5.8 and others.

The general idea is that internally, Perl uses a super-set character encoding. That is, Perl strings can contain any possible characters - Perl 'magically' handles it. However, since you've got s file on disk that is encoded in some special way, you have to tell Perl how to read that file correctly so that it can get the characters right.

Here is where PerlIO comes in. You simply tell Perl how the file is encoded and it does the rest for you. You can use this technique to transcode files from one encoding to another too. Here's an example:
^

 use Encode;5

open(FILE, "<:encoding(cp1252)", "myfile.txt");

 open(OUT, &quot;&gt;:utf8&quot;, &quot;outfile.txt&quot;);<br />
 while(&lt;FILE&gt;)<br />
 {<br />
   print OUT &quot;$_&quot;;<br />
 }7

^

The above example will open the CP1252 encoded file &quot;myfile.txt&quot; and write it out in UTF8 encoding.

There seems to be a caveat to all this. You might imagine the following woud be equivalent:

^

 use Encode;15

open(FILE, &quot;&lt;:encoding(cp1252)&quot;, &quot;myfile.txt&quot;);

 open(OUT, &quot;&gt;:encoding(utf8)&quot;, &quot;outfile.txt&quot;);<br />
 while(&lt;FILE&gt;)<br />
 {<br />
   print OUT &quot;$_&quot;;<br />
 }17

^

...however, you'd be wrong. This seems to do strange things which I can't quite get to the bottom of. I found the problem parsing HTML files - it seems that opening files (at least for reading) using the &quot;encoding(utf8)&quot; symantics can cause Perl to go very strange - it seems that it's possible to get Perl stuck on a mangled UTF8 character such that it never gets beyond it, yet seems to consume more and more memory.

The truth is, I haven't fully got to the bottom of this, and I may not be quite right about it. However I have developed a (slightly convoluted) test case, which works fine when I use &quot;&lt;:utf8&quot; but not when I use &quot;&lt;:encoding(utf8)&quot;.

Caveat reader!