Archive

Posts Tagged ‘incorrect file encoding’

Perl – Writing to file outputting in UTF-16 instead of UTF-8

September 16, 2011 2 comments

Recently I needed to write a perl script that ran on Cygwin. My default setting means that any files written by perl were being written in UTF-16. This led to what appeared to be a lot of Japanese writing, making it completely illegible and unusable.

After a lot of digging around on the internet, I managed to hack together the following code:

open my $SH, ">>:raw:encoding(UTF16-LE):crlf:utf8", "test1.txt";;
print $SH "\x{FEFF}";
print "Some test writing \n";
close ($SH);

This code tells perl we’re going to pass “characters” to this file handle instead of bytes. Next transform \n into \r\n to give DOS line endings. Next apply the UTF16-LE, so that 0x0A becomes 0x0A 0x00. This stops perl writing a byte order mark (BOM) at the beginning of the file. Finally, the raw:encoding removes the default ctrf so that it is not in the wrong place.

Now the file is being opened with the correct coding, we need to write the BOM to the beginning of the file to tell readers of this file what endianness it is. We do this by printing \x{FFF} to the file.