torsdag, juni 15, 2006

Java Charsets

Today I've spent some time pounding java.lang.String to give me a byte array I can use as a data format. This seems harder than I thought it should be. So, why can't I use getBytes() or getBytes("ascii") or getBytes("iso-8859-1") or getBytes("utf-8")? Those are fine for certain tasks, but I'm looking for a very specific translation from chars to bytes. The application I was working on is Zlib in Java, for JRuby. Since Ruby have the somewhat funny custom of using Strings as byte buffers this means the output I get from a Ruby IO-operation is a RubyString.

The reason I started trying different paths for this was that Zlib didn't work as it should. Not at all. I knew it worked when I did it one char at a time, because then I casted the char to an int instead (since InputStream#read() returns a int). So, I created this small program:


final byte[] chrs = new byte[256];
for(int i=0,j=chrs.length;i<j;i++) {
chrs[i] = (byte)i;
}
final String str = new String(chrs);
final byte[] bts = str.getBytes();
for(int i=0,j=chrs.length;i<j;i++) {
System.out.println("[" + i + "]= " + (int)chrs[i] + ", " + bts[i] + " ... should be: " + (byte)chrs[i]);
}

to see what happened here. Now, I won't bore you with the complete printout from this. But there are a few specific portions that I'd like to share:

[127]= 127, 127 ... should be: 127
[128]= -128, -128 ... should be: -128
[129]= -127, 63 ... should be: -127
[130]= -126, -126 ... should be: -126

[140]= -116, -116 ... should be: -116
[141]= -115, 63 ... should be: -115
[142]= -114, -114 ... should be: -114
[143]= -113, 63 ... should be: -113
[144]= -112, 63 ... should be: -112
[145]= -111, -111 ... should be: -111

and

[156]= -100, -100 ... should be: -100
[157]= -99, 63 ... should be: -99
[158]= -98, -98 ... should be: -98

Those 63-values keep showing up and destroying everything. If I try another encoding in the getBytes-method it actually gets worse. I couldn't find any way to get this to write the expected output. So, I embarked on a quest. A quest to solve this small trouble, forever and always. The result is plaincharset, a small project consisting of 4 classes. Nothing spectacular, but if you add the jar-file to your classpath you can now use the charset name "PLAIN" to get every byte correctly from getBytes and new String. If you have characters that are not within 0..255 I cannot guarantee anything at all. I hereby release the project in the public domain. The source can be found here, and if you just want the jar-file, download it here.

So, what is the secret behind this marvel? In one word: NIO. The jar-file contains a subclass of CharsetProvider, a subclass of Charset, one CharsetDecoder and one CharsetEncoder. The only classes with anything in them is the decoder and encoder, which gets an input NIO-buffer and an output NIO-buffer. I just read from the input and write to the output, casting where necessary. There is also one service-provider file in the META-INF directory in the jar, which says to use the com.ologix.charset.PlainCharsetProvider as a provider for charsets.

And did this work for my Zlib-implementation? I'm happy to say that it did. It works very well and is both smaller in code length, and much, much faster. I'm happy.

Inga kommentarer: