Making a Java SafeString that works with all unicode characters

by
Tags: , ,
Category:

In Java, there is an issue with String when using characters that take up more than 2 bytes in UTF-16. substring() and similar methods can split the character in the middle. I was thinking switching Strings to UTF-8 might be good, and there are currently two JEPs for Java 9 somewhat related to this. 226: UTF-8 Property Files and 254: Compact Strings. But thinking about this a little more, I don’t necessarily want a UTF-8 String class, but a String class that works with all unicode characters.

As a workaround, maybe a String class that does not accept characters that can’t be represented by two bytes? Or maybe a class that has “safe” methods for substring, etc.? The first option might be able to be handled by an annotation and the Checker Framework, but the second option might be easier to get going, so let’s take a look at the second option with a Safe String class.

Let’s start with a 4 character test string.

private static final int VALID_LENGTH = 4;
private String testString = new StringBuilder()
    .append("a")
    .appendCodePoint(0x10400)
    .append("cd")
    .toString();

Make sure the length() is invalid

assertNotEquals(VALID_LENGTH, testString.length());

While a safe string should pass

assertEquals(VALID_LENGTH, testSafeString.length());

Same for substring

String newString = testString.substring(2, 4);
assertNotEquals("cd", newString);
SafeString newSafeString = testSafeString.substring(2, 4);
assertEquals("cd", newSafeString.get());

Now let’s take a look at a SafeString class, with data containing the original String. To get this to work, we need the real index

private int getRealIndex(int index) {
  return data.offsetByCodePoints(0, index);
}

The following gives characterAt and substring

public int characterAt(int index) {
  return data.codePointAt(getRealIndex(index));
}
public SafeString substring(int startIndex, int endIndex) {
  int codePointStartIndex = getRealIndex(startIndex);
  int codePointEndIndex = getRealIndex(endIndex);
  String newData = data.substring(codePointStartIndex, codePointEndIndex);
  return new SafeString(newData);
}

and the length

data.codePointCount(0, data.length());

The full code is available on github.