2020-10-28

Unicode in JavaScript

How JavaScript uses Unicode internally
Using Unicode in a string
Normalization
Emojis
Get the proper length of a string
ES6 Unicode code point escapes
Encoding ASCII chars

How JavaScript uses Unicode internally

Despite the fact that a JavaScript source file can have any kind of encoding, JavaScript uses UTF-16 internally.

JavaScript strings are all UTF-16 sequences.

According to Section 6 of the ECMAScript specification:

ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. […] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. […] If an actual source text is encoded in a form other than 16-bit code units, it must be processed as if it was first converted to UTF-16.

Using Unicode in a string

A unicode sequence can be inside any string using the format \xXXXX:

1	var s1 = '\u00E9'; // é

A sequnce can be created by combining two unicode sequences:

1	var s2 = '\u0065\u0301' //é

They both generate an accented e, but they are two different strings, and str2 is considered to be 2 characters long:

"é".length // 1
s1.length //1
s2.length //2
s1 == s2 // false

You can also write a string combining a unicode character with a plain char, as internally it’s actually the same thing:

const s3 = 'e\u0301' //é
s3.length === 2 //true
s2 === s3 //true
s1 !== s3 //true

Normalization

ES6/ES2015 introduced the normalize() method on the string prototype:

1
2
3

s1 === s3 // false

s1.normalize() === s3.normalize() // true

Emojis

Emojis are Unicode characters, so they are perfectly valid to be used in strings:

1	var s4 = '🐶';

The 🐶 symbol, which is U+1F436, is traditionally encoded as \uD83D\uDC36 (called surrogate pair).

Get the proper length of a string

1	s4.length // 2

One easy way in ES6+ is to use the spread operator:

1	;[...'🐶'].length //1

ES6 Unicode code point escapes

ES6/ES2015 introduced a way to represent Unicode points in the astral planes (any Unicode code point requiring more than 4 chars), by wrapping the code in graph parentheses \u{XXXXX}.

The dog 🐶 symbol, which is U+1F436, can be represented as \u{1F436} instead of having to combine two unrelated Unicode code points, like we showed before: \uD83D\uDC36.

But length calculation still does not work correctly, because internally it’s converted to the surrogate pair shown above.

Encoding ASCII chars

The first 128 characters can be encoded using the special escaping character \x, which only accepts 2 characters:

1 2	'\x61' // a '\x2A' // *

This will only work from \x00 to \xFF, which is the set of ASCII characters.