Question 1

Why is the byte count higher than the character count?

Accepted Answer

UTF-8 uses one byte for ASCII but two to four bytes for other characters: most accented and Greek/Cyrillic letters take two, most CJK characters three, and most emoji four. So any non-ASCII text has more bytes than characters.

Question 2

Which number does a database VARCHAR limit use?

Accepted Answer

It depends on the database and column definition. Some count characters, but many (and byte-typed columns) count UTF-8 bytes — so a VARCHAR(20) can reject a 12-character string that encodes to 24 bytes. When in doubt, size against the UTF-8 byte count shown here.

Question 3

Why does JavaScript report a different length?

Accepted Answer

JavaScript strings are UTF-16, so .length counts 16-bit units. Characters outside the Basic Multilingual Plane — including most emoji — are stored as a surrogate pair and count as 2. That is the "UTF-16 units" figure.

Question 4

How are emoji counted?

Accepted Answer

As one code point each in the character count (unless they are a combined sequence like a flag or skin-tone variant, which can be several), typically four bytes in UTF-8, and two units in UTF-16. The tool shows all three so you can see the difference.

Question 5

Does this match `wc -c` and `wc -m`?

Accepted Answer

Yes: the UTF-8 byte count corresponds to `wc -c` (bytes) and the character count to `wc -m` (characters) on a UTF-8 locale, since the byte figure uses the same UTF-8 encoding.

Question 6

Is my text uploaded?

Accepted Answer

No. All counting uses the browser's built-in text encoder with no network request.

UTF-8 Byte Counter

Counts

About this tool

How to use it

Frequently asked questions

Related tools