Unicode and UTF-8

Discussions related to mouse, keyboard, fonts and Graphical User Interface
RichardRussell

Unicode and UTF-8

Post by RichardRussell »

In another thread there seems to be some confusion over the terms Unicode and UTF-8, although they are well explained in the relevant Wikipedia articles linked to. They are not interchangeable!

Unicode is a method of representing most of the characters, glyphs and symbols used in the world's writing systems, by allocating each of them a unique numeric value or 'code point'. There are in excess of 140,000 such characters currently allocated (although BBC BASIC for Windows and BBC BASIC for SDL 2.0 are limited to the 65535 code points in the Basic Multilingual Plane or BMP).

UTF-8 is a method of encoding Unicode text into a sequence of bytes; it is a variable-length code, each character consists of from one to four bytes. Alternative Unicode encodings are UTF-16 (a sequence of 16-bit words, also variable-length) and UTF-32 (a sequence of fixed-length 32-bit values). UTF-16 and UTF-32 come in two varieties (little-endian and big-endian) which determine the order of bytes if the 16 or 32-bit values are serialised.

You could use UTF-8 to represent only the 256 characters in the ANSI character set (not that there would be much point doing so, because it would introduce the complication of a variable-length encoding for no benefit) but that does not make it Unicode!
Last edited by RichardRussell on Fri 10 Apr 2020, 16:02, edited 1 time in total.
User avatar
hellomike
Posts: 184
Joined: Sat 09 Jun 2018, 09:47
Location: Amsterdam

Re: Unicode and UTF-8

Post by hellomike »

Thanks for the recap Richard! I have to admit that I struggled and sometimes still struggle with these concepts.

I remember being an 'expert' with BASIC but couldn't grasp what POKE and PEEK really did. Once I understood the concept of memory locations and bytes in that memory, it was easy.

As always, once a concept is truly understood it is easy from then on.

Regards,

Mike
RichardRussell

Re: Unicode and UTF-8

Post by RichardRussell »

hellomike wrote: Fri 10 Apr 2020, 10:50I have to admit that I struggled and sometimes still struggle with these concepts.
Here are a couple of functions, one for converting ANSI to UTF-8 (which is always possible):

Code: Select all

      DEF FN_ansi_to_utf8(a$)
      LOCAL C%, I%, u$
      FOR I% = 1 TO LEN(a$)
        C% = ASCMID$(a$,I%,1)
        IF C% > &7F THEN
          CASE C% OF
            WHEN &80: C% = &20AC : REM Euro
            WHEN &82: C% = &201A
            WHEN &83: C% = &0192
            WHEN &84: C% = &201E
            WHEN &85: C% = &2026
            WHEN &86: C% = &2020
            WHEN &87: C% = &2021
            WHEN &88: C% = &02C6
            WHEN &89: C% = &2030
            WHEN &8A: C% = &0160
            WHEN &8B: C% = &2039
            WHEN &8C: C% = &0152
            WHEN &8E: C% = &017D
            WHEN &91: C% = &2018
            WHEN &92: C% = &2019
            WHEN &93: C% = &201C
            WHEN &94: C% = &201D
            WHEN &95: C% = &2022
            WHEN &96: C% = &2013
            WHEN &97: C% = &2014
            WHEN &98: C% = &02DC
            WHEN &99: C% = &2122
            WHEN &9A: C% = &0161
            WHEN &9B: C% = &203A
            WHEN &9C: C% = &0153
            WHEN &9E: C% = &017E
            WHEN &9F: C% = &0178
          ENDCASE
        ENDIF
        IF C% < &80 THEN
          u$ += CHR$(C%)
        ELSEIF C% < &800 THEN;
          u$ += CHR$(&C0 + (C% >> 6 AND &3F)) + CHR$(&80 + (C% AND &3F))
        ELSE
          u$ += CHR$(&E0 + (C% >> 12)) + CHR$(&80 + (C% >> 6 AND &3F)) + CHR$(&80 + (C% AND &3F))
        ENDIF
      NEXT
      = u$
and one for converting UTF-8 to ANSI (if the source string contains any characters not available in ANSI they will be substituted by meaningless, unrelated, ANSI characters; no warning or error will result):

Code: Select all

      DEF FN_utf8_to_ansi(u$)
      LOCAL C%, I%, a$
      WHILE I% < LEN(u$)
        I% += 1
        C% = ASCMID$(u$,I%,1)
        IF C% >= &E0 THEN
          C% = (C% << 12) AND &F000
          C% OR= (ASCMID$(u$,I%+1,1) << 6) AND &0FC0
          C% OR= ASCMID$(u$,I%+2,1) AND &003F
          I% += 2
        ELSEIF C% >= &C0 THEN;
          C% = (C% << 6) AND &07C0
          C% OR= ASCMID$(u$,I%+1,1) AND &003F
          I% += 1
        ENDIF
        IF C% > &FF THEN
          CASE C% OF
            WHEN &20AC: C% = &80 : REM Euro
            WHEN &201A: C% = &82
            WHEN &0192: C% = &83
            WHEN &201E: C% = &84
            WHEN &2026: C% = &85
            WHEN &2020: C% = &86
            WHEN &2021: C% = &87
            WHEN &02C6: C% = &88
            WHEN &2030: C% = &89
            WHEN &0160: C% = &8A
            WHEN &2039: C% = &8B
            WHEN &0152: C% = &8C
            WHEN &017D: C% = &8E
            WHEN &2018: C% = &91
            WHEN &2019: C% = &92
            WHEN &201C: C% = &93
            WHEN &201D: C% = &94
            WHEN &2022: C% = &95
            WHEN &2013: C% = &96
            WHEN &2014: C% = &97
            WHEN &02DC: C% = &98
            WHEN &2122: C% = &99
            WHEN &0161: C% = &9A
            WHEN &203A: C% = &9B
            WHEN &0153: C% = &9C
            WHEN &017E: C% = &9E
            WHEN &0178: C% = &9F
          ENDCASE
        ENDIF
        a$ += CHR$C%
      ENDWHILE
      = a$
I propose to add these to the utf8lib library in due course.