In another thread there seems to be some confusion over the terms Unicode and UTF-8, although they are well explained in the relevant Wikipedia articles linked to. They are not interchangeable!
Unicode is a method of representing most of the characters, glyphs and symbols used in the world's writing systems, by allocating each of them a unique numeric value or 'code point'. There are in excess of 140,000 such characters currently allocated (although BBC BASIC for Windows and BBC BASIC for SDL 2.0 are limited to the 65535 code points in the Basic Multilingual Plane or BMP).
UTF-8 is a method of encoding Unicode text into a sequence of bytes; it is a variable-length code, each character consists of from one to four bytes. Alternative Unicode encodings are UTF-16 (a sequence of 16-bit words, also variable-length) and UTF-32 (a sequence of fixed-length 32-bit values). UTF-16 and UTF-32 come in two varieties (little-endian and big-endian) which determine the order of bytes if the 16 or 32-bit values are serialised.
You could use UTF-8 to represent only the 256 characters in the ANSI character set (not that there would be much point doing so, because it would introduce the complication of a variable-length encoding for no benefit) but that does not make it Unicode!
Unicode and UTF-8
Unicode and UTF-8
Last edited by RichardRussell on Fri 10 Apr 2020, 16:02, edited 1 time in total.
- hellomike
- Posts: 184
- Joined: Sat 09 Jun 2018, 09:47
- Location: Amsterdam
Re: Unicode and UTF-8
Thanks for the recap Richard! I have to admit that I struggled and sometimes still struggle with these concepts.
I remember being an 'expert' with BASIC but couldn't grasp what POKE and PEEK really did. Once I understood the concept of memory locations and bytes in that memory, it was easy.
As always, once a concept is truly understood it is easy from then on.
Regards,
Mike
I remember being an 'expert' with BASIC but couldn't grasp what POKE and PEEK really did. Once I understood the concept of memory locations and bytes in that memory, it was easy.
As always, once a concept is truly understood it is easy from then on.
Regards,
Mike
Re: Unicode and UTF-8
Here are a couple of functions, one for converting ANSI to UTF-8 (which is always possible):
Code: Select all
DEF FN_ansi_to_utf8(a$)
LOCAL C%, I%, u$
FOR I% = 1 TO LEN(a$)
C% = ASCMID$(a$,I%,1)
IF C% > &7F THEN
CASE C% OF
WHEN &80: C% = &20AC : REM Euro
WHEN &82: C% = &201A
WHEN &83: C% = &0192
WHEN &84: C% = &201E
WHEN &85: C% = &2026
WHEN &86: C% = &2020
WHEN &87: C% = &2021
WHEN &88: C% = &02C6
WHEN &89: C% = &2030
WHEN &8A: C% = &0160
WHEN &8B: C% = &2039
WHEN &8C: C% = &0152
WHEN &8E: C% = &017D
WHEN &91: C% = &2018
WHEN &92: C% = &2019
WHEN &93: C% = &201C
WHEN &94: C% = &201D
WHEN &95: C% = &2022
WHEN &96: C% = &2013
WHEN &97: C% = &2014
WHEN &98: C% = &02DC
WHEN &99: C% = &2122
WHEN &9A: C% = &0161
WHEN &9B: C% = &203A
WHEN &9C: C% = &0153
WHEN &9E: C% = &017E
WHEN &9F: C% = &0178
ENDCASE
ENDIF
IF C% < &80 THEN
u$ += CHR$(C%)
ELSEIF C% < &800 THEN;
u$ += CHR$(&C0 + (C% >> 6 AND &3F)) + CHR$(&80 + (C% AND &3F))
ELSE
u$ += CHR$(&E0 + (C% >> 12)) + CHR$(&80 + (C% >> 6 AND &3F)) + CHR$(&80 + (C% AND &3F))
ENDIF
NEXT
= u$
Code: Select all
DEF FN_utf8_to_ansi(u$)
LOCAL C%, I%, a$
WHILE I% < LEN(u$)
I% += 1
C% = ASCMID$(u$,I%,1)
IF C% >= &E0 THEN
C% = (C% << 12) AND &F000
C% OR= (ASCMID$(u$,I%+1,1) << 6) AND &0FC0
C% OR= ASCMID$(u$,I%+2,1) AND &003F
I% += 2
ELSEIF C% >= &C0 THEN;
C% = (C% << 6) AND &07C0
C% OR= ASCMID$(u$,I%+1,1) AND &003F
I% += 1
ENDIF
IF C% > &FF THEN
CASE C% OF
WHEN &20AC: C% = &80 : REM Euro
WHEN &201A: C% = &82
WHEN &0192: C% = &83
WHEN &201E: C% = &84
WHEN &2026: C% = &85
WHEN &2020: C% = &86
WHEN &2021: C% = &87
WHEN &02C6: C% = &88
WHEN &2030: C% = &89
WHEN &0160: C% = &8A
WHEN &2039: C% = &8B
WHEN &0152: C% = &8C
WHEN &017D: C% = &8E
WHEN &2018: C% = &91
WHEN &2019: C% = &92
WHEN &201C: C% = &93
WHEN &201D: C% = &94
WHEN &2022: C% = &95
WHEN &2013: C% = &96
WHEN &2014: C% = &97
WHEN &02DC: C% = &98
WHEN &2122: C% = &99
WHEN &0161: C% = &9A
WHEN &203A: C% = &9B
WHEN &0153: C% = &9C
WHEN &017E: C% = &9E
WHEN &0178: C% = &9F
ENDCASE
ENDIF
a$ += CHR$C%
ENDWHILE
= a$