Some Unicode in BBCSDL source not displayed.

hellomike · Post by **hellomike** » Sun 27 Aug 2023, 13:39

When the following lines of code are entered (or pasted) into SDLIDE, which is in Unicode mode, the first two lines with the accented characters (UTF8 versions, NOT ANSI) are shown correct.

Code: Select all

      DATA "Les Misérables"
      DATA "La Femme d'À Côté"
      DATA "Alternating ANSI and UTF8 pipe characters | ｜ | ｜ | ｜ | ｜"
      DATA "Alternating ANSI and UTF8 colons : ： : ： : ："
      DATA "Alternating ANSI and UTF8 asterisks * ＊ * ＊ * ＊"

In the alternating lines, only the ANSI version of the charactres is shown. The UTF8 versions for the pipe, colon and asterisk are shown as unprintable rectangles.

blk.jpg

What is going on here?

Thanks,

Mike

Hated Moron · Post by **Hated Moron** » Sun 27 Aug 2023, 15:05

hellomike wrote: ↑Sun 27 Aug 2023, 13:39 What is going on here?

You've contrived to create an illegal file which contains both ANSI and UTF-8 encoded characters. Once you create something illegal you can't meaningfully ask how it will misbehave, it might explode your computer (not really, but my point is that you can't ask the question).

It's not permitted to mix ANSI and UTF-8 encodings in the same file, they are mutually exclusive. That's obviously the case because some sequences of bytes are valid in both encodings, and there is no way of telling which is the correct interpretation.

For example the ANSI string "NESCAFÉ©" (where the last two characters are E-acute and the Copyright symbol) and the UTF-8 string "NESCAFɩ" (where the last character is the lower-case Greek letter iota) have identical encodings: 4E 45 53 43 41 46 C9 A9.

Try it for yourself. Copy-and-paste this into the BB4W or BBCSDL editor in ANSI mode and run it:

Code: Select all

      a$ = "NESCAFÉ©"
      FOR I% = 1 TO LEN(a$)
        PRINT " " RIGHT$("0"+STR$~ASCMID$(a$,I%),2);
      NEXT
      PRINT

Now copy-and-paste this into the BB4W or BBCSDL editor in Unicode mode and run it:

Code: Select all

      a$ = "NESCAFɩ"
      FOR I% = 1 TO LEN(a$)
        PRINT " " RIGHT$("0"+STR$~ASCMID$(a$,I%),2);
      NEXT
      PRINT

You should find that they print the same sequence of bytes.

hellomike · Post by **hellomike** » Sun 27 Aug 2023, 17:14

Thanks for the fast reply.

As mentioned in the topic, I can assure you that also the accented characters are UTF8 and not ANSI.
This program gives the same result in BB4W and BBCSDL and shows correctly in the BB4W editor but not in SDLIDE (both in Unicode mode).

Code: Select all

      a$="d'À Côté"
      PRINT LENa$
      FOR I% = 1 TO LEN(a$)
        PRINT " " RIGHT$("0"+STR$~ASCMID$(a$,I%),2);
      NEXT
      PRINT

      a$="Asterisks * ＊ * ＊ * ＊ * ＊"
      PRINT LENa$
      FOR I% = 1 TO LEN(a$)
        PRINT " " RIGHT$("0"+STR$~ASCMID$(a$,I%),2);
      NEXT
      PRINT
      END

Output is the same in both, i.e.:

Code: Select all

        11
 64 27 C3 80 20 43 C3 B4 74 C3 A9
        33
 41 73 74 65 72 69 73 6B 73 20 2A 20 EF BC 8A 20 2A 20 EF BC 8A 20 2A 20 EF BC 8A 20 2A 20 EF BC 8A
>

Hated Moron · Post by **Hated Moron** » Sun 27 Aug 2023, 17:53

hellomike wrote: ↑Sun 27 Aug 2023, 17:14 I can assure you that also the accented characters are UTF8 and not ANSI.

I'm very confused. You explicitly stated in your original program that there were both ANSI and UTF-8 characters:

Code: Select all

      DATA "Alternating ANSI and UTF8 asterisks * ＊ * ＊ * ＊"

Since including both ANSI and UTF-8 encodings in the same file is illegal, I replied on that basis.

Are you now saying that there aren't both ANSI and UTF-8 characters after all, despite what the program claims?

hellomike · Post by **hellomike** » Sun 27 Aug 2023, 17:54

Or, in short. A program as the original program but without the first two DATA statements (i.e. with only the alternating lines) also loads not showing the UTF8 characters correctly.

Regards,

Mike

hellomike · Post by **hellomike** » Sun 27 Aug 2023, 17:57

Oops our messages are crossing....

Yes, you are correct, my wording was confusing, appologies.
No, ALL the special characters are entered as UTF8. As you can see, the length of both strings are longer, i.e. multibyte.

Mike

Hated Moron · Post by **Hated Moron** » Sun 27 Aug 2023, 18:26

hellomike wrote: ↑Sun 27 Aug 2023, 17:57 Yes, you are correct, my wording was confusing, appologies.
No, ALL the special characters are entered as UTF8. As you can see, the length of both strings are longer, i.e. multibyte.

OK, so your question is simply "why are those characters displayed as boxes and not as asterisks?". The answer is just as simple, they are uncommon characters and are not included in the font you are using (which I expect is DejaVuSansMono.ttf since that is the default).

Indeed it's such an unusual character that there aren't many common fonts which include it. One that does is 'Arial Unicode MS' so I copied that into the lib/ directory and this is what I get:

arial-unicode-ms.png

Hated Moron · Post by **Hated Moron** » Sun 27 Aug 2023, 19:03

Hated Moron wrote: ↑Sun 27 Aug 2023, 18:26 Indeed it's such an unusual character that there aren't many common fonts which include it. One that does is 'Arial Unicode MS' so I copied that into the lib/ directory and this is what I get:

Another is unifont.ttf which has probably the most comprehensive coverage of all free fonts, but is based on a bitmap font so isn't very high quality, especially at large sizes. It's also 12 Mbytes!

unifont.png

hellomike · Post by **hellomike** » Mon 28 Aug 2023, 06:03

Thanks for clarifying.

Regards,

Mike

Hated Moron · Post by **Hated Moron** » Thu 31 Aug 2023, 12:46

Ever since creating BBC BASIC for SDL 2.0 I've been looking for the 'perfect' typeface to use for the code editor: free (and non-restrictive licensing), good code coverage (at least the main Western languages), monospaced (ideally - SDLIDE will work with proportional-spaced fonts) and high legibility (zero easily distinguished from O, lower-case l from figure 1 etc.).

The best I've managed to find, and the default font used by SDLIDE, is DejaVuSansMono. It fulfils most of the above requirements, with the most notable shortcoming being the absence of Hebrew. That seems a very strange omission, given that DejaVuSans does include Hebrew, and I don't think it's particularly challenging to design a monospaced version of the alphabet.

If anybody knows of a better typeface I would be very interested, but I've not found one despite extensive searches. I'd also like to know if there is any feasible way of transplanting the Hebrew characters from DejaVuSans into DejaVuSansMono (even if they aren't strictly monospaced) which would be a big improvement.

BBC BASIC forum

Some Unicode in BBCSDL source not displayed.

Some Unicode in BBCSDL source not displayed.

Re: Some Unicode in BBCSDL source not displayed.

Re: Some Unicode in BBCSDL source not displayed.

Re: Some Unicode in BBCSDL source not displayed.

Re: Some Unicode in BBCSDL source not displayed.

Re: Some Unicode in BBCSDL source not displayed.

Re: Some Unicode in BBCSDL source not displayed.

Re: Some Unicode in BBCSDL source not displayed.

Re: Some Unicode in BBCSDL source not displayed.

Re: Some Unicode in BBCSDL source not displayed.