Add UTF8.c

This is a small library I wrote to handle UTF-8.

Usage is meant to be as simple as possible - see for example decoding
a UTF-8 string:

  const char* str = "asdf";
  uint32_t codepoint;
  while ((codepoint = UTF8_next(&str)))
  {
      // you have a codepoint congrats
  }

Or encoding a single codepoint to add it to a string:

  std::string result;
  result.append(UTF8_encode(0x1234).bytes);

There are some other functions (UTF8_total_codepoints() to get the
total number of codepoints in a string, UTF8_backspace() to get the
length of a string after backspacing one character, and
UTF8_peek_next() as a slightly less fancy version of UTF8_next()), but
more functions could always be added if we need them.

This will allow us to replace utfcpp (utf8::unchecked) and also fix
some less-than-ideal code:

- Some places have to resort to ignoring UTF-8 (next_wrap) or using
  UCS-4→UTF-8 functions (VFormat had to use PHYSFS ones, and one other
  place has four lines of code including a std::back_inserter just for
  one character)

- The iterator stuff is kinda confusing and verbose anyway
This commit is contained in:
Dav999-v
2023-02-23 03:41:36 +01:00
committed by Misa Elizabeth Kai
parent 22f1a18fe7
commit 3ce4735d50
3 changed files with 238 additions and 0 deletions

View File

@@ -0,0 +1,35 @@
#ifndef UTF8_H
#define UTF8_H
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#ifdef __cplusplus
extern "C"
{
#endif
typedef struct
{
char bytes[5];
uint8_t nbytes;
bool error;
}
UTF8_encoding;
uint32_t UTF8_peek_next(const char* s_str, uint8_t* codepoint_nbytes);
uint32_t UTF8_next(const char** p_str);
UTF8_encoding UTF8_encode(uint32_t codepoint);
size_t UTF8_total_codepoints(const char* str);
size_t UTF8_backspace(const char* str, size_t len);
#ifdef __cplusplus
} /* extern "C" */
#endif
#endif // UTF8_H