Module unicode

This module provides support to handle the Unicode UTF-8 encoding.

Types

RuneImpl = int32
  Source Edit
Rune = distinct RuneImpl
type that can hold any Unicode character   Source Edit
Rune16 = distinct int16
16 bit Unicode character   Source Edit

Procs

proc `<=%`(a, b: Rune): bool {.
raises: [], tags: []
.}
  Source Edit
proc `<%`(a, b: Rune): bool {.
raises: [], tags: []
.}
  Source Edit
proc `==`(a, b: Rune): bool {.
raises: [], tags: []
.}
  Source Edit
proc runeLen(s: string): int {.
gcsafe, extern: "nuc$1", raises: [], tags: []
.}
Returns the number of Unicode characters of the string s   Source Edit
proc runeLenAt(s: string; i: Natural): int {.
raises: [], tags: []
.}
Returns the number of bytes the rune starting at s[i] takes   Source Edit
proc validateUtf8(s: string): int {.
raises: [], tags: []
.}
Returns the position of the invalid byte in s if the string s does not hold valid UTF-8 data. Otherwise -1 is returned.   Source Edit
proc runeAt(s: string; i: Natural): Rune {.
raises: [], tags: []
.}
Returns the unicode character in s at byte index i   Source Edit
proc toUTF8(c: Rune): string {.
gcsafe, extern: "nuc$1", raises: [], tags: []
.}
Converts a rune into its UTF-8 representation   Source Edit
proc `$`(rune: Rune): string {.
raises: [], tags: []
.}
Converts a Rune to a string   Source Edit
proc `$`(runes: seq[Rune]): string {.
raises: [], tags: []
.}
Converts a sequence of Runes to a string   Source Edit
proc runeOffset(s: string; pos: Natural; start: Natural = 0): int {.
raises: [], tags: []
.}

Returns the byte position of unicode character at position pos in s with an optional start byte position. returns the special value -1 if it runs out of the string

Beware: This can lead to unoptimized code and slow execution! Most problems are solve more efficient by using an iterator or conversion to a seq of Rune.

  Source Edit
proc runeAtPos(s: string; pos: int): Rune {.
raises: [], tags: []
.}

Returns the unicode character at position pos

Beware: This can lead to unoptimized code and slow execution! Most problems are solve more efficient by using an iterator or conversion to a seq of Rune.

  Source Edit
proc runeStrAtPos(s: string; pos: Natural): string {.
raises: [], tags: []
.}

Returns the unicode character at position pos as UTF8 String

Beware: This can lead to unoptimized code and slow execution! Most problems are solve more efficient by using an iterator or conversion to a seq of Rune.

  Source Edit
proc runeReverseOffset(s: string; rev: Positive): (int, int) {.
raises: [], tags: []
.}

Returns a tuple with the the byte offset of the unicode character at position rev in s counting from the end (starting with 1) and the total number of runes in the string. Returns a negative value for offset if there are to few runes in the string to satisfy the request.

Beware: This can lead to unoptimized code and slow execution! Most problems are solve more efficient by using an iterator or conversion to a seq of Rune.

  Source Edit
proc runeSubStr(s: string; pos: int; len: int = int.high): string {.
raises: [], tags: []
.}

Returns the UTF-8 substring starting at codepoint pos with len codepoints. If pos or len is negativ they count from the end of the string. If len is not given it means the longest possible string.

(Needs some examples)

  Source Edit
proc toLower(c: Rune): Rune {.
gcsafe, extern: "nuc$1", procvar, raises: [], tags: []
.}
Converts c into lower case. This works for any Unicode character. If possible, prefer toLower over toUpper.   Source Edit
proc toUpper(c: Rune): Rune {.
gcsafe, extern: "nuc$1", procvar, raises: [], tags: []
.}
Converts c into upper case. This works for any Unicode character. If possible, prefer toLower over toUpper.   Source Edit
proc toTitle(c: Rune): Rune {.
gcsafe, extern: "nuc$1", procvar, raises: [], tags: []
.}
Converts c to title case   Source Edit
proc isLower(c: Rune): bool {.
gcsafe, extern: "nuc$1", procvar, raises: [], tags: []
.}
Returns true iff c is a lower case Unicode character. If possible, prefer isLower over isUpper.   Source Edit
proc isUpper(c: Rune): bool {.
gcsafe, extern: "nuc$1", procvar, raises: [], tags: []
.}
Returns true iff c is a upper case Unicode character. If possible, prefer isLower over isUpper.   Source Edit
proc isAlpha(c: Rune): bool {.
gcsafe, extern: "nuc$1", procvar, raises: [], tags: []
.}
Returns true iff c is an alpha Unicode character (i.e., a letter)   Source Edit
proc isTitle(c: Rune): bool {.
gcsafe, extern: "nuc$1", procvar, raises: [], tags: []
.}
Returns true iff c is a Unicode titlecase character   Source Edit
proc isWhiteSpace(c: Rune): bool {.
gcsafe, extern: "nuc$1", procvar, raises: [], tags: []
.}
Returns true iff c is a Unicode whitespace character   Source Edit
proc isCombining(c: Rune): bool {.
gcsafe, extern: "nuc$1", procvar, raises: [], tags: []
.}
Returns true iff c is a Unicode combining character   Source Edit
proc isUpper(s: string): bool {.
noSideEffect, procvar, gcsafe, extern: "nuc$1Str", raises: [], tags: []
.}
Returns true iff s contains all upper case unicode characters.   Source Edit
proc isLower(s: string): bool {.
noSideEffect, procvar, gcsafe, extern: "nuc$1Str", raises: [], tags: []
.}
Returns true iff s contains all lower case unicode characters.   Source Edit
proc isAlpha(s: string): bool {.
noSideEffect, procvar, gcsafe, extern: "nuc$1Str", raises: [], tags: []
.}
Returns true iff s contains all alphabetic unicode characters.   Source Edit
proc isSpace(s: string): bool {.
noSideEffect, procvar, gcsafe, extern: "nuc$1Str", raises: [], tags: []
.}
Returns true iff s contains all whitespace unicode characters.   Source Edit
proc toUpper(s: string): string {.
noSideEffect, procvar, gcsafe, extern: "nuc$1Str", raises: [], tags: []
.}
Converts s into upper-case unicode characters.   Source Edit
proc toLower(s: string): string {.
noSideEffect, procvar, gcsafe, extern: "nuc$1Str", raises: [], tags: []
.}
Converts s into lower-case unicode characters.   Source Edit
proc swapCase(s: string): string {.
noSideEffect, procvar, gcsafe, extern: "nuc$1", raises: [], tags: []
.}

Swaps the case of unicode characters in s

Returns a new string such that the cases of all unicode characters are swapped if possible

  Source Edit
proc capitalize(s: string): string {.
noSideEffect, procvar, gcsafe, extern: "nuc$1", raises: [], tags: []
.}
Converts the first character of s into an upper-case unicode character.   Source Edit
proc translate(s: string; replacements: proc (key: string): string): string {.
gcsafe, extern: "nuc$1", raises: [], tags: []
.}

Translates words in a string using the replacements proc to substitute words inside s with their replacements

replacements is any proc that takes a word and returns a new word to fill it's place.

  Source Edit
proc title(s: string): string {.
noSideEffect, procvar, gcsafe, extern: "nuc$1", raises: [], tags: []
.}

Converts s to a unicode title.

Returns a new string such that the first character in each word inside s is capitalized

  Source Edit
proc isTitle(s: string): bool {.
noSideEffect, procvar, gcsafe, extern: "nuc$1Str", raises: [], tags: []
.}

Checks whether or not s is a unicode title.

Returns true if the first character in each word inside s are upper case and there is at least one character in s.

  Source Edit
proc toRunes(s: string): seq[Rune] {.
raises: [], tags: []
.}
Obtains a sequence containing the Runes in s   Source Edit
proc cmpRunesIgnoreCase(a, b: string): int {.
gcsafe, extern: "nuc$1", procvar, raises: [], tags: []
.}
Compares two UTF-8 strings and ignores the case. Returns:

0 iff a == b
< 0 iff a < b
> 0 iff a > b

  Source Edit
proc reversed(s: string): string {.
raises: [], tags: []
.}
Returns the reverse of s, interpreting it as Unicode characters. Unicode combining characters are correctly interpreted as well:
assert reversed("Reverse this!") == "!siht esreveR"
assert reversed("先秦兩漢") == "漢兩秦先"
assert reversed("as⃝df̅") == "f̅ds⃝a"
assert reversed("a⃞b⃞c⃞") == "c⃞b⃞a⃞"
  Source Edit
proc graphemeLen(s: string; i: Natural): Natural {.
raises: [], tags: []
.}
The number of bytes belonging to 's[i]' including following combining characters.   Source Edit
proc lastRune(s: string; last: int): (Rune, int) {.
raises: [], tags: []
.}
length of the last rune in 's[0..last]'. Returns the rune and its length in bytes.   Source Edit

Iterators

iterator runes(s: string): Rune {.
raises: [], tags: []
.}
Iterates over any unicode character of the string s returning runes   Source Edit
iterator utf8(s: string): string {.
raises: [], tags: []
.}
Iterates over any unicode character of the string s returning utf8 values   Source Edit

Templates

template fastRuneAt(s: string; i: int; result: untyped; doInc = true)
Returns the Unicode character s[i] in result. If doInc == true i is incremented by the number of bytes that have been processed.   Source Edit
template fastToUTF8Copy(c: Rune; s: var string; pos: int; doInc = true)

Copies UTF-8 representation of c into the preallocated string s starting at position pos. If doInc == true, pos is incremented by the number of bytes that have been processed.

To be the most efficient, make sure s is preallocated with an additional amount equal to the byte length of c.

  Source Edit