Tuesday, 1 May 2018

egg Strings

Currently, strings are a fundamental type in egg. In many languages they are not. I'm still undecided how much compiler and language support to provide. In particular, because egg is supposed to be a stable language, if strings are a fundamental type, any decisions I make now will need to be "baked" into the specification. So I've been looking into what strings actually are.

In languages where strings are not fundamental, such as C++, there may be many representations of text:

  char*
  char[N]
  std::string
  std::wstring
  std::u16string
  std::u32string

This has the advantage that the different representations and implementations can be tailored for different criteria, but it can be a bit overwhelming for learners.

Strings in egg are simply sequences of Unicode characters. They are immutable (like Java) and do not directly expose their internal representation. They are also independent of your locale.

So what can you do with strings like these? The Wikipedia pages leave a bit to be desired on this subject, so I had to start from first principles.

Assignment

Strings can be assigned from string literals or from other strings:

  string greeting = "hello";
  string word = greeting;

Concatenation

Strings can be created by concatenating other strings using the type-name constructor:

  string greeting = string("hello"); // "hello"
  greeting = string(greeting," world"); // "hello world"

Property: int length

Strings have a single property, "length", which is the count of the Unicode code points (not code units) that make up the string:

  string greeting = "hello";
  print(greeting.length); // prints '5'

Operator: string[int]

Strings have an indexing operator to extract single-character substrings at the relevant integer code point index:

  string greeting = "hello";
  string letter = greeting[0]; // sets letter to "h"

If the index is negative, or beyond the end of the string, an exception is thrown.

Operators: == and !=

Strings have equality/inequality operators:

  print("hello" == "hello"); // prints 'true'
  print("hello" == "goodbye"); // prints 'false'
  print("hello" != "hello"); // prints 'false'
  print("hello" != "goodbye"); // prints 'true'

Equality is case-sensitive and based on the equality of the constituent code point sequence. Strings are never equal to non-string values.

Iteration

The code points that make up a string can be iterated:

  string greeting = "hello";
  for (string i : greeting) print(i); // prints 'h' 'e' 'l' 'l' 'o'


Note that the iteration returns single-character strings.

Member: int hash()

Computes a 64-bit signed hash integer for use in containers and algorithms.

Member: int compare(string other)

Compares the string to another in an invariant manner (code point by code point). Returns negative/zero/position depending on whether the string is less/equal.greater than the other.

  "hello".compare("goodbye") // returns a positive integer

Case-insensitive and locale-specific comparisons are outside the remit of the basic string methods.

To prevent accident assumptions about collation ordering, the relational operators "<", "<=", ">=" and ">" are not directly supported for strings.

Member: bool contains(string needle)

Searches for a substring within the string.

  "beggar".contains("egg") // returns true

This is an optimisation of

  haystack.indexOf(needle) != null

Member: bool startsWith(string needle)

Determines if a string starts with substring.

  "beggar".startsWith("beg") // returns true

This is an optimisation of

  haystack.indexOf(needle) == 0

Member: bool endsWith(string needle)

Determines if a string starts with substring.

  "beggar".endsWith("beg") // returns false

This is an optimisation of

  haystack.indexOf(needle) == (haystack.length - needle.length)

Member: int? indexOf(string needle)

Returns the zero-based index of the first occurrence of the substring, or "null" if the substring is not present.

  "beggar".indexOf("egg") // returns 1

The full signature is

  int? indexOf(string needle,
               int? start_index = null,
               int? limit = null,
               bool? negate = null)

Member: int? lastIndexOf(string needle)

Returns the zero-based index of the last occurrence of the substring or "null" if the substring is not present.

  "beggar".lastIndexOf("g") // returns 3

The full signature is

  int? lastIndexOf(string needle,
                   int? start_index = null,
                   int? limit = null,
                   bool? negate = null)

Member: string replace(string needle, string replacement)

Replaces occurrences of a substring with another string.

  "beggar".replace("egg", "e") // returns "bear"

Remember, the result is returned from this function; the original string is unaltered. The full signature is

  string replace(string needle,
                 string replacement,
                 int? max_occurrences = null)

If "max_occurrences" is negative, replacements are made for the last occurrences of the substring.

Member: string slice(int? begin = null, int? end = null)

Extracts a substring from the string. The parameters are both optional indices. Negative indices count back from the end of the string:

  "beggar".slice(1) // returns "eggar"
  "beggar".slice(1, 4) // returns "egg"
  "beggar".slice(1, -2) // returns "egg"

Member: string[] split(string separator, int? limit = null)

Splits a string into an array of substrings:

  "beggar".split("g") // returns ["be","","ar"]

If "limit" is negative, splits are limited to the last occurrences of the separator.

Member: string join(any[] ...params)

Joins strings together:

  "a".join("b", "n", "n", "") // returns "banana"

Member: string repeat(int count)

Creates a new string by concatenating repetitions of a source string:

  "beggar".repeat(0) // returns ""


  "beggar".repeat(3) // returns "beggarbeggarbeggar"

And that's pretty much it. There are some obvious (and deliberate) omissions:

MISSING Member: string format(any[] ...params)

The functionality for formatting general text will live in separate modules. This will allow for more than one formatting scheme and also allow the algorithms to evolve over time.

MISSING Member: int codePointAt(int index)

This is really a function of the (Unicode) encoding specification, so should be defined there:

  import unicode = /*...*/;
  print(unicode.getCodePointAt("hello", 1)); // prints 101

MISSING Members: string lowercase() and string uppercase()

These are omitted because they rely on the Unicode character database, which changes over time. See here. The functionality will probably be made available via a versioned module:

  import unicode = /*...*/;
  print(unicode.toUpper("hello")); // prints "HELLO"

MISSING Members: string trim(), string trimLeft() and string trimRight()

Removing white-space (or any class of character) from the ends of strings is a common operation. The Unicode standards committee designated twenty-odd code points as "white-space". This includes one of my favourite characters: U+1680: OGHAM SPACE MARK. Alas, the categorisation is somewhat fluid, so one requires access to the Unicode character database to get it right.

MISSING Member: string reverse()

Although characters in egg strings are Unicode, reversing a Unicode string may not produce the results you expect. To do it properly again requires access to the Unicode character database, so the reverse function should belong to a versioned module.

Also, there some members that I'm still dithering about:

POSSIBLE Members: string padLeft() and string padRight()

These members can be synthesised using the existing members. However, they are surprisingly easy to get wrong, so maybe they're a useful addition:

  string padLeft(int min_length, string? padding = null)
  string padRight(int min_length, string? padding = null)

No comments:

Post a Comment