WI 20. Advanced Text

So far, we have dealt with text as something which comes in little packets: we have printed it out, read it in from the keyboard, and compared it with other text. But we have never tried to open the packets and get at the contents, letter by letter, or to make any alterations, or look for certain combinations of letters. These tricks are surprisingly seldom needed - a surprise, that is, given that everything Inform does is textual - but they are in fact open to us. For example:

if character number 1 in "[time of day]" is "1", ...

will be true at, for example, 11:30 PM and 1:22 AM, but not at 3:15 PM. What happens here is that Inform expands the time of day into a text, say "11:30 PM", then extracts the first character, say "1", and tests it.

Until 2012, Inform had two kinds of text - plain "text", and "indexed text" - but there's now only "text", which has all of the abilities of both.

🔗

WI §20.2 Memory limitations

Inform creates "story files" for very small virtual computers (capable of running on phones, for instance) where memory is tight. If we create a number variable and keep on adding 1 to it, the value simply gets bigger. But if we make some text and keep on adding a letter "x" to it, the text takes up more and more space, growing into longer and longer runs of "x"s until there is no more space to hold it.

The following warnings are rather like the tiny print about side-effects on medicine bottles: that is, we mostly ignore them, and if the drugs should kill us, well, at least we have the consolation of knowing we were warned. There are basically three limitations on text:

(1) An amount of memory has to be set aside for text (and other flexible-sized data), and Inform guesses the amount needed. Story files using the Glulx format (see the Settings panel) are able to increase this as necessary in play, so there's no problem if the guess was wrong. But Z-machine story files are stuck with whatever amount of memory was initially chosen.

That choice can be increased with a use option, like so:

Use dynamic memory allocation of at least 16384.

Inform raises its estimate of the amount needed to ensure that this amount is always at least its own guess, and also at least any amount declared like this. (And then it rounds up to the nearest power of 2, as it happens.) The default value of "dynamic memory allocation" is 8192. In practice, this use option isn't needed much, though, because any story needing large amounts of dynamic memory will likely be on Glulx in any case.

(2) Text has a maximum length. This maximum is normally 1000 characters, which ought to be plenty, but can be raised by sentences such as:

Use maximum text length of at least 2000.

What happens if this is broken, that is, if we try to use text overrunning this length? The Z-machine may simply crash, so if there is any chance that any single text may grow unpredictably large, Glulx should always be used. On Glulx, overrunning text is truncated safely, except that under Glulx 3.1.0 or better the story file will try to use dynamic memory allocation to expand the limit as needed to avoid truncation. (Testing shows that text is slow to manipulate once it grows beyond about 20,000 characters in length, but this is not really surprising.)

(3) Under the Z-machine, text may only contain characters from the so-called "ZSCII" character set - standard numbers, letters, punctuation marks and the commonest West European accented letters. Anything more exotic is likely to be flattened into a question mark "?". Under Glulx, any character can be used.

All of this makes the Z-machine sound very inferior, for text purposes. But note that Z can handle all of the examples in this chapter perfectly happily.

🔗

WI §20.3 Characters, words, punctuated words, unpunctuated words, lines, paragraphs

Inform can get at the contents of text in a variety of ways. The lowest-level is by character - a character is a letter, digit, punctuation symbol, space or other letter-form. (We use the term "character" rather than "letter" because otherwise we would have to call "5" a letter, and so on.) Characters number upwards from 1: character number 1, to repeat that, starts the text. We can get the Nth character with:

character number (number) in (text) ⇒ text

This phrase produces the Nth character from the text, counting from 1. Characters include letters, digits, punctuation symbols, spaces or other letter-forms. Example:

character number 8 in "numberless projects of social reform"

produces "e". If the index is less than 1 or more than the length of the text, the result is an empty text, "".

The maximum character number varies with the current length of the text, and can be evaluated as:

number of characters in (text) ⇒ number

This phrase produces the number of characters from the text. Characters include letters, digits, punctuation symbols, spaces or other letter-forms. Examples:

number of characters in "War and Peace"
number of characters in ""

produce 13 and 0 respectively.

We can also use the adjective "empty":

if the description of the location is empty, ...

The empty text, "", is the only one with 0 characters.

We can also extract the contents by word, again numbered from 1. Thus:

word number (number) in (text) ⇒ text

This phrase produces the Nth word from the text, counting from 1. Words for this purpose are what's left after breaking the text up at punctuation or spacing (spaces, line breaks, paragraph breaks) and then removing that punctuation or spacing. Example:

word number 3 in "ice-hot, don't you think?"

produces "don't". If the index is less than 1 or more than the number of words in the text, the result is an empty text, "".

number of words in (text) ⇒ number

This phrase produces the number of words from the text. Words for this purpose are what's left after breaking the text up at punctuation or spacing (spaces, line breaks, paragraph breaks) and then removing that punctuation or spacing. Example:

number of words in "ice-hot, don't you think?"

produces 5.

Note that the contraction apostrophe in "don't" doesn't count as punctuation. Because this is not always quite what we want, Inform offers two variations:

punctuated word number (number) in (text) ⇒ text

punctuated word number 2 in "ice-hot, don't you think?"

produces "-". The punctuated words here are "ice", "-", "hot", ",", "don't", "you", "think", "?". If two or more punctuation marks are adjacent, they are counted as different words, except for runs of dashes or periods: thus ",," has two punctuated words, but "--" and "…" have only one each. If the index is less than 1 or more than the number of punctuated words in the text, the result is an empty text, "".

number of punctuated words in (text) ⇒ number

number of punctuated words in "ice-hot, don't you think?"

produces 8; see if you can find them all.

unpunctuated word number (number) in (text) ⇒ text

This phrase produces the Nth word from the text, counting from 1. Words for this purpose are what's left after breaking the text up at spacing (spaces, line breaks, paragraph breaks) but including all punctuation as if it were part of the spelling of the words it joins to. Example:

unpunctuated word number 1 in "ice-hot, don't you think?"

produces "ice-hot,". The unpunctuated words in "ice-hot, don't you think?" are "ice-hot,", "don't", "you", "think?". If the index is less than 1 or more than the number of punctuated words in the text, the result is an empty text, "".

number of unpunctuated words in (text) ⇒ number

This phrase produces the number of words from the text. Words for this purpose are what's left after breaking the text up at spacing (spaces, line breaks, paragraph breaks) but including all punctuation as if it were part of the spelling of the words it joins to. Example:

number of unpunctuated words in "ice-hot, don't you think?"

produces just 4.

Finally, on the larger scale still, we also have:

line number (number) in (text) ⇒ text

This phrase produces the Nth line from the text, counting from 1. Unless explicit use is made of line-breaking, lines and paragraphs will be the same - it doesn't refer to lines as visible on screen, because we have no way of knowing what size screen the player might have.

number of lines in (text) ⇒ number

This phrase produces the number of lines in the text. Unless explicit use is made of line-breaking, lines and paragraphs will be the same - it doesn't refer to lines as visible on screen, because we have no way of knowing what size screen the player might have. Example: the number of lines in

"Sensational news just in![paragraph break]The Martians have invaded Miranda.[line break](One of the moons of Uranus, that is.)"

is 3.

paragraph number (number) in (text) ⇒ text

This phrase produces the Nth paragraph from the text, counting from 1.

number of paragraphs in (text) ⇒ number

This phrase produces the number of paragraphs in the text. Example: the number of paragraphs in

"Sensational news just in![paragraph break]The Martians have invaded Miranda.[line break](One of the moons of Uranus, that is.)"

is 2.

(Attempting to make large enough texts to have a serious paragraph count is slightly risky if there is not much memory to play with, as on the Z-machine. But the facilities do exist.)

🔗

WI §20.4 Upper and lower case letters

In most European languages the same letters can appear in two forms: as capitals, like "X", mainly used to mark a name or the start of a sentence; or in their ordinary less prominent form, like "x". These forms are called upper and lower case because, historically, typesetters kept lead castings of letters in two wooden cases, one above the other on the workbench. Lower case letters were in the lower box closer to hand, being more often needed.

Human languages are complicated. Not every lower case letter has an upper case partner: ordinal markers in Hispanic languages don't, for instance, and the German "ß" is never used in upper case. Sometimes two different lower case letters have the same upper case form: "ς" and "σ", two versions of the Greek sigma, both capitalise to "Σ". Inform follows the international Unicode standard in coping with all this.

We can test whether text is in either case like so:

if (text) is in lower case:

This condition is true if every character in the text is a lower case letter. Examples: this is true for "wax", but false for "wax seal" or "eZ mOnEy".

if (text) is in upper case:

This condition is true if every character in the text is in upper case. Examples: this is true for "BEESWAX", but false for "ROOM 101".

We can change the casing of text using:

(text) in lower case ⇒ text

This phrase produces a new version of the given text, but with all upper case letters reduced to lower case. Example: "a ticket to Tromsø via Østfold" becomes

"a ticket to tromsø via østfold"

(text) in upper case ⇒ text

This phrase produces a new version of the given text, but with all upper case letters reduced to lower case. Example: "a ticket to Tromsø via Østfold" becomes

"A TICKET TO TROMSØ VIA ØSTFOLD"

(text) in title case ⇒ text

This phrase produces a new version of the given text, but with casing of words changed to title casing: this capitalises the first letter of each word, and lowers the rest. Example: "a ticket to Tromsø via Østfold" becomes

"A Ticket To Tromsø Via Østfold"

(text) in sentence case ⇒ text

This phrase produces a new version of the given text, but with casing of words changed to sentence casing: this capitalises the first letter of each sentence and reduces the rest to lower case. Example: "a ticket to Tromsø via Østfold" becomes

"A ticket to tromsø via østfold"

Accents are preserved in case changes. So (if we are using Glulx and have Unicode available) title case can turn Aristophanes' discomfortingly lower-case lines

ἐξ οὗ γὰρ ἡμᾶς προὔδοσαν μιλήσιοι,
οὐκ εἶδον οὐδ᾽ ὄλισβον ὀκτωδάκτυλον,
ὃς ἦν ἂν ἡμῖν σκυτίνη "πικουρία

by raising them proudly up like so:

Ἐξ Οὗ Γὰρ Ἡμᾶς Προὔδοσαν Μιλήσιοι,
Οὐκ Εἶδον Οὐδ᾽ Ὄλισβον Ὀκτωδάκτυλον,
Ὃς Ἦν Ἂν Ἡμῖν Σκυτίνη "Πικουρία.

Title and sentence casing can only be approximate if done by computer. Inform looks at the letters, but is blind to the words and sentences they make up. (Note the way sentence casing did not realise "Tromsø" and "Østfold" were proper nouns.) If asked to put the name "MCKAY" into title casing, Inform will opt for "Mckay", not recognising this as the Scottish patronymic surname "McKay". Given "baym dnieper", the title of David Bergelson's great Yiddish novel of 1932, it will opt for "BAYM DNIEPER": but properly speaking Yiddish does not have upper case lettering at all, though nowadays it is sometimes printed as if it did. And conventions are very variable about which words should be capitalised in titles: English publishers mostly agree that connectives, articles and prepositions should be in lower case, but in France almost anything goes, with Académie Française rules giving way to avant-garde book design. In short, we cannot rely on Inform's title casing to produce a result which a human reader will always think perfect.

This discussion has all been about how Inform prints, not about how it reads commands from the keyboard, because the latter is done case-insensitively. The virtual machines for which Inform creates programs normally flatten all command input to lower case, and in any case Understand comparison ignores casing. Thus

Understand "mckay" as the Highland Piper.

means that "examine McKay", "examine MCKAY", "examine mckay", and so forth are all equivalent. The text of the player's command probably doesn't preserve the original casing typed in any event.

One more caution, though it will affect hardly anyone. For projects using the Z-machine, only a restricted character set is available in texts: for more, we must use Glulx. A mad anomaly of ZSCII, the Z-machine character set, is that it contains the lower case letter "ÿ" but not its upper case form "Ÿ", so that

"ÿ" in upper case

produces "Ÿ" in Glulx but "ÿ" in the Z-machine. This will come as a blow to Queensrÿche fans, but in all other respects any result on the Z-machine should agree with its counterpart on Glulx.

Examples

411.

Capital City ★

To arrange that the location information normally given on the left-hand side of the status line appears in block capitals.

RB 12.2 The Status Line

412.

Rocket Man ★

Using case changes on any text produced by a "to say…" phrase.

RB 2.1 Varying What Is Written

🔗

WI §20.5 Matching and exactly matching

Up to now, we have only been able to judge two texts by seeing if they are equal, but we can now ask more subtle questions.

if (text) matches the text (text):

This condition is true if the second text occurs anywhere inside the first. Examples:

if "[score]" matches the text "3", ...

tests whether the digit 3 occurs anywhere in the score, as printed out; and

if the printed name of the location matches the text "the", ...

tests to see whether "the" can be found anywhere in the current room's name. Note that the location "Smotheringly Hot Jungle" would pass this test - it's there if you look. On the other hand, "The Orangery" would not, because "The" does not match against "the". We can get around this in a variety of ways, one of which is to tell Inform to be insensitive to the case (upper or lower) of letters:

if the printed name of the location matches the text "the", case insensitively: ...

if (text) exactly matches the text (text):

This condition is true if the second text matches the first, starting at the beginning and finishing at the end. This appears to be the same as testing if one is equal to the other, but that's not quite true: for example,

if "[score]" exactly matches the text "[best score]", ...

is true if the score and best score currently print out as the same text, which will be true if they are currently equal as numbers; but

if "[score]" is "[best score]", ...

is never true - these are different texts, even if they sometimes look the same.

In the next section we shall see that "matches" and "exactly matches" can do much more than the simple text matching demonstrated above.

We can also see how many times something matches:

number of times (text) matches the text (text) ⇒ number

This produces the number of times the second text occurs within the first. The matches are not allowed to overlap. Example:

number of times "pell-mell sally" matches the text "ll" = 3
number of times "xyzzy" matches the text "Z" = 0
number of times "xyzzy" matches the text "Z", case insensitively = 2
number of times "aaaaaaaa" matches the text "aaaa" = 2

There's no "number of times WHATEVER exactly matches the text FIND" phrase since this is by definition going to have to be 0 or 1.

🔗

WI §20.6 Regular expression matching

When playing around with text, we tend to get into longer and trickier wrangles of matching - we find that we want to look not for simple text like "gold", but for "gold" used only as a separate word, or for a date in YYYY-MM-DD format, or for a seemingly endless range of other possibilities. What we need is not just for Inform to provide a highly flexible matching program, but also a good notation in which to describe what we want.

Fortunately, such a notation already exists. This is the "regular expression" notation, named for a 1950s mathematical model by the logician Stephen Kleene, applied to computing in the late 60s by Ken Thompson, borrowed almost at once by the early Unix tools of the 70s, and developed further by Henry Spencer in the 80s and Philip Hazel in the 90s. The glue holding the Internet together - the Apache web-server, the scripting languages Perl and Python, and so forth - makes indispensable use of regular expressions.

As might be expected from the previous section, we simply have to describe the FIND text as "regular expression" rather than "text" and then the same facilities are available:

if (text) matches the regular expression (text):

This condition is true if any contiguous part of the text can be matched against the given regular expression. Examples:

if "taramasalata" matches the regular expression "a.*l", ...

is true, since this looks for a part of "taramasalata" which begins with "a", continues with any number of characters, and finishes with "l"; so it matches "aramasal". (Not "asal", because it gets the makes the leftmost match it can.) The option "case insensitively" causes lower and upper case letters to be treated as equivalent.

if (text) exactly matches the regular expression (text):

This condition is true if the whole text (starting from the beginning and finishing at the end) can be matched against the given regular expression. The option "case insensitively" causes lower and upper case letters to be treated as equivalent.

And once again:

number of times (text) matches the regular expression (text) ⇒ number

This produces the number of times that contiguous pieces of the text can be matched against the regular expression, without allowing them to overlap.

Since a regular expression can match quite a variety of possibilities (for instance "b\w+t" could match "boast", "boat", "bonnet" and so on), it's sometimes useful to find what the match actually was:

text matching regular expression ⇒ text

This phrase is only meaningful immediately after a successful match of a regular expression against text, and it produces the text which matched. Example:

if "taramasalata" matches the regular expression "m.*l":
   say "[text matching regular expression].";

says "masal."

Perhaps fairly, perhaps not, regular expressions have a reputation for being inscrutable. The basic idea is that although alphanumeric characters (letters, numbers and spaces) mean just what they look like, punctuation characters are commands with sometimes dramatic effects. Thus:

if WHATEVER matches the regular expression "fish", ...
if WHATEVER matches the regular expression "f.*h", ...

behave very differently. The first is just like matching the text "fish", but the second matches on any sequence of characters starting with an "f" and ending with an "h". This is not at all obvious at first sight: reading regular expressions is a skill which must be learned, like reading a musical score. A really complex regular expression can look like a soup of punctuation and even an expert will blink for a few minutes before telling you what it does - but a beginner can pick up the basics very quickly. Newcomers might like to try out and become comfortable with the features a few at a time, reading down the following list.

1. Golden rule. Don't try to remember all the characters with weird effects. Instead, if you actually mean any symbol other than a letter, digit or space to be taken literally, place a backslash "\" in front of it. For instance, matching the regular expression

"\*A\* of the Galactic Patrol"

is the same as matching the text "*A* of the Galactic Patrol", because the asterisks are robbed of their normal powers. This includes backslash itself: "\\" means a literal backslash. (Don't backslash letters or digits - that turns out to have a meaning all its own, but anyway, there is never any need.)

2. Alternatives. The vertical stroke "|" - not a letter I or L, nor the digit 1 - divides alternatives. Thus

"the fish|fowl|crawling thing"

is the same as saying match "the fish", or "fowl", or "crawling thing".

3. Dividing with brackets. Round brackets "(" and ")" group parts of the expression together.

"the (fish|fowl|crawling thing) in question"

is the same as saying match "the fish in question", or "the fowl in question", or "the crawling thing in question". Note that the "|" ranges outwards only as far as the group it is in.

4. Any character. The period "." means any single character. So

"a...z"

matches on any sequence of five characters so long as the first is "a" and the last is "z".

5. Character alternatives. The angle brackets "<" and ">" are a more concise way of specifying alternatives for a single character. Thus

"b<aeiou>b"

matches on "bab", "beb", "bib", "bob" or "bub", but not "baob" or "beeb" - any single character within the angle brackets is accepted. Beginning the range with "^" means "any single character so long as it is not one of these": thus

"b<^aeiou>b"

matches on "blb" but not "bab", "beb", etc., nor on "blob" or "bb". Because long runs like this can be a little tiresome, we are also allowed to use "-" to indicate whole ranges. Thus

"b<a-z>b"

matches a "b", then any lower case English letter, then another "b".

In traditional regular expression language, square brackets rather than angle brackets are used for character ranges. In fact Inform does understand this notation if there are actual square brackets "[" and "]" in the pattern text, but in practice this would be tiresome to achieve, since Inform uses those to achieve text substitutions. So Inform allows "b<a-z>b" rather than making us type something like

"b[bracket]a-z[close bracket]b"

to create the text "b[a-z]b".

6. Popular character ranges. The range "<0-9>", matching any decimal digit, is needed so often that it has an abbreviation: "\d". Thus

"\d\d\d\d-\d\d-\d\d"

matches, say, "2006-12-03". Similarly, "\s" means "any spacing character" - a space, tab or line break. "\p" is a punctuation character, in the same sense used for word division in the previous section: it actually matches any of

. , ! ? - / " : ; ( ) [ ] { }

"\w" means "any character appearing in a word", and Inform defines it as anything not matching "\s" or "\p".

"\l" and "\u" match lower and upper case letters, respectively. These are much stronger than "<a-z>" and "<A-Z>", since they use the complete definition in the Unicode 4.0.0 standard, so that letter-forms from all languages are catered for: for example "δ" matches "\l" and "Δ" matches "\u".

The reverse of these is achieved by capitalising the letter. So "\D" means "anything not a digit", "\P" means "anything not punctuation", "\W" means "anything not a word character", "\L" means "anything not a lower case letter" and so on.

7. Positional restrictions. The notation "^" does not match anything, as such, but instead requires that we be positioned at the start of the text. Thus

"^fish"

matches only "fish" at the start of the text, not occurring anywhere later on. Similarly, "$" requires that the position be the end of the text. So

"fish$"

matches only if the last four characters are "fish". Matching "^fish$" is the same thing as what Inform calls exactly matching "fish".

Another useful notation is "\b", which matches a word boundary: that is, it matches no actual text, but requires the position to be a junction between a word character and a non-word character (a "\w" and a "\W") or vice versa. Thus

"\bfish\b"

matches "fish" in "some fish" and also "some fish, please!", but not in "shellfish". (The regular expression "\w*fish\b" catches all words ending in "fish", as we will see below.) As usual, the capitalised version "\B" negates this, and means "not at a word boundary".

8. Line break and tab. The notations "\n" and "\t" are used for a line break ("n" for "new line") and tab, respectively. Tabs normally do not occur in Inform strings, but can do when reading from files. It makes no sense to reverse these, so "\N" and "\T" produce errors.

9. Repetition. Placing a number in braces "{" and "}" after something says that it should be repeated that many times. Thus

"ax{25}"

matches only on "axxxxxxxxxxxxxxxxxxxxxxxxx". More usefully, perhaps, we can specify a range of the number of repetitions:

"ax{2,6}"

matches only on "axx", "axxx", "axxxx", "axxxxx", "axxxxxx". And we can leave the top end open: "ax{2,}" means "a" followed by at least two "x"s.

Note that the braces attach only to most recent thing - so "ax{2}" means "a" followed by two of "x" - but, as always, we can use grouping brackets to change that. So "(ax){2,}" matches "axax", "axaxax", "axaxaxax",…

(It's probably best not to use Inform to try to match the human genome against "<acgt>{3000000000}", but one of the most important practical uses of regular expression matching in science is in treating DNA as a string of nucleotides represented by the letters "a", "c", "g", "t", and looking for patterns.)

10. Popular repetitions. Three cases are so often needed that they have standard short forms:

"{0,1}", which means 0 or 1 repetition of something - in other words, doesn't so much repeat it as make it optional - is written "?". Thus "ax?y" matches only on "ay" or "axy".

"{0,}", which means 0 or more repetitions - in other words, any number at all - is written "*". Thus "ax*y" matches on "ay", "axy", "axxy", "axxxy", … and the omnivorous ".*" - which means "anything, any number of times" - matches absolutely every text. (Perhaps unexpectedly, replacing ".*" in a text with "X" will produce "XX", not "X", because the ".*" first matches the text, then matches the empty gap at the end. To match the entire text just once, try "^.*$".)

"{1,}", which means 1 or more repetitions, is written "+". So "\d+" matches any run of digits, for instance.

11. Greedy vs lazy. Once we allow things to repeat an unknown number of times, we run into an ambiguity. Sure, "\d+" matches the text "16339b". But does it look only as far as the "1", then reason that it now has one or more digits in a row, and stop? Or does it run onward devouring digits until it can do so no longer, so matching the "16339" part? These two strategies are called "lazy" and "greedy" respectively.

Do we care? Well, the strategy used makes no difference to whether there is a match, but it does affect what part of the text is matched, and the number of matches there are. Unless we mark for it, all repetitions are greedy. Usually this is good, but it means that, for instance,

"-.+-"

applied to "-alpha- -beta- -gamma-" will match the whole text, because ".+" picks up all of "alpha- -beta- -gamma". To get around this, we can mark any of the repetition operators as lazy by adding a question mark "?". Thus:

"-.+?-"

applied to "-alpha- -beta- -gamma-" matches three times, producing "-alpha-" then "-beta-" then "-gamma-".

A logical but sometimes confusing consequence is that a doubled question mark "??" means "repeat 0 or 1 times, but prefer 0 matches to 1 if both are possibilities": whereas a single question mark "?", being greedy, means "repeat 0 or 1 times, but prefer 1 match to 0 if both are possibilities".

12. Numbered groups. We have already seen that round brackets are useful to clump together parts of the regular expression - to choose within them, or repeat them. In fact, Inform numbers these from 1 upwards as they are used from left to right, and we can subsequently refer back to their contents with the notation "\1", "\2", … After a successful match, we can find the results of these subexpressions with:

text matching subexpression (number) ⇒ text

This phrase is only meaningful immediately after a successful match of a regular expression against text, and it produces the text which matched. The number must be from 1 to 9, and must correspond to one of the bracketed groups in the expression just matched. Example: after

if "taramasalata" matches the regular expression "a(r.*l)a(.)":

the "text matching regular expression" is "aramasalat", the "text matching subexpression 1" is "ramasal", and "text matching subexpression 2" is "t".

For instance:

"(\w)\w*\1"

matches any run of two or more word-characters, subject to the restriction that the last one has to be the same as the first - so it matches "xerox" but not "alphabet". When Inform matches this against "xerox", first it matches the initial "x" against the group "(\w)". It then matches "\w*" ("any number of word-characters") against "ero", so that the "*" runs up to 3 repetitions. It then matches "\1" against the final "x", because "\1" requires it to match against whatever last matched in sub-expression 1 - which was an "x".

Numbered groups allow wicked tricks in matching, it's true, but really come into their own when it comes to replacing - as we shall see.

13. Switching case sensitivity on and off. The special notations "(?i)" and "(?-i)" switch sensitivity to upper vs. lower case off and on, mid-expression. Thus "a(?i)bcd(?-i)e" matches "abcde", "aBcDe", etc., but not "Abcde" or "abcdE".

14. Groups with special meanings. This is the last of the special syntaxes: but it's a doozy. A round-bracketed group can be marked to behave in a special way by following the open bracket by a symbol with a special meaning. Groups like this have no number and are not counted as part of \1, \2, and so forth - they are intended not to gather up material but to have some effect of their own.

"(# ...)"

Is a comment, that is, causes the group to do nothing and match against anything.

"(?= ...)"

Is a lookahead: it is a form of positional requirement, like "\b" or "^", but one which requires that the text ahead of us matches whatever is in the brackets. (It doesn't consume that text - only checks to see that it's there.) For instance "\w+(?=;)" matches a word followed by a semicolon, but does not match the semicolon itself.

"(?! ...)"

Is the same but negated: it requires that the text ahead of us does not match the material given. For instance, "a+(?!z)" matches any run of "a"s not followed by a "z".

"(?<= ...)" and "(?<! ...)"

Are the same but looking behind (hence the "<"), not forward. These are restricted to cases where Inform can determine that the material to be matched has a definite known width. For instance, "(?<!shell)fish" matches any "fish" not occurring in "shellfish".

"(> ...)"

Is a possessive, that is, causes the material to be matched and, once matched, never lets go. No matter what subsequently turns out to be convenient, it will never change its match. For instance, "\d+8" matches against "768" because Inform realises that "\d+" cannot be allowed to eat the "8" if there is to be a match, and stops it. But "(>\d+)8" does not match against "768" because now the "\d+", which initially eats "768", is possessive and refuses to give up the "8" once taken.

"(?(1)...)" and "(?(1)...|...)"

Are conditionals. These require us to match the material given if \1 has successfully matched already; in the second version, the material after the "|" must be matched if \1 has not successfully matched yet. And the same for 2, 3, …, 9, of course.

Finally, conditionals can also use lookaheads or lookbehinds as their conditions. So for instance:

"(?(?=\d)\d\d\d\d|AY-\d\d\d\d)"

means if you start with a digit, match four digits; otherwise match "AY-" followed by four digits. There are easier ways to do this, of course, but the really juicy uses of conditionals are only borderline legible and make poor examples - perhaps this is telling us something.

Examples

413.

Alpha ★

Creating a beta-testing command that matches any line starting with punctuation.

RB 13.1 Testing

414.

About Inform's regular expression support ★

Some footnotes on Inform's regular expressions, and how they compare to those of other programming languages.

RB 1.4 Information Only

🔗

WI §20.7 Making new text with text substitutions

Substitutions are most often used just for printing, like so:

say "The clock reads [time of day].";

But they can also produce text which can be stored up or used in other ways. For example, defining

To decide what text is (T - text) doubled:
   decide on "[T][T]".

makes

let the Gerard Kenny reference be "NewYork" doubled;

set this temporary variable to "NewYorkNewYork".

There is, however, a subtlety here. A text with a substitution in it, like:

"The clock reads [time of day]."

is always waiting to be substituted, that is, to become something like:

"The clock reads 11:12 AM."

If all we do with text is to print it, there's nothing to worry about. But if we're storing it up, especially for multiple turns, there are ambiguities. For example, suppose we're changing the look of the black status line bar at the top of the text window:

now the left hand status line is "[time of day]";

Just copying "[time of day]" to the "left hand status line" variable doesn't make it substitute - which is just as well, or the top of the screen would perpetually show "9:00 AM".

On the other hand, looking back at the phrase example:

To decide what text is (T - text) doubled:
   decide on "[T][T]".

"[T][T]" is substituted immediately it's formed. That's also a good thing, because "T" loses its meaning the moment the phrase finishes, which would make "[T][T]" meaningless anywhere else.

What's going on here is this: Inform substitutes text immediately if it contains references to a temporary value such as "T", and otherwise only if it needs to access the contents. This is why "[time of day]" isn't substituted until we need to print it out (or, say, access the third character): "time of day" is a value which always exists, not a temporary one.

Another case where that might be important is if we want to set a text to an elaborated version of itself. For example, suppose there is a variable (not a temporary one) called "the accumulated tally", and consider this:

now the accumulated tally is "[the accumulated tally]X";

The intention of the writer here was to add an "X" each time this happens. But the result is a hang, because what it actually means is that accumulated tally can only be printed if the accumulated tally is printed first… an infinite regress. The safe way to do this would be:

now the accumulated tally is the substituted form of "[the accumulated tally]X";

Using the adjectives "substituted" and "unsubstituted", it's always possible to test whether a given text is in either state, should this ever be useful. For example,

now the left hand status line is "[time of day]";
if the left hand status line is unsubstituted, say "Yes!";

will say "Yes!": the LHSL is like a bomb waiting to go off. Speaking of which:

The player is holding a temporal bomb.
 
When play begins:
   now the left hand status line is "Clock reads: [time of day]".
 
After dropping the temporal bomb:
   now the left hand status line is the substituted form of the left hand status line;
   say "Time itself is now broken. Well done."

This is making use of:

substituted form of (text) ⇒ text

This takes a text and makes substitution occur immediately. For example,

substituted form of "time of death, [time of day]"

produces something like "time of death, 9:15 AM" rather than "time of death, [time of day]". It's entirely legal to apply this to text which never had any substitutions in, so

substituted form of "balloon"

produces "balloon".

Note that there's no analogous phrase for "unsubstituted form of…", because once text has substituted, there's no way to go back.

Examples

415.

Identity Theft ★

Allowing the player to enter a name to be used for the player character during the game.

RB 5.2 Traits Determined By the Player

416.

Mirror, Mirror ★

The sorcerer's mirror can, when held up high, form an impression of its surroundings which it then preserves.

RB 9.12 Cameras and Recording Devices

417.

The Cow Exonerated ★★

Creating a class of matches that burn for a time and then go out, with elegant reporting when several matches go out at once.

RB 10.8 Fire

🔗

WI §20.8 Replacements

Suppose V is a text which varies - perhaps a property of something, or a variable defined everywhere, or a temporary "let"-named value. How do we change its contents? The easiest way is simply to assign text to it. Thus:

let V be "It is now [the time of the day in words]."

And, for instance,

let V be "[V]!"

adds an exclamation mark at the end of V.

Otherwise, it is more useful (also a little faster) to modify V by changing its characters, words and so on. Thus:

replace character number (number) in (text) with (text)

This phrase acts on the named text by placing the given text in place of the Nth character, counting from 1. Example:

let V be "mope";
replace character number 3 in V with "lecul";
say V;

says "molecule".

replace word number (number) in (text) with (text)

This phrase acts on the named text by placing the given text in place of the Nth word, counting from 1, and dividing words at spacing or punctuation. Example:

let V be "Does the well run dry?";
replace word number 3 in V with "jogger";
say V;

says "Does the jogger run dry?".

replace punctuated word number (number) in (text) with (text)

This phrase acts on the named text by placing the given text in place of the Nth word, counting from 1, and dividing words at spacing, counting punctuation runs as words in their own right. Example:

let V be "Frankly, yes, I agree.";
replace punctuated word number 2 in V with ":";
say V;

says "Frankly: yes, I agree.".

replace unpunctuated word number (number) in (text) with (text)

This phrase acts on the named text by placing the given text in place of the Nth word, counting from 1, and dividing words at spacing, counting punctuation as part of a word just as if it were lettering. Example:

let V be "Frankly, yes, I agree.";
replace unpunctuated word number 2 in V with "of course";
say V;

says "Frankly, of course I agree.".

replace line number (number) in (text) with (text)

This phrase acts on the named text by placing the given text in place of the Nth line, counting from 1. Lines are divided by paragraph or line breaks.

replace paragraph number (number) in (text) with (text)

This phrase acts on the named text by placing the given text in place of the Nth paragraph, counting from 1.

Last, but not least, we can replace text wherever it occurs:

replace the text (text) in (text) with (text)

This phrase acts on the named text by searching and replacing, as many non-overlapping times as possible. Example:

replace the text "a" in V with "z"

changes every lower-case "a" to "z": the same thing done with the "case insensitively" option would change each "a" or "A" to "z".

All very well for letters, but it can be unfortunate to try

replace the text "Bob" in V with "Robert"

if V happens to contain, say "The Olympic Bobsleigh Team": it would become "The Olympic Robertsleigh Team". What we want, of course, is for Bob to become Robert only when it's a whole word. We can get that with:

replace the word (text) in (text) with (text)

This phrase acts on the named text by searching and replacing, as many non-overlapping times as possible, where the search text must occur as a whole word. Example:

replace the word "Bob" in V with "Robert"

changes "Bob got on the Bobsleigh" to "Robert got on the Bobsleigh".

replace the punctuated word (text) in (text) with (text)

This phrase acts on the named text by searching and replacing, as many non-overlapping times as possible, where the search text must occur as a whole word or run of punctuation.

But these are all just special cases of the grand-daddy of all replacement phrases:

replace the regular expression (text) in (text) with (text)

This phrase acts on the named text by matching the regular expression and replacing anything which fits it, as many non-overlapping times as possible. Example:

replace the regular expression "\d+" in V with "..."

changes "The Battle of Waterloo, 1815, rivalled Trafalgar, 1805" to "The Battle of Waterloo, …, rivalled Trafalgar, …". The "case insensitively" causes lower and upper case letters to be treated as if the same letter.

When replacing a regular expression, the replacement text also has a few special meanings (though, thankfully, many fewer than for the expression itself). Once again "\n" and "\t" can be used for line break and tab characters, and "\\" must be used for an actual backslash. But, very usefully, "\1" to "\9" expand as the contents of groups numbered 1 to 9, and "\0" to the exact text matched. So:

replace the regular expression "\d+" in V with "roughly \0"

adds the word "roughly" in front of any run of digits in V, because \0 becomes in turn whichever run of digits matched. And

replace the regular expression "(\w+) (.*)" in V with "\2, \1"

performs the transformation "Frank Booth" to "Booth, Frank".

Finally, prefixing the number by "l" or "u" forces the text it represents into lower or upper case, respectively. For instance:

replace the regular expression "\b(\w)(\w*)" in X with "\u1\l2";

changes the casing of X to "title casing", where each individual word is capitalised. (This is a little slow on large texts, since so many matches and replacements are made: it's more efficient to use the official phrases for changing case.)

Examples

418.

Blackout ★

Filtering the names of rooms printed while in darkness.

RB 2.1 Varying What Is Written

419.

Fido ★

A dog the player can name and un-name at will.

RB 8.3 Animals

420.

Igpay Atinlay ★

A pig Latin filter for the player's commands.

RB 2.3 Using the Player's Input

421.

Mr. Burns' Repast ★★

Letting the player guess types for an unidentifiable fish.

RB 2.3 Using the Player's Input

422.

Northstar ★★

Making Inform understand ASK JOSH TO TAKE INVENTORY as JOSH, TAKE INVENTORY. This requires us to use a regular expression on the player's command, replacing some of the content.

RB 7.14 Obedient Characters

423.

Cave-troll ★★★

Determining that the command the player typed is invalid, editing it, and re-examining it to see whether it now reads correctly.

RB 6.17 Clarification and Correction

🔗

WI §20.9 Summary of regular expression notation

MATCHING

Positional restrictions

^Matches (accepting no text) only at the start of the text
$Matches (accepting no text) only at the end of the text
\bWord boundary: matches at either end of text or between a \w and a \W
\BMatches anywhere where \b does not match

Backslashed character classes

\charIf char is other than a-z, A-Z, 0-9 or space, matches that literal char
\\For example, this matches literal backslash "\"
\nMatches literal line break character
\tMatches literal tab character (but use this only with external files)
 

\dMatches any single digit
\lMatches any lower case letter (by Unicode 4.0.0 definition)
\pMatches any single punctuation mark: . , ! ? - / " : ; ( ) [ ] { }
\sMatches any single spacing character (space, line break, tab)
\uMatches any upper case letter (by Unicode 4.0.0 definition)
\wMatches any single word character (neither \p nor \s)

\DMatches any single non-digit
\LMatches any non-lower-case-letter
\PMatches any single non-punctuation-mark
\SMatches any single non-spacing-character
\UMatches any non-upper-case-letter
\WMatches any single non-word-character (i.e., matches either \p or \s)

Other character classes

.Matches any single character
<...>Character range: matches any single character inside
<^...>Negated character range: matches any single character not inside

Inside a character range

e-hAny character in the run "e" to "h" inclusive (and so on for other runs)
>...Starting with ">" means that a literal close angle bracket is included
\Backslash has the same meaning as for backslashed character classes: see above

Structural

|Divides alternatives: "fish|fowl" matches either
(?i)Always matches: switches to case-insensitive matching from here on
(?-i)Always matches: switches to case-sensitive matching from here on

Repetitions

...?Matches "..." either 0 or 1 times, i.e., makes "..." optional
...*Matches "..." 0 or more times: e.g. "\s*" matches an optional run of space
...+Matches "..." 1 or more times: e.g. "x+" matches any run of "x"s
...{6}Matches "..." exactly 6 times (similarly for other numbers, of course)
...{2,5}Matches "..." between 2 and 5 times
...{3,}Matches "..." 3 or more times
....?"?" after any repetition makes it "lazy", matching as few repeats as it can

Numbered subexpressions

(...)Groups part of the expression together: matches if the interior matches
\1Matches the contents of the 1st subexpression reading left to right
\2Matches the contents of the 2nd, and so on up to "\9" (but no further)

Unnumbered subexpressions

(# ...)Comment: always matches, and the contents are ignored
(?= ...)Lookahead: matches if the text ahead matches "...", but doesn't consume it
(?! ...)Negated lookahead: matches if lookahead fails
(?<= ...)Lookbehind: matches if the text behind matches "...", but doesn't consume it
(?<! ...)Negated lookbehind: matches if lookbehind fails
(> ...)Possessive: tries to match "..." and if it succeeds, never backtracks on this
(?(1)...)Conditional: if \1 has matched by now, require that "..." be matched
(?(1)...|...)Conditional: ditto, but if \1 has not matched, require the second part
(?(?=...)...|...)Conditional with lookahead as its condition for which to match
(?(?<=...)...|...)Conditional with lookbehind as its condition for which to match

IN REPLACEMENT TEXT

\charIf char is other than a-z, A-Z, 0-9 or space, expands to that literal char
\\In particular, "\\" expands to a literal backslash "\"
\nExpands to a line break character
\tExpands to a tab character (but use this only with external files)
\0Expands to the full text matched
\1Expands to whatever the 1st bracketed subexpression matched
\2Expands to whatever the 2nd matched, and so on up to "\9" (but no further)
\l0Expands to \0 converted to lower case (and so on for "\l1" to "\l9")
\u0Expands to \0 converted to upper case (and so on for "\u1" to "\u9")

^	Matches (accepting no text) only at the start of the text
$	Matches (accepting no text) only at the end of the text
\b	Word boundary: matches at either end of text or between a \w and a \W
\B	Matches anywhere where \b does not match

\char	If char is other than a-z, A-Z, 0-9 or space, matches that literal char
\\	For example, this matches literal backslash "\"
\n	Matches literal line break character
\t	Matches literal tab character (but use this only with external files)

\d	Matches any single digit
\l	Matches any lower case letter (by Unicode 4.0.0 definition)
\p	Matches any single punctuation mark: . , ! ? - / " : ; ( ) [ ] { } \sMatches any single spacing character (space, line break, tab) \uMatches any upper case letter (by Unicode 4.0.0 definition) \wMatches any single word character (neither \p nor \s) \DMatches any single non-digit \LMatches any non-lower-case-letter \PMatches any single non-punctuation-mark \SMatches any single non-spacing-character \UMatches any non-upper-case-letter \WMatches any single non-word-character (i.e., matches either \p or \s)

.	Matches any single character
<...>	Character range: matches any single character inside
<^...>	Negated character range: matches any single character not inside

e-h	Any character in the run "e" to "h" inclusive (and so on for other runs)
>...	Starting with ">" means that a literal close angle bracket is included
\	Backslash has the same meaning as for backslashed character classes: see above

\|	Divides alternatives: "fish\|fowl" matches either
(?i)	Always matches: switches to case-insensitive matching from here on
(?-i)	Always matches: switches to case-sensitive matching from here on

...?	Matches "..." either 0 or 1 times, i.e., makes "..." optional
...*	Matches "..." 0 or more times: e.g. "\s*" matches an optional run of space
...+	Matches "..." 1 or more times: e.g. "x+" matches any run of "x"s
...{6}	Matches "..." exactly 6 times (similarly for other numbers, of course)
...{2,5}	Matches "..." between 2 and 5 times
...{3,}	Matches "..." 3 or more times
....?	"?" after any repetition makes it "lazy", matching as few repeats as it can

(...)	Groups part of the expression together: matches if the interior matches
\1	Matches the contents of the 1st subexpression reading left to right
\2	Matches the contents of the 2nd, and so on up to "\9" (but no further)

(# ...)	Comment: always matches, and the contents are ignored
(?= ...)	Lookahead: matches if the text ahead matches "...", but doesn't consume it
(?! ...)	Negated lookahead: matches if lookahead fails
(?<= ...)	Lookbehind: matches if the text behind matches "...", but doesn't consume it
(?<! ...)	Negated lookbehind: matches if lookbehind fails
(> ...)	Possessive: tries to match "..." and if it succeeds, never backtracks on this
(?(1)...)	Conditional: if \1 has matched by now, require that "..." be matched
(?(1)...\|...)	Conditional: ditto, but if \1 has not matched, require the second part
(?(?=...)...\|...)	Conditional with lookahead as its condition for which to match
(?(?<=...)...\|...)	Conditional with lookbehind as its condition for which to match

\char	If char is other than a-z, A-Z, 0-9 or space, expands to that literal char
\\	In particular, "\\" expands to a literal backslash "\"
\n	Expands to a line break character
\t	Expands to a tab character (but use this only with external files)
\0	Expands to the full text matched
\1	Expands to whatever the 1st bracketed subexpression matched
\2	Expands to whatever the 2nd matched, and so on up to "\9" (but no further)
\l0	Expands to \0 converted to lower case (and so on for "\l1" to "\l9")
\u0	Expands to \0 converted to upper case (and so on for "\u1" to "\u9")