I6 Template Layer

Inform 7 6M62ContentsIntroductionFunction IndexRules Index

RegExp.i6t

RegExp contents

Debugging.

Set this to true at your peril.

12Global TEXT_TY_RE_Trace = false; ! Change to true for (a lot of) debugging data in use 13[ TEXT_TY_RE_SetTrace F; TEXT_TY_RE_Trace = F; ];

Algorithm.

Once Inform 7 supported (indexed) text, regular-expression matching became an obvious goal: regexp-based features offer more or less the gold standard in text search and replace facilities, and I7 is so concerned with text that we shouldn't make do with less. But the best and most portable implementation of regular expression matching, PCRE by Philip Hazel, is about a hundred times larger than the code in this section, and also had unacceptable memory needs: there was no practicable way to make it small enough to do useful work on the Z-machine. Nor could an I6 regexp-matcher compile just-in-time code, or translate the expression into a suitable deterministic finite state machine. One day, though, I read one of the papers which Brian Kernighan writes every few years to the effect that regular-expression matching is much easier than you think. Kernighan is right: writing a regexp matcher is indeed easier than you think (one day's worth of cheerful hacking), but debugging one until it passes the trickiest hundred of Perl's 645 test cases is another matter (and it took a whole week more). Still, the result seems to be robust. The main compromise made is that backtracking is not always comprehensive with regexps like ^(a\1?){4}$, because we do not allocate individual storage to backtrack individually through all possibilities of each of the four uses of the bracketed subexpression – which means we miss some cases, since the subexpression contains a reference to itself, so that its content can vary in the four uses. PCRE's approach here is to expand the expression as if it were a sequence of four bracketed expressions, thus removing the awkward quantifier {4}, but that costs memory: indeed this is why PCRE cannot solve all of Perl's test cases without its default memory allocation being raised. In other respects, the algorithm below appears to be accurate if not very fast.

Class Codes.

While in principle we could keep the match expression in textual form, in practice the syntax of regular expressions is complex enough that this would be tricky and rather slow: we would be parsing the same notations over and over again. So we begin by compiling it to a simple tree structure. The tree is made up of nodes, and each node has a "class code": these are identified by the *_RE_CC constants below. Note that the class codes below are all negative: this is so that they are distinct from all valid ZSCII or Unicode characters. (ZSCII is used only on the Z-machine, which has a 16-bit word but an 8-bit character set, so that all character values are positive; similarly, Unicode is (for our purposes) a 16-bit character set on a 32-bit virtual machine.)

58! Character classes 59 60Constant NEWLINE_RE_CC = -1; 61Constant TAB_RE_CC = -2; 62Constant DIGIT_RE_CC = -3; 63Constant NONDIGIT_RE_CC = -4; 64Constant WHITESPACE_RE_CC = -5; 65Constant NONWHITESPACE_RE_CC = -6; 66Constant PUNCTUATION_RE_CC = -7; 67Constant NONPUNCTUATION_RE_CC = -8; 68Constant WORD_RE_CC = -9; 69Constant NONWORD_RE_CC = -10; 70Constant ANYTHING_RE_CC = -11; 71Constant NOTHING_RE_CC = -12; 72Constant RANGE_RE_CC = -13; 73Constant LCASE_RE_CC = -14; 74Constant NONLCASE_RE_CC = -15; 75Constant UCASE_RE_CC = -16; 76Constant NONUCASE_RE_CC = -17; 77 78! Control structures 79 80Constant SUBEXP_RE_CC = -20; 81Constant DISJUNCTION_RE_CC = -21; 82Constant CHOICE_RE_CC = -22; 83Constant QUANTIFIER_RE_CC = -23; 84Constant IF_RE_CC = -24; 85Constant CONDITION_RE_CC = -25; 86Constant THEN_RE_CC = -26; 87Constant ELSE_RE_CC = -27; 88 89! Substring matchers 90 91Constant VARIABLE_RE_CC = -30; 92Constant LITERAL_RE_CC = -31; 93 94! Positional matchers 95 96Constant START_RE_CC = -40; 97Constant END_RE_CC = -41; 98Constant BOUNDARY_RE_CC = -42; 99Constant NONBOUNDARY_RE_CC = -43; 100Constant ALWAYS_RE_CC = -44; 101Constant NEVER_RE_CC = -45; 102 103! Mode switches 104 105Constant SENSITIVITY_RE_CC = -50;

Packets.

The nodes of the compiled expression tree are stored in "packets", which are segments of a fixed array. A regexp complicated enough that it cannot be stored in RE_MAX_PACKETS packets will be rejected with an error: it looks like a rather low limit, but in fact suffices to handle all of Perl's test cases, some of which are works of diabolism.

A packet is then a record containing 14 fields, with offsets defined by the constants defined below. These fields combine the compilation of the corresponding fragment of the regexp with both the tree structure holding these packets together and also the current state of the temporary variables recording how far we have progressed in trying all of the possible ways to match the packet.

122Constant RE_MAX_PACKETS = 32; 123 124Constant RE_PACKET_SIZE = 14; ! Words of memory used per packet 125Constant RE_PACKET_SIZE_IN_BYTES = WORDSIZE*RE_PACKET_SIZE; ! Bytes used per packet 126 127Array RE_PACKET_space --> RE_MAX_PACKETS*RE_PACKET_SIZE; 128 129Constant RE_CCLASS = 0; ! One of the class codes defined above 130Constant RE_PAR1 = 1; ! Three parameters whose meaning depends on class code 131Constant RE_PAR2 = 2; 132Constant RE_PAR3 = 3; 133Constant RE_NEXT = 4; ! Younger sibling in the compiled tree 134Constant RE_PREVIOUS = 5; ! Elder sibling 135Constant RE_DOWN = 6; ! Child 136Constant RE_UP = 7; ! Parent 137Constant RE_DATA1 = 8; ! Backtracking data 138Constant RE_DATA2 = 9; 139Constant RE_CONSTRAINT = 10; 140Constant RE_CACHE1 = 11; 141Constant RE_CACHE2 = 12; 142Constant RE_MODES = 13;

Nodes.

The routine to create a node, something which happens only during the compilation phase, and also the routine which returns the address of a given node. Nodes are numbered from 0 up to M-1, where M is the constant RE_MAX_PACKETS.

151[ TEXT_TY_RE_Node n cc par1 par2 par3 offset; 152    if ((n<0) || (n >= RE_MAX_PACKETS)) rfalse; 153    offset = RE_PACKET_space + n*RE_PACKET_SIZE_IN_BYTES; 154    offset-->RE_CCLASS = cc; 155    offset-->RE_PAR1 = par1; 156    offset-->RE_PAR2 = par2; 157    offset-->RE_PAR3 = par3; 158    offset-->RE_NEXT = NULL; 159    offset-->RE_PREVIOUS = NULL; 160    offset-->RE_DOWN = NULL; 161    offset-->RE_UP = NULL; 162    offset-->RE_DATA1 = -1; ! Match start 163    offset-->RE_DATA2 = -1; ! Match end 164    offset-->RE_CONSTRAINT = -1; ! Rewind edge 165    return offset; 166]; 167 168[ TEXT_TY_RE_NodeAddress n; 169    if ((n<0) || (n >= RE_MAX_PACKETS)) return -1; 170    return RE_PACKET_space + n*RE_PACKET_SIZE_IN_BYTES; 171];

Match Variables.

A bracketed subexpression can be used as a variable: we support \1, ..., \9 to mean "the value of subexpression 1 to 9", and \0 to mean "the whole text matched", as if the entire regexp were bracketed. (PCRE and Perl also allow \10, \11, ..., but we don't, because it complicates parsing and memory is too short.)

RE_Subexpressions-->10 stores the number of subexpressions in use, not counting \0. During the compiling stage, RE_Subexpressions-->N is set to point to the node representating \N, where N varies from 1 to 9. When matching is complete, and assuming we care about the contents of these variables – which we might not, and if not we certainly don't want to waste time and memory – we call TEXT_TY_RE_CreateMatchVars to allocate text variables and fill them in as appropriate, memory permitting.

TEXT_TY_RE_EmptyMatchVars empties any such variables which may survive from a previous match, setting them to the empty text.

192Array RE_Subexpressions --> 11; ! Address of node for this subexpression 193Array Allocated_Match_Vars --> 10; ! Indexed text to hold values of the variables 194 195[ TEXT_TY_RE_DebugMatchVars txt 196    offset n i; 197    print RE_Subexpressions-->10, " collecting subexps^"; 198    for (n=0:(n<RE_Subexpressions-->10) && (n<10): n++) { 199        offset = RE_Subexpressions-->n; 200        print "Subexp ", offset-->RE_PAR1, 201            " = [", offset-->RE_DATA1, ",", offset-->RE_DATA2, "] = "; 202        for (i=offset-->RE_DATA1:i<offset-->RE_DATA2:i++) 203            print (char) BlkValueRead(txt, i); 204        print "^"; 205    } 206]; 207 208[ TEXT_TY_RE_CreateMatchVars txt 209    offset n i ch ctxt cl csize; 210    for (n=0:(n<RE_Subexpressions-->10) && (n<10): n++) { 211        offset = RE_Subexpressions-->n; 212        if (Allocated_Match_Vars-->n) BlkValueFree(Allocated_Match_Vars-->n); 213        Allocated_Match_Vars-->n = BlkValueCreate(TEXT_TY); 214        TEXT_TY_Transmute(Allocated_Match_Vars-->n); 215        ctxt = Allocated_Match_Vars-->n; 216        csize = BlkValueLBCapacity(ctxt); 217        cl = 0; 218        for (i=offset-->RE_DATA1:i<offset-->RE_DATA2:i++) { 219            ch = BlkValueRead(txt, i); 220            if (cl+1 >= csize) { 221                if (BlkValueSetLBCapacity(ctxt, 2*cl) == false) break; 222                csize = BlkValueLBCapacity(ctxt); 223            } 224            BlkValueWrite(ctxt, cl++, ch); 225        } 226        BlkValueWrite(ctxt, cl, 0); 227    } 228]; 229 230[ TEXT_TY_RE_EmptyMatchVars txt 231    n; 232    for (n=0:(n<RE_Subexpressions-->10) && (n<10): n++) 233        if (Allocated_Match_Vars-->n ~= 0) 234            BlkValueWrite(Allocated_Match_Vars-->n, 0, 0); 235]; 236 237[ TEXT_TY_RE_GetMatchVar vn 238    offset; 239    if ((vn<0) || (vn>=10) || (vn >= RE_Subexpressions-->10)) return EMPTY_TEXT_VALUE; 240    offset = RE_Subexpressions-->vn; 241    if (offset == 0) return EMPTY_TEXT_VALUE; 242    if (offset-->RE_DATA1 < 0) return EMPTY_TEXT_VALUE; 243    if (Allocated_Match_Vars-->vn == 0) { 244        print "*** ", vn, " unallocated ***^"; 245        return EMPTY_TEXT_VALUE; 246    } 247    return Allocated_Match_Vars-->vn; 248];

Markers.

At each node, the -->RE_DATA1 and -->RE_DATA2 fields represent the character positions of the start and end of the text matched by the node and its subtree (if any). These are called markers.

Thus TEXT_TY_MV_End(N, 0) returns the start of \N and TEXT_TY_MV_End(N, 1) the end of \N, according to the current match of subexpression N.

259[ TEXT_TY_MV_End n end 260    offset; 261    offset = RE_Subexpressions-->n; 262    if (end==0) return offset-->RE_DATA1; 263    return offset-->RE_DATA2; 264]; 265 266[ TEXT_TY_RE_Clear_Markers token; 267    for (: token ~= NULL: token = token-->RE_NEXT) { 268        if (token-->RE_DOWN ~= NULL) TEXT_TY_RE_Clear_Markers(token-->RE_DOWN); 269        token-->RE_DATA1 = -1; 270        token-->RE_DATA2 = -1; 271        token-->RE_CONSTRAINT = -1; 272    } 273];

Debugging.

Code in this paragraph simply prints a convenient screen representation of the compiled regexp, together with the current values of its markers. It is invaluable for debugging purposes and, touch wood, may not be needed again, but it is relatively compact and we keep it just in case.

282[ TEXT_TY_RE_DebugTree ftxt detail; 283    print "Pattern: ", (TEXT_TY_Say) ftxt, "^"; 284    TEXT_TY_RE_DebugSubtree(ftxt, 1, RE_PACKET_space, detail); 285]; 286 287[ TEXT_TY_RE_DebugSubtree ftxt depth offset detail 288    cup; 289    if (offset ~= NULL) { 290        cup = offset-->RE_UP; 291        if (offset-->RE_PREVIOUS ~= NULL) print "*** broken initial previous ***^"; 292    } 293    while (offset ~= NULL) { 294        if (offset-->RE_UP ~= cup) print "*** broken up matching ***^"; 295        spaces(depth*2); 296        TEXT_TY_RE_DebugNode(offset, ftxt, detail); 297        if (offset-->RE_DOWN ~= NULL) { 298            if ((offset-->RE_DOWN)-->RE_UP ~= offset) 299                print "*** broken down/up ***^"; 300            TEXT_TY_RE_DebugSubtree(ftxt, depth+1, offset-->RE_DOWN, detail); 301        } 302        if (offset-->RE_NEXT ~= NULL) { 303            if ((offset-->RE_NEXT)-->RE_PREVIOUS ~= offset) 304                print "*** broken next/previous ***^"; 305        } 306        offset = offset-->RE_NEXT; 307    } 308]; 309 310[ TEXT_TY_RE_DebugNode offset ftxt detail 311    i par1 par2 par3; 312    if (offset == NULL) "[NULL]"; 313    print "[", (offset-RE_PACKET_space)/(RE_PACKET_SIZE_IN_BYTES), "] "; 314    ! for (i=0:i<RE_PACKET_SIZE:i++) print offset-->i, " "; 315    par1 = offset-->RE_PAR1; 316    par2 = offset-->RE_PAR2; 317    par3 = offset-->RE_PAR3; 318    switch (offset-->RE_CCLASS) { 319        DIGIT_RE_CC: print "DIGIT"; 320        NONDIGIT_RE_CC: print "NONDIGIT"; 321        UCASE_RE_CC: print "UCASE"; 322        NONUCASE_RE_CC: print "NONUCASE"; 323        LCASE_RE_CC: print "LCASE"; 324        NONLCASE_RE_CC: print "NONLCASE"; 325        WHITESPACE_RE_CC: print "WHITESPACE"; 326        NONWHITESPACE_RE_CC: print "NONWHITESPACE"; 327        PUNCTUATION_RE_CC: print "PUNCTUATION"; 328        NONPUNCTUATION_RE_CC: print "NONPUNCTUATION"; 329        WORD_RE_CC: print "WORD"; 330        NONWORD_RE_CC: print "NONWORD"; 331        ALWAYS_RE_CC: print "ALWAYS"; 332        NEVER_RE_CC: print "NEVER"; 333        START_RE_CC: print "START"; 334        END_RE_CC: print "END"; 335        BOUNDARY_RE_CC: print "BOUNDARY"; 336        NONBOUNDARY_RE_CC: print "NONBOUNDARY"; 337        ANYTHING_RE_CC: print "ANYTHING"; 338        NOTHING_RE_CC: print "NOTHING"; 339        RANGE_RE_CC: print "RANGE"; if (par3 == true) print " (negated)"; 340            print " "; 341            for (i=par1:i<par2:i++) print (char) BlkValueRead(ftxt, i); 342        VARIABLE_RE_CC: print "VARIABLE ", par1; 343        SUBEXP_RE_CC: 344            if (par1 == 0) print "EXP"; 345            else print "SUBEXP "; 346            if (par1 >= 0) print "= V", par1; 347            if (par2 == 1) { 348                if (par3 == 0) print " (?=...) lookahead"; 349                else print " (?<=...) lookbehind of width ", par3; 350            } 351            if (par2 == 2) { 352                if (par3 == 0) print " (?...) negated lookahead"; 353                else print " (?<...) negated lookbehind of width ", par3; 354            } 355            if (par2 == 3) print " uncollecting"; 356            if (par2 == 0 or 3) { 357                if (par3 == 1) print " forcing case sensitivity"; 358                if (par3 == 2) print " forcing case insensitivity"; 359            } 360            if (par2 == 4) print " (?>...) possessive"; 361        NEWLINE_RE_CC: print "NEWLINE"; 362        TAB_RE_CC: print "TAB"; 363        QUANTIFIER_RE_CC: print "QUANTIFIER min=", par1, " max=", par2; 364            if (par3) print " (lazy)"; else print " (greedy)"; 365        LITERAL_RE_CC: print "LITERAL"; 366            print " "; 367            for (i=par1:i<par2:i++) print (char) BlkValueRead(ftxt, i); 368        DISJUNCTION_RE_CC: print "DISJUNCTION of ", par1, " choices"; 369        CHOICE_RE_CC: print "CHOICE no ", par1; 370        SENSITIVITY_RE_CC: print "SENSITIVITY"; 371            if (par1) print " off"; else print " on"; 372        IF_RE_CC: print "IF"; if (par1 >= 1) print " = V", par1; 373        CONDITION_RE_CC: print "CONDITION"; if (par1 >= 1) print " = V", par1; 374        THEN_RE_CC: print "THEN"; 375        ELSE_RE_CC: print "ELSE"; 376    } 377    if (detail) 378        print ": ", offset-->RE_DATA1, ", ", offset-->RE_DATA2, ", ", offset-->RE_CONSTRAINT; 379    print "^"; 380];

When we search for a literal substring, say looking for "per" in "Supernumerary", we will in fact use the same apparatus as when searching for a regular expression: we compile a very simple node tree in which \0 as the root contains just one child node, a LITERAL_RE_CC matching exactly the text "per". We return 2 since that's the number of nodes in the tree.

390[ TEXT_TY_CHR_CompileTree ftxt exactly 391    root literal fto no_packets token attach_to; 392 393    fto = TEXT_TY_CharacterLength(ftxt); 394 395    root = TEXT_TY_RE_Node(0, SUBEXP_RE_CC, 0, 0, 0); 396    literal = TEXT_TY_RE_Node(1, LITERAL_RE_CC, 0, fto, 0); 397     398    root-->RE_DOWN = literal; 399    literal-->RE_UP = root; 400 401    if (exactly) { 402        no_packets = 2; 403        if (no_packets+3 > RE_MAX_PACKETS) return "regexp too complex"; 404        exactly = RE_PACKET_space-->RE_DOWN; 405        token = TEXT_TY_RE_Node(no_packets++, START_RE_CC, 0, 0, 0); 406        RE_PACKET_space-->RE_DOWN = token; token-->RE_UP = RE_PACKET_space; 407        attach_to = TEXT_TY_RE_Node(no_packets++, SUBEXP_RE_CC, -1, 3, 0); 408        token-->RE_NEXT = attach_to; attach_to-->RE_PREVIOUS = token; 409        attach_to-->RE_UP = RE_PACKET_space; 410        attach_to-->RE_NEXT = TEXT_TY_RE_Node(no_packets++, END_RE_CC, 0, 0, 0); 411        (attach_to-->RE_NEXT)-->RE_PREVIOUS = attach_to; 412        (attach_to-->RE_NEXT)-->RE_UP = RE_PACKET_space; 413        attach_to-->RE_DOWN = exactly; 414        while (exactly ~= NULL) { 415            exactly-->RE_UP = attach_to; exactly = exactly-->RE_NEXT; 416        } 417    } 418     419    no_packets = TEXT_TY_RE_ExpandChoices(RE_PACKET_space, no_packets); 420];

But in general we need to compile a regular expression string into a tree of the kind described above, and here is the routine which does that, returning the number of nodes used to build the tree. The syntax it accepts is very fully documented in Writing with Inform, so no details are given here.

430Array Subexp_Posns --> 20; 431[ TEXT_TY_RE_CompileTree ftxt exactly 432    no_packets ffrom fto cc par1 par2 par3 433    quantifiable token attach_to no_subs blevel bits; 434 435    fto = TEXT_TY_CharacterLength(ftxt); 436    if (fto == 0) { 437        TEXT_TY_RE_Node(no_packets++, NEVER_RE_CC, 0, 0, 0); ! Empty regexp never matches 438        return 1; 439    } 440 441    attach_to = TEXT_TY_RE_Node(no_packets++, SUBEXP_RE_CC, 0, 0, 0); 442    RE_Subexpressions-->0 = attach_to; RE_Subexpressions-->10 = 1; no_subs = 1; 443 444    quantifiable = false; blevel = 0; 445     446    for (ffrom = 0: ffrom < fto: ) { 447        cc = BlkValueRead(ftxt, ffrom++); par1 = 0; par2 = 0; par3 = 0; 448        if (cc == '\') { 449            if (ffrom == fto) return "Search pattern not terminated"; 450            cc = BlkValueRead(ftxt, ffrom++); 451            switch (cc) { 452                'b': cc = BOUNDARY_RE_CC; 453                'B': cc = NONBOUNDARY_RE_CC; 454                'd': cc = DIGIT_RE_CC; 455                'D': cc = NONDIGIT_RE_CC; 456                'l': cc = LCASE_RE_CC; 457                'L': cc = NONLCASE_RE_CC; 458                'n': cc = NEWLINE_RE_CC; 459                'p': cc = PUNCTUATION_RE_CC; 460                'P': cc = NONPUNCTUATION_RE_CC; 461                's': cc = WHITESPACE_RE_CC; 462                'S': cc = NONWHITESPACE_RE_CC; 463                't': cc = TAB_RE_CC; 464                'u': cc = UCASE_RE_CC; 465                'U': cc = NONUCASE_RE_CC; 466                'w': cc = WORD_RE_CC; 467                'W': cc = NONWORD_RE_CC; 468                default: 469                    if ((cc >= '1') && (cc <= '9')) { 470                        par1 = cc-'0'; 471                        cc = VARIABLE_RE_CC; 472                    } else { 473                        if (((cc >= 'a') && (cc <= 'z')) || 474                            ((cc >= 'A') && (cc <= 'Z'))) return "unknown escape"; 475                        cc = LITERAL_RE_CC; 476                        par1 = ffrom-1; par2 = ffrom; 477                    } 478            } 479            quantifiable = true; 480        } else { 481            switch (cc) { 482                '(': par2 = 0; 483                    !if (BlkValueRead(ftxt, ffrom) == ')') return "empty subexpression"; 484                    if (BlkValueRead(ftxt, ffrom) == '?') { 485                        ffrom++; 486                        bits = true; 487                        if (BlkValueRead(ftxt, ffrom) == '-') { ffrom++; bits = false; } 488                        else if (BlkValueRead(ftxt, ffrom) == '<') { ffrom++; bits = false; } 489                        switch (cc = BlkValueRead(ftxt, ffrom++)) { 490                            '#': while (BlkValueRead(ftxt, ffrom++) ~= 0 or ')') ; 491                                if (BlkValueRead(ftxt, ffrom-1) == 0) 492                                    return "comment never ends"; 493                                continue; 494                            '(': cc = BlkValueRead(ftxt, ffrom); 495                                if ((cc == '1' or '2' or '3' or '4' or 496                                    '5' or '6' or '7' or '8' or '9') && 497                                    (BlkValueRead(ftxt, ffrom+1) ==')')) { 498                                    ffrom = ffrom + 2; 499                                    par1 = cc - '0'; 500                                } else ffrom--; 501                                cc = IF_RE_CC; ! (?(...)...) conditional 502                                quantifiable = false; 503                                if (blevel == 20) return "subexpressions too deep"; 504                                Subexp_Posns-->(blevel++) = TEXT_TY_RE_NodeAddress(no_packets); 505                                jump CClassKnown; 506                            '=': par2 = 1; ! (?=...) lookahead/behind 507                                par3 = 0; if (bits == false) par3 = -1; 508                            '': par2 = 2; ! (?!...) negated lookahead/behind 509                                par3 = 0; if (bits == false) par3 = -1; 510                            ':': par2 = 3; ! (?:...) uncollecting subexpression 511                            '>': par2 = 4; ! (?>...) possessive 512                            default: 513                                if (BlkValueRead(ftxt, ffrom) == ')') { 514                                    if (cc == 'i') { 515                                        cc = SENSITIVITY_RE_CC; par1 = bits; ffrom++; 516                                        jump CClassKnown; 517                                    } 518                                } 519                                if (BlkValueRead(ftxt, ffrom) == ':') { 520                                    if (cc == 'i') { 521                                        par1 = bits; par2 = 3; par3 = bits+1; ffrom++; 522                                        jump AllowForm; 523                                    } 524                                } 525                                return "unknown (?...) form"; 526                        } 527                    } 528                    .AllowForm; 529                    if (par2 == 0) par1 = no_subs++; else par1 = -1; 530                    cc = SUBEXP_RE_CC; 531                    quantifiable = false; 532                    if (blevel == 20) return "subexpressions too deep"; 533                    Subexp_Posns-->(blevel++) = TEXT_TY_RE_NodeAddress(no_packets); 534                ')': if (blevel == 0) return "subexpression bracket mismatch"; 535                    blevel--; 536                    attach_to = Subexp_Posns-->blevel; 537                    if (attach_to-->RE_DOWN == NULL) { 538                        if (no_packets >= RE_MAX_PACKETS) return "regexp too complex"; 539                        attach_to-->RE_DOWN = 540                            TEXT_TY_RE_Node(no_packets++, ALWAYS_RE_CC, 0, 0, 0); 541                        (attach_to-->RE_DOWN)-->RE_UP = attach_to; 542                    } 543                    quantifiable = true; 544                    continue; 545                '.': cc = ANYTHING_RE_CC; quantifiable = true; 546                '|': cc = CHOICE_RE_CC; quantifiable = false; 547                '^': cc = START_RE_CC; quantifiable = false; 548                '$': cc = END_RE_CC; quantifiable = false; 549                '{': if (quantifiable == false) return "quantifier misplaced"; 550                    par1 = 0; par2 = -1; bits = 1; 551                    while ((cc=BlkValueRead(ftxt, ffrom++)) ~= 0 or '}') { 552                        if (cc == ',') { 553                            bits++; 554                            if (bits >= 3) return "too many colons in ?{...}"; 555                            continue; 556                        } 557                        if ((cc >= '0') || (cc <= '9')) { 558                            if (bits == 1) { 559                                if (par1 < 0) par1 = 0; 560                                par1 = par1*10 + (cc-'0'); 561                            } else { 562                                if (par2 < 0) par2 = 0; 563                                par2 = par2*10 + (cc-'0'); 564                            } 565                        } else return "non-digit in ?{...}"; 566                    } 567                    if (cc ~= '}') return "{x,y} quantifier never ends"; 568                    cc = QUANTIFIER_RE_CC; 569                    if (par2 == -1) { 570                        if (bits == 2) par2 = 30000; 571                        else par2 = par1; 572                    } 573                    if (par1 > par2) return "{x,y} with x greater than y"; 574                    if (BlkValueRead(ftxt, ffrom) == '?') { ffrom++; par3 = true; } 575                    quantifiable = false; 576                '<', '[': par3 = false; if (cc == '<') bits = '>'; else bits = ']'; 577                    if (BlkValueRead(ftxt, ffrom) == '^') { ffrom++; par3 = true; } 578                    par1 = ffrom; 579                    if (BlkValueRead(ftxt, ffrom) == bits) { ffrom++; } 580                    while (cc ~= bits or 0) { 581                        cc = BlkValueRead(ftxt, ffrom++); 582                        if (cc == '\') { 583                            cc = BlkValueRead(ftxt, ffrom++); 584                            if (cc ~= 0) cc = BlkValueRead(ftxt, ffrom++); 585                        } 586                    } 587                    if (cc == 0) return "Character range never ends"; 588                    par2 = ffrom-1; 589                    if ((par2 > par1 + 1) && 590                        (BlkValueRead(ftxt, par1) == ':') && 591                        (BlkValueRead(ftxt, par2-1) == ':') && 592                        (BlkValueRead(ftxt, par2-2) ~= '\')) 593                        return "POSIX named character classes unsupported"; 594                    bits = TEXT_TY_RE_RangeSyntaxCorrect(ftxt, par1, par2); 595                    if (bits) return bits; 596                    if (par1 < par2) cc = RANGE_RE_CC; 597                    else cc = NOTHING_RE_CC; 598                    quantifiable = true; 599                '*': if (quantifiable == false) return "quantifier misplaced"; 600                    cc = QUANTIFIER_RE_CC; 601                    par1 = 0; par2 = 30000; 602                    if (BlkValueRead(ftxt, ffrom) == '?') { ffrom++; par3 = true; } 603                    quantifiable = false; 604                '+': if (quantifiable == false) return "quantifier misplaced"; 605                    cc = QUANTIFIER_RE_CC; 606                    par1 = 1; par2 = 30000; 607                    if (BlkValueRead(ftxt, ffrom) == '?') { ffrom++; par3 = true; } 608                    quantifiable = false; 609                '?': if (quantifiable == false) return "quantifier misplaced"; 610                    cc = QUANTIFIER_RE_CC; 611                    par1 = 0; par2 = 1; 612                    if (BlkValueRead(ftxt, ffrom) == '?') { ffrom++; par3 = true; } 613                    quantifiable = false; 614            } 615        } 616         617        .CClassKnown; 618         619        if (cc >= 0) { 620            quantifiable = true; 621            if ((attach_to-->RE_CCLASS == LITERAL_RE_CC) && 622                (BlkValueRead(ftxt, ffrom) ~= '*' or '+' or '?' or '{')) { 623                (attach_to-->RE_PAR2)++; 624                if (TEXT_TY_RE_Trace == 2) { 625                    print "Extending literal by ", cc, "=", (char) cc, "^"; 626                } 627                continue; 628            } 629            cc = LITERAL_RE_CC; par1 = ffrom-1; par2 = ffrom; 630        } 631         632        if (no_packets >= RE_MAX_PACKETS) return "regexp too complex"; 633 634        if (TEXT_TY_RE_Trace == 2) { 635            print "Attaching packet ", no_packets+1, " to "; 636            TEXT_TY_RE_DebugNode(attach_to, ftxt); 637            TEXT_TY_RE_DebugTree(ftxt); 638        } 639 640        token = TEXT_TY_RE_Node(no_packets++, cc, par1, par2, par3); 641 642        if ((token-->RE_CCLASS == SUBEXP_RE_CC) && (token-->RE_PAR2 == 0)) { 643            RE_Subexpressions-->(token-->RE_PAR1) = token; 644            (RE_Subexpressions-->10)++; 645        } 646         647        if ((attach_to-->RE_CCLASS == SUBEXP_RE_CC or CHOICE_RE_CC or IF_RE_CC) && 648            (attach_to-->RE_DOWN == NULL)) { 649            attach_to-->RE_DOWN = token; token-->RE_UP = attach_to; 650        } else { 651            if ((token-->RE_CCLASS == CHOICE_RE_CC) && 652                ((attach_to-->RE_UP)-->RE_CCLASS == CHOICE_RE_CC)) { 653                no_packets--; token = attach_to-->RE_UP; 654            } else { 655                if (token-->RE_CCLASS == CHOICE_RE_CC) { 656                    while (attach_to-->RE_PREVIOUS ~= NULL) 657                        attach_to = attach_to-->RE_PREVIOUS; 658                } 659                if (token-->RE_CCLASS == QUANTIFIER_RE_CC or CHOICE_RE_CC) { 660                    token-->RE_PREVIOUS = attach_to-->RE_PREVIOUS; 661                    token-->RE_UP = attach_to-->RE_UP; 662                    if ((attach_to-->RE_UP ~= NULL) && (attach_to-->RE_PREVIOUS == NULL)) 663                        (attach_to-->RE_UP)-->RE_DOWN = token; 664                    token-->RE_DOWN = attach_to; 665                    bits = attach_to; 666                    while (bits ~= NULL) { 667                        bits-->RE_UP = token; 668                        bits = bits-->RE_NEXT; 669                    } 670                    attach_to-->RE_PREVIOUS = NULL; 671                    if (token-->RE_PREVIOUS ~= NULL) 672                        (token-->RE_PREVIOUS)-->RE_NEXT = token; 673                } else { 674                    attach_to-->RE_NEXT = token; token-->RE_PREVIOUS = attach_to; 675                    token-->RE_UP = attach_to-->RE_UP; 676                } 677            } 678        } 679         680        if (token-->RE_CCLASS == CHOICE_RE_CC) { 681            if (no_packets >= RE_MAX_PACKETS) return "regexp too complex"; 682            token-->RE_NEXT = TEXT_TY_RE_Node(no_packets++, CHOICE_RE_CC, 0, 0, 0); 683            (token-->RE_NEXT)-->RE_PREVIOUS = token; 684            (token-->RE_NEXT)-->RE_UP = token-->RE_UP; 685            token = token-->RE_NEXT; 686        } 687 688        attach_to = token; 689 690        if (TEXT_TY_RE_Trace == 2) { 691            print "Result:^"; 692            TEXT_TY_RE_DebugTree(ftxt); 693        } 694 695    } 696     697    if (blevel ~= 0) return "subexpression bracket mismatch"; 698 699    if (exactly) { 700        if (no_packets+3 > RE_MAX_PACKETS) return "regexp too complex"; 701        exactly = RE_PACKET_space-->RE_DOWN; 702        token = TEXT_TY_RE_Node(no_packets++, START_RE_CC, 0, 0, 0); 703        RE_PACKET_space-->RE_DOWN = token; token-->RE_UP = RE_PACKET_space; 704        attach_to = TEXT_TY_RE_Node(no_packets++, SUBEXP_RE_CC, -1, 3, 0); 705        token-->RE_NEXT = attach_to; attach_to-->RE_PREVIOUS = token; 706        attach_to-->RE_UP = RE_PACKET_space; 707        attach_to-->RE_NEXT = TEXT_TY_RE_Node(no_packets++, END_RE_CC, 0, 0, 0); 708        (attach_to-->RE_NEXT)-->RE_PREVIOUS = attach_to; 709        (attach_to-->RE_NEXT)-->RE_UP = RE_PACKET_space; 710        attach_to-->RE_DOWN = exactly; 711        while (exactly ~= NULL) { 712            exactly-->RE_UP = attach_to; exactly = exactly-->RE_NEXT; 713        } 714    } 715     716    no_packets = TEXT_TY_RE_ExpandChoices(RE_PACKET_space, no_packets); 717 718    if (TEXT_TY_RE_Trace) { 719        print "Compiled pattern:^"; 720        TEXT_TY_RE_DebugTree(ftxt); 721    } 722     723    bits = TEXT_TY_RE_CheckTree(RE_PACKET_space, no_subs); if (bits) return bits; 724     725    return no_packets; 726]; 727 728[ TEXT_TY_RE_RangeSyntaxCorrect ftxt rf rt 729    i chm; 730    for (i=rf: i<rt: i++) { 731        chm = BlkValueRead(ftxt, i); 732        if ((chm == '\') && (i+1<rt)) { 733            chm = BlkValueRead(ftxt, ++i); 734            if (((chm >= 'a') && (chm <= 'z')) || 735                ((chm >= 'A') && (chm <= 'Z'))) { 736                if (chm ~= 's' or 'S' or 'p' or 'P' or 'w' or 'W' or 'd' 737                    or 'D' or 'n' or 't' or 'l' or 'L' or 'u' or 'U') 738                    return "Invalid escape in {} range"; 739            } 740        } 741        if ((i+2<rt) && (BlkValueRead(ftxt, i+1) == '-')) { 742            if (chm > BlkValueRead(ftxt, i+2)) return "Invalid {} range"; 743            i=i+2; 744        } 745    } 746    rfalse; 747]; 748 749[ TEXT_TY_RE_ExpandChoices token no_packets 750    rv prev nex holder new ct n cond_node then_node else_node; 751    while (token ~= NULL) { 752        if (token-->RE_CCLASS == IF_RE_CC) { 753            if ((token-->RE_DOWN)-->RE_CCLASS == CHOICE_RE_CC) { 754                for (nex=token-->RE_DOWN, n=0: nex~=NULL: nex=nex-->RE_NEXT) n++; 755                if (n~=2) return "conditional has too many clauses"; 756                if (no_packets >= RE_MAX_PACKETS) return "regexp too complex"; 757                cond_node = TEXT_TY_RE_Node(no_packets++, CONDITION_RE_CC, 0, 0, 0); 758                if (token-->RE_PAR1 >= 1) { 759                    cond_node-->RE_PAR1 = token-->RE_PAR1; 760                } 761                then_node = token-->RE_DOWN; 762                then_node-->RE_CCLASS = THEN_RE_CC; 763                else_node = then_node-->RE_NEXT; 764                else_node-->RE_CCLASS = ELSE_RE_CC; 765                if (cond_node-->RE_PAR1 < 1) { 766                    cond_node-->RE_DOWN = then_node-->RE_DOWN; 767                    then_node-->RE_DOWN = (then_node-->RE_DOWN)-->RE_NEXT; 768                    if (then_node-->RE_DOWN ~= NULL) 769                        (then_node-->RE_DOWN)-->RE_PREVIOUS = NULL; 770                    (cond_node-->RE_DOWN)-->RE_NEXT = NULL; 771                    (cond_node-->RE_DOWN)-->RE_UP = cond_node; 772                } 773                token-->RE_DOWN = cond_node; cond_node-->RE_UP = token; 774                cond_node-->RE_NEXT = then_node; then_node-->RE_PREVIOUS = cond_node; 775            } else { 776                if (no_packets >= RE_MAX_PACKETS) return "regexp too complex"; 777                cond_node = TEXT_TY_RE_Node(no_packets++, CONDITION_RE_CC, 0, 0, 0); 778                if (no_packets >= RE_MAX_PACKETS) return "regexp too complex"; 779                then_node = TEXT_TY_RE_Node(no_packets++, THEN_RE_CC, 0, 0, 0); 780                if (token-->RE_PAR1 >= 1) { 781                    cond_node-->RE_PAR1 = token-->RE_PAR1; 782                    then_node-->RE_DOWN = token-->RE_DOWN; 783                } else { 784                    cond_node-->RE_DOWN = token-->RE_DOWN; 785                    then_node-->RE_DOWN = (token-->RE_DOWN)-->RE_NEXT; 786                    (cond_node-->RE_DOWN)-->RE_NEXT = NULL; 787                    (cond_node-->RE_DOWN)-->RE_UP = cond_node; 788                } 789                token-->RE_DOWN = cond_node; 790                cond_node-->RE_UP = token; cond_node-->RE_NEXT = then_node; 791                then_node-->RE_PREVIOUS = cond_node; then_node-->RE_UP = token; 792                then_node-->RE_NEXT = NULL; 793                if (then_node-->RE_DOWN ~= NULL) 794                    (then_node-->RE_DOWN)-->RE_PREVIOUS = NULL; 795                for (nex = then_node-->RE_DOWN: nex ~= NULL: nex = nex-->RE_NEXT) { 796                    nex-->RE_UP = then_node; 797                } 798            } 799             800            if (cond_node-->RE_DOWN ~= NULL) { 801                nex = cond_node-->RE_DOWN; 802                if ((nex-->RE_CCLASS ~= SUBEXP_RE_CC) || 803                    (nex-->RE_NEXT ~= NULL) || 804                    (nex-->RE_PAR2 ~= 1 or 2)) { 805                    !TEXT_TY_RE_DebugSubtree(0, 0, nex, true); 806                    return "condition not lookahead/behind"; 807                } 808            } 809        } 810        if ((token-->RE_CCLASS == CHOICE_RE_CC) && (token-->RE_PAR1 < 1)) { 811            prev = token-->RE_PREVIOUS; 812            nex = token-->RE_NEXT; 813            while ((nex ~= NULL) && (nex-->RE_CCLASS == CHOICE_RE_CC)) 814                nex = nex-->RE_NEXT; 815            holder = token-->RE_UP; if (holder == NULL) return "bang"; 816            if (no_packets >= RE_MAX_PACKETS) return "regexp too complex"; 817            new = TEXT_TY_RE_Node(no_packets++, DISJUNCTION_RE_CC, 0, 0, 0); 818            holder-->RE_DOWN = new; new-->RE_UP = holder; 819            if (prev ~= NULL) { 820                prev-->RE_NEXT = new; new-->RE_PREVIOUS = prev; 821            } 822            if (nex ~= NULL) { 823                nex-->RE_PREVIOUS = new; new-->RE_NEXT = nex; 824            } 825            new-->RE_DOWN = token; 826            token-->RE_PREVIOUS = NULL; 827            ct = 1; 828            while (token ~= NULL) { 829                token-->RE_PAR1 = ct++; 830                token-->RE_UP = new; 831                if ((token-->RE_NEXT ~= NULL) && 832                    ((token-->RE_NEXT)-->RE_CCLASS ~= CHOICE_RE_CC)) 833                    token-->RE_NEXT = NULL; 834                token = token-->RE_NEXT; 835            } 836            new-->RE_PAR1 = ct-1; 837            if (token ~= NULL) token-->RE_NEXT = NULL; 838            token = new; continue; 839        } 840        if (token-->RE_DOWN ~= NULL) { 841            no_packets = TEXT_TY_RE_ExpandChoices(token-->RE_DOWN, no_packets); 842            if ((no_packets<0) || (no_packets >= RE_MAX_PACKETS)) break; 843        } 844        token = token-->RE_NEXT; 845    } 846    return no_packets; 847]; 848 849[ TEXT_TY_RE_CheckTree token no_subs 850    rv; 851    while (token ~= NULL) { 852        if (token-->RE_CCLASS == VARIABLE_RE_CC) { 853            if (token-->RE_PAR1 >= no_subs) return "reference to nonexistent group"; 854        } 855        if ((token-->RE_CCLASS == SUBEXP_RE_CC) && 856            (token-->RE_PAR2 == 1 or 2) && 857            (token-->RE_PAR3 == -1)) { 858            token-->RE_PAR3 = TEXT_TY_RE_Width(token-->RE_DOWN); 859            if (token-->RE_PAR3 == -1) return "variable length lookbehind not implemented"; 860        } 861        if (token-->RE_DOWN ~= NULL) { 862            rv = TEXT_TY_RE_CheckTree(token-->RE_DOWN, no_subs); 863            if (rv) return rv; 864        } 865        token = token-->RE_NEXT; 866    } 867    rfalse; 868]; 869 870[ TEXT_TY_RE_Width token downwards 871    w rv aw choice; 872    while (token ~= NULL) { 873        switch (token-->RE_CCLASS) { 874            DIGIT_RE_CC, NONDIGIT_RE_CC, WHITESPACE_RE_CC, NONWHITESPACE_RE_CC, 875            PUNCTUATION_RE_CC, NONPUNCTUATION_RE_CC, WORD_RE_CC, NONWORD_RE_CC, 876            ANYTHING_RE_CC, NOTHING_RE_CC, RANGE_RE_CC, NEWLINE_RE_CC, TAB_RE_CC, 877            UCASE_RE_CC, NONUCASE_RE_CC, LCASE_RE_CC, NONLCASE_RE_CC: 878                w++; 879            START_RE_CC, END_RE_CC, BOUNDARY_RE_CC, NONBOUNDARY_RE_CC, ALWAYS_RE_CC: 880                ; 881            LITERAL_RE_CC: 882                w = w + token-->RE_PAR2 - token-->RE_PAR1; 883            VARIABLE_RE_CC: 884                return -1; 885            IF_RE_CC: 886                rv = TEXT_TY_RE_Width((token-->RE_DOWN)-->RE_NEXT); 887                if (rv == -1) return -1; 888                if (rv ~= TEXT_TY_RE_Width(((token-->RE_DOWN)-->RE_NEXT)-->RE_NEXT)) 889                    return -1; 890                w = w + rv; 891            SUBEXP_RE_CC: 892                if (token-->RE_PAR2 == 1 or 2) rv = 0; 893                else { 894                    rv = TEXT_TY_RE_Width(token-->RE_DOWN); 895                    if (rv == -1) return -1; 896                } 897                w = w + rv; 898            QUANTIFIER_RE_CC: 899                if (token-->RE_PAR1 ~= token-->RE_PAR2) return -1; 900                rv = TEXT_TY_RE_Width(token-->RE_DOWN); 901                if (rv == -1) return -1; 902                w = w + rv*(token-->RE_PAR1); 903            DISJUNCTION_RE_CC: 904                aw = -1; 905                for (choice = token-->RE_DOWN: choice ~= NULL: choice = choice-->RE_NEXT) { 906                    rv = TEXT_TY_RE_Width(choice-->RE_DOWN); 907                    !print "Option found ", rv, "^"; 908                    if (rv == -1) return -1; 909                    if ((aw >= 0) && (aw ~= rv)) return -1; 910                    aw = rv; 911                } 912                w = w + aw; 913            SENSITIVITY_RE_CC: 914                ; 915        } 916        if (downwards) return w; 917        if (token ~= NULL) token = token-->RE_NEXT; 918    } 919    return w; 920];

Parser.

The virtue of all of that tree compilation is that the code which actually does the work – which parses the source text to see if the regular expression matches it – is much shorter and quicker: indeed, it takes up fewer lines than the compiler part, which goes to show that decoding regular expression syntax is a more complex task than acting upon it. This would have surprised the pioneers of regexp, but the syntax has become much more complicated over the decades because of a steady increase in the number of extended notations. The process shows no sign of stopping, with Python and PCRE continuing to push boundaries beyond Perl, which was once thought the superest, duperest regexp syntax there could be. However: to work.

The main matcher simply starts a recursive subroutine to perform the match. However, the subroutine tests for a match at a particular position in the source text; so the main routine tries the subroutine everywhere convenient in the source text, from left to right, until a match is made – unless the regexp is constrained by a ^ glyph to begin matching at the start of the source text, which will cause a START_RE_CC node to be the eldest child of the \0 root.

943Global TEXT_TY_RE_RewindCount; 944[ TEXT_TY_RE_PrintNoRewinds; print TEXT_TY_RE_RewindCount; ]; 945 946Constant CIS_MFLAG = 1; 947Constant ACCUM_MFLAG = 2; 948 949[ TEXT_TY_RE_Parse ftxt txt ipos insens 950    ilen rv root i initial_mode; 951 952    ilen = TEXT_TY_CharacterLength(txt); 953    if ((ipos<0) || (ipos>ilen)) return -1; 954     955    root = RE_PACKET_space; 956     957    initial_mode = 0; if (insens) initial_mode = CIS_MFLAG; 958     959    TEXT_TY_RE_Clear_Markers(RE_PACKET_space); 960     961    for (:ipos<=ilen:ipos++) { 962        if ((RE_PACKET_space-->RE_DOWN ~= NULL) && 963            ((RE_PACKET_space-->RE_DOWN)-->RE_CCLASS == START_RE_CC) && 964            (ipos>0)) { rv = -1; break; } 965        if (ipos > 0) TEXT_TY_RE_EraseConstraints(RE_PACKET_space, initial_mode); 966        TEXT_TY_RE_RewindCount = 0; 967        rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ilen, RE_PACKET_space, initial_mode); 968        if (rv >= 0) break; 969    } 970 971    if (rv == -1) { 972        root-->RE_DATA1 = -1; 973        root-->RE_DATA2 = -1; 974    } else { 975        root-->RE_DATA1 = ipos; 976        root-->RE_DATA2 = ipos+rv; 977    } 978    return rv; 979];

Parse At Position.

TEXT_TY_RE_ParseAtPosition(ftxt, txt, ifrom, ito) attempts to match text beginning at position ifrom in the text txt and extending for any length up to position ito: it returns the number of characters which were matched (which can legitimately be 0), or -1 if no match could be made. ftxt is the original text of the regular expression in its precompiled form, which we need partly to print good debugging information, but mostly in order to match against a LITERAL_RE_CC node.

991[ TEXT_TY_RE_ParseAtPosition ftxt txt ifrom ito token mode_flags 992    outcome ipos npos rv i ch edge rewind_this; 993 994    if (ifrom > ito) return -1; 995 996    ipos = ifrom; 997 998    .Rewind; 999    while (token ~= NULL) { 1000        outcome = false; 1001        if (TEXT_TY_RE_Trace) { 1002            print "Matching at ", ipos, ": "; 1003            TEXT_TY_RE_DebugNode(token, ftxt, true); 1004        } 1005 1006        if (ipos<ito) ch = BlkValueRead(txt, ipos); else ch = 0; 1007 1008        token-->RE_MODES = mode_flags; ! Save in case of backtrack 1009 1010        switch (token-->RE_CCLASS) { 1011             1012            ! Should never happen 1013             1014            CHOICE_RE_CC: return "internal error"; 1015             1016            ! Mode switches 1017             1018            SENSITIVITY_RE_CC: 1019                if (token-->RE_PAR1) mode_flags = mode_flags | CIS_MFLAG; 1020                else mode_flags = mode_flags & (~CIS_MFLAG); 1021                outcome = true; 1022         1023            ! Zero-length positional markers 1024             1025            ALWAYS_RE_CC: 1026                outcome = true; 1027            NEVER_RE_CC: 1028            START_RE_CC: 1029                if (ipos == 0) outcome = true; 1030            END_RE_CC: 1031                if (BlkValueRead(txt, ipos) == 0) outcome = true; 1032            BOUNDARY_RE_CC: 1033                rv = 0; 1034                if (BlkValueRead(txt, ipos) == 0 or 10 or 13 or 32 or 9 1035                    or '.' or ',' or '' or '?' 1036                    or '-' or '/' or '' or ':' or ';' 1037                    or '(' or ')' or '[' or ']' or '{' or '}') rv++; 1038                if (ipos == 0) ch = 0; 1039                else ch = BlkValueRead(txt, ipos-1); 1040                if (ch == 0 or 10 or 13 or 32 or 9 1041                    or '.' or ',' or '' or '?' 1042                    or '-' or '/' or '' or ':' or ';' 1043                    or '(' or ')' or '[' or ']' or '{' or '}') rv++; 1044                if (rv == 1) outcome = true; 1045            NONBOUNDARY_RE_CC: 1046                rv = 0; 1047                if (BlkValueRead(txt, ipos) == 0 or 10 or 13 or 32 or 9 1048                    or '.' or ',' or '' or '?' 1049                    or '-' or '/' or '' or ':' or ';' 1050                    or '(' or ')' or '[' or ']' or '{' or '}') rv++; 1051                if (ipos == 0) ch = 0; 1052                else ch = BlkValueRead(txt, ipos-1); 1053                if (ch == 0 or 10 or 13 or 32 or 9 1054                    or '.' or ',' or '' or '?' 1055                    or '-' or '/' or '' or ':' or ';' 1056                    or '(' or ')' or '[' or ']' or '{' or '}') rv++; 1057                if (rv ~= 1) outcome = true; 1058 1059            ! Control constructs 1060         1061            IF_RE_CC: 1062                i = token-->RE_PAR1; ch = false; 1063                if (TEXT_TY_RE_Trace) { 1064                    print "Trying conditional from ", ipos, ": "; 1065                    TEXT_TY_RE_DebugNode(token, ftxt, true); 1066                } 1067                if (i >= 1) { 1068                     if ((i<RE_Subexpressions-->10) && 1069                         ((RE_Subexpressions-->i)-->RE_DATA1 >= 0)) ch = true; 1070                } else { 1071                    rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito, 1072                        (token-->RE_DOWN)-->RE_DOWN, mode_flags); 1073                    if (rv >= 0) ch = true; 1074                } 1075                if (TEXT_TY_RE_Trace) { 1076                    print "Condition found to be ", ch, "^"; 1077                } 1078                if (ch) { 1079                    rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito, 1080                        ((token-->RE_DOWN)-->RE_NEXT)-->RE_DOWN, mode_flags); 1081                    !print "Then clause returned ", rv, "^"; 1082                } else { 1083                    if ((((token-->RE_DOWN)-->RE_NEXT)-->RE_NEXT) == NULL) 1084                        rv = 0; ! The empty else clause matches 1085                    else rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito, 1086                        (((token-->RE_DOWN)-->RE_NEXT)-->RE_NEXT)-->RE_DOWN, mode_flags); 1087                    !print "Else clause returned ", rv, "^"; 1088                } 1089                if (rv >= 0) { 1090                    outcome = true; 1091                    ipos = ipos + rv; 1092                } 1093            DISJUNCTION_RE_CC: 1094                if (TEXT_TY_RE_Trace) { 1095                    print "Trying disjunction from ", ipos, ": "; 1096                    TEXT_TY_RE_DebugNode(token, ftxt, true); 1097                } 1098                for (ch = token-->RE_DOWN: ch ~= NULL: ch = ch-->RE_NEXT) { 1099                    if (ch-->RE_PAR1 <= token-->RE_CONSTRAINT) continue; 1100                    if (TEXT_TY_RE_Trace) { 1101                        print "Trying choice at ", ipos, ": "; 1102                        TEXT_TY_RE_DebugNode(ch, ftxt, true); 1103                    } 1104                    rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito, 1105                        ch-->RE_DOWN, mode_flags); 1106                    if (rv >= 0) { 1107                        token-->RE_DATA1 = ipos; ! Where match was made 1108                        token-->RE_DATA2 = ch-->RE_PAR1; ! Option taken 1109                        ipos = ipos + rv; 1110                        outcome = true; 1111                        if (TEXT_TY_RE_Trace) { 1112                            print "Choice worked with width ", rv, ": "; 1113                            TEXT_TY_RE_DebugNode(ch, ftxt, true); 1114                        } 1115                        break; 1116                    } else { 1117                        if (mode_flags & ACCUM_MFLAG == false) 1118                            TEXT_TY_RE_FailSubexpressions(ch-->RE_DOWN); 1119                    } 1120                } 1121                if (outcome == false) { 1122                    if (TEXT_TY_RE_Trace) { 1123                        print "Failed disjunction from ", ipos, ": "; 1124                        TEXT_TY_RE_DebugNode(token, ftxt, true); 1125                    } 1126                    token-->RE_DATA1 = ipos; ! Where match was tried 1127                    token-->RE_DATA2 = -1; ! No option was taken 1128                } 1129            SUBEXP_RE_CC: 1130                if (token-->RE_PAR2 == 1 or 2) { 1131                    npos = ipos - token-->RE_PAR3; 1132                    if (npos<0) rv = -1; ! Lookbehind fails: nothing behind 1133                    else rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, npos, ito, token-->RE_DOWN, 1134                        mode_flags); 1135                } else { 1136                    switch (token-->RE_PAR3) { 1137                        0: rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito, token-->RE_DOWN, 1138                            mode_flags); 1139                        1: rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito, token-->RE_DOWN, 1140                            mode_flags & (~CIS_MFLAG)); 1141                        2: rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito, token-->RE_DOWN, 1142                            mode_flags | CIS_MFLAG); 1143                    } 1144                } 1145                npos = ipos; 1146                if (rv >= 0) npos = ipos + rv; 1147                switch (token-->RE_PAR2) { 1148                    1: if (rv >= 0) rv = 0; 1149                    2: if (rv >= 0) rv = -1; else rv = 0; 1150                } 1151                if (rv >= 0) { 1152                    token-->RE_DATA1 = ipos; 1153                    ipos = ipos + rv; 1154                    token-->RE_DATA2 = npos; 1155                    outcome = true; 1156                } else { 1157                    if (mode_flags & ACCUM_MFLAG == false) { 1158                        token-->RE_DATA1 = -1; 1159                        token-->RE_DATA2 = -1; 1160                    } 1161                } 1162                if (token-->RE_PAR2 == 2) TEXT_TY_RE_FailSubexpressions(token, true); 1163            QUANTIFIER_RE_CC: 1164                token-->RE_DATA1 = ipos; 1165                if ((token-->RE_DOWN)-->RE_CCLASS == SUBEXP_RE_CC) { 1166                    (token-->RE_DOWN)-->RE_CACHE1 = -1; 1167                    (token-->RE_DOWN)-->RE_CACHE2 = -1; 1168                } 1169                if (TEXT_TY_RE_Trace) { 1170                    print "Trying quantifier from ", ipos, ": "; 1171                    TEXT_TY_RE_DebugNode(token, ftxt, true); 1172                } 1173                if (token-->RE_PAR3 == false) { ! Greedy quantifier 1174                    !edge = ito; if (token-->RE_CONSTRAINT >= 0) edge = token-->RE_CONSTRAINT; 1175                    edge = token-->RE_PAR2; 1176                    if (token-->RE_CONSTRAINT >= 0) edge = token-->RE_CONSTRAINT; 1177                    rv = -1; 1178                    for (i=0, npos=ipos: i<edge: i++) { 1179                        if (TEXT_TY_RE_Trace) { 1180                            print "Trying quant rep ", i+1, " at ", npos, ": "; 1181                            TEXT_TY_RE_DebugNode(token, ftxt, true); 1182                        } 1183                        rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, npos, ito, token-->RE_DOWN, 1184                            mode_flags | ACCUM_MFLAG); 1185                        if (rv < 0) break; 1186                        if ((token-->RE_DOWN)-->RE_CCLASS == SUBEXP_RE_CC) { 1187                            (token-->RE_DOWN)-->RE_CACHE1 = (token-->RE_DOWN)-->RE_DATA1; 1188                            (token-->RE_DOWN)-->RE_CACHE2 = (token-->RE_DOWN)-->RE_DATA2; 1189                        } 1190                        if ((rv == 0) && (token-->RE_PAR2 == 30000) && (i>=1)) { i++; break; } 1191                        npos = npos + rv; 1192                    } 1193                    if ((i >= token-->RE_PAR1) && (i <= token-->RE_PAR2)) 1194                        outcome = true; 1195                } else { ! Lazy quantifier 1196                    edge = token-->RE_PAR1; 1197                    if (token-->RE_CONSTRAINT > edge) edge = token-->RE_CONSTRAINT; 1198                    for (i=0, npos=ipos: (npos<ito) && (i < token-->RE_PAR2): i++) { 1199                        if (i >= edge) break; 1200                        if (TEXT_TY_RE_Trace) { 1201                            print "Trying quant rep ", i+1, " at ", npos, ": "; 1202                            TEXT_TY_RE_DebugNode(token, ftxt, true); 1203                        } 1204                        rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, npos, ito, token-->RE_DOWN, 1205                            mode_flags | ACCUM_MFLAG); 1206                        if (rv < 0) break; 1207                        if ((token-->RE_DOWN)-->RE_CCLASS == SUBEXP_RE_CC) { 1208                            (token-->RE_DOWN)-->RE_CACHE1 = (token-->RE_DOWN)-->RE_DATA1; 1209                            (token-->RE_DOWN)-->RE_CACHE2 = (token-->RE_DOWN)-->RE_DATA2; 1210                        } 1211                        if ((rv == 0) && (token-->RE_PAR2 == 30000) && (i>=1)) { i++; break; } 1212                        npos = npos + rv; 1213                    } 1214                    if ((i >= edge) && (i <= token-->RE_PAR2)) 1215                        outcome = true; 1216                } 1217                if (outcome) { 1218                    if (token-->RE_PAR3 == false) { ! Greedy quantifier 1219                        if (i > token-->RE_PAR1) { ! I.e., if we have been greedy 1220                            token-->RE_DATA2 = i-1; ! And its edge limitation 1221                        } else { 1222                            token-->RE_DATA2 = -1; 1223                        } 1224                    } else { ! Lazy quantifier 1225                        if (i < token-->RE_PAR2) { ! I.e., if we have been lazy 1226                            token-->RE_DATA2 = i+1; ! And its edge limitation 1227                        } else { 1228                            token-->RE_DATA2 = -1; 1229                        } 1230                    } 1231                    ipos = npos; 1232                    if ((i == 0) && (mode_flags & ACCUM_MFLAG == false)) 1233                        TEXT_TY_RE_FailSubexpressions(token-->RE_DOWN); 1234                    if ((token-->RE_DOWN)-->RE_CCLASS == SUBEXP_RE_CC) { 1235                        (token-->RE_DOWN)-->RE_DATA1 = (token-->RE_DOWN)-->RE_CACHE1; 1236                        (token-->RE_DOWN)-->RE_DATA2 = (token-->RE_DOWN)-->RE_CACHE2; 1237                    } 1238                    if (TEXT_TY_RE_Trace) { 1239                        print "Successful quant reps ", i, ": "; 1240                        TEXT_TY_RE_DebugNode(token, ftxt, true); 1241                    } 1242                } else { 1243                    !token-->RE_DATA2 = -1; 1244                    if (mode_flags & ACCUM_MFLAG == false) 1245                        TEXT_TY_RE_FailSubexpressions(token-->RE_DOWN); 1246                    if (TEXT_TY_RE_Trace) { 1247                        print "Failed quant reps ", i, ": "; 1248                        TEXT_TY_RE_DebugNode(token, ftxt, true); 1249                    } 1250                } 1251                 1252            ! Character classes 1253                 1254            NOTHING_RE_CC: ; 1255            ANYTHING_RE_CC: if (ch) outcome = true; ipos++; 1256            WHITESPACE_RE_CC: 1257                if (ch == 10 or 13 or 32 or 9) { outcome = true; ipos++; } 1258            NONWHITESPACE_RE_CC: 1259                if ((ch) && (ch ~= 10 or 13 or 32 or 9)) { outcome = true; ipos++; } 1260            PUNCTUATION_RE_CC: 1261                if (ch == '.' or ',' or '' or '?' 1262                    or '-' or '/' or '' or ':' or ';' 1263                    or '(' or ')' or '[' or ']' or '{' or '}') { outcome = true; ipos++; } 1264            NONPUNCTUATION_RE_CC: 1265                if ((ch) && (ch ~= '.' or ',' or '' or '?' 1266                    or '-' or '/' or '' or ':' or ';' 1267                    or '(' or ')' or '[' or ']' or '{' or '}')) { outcome = true; ipos++; } 1268            WORD_RE_CC: 1269                if ((ch) && (ch ~= 10 or 13 or 32 or 9 1270                    or '.' or ',' or '' or '?' 1271                    or '-' or '/' or '' or ':' or ';' 1272                    or '(' or ')' or '[' or ']' or '{' or '}')) { outcome = true; ipos++; } 1273            NONWORD_RE_CC: 1274                if (ch == 10 or 13 or 32 or 9 1275                    or '.' or ',' or '' or '?' 1276                    or '-' or '/' or '' or ':' or ';' 1277                    or '(' or ')' or '[' or ']' or '{' or '}') { outcome = true; ipos++; } 1278            DIGIT_RE_CC: 1279                if (ch == '0' or '1' or '2' or '3' or '4' 1280                    or '5' or '6' or '7' or '8' or '9') { outcome = true; ipos++; } 1281            NONDIGIT_RE_CC: 1282                if ((ch) && (ch ~= '0' or '1' or '2' or '3' or '4' 1283                    or '5' or '6' or '7' or '8' or '9')) { outcome = true; ipos++; } 1284            LCASE_RE_CC: 1285                if (CharIsOfCase(ch, 0)) { outcome = true; ipos++; } 1286            NONLCASE_RE_CC: 1287                if ((ch) && (CharIsOfCase(ch, 0) == false)) { outcome = true; ipos++; } 1288            UCASE_RE_CC: 1289                if (CharIsOfCase(ch, 1)) { outcome = true; ipos++; } 1290            NONUCASE_RE_CC: 1291                if ((ch) && (CharIsOfCase(ch, 1) == false)) { outcome = true; ipos++; } 1292            NEWLINE_RE_CC: if (ch == 10) { outcome = true; ipos++; } 1293            TAB_RE_CC: if (ch == 9) { outcome = true; ipos++; } 1294            RANGE_RE_CC: 1295                if (TEXT_TY_RE_Range(ch, ftxt, 1296                    token-->RE_PAR1, token-->RE_PAR2, token-->RE_PAR3, mode_flags & CIS_MFLAG)) 1297                    { outcome = true; ipos++; } 1298             1299            ! Substring matches 1300             1301            LITERAL_RE_CC: 1302                rv = TEXT_TY_RE_MatchSubstring(txt, ipos, 1303                    ftxt, token-->RE_PAR1, token-->RE_PAR2, mode_flags & CIS_MFLAG); 1304                if (rv >= 0) { ipos = ipos + rv; outcome = true; } 1305            VARIABLE_RE_CC: 1306                i = token-->RE_PAR1; 1307                if ((RE_Subexpressions-->i)-->RE_DATA1 >= 0) { 1308                    rv = TEXT_TY_RE_MatchSubstring(txt, ipos, 1309                        txt, (RE_Subexpressions-->i)-->RE_DATA1, 1310                        (RE_Subexpressions-->i)-->RE_DATA2, mode_flags & CIS_MFLAG); 1311                    if (rv >= 0) { ipos = ipos + rv; outcome = true; } 1312                } 1313                .NeverMatchIncompleteVar; 1314        } 1315         1316        if (outcome == false) { 1317            if (TEXT_TY_RE_RewindCount++ >= 10000) { 1318                if (TEXT_TY_RE_RewindCount == 10001) { 1319                    style bold; print "OVERFLOW^"; style roman; 1320                } 1321                return -1; 1322            } 1323            if (TEXT_TY_RE_Trace) { 1324                print "Rewind sought from failure at pos ", ipos, " with: "; 1325                    TEXT_TY_RE_DebugNode(token, ftxt, true); 1326            } 1327            if ((token-->RE_CCLASS == QUANTIFIER_RE_CC) && 1328                (TEXT_TY_RE_SeekBacktrack(token-->RE_DOWN, ftxt, false, ito, false))) 1329                jump RewindFound; 1330            if (mode_flags & ACCUM_MFLAG == false) TEXT_TY_RE_FailSubexpressions(token); 1331            token = token-->RE_PREVIOUS; 1332            while (token ~= NULL) { 1333                if (TEXT_TY_RE_SeekBacktrack(token, ftxt, true, ito, false)) { 1334                    .RewindFound; 1335                    ipos = token-->RE_DATA1; 1336                    mode_flags = token-->RE_MODES; 1337                    if (mode_flags & ACCUM_MFLAG == false) 1338                        TEXT_TY_RE_FailSubexpressions(token, true); 1339                    if (ipos == -1) 1340                        TEXT_TY_RE_DebugTree(ftxt, true); 1341                    if (TEXT_TY_RE_Trace) { 1342                        print "^[", ifrom, ",", ito, "] rewinding to ", ipos, " at "; 1343                        TEXT_TY_RE_DebugNode(token, ftxt, true); 1344                    } 1345                    jump Rewind; 1346                } 1347                token = token-->RE_PREVIOUS; 1348            } 1349            if (TEXT_TY_RE_Trace) 1350                print "^Rewind impossible^"; 1351            return -1; 1352        } 1353 1354        token = token-->RE_NEXT; 1355    } 1356    return ipos - ifrom; 1357];

Backtracking.

It would be very straightforward to match regular expressions with the above recursive code if, for every node, there were a fixed number of characters (depending on the node) such that there would either be a match eating that many characters, or else no match at all. If that were true, we could simply march through the text matching until we could match no more, and although some nodes might have ambiguous readings, we could always match the first possibility which worked. There would never be any need to retreat.

Well, in fact that happy state does apply to a surprising number of nodes, and some quite complicated regular expressions can be made which use only them: <abc>{2}\d\d\1, for instance, matches a sequence of exactly 6 characters or else fails to match altogether, and there is never any need to backtrack. One reason why backtracking is a fairly good algorithm in practice is that these "good" cases occur fairly often, in subexpressions if not in the entire expression, and the simple method above disposes of them efficiently.

But in an expression like ab+bb, there is no alternative to backtracking if we are going to try to match the nodes from left to right: we match the "a", then we match as many "b"s as we can – but then we find that we have to match "bb", and this is necessarily impossible because we have just eaten all of the "b"s available. We therefore backtrack one node to the b+ and try again. We obviously can't literally try again because that would give the same result: instead we impose a constraint. Suppose it previously matched a row of 23 letter "b"s, so that the quantifier + resulted in a multiplicity of 23. We then constrain the node and in effect consider it to be b{1,22}, that is, to match at least 1 and at most 22 letter "b"s. That still won't work, as it happens, so we backtrack again with a constraint tightened to make it b{1,21}, and now the match occurs as we would hope. When the expression becomes more complex, backtracking becomes a longer-distance, recursive procedure – we have to exhaust all possibilities of a more recent node before tracking back to one from longer ago. (This is why the worst test cases are those which entice us into a long, long series of matches only to find that a guess made right back at the start was ill-fated.)

Rather than describing TEXT_TY_RE_SeekBacktrack in detail here, it is probably more useful to suggest that the reader observe it in action by setting TEXT_TY_RE_Trace and trying a few regular expressions.

1402[ TEXT_TY_RE_SeekBacktrack token ftxt downwards ito report_only 1403    untried; 1404    for (: token ~= NULL: token = token-->RE_NEXT) { 1405        if ((TEXT_TY_RE_Trace) && (report_only == false)) { 1406            print "Scan for rewind: "; 1407            TEXT_TY_RE_DebugNode(token, ftxt, true); 1408        } 1409        if ((token-->RE_CCLASS == SUBEXP_RE_CC) && 1410            (token-->RE_PAR2 == 1 or 2 or 4)) { 1411            if (downwards) rfalse; 1412            continue; 1413        } 1414        if (token-->RE_DOWN ~= NULL) { 1415            if ((TEXT_TY_RE_Trace) && (report_only == false)) print "Descend^"; 1416            if (TEXT_TY_RE_SeekBacktrack(token-->RE_DOWN, ftxt, false, ito, report_only)) rtrue; 1417        } 1418        untried = false; 1419        switch (token-->RE_CCLASS) { 1420            DISJUNCTION_RE_CC: 1421                if ((token-->RE_DATA2 >= 1) && 1422                    (token-->RE_DATA2 < token-->RE_PAR1) && 1423                    (token-->RE_CONSTRAINT < token-->RE_PAR1)) { ! Matched but earlier than last 1424                    if (report_only) rtrue; 1425                    if (token-->RE_CONSTRAINT == -1) 1426                        token-->RE_CONSTRAINT = 1; 1427                    else 1428                        (token-->RE_CONSTRAINT)++; 1429                    untried = true; 1430                } 1431            QUANTIFIER_RE_CC: 1432                if (token-->RE_CONSTRAINT ~= -2) { 1433                    if ((TEXT_TY_RE_Trace) && (report_only == false)) { 1434                        print "Quant with cons not -2: "; 1435                        TEXT_TY_RE_DebugNode(token, ftxt, true); 1436                    } 1437                    if (token-->RE_DATA2 >= 0) { 1438                        if (report_only) rtrue; 1439                        token-->RE_CONSTRAINT = token-->RE_DATA2; 1440                        untried = true; 1441                    } 1442                } 1443        } 1444        if (untried) { 1445            if (TEXT_TY_RE_Trace) { 1446                print "Grounds for rewind at: "; 1447                TEXT_TY_RE_DebugNode(token, ftxt, true); 1448            } 1449            TEXT_TY_RE_EraseConstraints(token-->RE_NEXT); 1450            TEXT_TY_RE_EraseConstraints(token-->RE_DOWN); 1451            rtrue; 1452        } 1453        if (downwards) rfalse; 1454    } 1455    rfalse; 1456];

Fail Subexpressions.

Here, an attempt to make a complicated match against the node in token has failed: that means that any subexpressions which were matched in the course of the attempt must also in retrospect be considered unmatched. So we work down through the subtree at token and empty any markers for subexpressions, which in effect clears their backslash variables – this is important as, otherwise, the contents left over could cause the alternative reading of the token to be misparsed if it refers to the backslash variables in question. (If you think nobody would ever be crazy enough to write a regular expression like that, you haven't see Perl's test suite.)

If the downwards flag is clear, it not only invalidates subexpression matches below the node but also to the right of the node – this is useful for a backtrack which runs back quite some distance.

1474[ TEXT_TY_RE_FailSubexpressions token downwards; 1475    for (: token ~= NULL: token = token-->RE_NEXT) { 1476        if (token-->RE_DOWN ~= NULL) TEXT_TY_RE_FailSubexpressions(token-->RE_DOWN); 1477        if (token-->RE_CCLASS == SUBEXP_RE_CC) { 1478            token-->RE_DATA1 = -1; 1479            token-->RE_DATA2 = -1; 1480        } 1481        if (downwards) break; 1482    } 1483];

Erasing Constraints.

As explained above, temporary constraints are placed on some nodes when we are backtracking to test possible cases. When we do backtrack, though, it's important to lift any constraints left over from the previous attempt to parse material which is part of or subsequent to the token whose match attempt has been abandoned.

1493[ TEXT_TY_RE_EraseConstraints token; 1494    while (token ~= NULL) { 1495        switch (token-->RE_CCLASS) { 1496            DISJUNCTION_RE_CC: token-->RE_CONSTRAINT = -1; 1497            QUANTIFIER_RE_CC: token-->RE_CONSTRAINT = -1; 1498        } 1499        if (token-->RE_DOWN) TEXT_TY_RE_EraseConstraints(token-->RE_DOWN); 1500        token = token-->RE_NEXT; 1501    } 1502];

Matching Literal Text.

Here we attempt to make a match of the substring of the text mtxt which runs from character mfrom to character mto, looking for it at the given position ipos in the source text txt.

1510[ TEXT_TY_RE_MatchSubstring txt ipos mtxt mfrom mto insens 1511    i ch; 1512    if (mfrom < 0) return 0; 1513    if (insens) 1514        for (i=mfrom:i<mto:i++) { 1515            ch = BlkValueRead(mtxt, i); 1516            if (BlkValueRead(txt, ipos++) ~= ch or TEXT_TY_RevCase(ch)) 1517                return -1; 1518        } 1519    else 1520        for (i=mfrom:i<mto:i++) 1521            if (BlkValueRead(txt, ipos++) ~= BlkValueRead(mtxt, i)) 1522                return -1; 1523    return mto-mfrom; 1524];

Matching Character Range.

Suppose that a character range is stored in ftxt between the character positions rf and rt. Then TEXT_TY_RE_Range(ch, ftxt, rf, rt, negate, insens) tests whether a given character ch lies within that character range, negating the outcome if negate is set, and performing comparisons case insensitively if insens is set.

1534[ TEXT_TY_RE_Range ch ftxt rf rt negate insens 1535    i chm upper crev; 1536    if (ch == 0) rfalse; 1537    if (negate == true) { 1538        if (TEXT_TY_RE_Range(ch, ftxt, rf, rt, false, insens)) rfalse; 1539        rtrue; 1540    } 1541    for (i=rf: i<rt: i++) { 1542        chm = BlkValueRead(ftxt, i); 1543        if ((chm == '\') && (i+1<rt)) { 1544            chm = BlkValueRead(ftxt, ++i); 1545            switch (chm) { 1546                's': 1547                    if (ch == 10 or 13 or 32 or 9) rtrue; 1548                'S': 1549                    if ((ch) && (ch ~= 10 or 13 or 32 or 9)) rtrue; 1550                'p': 1551                    if (ch == '.' or ',' or '' or '?' 1552                        or '-' or '/' or '' or ':' or ';' 1553                        or '(' or ')' or '[' or ']' or '{' or '}') rtrue; 1554                'P': 1555                    if ((ch) && (ch ~= '.' or ',' or '' or '?' 1556                        or '-' or '/' or '' or ':' or ';' 1557                        or '(' or ')' or '[' or ']' or '{' or '}')) rtrue; 1558                'w': 1559                    if ((ch) && (ch ~= 10 or 13 or 32 or 9 1560                        or '.' or ',' or '' or '?' 1561                        or '-' or '/' or '' or ':' or ';' 1562                        or '(' or ')' or '[' or ']' or '{' or '}')) rtrue; 1563                'W': 1564                    if (ch == 10 or 13 or 32 or 9 1565                        or '.' or ',' or '' or '?' 1566                        or '-' or '/' or '' or ':' or ';' 1567                        or '(' or ')' or '[' or ']' or '{' or '}') rtrue; 1568                'd': 1569                    if (ch == '0' or '1' or '2' or '3' or '4' 1570                        or '5' or '6' or '7' or '8' or '9') rtrue; 1571                'D': 1572                    if ((ch) && (ch ~= '0' or '1' or '2' or '3' or '4' 1573                        or '5' or '6' or '7' or '8' or '9')) rtrue; 1574                'l': if (CharIsOfCase(ch, 0)) rtrue; 1575                'L': if (CharIsOfCase(ch, 0) == false) rtrue; 1576                'u': if (CharIsOfCase(ch, 1)) rtrue; 1577                'U': if (CharIsOfCase(ch, 1) == false) rtrue; 1578                'n': if (ch == 10) rtrue; 1579                't': if (ch == 9) rtrue; 1580            } 1581        } else { 1582            if ((i+2<rt) && (BlkValueRead(ftxt, i+1) == '-')) { 1583                upper = BlkValueRead(ftxt, i+2); 1584                if ((ch >= chm) && (ch <= upper)) rtrue; 1585                if (insens) { 1586                    crev = TEXT_TY_RevCase(ch); 1587                    if ((crev >= chm) && (crev <= upper)) rtrue; 1588                } 1589                i=i+2; 1590            } else { 1591                if (chm == ch) rtrue; 1592                if ((insens) && (chm == TEXT_TY_RevCase(ch))) rtrue; 1593            } 1594        } 1595    } 1596    rfalse; 1597];

Search And Replace.

And finally, last but not least: the routine which searches an indexed text txt trying to match it against ftxt. If ftxtype is set to REGEXP_BLOB then ftxt is expected to be a regular expression such as ab+(c*de)?, whereas if ftxtype is CHR_BLOB then it is expected only to be a simple string of characters taken literally, such as frog.

Each match found is replaced with the contents of rtxt, except that if the blob type is REGEXP_BLOB then we recognise a few syntaxes as special: for instance, \2 expands to the value of subexpression 2 as it was matched – see Writing with Inform for details.

The optional argument insens is a flag which, if set, causes the matching to be done case insensitively; the optional argument exactly, if set, causes the matching to work only if the entire txt is matched. (This is not especially useful with regular expressions, because the effect can equally be achieved by turning ab+c, say, into ^ab+c$, but it is indeed useful where the blob type is CHR_BLOB.)

For an explanation of the use of the word "blob", see Text.i6t.

1621[ TEXT_TY_Replace_RE ftxtype txt ftxt rtxt insens exactly 1622    r p p1 p2 cp cp1 cp2; 1623    !print "Find: "; BlkValueDebug(ftxt); print "^"; 1624    !print "Rep: "; BlkValueDebug(rtxt); print "^"; 1625    !print "In: "; BlkValueDebug(txt); print "^"; 1626    if (rtxt == 0 or 1) { cp = txt-->0; p = TEXT_TY_Temporarily_Transmute(txt); } 1627    else TEXT_TY_Transmute(txt); 1628    cp1 = ftxt-->0; p1 = TEXT_TY_Temporarily_Transmute(ftxt); 1629    cp2 = rtxt-->0; p2 = TEXT_TY_Temporarily_Transmute(rtxt); 1630    r = TEXT_TY_Replace_REI(ftxtype, txt, ftxt, rtxt, insens, exactly); 1631    TEXT_TY_Untransmute(ftxt, p1, cp1); 1632    TEXT_TY_Untransmute(rtxt, p2, cp2); 1633    if (rtxt == 0 or 1) TEXT_TY_Untransmute(txt, p, cp); 1634    return r; 1635]; 1636 1637[ TEXT_TY_Replace_REI ftxtype txt ftxt rtxt insens exactly 1638    ctxt csize ilen i cl mpos cpos ch chm; 1639 1640    ilen = TEXT_TY_CharacterLength(txt); 1641 1642    TEXT_TY_RE_Err = 0; 1643    switch (ftxtype) { 1644        REGEXP_BLOB: i = TEXT_TY_RE_CompileTree(ftxt, exactly); 1645        CHR_BLOB: i = TEXT_TY_CHR_CompileTree(ftxt, exactly); 1646        default: "*** bad ftxtype ***"; 1647    } 1648     1649    if ((i<0) || (i>RE_MAX_PACKETS)) { 1650        TEXT_TY_RE_Err = i; 1651        print "*** Regular expression error: ", (string) TEXT_TY_RE_Err, " ***^"; 1652        RunTimeProblem(RTP_REGEXPSYNTAXERROR); 1653        return 0; 1654    } 1655 1656    if (TEXT_TY_RE_Trace) { 1657        TEXT_TY_RE_DebugTree(ftxt); 1658        print "(compiled to ", i, " packets)^"; 1659    } 1660     1661    if (ftxtype == REGEXP_BLOB) TEXT_TY_RE_EmptyMatchVars(); 1662    mpos = 0; chm = 0; cpos = 0; 1663    while (TEXT_TY_RE_Parse(ftxt, txt, mpos, insens) >= 0) { 1664        chm++; 1665         1666        if (TEXT_TY_RE_Trace) { 1667            print "^*** Match ", chm, " found (", RE_PACKET_space-->RE_DATA1, ",", 1668                RE_PACKET_space-->RE_DATA2, "): "; 1669            if (RE_PACKET_space-->RE_DATA1 == RE_PACKET_space-->RE_DATA2) { 1670                print "<empty>"; 1671            } 1672            for (i=RE_PACKET_space-->RE_DATA1:i<RE_PACKET_space-->RE_DATA2:i++) { 1673                print (char) BlkValueRead(txt, i); 1674            } 1675            print " ***^"; 1676        } 1677         1678        if (rtxt == 0) break; ! Accept only one match, replace nothing 1679         1680        if (rtxt ~= 0 or 1) { 1681            if (chm == 1) { 1682                ctxt = BlkValueCreate(TEXT_TY); 1683                TEXT_TY_Transmute(ctxt); 1684                csize = BlkValueLBCapacity(ctxt); 1685            } 1686 1687            for (i=cpos:i<RE_PACKET_space-->RE_DATA1:i++) { 1688                ch = BlkValueRead(txt, i); 1689                if (cl+1 >= csize) { 1690                    if (BlkValueSetLBCapacity(ctxt, 2*cl) == false) break; 1691                    csize = BlkValueLBCapacity(ctxt); 1692                } 1693                BlkValueWrite(ctxt, cl++, ch); 1694            } 1695            BlkValueWrite(ctxt, cl, 0); 1696     1697            TEXT_TY_Concatenate(ctxt, rtxt, ftxtype, txt); 1698            csize = BlkValueLBCapacity(ctxt); 1699            cl = TEXT_TY_CharacterLength(ctxt); 1700        } 1701 1702        mpos = RE_PACKET_space-->RE_DATA2; cpos = mpos; 1703        if (RE_PACKET_space-->RE_DATA1 == RE_PACKET_space-->RE_DATA2) 1704            mpos++; 1705 1706        if (TEXT_TY_RE_Trace) { 1707            if (chm == 100) { ! Purely to keep the output from being excessive 1708                print "(Stopping after 100 matches.)^"; break; 1709            } 1710        } 1711    } 1712    if (chm > 0) { 1713        if (rtxt ~= 0 or 1) { 1714            for (i=cpos:i<ilen:i++) { 1715                ch = BlkValueRead(txt, i); 1716                if (cl+1 >= csize) { 1717                    if (BlkValueSetLBCapacity(ctxt, 2*cl) == false) break; 1718                    csize = BlkValueLBCapacity(ctxt); 1719                } 1720                BlkValueWrite(ctxt, cl++, ch); 1721            } 1722        } 1723         1724        if (ftxtype == REGEXP_BLOB) { 1725            TEXT_TY_RE_CreateMatchVars(txt); 1726            if (TEXT_TY_RE_Trace) 1727                TEXT_TY_RE_DebugMatchVars(txt); 1728        } 1729 1730        if (rtxt ~= 0 or 1) { 1731            BlkValueWrite(ctxt, cl, 0); 1732            BlkValueCopy(txt, ctxt); 1733            BlkValueFree(ctxt); 1734        } 1735    } 1736    return chm; 1737];

Concatenation.

See the corresponding routine in Text.i6t: this is a variation which handles the special syntaxes used in search-and-replace.

1744[ TEXT_TY_RE_Concatenate txt_to txt_from blobtype txt_ref 1745    pos len ch i tosize x y case; 1746    if ((txt_to==0) || (BlkValueWeakKind(txt_to) ~= TEXT_TY)) rfalse; 1747    if ((txt_from==0) || (BlkValueWeakKind(txt_from) ~= TEXT_TY)) return txt_to; 1748    pos = TEXT_TY_CharacterLength(txt_to); 1749    tosize = BlkValueLBCapacity(txt_to); 1750    len = TEXT_TY_CharacterLength(txt_from); 1751    for (i=0:i<len:i++) { 1752        ch = BlkValueRead(txt_from, i); 1753        if ((ch == '\') && (i < len-1)) { 1754            ch = BlkValueRead(txt_from, ++i); 1755            if (ch == 'n') ch = 10; 1756            if (ch == 't') ch = 9; 1757            case = -1; 1758            if (ch == 'l') case = 0; 1759            if (ch == 'u') case = 1; 1760            if (case >= 0) ch = BlkValueRead(txt_from, ++i); 1761            if ((ch >= '0') && (ch <= '9')) { 1762                ch = ch - '0'; 1763                if (ch < RE_Subexpressions-->10) { 1764                    x = (RE_Subexpressions-->ch)-->RE_DATA1; 1765                    y = (RE_Subexpressions-->ch)-->RE_DATA2; 1766                    if (x >= 0) { 1767                        for (:x<y:x++) { 1768                            ch = BlkValueRead(txt_ref, x); 1769                            if (pos+1 >= tosize) { 1770                                if (BlkValueSetLBCapacity(txt_to, 2*tosize) == false) break; 1771                                tosize = BlkValueLBCapacity(txt_to); 1772                            } 1773                            if (case >= 0) 1774                                BlkValueWrite(txt_to, pos++, CharToCase(ch, case)); 1775                            else 1776                                BlkValueWrite(txt_to, pos++, ch); 1777                        } 1778                    } 1779                } 1780                continue; 1781            } 1782             1783        } 1784        if (pos+1 >= tosize) { 1785            if (BlkValueSetLBCapacity(txt_to, 2*tosize) == false) break; 1786            tosize = BlkValueLBCapacity(txt_to); 1787        } 1788        BlkValueWrite(txt_to, pos++, ch); 1789    } 1790    BlkValueWrite(txt_to, pos, 0); 1791    return txt_to; 1792];