RegExp contents
Debugging.
Set this to true at your peril.
12Global TEXT_TY_RE_Trace = false;
13[ TEXT_TY_RE_SetTrace F; TEXT_TY_RE_Trace = F; ];
Algorithm.
Once Inform 7 supported (indexed) text, regular-expression matching became an obvious goal: regexp-based features offer more or less the gold standard in text search and replace facilities, and I7 is so concerned with text that we shouldn't make do with less. But the best and most portable implementation of regular expression matching, PCRE by Philip Hazel, is about a hundred times larger than the code in this section, and also had unacceptable memory needs: there was no practicable way to make it small enough to do useful work on the Z-machine. Nor could an I6 regexp-matcher compile just-in-time code, or translate the expression into a suitable deterministic finite state machine. One day, though, I read one of the papers which Brian Kernighan writes every few years to the effect that regular-expression matching is much easier than you think. Kernighan is right: writing a regexp matcher is indeed easier than you think (one day's worth of cheerful hacking), but debugging one until it passes the trickiest hundred of Perl's 645 test cases is another matter (and it took a whole week more). Still, the result seems to be robust. The main compromise made is that backtracking is not always comprehensive with regexps like ^(a\1?){4}$, because we do not allocate individual storage to backtrack individually through all possibilities of each of the four uses of the bracketed subexpression – which means we miss some cases, since the subexpression contains a reference to itself, so that its content can vary in the four uses. PCRE's approach here is to expand the expression as if it were a sequence of four bracketed expressions, thus removing the awkward quantifier {4}, but that costs memory: indeed this is why PCRE cannot solve all of Perl's test cases without its default memory allocation being raised. In other respects, the algorithm below appears to be accurate if not very fast.
Class Codes.
While in principle we could keep the match expression in textual form, in practice the syntax of regular expressions is complex enough that this would be tricky and rather slow: we would be parsing the same notations over and over again. So we begin by compiling it to a simple tree structure. The tree is made up of nodes, and each node has a "class code": these are identified by the *_RE_CC constants below. Note that the class codes below are all negative: this is so that they are distinct from all valid ZSCII or Unicode characters. (ZSCII is used only on the Z-machine, which has a 16-bit word but an 8-bit character set, so that all character values are positive; similarly, Unicode is (for our purposes) a 16-bit character set on a 32-bit virtual machine.)
58
59
60Constant NEWLINE_RE_CC = -1;
61Constant TAB_RE_CC = -2;
62Constant DIGIT_RE_CC = -3;
63Constant NONDIGIT_RE_CC = -4;
64Constant WHITESPACE_RE_CC = -5;
65Constant NONWHITESPACE_RE_CC = -6;
66Constant PUNCTUATION_RE_CC = -7;
67Constant NONPUNCTUATION_RE_CC = -8;
68Constant WORD_RE_CC = -9;
69Constant NONWORD_RE_CC = -10;
70Constant ANYTHING_RE_CC = -11;
71Constant NOTHING_RE_CC = -12;
72Constant RANGE_RE_CC = -13;
73Constant LCASE_RE_CC = -14;
74Constant NONLCASE_RE_CC = -15;
75Constant UCASE_RE_CC = -16;
76Constant NONUCASE_RE_CC = -17;
77
78
79
80Constant SUBEXP_RE_CC = -20;
81Constant DISJUNCTION_RE_CC = -21;
82Constant CHOICE_RE_CC = -22;
83Constant QUANTIFIER_RE_CC = -23;
84Constant IF_RE_CC = -24;
85Constant CONDITION_RE_CC = -25;
86Constant THEN_RE_CC = -26;
87Constant ELSE_RE_CC = -27;
88
89
90
91Constant VARIABLE_RE_CC = -30;
92Constant LITERAL_RE_CC = -31;
93
94
95
96Constant START_RE_CC = -40;
97Constant END_RE_CC = -41;
98Constant BOUNDARY_RE_CC = -42;
99Constant NONBOUNDARY_RE_CC = -43;
100Constant ALWAYS_RE_CC = -44;
101Constant NEVER_RE_CC = -45;
102
103
104
105Constant SENSITIVITY_RE_CC = -50;
Packets.
The nodes of the compiled expression tree are stored in "packets", which are segments of a fixed array. A regexp complicated enough that it cannot be stored in RE_MAX_PACKETS packets will be rejected with an error: it looks like a rather low limit, but in fact suffices to handle all of Perl's test cases, some of which are works of diabolism.
A packet is then a record containing 14 fields, with offsets defined by the constants defined below. These fields combine the compilation of the corresponding fragment of the regexp with both the tree structure holding these packets together and also the current state of the temporary variables recording how far we have progressed in trying all of the possible ways to match the packet.
122Constant RE_MAX_PACKETS = 32;
123
124Constant RE_PACKET_SIZE = 14;
125Constant RE_PACKET_SIZE_IN_BYTES = WORDSIZE*RE_PACKET_SIZE;
126
127Array RE_PACKET_space --> RE_MAX_PACKETS*RE_PACKET_SIZE;
128
129Constant RE_CCLASS = 0;
130Constant RE_PAR1 = 1;
131Constant RE_PAR2 = 2;
132Constant RE_PAR3 = 3;
133Constant RE_NEXT = 4;
134Constant RE_PREVIOUS = 5;
135Constant RE_DOWN = 6;
136Constant RE_UP = 7;
137Constant RE_DATA1 = 8;
138Constant RE_DATA2 = 9;
139Constant RE_CONSTRAINT = 10;
140Constant RE_CACHE1 = 11;
141Constant RE_CACHE2 = 12;
142Constant RE_MODES = 13;
Nodes.
The routine to create a node, something which happens only during the compilation phase, and also the routine which returns the address of a given node. Nodes are numbered from 0 up to M-1, where M is the constant RE_MAX_PACKETS.
151[ TEXT_TY_RE_Node n cc par1 par2 par3 offset;
152 if ((n<0) || (n >= RE_MAX_PACKETS)) rfalse;
153 offset = RE_PACKET_space + n*RE_PACKET_SIZE_IN_BYTES;
154 offset-->RE_CCLASS = cc;
155 offset-->RE_PAR1 = par1;
156 offset-->RE_PAR2 = par2;
157 offset-->RE_PAR3 = par3;
158 offset-->RE_NEXT = NULL;
159 offset-->RE_PREVIOUS = NULL;
160 offset-->RE_DOWN = NULL;
161 offset-->RE_UP = NULL;
162 offset-->RE_DATA1 = -1;
163 offset-->RE_DATA2 = -1;
164 offset-->RE_CONSTRAINT = -1;
165 return offset;
166];
167
168[ TEXT_TY_RE_NodeAddress n;
169 if ((n<0) || (n >= RE_MAX_PACKETS)) return -1;
170 return RE_PACKET_space + n*RE_PACKET_SIZE_IN_BYTES;
171];
Match Variables.
A bracketed subexpression can be used as a variable: we support \1, ..., \9 to mean "the value of subexpression 1 to 9", and \0 to mean "the whole text matched", as if the entire regexp were bracketed. (PCRE and Perl also allow \10, \11, ..., but we don't, because it complicates parsing and memory is too short.)
RE_Subexpressions-->10 stores the number of subexpressions in use, not counting \0. During the compiling stage, RE_Subexpressions-->N is set to point to the node representating \N, where N varies from 1 to 9. When matching is complete, and assuming we care about the contents of these variables – which we might not, and if not we certainly don't want to waste time and memory – we call TEXT_TY_RE_CreateMatchVars to allocate text variables and fill them in as appropriate, memory permitting.
TEXT_TY_RE_EmptyMatchVars empties any such variables which may survive from a previous match, setting them to the empty text.
192Array RE_Subexpressions --> 11;
193Array Allocated_Match_Vars --> 10;
194
195[ TEXT_TY_RE_DebugMatchVars txt
196 offset n i;
197 print RE_Subexpressions-->10, " collecting subexps^";
198 for (n=0:(n<RE_Subexpressions-->10) && (n<10): n++) {
199 offset = RE_Subexpressions-->n;
200 print "Subexp ", offset-->RE_PAR1,
201 " = [", offset-->RE_DATA1, ",", offset-->RE_DATA2, "] = ";
202 for (i=offset-->RE_DATA1:i<offset-->RE_DATA2:i++)
203 print (char) BlkValueRead(txt, i);
204 print "^";
205 }
206];
207
208[ TEXT_TY_RE_CreateMatchVars txt
209 offset n i ch ctxt cl csize;
210 for (n=0:(n<RE_Subexpressions-->10) && (n<10): n++) {
211 offset = RE_Subexpressions-->n;
212 if (Allocated_Match_Vars-->n) BlkValueFree(Allocated_Match_Vars-->n);
213 Allocated_Match_Vars-->n = BlkValueCreate(TEXT_TY);
214 TEXT_TY_Transmute(Allocated_Match_Vars-->n);
215 ctxt = Allocated_Match_Vars-->n;
216 csize = BlkValueLBCapacity(ctxt);
217 cl = 0;
218 for (i=offset-->RE_DATA1:i<offset-->RE_DATA2:i++) {
219 ch = BlkValueRead(txt, i);
220 if (cl+1 >= csize) {
221 if (BlkValueSetLBCapacity(ctxt, 2*cl) == false) break;
222 csize = BlkValueLBCapacity(ctxt);
223 }
224 BlkValueWrite(ctxt, cl++, ch);
225 }
226 BlkValueWrite(ctxt, cl, 0);
227 }
228];
229
230[ TEXT_TY_RE_EmptyMatchVars txt
231 n;
232 for (n=0:(n<RE_Subexpressions-->10) && (n<10): n++)
233 if (Allocated_Match_Vars-->n ~= 0)
234 BlkValueWrite(Allocated_Match_Vars-->n, 0, 0);
235];
236
237[ TEXT_TY_RE_GetMatchVar vn
238 offset;
239 if ((vn<0) || (vn>=10) || (vn >= RE_Subexpressions-->10)) return EMPTY_TEXT_VALUE;
240 offset = RE_Subexpressions-->vn;
241 if (offset == 0) return EMPTY_TEXT_VALUE;
242 if (offset-->RE_DATA1 < 0) return EMPTY_TEXT_VALUE;
243 if (Allocated_Match_Vars-->vn == 0) {
244 print "*** ", vn, " unallocated ***^";
245 return EMPTY_TEXT_VALUE;
246 }
247 return Allocated_Match_Vars-->vn;
248];
Markers.
At each node, the -->RE_DATA1 and -->RE_DATA2 fields represent the character positions of the start and end of the text matched by the node and its subtree (if any). These are called markers.
Thus TEXT_TY_MV_End(N, 0) returns the start of \N and TEXT_TY_MV_End(N, 1) the end of \N, according to the current match of subexpression N.
259[ TEXT_TY_MV_End n end
260 offset;
261 offset = RE_Subexpressions-->n;
262 if (end==0) return offset-->RE_DATA1;
263 return offset-->RE_DATA2;
264];
265
266[ TEXT_TY_RE_Clear_Markers token;
267 for (: token ~= NULL: token = token-->RE_NEXT) {
268 if (token-->RE_DOWN ~= NULL) TEXT_TY_RE_Clear_Markers(token-->RE_DOWN);
269 token-->RE_DATA1 = -1;
270 token-->RE_DATA2 = -1;
271 token-->RE_CONSTRAINT = -1;
272 }
273];
Debugging.
Code in this paragraph simply prints a convenient screen representation of the compiled regexp, together with the current values of its markers. It is invaluable for debugging purposes and, touch wood, may not be needed again, but it is relatively compact and we keep it just in case.
282[ TEXT_TY_RE_DebugTree ftxt detail;
283 print "Pattern: ", (TEXT_TY_Say) ftxt, "^";
284 TEXT_TY_RE_DebugSubtree(ftxt, 1, RE_PACKET_space, detail);
285];
286
287[ TEXT_TY_RE_DebugSubtree ftxt depth offset detail
288 cup;
289 if (offset ~= NULL) {
290 cup = offset-->RE_UP;
291 if (offset-->RE_PREVIOUS ~= NULL) print "*** broken initial previous ***^";
292 }
293 while (offset ~= NULL) {
294 if (offset-->RE_UP ~= cup) print "*** broken up matching ***^";
295 spaces(depth*2);
296 TEXT_TY_RE_DebugNode(offset, ftxt, detail);
297 if (offset-->RE_DOWN ~= NULL) {
298 if ((offset-->RE_DOWN)-->RE_UP ~= offset)
299 print "*** broken down/up ***^";
300 TEXT_TY_RE_DebugSubtree(ftxt, depth+1, offset-->RE_DOWN, detail);
301 }
302 if (offset-->RE_NEXT ~= NULL) {
303 if ((offset-->RE_NEXT)-->RE_PREVIOUS ~= offset)
304 print "*** broken next/previous ***^";
305 }
306 offset = offset-->RE_NEXT;
307 }
308];
309
310[ TEXT_TY_RE_DebugNode offset ftxt detail
311 i par1 par2 par3;
312 if (offset == NULL) "[NULL]";
313 print "[", (offset-RE_PACKET_space)/(RE_PACKET_SIZE_IN_BYTES), "] ";
314
315 par1 = offset-->RE_PAR1;
316 par2 = offset-->RE_PAR2;
317 par3 = offset-->RE_PAR3;
318 switch (offset-->RE_CCLASS) {
319 DIGIT_RE_CC: print "DIGIT";
320 NONDIGIT_RE_CC: print "NONDIGIT";
321 UCASE_RE_CC: print "UCASE";
322 NONUCASE_RE_CC: print "NONUCASE";
323 LCASE_RE_CC: print "LCASE";
324 NONLCASE_RE_CC: print "NONLCASE";
325 WHITESPACE_RE_CC: print "WHITESPACE";
326 NONWHITESPACE_RE_CC: print "NONWHITESPACE";
327 PUNCTUATION_RE_CC: print "PUNCTUATION";
328 NONPUNCTUATION_RE_CC: print "NONPUNCTUATION";
329 WORD_RE_CC: print "WORD";
330 NONWORD_RE_CC: print "NONWORD";
331 ALWAYS_RE_CC: print "ALWAYS";
332 NEVER_RE_CC: print "NEVER";
333 START_RE_CC: print "START";
334 END_RE_CC: print "END";
335 BOUNDARY_RE_CC: print "BOUNDARY";
336 NONBOUNDARY_RE_CC: print "NONBOUNDARY";
337 ANYTHING_RE_CC: print "ANYTHING";
338 NOTHING_RE_CC: print "NOTHING";
339 RANGE_RE_CC: print "RANGE"; if (par3 == true) print " (negated)";
340 print " ";
341 for (i=par1:i<par2:i++) print (char) BlkValueRead(ftxt, i);
342 VARIABLE_RE_CC: print "VARIABLE ", par1;
343 SUBEXP_RE_CC:
344 if (par1 == 0) print "EXP";
345 else print "SUBEXP ";
346 if (par1 >= 0) print "= V", par1;
347 if (par2 == 1) {
348 if (par3 == 0) print " (?=...) lookahead";
349 else print " (?<=...) lookbehind of width ", par3;
350 }
351 if (par2 == 2) {
352 if (par3 == 0) print " (?...) negated lookahead";
353 else print " (?<...) negated lookbehind of width ", par3;
354 }
355 if (par2 == 3) print " uncollecting";
356 if (par2 == 0 or 3) {
357 if (par3 == 1) print " forcing case sensitivity";
358 if (par3 == 2) print " forcing case insensitivity";
359 }
360 if (par2 == 4) print " (?>...) possessive";
361 NEWLINE_RE_CC: print "NEWLINE";
362 TAB_RE_CC: print "TAB";
363 QUANTIFIER_RE_CC: print "QUANTIFIER min=", par1, " max=", par2;
364 if (par3) print " (lazy)"; else print " (greedy)";
365 LITERAL_RE_CC: print "LITERAL";
366 print " ";
367 for (i=par1:i<par2:i++) print (char) BlkValueRead(ftxt, i);
368 DISJUNCTION_RE_CC: print "DISJUNCTION of ", par1, " choices";
369 CHOICE_RE_CC: print "CHOICE no ", par1;
370 SENSITIVITY_RE_CC: print "SENSITIVITY";
371 if (par1) print " off"; else print " on";
372 IF_RE_CC: print "IF"; if (par1 >= 1) print " = V", par1;
373 CONDITION_RE_CC: print "CONDITION"; if (par1 >= 1) print " = V", par1;
374 THEN_RE_CC: print "THEN";
375 ELSE_RE_CC: print "ELSE";
376 }
377 if (detail)
378 print ": ", offset-->RE_DATA1, ", ", offset-->RE_DATA2, ", ", offset-->RE_CONSTRAINT;
379 print "^";
380];
Compiling Tree For Substring Search.
When we search for a literal substring, say looking for "per" in "Supernumerary", we will in fact use the same apparatus as when searching for a regular expression: we compile a very simple node tree in which \0 as the root contains just one child node, a LITERAL_RE_CC matching exactly the text "per". We return 2 since that's the number of nodes in the tree.
390[ TEXT_TY_CHR_CompileTree ftxt exactly
391 root literal fto no_packets token attach_to;
392
393 fto = TEXT_TY_CharacterLength(ftxt);
394
395 root = TEXT_TY_RE_Node(0, SUBEXP_RE_CC, 0, 0, 0);
396 literal = TEXT_TY_RE_Node(1, LITERAL_RE_CC, 0, fto, 0);
397
398 root-->RE_DOWN = literal;
399 literal-->RE_UP = root;
400
401 if (exactly) {
402 no_packets = 2;
403 if (no_packets+3 > RE_MAX_PACKETS) return "regexp too complex";
404 exactly = RE_PACKET_space-->RE_DOWN;
405 token = TEXT_TY_RE_Node(no_packets++, START_RE_CC, 0, 0, 0);
406 RE_PACKET_space-->RE_DOWN = token; token-->RE_UP = RE_PACKET_space;
407 attach_to = TEXT_TY_RE_Node(no_packets++, SUBEXP_RE_CC, -1, 3, 0);
408 token-->RE_NEXT = attach_to; attach_to-->RE_PREVIOUS = token;
409 attach_to-->RE_UP = RE_PACKET_space;
410 attach_to-->RE_NEXT = TEXT_TY_RE_Node(no_packets++, END_RE_CC, 0, 0, 0);
411 (attach_to-->RE_NEXT)-->RE_PREVIOUS = attach_to;
412 (attach_to-->RE_NEXT)-->RE_UP = RE_PACKET_space;
413 attach_to-->RE_DOWN = exactly;
414 while (exactly ~= NULL) {
415 exactly-->RE_UP = attach_to; exactly = exactly-->RE_NEXT;
416 }
417 }
418
419 no_packets = TEXT_TY_RE_ExpandChoices(RE_PACKET_space, no_packets);
420];
Compiling Tree For Regexp Search.
But in general we need to compile a regular expression string into a tree of the kind described above, and here is the routine which does that, returning the number of nodes used to build the tree. The syntax it accepts is very fully documented in Writing with Inform, so no details are given here.
430Array Subexp_Posns --> 20;
431[ TEXT_TY_RE_CompileTree ftxt exactly
432 no_packets ffrom fto cc par1 par2 par3
433 quantifiable token attach_to no_subs blevel bits;
434
435 fto = TEXT_TY_CharacterLength(ftxt);
436 if (fto == 0) {
437 TEXT_TY_RE_Node(no_packets++, NEVER_RE_CC, 0, 0, 0);
438 return 1;
439 }
440
441 attach_to = TEXT_TY_RE_Node(no_packets++, SUBEXP_RE_CC, 0, 0, 0);
442 RE_Subexpressions-->0 = attach_to; RE_Subexpressions-->10 = 1; no_subs = 1;
443
444 quantifiable = false; blevel = 0;
445
446 for (ffrom = 0: ffrom < fto: ) {
447 cc = BlkValueRead(ftxt, ffrom++); par1 = 0; par2 = 0; par3 = 0;
448 if (cc == '\') {
449 if (ffrom == fto) return "Search pattern not terminated";
450 cc = BlkValueRead(ftxt, ffrom++);
451 switch (cc) {
452 'b': cc = BOUNDARY_RE_CC;
453 'B': cc = NONBOUNDARY_RE_CC;
454 'd': cc = DIGIT_RE_CC;
455 'D': cc = NONDIGIT_RE_CC;
456 'l': cc = LCASE_RE_CC;
457 'L': cc = NONLCASE_RE_CC;
458 'n': cc = NEWLINE_RE_CC;
459 'p': cc = PUNCTUATION_RE_CC;
460 'P': cc = NONPUNCTUATION_RE_CC;
461 's': cc = WHITESPACE_RE_CC;
462 'S': cc = NONWHITESPACE_RE_CC;
463 't': cc = TAB_RE_CC;
464 'u': cc = UCASE_RE_CC;
465 'U': cc = NONUCASE_RE_CC;
466 'w': cc = WORD_RE_CC;
467 'W': cc = NONWORD_RE_CC;
468 default:
469 if ((cc >= '1') && (cc <= '9')) {
470 par1 = cc-'0';
471 cc = VARIABLE_RE_CC;
472 } else {
473 if (((cc >= 'a') && (cc <= 'z')) ||
474 ((cc >= 'A') && (cc <= 'Z'))) return "unknown escape";
475 cc = LITERAL_RE_CC;
476 par1 = ffrom-1; par2 = ffrom;
477 }
478 }
479 quantifiable = true;
480 } else {
481 switch (cc) {
482 '(': par2 = 0;
483
484 if (BlkValueRead(ftxt, ffrom) == '?') {
485 ffrom++;
486 bits = true;
487 if (BlkValueRead(ftxt, ffrom) == '-') { ffrom++; bits = false; }
488 else if (BlkValueRead(ftxt, ffrom) == '<') { ffrom++; bits = false; }
489 switch (cc = BlkValueRead(ftxt, ffrom++)) {
490 '#': while (BlkValueRead(ftxt, ffrom++) ~= 0 or ')') ;
491 if (BlkValueRead(ftxt, ffrom-1) == 0)
492 return "comment never ends";
493 continue;
494 '(': cc = BlkValueRead(ftxt, ffrom);
495 if ((cc == '1' or '2' or '3' or '4' or
496 '5' or '6' or '7' or '8' or '9') &&
497 (BlkValueRead(ftxt, ffrom+1) ==')')) {
498 ffrom = ffrom + 2;
499 par1 = cc - '0';
500 } else ffrom--;
501 cc = IF_RE_CC;
502 quantifiable = false;
503 if (blevel == 20) return "subexpressions too deep";
504 Subexp_Posns-->(blevel++) = TEXT_TY_RE_NodeAddress(no_packets);
505 jump CClassKnown;
506 '=': par2 = 1;
507 par3 = 0; if (bits == false) par3 = -1;
508 '': par2 = 2;
509 par3 = 0; if (bits == false) par3 = -1;
510 ':': par2 = 3;
511 '>': par2 = 4;
512 default:
513 if (BlkValueRead(ftxt, ffrom) == ')') {
514 if (cc == 'i') {
515 cc = SENSITIVITY_RE_CC; par1 = bits; ffrom++;
516 jump CClassKnown;
517 }
518 }
519 if (BlkValueRead(ftxt, ffrom) == ':') {
520 if (cc == 'i') {
521 par1 = bits; par2 = 3; par3 = bits+1; ffrom++;
522 jump AllowForm;
523 }
524 }
525 return "unknown (?...) form";
526 }
527 }
528 .AllowForm;
529 if (par2 == 0) par1 = no_subs++; else par1 = -1;
530 cc = SUBEXP_RE_CC;
531 quantifiable = false;
532 if (blevel == 20) return "subexpressions too deep";
533 Subexp_Posns-->(blevel++) = TEXT_TY_RE_NodeAddress(no_packets);
534 ')': if (blevel == 0) return "subexpression bracket mismatch";
535 blevel--;
536 attach_to = Subexp_Posns-->blevel;
537 if (attach_to-->RE_DOWN == NULL) {
538 if (no_packets >= RE_MAX_PACKETS) return "regexp too complex";
539 attach_to-->RE_DOWN =
540 TEXT_TY_RE_Node(no_packets++, ALWAYS_RE_CC, 0, 0, 0);
541 (attach_to-->RE_DOWN)-->RE_UP = attach_to;
542 }
543 quantifiable = true;
544 continue;
545 '.': cc = ANYTHING_RE_CC; quantifiable = true;
546 '|': cc = CHOICE_RE_CC; quantifiable = false;
547 '^': cc = START_RE_CC; quantifiable = false;
548 '$': cc = END_RE_CC; quantifiable = false;
549 '{': if (quantifiable == false) return "quantifier misplaced";
550 par1 = 0; par2 = -1; bits = 1;
551 while ((cc=BlkValueRead(ftxt, ffrom++)) ~= 0 or '}') {
552 if (cc == ',') {
553 bits++;
554 if (bits >= 3) return "too many colons in ?{...}";
555 continue;
556 }
557 if ((cc >= '0') || (cc <= '9')) {
558 if (bits == 1) {
559 if (par1 < 0) par1 = 0;
560 par1 = par1*10 + (cc-'0');
561 } else {
562 if (par2 < 0) par2 = 0;
563 par2 = par2*10 + (cc-'0');
564 }
565 } else return "non-digit in ?{...}";
566 }
567 if (cc ~= '}') return "{x,y} quantifier never ends";
568 cc = QUANTIFIER_RE_CC;
569 if (par2 == -1) {
570 if (bits == 2) par2 = 30000;
571 else par2 = par1;
572 }
573 if (par1 > par2) return "{x,y} with x greater than y";
574 if (BlkValueRead(ftxt, ffrom) == '?') { ffrom++; par3 = true; }
575 quantifiable = false;
576 '<', '[': par3 = false; if (cc == '<') bits = '>'; else bits = ']';
577 if (BlkValueRead(ftxt, ffrom) == '^') { ffrom++; par3 = true; }
578 par1 = ffrom;
579 if (BlkValueRead(ftxt, ffrom) == bits) { ffrom++; }
580 while (cc ~= bits or 0) {
581 cc = BlkValueRead(ftxt, ffrom++);
582 if (cc == '\') {
583 cc = BlkValueRead(ftxt, ffrom++);
584 if (cc ~= 0) cc = BlkValueRead(ftxt, ffrom++);
585 }
586 }
587 if (cc == 0) return "Character range never ends";
588 par2 = ffrom-1;
589 if ((par2 > par1 + 1) &&
590 (BlkValueRead(ftxt, par1) == ':') &&
591 (BlkValueRead(ftxt, par2-1) == ':') &&
592 (BlkValueRead(ftxt, par2-2) ~= '\'))
593 return "POSIX named character classes unsupported";
594 bits = TEXT_TY_RE_RangeSyntaxCorrect(ftxt, par1, par2);
595 if (bits) return bits;
596 if (par1 < par2) cc = RANGE_RE_CC;
597 else cc = NOTHING_RE_CC;
598 quantifiable = true;
599 '*': if (quantifiable == false) return "quantifier misplaced";
600 cc = QUANTIFIER_RE_CC;
601 par1 = 0; par2 = 30000;
602 if (BlkValueRead(ftxt, ffrom) == '?') { ffrom++; par3 = true; }
603 quantifiable = false;
604 '+': if (quantifiable == false) return "quantifier misplaced";
605 cc = QUANTIFIER_RE_CC;
606 par1 = 1; par2 = 30000;
607 if (BlkValueRead(ftxt, ffrom) == '?') { ffrom++; par3 = true; }
608 quantifiable = false;
609 '?': if (quantifiable == false) return "quantifier misplaced";
610 cc = QUANTIFIER_RE_CC;
611 par1 = 0; par2 = 1;
612 if (BlkValueRead(ftxt, ffrom) == '?') { ffrom++; par3 = true; }
613 quantifiable = false;
614 }
615 }
616
617 .CClassKnown;
618
619 if (cc >= 0) {
620 quantifiable = true;
621 if ((attach_to-->RE_CCLASS == LITERAL_RE_CC) &&
622 (BlkValueRead(ftxt, ffrom) ~= '*' or '+' or '?' or '{')) {
623 (attach_to-->RE_PAR2)++;
624 if (TEXT_TY_RE_Trace == 2) {
625 print "Extending literal by ", cc, "=", (char) cc, "^";
626 }
627 continue;
628 }
629 cc = LITERAL_RE_CC; par1 = ffrom-1; par2 = ffrom;
630 }
631
632 if (no_packets >= RE_MAX_PACKETS) return "regexp too complex";
633
634 if (TEXT_TY_RE_Trace == 2) {
635 print "Attaching packet ", no_packets+1, " to ";
636 TEXT_TY_RE_DebugNode(attach_to, ftxt);
637 TEXT_TY_RE_DebugTree(ftxt);
638 }
639
640 token = TEXT_TY_RE_Node(no_packets++, cc, par1, par2, par3);
641
642 if ((token-->RE_CCLASS == SUBEXP_RE_CC) && (token-->RE_PAR2 == 0)) {
643 RE_Subexpressions-->(token-->RE_PAR1) = token;
644 (RE_Subexpressions-->10)++;
645 }
646
647 if ((attach_to-->RE_CCLASS == SUBEXP_RE_CC or CHOICE_RE_CC or IF_RE_CC) &&
648 (attach_to-->RE_DOWN == NULL)) {
649 attach_to-->RE_DOWN = token; token-->RE_UP = attach_to;
650 } else {
651 if ((token-->RE_CCLASS == CHOICE_RE_CC) &&
652 ((attach_to-->RE_UP)-->RE_CCLASS == CHOICE_RE_CC)) {
653 no_packets--; token = attach_to-->RE_UP;
654 } else {
655 if (token-->RE_CCLASS == CHOICE_RE_CC) {
656 while (attach_to-->RE_PREVIOUS ~= NULL)
657 attach_to = attach_to-->RE_PREVIOUS;
658 }
659 if (token-->RE_CCLASS == QUANTIFIER_RE_CC or CHOICE_RE_CC) {
660 token-->RE_PREVIOUS = attach_to-->RE_PREVIOUS;
661 token-->RE_UP = attach_to-->RE_UP;
662 if ((attach_to-->RE_UP ~= NULL) && (attach_to-->RE_PREVIOUS == NULL))
663 (attach_to-->RE_UP)-->RE_DOWN = token;
664 token-->RE_DOWN = attach_to;
665 bits = attach_to;
666 while (bits ~= NULL) {
667 bits-->RE_UP = token;
668 bits = bits-->RE_NEXT;
669 }
670 attach_to-->RE_PREVIOUS = NULL;
671 if (token-->RE_PREVIOUS ~= NULL)
672 (token-->RE_PREVIOUS)-->RE_NEXT = token;
673 } else {
674 attach_to-->RE_NEXT = token; token-->RE_PREVIOUS = attach_to;
675 token-->RE_UP = attach_to-->RE_UP;
676 }
677 }
678 }
679
680 if (token-->RE_CCLASS == CHOICE_RE_CC) {
681 if (no_packets >= RE_MAX_PACKETS) return "regexp too complex";
682 token-->RE_NEXT = TEXT_TY_RE_Node(no_packets++, CHOICE_RE_CC, 0, 0, 0);
683 (token-->RE_NEXT)-->RE_PREVIOUS = token;
684 (token-->RE_NEXT)-->RE_UP = token-->RE_UP;
685 token = token-->RE_NEXT;
686 }
687
688 attach_to = token;
689
690 if (TEXT_TY_RE_Trace == 2) {
691 print "Result:^";
692 TEXT_TY_RE_DebugTree(ftxt);
693 }
694
695 }
696
697 if (blevel ~= 0) return "subexpression bracket mismatch";
698
699 if (exactly) {
700 if (no_packets+3 > RE_MAX_PACKETS) return "regexp too complex";
701 exactly = RE_PACKET_space-->RE_DOWN;
702 token = TEXT_TY_RE_Node(no_packets++, START_RE_CC, 0, 0, 0);
703 RE_PACKET_space-->RE_DOWN = token; token-->RE_UP = RE_PACKET_space;
704 attach_to = TEXT_TY_RE_Node(no_packets++, SUBEXP_RE_CC, -1, 3, 0);
705 token-->RE_NEXT = attach_to; attach_to-->RE_PREVIOUS = token;
706 attach_to-->RE_UP = RE_PACKET_space;
707 attach_to-->RE_NEXT = TEXT_TY_RE_Node(no_packets++, END_RE_CC, 0, 0, 0);
708 (attach_to-->RE_NEXT)-->RE_PREVIOUS = attach_to;
709 (attach_to-->RE_NEXT)-->RE_UP = RE_PACKET_space;
710 attach_to-->RE_DOWN = exactly;
711 while (exactly ~= NULL) {
712 exactly-->RE_UP = attach_to; exactly = exactly-->RE_NEXT;
713 }
714 }
715
716 no_packets = TEXT_TY_RE_ExpandChoices(RE_PACKET_space, no_packets);
717
718 if (TEXT_TY_RE_Trace) {
719 print "Compiled pattern:^";
720 TEXT_TY_RE_DebugTree(ftxt);
721 }
722
723 bits = TEXT_TY_RE_CheckTree(RE_PACKET_space, no_subs); if (bits) return bits;
724
725 return no_packets;
726];
727
728[ TEXT_TY_RE_RangeSyntaxCorrect ftxt rf rt
729 i chm;
730 for (i=rf: i<rt: i++) {
731 chm = BlkValueRead(ftxt, i);
732 if ((chm == '\') && (i+1<rt)) {
733 chm = BlkValueRead(ftxt, ++i);
734 if (((chm >= 'a') && (chm <= 'z')) ||
735 ((chm >= 'A') && (chm <= 'Z'))) {
736 if (chm ~= 's' or 'S' or 'p' or 'P' or 'w' or 'W' or 'd'
737 or 'D' or 'n' or 't' or 'l' or 'L' or 'u' or 'U')
738 return "Invalid escape in {} range";
739 }
740 }
741 if ((i+2<rt) && (BlkValueRead(ftxt, i+1) == '-')) {
742 if (chm > BlkValueRead(ftxt, i+2)) return "Invalid {} range";
743 i=i+2;
744 }
745 }
746 rfalse;
747];
748
749[ TEXT_TY_RE_ExpandChoices token no_packets
750 rv prev nex holder new ct n cond_node then_node else_node;
751 while (token ~= NULL) {
752 if (token-->RE_CCLASS == IF_RE_CC) {
753 if ((token-->RE_DOWN)-->RE_CCLASS == CHOICE_RE_CC) {
754 for (nex=token-->RE_DOWN, n=0: nex~=NULL: nex=nex-->RE_NEXT) n++;
755 if (n~=2) return "conditional has too many clauses";
756 if (no_packets >= RE_MAX_PACKETS) return "regexp too complex";
757 cond_node = TEXT_TY_RE_Node(no_packets++, CONDITION_RE_CC, 0, 0, 0);
758 if (token-->RE_PAR1 >= 1) {
759 cond_node-->RE_PAR1 = token-->RE_PAR1;
760 }
761 then_node = token-->RE_DOWN;
762 then_node-->RE_CCLASS = THEN_RE_CC;
763 else_node = then_node-->RE_NEXT;
764 else_node-->RE_CCLASS = ELSE_RE_CC;
765 if (cond_node-->RE_PAR1 < 1) {
766 cond_node-->RE_DOWN = then_node-->RE_DOWN;
767 then_node-->RE_DOWN = (then_node-->RE_DOWN)-->RE_NEXT;
768 if (then_node-->RE_DOWN ~= NULL)
769 (then_node-->RE_DOWN)-->RE_PREVIOUS = NULL;
770 (cond_node-->RE_DOWN)-->RE_NEXT = NULL;
771 (cond_node-->RE_DOWN)-->RE_UP = cond_node;
772 }
773 token-->RE_DOWN = cond_node; cond_node-->RE_UP = token;
774 cond_node-->RE_NEXT = then_node; then_node-->RE_PREVIOUS = cond_node;
775 } else {
776 if (no_packets >= RE_MAX_PACKETS) return "regexp too complex";
777 cond_node = TEXT_TY_RE_Node(no_packets++, CONDITION_RE_CC, 0, 0, 0);
778 if (no_packets >= RE_MAX_PACKETS) return "regexp too complex";
779 then_node = TEXT_TY_RE_Node(no_packets++, THEN_RE_CC, 0, 0, 0);
780 if (token-->RE_PAR1 >= 1) {
781 cond_node-->RE_PAR1 = token-->RE_PAR1;
782 then_node-->RE_DOWN = token-->RE_DOWN;
783 } else {
784 cond_node-->RE_DOWN = token-->RE_DOWN;
785 then_node-->RE_DOWN = (token-->RE_DOWN)-->RE_NEXT;
786 (cond_node-->RE_DOWN)-->RE_NEXT = NULL;
787 (cond_node-->RE_DOWN)-->RE_UP = cond_node;
788 }
789 token-->RE_DOWN = cond_node;
790 cond_node-->RE_UP = token; cond_node-->RE_NEXT = then_node;
791 then_node-->RE_PREVIOUS = cond_node; then_node-->RE_UP = token;
792 then_node-->RE_NEXT = NULL;
793 if (then_node-->RE_DOWN ~= NULL)
794 (then_node-->RE_DOWN)-->RE_PREVIOUS = NULL;
795 for (nex = then_node-->RE_DOWN: nex ~= NULL: nex = nex-->RE_NEXT) {
796 nex-->RE_UP = then_node;
797 }
798 }
799
800 if (cond_node-->RE_DOWN ~= NULL) {
801 nex = cond_node-->RE_DOWN;
802 if ((nex-->RE_CCLASS ~= SUBEXP_RE_CC) ||
803 (nex-->RE_NEXT ~= NULL) ||
804 (nex-->RE_PAR2 ~= 1 or 2)) {
805
806 return "condition not lookahead/behind";
807 }
808 }
809 }
810 if ((token-->RE_CCLASS == CHOICE_RE_CC) && (token-->RE_PAR1 < 1)) {
811 prev = token-->RE_PREVIOUS;
812 nex = token-->RE_NEXT;
813 while ((nex ~= NULL) && (nex-->RE_CCLASS == CHOICE_RE_CC))
814 nex = nex-->RE_NEXT;
815 holder = token-->RE_UP; if (holder == NULL) return "bang";
816 if (no_packets >= RE_MAX_PACKETS) return "regexp too complex";
817 new = TEXT_TY_RE_Node(no_packets++, DISJUNCTION_RE_CC, 0, 0, 0);
818 holder-->RE_DOWN = new; new-->RE_UP = holder;
819 if (prev ~= NULL) {
820 prev-->RE_NEXT = new; new-->RE_PREVIOUS = prev;
821 }
822 if (nex ~= NULL) {
823 nex-->RE_PREVIOUS = new; new-->RE_NEXT = nex;
824 }
825 new-->RE_DOWN = token;
826 token-->RE_PREVIOUS = NULL;
827 ct = 1;
828 while (token ~= NULL) {
829 token-->RE_PAR1 = ct++;
830 token-->RE_UP = new;
831 if ((token-->RE_NEXT ~= NULL) &&
832 ((token-->RE_NEXT)-->RE_CCLASS ~= CHOICE_RE_CC))
833 token-->RE_NEXT = NULL;
834 token = token-->RE_NEXT;
835 }
836 new-->RE_PAR1 = ct-1;
837 if (token ~= NULL) token-->RE_NEXT = NULL;
838 token = new; continue;
839 }
840 if (token-->RE_DOWN ~= NULL) {
841 no_packets = TEXT_TY_RE_ExpandChoices(token-->RE_DOWN, no_packets);
842 if ((no_packets<0) || (no_packets >= RE_MAX_PACKETS)) break;
843 }
844 token = token-->RE_NEXT;
845 }
846 return no_packets;
847];
848
849[ TEXT_TY_RE_CheckTree token no_subs
850 rv;
851 while (token ~= NULL) {
852 if (token-->RE_CCLASS == VARIABLE_RE_CC) {
853 if (token-->RE_PAR1 >= no_subs) return "reference to nonexistent group";
854 }
855 if ((token-->RE_CCLASS == SUBEXP_RE_CC) &&
856 (token-->RE_PAR2 == 1 or 2) &&
857 (token-->RE_PAR3 == -1)) {
858 token-->RE_PAR3 = TEXT_TY_RE_Width(token-->RE_DOWN);
859 if (token-->RE_PAR3 == -1) return "variable length lookbehind not implemented";
860 }
861 if (token-->RE_DOWN ~= NULL) {
862 rv = TEXT_TY_RE_CheckTree(token-->RE_DOWN, no_subs);
863 if (rv) return rv;
864 }
865 token = token-->RE_NEXT;
866 }
867 rfalse;
868];
869
870[ TEXT_TY_RE_Width token downwards
871 w rv aw choice;
872 while (token ~= NULL) {
873 switch (token-->RE_CCLASS) {
874 DIGIT_RE_CC, NONDIGIT_RE_CC, WHITESPACE_RE_CC, NONWHITESPACE_RE_CC,
875 PUNCTUATION_RE_CC, NONPUNCTUATION_RE_CC, WORD_RE_CC, NONWORD_RE_CC,
876 ANYTHING_RE_CC, NOTHING_RE_CC, RANGE_RE_CC, NEWLINE_RE_CC, TAB_RE_CC,
877 UCASE_RE_CC, NONUCASE_RE_CC, LCASE_RE_CC, NONLCASE_RE_CC:
878 w++;
879 START_RE_CC, END_RE_CC, BOUNDARY_RE_CC, NONBOUNDARY_RE_CC, ALWAYS_RE_CC:
880 ;
881 LITERAL_RE_CC:
882 w = w + token-->RE_PAR2 - token-->RE_PAR1;
883 VARIABLE_RE_CC:
884 return -1;
885 IF_RE_CC:
886 rv = TEXT_TY_RE_Width((token-->RE_DOWN)-->RE_NEXT);
887 if (rv == -1) return -1;
888 if (rv ~= TEXT_TY_RE_Width(((token-->RE_DOWN)-->RE_NEXT)-->RE_NEXT))
889 return -1;
890 w = w + rv;
891 SUBEXP_RE_CC:
892 if (token-->RE_PAR2 == 1 or 2) rv = 0;
893 else {
894 rv = TEXT_TY_RE_Width(token-->RE_DOWN);
895 if (rv == -1) return -1;
896 }
897 w = w + rv;
898 QUANTIFIER_RE_CC:
899 if (token-->RE_PAR1 ~= token-->RE_PAR2) return -1;
900 rv = TEXT_TY_RE_Width(token-->RE_DOWN);
901 if (rv == -1) return -1;
902 w = w + rv*(token-->RE_PAR1);
903 DISJUNCTION_RE_CC:
904 aw = -1;
905 for (choice = token-->RE_DOWN: choice ~= NULL: choice = choice-->RE_NEXT) {
906 rv = TEXT_TY_RE_Width(choice-->RE_DOWN);
907
908 if (rv == -1) return -1;
909 if ((aw >= 0) && (aw ~= rv)) return -1;
910 aw = rv;
911 }
912 w = w + aw;
913 SENSITIVITY_RE_CC:
914 ;
915 }
916 if (downwards) return w;
917 if (token ~= NULL) token = token-->RE_NEXT;
918 }
919 return w;
920];
Parser.
The virtue of all of that tree compilation is that the code which actually does the work – which parses the source text to see if the regular expression matches it – is much shorter and quicker: indeed, it takes up fewer lines than the compiler part, which goes to show that decoding regular expression syntax is a more complex task than acting upon it. This would have surprised the pioneers of regexp, but the syntax has become much more complicated over the decades because of a steady increase in the number of extended notations. The process shows no sign of stopping, with Python and PCRE continuing to push boundaries beyond Perl, which was once thought the superest, duperest regexp syntax there could be. However: to work.
The main matcher simply starts a recursive subroutine to perform the match. However, the subroutine tests for a match at a particular position in the source text; so the main routine tries the subroutine everywhere convenient in the source text, from left to right, until a match is made – unless the regexp is constrained by a ^ glyph to begin matching at the start of the source text, which will cause a START_RE_CC node to be the eldest child of the \0 root.
943Global TEXT_TY_RE_RewindCount;
944[ TEXT_TY_RE_PrintNoRewinds; print TEXT_TY_RE_RewindCount; ];
945
946Constant CIS_MFLAG = 1;
947Constant ACCUM_MFLAG = 2;
948
949[ TEXT_TY_RE_Parse ftxt txt ipos insens
950 ilen rv root i initial_mode;
951
952 ilen = TEXT_TY_CharacterLength(txt);
953 if ((ipos<0) || (ipos>ilen)) return -1;
954
955 root = RE_PACKET_space;
956
957 initial_mode = 0; if (insens) initial_mode = CIS_MFLAG;
958
959 TEXT_TY_RE_Clear_Markers(RE_PACKET_space);
960
961 for (:ipos<=ilen:ipos++) {
962 if ((RE_PACKET_space-->RE_DOWN ~= NULL) &&
963 ((RE_PACKET_space-->RE_DOWN)-->RE_CCLASS == START_RE_CC) &&
964 (ipos>0)) { rv = -1; break; }
965 if (ipos > 0) TEXT_TY_RE_EraseConstraints(RE_PACKET_space, initial_mode);
966 TEXT_TY_RE_RewindCount = 0;
967 rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ilen, RE_PACKET_space, initial_mode);
968 if (rv >= 0) break;
969 }
970
971 if (rv == -1) {
972 root-->RE_DATA1 = -1;
973 root-->RE_DATA2 = -1;
974 } else {
975 root-->RE_DATA1 = ipos;
976 root-->RE_DATA2 = ipos+rv;
977 }
978 return rv;
979];
Parse At Position.
TEXT_TY_RE_ParseAtPosition(ftxt, txt, ifrom, ito) attempts to match text beginning at position ifrom in the text txt and extending for any length up to position ito: it returns the number of characters which were matched (which can legitimately be 0), or -1 if no match could be made. ftxt is the original text of the regular expression in its precompiled form, which we need partly to print good debugging information, but mostly in order to match against a LITERAL_RE_CC node.
991[ TEXT_TY_RE_ParseAtPosition ftxt txt ifrom ito token mode_flags
992 outcome ipos npos rv i ch edge rewind_this;
993
994 if (ifrom > ito) return -1;
995
996 ipos = ifrom;
997
998 .Rewind;
999 while (token ~= NULL) {
1000 outcome = false;
1001 if (TEXT_TY_RE_Trace) {
1002 print "Matching at ", ipos, ": ";
1003 TEXT_TY_RE_DebugNode(token, ftxt, true);
1004 }
1005
1006 if (ipos<ito) ch = BlkValueRead(txt, ipos); else ch = 0;
1007
1008 token-->RE_MODES = mode_flags;
1009
1010 switch (token-->RE_CCLASS) {
1011
1012
1013
1014 CHOICE_RE_CC: return "internal error";
1015
1016
1017
1018 SENSITIVITY_RE_CC:
1019 if (token-->RE_PAR1) mode_flags = mode_flags | CIS_MFLAG;
1020 else mode_flags = mode_flags & (~CIS_MFLAG);
1021 outcome = true;
1022
1023
1024
1025 ALWAYS_RE_CC:
1026 outcome = true;
1027 NEVER_RE_CC:
1028 START_RE_CC:
1029 if (ipos == 0) outcome = true;
1030 END_RE_CC:
1031 if (BlkValueRead(txt, ipos) == 0) outcome = true;
1032 BOUNDARY_RE_CC:
1033 rv = 0;
1034 if (BlkValueRead(txt, ipos) == 0 or 10 or 13 or 32 or 9
1035 or '.' or ',' or '' or '?'
1036 or '-' or '/' or '' or ':' or ';'
1037 or '(' or ')' or '[' or ']' or '{' or '}') rv++;
1038 if (ipos == 0) ch = 0;
1039 else ch = BlkValueRead(txt, ipos-1);
1040 if (ch == 0 or 10 or 13 or 32 or 9
1041 or '.' or ',' or '' or '?'
1042 or '-' or '/' or '' or ':' or ';'
1043 or '(' or ')' or '[' or ']' or '{' or '}') rv++;
1044 if (rv == 1) outcome = true;
1045 NONBOUNDARY_RE_CC:
1046 rv = 0;
1047 if (BlkValueRead(txt, ipos) == 0 or 10 or 13 or 32 or 9
1048 or '.' or ',' or '' or '?'
1049 or '-' or '/' or '' or ':' or ';'
1050 or '(' or ')' or '[' or ']' or '{' or '}') rv++;
1051 if (ipos == 0) ch = 0;
1052 else ch = BlkValueRead(txt, ipos-1);
1053 if (ch == 0 or 10 or 13 or 32 or 9
1054 or '.' or ',' or '' or '?'
1055 or '-' or '/' or '' or ':' or ';'
1056 or '(' or ')' or '[' or ']' or '{' or '}') rv++;
1057 if (rv ~= 1) outcome = true;
1058
1059
1060
1061 IF_RE_CC:
1062 i = token-->RE_PAR1; ch = false;
1063 if (TEXT_TY_RE_Trace) {
1064 print "Trying conditional from ", ipos, ": ";
1065 TEXT_TY_RE_DebugNode(token, ftxt, true);
1066 }
1067 if (i >= 1) {
1068 if ((i<RE_Subexpressions-->10) &&
1069 ((RE_Subexpressions-->i)-->RE_DATA1 >= 0)) ch = true;
1070 } else {
1071 rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito,
1072 (token-->RE_DOWN)-->RE_DOWN, mode_flags);
1073 if (rv >= 0) ch = true;
1074 }
1075 if (TEXT_TY_RE_Trace) {
1076 print "Condition found to be ", ch, "^";
1077 }
1078 if (ch) {
1079 rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito,
1080 ((token-->RE_DOWN)-->RE_NEXT)-->RE_DOWN, mode_flags);
1081
1082 } else {
1083 if ((((token-->RE_DOWN)-->RE_NEXT)-->RE_NEXT) == NULL)
1084 rv = 0;
1085 else rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito,
1086 (((token-->RE_DOWN)-->RE_NEXT)-->RE_NEXT)-->RE_DOWN, mode_flags);
1087
1088 }
1089 if (rv >= 0) {
1090 outcome = true;
1091 ipos = ipos + rv;
1092 }
1093 DISJUNCTION_RE_CC:
1094 if (TEXT_TY_RE_Trace) {
1095 print "Trying disjunction from ", ipos, ": ";
1096 TEXT_TY_RE_DebugNode(token, ftxt, true);
1097 }
1098 for (ch = token-->RE_DOWN: ch ~= NULL: ch = ch-->RE_NEXT) {
1099 if (ch-->RE_PAR1 <= token-->RE_CONSTRAINT) continue;
1100 if (TEXT_TY_RE_Trace) {
1101 print "Trying choice at ", ipos, ": ";
1102 TEXT_TY_RE_DebugNode(ch, ftxt, true);
1103 }
1104 rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito,
1105 ch-->RE_DOWN, mode_flags);
1106 if (rv >= 0) {
1107 token-->RE_DATA1 = ipos;
1108 token-->RE_DATA2 = ch-->RE_PAR1;
1109 ipos = ipos + rv;
1110 outcome = true;
1111 if (TEXT_TY_RE_Trace) {
1112 print "Choice worked with width ", rv, ": ";
1113 TEXT_TY_RE_DebugNode(ch, ftxt, true);
1114 }
1115 break;
1116 } else {
1117 if (mode_flags & ACCUM_MFLAG == false)
1118 TEXT_TY_RE_FailSubexpressions(ch-->RE_DOWN);
1119 }
1120 }
1121 if (outcome == false) {
1122 if (TEXT_TY_RE_Trace) {
1123 print "Failed disjunction from ", ipos, ": ";
1124 TEXT_TY_RE_DebugNode(token, ftxt, true);
1125 }
1126 token-->RE_DATA1 = ipos;
1127 token-->RE_DATA2 = -1;
1128 }
1129 SUBEXP_RE_CC:
1130 if (token-->RE_PAR2 == 1 or 2) {
1131 npos = ipos - token-->RE_PAR3;
1132 if (npos<0) rv = -1;
1133 else rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, npos, ito, token-->RE_DOWN,
1134 mode_flags);
1135 } else {
1136 switch (token-->RE_PAR3) {
1137 0: rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito, token-->RE_DOWN,
1138 mode_flags);
1139 1: rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito, token-->RE_DOWN,
1140 mode_flags & (~CIS_MFLAG));
1141 2: rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, ipos, ito, token-->RE_DOWN,
1142 mode_flags | CIS_MFLAG);
1143 }
1144 }
1145 npos = ipos;
1146 if (rv >= 0) npos = ipos + rv;
1147 switch (token-->RE_PAR2) {
1148 1: if (rv >= 0) rv = 0;
1149 2: if (rv >= 0) rv = -1; else rv = 0;
1150 }
1151 if (rv >= 0) {
1152 token-->RE_DATA1 = ipos;
1153 ipos = ipos + rv;
1154 token-->RE_DATA2 = npos;
1155 outcome = true;
1156 } else {
1157 if (mode_flags & ACCUM_MFLAG == false) {
1158 token-->RE_DATA1 = -1;
1159 token-->RE_DATA2 = -1;
1160 }
1161 }
1162 if (token-->RE_PAR2 == 2) TEXT_TY_RE_FailSubexpressions(token, true);
1163 QUANTIFIER_RE_CC:
1164 token-->RE_DATA1 = ipos;
1165 if ((token-->RE_DOWN)-->RE_CCLASS == SUBEXP_RE_CC) {
1166 (token-->RE_DOWN)-->RE_CACHE1 = -1;
1167 (token-->RE_DOWN)-->RE_CACHE2 = -1;
1168 }
1169 if (TEXT_TY_RE_Trace) {
1170 print "Trying quantifier from ", ipos, ": ";
1171 TEXT_TY_RE_DebugNode(token, ftxt, true);
1172 }
1173 if (token-->RE_PAR3 == false) {
1174
1175 edge = token-->RE_PAR2;
1176 if (token-->RE_CONSTRAINT >= 0) edge = token-->RE_CONSTRAINT;
1177 rv = -1;
1178 for (i=0, npos=ipos: i<edge: i++) {
1179 if (TEXT_TY_RE_Trace) {
1180 print "Trying quant rep ", i+1, " at ", npos, ": ";
1181 TEXT_TY_RE_DebugNode(token, ftxt, true);
1182 }
1183 rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, npos, ito, token-->RE_DOWN,
1184 mode_flags | ACCUM_MFLAG);
1185 if (rv < 0) break;
1186 if ((token-->RE_DOWN)-->RE_CCLASS == SUBEXP_RE_CC) {
1187 (token-->RE_DOWN)-->RE_CACHE1 = (token-->RE_DOWN)-->RE_DATA1;
1188 (token-->RE_DOWN)-->RE_CACHE2 = (token-->RE_DOWN)-->RE_DATA2;
1189 }
1190 if ((rv == 0) && (token-->RE_PAR2 == 30000) && (i>=1)) { i++; break; }
1191 npos = npos + rv;
1192 }
1193 if ((i >= token-->RE_PAR1) && (i <= token-->RE_PAR2))
1194 outcome = true;
1195 } else {
1196 edge = token-->RE_PAR1;
1197 if (token-->RE_CONSTRAINT > edge) edge = token-->RE_CONSTRAINT;
1198 for (i=0, npos=ipos: (npos<ito) && (i < token-->RE_PAR2): i++) {
1199 if (i >= edge) break;
1200 if (TEXT_TY_RE_Trace) {
1201 print "Trying quant rep ", i+1, " at ", npos, ": ";
1202 TEXT_TY_RE_DebugNode(token, ftxt, true);
1203 }
1204 rv = TEXT_TY_RE_ParseAtPosition(ftxt, txt, npos, ito, token-->RE_DOWN,
1205 mode_flags | ACCUM_MFLAG);
1206 if (rv < 0) break;
1207 if ((token-->RE_DOWN)-->RE_CCLASS == SUBEXP_RE_CC) {
1208 (token-->RE_DOWN)-->RE_CACHE1 = (token-->RE_DOWN)-->RE_DATA1;
1209 (token-->RE_DOWN)-->RE_CACHE2 = (token-->RE_DOWN)-->RE_DATA2;
1210 }
1211 if ((rv == 0) && (token-->RE_PAR2 == 30000) && (i>=1)) { i++; break; }
1212 npos = npos + rv;
1213 }
1214 if ((i >= edge) && (i <= token-->RE_PAR2))
1215 outcome = true;
1216 }
1217 if (outcome) {
1218 if (token-->RE_PAR3 == false) {
1219 if (i > token-->RE_PAR1) {
1220 token-->RE_DATA2 = i-1;
1221 } else {
1222 token-->RE_DATA2 = -1;
1223 }
1224 } else {
1225 if (i < token-->RE_PAR2) {
1226 token-->RE_DATA2 = i+1;
1227 } else {
1228 token-->RE_DATA2 = -1;
1229 }
1230 }
1231 ipos = npos;
1232 if ((i == 0) && (mode_flags & ACCUM_MFLAG == false))
1233 TEXT_TY_RE_FailSubexpressions(token-->RE_DOWN);
1234 if ((token-->RE_DOWN)-->RE_CCLASS == SUBEXP_RE_CC) {
1235 (token-->RE_DOWN)-->RE_DATA1 = (token-->RE_DOWN)-->RE_CACHE1;
1236 (token-->RE_DOWN)-->RE_DATA2 = (token-->RE_DOWN)-->RE_CACHE2;
1237 }
1238 if (TEXT_TY_RE_Trace) {
1239 print "Successful quant reps ", i, ": ";
1240 TEXT_TY_RE_DebugNode(token, ftxt, true);
1241 }
1242 } else {
1243
1244 if (mode_flags & ACCUM_MFLAG == false)
1245 TEXT_TY_RE_FailSubexpressions(token-->RE_DOWN);
1246 if (TEXT_TY_RE_Trace) {
1247 print "Failed quant reps ", i, ": ";
1248 TEXT_TY_RE_DebugNode(token, ftxt, true);
1249 }
1250 }
1251
1252
1253
1254 NOTHING_RE_CC: ;
1255 ANYTHING_RE_CC: if (ch) outcome = true; ipos++;
1256 WHITESPACE_RE_CC:
1257 if (ch == 10 or 13 or 32 or 9) { outcome = true; ipos++; }
1258 NONWHITESPACE_RE_CC:
1259 if ((ch) && (ch ~= 10 or 13 or 32 or 9)) { outcome = true; ipos++; }
1260 PUNCTUATION_RE_CC:
1261 if (ch == '.' or ',' or '' or '?'
1262 or '-' or '/' or '' or ':' or ';'
1263 or '(' or ')' or '[' or ']' or '{' or '}') { outcome = true; ipos++; }
1264 NONPUNCTUATION_RE_CC:
1265 if ((ch) && (ch ~= '.' or ',' or '' or '?'
1266 or '-' or '/' or '' or ':' or ';'
1267 or '(' or ')' or '[' or ']' or '{' or '}')) { outcome = true; ipos++; }
1268 WORD_RE_CC:
1269 if ((ch) && (ch ~= 10 or 13 or 32 or 9
1270 or '.' or ',' or '' or '?'
1271 or '-' or '/' or '' or ':' or ';'
1272 or '(' or ')' or '[' or ']' or '{' or '}')) { outcome = true; ipos++; }
1273 NONWORD_RE_CC:
1274 if (ch == 10 or 13 or 32 or 9
1275 or '.' or ',' or '' or '?'
1276 or '-' or '/' or '' or ':' or ';'
1277 or '(' or ')' or '[' or ']' or '{' or '}') { outcome = true; ipos++; }
1278 DIGIT_RE_CC:
1279 if (ch == '0' or '1' or '2' or '3' or '4'
1280 or '5' or '6' or '7' or '8' or '9') { outcome = true; ipos++; }
1281 NONDIGIT_RE_CC:
1282 if ((ch) && (ch ~= '0' or '1' or '2' or '3' or '4'
1283 or '5' or '6' or '7' or '8' or '9')) { outcome = true; ipos++; }
1284 LCASE_RE_CC:
1285 if (CharIsOfCase(ch, 0)) { outcome = true; ipos++; }
1286 NONLCASE_RE_CC:
1287 if ((ch) && (CharIsOfCase(ch, 0) == false)) { outcome = true; ipos++; }
1288 UCASE_RE_CC:
1289 if (CharIsOfCase(ch, 1)) { outcome = true; ipos++; }
1290 NONUCASE_RE_CC:
1291 if ((ch) && (CharIsOfCase(ch, 1) == false)) { outcome = true; ipos++; }
1292 NEWLINE_RE_CC: if (ch == 10) { outcome = true; ipos++; }
1293 TAB_RE_CC: if (ch == 9) { outcome = true; ipos++; }
1294 RANGE_RE_CC:
1295 if (TEXT_TY_RE_Range(ch, ftxt,
1296 token-->RE_PAR1, token-->RE_PAR2, token-->RE_PAR3, mode_flags & CIS_MFLAG))
1297 { outcome = true; ipos++; }
1298
1299
1300
1301 LITERAL_RE_CC:
1302 rv = TEXT_TY_RE_MatchSubstring(txt, ipos,
1303 ftxt, token-->RE_PAR1, token-->RE_PAR2, mode_flags & CIS_MFLAG);
1304 if (rv >= 0) { ipos = ipos + rv; outcome = true; }
1305 VARIABLE_RE_CC:
1306 i = token-->RE_PAR1;
1307 if ((RE_Subexpressions-->i)-->RE_DATA1 >= 0) {
1308 rv = TEXT_TY_RE_MatchSubstring(txt, ipos,
1309 txt, (RE_Subexpressions-->i)-->RE_DATA1,
1310 (RE_Subexpressions-->i)-->RE_DATA2, mode_flags & CIS_MFLAG);
1311 if (rv >= 0) { ipos = ipos + rv; outcome = true; }
1312 }
1313 .NeverMatchIncompleteVar;
1314 }
1315
1316 if (outcome == false) {
1317 if (TEXT_TY_RE_RewindCount++ >= 10000) {
1318 if (TEXT_TY_RE_RewindCount == 10001) {
1319 style bold; print "OVERFLOW^"; style roman;
1320 }
1321 return -1;
1322 }
1323 if (TEXT_TY_RE_Trace) {
1324 print "Rewind sought from failure at pos ", ipos, " with: ";
1325 TEXT_TY_RE_DebugNode(token, ftxt, true);
1326 }
1327 if ((token-->RE_CCLASS == QUANTIFIER_RE_CC) &&
1328 (TEXT_TY_RE_SeekBacktrack(token-->RE_DOWN, ftxt, false, ito, false)))
1329 jump RewindFound;
1330 if (mode_flags & ACCUM_MFLAG == false) TEXT_TY_RE_FailSubexpressions(token);
1331 token = token-->RE_PREVIOUS;
1332 while (token ~= NULL) {
1333 if (TEXT_TY_RE_SeekBacktrack(token, ftxt, true, ito, false)) {
1334 .RewindFound;
1335 ipos = token-->RE_DATA1;
1336 mode_flags = token-->RE_MODES;
1337 if (mode_flags & ACCUM_MFLAG == false)
1338 TEXT_TY_RE_FailSubexpressions(token, true);
1339 if (ipos == -1)
1340 TEXT_TY_RE_DebugTree(ftxt, true);
1341 if (TEXT_TY_RE_Trace) {
1342 print "^[", ifrom, ",", ito, "] rewinding to ", ipos, " at ";
1343 TEXT_TY_RE_DebugNode(token, ftxt, true);
1344 }
1345 jump Rewind;
1346 }
1347 token = token-->RE_PREVIOUS;
1348 }
1349 if (TEXT_TY_RE_Trace)
1350 print "^Rewind impossible^";
1351 return -1;
1352 }
1353
1354 token = token-->RE_NEXT;
1355 }
1356 return ipos - ifrom;
1357];
Backtracking.
It would be very straightforward to match regular expressions with the above recursive code if, for every node, there were a fixed number of characters (depending on the node) such that there would either be a match eating that many characters, or else no match at all. If that were true, we could simply march through the text matching until we could match no more, and although some nodes might have ambiguous readings, we could always match the first possibility which worked. There would never be any need to retreat.
Well, in fact that happy state does apply to a surprising number of nodes, and some quite complicated regular expressions can be made which use only them: <abc>{2}\d\d\1, for instance, matches a sequence of exactly 6 characters or else fails to match altogether, and there is never any need to backtrack. One reason why backtracking is a fairly good algorithm in practice is that these "good" cases occur fairly often, in subexpressions if not in the entire expression, and the simple method above disposes of them efficiently.
But in an expression like ab+bb, there is no alternative to backtracking if we are going to try to match the nodes from left to right: we match the "a", then we match as many "b"s as we can – but then we find that we have to match "bb", and this is necessarily impossible because we have just eaten all of the "b"s available. We therefore backtrack one node to the b+ and try again. We obviously can't literally try again because that would give the same result: instead we impose a constraint. Suppose it previously matched a row of 23 letter "b"s, so that the quantifier + resulted in a multiplicity of 23. We then constrain the node and in effect consider it to be b{1,22}, that is, to match at least 1 and at most 22 letter "b"s. That still won't work, as it happens, so we backtrack again with a constraint tightened to make it b{1,21}, and now the match occurs as we would hope. When the expression becomes more complex, backtracking becomes a longer-distance, recursive procedure – we have to exhaust all possibilities of a more recent node before tracking back to one from longer ago. (This is why the worst test cases are those which entice us into a long, long series of matches only to find that a guess made right back at the start was ill-fated.)
Rather than describing TEXT_TY_RE_SeekBacktrack in detail here, it is probably more useful to suggest that the reader observe it in action by setting TEXT_TY_RE_Trace and trying a few regular expressions.
1402[ TEXT_TY_RE_SeekBacktrack token ftxt downwards ito report_only
1403 untried;
1404 for (: token ~= NULL: token = token-->RE_NEXT) {
1405 if ((TEXT_TY_RE_Trace) && (report_only == false)) {
1406 print "Scan for rewind: ";
1407 TEXT_TY_RE_DebugNode(token, ftxt, true);
1408 }
1409 if ((token-->RE_CCLASS == SUBEXP_RE_CC) &&
1410 (token-->RE_PAR2 == 1 or 2 or 4)) {
1411 if (downwards) rfalse;
1412 continue;
1413 }
1414 if (token-->RE_DOWN ~= NULL) {
1415 if ((TEXT_TY_RE_Trace) && (report_only == false)) print "Descend^";
1416 if (TEXT_TY_RE_SeekBacktrack(token-->RE_DOWN, ftxt, false, ito, report_only)) rtrue;
1417 }
1418 untried = false;
1419 switch (token-->RE_CCLASS) {
1420 DISJUNCTION_RE_CC:
1421 if ((token-->RE_DATA2 >= 1) &&
1422 (token-->RE_DATA2 < token-->RE_PAR1) &&
1423 (token-->RE_CONSTRAINT < token-->RE_PAR1)) {
1424 if (report_only) rtrue;
1425 if (token-->RE_CONSTRAINT == -1)
1426 token-->RE_CONSTRAINT = 1;
1427 else
1428 (token-->RE_CONSTRAINT)++;
1429 untried = true;
1430 }
1431 QUANTIFIER_RE_CC:
1432 if (token-->RE_CONSTRAINT ~= -2) {
1433 if ((TEXT_TY_RE_Trace) && (report_only == false)) {
1434 print "Quant with cons not -2: ";
1435 TEXT_TY_RE_DebugNode(token, ftxt, true);
1436 }
1437 if (token-->RE_DATA2 >= 0) {
1438 if (report_only) rtrue;
1439 token-->RE_CONSTRAINT = token-->RE_DATA2;
1440 untried = true;
1441 }
1442 }
1443 }
1444 if (untried) {
1445 if (TEXT_TY_RE_Trace) {
1446 print "Grounds for rewind at: ";
1447 TEXT_TY_RE_DebugNode(token, ftxt, true);
1448 }
1449 TEXT_TY_RE_EraseConstraints(token-->RE_NEXT);
1450 TEXT_TY_RE_EraseConstraints(token-->RE_DOWN);
1451 rtrue;
1452 }
1453 if (downwards) rfalse;
1454 }
1455 rfalse;
1456];
Fail Subexpressions.
Here, an attempt to make a complicated match against the node in token has failed: that means that any subexpressions which were matched in the course of the attempt must also in retrospect be considered unmatched. So we work down through the subtree at token and empty any markers for subexpressions, which in effect clears their backslash variables – this is important as, otherwise, the contents left over could cause the alternative reading of the token to be misparsed if it refers to the backslash variables in question. (If you think nobody would ever be crazy enough to write a regular expression like that, you haven't see Perl's test suite.)
If the downwards flag is clear, it not only invalidates subexpression matches below the node but also to the right of the node – this is useful for a backtrack which runs back quite some distance.
1474[ TEXT_TY_RE_FailSubexpressions token downwards;
1475 for (: token ~= NULL: token = token-->RE_NEXT) {
1476 if (token-->RE_DOWN ~= NULL) TEXT_TY_RE_FailSubexpressions(token-->RE_DOWN);
1477 if (token-->RE_CCLASS == SUBEXP_RE_CC) {
1478 token-->RE_DATA1 = -1;
1479 token-->RE_DATA2 = -1;
1480 }
1481 if (downwards) break;
1482 }
1483];
Erasing Constraints.
As explained above, temporary constraints are placed on some nodes when we are backtracking to test possible cases. When we do backtrack, though, it's important to lift any constraints left over from the previous attempt to parse material which is part of or subsequent to the token whose match attempt has been abandoned.
1493[ TEXT_TY_RE_EraseConstraints token;
1494 while (token ~= NULL) {
1495 switch (token-->RE_CCLASS) {
1496 DISJUNCTION_RE_CC: token-->RE_CONSTRAINT = -1;
1497 QUANTIFIER_RE_CC: token-->RE_CONSTRAINT = -1;
1498 }
1499 if (token-->RE_DOWN) TEXT_TY_RE_EraseConstraints(token-->RE_DOWN);
1500 token = token-->RE_NEXT;
1501 }
1502];
Matching Literal Text.
Here we attempt to make a match of the substring of the text mtxt which runs from character mfrom to character mto, looking for it at the given position ipos in the source text txt.
1510[ TEXT_TY_RE_MatchSubstring txt ipos mtxt mfrom mto insens
1511 i ch;
1512 if (mfrom < 0) return 0;
1513 if (insens)
1514 for (i=mfrom:i<mto:i++) {
1515 ch = BlkValueRead(mtxt, i);
1516 if (BlkValueRead(txt, ipos++) ~= ch or TEXT_TY_RevCase(ch))
1517 return -1;
1518 }
1519 else
1520 for (i=mfrom:i<mto:i++)
1521 if (BlkValueRead(txt, ipos++) ~= BlkValueRead(mtxt, i))
1522 return -1;
1523 return mto-mfrom;
1524];
Matching Character Range.
Suppose that a character range is stored in ftxt between the character positions rf and rt. Then TEXT_TY_RE_Range(ch, ftxt, rf, rt, negate, insens) tests whether a given character ch lies within that character range, negating the outcome if negate is set, and performing comparisons case insensitively if insens is set.
1534[ TEXT_TY_RE_Range ch ftxt rf rt negate insens
1535 i chm upper crev;
1536 if (ch == 0) rfalse;
1537 if (negate == true) {
1538 if (TEXT_TY_RE_Range(ch, ftxt, rf, rt, false, insens)) rfalse;
1539 rtrue;
1540 }
1541 for (i=rf: i<rt: i++) {
1542 chm = BlkValueRead(ftxt, i);
1543 if ((chm == '\') && (i+1<rt)) {
1544 chm = BlkValueRead(ftxt, ++i);
1545 switch (chm) {
1546 's':
1547 if (ch == 10 or 13 or 32 or 9) rtrue;
1548 'S':
1549 if ((ch) && (ch ~= 10 or 13 or 32 or 9)) rtrue;
1550 'p':
1551 if (ch == '.' or ',' or '' or '?'
1552 or '-' or '/' or '' or ':' or ';'
1553 or '(' or ')' or '[' or ']' or '{' or '}') rtrue;
1554 'P':
1555 if ((ch) && (ch ~= '.' or ',' or '' or '?'
1556 or '-' or '/' or '' or ':' or ';'
1557 or '(' or ')' or '[' or ']' or '{' or '}')) rtrue;
1558 'w':
1559 if ((ch) && (ch ~= 10 or 13 or 32 or 9
1560 or '.' or ',' or '' or '?'
1561 or '-' or '/' or '' or ':' or ';'
1562 or '(' or ')' or '[' or ']' or '{' or '}')) rtrue;
1563 'W':
1564 if (ch == 10 or 13 or 32 or 9
1565 or '.' or ',' or '' or '?'
1566 or '-' or '/' or '' or ':' or ';'
1567 or '(' or ')' or '[' or ']' or '{' or '}') rtrue;
1568 'd':
1569 if (ch == '0' or '1' or '2' or '3' or '4'
1570 or '5' or '6' or '7' or '8' or '9') rtrue;
1571 'D':
1572 if ((ch) && (ch ~= '0' or '1' or '2' or '3' or '4'
1573 or '5' or '6' or '7' or '8' or '9')) rtrue;
1574 'l': if (CharIsOfCase(ch, 0)) rtrue;
1575 'L': if (CharIsOfCase(ch, 0) == false) rtrue;
1576 'u': if (CharIsOfCase(ch, 1)) rtrue;
1577 'U': if (CharIsOfCase(ch, 1) == false) rtrue;
1578 'n': if (ch == 10) rtrue;
1579 't': if (ch == 9) rtrue;
1580 }
1581 } else {
1582 if ((i+2<rt) && (BlkValueRead(ftxt, i+1) == '-')) {
1583 upper = BlkValueRead(ftxt, i+2);
1584 if ((ch >= chm) && (ch <= upper)) rtrue;
1585 if (insens) {
1586 crev = TEXT_TY_RevCase(ch);
1587 if ((crev >= chm) && (crev <= upper)) rtrue;
1588 }
1589 i=i+2;
1590 } else {
1591 if (chm == ch) rtrue;
1592 if ((insens) && (chm == TEXT_TY_RevCase(ch))) rtrue;
1593 }
1594 }
1595 }
1596 rfalse;
1597];
Search And Replace.
And finally, last but not least: the routine which searches an indexed text txt trying to match it against ftxt. If ftxtype is set to REGEXP_BLOB then ftxt is expected to be a regular expression such as ab+(c*de)?, whereas if ftxtype is CHR_BLOB then it is expected only to be a simple string of characters taken literally, such as frog.
Each match found is replaced with the contents of rtxt, except that if the blob type is REGEXP_BLOB then we recognise a few syntaxes as special: for instance, \2 expands to the value of subexpression 2 as it was matched – see Writing with Inform for details.
The optional argument insens is a flag which, if set, causes the matching to be done case insensitively; the optional argument exactly, if set, causes the matching to work only if the entire txt is matched. (This is not especially useful with regular expressions, because the effect can equally be achieved by turning ab+c, say, into ^ab+c$, but it is indeed useful where the blob type is CHR_BLOB.)
For an explanation of the use of the word "blob", see Text.i6t.
1621[ TEXT_TY_Replace_RE ftxtype txt ftxt rtxt insens exactly
1622 r p p1 p2 cp cp1 cp2;
1623
1624
1625
1626 if (rtxt == 0 or 1) { cp = txt-->0; p = TEXT_TY_Temporarily_Transmute(txt); }
1627 else TEXT_TY_Transmute(txt);
1628 cp1 = ftxt-->0; p1 = TEXT_TY_Temporarily_Transmute(ftxt);
1629 cp2 = rtxt-->0; p2 = TEXT_TY_Temporarily_Transmute(rtxt);
1630 r = TEXT_TY_Replace_REI(ftxtype, txt, ftxt, rtxt, insens, exactly);
1631 TEXT_TY_Untransmute(ftxt, p1, cp1);
1632 TEXT_TY_Untransmute(rtxt, p2, cp2);
1633 if (rtxt == 0 or 1) TEXT_TY_Untransmute(txt, p, cp);
1634 return r;
1635];
1636
1637[ TEXT_TY_Replace_REI ftxtype txt ftxt rtxt insens exactly
1638 ctxt csize ilen i cl mpos cpos ch chm;
1639
1640 ilen = TEXT_TY_CharacterLength(txt);
1641
1642 TEXT_TY_RE_Err = 0;
1643 switch (ftxtype) {
1644 REGEXP_BLOB: i = TEXT_TY_RE_CompileTree(ftxt, exactly);
1645 CHR_BLOB: i = TEXT_TY_CHR_CompileTree(ftxt, exactly);
1646 default: "*** bad ftxtype ***";
1647 }
1648
1649 if ((i<0) || (i>RE_MAX_PACKETS)) {
1650 TEXT_TY_RE_Err = i;
1651 print "*** Regular expression error: ", (string) TEXT_TY_RE_Err, " ***^";
1652 RunTimeProblem(RTP_REGEXPSYNTAXERROR);
1653 return 0;
1654 }
1655
1656 if (TEXT_TY_RE_Trace) {
1657 TEXT_TY_RE_DebugTree(ftxt);
1658 print "(compiled to ", i, " packets)^";
1659 }
1660
1661 if (ftxtype == REGEXP_BLOB) TEXT_TY_RE_EmptyMatchVars();
1662 mpos = 0; chm = 0; cpos = 0;
1663 while (TEXT_TY_RE_Parse(ftxt, txt, mpos, insens) >= 0) {
1664 chm++;
1665
1666 if (TEXT_TY_RE_Trace) {
1667 print "^*** Match ", chm, " found (", RE_PACKET_space-->RE_DATA1, ",",
1668 RE_PACKET_space-->RE_DATA2, "): ";
1669 if (RE_PACKET_space-->RE_DATA1 == RE_PACKET_space-->RE_DATA2) {
1670 print "<empty>";
1671 }
1672 for (i=RE_PACKET_space-->RE_DATA1:i<RE_PACKET_space-->RE_DATA2:i++) {
1673 print (char) BlkValueRead(txt, i);
1674 }
1675 print " ***^";
1676 }
1677
1678 if (rtxt == 0) break;
1679
1680 if (rtxt ~= 0 or 1) {
1681 if (chm == 1) {
1682 ctxt = BlkValueCreate(TEXT_TY);
1683 TEXT_TY_Transmute(ctxt);
1684 csize = BlkValueLBCapacity(ctxt);
1685 }
1686
1687 for (i=cpos:i<RE_PACKET_space-->RE_DATA1:i++) {
1688 ch = BlkValueRead(txt, i);
1689 if (cl+1 >= csize) {
1690 if (BlkValueSetLBCapacity(ctxt, 2*cl) == false) break;
1691 csize = BlkValueLBCapacity(ctxt);
1692 }
1693 BlkValueWrite(ctxt, cl++, ch);
1694 }
1695 BlkValueWrite(ctxt, cl, 0);
1696
1697 TEXT_TY_Concatenate(ctxt, rtxt, ftxtype, txt);
1698 csize = BlkValueLBCapacity(ctxt);
1699 cl = TEXT_TY_CharacterLength(ctxt);
1700 }
1701
1702 mpos = RE_PACKET_space-->RE_DATA2; cpos = mpos;
1703 if (RE_PACKET_space-->RE_DATA1 == RE_PACKET_space-->RE_DATA2)
1704 mpos++;
1705
1706 if (TEXT_TY_RE_Trace) {
1707 if (chm == 100) {
1708 print "(Stopping after 100 matches.)^"; break;
1709 }
1710 }
1711 }
1712 if (chm > 0) {
1713 if (rtxt ~= 0 or 1) {
1714 for (i=cpos:i<ilen:i++) {
1715 ch = BlkValueRead(txt, i);
1716 if (cl+1 >= csize) {
1717 if (BlkValueSetLBCapacity(ctxt, 2*cl) == false) break;
1718 csize = BlkValueLBCapacity(ctxt);
1719 }
1720 BlkValueWrite(ctxt, cl++, ch);
1721 }
1722 }
1723
1724 if (ftxtype == REGEXP_BLOB) {
1725 TEXT_TY_RE_CreateMatchVars(txt);
1726 if (TEXT_TY_RE_Trace)
1727 TEXT_TY_RE_DebugMatchVars(txt);
1728 }
1729
1730 if (rtxt ~= 0 or 1) {
1731 BlkValueWrite(ctxt, cl, 0);
1732 BlkValueCopy(txt, ctxt);
1733 BlkValueFree(ctxt);
1734 }
1735 }
1736 return chm;
1737];
Concatenation.
See the corresponding routine in Text.i6t: this is a variation which handles the special syntaxes used in search-and-replace.
1744[ TEXT_TY_RE_Concatenate txt_to txt_from blobtype txt_ref
1745 pos len ch i tosize x y case;
1746 if ((txt_to==0) || (BlkValueWeakKind(txt_to) ~= TEXT_TY)) rfalse;
1747 if ((txt_from==0) || (BlkValueWeakKind(txt_from) ~= TEXT_TY)) return txt_to;
1748 pos = TEXT_TY_CharacterLength(txt_to);
1749 tosize = BlkValueLBCapacity(txt_to);
1750 len = TEXT_TY_CharacterLength(txt_from);
1751 for (i=0:i<len:i++) {
1752 ch = BlkValueRead(txt_from, i);
1753 if ((ch == '\') && (i < len-1)) {
1754 ch = BlkValueRead(txt_from, ++i);
1755 if (ch == 'n') ch = 10;
1756 if (ch == 't') ch = 9;
1757 case = -1;
1758 if (ch == 'l') case = 0;
1759 if (ch == 'u') case = 1;
1760 if (case >= 0) ch = BlkValueRead(txt_from, ++i);
1761 if ((ch >= '0') && (ch <= '9')) {
1762 ch = ch - '0';
1763 if (ch < RE_Subexpressions-->10) {
1764 x = (RE_Subexpressions-->ch)-->RE_DATA1;
1765 y = (RE_Subexpressions-->ch)-->RE_DATA2;
1766 if (x >= 0) {
1767 for (:x<y:x++) {
1768 ch = BlkValueRead(txt_ref, x);
1769 if (pos+1 >= tosize) {
1770 if (BlkValueSetLBCapacity(txt_to, 2*tosize) == false) break;
1771 tosize = BlkValueLBCapacity(txt_to);
1772 }
1773 if (case >= 0)
1774 BlkValueWrite(txt_to, pos++, CharToCase(ch, case));
1775 else
1776 BlkValueWrite(txt_to, pos++, ch);
1777 }
1778 }
1779 }
1780 continue;
1781 }
1782
1783 }
1784 if (pos+1 >= tosize) {
1785 if (BlkValueSetLBCapacity(txt_to, 2*tosize) == false) break;
1786 tosize = BlkValueLBCapacity(txt_to);
1787 }
1788 BlkValueWrite(txt_to, pos++, ch);
1789 }
1790 BlkValueWrite(txt_to, pos, 0);
1791 return txt_to;
1792];