9 date 93.02.26.21.57.14; author jromine; state Exp;
14 date 92.10.31.07.44.56; author jromine; state Exp;
19 date 92.10.28.22.56.27; author jromine; state Exp;
24 date 92.05.12.21.49.37; author jromine; state Exp;
29 date 92.02.12.05.07.25; author jromine; state Exp;
34 date 92.01.31.21.54.32; author jromine; state Exp;
39 date 92.01.24.18.03.41; author jromine; state Exp;
44 date 90.04.18.13.48.12; author sources; state Exp;
49 date 90.04.05.15.31.37; author sources; state Exp;
54 date 90.04.05.14.46.02; author sources; state Exp;
59 date 90.04.02.14.39.47; author sources; state Exp;
64 date 90.03.12.10.12.38; author sources; state Exp;
69 date 90.02.08.16.11.03; author sources; state Exp;
74 date 90.02.05.15.06.23; author sources; state Exp;
79 date 90.02.05.15.05.52; author sources; state Exp;
93 @/* m_getfld.c - read/parse a message */
95 static char ident[] = "@@(#)$Id: m_getfld.c,v 1.14 1992/10/31 07:44:56 jromine Exp jromine $";
100 #include "../zotnet/mts.h"
104 /* This module has a long and checkered history. First, it didn't burst
105 maildrops correctly because it considered two CTRL-A:s in a row to be
106 an inter-message delimiter. It really is four CTRL-A:s followed by a
107 newline. Unfortunately, MMDF will convert this delimiter *inside* a
108 message to a CTRL-B followed by three CTRL-A:s and a newline. This
109 caused the old version of m_getfld() to declare eom prematurely. The
110 fix was a lot slower than
112 c == '\001' && peekc (iob) == '\001'
114 but it worked, and to increase generality, UUCP style maildrops could
115 be parsed as well. Unfortunately the speed issue finally caught up with
116 us since this routine is at the very heart of MH.
118 To speed things up considerably, the routine Eom() was made an auxilary
119 function called by the macro eom(). Unless we are bursting a maildrop,
120 the eom() macro returns FALSE saying we aren't at the end of the
123 The next thing to do is to read the mtstailor file and initialize
124 delimiter[] and delimlen accordingly...
126 After mhl was made a built-in in msh, m_getfld() worked just fine
127 (using m_unknown() at startup). Until one day: a message which was
128 the result of a bursting was shown. Then, since the burst boundaries
129 aren't CTRL-A:s, m_getfld() would blinding plunge on past the boundary.
130 Very sad. The solution: introduce m_eomsbr(). This hook gets called
131 after the end of each line (since testing for eom involves an fseek()).
132 This worked fine, until one day: a message with no body portion arrived.
135 while (eom (c = Getc (iob), iob))
138 loop caused m_getfld() to return FMTERR. So, that logic was changed to
139 check for (*eom_action) and act accordingly.
141 This worked fine, until one day: someone didn't use four CTRL:A's as
142 their delimiters. So, the bullet got bit and we read mts.h and
143 continue to struggle on. It's not that bad though, since the only time
144 the code gets executed is when inc (or msh) calls it, and both of these
145 have already called mts_init().
147 ------------------------
148 (Written by Van Jacobson for the mh6 m_getfld, January, 1986):
150 This routine was accounting for 60% of the cpu time used by most mh
151 programs. I spent a bit of time tuning and it now accounts for <10%
152 of the time used. Like any heavily tuned routine, it's a bit
153 complex and you want to be sure you understand everything that it's
154 doing before you start hacking on it. Let me try to emphasize
155 that: every line in this atrocity depends on every other line,
156 sometimes in subtle ways. You should understand it all, in detail,
157 before trying to change any part. If you do change it, test the
158 result thoroughly (I use a hand-constructed test file that exercises
159 all the ways a header name, header body, header continuation,
160 header-body separator, body line and body eom can align themselves
161 with respect to a buffer boundary). "Minor" bugs in this routine
162 result in garbaged or lost mail.
164 If you hack on this and slow it down, I, my children and my
165 children's children will curse you.
167 This routine gets used on three different types of files: normal,
168 single msg files, "packed" unix or mmdf mailboxs (when used by inc)
169 and packed, directoried bulletin board files (when used by msh).
170 The biggest impact of different file types is in "eom" testing. The
171 code has been carefully organized to test for eom at appropriate
172 times and at no other times (since the check is quite expensive).
173 I have tried to arrange things so that the eom check need only be
174 done on entry to this routine. Since an eom can only occur after a
175 newline, this is easy to manage for header fields. For the msg
176 body, we try to efficiently search the input buffer to see if
177 contains the eom delimiter. If it does, we take up to the
178 delimiter, otherwise we take everything in the buffer. (The change
179 to the body eom/copy processing produced the most noticeable
180 performance difference, particularly for "inc" and "show".)
182 There are three qualitatively different things this routine busts
183 out of a message: field names, field text and msg bodies. Field
184 names are typically short (~8 char) and the loop that extracts them
185 might terminate on a colon, newline or max width. I considered
186 using a Vax "scanc" to locate the end of the field followed by a
187 "bcopy" but the routine call overhead on a Vax is too large for this
188 to work on short names. If Berkeley ever makes "inline" part of the
189 C optimiser (so things like "scanc" turn into inline instructions) a
190 change here would be worthwhile.
192 Field text is typically 60 - 100 characters so there's (barely)
193 a win in doing a routine call to something that does a "locc"
194 followed by a "bmove". About 30% of the fields have continuations
195 (usually the 822 "received:" lines) and each continuation generates
196 another routine call. "Inline" would be a big win here, as well.
198 Messages, as of this writing, seem to come in two flavors: small
199 (~1K) and long (>2K). Most messages have 400 - 600 bytes of headers
200 so message bodies average at least a few hundred characters.
201 Assuming your system uses reasonably sized stdio buffers (1K or
202 more), this routine should be able to remove the body in large
203 (>500 byte) chunks. The makes the cost of a call to "bcopy"
204 small but there is a premium on checking for the eom in packed
205 maildrops. The eom pattern is always a simple string so we can
206 construct an efficient pattern matcher for it (e.g., a Vax "matchc"
207 instruction). Some thought went into recognizing the start of
208 an eom that has been split across two buffers.
210 This routine wants to deal with large chunks of data so, rather
211 than "getc" into a local buffer, it uses stdio's buffer. If
212 you try to use it on a non-buffered file, you'll get what you
213 deserve. This routine "knows" that struct FILEs have a _ptr
214 and a _cnt to describe the current state of the buffer and
215 it knows that _filbuf ignores the _ptr & _cnt and simply fills
216 the buffer. If stdio on your system doesn't work this way, you
217 may have to make small changes in this routine.
219 This routine also "knows" that an EOF indication on a stream is
220 "sticky" (i.e., you will keep getting EOF until you reposition the
221 stream). If your system doesn't work this way it is broken and you
222 should complain to the vendor. As a consequence of the sticky
223 EOF, this routine will never return any kind of EOF status when
224 there is data in "name" or "buf").
228 #define Getc(iob) getc(iob)
229 #define eom(c,iob) (msg_style != MS_DEFAULT && \
230 (((c) == *msg_delim && m_Eom(c,iob)) ||\
231 (eom_action && (*eom_action)(c))))
233 static unsigned char *matchc();
234 static unsigned char *locc();
236 static unsigned char **pat_map;
238 extern int msg_count; /* defined in sbr/m_msgdef.c = 0
239 * disgusting hack for "inc" so it can
240 * know how many characters were stuffed
241 * in the buffer on the last call (see
242 * comments in uip/scansbr.c) */
244 extern int msg_style; /* defined in sbr/m_msgdef.c = MS_DEFAULT */
246 * The "full" delimiter string for a packed maildrop consists
247 * of a newline followed by the actual delimiter. E.g., the
248 * full string for a Unix maildrop would be: "\n\nFrom ".
249 * "Fdelim" points to the start of the full string and is used
250 * in the BODY case of the main routine to search the buffer for
251 * a possible eom. Msg_delim points to the first character of
252 * the actual delim. string (i.e., fdelim+1). Edelim
253 * points to the 2nd character of actual delimiter string. It
254 * is used in m_Eom because the first character of the string
255 * has been read and matched before m_Eom is called.
257 extern char *msg_delim; /* defined in sbr/m_msgdef.c = "" */
258 static unsigned char *fdelim;
259 static unsigned char *delimend;
260 static int fdelimlen;
261 static unsigned char *edelim;
262 static int edelimlen;
264 static int (*eom_action) () = NULL;
267 #define _ptr _p /* Gag */
268 #define _cnt _r /* Retch */
269 #define _filbuf __srget /* Puke */
274 m_getfld (state, name, buf, bufsz, iob)
281 register unsigned char *cp;
282 register unsigned char *bp;
283 register unsigned char *ep;
284 register unsigned char *sp;
290 if ((c = Getc(iob)) < 0) {
297 /* flush null messages */
298 while ((c = Getc(iob)) >= 0 && eom (c, iob))
301 (void) ungetc(c, iob);
312 if (c == '\n' || c == '-') {
313 /* we hit the header/body separator */
314 while (c != '\n' && (c = Getc(iob)) >= 0)
317 if (c < 0 || (c = Getc(iob)) < 0 || eom (c, iob)) {
319 /* flush null messages */
320 while ((c = Getc(iob)) >= 0 && eom (c, iob))
323 (void) ungetc(c, iob);
333 * get the name of this component. take characters up
334 * to a ':', a newline or NAMESZ-1 characters, whichever
337 cp = name; i = NAMESZ - 1;
339 bp = sp = (unsigned char *) iob->_ptr - 1;
340 j = (cnt = iob->_cnt+1) < i ? cnt : i;
341 while ((c = *bp++) != ':' && c != '\n' && --j >= 0)
345 if ((cnt -= j) <= 0) {
346 if (_filbuf(iob) == EOF) {
348 advise (NULLCP, "eof encountered in field \"%s\"",
360 * something went wrong. possibilities are:
361 * . hit a newline (error)
362 * . got more than namesz chars. (error)
363 * . hit the end of the buffer. (loop)
367 advise (NULLCP, "eol encountered in field \"%s\"", name);
373 advise (NULLCP, "field name \"%s\" exceeds %d bytes",
380 while (isspace (*--cp) && cp >= name)
387 * get (more of) the text of a field. take
388 * characters up to the end of this field (newline
389 * followed by non-blank) or bufsz-1 characters.
391 cp = buf; i = bufsz-1;
393 cnt = iob->_cnt++; bp = (unsigned char *) --iob->_ptr;
394 c = cnt < i ? cnt : i;
395 while (ep = locc( c, bp, '\n' )) {
397 * if we hit the end of this field, return.
399 if ((j = *++ep) != ' ' && j != '\t') {
400 j = ep - (unsigned char *) iob->_ptr;
401 (void) bcopy( iob->_ptr, cp, j);
402 iob->_ptr = ep; iob->_cnt -= j;
407 c -= ep - bp; bp = ep;
410 * end of input or dest buffer - copy what we've found.
412 c += bp - (unsigned char *) iob->_ptr;
413 (void) bcopy( iob->_ptr, cp, c);
416 /* the dest buffer is full */
417 iob->_cnt -= c; iob->_ptr += c;
422 * There's one character left in the input buffer.
423 * Copy it & fill the buffer. If the last char
424 * was a newline and the next char is not whitespace,
425 * this is the end of the field. Otherwise loop.
428 *cp++ = j = *(iob->_ptr + c);
430 if ((j == '\0' || j == '\n') && c != ' ' && c != '\t') {
432 --iob->_ptr, ++iob->_cnt;
442 * get the message body up to bufsz characters or the
443 * end of the message. Sleazy hack: if bufsz is negative
444 * we assume that we were called to copy directly into
445 * the output buffer and we don't add an eos.
447 i = (bufsz < 0) ? -bufsz : bufsz-1;
448 bp = (unsigned char *) --iob->_ptr; cnt = ++iob->_cnt;
449 c = (cnt < i ? cnt : i);
450 if (msg_style != MS_DEFAULT && c > 1) {
452 * packed maildrop - only take up to the (possible)
453 * start of the next message. This "matchc" should
454 * probably be a Boyer-Moore matcher for non-vaxen,
455 * particularly since we have the alignment table
456 * all built for the end-of-buffer test (next).
457 * But our vax timings indicate that the "matchc"
458 * instruction is 50% faster than a carefully coded
459 * B.M. matcher for most strings. (So much for elegant
460 * algorithms vs. brute force.) Since I (currently)
461 * run MH on a vax, we use the matchc instruction. --vj
463 if (ep = matchc( fdelimlen, fdelim, c, bp ) )
467 * There's no delim in the buffer but there may be
468 * a partial one at the end. If so, we want to leave
469 * it so the "eom" check on the next call picks it up.
470 * Use a modified Boyer-Moore matcher to make this
471 * check relatively cheap. The first "if" figures
472 * out what position in the pattern matches the last
473 * character in the buffer. The inner "while" matches
474 * the pattern against the buffer, backwards starting
475 * at that position. Note that unless the buffer
476 * ends with one of the characters in the pattern
477 * (excluding the first and last), we do only one test.
480 if (sp = pat_map[*ep]) {
483 while (*--ep == *--cp)
488 * ep < bp means that all the buffer
489 * contains is a prefix of delim.
490 * If this prefix is really a delim, the
491 * m_eom call at entry should have found
492 * it. Thus it's not a delim and we can
498 /* try matching one less char of delim string */
500 } while (--sp > fdelim);
504 (void) bcopy( bp, buf, c );
515 adios (NULLCP, "m_getfld() called with bogus state of %d", state);
519 msg_count = cp - buf;
526 static char unixbuf[BUFSIZ] = "";
537 register char *delimstr;
539 msg_style = MS_UNKNOWN;
541 /* Figure out what the message delimitter string is for this
542 * maildrop. (This used to be part of m_Eom but I didn't like
543 * the idea of an "if" statement that could only succeed on the
544 * first call to m_Eom getting executed on each call, i.e., at
545 * every newline in the message).
547 * If the first line of the maildrop is a Unix "from" line, we say the
548 * style is UUCP and eat the rest of the line. Otherwise we say the style
549 * is MMDF & look for the delimiter string specified when MH was built
550 * (or from the mtstailor file).
553 if (fread (text, sizeof *text, 5, iob) == 5
554 && strncmp (text, "From ", 5) == 0) {
556 delimstr = "\nFrom ";
558 while ((c = getc (iob)) != '\n' && c >= 0)
562 while ((c = getc (iob)) != '\n')
567 /* not a Unix style maildrop */
568 (void) fseek (iob, pos, 0);
569 if (mmdlm2 == NULLCP || *mmdlm2 == 0)
570 mmdlm2 = "\001\001\001\001\n";
574 c = strlen (delimstr);
575 fdelim = (unsigned char *)malloc((unsigned)c + 3);
578 msg_delim = (char *)fdelim+1;
579 edelim = (unsigned char *)msg_delim+1;
582 (void)strcpy(msg_delim, delimstr);
583 delimend = (unsigned char *)msg_delim + edelimlen;
585 adios (NULLCP, "maildrop delimiter must be at least 2 bytes");
587 * build a Boyer-Moore end-position map for the matcher in m_getfld.
588 * N.B. - we don't match just the first char (since it's the newline
589 * separator) or the last char (since the matchc would have found it
590 * if it was a real delim).
592 pat_map = (unsigned char **) calloc (256, sizeof (unsigned char *));
594 for (cp = (char *)fdelim + 1; cp < (char *)delimend; cp++ )
595 pat_map[*cp] = (unsigned char *)cp;
597 if (msg_style == MS_MMDF) {
598 /* flush extra msg hdrs */
599 while ((c = Getc(iob)) >= 0 && eom (c, iob))
602 (void) ungetc(c, iob);
607 void m_eomsbr (action)
610 if (eom_action = action) {
617 msg_delim = (char *)fdelim + 1;
618 fdelimlen = strlen((char *)fdelim);
619 delimend = (unsigned char *)(msg_delim + edelimlen);
625 /* test for msg delimiter string */
631 register long pos = 0L;
639 if ((i = fread (text, sizeof *text, edelimlen, iob)) != edelimlen
640 || strncmp (text, (char *)edelim, edelimlen)) {
641 if (i == 0 && msg_style == MS_UUCP)
642 /* the final newline in the (brain damaged) unix-format
643 * maildrop is part of the delimitter - delete it.
648 (void) fseek (iob, pos, 0);
650 (void) fseek (iob, (long)(pos-1), 0);
651 (void) getc (iob); /* should be OK */
656 if (msg_style == MS_UUCP) {
658 while ((c = getc (iob)) != '\n')
663 while ((c = getc (iob)) != '\n' && c >= 0)
679 static char unixfrom[BUFSIZ];
682 if (cp = dp = index (unixbuf, ' ')) {
683 while (cp = index (cp + 1, 'r'))
684 if (strncmp (cp, "remote from ", 12) == 0) {
686 (void) sprintf (pp, "%s!", cp + 12);
691 cp = unixbuf + strlen (unixbuf);
692 if ((cp -= 25) >= dp)
696 (void) sprintf (pp, "%s\n", unixbuf);
706 asm("_matchc: .word 0");
707 asm(" movq 4(ap),r0");
708 asm(" movq 12(ap),r2");
709 asm(" matchc r0,(r1),r2,(r3)");
711 asm(" movl 4(ap),r3");
712 asm("1: subl3 4(ap),r3,r0");
715 static unsigned char *
716 matchc( patln, pat, strln, str )
722 register char *es = str + strln - patln;
725 register char *ep = pat + patln;
726 register char pc = *pat++;
734 while (pp < ep && *sp++ == *pp)
737 return ((unsigned char *)--str);
745 * Locate character "term" in the next "cnt" characters of "src".
746 * If found, return its address, otherwise return 0.
750 asm("_locc: .word 0");
751 asm(" movq 4(ap),r0");
752 asm(" locc 12(ap),r0,(r1)");
757 static unsigned char *
758 locc( cnt, src, term )
760 register unsigned char *src;
761 register unsigned char term;
763 while (*src++ != term && --cnt > 0);
765 return (cnt > 0 ? --src : (unsigned char *)0);
771 #if !defined (BSD42) && !defined (bcopy)
772 int bcmp (b1, b2, length)
785 bcopy (b1, b2, length)
802 #endif /* not BSD42 */
813 static char ident[] = "@@(#)$Id: m_getfld.c,v 1.13 1992/10/28 22:56:27 jromine Exp jromine $";
820 @possible fix for no-newline in .mh_profile problem.
825 static char ident[] = "@@(#)$Id: m_getfld.c,v 1.12 1992/05/12 21:49:37 jromine Exp jromine $";
842 static char ident[] = "@@(#)$Id: m_getfld.c,v 1.11 1992/02/12 05:07:25 jromine Exp jromine $";
845 if (j == '\n' && c != ' ' && c != '\t') {
851 @second try at fseek() fix
856 static char ident[] = "@@(#)$Id: m_getfld.c,v 1.10 1992/01/31 21:54:32 jromine Exp jromine $";
869 fdelimlen = strlen(fdelim);
875 || strncmp (text, edelim, edelimlen)) {
898 static char ident[] = "@@(#)$Id: m_getfld.c,v 1.9 1992/01/24 18:03:41 jromine Exp jromine $";
907 (void) fseek (iob, pos, 0);
914 @move msg_count, msg_style & msg_delim to m_msgdef.c for
920 static char ident[] = "@@(#)$Id: m_getfld.c,v 1.8 1990/04/18 13:48:12 sources Exp jromine $";
941 if (mmdlm2 == NULLCP || *mmdlm2 == NULL)
962 @back out RAND fix -- under #ifdef notdef
967 static char ident[] = "@@(#)$Id: m_getfld.c,v 1.7 90/04/05 15:31:37 sources Exp Locker: sources $";
970 int msg_count = 0; /* disgusting hack for "inc" so it can
973 int msg_style = MS_DEFAULT;
976 char *msg_delim = "";
987 static char ident[] = "@@(#)$Id:$";
1000 static char ident[] = "$Id:";
1015 @cast iob->_ptr as unsigned char *. This stuff is really yukky!
1020 (void) fseek (iob, pos, 0);
1023 while (pp < ep && *sp++ == *pp++)
1030 @Fixes from Van Jacobson
1035 bp = sp = iob->_ptr - 1;
1038 cnt = iob->_cnt++; bp = --iob->_ptr;
1044 c += bp - iob->_ptr;
1047 bp = --iob->_ptr; cnt = ++iob->_cnt;
1053 @*** empty log message ***
1058 static char *matchc();
1059 static char *locc();
1062 static char **pat_map;
1065 static char *fdelim;
1066 static char *delimend;
1069 static char *edelim;
1087 while (ep = (unsigned char *) locc( c, bp, '\n' )) {
1093 if (ep = (unsigned char *) matchc( fdelimlen, fdelim, c, bp ) )
1097 * check relatively cheap. The first "while" figures
1103 while ((cp = pat_map[*ep]) < sp) {
1104 ep = bp + c - 1; sp = cp;
1106 sp = (unsigned char *) delimend;
1108 while ((cp = pat_map[*ep]) < (char *) sp) {
1109 ep = bp + c - 1; sp = (unsigned char *) cp;
1111 while (*--ep == *--cp && cp > fdelim)
1115 if (*ep == *cp && ep > bp)
1120 void m_unknown (iob)
1125 fdelim = "\n\nFrom ";
1128 fdelim = (char *)malloc((unsigned)strlen(mmdlm2)+2);
1130 (void)strcpy(fdelim+1, mmdlm2);
1133 fdelimlen = strlen(fdelim);
1134 msg_delim = fdelim+1;
1135 edelim = msg_delim+1;
1136 edelimlen = fdelimlen-2;
1137 delimend = msg_delim + edelimlen;
1140 pat_map = (char **) malloc( 256 * sizeof (char *));
1141 for (c = 256; c--; )
1142 pat_map[c] = delimend + 1;
1145 for (cp = fdelim + 1; cp < delimend; cp++ )
1149 msg_delim = fdelim + 1;
1150 fdelimlen = strlen (fdelim);
1151 delimend = msg_delim + edelimlen;
1167 return (cnt > 0 ? --src : NULLCP);
1170 #endif not BSD42 or SYS5