1 This is the pre-mmh version of sbr/m_getfld.c (dated 2008-12-26).
2 The current version is still unbearbable, but this one is original.
3 Enjoy! :-) -- 2012-04-01 markus schnalke <meillo@marmaro.de>
6 * m_getfld.c -- read/parse a message
8 * This code is Copyright (c) 2002, by the authors of nmh. See the
9 * COPYRIGHT file in the root directory of the nmh distribution for
10 * complete copyright information.
17 /* This module has a long and checkered history. First, it didn't burst
18 maildrops correctly because it considered two CTRL-A:s in a row to be
19 an inter-message delimiter. It really is four CTRL-A:s followed by a
20 newline. Unfortunately, MMDF will convert this delimiter *inside* a
21 message to a CTRL-B followed by three CTRL-A:s and a newline. This
22 caused the old version of m_getfld() to declare eom prematurely. The
23 fix was a lot slower than
25 c == '\001' && peekc (iob) == '\001'
27 but it worked, and to increase generality, MBOX style maildrops could
28 be parsed as well. Unfortunately the speed issue finally caught up with
29 us since this routine is at the very heart of MH.
31 To speed things up considerably, the routine Eom() was made an auxilary
32 function called by the macro eom(). Unless we are bursting a maildrop,
33 the eom() macro returns FALSE saying we aren't at the end of the
36 The next thing to do is to read the mts.conf file and initialize
37 delimiter[] and delimlen accordingly...
39 After mhl was made a built-in in msh, m_getfld() worked just fine
40 (using m_unknown() at startup). Until one day: a message which was
41 the result of a bursting was shown. Then, since the burst boundaries
42 aren't CTRL-A:s, m_getfld() would blinding plunge on past the boundary.
43 Very sad. The solution: introduce m_eomsbr(). This hook gets called
44 after the end of each line (since testing for eom involves an fseek()).
45 This worked fine, until one day: a message with no body portion arrived.
48 while (eom (c = Getc (iob), iob))
51 loop caused m_getfld() to return FMTERR. So, that logic was changed to
52 check for (*eom_action) and act accordingly.
54 This worked fine, until one day: someone didn't use four CTRL:A's as
55 their delimiters. So, the bullet got bit and we read mts.h and
56 continue to struggle on. It's not that bad though, since the only time
57 the code gets executed is when inc (or msh) calls it, and both of these
58 have already called mts_init().
60 ------------------------
61 (Written by Van Jacobson for the mh6 m_getfld, January, 1986):
63 This routine was accounting for 60% of the cpu time used by most mh
64 programs. I spent a bit of time tuning and it now accounts for <10%
65 of the time used. Like any heavily tuned routine, it's a bit
66 complex and you want to be sure you understand everything that it's
67 doing before you start hacking on it. Let me try to emphasize
68 that: every line in this atrocity depends on every other line,
69 sometimes in subtle ways. You should understand it all, in detail,
70 before trying to change any part. If you do change it, test the
71 result thoroughly (I use a hand-constructed test file that exercises
72 all the ways a header name, header body, header continuation,
73 header-body separator, body line and body eom can align themselves
74 with respect to a buffer boundary). "Minor" bugs in this routine
75 result in garbaged or lost mail.
77 If you hack on this and slow it down, I, my children and my
78 children's children will curse you.
80 This routine gets used on three different types of files: normal,
81 single msg files, "packed" unix or mmdf mailboxs (when used by inc)
82 and packed, directoried bulletin board files (when used by msh).
83 The biggest impact of different file types is in "eom" testing. The
84 code has been carefully organized to test for eom at appropriate
85 times and at no other times (since the check is quite expensive).
86 I have tried to arrange things so that the eom check need only be
87 done on entry to this routine. Since an eom can only occur after a
88 newline, this is easy to manage for header fields. For the msg
89 body, we try to efficiently search the input buffer to see if
90 contains the eom delimiter. If it does, we take up to the
91 delimiter, otherwise we take everything in the buffer. (The change
92 to the body eom/copy processing produced the most noticeable
93 performance difference, particularly for "inc" and "show".)
95 There are three qualitatively different things this routine busts
96 out of a message: field names, field text and msg bodies. Field
97 names are typically short (~8 char) and the loop that extracts them
98 might terminate on a colon, newline or max width. I considered
99 using a Vax "scanc" to locate the end of the field followed by a
100 "bcopy" but the routine call overhead on a Vax is too large for this
101 to work on short names. If Berkeley ever makes "inline" part of the
102 C optimiser (so things like "scanc" turn into inline instructions) a
103 change here would be worthwhile.
105 Field text is typically 60 - 100 characters so there's (barely)
106 a win in doing a routine call to something that does a "locc"
107 followed by a "bmove". About 30% of the fields have continuations
108 (usually the 822 "received:" lines) and each continuation generates
109 another routine call. "Inline" would be a big win here, as well.
111 Messages, as of this writing, seem to come in two flavors: small
112 (~1K) and long (>2K). Most messages have 400 - 600 bytes of headers
113 so message bodies average at least a few hundred characters.
114 Assuming your system uses reasonably sized stdio buffers (1K or
115 more), this routine should be able to remove the body in large
116 (>500 byte) chunks. The makes the cost of a call to "bcopy"
117 small but there is a premium on checking for the eom in packed
118 maildrops. The eom pattern is always a simple string so we can
119 construct an efficient pattern matcher for it (e.g., a Vax "matchc"
120 instruction). Some thought went into recognizing the start of
121 an eom that has been split across two buffers.
123 This routine wants to deal with large chunks of data so, rather
124 than "getc" into a local buffer, it uses stdio's buffer. If
125 you try to use it on a non-buffered file, you'll get what you
126 deserve. This routine "knows" that struct FILEs have a _ptr
127 and a _cnt to describe the current state of the buffer and
128 it knows that _filbuf ignores the _ptr & _cnt and simply fills
129 the buffer. If stdio on your system doesn't work this way, you
130 may have to make small changes in this routine.
132 This routine also "knows" that an EOF indication on a stream is
133 "sticky" (i.e., you will keep getting EOF until you reposition the
134 stream). If your system doesn't work this way it is broken and you
135 should complain to the vendor. As a consequence of the sticky
136 EOF, this routine will never return any kind of EOF status when
137 there is data in "name" or "buf").
144 static int m_Eom (int, FILE *);
145 static unsigned char *matchc(int, char *, int, char *);
146 static unsigned char *locc(int, unsigned char *, unsigned char);
148 #define Getc(iob) getc(iob)
149 #define eom(c,iob) (msg_style != MS_DEFAULT && \
150 (((c) == *msg_delim && m_Eom(c,iob)) ||\
151 (eom_action && (*eom_action)(c))))
153 static unsigned char **pat_map;
156 * defined in sbr/m_msgdef.c = 0
157 * This is a disgusting hack for "inc" so it can know how many
158 * characters were stuffed in the buffer on the last call
159 * (see comments in uip/scansbr.c).
161 extern int msg_count;
164 * defined in sbr/m_msgdef.c = MS_DEFAULT
166 extern int msg_style;
169 * The "full" delimiter string for a packed maildrop consists
170 * of a newline followed by the actual delimiter. E.g., the
171 * full string for a Unix maildrop would be: "\n\nFrom ".
172 * "Fdelim" points to the start of the full string and is used
173 * in the BODY case of the main routine to search the buffer for
174 * a possible eom. Msg_delim points to the first character of
175 * the actual delim. string (i.e., fdelim+1). Edelim
176 * points to the 2nd character of actual delimiter string. It
177 * is used in m_Eom because the first character of the string
178 * has been read and matched before m_Eom is called.
180 extern char *msg_delim; /* defined in sbr/m_msgdef.c = "" */
181 static unsigned char *fdelim;
182 static unsigned char *delimend;
183 static int fdelimlen;
184 static unsigned char *edelim;
185 static int edelimlen;
187 static int (*eom_action)(int) = NULL;
190 # define _ptr _p /* Gag */
191 # define _cnt _r /* Retch */
192 # define _filbuf __srget /* Puke */
193 # define DEFINED__FILBUF_TO_SOMETHING_SPECIFIC
199 # define _base __base
200 # define _filbuf(fp) ((fp)->__cnt = 0, __filbuf(fp))
201 # define DEFINED__FILBUF_TO_SOMETHING_SPECIFIC
204 #ifndef DEFINED__FILBUF_TO_SOMETHING_SPECIFIC
205 extern int _filbuf(FILE*);
210 m_getfld (int state, unsigned char *name, unsigned char *buf,
211 int bufsz, FILE *iob)
213 register unsigned char *bp, *cp, *ep, *sp;
214 register int cnt, c, i, j;
216 if ((c = Getc(iob)) < 0) {
223 /* flush null messages */
224 while ((c = Getc(iob)) >= 0 && eom (c, iob))
238 if (c == '\n' || c == '-') {
239 /* we hit the header/body separator */
240 while (c != '\n' && (c = Getc(iob)) >= 0)
243 if (c < 0 || (c = Getc(iob)) < 0 || eom (c, iob)) {
245 /* flush null messages */
246 while ((c = Getc(iob)) >= 0 && eom (c, iob))
259 * get the name of this component. take characters up
260 * to a ':', a newline or NAMESZ-1 characters, whichever
267 bp = sp = (unsigned char *) iob->_IO_read_ptr - 1;
268 j = (cnt = ((long) iob->_IO_read_end -
269 (long) iob->_IO_read_ptr) + 1) < i ? cnt : i;
270 #elif defined(__DragonFly__)
271 bp = sp = (unsigned char *) ((struct __FILE_public *)iob)->_p - 1;
272 j = (cnt = ((struct __FILE_public *)iob)->_r+1) < i ? cnt : i;
274 bp = sp = (unsigned char *) iob->_ptr - 1;
275 j = (cnt = iob->_cnt+1) < i ? cnt : i;
277 while (--j >= 0 && (c = *bp++) != ':' && c != '\n')
281 if ((cnt -= j) <= 0) {
283 iob->_IO_read_ptr = iob->_IO_read_end;
284 if (__underflow(iob) == EOF) {
285 #elif defined(__DragonFly__)
286 if (__srget(iob) == EOF) {
288 if (_filbuf(iob) == EOF) {
291 advise (NULL, "eof encountered in field \"%s\"", name);
295 iob->_IO_read_ptr++; /* NOT automatic in __underflow()! */
299 iob->_IO_read_ptr = bp + 1;
300 #elif defined(__DragonFly__)
301 ((struct __FILE_public *)iob)->_p = bp + 1;
302 ((struct __FILE_public *)iob)->_r = cnt - 1;
312 * something went wrong. possibilities are:
313 * . hit a newline (error)
314 * . got more than namesz chars. (error)
315 * . hit the end of the buffer. (loop)
318 /* We hit the end of the line without seeing ':' to
319 * terminate the field name. This is usually (always?)
320 * spam. But, blowing up is lame, especially when
321 * scan(1)ing a folder with such messages. Pretend such
322 * lines are the first of the body (at least mutt also
323 * handles it this way). */
325 /* See if buf can hold this line, since we were assuming
326 * we had a buffer of NAMESZ, not bufsz. */
327 /* + 1 for the newline */
329 /* No, it can't. Oh well, guess we'll blow up. */
331 advise (NULL, "eol encountered in field \"%s\"", name);
335 memcpy (buf, name, j - 1);
338 /* mhparse.c:get_content wants to find the position of the
339 * body start, but it thinks there's a blank line between
340 * the header and the body (naturally!), so seek back so
341 * that things line up even though we don't have that
342 * blank line in this case. Simpler parsers (e.g. mhl)
343 * get extra newlines, but that should be harmless enough,
344 * right? This is a corrupt message anyway. */
345 fseek (iob, ftell (iob) - 2, SEEK_SET);
350 advise (NULL, "field name \"%s\" exceeds %d bytes", name, NAMESZ - 2);
356 while (isspace (*--cp) && cp >= name)
363 * get (more of) the text of a field. take
364 * characters up to the end of this field (newline
365 * followed by non-blank) or bufsz-1 characters.
367 cp = buf; i = bufsz-1;
370 cnt = (long) iob->_IO_read_end - (long) iob->_IO_read_ptr;
371 bp = (unsigned char *) --iob->_IO_read_ptr;
372 #elif defined(__DragonFly__)
373 cnt = ((struct __FILE_public *)iob)->_r++;
374 bp = (unsigned char *) --((struct __FILE_public *)iob)->_p;
377 bp = (unsigned char *) --iob->_ptr;
379 c = cnt < i ? cnt : i;
380 while ((ep = locc( c, bp, '\n' ))) {
382 * if we hit the end of this field, return.
384 if ((j = *++ep) != ' ' && j != '\t') {
386 j = ep - (unsigned char *) iob->_IO_read_ptr;
387 memcpy (cp, iob->_IO_read_ptr, j);
388 iob->_IO_read_ptr = ep;
389 #elif defined(__DragonFly__)
390 j = ep - (unsigned char *) ((struct __FILE_public *)iob)->_p;
391 memcpy (cp, ((struct __FILE_public *)iob)->_p, j);
392 ((struct __FILE_public *)iob)->_p = ep;
393 ((struct __FILE_public *)iob)->_r -= j;
395 j = ep - (unsigned char *) iob->_ptr;
396 memcpy (cp, iob->_ptr, j);
408 * end of input or dest buffer - copy what we've found.
411 c += bp - (unsigned char *) iob->_IO_read_ptr;
412 memcpy( cp, iob->_IO_read_ptr, c);
413 #elif defined(__DragonFly__)
414 c += bp - (unsigned char *) ((struct __FILE_public *)iob)->_p;
415 memcpy( cp, ((struct __FILE_public *)iob)->_p, c);
417 c += bp - (unsigned char *) iob->_ptr;
418 memcpy( cp, iob->_ptr, c);
423 /* the dest buffer is full */
425 iob->_IO_read_ptr += c;
426 #elif defined(__DragonFly__)
427 ((struct __FILE_public *)iob)->_r -= c;
428 ((struct __FILE_public *)iob)->_p += c;
437 * There's one character left in the input buffer.
438 * Copy it & fill the buffer. If the last char
439 * was a newline and the next char is not whitespace,
440 * this is the end of the field. Otherwise loop.
444 *cp++ = j = *(iob->_IO_read_ptr + c);
445 iob->_IO_read_ptr = iob->_IO_read_end;
446 c = __underflow(iob);
447 iob->_IO_read_ptr++; /* NOT automatic! */
448 #elif defined(__DragonFly__)
449 *cp++ =j = *(((struct __FILE_public *)iob)->_p + c);
452 *cp++ = j = *(iob->_ptr + c);
456 ((j == '\0' || j == '\n') && c != ' ' && c != '\t')) {
460 #elif defined(__DragonFly__)
461 --((struct __FILE_public *)iob)->_p;
462 ++((struct __FILE_public *)iob)->_r;
477 * get the message body up to bufsz characters or the
478 * end of the message. Sleazy hack: if bufsz is negative
479 * we assume that we were called to copy directly into
480 * the output buffer and we don't add an eos.
482 i = (bufsz < 0) ? -bufsz : bufsz-1;
484 bp = (unsigned char *) --iob->_IO_read_ptr;
485 cnt = (long) iob->_IO_read_end - (long) iob->_IO_read_ptr;
486 #elif defined(__DragonFly__)
487 bp = (unsigned char *) --((struct __FILE_public *)iob)->_p;
488 cnt = ++((struct __FILE_public *)iob)->_r;
490 bp = (unsigned char *) --iob->_ptr;
493 c = (cnt < i ? cnt : i);
494 if (msg_style != MS_DEFAULT && c > 1) {
496 * packed maildrop - only take up to the (possible)
497 * start of the next message. This "matchc" should
498 * probably be a Boyer-Moore matcher for non-vaxen,
499 * particularly since we have the alignment table
500 * all built for the end-of-buffer test (next).
501 * But our vax timings indicate that the "matchc"
502 * instruction is 50% faster than a carefully coded
503 * B.M. matcher for most strings. (So much for elegant
504 * algorithms vs. brute force.) Since I (currently)
505 * run MH on a vax, we use the matchc instruction. --vj
507 if ((ep = matchc( fdelimlen, fdelim, c, bp )))
511 * There's no delim in the buffer but there may be
512 * a partial one at the end. If so, we want to leave
513 * it so the "eom" check on the next call picks it up.
514 * Use a modified Boyer-Moore matcher to make this
515 * check relatively cheap. The first "if" figures
516 * out what position in the pattern matches the last
517 * character in the buffer. The inner "while" matches
518 * the pattern against the buffer, backwards starting
519 * at that position. Note that unless the buffer
520 * ends with one of the characters in the pattern
521 * (excluding the first and last), we do only one test.
524 if ((sp = pat_map[*ep])) {
526 /* This if() is true unless (a) the buffer is too
527 * small to contain this delimiter prefix, or
528 * (b) it contains exactly enough chars for the
530 * For case (a) obviously we aren't going to match.
531 * For case (b), if the buffer really contained exactly
532 * a delim prefix, then the m_eom call at entry
533 * should have found it. Thus it's not a delim
534 * and we know we won't get a match.
536 if (((sp - fdelim) + 2) <= c) {
538 /* Unfortunately although fdelim has a preceding NUL
539 * we can't use this as a sentinel in case the buffer
540 * contains a NUL in exactly the wrong place (this
541 * would cause us to run off the front of fdelim).
543 while (*--ep == *--cp)
547 /* we matched the entire delim prefix,
548 * so only take the buffer up to there.
549 * we know ep >= bp -- check above prevents underrun
555 /* try matching one less char of delim string */
557 } while (--sp > fdelim);
561 memcpy( buf, bp, c );
563 iob->_IO_read_ptr += c;
564 #elif defined(__DragonFly__)
565 ((struct __FILE_public *)iob)->_r -= c;
566 ((struct __FILE_public *)iob)->_p += c;
579 adios (NULL, "m_getfld() called with bogus state of %d", state);
583 msg_count = cp - buf;
589 static char unixbuf[BUFSIZ] = "";
599 register char *delimstr;
602 * Figure out what the message delimitter string is for this
603 * maildrop. (This used to be part of m_Eom but I didn't like
604 * the idea of an "if" statement that could only succeed on the
605 * first call to m_Eom getting executed on each call, i.e., at
606 * every newline in the message).
608 * If the first line of the maildrop is a Unix "From " line, we
609 * say the style is MBOX and eat the rest of the line. Otherwise
610 * we say the style is MMDF and look for the delimiter string
611 * specified when nmh was built (or from the mts.conf file).
614 msg_style = MS_UNKNOWN;
617 if (fread (text, sizeof(*text), 5, iob) == 5
618 && strncmp (text, "From ", 5) == 0) {
620 delimstr = "\nFrom ";
622 while ((c = getc (iob)) != '\n' && c >= 0)
626 while ((c = getc (iob)) != '\n' && cp - unixbuf < BUFSIZ - 1)
631 /* not a Unix style maildrop */
632 fseek (iob, pos, SEEK_SET);
633 if (mmdlm2 == NULL || *mmdlm2 == 0)
634 mmdlm2 = "\001\001\001\001\n";
638 c = strlen (delimstr);
639 fdelim = (unsigned char *) mh_xmalloc((size_t) (c + 3));
642 msg_delim = (char *)fdelim+1;
643 edelim = (unsigned char *)msg_delim+1;
646 strcpy (msg_delim, delimstr);
647 delimend = (unsigned char *)msg_delim + edelimlen;
649 adios (NULL, "maildrop delimiter must be at least 2 bytes");
651 * build a Boyer-Moore end-position map for the matcher in m_getfld.
652 * N.B. - we don't match just the first char (since it's the newline
653 * separator) or the last char (since the matchc would have found it
654 * if it was a real delim).
656 pat_map = (unsigned char **) calloc (256, sizeof(unsigned char *));
658 for (cp = (char *) fdelim + 1; cp < (char *) delimend; cp++ )
659 pat_map[(unsigned char)*cp] = (unsigned char *) cp;
661 if (msg_style == MS_MMDF) {
662 /* flush extra msg hdrs */
663 while ((c = Getc(iob)) >= 0 && eom (c, iob))
672 m_eomsbr (int (*action)(int))
674 if ((eom_action = action)) {
681 msg_delim = (char *)fdelim + 1;
682 fdelimlen = strlen((char *)fdelim);
683 delimend = (unsigned char *)(msg_delim + edelimlen);
689 * test for msg delimiter string
693 m_Eom (int c, FILE *iob)
695 register long pos = 0L;
703 if ((i = fread (text, sizeof *text, edelimlen, iob)) != edelimlen
704 || strncmp (text, (char *)edelim, edelimlen)) {
705 if (i == 0 && msg_style == MS_MBOX)
706 /* the final newline in the (brain damaged) unix-format
707 * maildrop is part of the delimitter - delete it.
712 fseek (iob, pos, SEEK_SET);
715 fseek (iob, (long)(pos-1), SEEK_SET);
716 getc (iob); /* should be OK */
720 if (msg_style == MS_MBOX) {
722 while ((c = getc (iob)) != '\n')
727 while ((c = getc (iob)) != '\n' && c >= 0 && cp - unixbuf < BUFSIZ - 1)
739 * Return the Return-Path and Delivery-Date
740 * header information.
742 * Currently, I'm assuming that the "From " line
743 * takes one of the following forms.
745 * From sender date remote from host (for UUCP delivery)
746 * From sender@host date (for sendmail delivery)
750 get_returnpath (char *rp, int rplen, char *dd, int ddlen)
752 char *ap, *bp, *cp, *dp;
755 if (!(bp = cp = strchr(ap, ' ')))
759 * Check for "remote from" in envelope to see
760 * if this message uses UUCP style addressing
762 while ((cp = strchr(++cp, 'r'))) {
763 if (strncmp (cp, "remote from", 11) == 0) {
764 cp = strrchr (cp, ' ');
770 * Get the Return-Path information from
771 * the "From " envelope.
774 /* return path for UUCP style addressing */
775 dp = strchr (++cp, '\n');
776 snprintf (rp, rplen, "%.*s!%.*s\n", (int)(dp - cp), cp, (int)(bp - ap), ap);
778 /* return path for standard domain addressing */
779 snprintf (rp, rplen, "%.*s\n", (int)(bp - ap), ap);
783 * advance over the spaces to get to
784 * delivery date on envelope
789 /* Now get delivery date from envelope */
790 snprintf (dd, ddlen, "%.*s\n", 24, bp);
798 static unsigned char *
799 matchc(int patln, char *pat, int strln, char *str)
801 register char *es = str + strln - patln;
804 register char *ep = pat + patln;
805 register char pc = *pat++;
814 while (pp < ep && *sp++ == *pp)
817 return ((unsigned char *)--str);
823 * Locate character "term" in the next "cnt" characters of "src".
824 * If found, return its address, otherwise return 0.
827 static unsigned char *
828 locc(int cnt, unsigned char *src, unsigned char term)
830 while (*src++ != term && --cnt > 0);
832 return (cnt > 0 ? --src : (unsigned char *)0);