Digestifying an ITS Mailing List

Why Digestify?
--------------

First, what is digestifying and why do it?  A mailing list is used by a
mailer program (such as ITS's COMSAT) to distribute messages to more than
one address, translating the single list address given into the addresses
of the intended recipients.  Normally this process occurs without any
built-in delays; the mailer receives a message, checks the set of
addressees for mailing lists to expand, performs such expansions, and
immediately delivers the message to the desired recipients.

However, some mailing lists have a consistently high volume of mail
travelling to them.  Any message sent through an ITS machine ties up its
COMSAT in proportion to the number of recipients and the number of messages
sent, rather than in proportion to the total number of characters sent --
for example, delivering any message to AI-List (one very large mailing
list) takes two and a half hours!  Mail to hosts that are neither local to
MIT nor directly connected to the central nodes of the Arpanet -- "weird"
hosts -- is particularly expensive.  Thus, any mailing list which gets more
than a few messages per day and which goes to more than about ten weird
hosts imposes a large load on COMSAT.

Also, the overhead to list members of reading mail sent to the list is a
function of the number of messages received, as well as the number of
characters.  For these two reasons, large mailing lists should be
-digested-; that is, arrangements should be made to collect all the
messages sent to such a list during each day, bundle them up as one single,
long message, and send that message, the digest, to the list members.  The
digest has a special format which can be undigested or burst -- broken up
into individual messages -- by many mail-reading programs.

This file explains how to get the digestifier daemon to digest a ITS
mailing list automatically, using as an example a hypothetical mailing list
called FOOBLATZ.


How To Digestify
----------------

So now that you've decided your mailing list should be digested, how do
you arrange that?

First, all automatically digestified ITS mailing lists currently must
reside on MC.LCS.MIT.EDU; if yours lives elsewhere, you must move it to MC.
You can make forwarding pointers from other machines to MC; on another ITS
machine, such a pointer would be a line like

(FOOBLATZ (EQV-LIST FOOBLATZ@MC))

in the other machine's .MAIL.;NAMES file.

Second, you must alter the mailing list entry in MC:.MAIL.;NAMES > .  Mail
sent to FOOBLATZ needs to be collected into a file, the inbox, for later
digestification, rather than immediately sent out to members of the mailing
list.  So the mailing list entry for FOOBLATZ should look like

(FOOBLATZ (EQV-LIST ([COMAIL;FOO INBOX] (R-OPTION FAST-APPEND))))

Note that the FAST-APPEND option makes COMSAT append new mail to the end of
the inbox, which will cause the FOOBLATZ Digest to include the messages
sent to FOOBLATZ in the chronological order in which they arrived at MC.
Not including this option will cause the digest to include the messages in
the reverse order of arrival, which will confuse many list members.
Further, as explained below in the digestifier algorithm section, when
discussion grows very brisk, the inbox may contain more than one digest's
worth of messages; in this case the digestifier will create a digest
starting at the beginning of the file and going until it reaches its size
limit, so the FAST-APPEND option will ensure that the older part of the
conversation is sent out first, and that very old messages don't accumulate
unsent.

Third, you must create an entry in the file MC:DIGEST;DEFS > for your
digest.  Everything in that file up to the first ^_ (ascii 037) is a
comment, and after that comes a series of digest definitions, separated by
^_'s; to make life easy, put yours at the end.

Each digest definition has a format that looks like an RFC822 mail header
-- that is, it consists of a series of named fields of the form:

Name: FooBlatz
Inbox: COMAIL;FOO INBOX
Administrivia: COMAIL;FOO ADMIN
Record: COMAIL;FOO RECORD
First-Issue-Number: 259
AUTHOR: FOOBLATZ-REQUEST
RCPT: (@FILE [COMAIL;FOO LIST])
From: FooBlatz Daily Blast
      <FooBlatz-Request%MC.LCS.MIT.EDU@Mintaka.LCS.MIT.EDU>
Reply-To: FOOBLATZ%MC.LCS.MIT.EDU@Mintaka.LCS.MIT.EDU
To: FOOBLATZ%MC.LCS.MIT.EDU@Mintaka.LCS.MIT.EDU

Continuation lines, such as the second line of the From: field in the
example above, -must- start with a space or a tab.  Blank lines between
fields are ignored, so you can insert them in your digest entry if you
like.  The order of the fields in the definition does not matter.
Capitalization of field names does not matter, but capitalization of field
values does -- if you want the From: field to look like "FooBlatz-Request"
in the digests sent out, capitalize it that way.

Here is a catalog of all of the currently accepted fields, their
meanings, and whether they are required for the digestifier to work:

  Name:			(required)

    The name of the digest, such as "FooBlatz".  This name is usually used
    before the word "Digest", as in "FooBlatz Digest #259".

  Inbox:		(required)

    The name of the file that COMSAT delivers mail to for this digest.  The
    device is defaulted to "DSK", the directory is defaulted to "COMAIL",
    and the second filename is defaulted to "INBOX".  You -must- supply the
    first filename.  Thus you can say just "Inbox: FOO" if your inbox is
    DSK:COMAIL;FOO INBOX.

  Administrivia:	(optional)

    The name of the file that the digestifier should check for
    administrative messages that should be inserted at the front of the
    next digest.  By default this file has the same name as the inbox file,
    but with the second filename of "ADMIN".  If you don't specify an
    Administrivia field, then the digestifier will not look for an
    administrivia file at all -- if you want to use the default file name,
    you can simply give Administrivia: a blank field.

    If this field exists, the digestifier will look for a file of the
    specified name; if the file exists, the digestifier will include its
    contents in the digest between the list of message topics and the first
    message, and delete the file.  Note that the administrivia file is not
    a mailbox -- its contents will be included in the digest exactly,
    including all the headers and other (for this purpose) extraneous
    nonsense of anything sent to the file as mail.  Spare your list members
    by avoiding this action; log in and write the file directly if that's
    at all possible.  When you write the file, you don't need to explicitly
    create white space around your text; the digestifier will automatically
    provide blank lines before and after it.

  Record:		(optional)

    The name of the file that the digestifier uses to keep track of the
    state of the digest.  This contains vital data like the current issue
    number and the time that the most recent digest was mailed.  By default
    this file has the same name as the inbox file, but with the second
    filename of "RECORD".

    Do not try to create this file yourself!  Doing so will only confuse
    the situation.  The digestifier will create this file the first time it
    processes your digest; if you don't specify a Record field, the
    digestifier will use the default name for this file.

  First-Issue-Number:	(optional, usually)

    This field is -only- used by the digestifier when it creates a new
    record file the first time it processes your digest.  This is used to
    initialize the issue number stored in the record file so that the next
    digest created will have the given number.  It should consist of a
    string of digits (only digits!) representing a decimal number, like
    "259".

    If this field is not present and the digestifier can't find a record
    file, then the digest definition is broken and no digest will be
    produced.  For safety, you can remove the First-Issue-Number from your
    digest definition after your record file is created; that way, if
    someone accidentally deletes your record file, the digestifier won't
    automatically recreate it and start duplicating issue numbers.

    When converting an existing digest to use this digestifier, the initial
    contents of this field should be set to the contents of the existing
    "NUMBER" file associated with that digest, to preserve continuity of
    issue numbers.

  RCPT:			(required)

    This is the digest recipient, the address to which the digest will
    actually be mailed (independent of what you may list in the To: and
    From: fields!).  This has exactly the format of a recipient from
    .MAIL.;NAMES >, except it cannot be continued onto a continuation line.
    Typically you will keep the actual mailing list in a file, say
    COMAIL;FOO LIST, and have a RCPT field like

	RCPT: (@FILE [COMAIL;FOO LIST])

    Defining the list in an indirect file is a good idea for large mailing
    lists that change frequently, since it allows you to avoid recompiling
    NAMES > and messing with the NAMED ERR file.

    When converting an existing digest to use this digestifier, note that a
    separate entry in NAMES > is no longer called for to hold the mailing
    list which includes the actual list members.  You can use such a list,
    but it's probably easier to simply specify the indirect file containing
    the list members explicitly in this field.

  AUTHOR:		(required)

    This is where delivery error messages should be directed.

    Unlike the RCPT: field, this is not a NAMES > style recipient; nor is
    it an RFC822 style recipient (something of the form User@Host).  It can
    only be a simple string -- "FOOBLATZ-REQUEST" in our example.  This
    means that unless you put a single person's name here, you will have to
    create a mailing list to receive the errors.

    If you want to try to keep human-generated requests apart from
    mailer-generated errors, you can create a mailing list separate from
    your administrative list -- called, say, FooBlatz-Errors -- and put it
    here in the AUTHOR: field.

  From:			(required)
  Reply-To:		(optional)
  To:			(optional)

    The values of these three fields are copied verbatim into the header of
    all generated digests.  If the optional fields are not given, the
    generated digests will not have these fields -- no default values are
    generated for them.  Please be careful to specify only RFC822-legal
    values for these fields.  Currently most digests use an address of the
    form

	From: FooBlatz Daily Blast
	      <FooBlatz-Request%MC.LCS.MIT.EDU@Mintaka.LCS.MIT.EDU>
    or
	To: FOOBLATZ%MC.LCS.MIT.EDU@Mintaka.LCS.MIT.EDU

    (By the way, there is no reason why MC's name has to appear here.  Your
    subscribers don't need to know that MC is involved in producing the
    digest as long as you give them -some- address that reaches your
    inbox.)

    Generally the From: field will contain the name of the mailing list's
    auxiliary administration list -- FooBlatz-Request in our example.  This
    is the address where people will generally send their administrative
    requests.  It need -not- be the same as the address that appears in the
    AUTHOR: field, although typically it is.

    The To: and Reply-To: fields should contain the address of the mailing
    list itself -- that is, the address where people send mail they want
    included in the digest.  Mail sent to this address should eventually
    reach your digest's inbox file.

    Actually, many other well-known RFC822 header fields can be given as
    fields in the digest definition, but most digests will want to use
    exactly these three.  (See the source code if you want to know what
    others will work.  Note that Date, Subject and Message-ID headers are
    automatically generated for each issue of each digest by the
    digestifier.)


Digestifier Algorithms
----------------------

The digestifier is run automatically once every hour.  It reads through the
file DIGEST;DEFS > and considers each digest in turn, keeping a log of its
actions in the file DIGEST;LOG >.

For each digest, the digestifier considers a number of factors to determine
whether or not it is going to produce a digest this time:

1.  The current time of day.  The hours between 2AM and 7AM are "Prime
    Time" and the digestifier prefers to create a digest then.

2.  The current size of the digest's inbox.  The digestifier never produces
    a digest larger than a certain size (around 48000 characters).  If the
    inbox looks like it contains more than 1.5 digests worth of material,
    then the inbox is "bloated" and the digestifier tries to create a
    digest soon.

3.  How long ago the previous digest was mailed.  The digestifier tries not
    to produce digests so frequently that people and mailers are
    overwhelmed with them, nor so infrequently that a message can sit in
    the inbox for an unreasonably long time.

The precise test is:

  (AND <the previous digest was created more than 90 minutes ago>
       (OR <the inbox is bloated>	; more than 1.5 digests pending
	   (AND <it is prime time>	; between 2AM and 7AM
		<the previous digest was created more than 18 hours ago>)))

In English, this translates as:
    If the last issue of this digest was sent out less than an hour and a
half ago, wait.  If the last issue went out longer ago than that and the
inbox is bloated, create a digest.  But if the inbox isn't bloated, check
whether it's prime time; if it is, and the last issue went out yesterday,
then create and send today's issue.

The various numbers and times are all subject to future adjustment of
course.

This digestifier should be fairly robust in the face of system crashes,
being gunned down in the middle of processing, etc.  The worst that can
happen is that a duplicate issue can be produced, and that can only occur
if the digestifier is zapped during an extremely small window.  I'll be
surprised if it ever happens.

It is perfectly safe to run two digestifiers at once, since both the
digestifier and COMSAT use the LOCK device to coordinate access to inboxes.

In fact, if you edit the DEFS file, it is probably a good idea to run the
digestifier once yourself and check the LOG file to see if you made any
errors.  Even if your inbox is empty, this procedure will catch many
possible problems with your digest definition.  You should be able to run
the digestifier by typing:

    :DIGEST;DIGEST

to DDT.  This might take a couple of minutes to finish (the digestifier
might decide to produce some digests!) so be patient.  Then you should look
at what was appended to the end of the current DIGEST;LOG > file.

This digestifier tries to be fairly civic-minded about cleaning up after
itself.  If it encounters any errors during the processing of a digest it
logs the error, then carefully deletes any partially-written output and
either proceeds to the next digest, or kills itself (depending on the
nature of the error).  Only a few amazingly unlikely errors should ever
leave a dead disowned job as a corpse.

Mail is always delivered through the bulk COMSAT (through the .BULK.
directory.) 

This digestifier is careful to check the messages included in the digests
it builds for lines of "-"s that could confuse digest bursting or
undigestifying tools, and modifies the first "-" in any suspect line to be
a space.

Send bug reports to Bug-DIGEST.


This digestifier was written by Alan Bawden and supersedes previous digests
written by Rob Austein, David Wallace, and Chris Maeda.  A lot of the
documentation was adapted from documentation written by David Chapman for
the GUMBY digestifier.