Python DevCenter
O'Reilly Network.oreilly.comSafari Bookshelf.Conferences.
advertisement
O'Reilly Emerging Telephony Conference January 24-26 2006, San Francisco, CA

Search ONLamp

 

Login
Register
Manage Newsletters
Register Your Books

What are those funny green links?


Python Topics

Atom 1.0 Feed RSS 1.0 Feed RSS 2.0 Feed

Related O'Reilly Books



Python Cookbook, 2nd Edition (cover)

Python Recipe of the Day

   Print.Print
Email.Email article link

The following recipe is from Python Cookbook, 2nd Edition, by Alex Martelli, Anna Ravenscroft and David Ascher. All links in this recipe point to the online version of the book on the Safari Bookshelf.

Buy it now, or read it online on the Safari Bookshelf.

In addition, visit the online Python Cookbook, a collaborative website, built by ActiveState and O'Reilly, which hosts contributions from the entire Python Community.


13.8. Removing Attachments from an Email Message

Credit: Anthony Baxter

. Problem

You're handling email in Python and need to remove from email messages any attachments that might be dangerous.

. Solution

Regular expressions can help us identify dangerous content types and file extensions, and thus code a function to remove any potentially dangerous attachments:

ReplFormat = """
This message contained an attachment that was stripped out.
The filename was: %(filename)s,
The original type was: %(content_type)s
(and it had additional parameters of:
%(params)s)
"""
import re
BAD_CONTENT_RE = re.compile('application/(msword|msexcel)', re.I)
BAD_FILEEXT_RE = re.compile(r'(\.exe|\.zip|\.pif|\.scr|\.ps)$')
def sanitise(msg):
    ''' Strip out all potentially dangerous payloads from a message '''
    ct = msg.get_content_type( )
    fn = msg.get_filename( )
    if BAD_CONTENT_RE.search(ct) or (fn and BAD_FILEEXT_RE.search(fn)):
        # bad message-part, pull out info for reporting then destroy it
        # present the parameters to the content-type, list of key, value
        # pairs, as key=value forms joined by comma-space
        params = msg.get_params( )[1:]
        params = ', '.join([ '='.join(p) for p in params ])
        # put informative message text as new payload
        replace = ReplFormat % dict(content_type=ct, filename=fn, params=params)
        msg.set_payload(replace)
        # now remove parameters and set contents in content-type header
        for k, v in msg.get_params( )[1:]:
            msg.del_param(k)
        msg.set_type('text/plain')
        # Also remove headers that make no sense without content-type
        del msg['Content-Transfer-Encoding']
        del msg['Content-Disposition']
    else:
        # Now we check for any sub-parts to the message
        if msg.is_multipart( ):
            # Call sanitise recursively on any subparts
            payload = [ sanitise(x) for x in msg.get_payload( ) ]
            # Replace the payload with our list of sanitised parts
            msg.set_payload(payload)
    # Return the sanitised message
    return msg
# Add a simple driver/example to show how to use this function
if _ _name_ _ == '_ _main_ _':
    import email, sys
    m = email.message_from_file(open(sys.argv[1]))
    print sanitise(m)

. Discussion

This issue has come up a few times on the newsgroup comp.lang.python, so I decided to post a cookbook entry to show how easy it is to deal with this kind of task. Specifically, this recipe shows how to read in an email message, strip out any dangerous or suspicious attachments, and replace them with a harmless text message informing the user of the alterations that we're performed.

This kind of task is particularly important when end users are using something like Microsoft Outlook, which is targeted by harmful virus and worm messages (collectively known as malware) on a daily basis.

The email parser in Python 2.4 has been completely rewritten to be robust first, correct second. Prior to that version, the parser was written for correctness first. But focusing on correctness was a problem because many virus/worm messages and other malware routinely send email messages that are broken and nonconformant—malformed to the point that the old email parser chokes and dies. The new parser is designed to never actually break when reading a message. Instead, it tries its best to fix whatever it can fix in the message. (If you have a message that causes the parser to crash, please let us, the core Python developers, know. It's a bug, and we'll fix it. Please include a copy of the message that makes the parser crash, or else it's very unlikely that we can reproduce your problem!)

The recipe's code itself is fairly well commented and should be easy enough to follow. A mail message consists of one or more parts; each of these parts can contain nested parts. We call the sanitise function on the top-level Message object, and it calls itself recursively on the subobjects if and as needed.

The sanitise function first checks the Content-Type of the part, and if there's a filename, it also checks that filename's extension against a known-to-be-bad list. If the message part is bad, we replace the message itself with a short text description describing the now-removed part and clean out the headers that are relevant. We set this message part's Content-Type to 'text/plain' and remove other headers related to the now-removed message.

Finally, we check whether the message is a multipart message. If so, it means the message has subparts, so we recursively call the sanitise function on each of them. We then replace the payload with our list of sanitized subparts.

If you're interested in working further on this recipe, the most important extra functionality, which is easy to add with a small amount of work, might be to store the attached file in some directory (instead of destroying all suspect attachments), and give the user a link to that file. Also consider extending the check in sanitise that filters dangerous attachments to have it verify more than just the content type and file extension; other headers may be able to carry known signs of worm or virus messages.

. See Also

Documentation for the standard library modules email and re in the Library Reference and Python in a Nutshell.


View the past week's recipes: Today | Yesterday | 3 days ago | 4 days ago | 5 days ago | 6 days ago | A week ago


Sponsored by: