"Linux Gazette...making Linux just a little more fun!"

Cracking open proprietary envelopes

By Adrian J. Chung

Anyone with an email address can expect to receive attachments in a multitude of formats. Unfortunately, some formats cannot be read using free software. This is especially true if our email buddies are still involved in the arguably risky practice of using proprietary programs in conjunction with their email readers.

Many free software advocates adopt a policy of ignoring all email with attachments dependent on closed source software, opting instead to lecture the sender on the importance of open standards. Others may not like missing out on the fun to be had from attachments being forwarded amongst their peers. If you find yourself in this situation, the techniques outlined in this article may serve as a partial solution.

There is not much a Linux user can do if the entire contents of an attachment are encoded using a jealously guarded secret algorithm. Very often however, the problematic file is merely a thin proprietary envelope enclosing a loose collection of data objects that use well-known encoding standards. For instance, some MS Word documents being forwarded around the Net contain ordinary JPG and PNG images embedded within the file. If we can find a way to remove the envelope, reading these enclosed files would be a straight forward matter. The following sections describe how this can be accomplished using a little Python scripting together with a few image viewing and manipulation tools available on most Linux distributions.

Extracting the text

Before tackling the problem of the embedded images we can easily view any readable text using the strings utility:

strings proprietary.file | less

This will output any strings of at least 4 bytes in length that consist of readable ASCII characters. Naturally, a lot more than just intelligible sentences will be returned. Most will be junk, but the readable text is easily spotted. The strings tools will also pick up the readable header information within the embedded images themselves. JPEG files contain the string "JFIF" in the header. This gives us a quick way to check what types of images a file may contain, and gives an indication of how many there are.

strings proprietary.file | grep JFIF
strings -n 3 proprietary.file | grep PNG
strings proprietary.file | grep GIF8

The -n 3 allows us to detect readable strings as short as 3 characters. Not every occurrence of "JFIF" is necessarily a JPEG image since the document itself may have mentioned JFIF in a paragraph of text -- though this is rare among the email attachments most commonly forwarded.

Locating the images

We need to find where exactly each image is located within the file. A little Python will help to find possible embedded images and report their positions as a byte offset:

from string import find

#read in proprietary data
fh = open( "proprietary.file" )
dat = fh.read()
fh.close()

#search for JFIF
x = -1
while 1:
    x = find(dat,"JFIF",x+1)
    if x<0: break
    #file actually started 6 bytes earlier
    print x - 6

This will find the byte offsets of every embedded JPEG file though not every offset is guaranteed to be for a valid file. This can easily be extended to handle GIF and PNG images:

Listing 1

#!/usr/bin/python
from string import find
from sys import argv

headers = [("GIF8",0), ("PNG",1), ("JFIF",6)]
filepath = "proprietary.file"
if len(argv)>1: filepath = argv[1]

fh = open(filepath )
dat = fh.read()
fh.close()

for kw,off in headers:
    x = 0
    while 1:
        x = find(dat,kw,x+1)
        if x<0: break
        print kw,"file begins at byte",x - off

Note that the image file begins a few bytes before the "PNG" or "JFIF" string.

Displaying the images

Now that we know where each image is likely to start how do we display them? ImageMagick's display utility can help here. Suppose our proprietary file contains a JPEG image beginning at byte 1000. Using tail to remove all the bytes that preceed it and pipe the rest to display.

tail -c +1001 proprietary.file | display -

Note that tail -c begins counting bytes at 1. In case we have many dozens of embedded image files we can adapt our previous Python script to automate the process.

Listing 2

#!/usr/bin/python
from string import find
from sys import argv
from os import system

headers = [("GIF8",0), ("PNG",1), ("JFIF",6)]
filepath = "proprietary.file"
if len(argv)>1: filepath = argv[1]

fh = open(filepath )
dat = fh.read()
fh.close()

for kw,off in headers:
    x = 0
    while 1:
        x = find(dat,kw,x+1)
        if x<0: break
        system("tail -c +%d %s | display -" % (x - off + 1, filepath))

Extracting each image file

ImageMagick throws away any excess data fed to it after reading to the end of the image segment. If we want to separate the image data completely for storage as individual files, we also need to find the end of each image. One way to do this is to use a modified binary chop algorithm.

Listing 3

#!/usr/bin/python
from string import find
from sys import argv
from commands import getstatusoutput

headers = [("GIF8",0,"giftopnm","gif"), ("PNG",1,"pngtopnm","png"),
           ("JFIF",6,"djpeg","jpg")]
filepath = "proprietary.file"
if len(argv)>1: filepath = argv[1]

fh = open(filepath )
dat = fh.read()
fh.close()

inum = 0
for kw,off,conv,ext in headers:
    x = -1
    while 1:
        x = find(dat,kw,x+1)
        if x<0: break
        beg = x - off
        #possible image located -- find end by binary chop
	s1 = len(dat) - x
	s0 = 1
        sz = s1
	while s0<s1:
	    (stat,output) = getstatusoutput("tail -c +%d %s | head -c %d | %s >/dev/null" % (beg + 1, filepath, sz, conv))
	    if stat:
                #failed -- possibly too small
                if sz == s1:
                    #failed -- probably invalid data
                    print "failed... no image here"
                    break
                elif sz == s0:
                    #we've found the length -- write out image
                    imgname = "image%03d.%s" % (inum, ext)
                    print "writing",imgname
                    fh = open( imgname, "w")
                    fh.write(dat[beg :beg+s1])
                    fh.close()
                    inum = inum + 1
                    break
                s0 = sz
            else:
                #might be too big -- try smaller
                s1 = sz
            sz = int((s0+s1)/2)

One can make use of image decoding utilities giftopnm, djpeg, and pngtopnm to locate the end of the file. Like display these tools discard excess input data after the end of the image file and with terminate without error. If however they are given truncated image data they will report an error and terminate unsuccessfully. The Python script feeds image data of varying lengths to the decoding tool and its completion status is used to home into the correct length of the required file.

Conclusion

This article has shown how to write scripts that extract data objects, encoded using platform-independent open standards, from within proprietary files. It should be a simple task to extend these scripts for handling other image formats and even other types of data objects, such as sound and music files. Note that there are many file formats that frustrate the techniques described here via a layer of simple encryption and/or obfuscation.

Even if one has access to the appropriate proprietary application for reading a particular email attachment, the scripts outlined above can be useful for avoiding any possible macro viruses or security exploits specific to that application.

And finally a word of warning. The legislature of some countries have vaguely worded laws that can be interpreted in such a way that these scripts may be considered as illegal copyright circumvention devices. This may or may not be relevant to you depending on the country where you reside. As is always the case when mixing open and closed source systems, your mileage may vary.

[Editor's note: The Python Imaging Library (PIL) provides a way to work with images from within a larger program. You can open an image and read its type and dimensions, transform it, create thumbnails, etc. -Iron.]

Adrian J Chung

When not teaching undergraduate computing at the University of the West Indies, Trinidad, Adrian is writing system level scripts to manage a network of Linux boxes, and conducts experiments with interfacing various scripting environments with home-brew computer graphics renderers and data visualization libraries.