Security Unplugged !!!: Forensic Analysis of Microsoft office OOXML/OpenXML files using Python

Friday, January 2, 2015

Forensic Analysis of Microsoft office OOXML/OpenXML files using Python

Microsoft started supporting Office Open XML format from Microsoft Office 2007 release onwards.

Office Open XML is also know as OOXML or OpenXML. All office files, say, docx, xlsx, pptx are zipped content of XML files.

In this blog we will try to read find xml files part of the office file archive and look at directory structure. We will parse office files using Python.

Libraries used

        import zipfile, sys
        import libxml2, datetime

zipfile for parsing zipfile

sys to use exit() function

libxml2 for parsing XML files

datetime for printing date/time

Read office file(which is zipped indeed)

file = zipfile.ZipFile(sys.argv[1],"r")

Where sys.argv[1] is the command line argument we pass to the program as shpown below

./office.py file.docx

Loop through all the files and print

        for name in file.namelist():
            print "file name:" + name

To read XML file

        xmlbuf = file.read(name)

file.open() will not work

Read the XML file, we can also use libxml2.parseMemory(xmlbuf)

        try:
            xmlf = libxml2.parseDoc(xmlbuf)
        except (libxml2.parserError, TypeError):
            print "Error loading core.xml"
            sys.exit()

Get the root element of the XML

        root = xmlf.getRootElement()

coreProperties in the case of core.xml

Call below function with root tag/element as argument

        recursive_find(root)

This is a recursive function which recursively gets all the XML tags.

Entity: line 1: parser error : Start tag expected, '<' not found
docProps/core.xml

To ignore above error use

        def noerr(ctx, str):
            pass
        libxml2.registerErrorHandler(noerr, None)

The error above is the major blocking point.

core.xml file part of the office file archive has details like file creators name, access time, modified time, last modified user's name etc. We will try to extract those values using Python. Apart from core.xml file we also print names, file sizes, time stamps of the files part of the archive.

Final result will look like