Friday, January 2, 2015

Forensic Analysis of Microsoft office OOXML/OpenXML files using Python

Microsoft started supporting Office Open XML format from Microsoft Office 2007 release onwards.
Office Open XML is also know as OOXML or OpenXML. All office files, say, docx, xlsx, pptx are zipped content of XML files.

In this blog we will try to read find xml files part of the office file archive and look at directory structure. We will parse office files using Python.

Libraries used
        import zipfile, sys
        import libxml2, datetime
zipfile       for parsing zipfile
sys             to use exit() function
libxml2     for parsing XML files
datetime    for printing date/time

Read office file(which is zipped indeed)
file = zipfile.ZipFile(sys.argv[1],"r")
Where sys.argv[1] is the command line argument we pass to the program as shpown below
./ file.docx

Loop through all the files and print
        for name in file.namelist():
            print "file name:" + name
To read XML file
        xmlbuf = will not work

Read the XML file, we can also use  libxml2.parseMemory(xmlbuf)

            xmlf = libxml2.parseDoc(xmlbuf)
        except (libxml2.parserError, TypeError):
            print "Error loading core.xml"

Get the root element of the XML
        root = xmlf.getRootElement()
coreProperties in the case of core.xml

Call below function with root tag/element as argument
This is a recursive function which recursively gets all the XML tags.

Entity: line 1: parser error : Start tag expected, '<' not found

To ignore above error use
        def noerr(ctx, str):
        libxml2.registerErrorHandler(noerr, None)
The error above is the major blocking point.

core.xml file part of the office file archive has details like file creators name, access time, modified time, last modified user's name etc. We will try to extract those values using Python. Apart from core.xml file we also print names, file sizes, time stamps of the files part of the archive.

Final result will look like