Microsoft started supporting Office Open XML format from Microsoft Office 2007 release onwards.
Office Open XML is also know as OOXML or OpenXML. All office files, say, docx, xlsx, pptx are zipped content of XML files.
In this blog we will try to read find xml files part of the office file archive and look at directory structure. We will parse office files using Python.
import zipfile, sys import libxml2, datetime
zipfile for parsing zipfile
sys to use exit() function
libxml2 for parsing XML files
datetime for printing date/time
file = zipfile.ZipFile(sys.argv,"r")
Where sys.argv is the command line argument we pass to the program as shpown below
Loop through all the files and print
for name in file.namelist(): print "file name:" + nameTo read XML file
xmlbuf = file.read(name)
Read the XML file, we can also use libxml2.parseMemory(xmlbuf)
try: xmlf = libxml2.parseDoc(xmlbuf) except (libxml2.parserError, TypeError): print "Error loading core.xml" sys.exit()
Get the root element of the XML
root = xmlf.getRootElement()
Call below function with root tag/element as argument
recursive_find(root)This is a recursive function which recursively gets all the XML tags.
Entity: line 1: parser error : Start tag expected, '<' not found
To ignore above error use
def noerr(ctx, str): pass libxml2.registerErrorHandler(noerr, None)The error above is the major blocking point.
core.xml file part of the office file archive has details like file creators name, access time, modified time, last modified user's name etc. We will try to extract those values using Python. Apart from core.xml file we also print names, file sizes, time stamps of the files part of the archive.