Microsoft started supporting Office Open XML format from Microsoft Office 2007 release onwards.
Office Open XML is also know as OOXML or OpenXML. All office files, say, docx, xlsx, pptx are zipped content of XML files.
In this blog we will try to read find xml files part of the office file archive and look at directory structure. We will parse office files using Python.
Libraries used
import zipfile, sys import libxml2, datetime
zipfile for parsing zipfile
sys to use exit() function
libxml2 for parsing XML files
datetime for printing date/time
file = zipfile.ZipFile(sys.argv[1],"r")
Where sys.argv[1] is the command line argument we pass to the program as shpown below
./office.py file.docx
Loop through all the files and print
for name in file.namelist(): print "file name:" + nameTo read XML file
xmlbuf = file.read(name)
Read the XML file, we can also use libxml2.parseMemory(xmlbuf)
try: xmlf = libxml2.parseDoc(xmlbuf) except (libxml2.parserError, TypeError): print "Error loading core.xml" sys.exit()
Get the root element of the XML
root = xmlf.getRootElement()
Call below function with root tag/element as argument
recursive_find(root)This is a recursive function which recursively gets all the XML tags.
Entity: line 1: parser error : Start tag expected, '<' not found
docProps/core.xml
To ignore above error use
def noerr(ctx, str): pass libxml2.registerErrorHandler(noerr, None)The error above is the major blocking point.
core.xml file part of the office file archive has details like file creators name, access time, modified time, last modified user's name etc. We will try to extract those values using Python. Apart from core.xml file we also print names, file sizes, time stamps of the files part of the archive.
http://en.wikipedia.org/wiki/Office_Open_XML
http://www.ecma-international.org/publications/standards/Ecma-376.htm
http://msdn.microsoft.com/en-us/library/dd908153(v=office.12).aspx
No comments:
Post a Comment