An XML Adventure: Part 1 - Python

Recently I've had to deal with our XML at work more than usual. It's a mess to read, and trying check it for errors is extremely unwieldy. We have a program for making it easier to modify, but it doesn't help much if I want to find inconsistencies or compare properties. As far as I know, we don't have any other tools for working with our XML, at least that I have access to, so I decided I'd make a new tool at home after work. I figured it might take a while, but the potential productivity boost seemed worth it.

I decided I'd write it in Java. Java is what our software is written in so I figured I should write my XML tool in Java too. If anyone wanted to modify it for their specific purposes they could use the same language (not everyone is as interested in programming languages as I am), and they wouldn't have to install anything new on their machine to run the it.

As far as I can tell, Java has three or more ways to search or unmarshal XML: DOM, SAX, and JAXB. There are probably other ways, but those were the ones I considered using. I spent an embarrassing three hours trying to figure out which Java library to use and how to use it. For some reason I struggle with Java related documentation more than other docs I've come across, and after three torturous hours, I decided to see what other languages had to offer.

I started with Python. I've heard great things about it since I started programming and thought it was time to dig a little deeper. I also looked into Go because...well...I've been intrigued by the language for a while and thought I'd see how Go handles XML, if at all.

I spent 10 minutes researching solutions in Python and Go and felt like I already had a better handle on how to approach the problem in each language than I did after hours of research with Java. Chances are I had a better idea of what I was looking for after researching Java solutions, but the Python and Go solutions seemed so simple and easy to use. I couldn't help wanting to try them out.

The Python module I stumbled upon is the Element Tree XML API module. For demonstration purposes I'm going to use the following XML that I took from the Python documentation and modified a bit for all of my examples going forward:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
	<country name="United States of America">
        <rank>3</rank>
        <year>2017</year>
        <gdppc>6000000</gdppc>
        <neighbor name="Canada" direction="N"/>
        <neighbor name="Mexico" direction="S"/>
		<states>
			<state name="Texas">
				<capital name="Austin"/>
				<flower name="Bluebonnet"/>
			</state>
			<state name="Virginia">
				<capital name="Richmond"/>
				<flower name="Dogwood"/>
			</state>
			<state name="Florida">
				<capital name="Tallahassee"/>
				<flower name="Orange Blossom"/>
			</state>
		</states>
    </country>
</data>

Unmarshalling XML with Python is extremely simple. If we want to run a Python script all we have to do is create a .py file and add the following:

import xml.etree.ElementTree as ETree
import sys

def main():
    xmlFile = sys.argv[1]
    XMLTree = ETree.parse(xmlFile)
    treeRoot = XMLTree.getroot()

And that was all we need to get started. I used the sys module so that I could pass the XML file I want to unmarshal as a command line argument, but I could have hard-coded the XML file or used another module like argparse. Next, we wield the powerful API to work our magic on the tree. One way of doing this is using the "iter()" method to find all the elements in the tree with a particular tag.

for state in treeRoot.iter("state"):
    print(state.attrib)

Running this script will give us the following output:

{'name': 'Texas'}
{'name': 'Virginia'}
{'name': 'Florida'}

Keep in mind that if the other countries had states in the XML, those states would be printed as well. The attrib value we used in print() returns a dictionary with all of the attributes of the XML tag, so each attribute can be accessed via normal dictionary methods using the attribute's name as the key. For example, if we used print(state.attrib.get("name")) we would just see Texas Virginia Florida printed out instead of the full dictionary.

Up until now, I've only used python via .py files, but using the XML Element Tree API via the Python interactive shell is a surprisingly pleasant way to use it. Sometimes I don't always know what I'm going to be looking for on any given day, I just know I'll be looking for something. The interactive shell seems well suited to handle these types of situations.

The steps for using the shell are almost exactly the same as running a .py script. We'll navigate to the directory with the XML file we want to interact with and start the interpreter by typing python into our shell (with Python installed, this command should work out of the box on Linux, and it should work on Windows in powershell if you've added the path to python.exe to your environment variable).

>>> import xml.etree.ElementTree as Etree
>>> XMLTree = Etree.parse("country.xml")
>>> treeRoot = XMLTree.getroot()
>>> for state in treeRoot.iter("state"):
...    print(state.attrib)
...
{'name': 'Texas'}
{'name': 'Virginia'}
{'name': 'Florida'}

The only thing I did differently was hard-code the XML file I want to use. Then, just like before, we use the API to find whatever it is we are looking for.

>>> for country in treeRoot.iter("country"):
...     gdppc = int(country.find("gdppc").text)
...     if gdppc > 100000:
...             print(country.attrib.get("name") + " has a gdppc over 100000")
...
Liechtenstein has a gdppc over 100000
United States of America has a gdppc over 100000

I was blown away by how easy it was to get going with Python. Just a few lines of code and we can easily access everything we want in an XML tree. Python will definitely be handling a larger portion of my work load going forward.

In Part 2 I'll write about how my experiences using Go and it's XML package and how I'm using it differently than Python.

References: Python XML Module,
Go XML Package, Python argparse