Software Carpentry logo

XML

April 24, 2010: We are pleased to announce that Version 4 of this course is now under development. For updates and an early peek at the content, please check out the Software Carpentry blog at http://www.software-carpentry.org/blog/.

1) Introduction

2) You Can Skip This Lecture If...

3) In the Beginning

4) The Modern Era

5) Formatting Rules

6) Document Structure

7) Text

8) XHTML

9) Sample XHTML Page

<html>
<body>
<h1>Software Carpentry</h1>

<p>This course will introduce <em>essential software development skills</em>,
and show where and how they should be applied.</p>

<address>Greg Wilson (gvwilson@third-bit.com)</address>
</body>
</html>
Simple Page Rendered by Firefox

Figure 20.1: Simple Page Rendered by Firefox

10) Critique of HTML/XHTML

11) Attributes

12) Attributes Vs. Elements

13) More XHTML Tags

<html>
<head>
  <title>Comments Page</title>
  <meta name="author" content="aturing"/>
</head>
<body>

<!-- House style puts all titles in italics -->
<h1><em>Welcome to the Comments Page</em></h1>

<!-- Update this paragraph to describe the forum. -->
<p>Welcome to the Comments Forum.</p>

</body>
</html>

14) Lists and Tables

15) Example

<html>
<head>
  <title>Lists and Tables</title>
  <meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>

<table cellpadding="3" border="1">
  <tr>
    <td align="center"><em>Unordered List</em></td>
    <td align="center"><em>Ordered List</em></td>
  </tr>
  <tr>
    <td align="left" valign="top">
      <ul>
        <li>Hydrogen</li>
        <li>Lithium</li>
        <li>Sodium</li>
        <li>Potassium</li>
        <li>Rubidium</li>
        <li>Cesium</li>
        <li>Francium</li>
      </ul>
    </td>
    <td align="left" valign="top">
      <ol>
        <li>Helium</li>
        <li>Neon</li>
        <li>Argon</li>
        <li>Krypton</li>
        <li>Xenon</li>
        <li>Radon</li>
      </ol>
    </td>
  </tr>
</table>

</body>
</html>
Lists and Tables

Figure 20.2: Lists and Tables

16) Images

<html>
<head>
  <title>Images</title>
  <meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>

<h1>Our Logo</h1>

<img src="../../../img/sc_powered.jpg" alt="[Powered by Software Carpentry]"/>

</body>
</html>
Images in Pages

Figure 20.3: Images in Pages

17) Links

<html>
<head>
  <title>Links</title>
  <meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>

<h1>A Few of My Favorite Places</h1>

<ul>
  <li><a href="http://www.google.com">Google</a></li>
  <li><a href="http://www.python.org">Python</a></li>
  <li><a href="http://www.nature.com/index.html">Nature Online</a></li>
  <li>Examples in this lecture:
    <ul>
      <li><a href="comments.html">Comments</a></li>
      <li><a href="image.html">Images</a></li>
      <li><a href="list_table.html">Lists and Tables</a></li>
    </ul>
  </li>
</ul>

</body>
</html>

18) The Document Object Model

19) The Basics

20) DOM Tree Example

<root>
  <first>element</first>
  <second attr="value">element</second>
  <third-element/>
</root>
A DOM Tree

Figure 20.5: A DOM Tree

21) More On Tree Structure

22) Creating a Tree

<?xml version="1.0" encoding="utf-8"?>
<planet name="Mercury">
  <period units="days">87.97</period>
</planet>
import xml.dom.minidom
doc = xml.dom.minidom.parse('mercury.xml')
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<planet name="Mercury">
  <period units="days">87.97</period>
</planet>

23) Converting to Text

import xml.dom.minidom

my_xml = '''<name>Donald Knuth</name>'''
my_doc = xml.dom.minidom.parseString(my_xml)
name = my_doc.documentElement.firstChild.data
print 'name is:', name
print 'but name in full is:', repr(name)
name is: Donald Knuth
but name in full is: u'Donald Knuth'

24) Other Ways To Create Documents

import xml.dom.minidom

src = '''<planet name="Venus">
  <period units="days">224.7</period>
</planet>'''

doc = xml.dom.minidom.parseString(src)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<planet name="Venus">
  <period units="days">224.7</period>
</planet>
import xml.dom.minidom

impl = xml.dom.minidom.getDOMImplementation()

doc = impl.createDocument(None, 'planet', None)
root = doc.documentElement
root.setAttribute('name', 'Mars')

period = doc.createElement('period')
root.appendChild(period)

text = doc.createTextNode('686.98')
period.appendChild(text)

print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<planet name="Mars"><period>686.98</period></planet>

25) The Details

26) Finding Nodes

import xml.dom.minidom

src = '''<heavenly_bodies>
  <planet name="Mercury"/>
  <planet name="Venus"/>
  <planet name="Earth"/>
  <moon name="Moon"/>
  <planet name="Mars"/>
  <moon name="Phobos"/>
  <moon name="Deimos"/>
</heavenly_bodies>'''

doc = xml.dom.minidom.parseString(src)
for node in doc.getElementsByTagName('moon'):
    print node.getAttribute('name')

Moon
Phobos
Deimos

27) Walking a Tree

28) Recursive Tree Walker

import xml.dom.minidom

src = '''<solarsystem>
<planet name="Mercury"><period units="days">87.97</period></planet>
<planet name="Venus"><period units="days">224.7</period></planet>
<planet name="Earth"><period units="days">365.26</period></planet>
</solarsystem>
'''

def walkTree(currentNode, indent=0):
    spaces = ' ' * indent
    if currentNode.nodeType == currentNode.TEXT_NODE:
        print spaces + 'TEXT' + ' (%d)' % len(currentNode.data)
    else:
        print spaces + currentNode.tagName
        for child in currentNode.childNodes:
            walkTree(child, indent+1)

doc = xml.dom.minidom.parseString(src)
walkTree(doc.documentElement)
solarsystem
 TEXT (1)
 planet
  period
   TEXT (5)
 TEXT (1)
 planet
  period
   TEXT (5)
 TEXT (1)
 planet
  period
   TEXT (6)
 TEXT (1)

29) Modifying the Tree

Modifying the DOM Tree

Figure 20.6: Modifying the DOM Tree

30) Complications

31) Solution

def emphasize(doc):
    paragraphs = doc.getElementsByTagName('p')
    for para in paragraphs:
        first = para.firstChild
        if first.nodeType == first.TEXT_NODE:
            emphasizeText(doc, para, first)
def emphasizeText(doc, para, textNode):

    # Look for optional spaces, a word, and the rest of the paragraph.
    m = re.match(r'^(\s*)(\S*)\b(.*)$', str(textNode.data))
    if not m:
        return
    leadingSpace, firstWord, restOfText = m.groups()
    if not firstWord:
        return

    # If there's text after the first word, re-save it.
    if restOfText:
        restOfText = doc.createTextNode(restOfText)
        para.insertBefore(restOfText, para.firstChild)

    # Emphasize the first word.
    emph = doc.createElement('em')
    emph.appendChild(doc.createTextNode(firstWord))
    para.insertBefore(emph, para.firstChild)

    # If there's leading space, re-save it.
    if leadingSpace:
        leadingSpace = doc.createTextNode(leadingSpace)
        para.insertBefore(leadingSpace, para.firstChild)

    # Get rid of the original text.
    para.removeChild(textNode)

32) Not Finished Yet

if __name__ == '__main__':

    src = '''<html><body>
<p>First paragraph.</p>
<p>Second paragraph contains <em>emphasis</em>.</p>
<p>Third paragraph.</p>
</body></html>'''

    doc = xml.dom.minidom.parseString(src)
    emphasize(doc)
    print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<html><body>
<p><em>First</em> paragraph.</p>
<p><em>Second</em> paragraph contains <em>emphasis</em>.</p>
<p><em>Third</em> paragraph.</p>
</body></html>

33) Summary