April 24, 2010: We are pleased to announce that Version 4 of this course is now under development. For updates and an early peek at the content, please check out the Software Carpentry blog at http://www.software-carpentry.org/blog/.
<>tagname...tagname<X>...<Y>...</Y></X> is legal...<X>...<Y>...</X></Y> is not< and >
&name;| Sequence | Character |
|---|---|
< |
< |
> |
> |
" |
" |
& |
& |
Table 20.1: XML Character Escapes
| Tag | Usage |
|---|---|
<html> |
Root element of entire HTML document. |
<body> |
Body of page (i.e., visible content). |
<h1> |
Top-level heading. Use <h2>, <h3>, etc. for second- and third-level headings. |
<p> |
Paragraph. |
<em> |
Emphasized text; browser or editor will usually display it in italics. |
<address> |
Address of document author (also usually displayed in italics). |
Table 20.2: Basic XHTML Tags
<html> <body> <h1>Software Carpentry</h1> <p>This course will introduce <em>essential software development skills</em>, and show where and how they should be applied.</p> <address>Greg Wilson (gvwilson@third-bit.com)</address> </body> </html>
Figure 20.1: Simple Page Rendered by Firefox
h1 (level-1 heading) is semantic (meaning)i (italics) is display (formatting)<h1 align="center">A Centered Heading</h1><p class="disclaimer">This planet provided as-is.</p><p align="left" align="right">...</p> is illegal<p align=center>...<p>, but modern parsers will reject ithead element as well as a body
<!--, and end with --><html> <head> <title>Comments Page</title> <meta name="author" content="aturing"/> </head> <body> <!-- House style puts all titles in italics --> <h1><em>Welcome to the Comments Page</em></h1> <!-- Update this paragraph to describe the forum. --> <p>Welcome to the Comments Forum.</p> </body> </html>
ul for an unordered (bulleted) list, and ol for an ordered (numbered) one
litable for tables
tr (for "table row")td (for "table data")<html>
<head>
<title>Lists and Tables</title>
<meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>
<table cellpadding="3" border="1">
<tr>
<td align="center"><em>Unordered List</em></td>
<td align="center"><em>Ordered List</em></td>
</tr>
<tr>
<td align="left" valign="top">
<ul>
<li>Hydrogen</li>
<li>Lithium</li>
<li>Sodium</li>
<li>Potassium</li>
<li>Rubidium</li>
<li>Cesium</li>
<li>Francium</li>
</ul>
</td>
<td align="left" valign="top">
<ol>
<li>Helium</li>
<li>Neon</li>
<li>Argon</li>
<li>Krypton</li>
<li>Xenon</li>
<li>Radon</li>
</ol>
</td>
</tr>
</table>
</body>
</html>
Figure 20.2: Lists and Tables
meta elements in document head
img tag
src argument specifies where to find the image file<html> <head> <title>Images</title> <meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/> </head> <body> <h1>Our Logo</h1> <img src="../../../img/sc_powered.jpg" alt="[Powered by Software Carpentry]"/> </body> </html>
Figure 20.3: Images in Pages
alt attribute to specify alternative text
a element to create a link
href attribute specifies what the link is pointing at<html>
<head>
<title>Links</title>
<meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>
<h1>A Few of My Favorite Places</h1>
<ul>
<li><a href="http://www.google.com">Google</a></li>
<li><a href="http://www.python.org">Python</a></li>
<li><a href="http://www.nature.com/index.html">Nature Online</a></li>
<li>Examples in this lecture:
<ul>
<li><a href="comments.html">Comments</a></li>
<li><a href="image.html">Images</a></li>
<li><a href="list_table.html">Lists and Tables</a></li>
</ul>
</li>
</ul>
</body>
</html>
Figure 20.4: Links in Pages
minidom
<root> <first>element</first> <second attr="value">element</second> <third-element/> </root>
Figure 20.5: A DOM Tree
<?xml version="1.0" encoding="utf-8"?> <planet name="Mercury"> <period units="days">87.97</period> </planet>
import xml.dom.minidom
doc = xml.dom.minidom.parse('mercury.xml')
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?> <planet name="Mercury"> <period units="days">87.97</period> </planet>
toxml method can be called on the document, or on any element node, to create textimport xml.dom.minidom my_xml = '''<name>Donald Knuth</name>''' my_doc = xml.dom.minidom.parseString(my_xml) name = my_doc.documentElement.firstChild.data print 'name is:', name print 'but name in full is:', repr(name)
name is: Donald Knuth but name in full is: u'Donald Knuth'
u in front of the string the second time it is printed
print statement converts the Unicode string to ASCII for displayimport xml.dom.minidom
src = '''<planet name="Venus">
<period units="days">224.7</period>
</planet>'''
doc = xml.dom.minidom.parseString(src)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?> <planet name="Venus"> <period units="days">224.7</period> </planet>
import xml.dom.minidom
impl = xml.dom.minidom.getDOMImplementation()
doc = impl.createDocument(None, 'planet', None)
root = doc.documentElement
root.setAttribute('name', 'Mars')
period = doc.createElement('period')
root.appendChild(period)
text = doc.createTextNode('686.98')
period.appendChild(text)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?> <planet name="Mars"><period>686.98</period></planet>
xml.dom.minidom is really just a wrapper around other platform-specific XML libraries
document nodecreateDocument specifies the type of the document's root nodecreateDocument aresetAttribute(attributeName, newValue)
experimenter nodes, extract names, and print a sorted listgetElementsByTagName method to do this
import xml.dom.minidom
src = '''<heavenly_bodies>
<planet name="Mercury"/>
<planet name="Venus"/>
<planet name="Earth"/>
<moon name="Moon"/>
<planet name="Mars"/>
<moon name="Phobos"/>
<moon name="Deimos"/>
</heavenly_bodies>'''
doc = xml.dom.minidom.parseString(src)
for node in doc.getElementsByTagName('moon'):
print node.getAttribute('name')
Moon Phobos Deimos
nodeType
ELEMENT_NODE, TEXT_NODE, ATTRIBUTE_NODE, DOCUMENT_NODEchildNodesdataimport xml.dom.minidom
src = '''<solarsystem>
<planet name="Mercury"><period units="days">87.97</period></planet>
<planet name="Venus"><period units="days">224.7</period></planet>
<planet name="Earth"><period units="days">365.26</period></planet>
</solarsystem>
'''
def walkTree(currentNode, indent=0):
spaces = ' ' * indent
if currentNode.nodeType == currentNode.TEXT_NODE:
print spaces + 'TEXT' + ' (%d)' % len(currentNode.data)
else:
print spaces + currentNode.tagName
for child in currentNode.childNodes:
walkTree(child, indent+1)
doc = xml.dom.minidom.parseString(src)
walkTree(doc.documentElement)
solarsystem TEXT (1) planet period TEXT (5) TEXT (1) planet period TEXT (5) TEXT (1) planet period TEXT (6) TEXT (1)
em element whose only child is a text node containing that word
Figure 20.6: Modifying the DOM Tree
em
getElementsByTagName, and iterate over themdef emphasize(doc):
paragraphs = doc.getElementsByTagName('p')
for para in paragraphs:
first = para.firstChild
if first.nodeType == first.TEXT_NODE:
emphasizeText(doc, para, first)
def emphasizeText(doc, para, textNode):
# Look for optional spaces, a word, and the rest of the paragraph.
m = re.match(r'^(\s*)(\S*)\b(.*)$', str(textNode.data))
if not m:
return
leadingSpace, firstWord, restOfText = m.groups()
if not firstWord:
return
# If there's text after the first word, re-save it.
if restOfText:
restOfText = doc.createTextNode(restOfText)
para.insertBefore(restOfText, para.firstChild)
# Emphasize the first word.
emph = doc.createElement('em')
emph.appendChild(doc.createTextNode(firstWord))
para.insertBefore(emph, para.firstChild)
# If there's leading space, re-save it.
if leadingSpace:
leadingSpace = doc.createTextNode(leadingSpace)
para.insertBefore(leadingSpace, para.firstChild)
# Get rid of the original text.
para.removeChild(textNode)
if __name__ == '__main__':
src = '''<html><body>
<p>First paragraph.</p>
<p>Second paragraph contains <em>emphasis</em>.</p>
<p>Third paragraph.</p>
</body></html>'''
doc = xml.dom.minidom.parseString(src)
emphasize(doc)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?> <html><body> <p><em>First</em> paragraph.</p> <p><em>Second</em> paragraph contains <em>emphasis</em>.</p> <p><em>Third</em> paragraph.</p> </body></html>
Copyright © 2005-09 Python Software Foundation.
Created Thu Aug 6 21:56:06 2009 UTC