Config file processing with LibXML2

来源:百度文库 编辑:神马文学网 时间:2024/04/26 15:45:57

Document options

Print this page

E-mail this page
Rate this page

Help us improve this content
Level: Intermediate
David Dougall (david.dougall@gmail.com), Freelance writer
Nicholas Chase (ibmquestions@nicholaschase.com) Backstop Media
23 May 2006
Discover how to use XML (Extensible Markup Language) in your UNIX® applications. This article, for UNIX developers who are unfamiliar with XML, explores the XML libraries developed by the Gnome project. After briefly explaining XML in general, you'll examine example code that a UNIX application developer might use to parse and manage configuration files that are in the XML format using the LibXML2 libraries.
XML is a great choice for well-developed, highly interoperable applications and, as such, it is becoming more and more common both for data storage and for configuration file management. This article explores an example application that uses XML (Extensible Markup Language) as the format for its configuration file as a way of showing you how you can use XML in your own UNIX applications. The sample application is written in Perl and uses Perl modules based on the Gnome project's LibXML2 library.
After a brief definition of XML, this article shows a sample configuration file written using XML. Example code then shows how to parse this config file. A system manager can modify the config file by hand, but it is assumed that at some level an application will be required to modify the config file directly. The article then shows an example of how to programmatically add new configuration options to the XML document, as well as how to change values of current entries. Lastly, it presents the code for writing the changed configuration file out to disk.


Back to top
Before getting into the actual LibXML2 libraries, let's start by getting a firm grounding in XML. XML is a text-based format for creating structured data accessible in any language and from any platform. It involves a series of HTML-like tags arranged in a treelike structure.
Consider, for example, the simple document shown inListing 1. This is actually a simplified version of the config file example explored in theConf file section. It has been simplified in order to make general XML concepts easier to see.
root delete 10
The first line inListing 1 is the XML declaration, which tells the application responsible for processing XML, the parser, what version of XML it's dealing with. The overwhelming majority of files will be written using 1.0, but a very small number will be 1.1 files. It also defines the encoding to use. Most files use UTF-8, but XML is designed to incorporate data in virtually any language, including those that don't use the English language alphabet.
Next come the elements. An element starts with an open tag (such as ) and ends with a close tag (such as ), with a slash (/) distinguishing it from the open tag.
An element is a type of Node. The XML Document Object Model (DOM) defines several types of Nodes, including Elements (such as files or age), Attributes (such as units), and text (such as root or 10). Elements can have child nodes. For example, the age element has one child, the text node 10. The files element, on the other hand, has seven children. Three are obvious. They are the three element children: owner, action, and age. The other four are the whitespace text notes before and after the elements.
XML parsers can use this parent-child structure to navigate to a document, and even to modify the structure or content of the document. LibXML2 is one of these parsers, and the sample application uses this structure for just that purpose. There are many parsers and libraries available for many different environments. LibXML2 is the best one available for UNIX and has been extended to provide support in several scripted languages, such as Perl and Python.
To begin the investigation, let's look at an example config file.


Back to top
The sample application for this article is one that reads a list of actions to take on specific files. A config file defines those files and actions. Assume the config file is a file located somewhere on the UNIX filesystem. This configuration file example might be used, for instance, in a UNIX system cron. The XML defines directory paths and actions to perform based on criteria, such as owner and file age (seeListing 2).
/var root delete 10 any delete 96 /tmp any delete 24
In this case, the root element is filesystem, which contains two path elements. Each path element contains the directory name and one or more files elements. Each files element defines an action for the application to take regarding a user or user's files when they reach a particular age, with the units for that age value specified in the units attribute on the age element. Remember, the whitespace is significant. From a structural point of view, each piece makes up a separate Text node.
In a production environment, a well-written UNIX application would possess the ability to not only read the data and act on it, but also the ability to add, remove, and edit data in accordance with user input.
Now let's look at the application that uses this data.


Back to top
The remainder of this article discusses a framework with example code for parsing and managing XML configuration files. The examples specifically read and alter a configuration file, but you can use these concepts for any of the type of tasks at that arise in a UNIX developers life. What's more, since you're using the LibXML2 libraries, you can plug these concepts into virtually any UNIX application.
We decided to show the examples in this article using a Perl version of the LibXML2 libraries. Most of the documentation on the Internet discusses programming in Java™ or Microsoft® Visual Studio tools, but for a UNIX user or developer, Perl is more useful.Listing 3 shows the Perl modules required for parsing this XML document.
XML::LibXML XML::LibXML::Common XML::NamespaceSupport XML::SAX
The code shown in the following sections is merely a framework. It is discussed in three parts: parsing, manipulating, and exporting.
During the loading and parsing stage, the data would likely be loaded into Perl variables, such as lists or hashes, but as every programmer has his or her own preferred way to do that, We'll leave it as an exercise for the reader. The following code merely prints data to show that the script has correctly found it.
In the manipulating stage, the program makes changes to the data by adding, modifying, and deleting elements in the XML document. Normally, this would be in response to user actions.
Finally, the Exporting stage takes the final document after it has been modified and writes it back out to disk.


Back to top
The first step in reading the XML file is for the application to load the data and parse it into a Document object. From there, you can navigate the DOM tree to get to specific nodes. Let's see how that works inListing 4.
my $parser = XML::LibXML->new(); my $doc = $parser->parse_file("example.xml"); $filesystem = $doc->getDocumentElement(); @nodes=$filesystem->childNodes; foreach $node (@nodes) { if($node->nodeType==ELEMENT_NODE) { # ignore text nodes # just get the first match @dirnames = $node->getElementsByTagName("dirname")->item(0); foreach $dirname (@dirnames) { print "dirname: " . $dirname->textContent . "\n"; # push this into an array } # get all children @files = $node->getChildrenByTagName("files"); foreach $file (@files) { foreach $values ($file->childNodes) { # ignore text nodes if($values->nodeType!=XML_TEXT_NODE) { if($values->nodeName() eq "age") { # check for attribute, otherwise, use default of 'hours' if($values->hasAttributes()) { print $values->nodeName() . ": " . $values->textContent; print " " . $values->attributes->item(0)->value(); print "\n"; } else { print $values->nodeName() . ": " . $values->textContent; print " hours\n"; } # calculate extended value from units and put in a # hash linked with this dirname, etc. } else { print $values->nodeName() . ": " . $values->textContent; print "\n"; # put this value into a hash linked with $dirname. # We may have multiple entries for each $dirname, so # perhaps use an array within a hash } } } } } }
First, inListing 4 you create the parser and load the XML from a file into a XML::LibXML::Document variable. This object contains the entire XML tree and has methods associated with it to search for nodes, export, validate, and create new nodes. This article discusses some of these in the next few sections. Starting at the top, you see the getDocumentElement() method, which returns the root node. From this root node, you can traverse the entire XML tree.
The main foreach loop cycles through each of the nodes within the parent filesystem element. This leaves the path elements when deliberately selecting only element nodes. The getElementsByTagName() method searches within the node for elements with the corresponding name and returns them in a NodeList object. Each path element contains a single dirname element, so the code searches for elements named dirname and grabs the first entry. The code must select only ELEMENT type nodes, because TEXT nodes do not support this method and produce a non-recoverable error in Perl.
There might be multiple files elements within a single path element, so the code loops through each one with the getChildrenByTagName() method, which is similar to getElementsByTagName(), but searches only the direct children of the target node. This returns each files element, but you must parse one step further to get the owner, action, and age elements. Once you have these nodes, you can call textContent to get the actual value from the element. This is a shortcut to selecting the value of the TEXT node that is the child of the ELEMENT node, as in:
print $values->nodeName() . ": " print $values->firstChild()->nodeValue();
In the case of the age element, you also have a possible attribute to give the time units. Using the hasAttributes() and Attributes functions, the program extracts this attribute if it exists. If it doesn't, the program uses the default value of hours.
Now, let's look at manipulating the data so that you can programmatically add, remove, and edit actions.


Back to top
With the current code in place, this is already a useful program. A user can easily make changes to what the program does by hand editing the XML file. However, a skilled UNIX developer can also use XML functions to directly modify the file from within the program itself. For example, you might include a menu option for adding a new action, or removing an existing action. To that end, let's look at code for manipulating the data from within the program.
$newnode = $doc->createElement("path"); $newdirnode = $doc->createElement("dirname"); $newdirnode->appendText("/root"); $newfilesnode = $doc->createElement("files"); $newownernode = $doc->createElement("owner"); $newownernode->appendText("any"); $newactionnode = $doc->createElement("action"); $newactionnode->appendText("archive"); $newagenode = $doc->createElement("age"); $newagenode->appendText("30"); $newagenode->setAttribute("units","days"); $newfilesnode->addChild($newownernode); $newfilesnode->addChild($newactionnode); $newfilesnode->addChild($newagenode); $newnode->addChild($newdirnode); $newnode->addChild($newfilesnode); $filesystem->addChild($newnode);
The code inListing 5 creates and populates all of the elements of a path element. It then adds this newly created node to the root element, filesystem. Each element needs to be created using the createElement() method of the XML::LibXML::Document class. (It is the Document that creates any new nodes you need.) This method returns an empty node that is not linked anywhere in the document tree yet. You can then add content to each node using the appendText() method of the XML::LibXML::Element class. Again, this is a shortcut for creating a new TEXT node, populating it, and adding it to the element. You can add attributes using the setAttribute() method, which automatically creates a new ATTRIBUTE node if one does not already exist on the target element with the given name.
After you have created each of the nodes and populated them, call the addChild() method on the parent node with the requested child node as a parameter. Thus in the above code, $newownernode becomes a child of $newfilesnode. All nodes exist in the document in the order in which they are added. If you wish to specify a different order, you can use the insertAfter() or insertBefore() functions.
Add each node to its parent until you finally add the main parent node to a node that already exists in the document. In the above example, you add this node to the root filesystem node. (If you were creating this document from scratch, you can call addChild() on the Document object itself to add the root element, and then add any other nodes to that element.)
As explained earlier, the example XML code fromListing 2 is in human readable format. The line breaks and indentation help make it easier to read. Each of those characters is read by the XML parser as a TEXT type node. The example inListing 5 does not add any of these TEXT nodes. Thus, the output from this example will not have line breaks or indentation. If you wish to create this whitespace, you would need to create a TEXT type node using the XML::LibXML::Text class, or use the Document object's createTextNode() function. The return value of this constructor is a node that you can add to the tree in the same way as the nodes in the above example.
To change the contents of the file, you can either directly set the nodeValue() of the TEXT node in question, or you can replace the element altogether:
$newnode = $doc->createElement("owner"); $newnode->appendText("toor"); $oldnode->replaceNode($newnode);
To delete a node, you have several options. One is to simply remove it from the structure, as in:
$file->unbindNode();
Once you have found the node that needs to be removed, a single command removes it from the structure, but not the document. This function call does not destroy the data structure until the program ends. If you wanted to move the node to another part of the tree, you could call addNode() with the same variable to add it back into the document in a new location. You also have the option to use the removeChild() or removeChildNodes() functions to completely free the resource from the document.


Back to top
In some languages, saving an XML document to a file can be quite tedious, but fortunately LibXML makes it very easy:
$doc->toFile("example.xml");
Of all the manipulations you've performed on the data, this one is by far the simplest. Once you have made the changes to the XML document in memory, a single function call writes it back out to the configuration file. You can also use related Perl functions, such as toString() and toFH(), that output the XML to a string variable or an open Perl filehandle respectively, which gives you a greater range of options for building your own applications.


Back to top
The Gnome project has done a valuable service by providing the LibXML2 libraries and supporting Perl modules. This article discusses three important steps required to manage and use XML configuration files. The parsing stage might be considered the most complicated, as it requires a somewhat recursive design to parse the XML tree. Although somewhat tedious, the manipulation of the XML document in memory is very straightforward. Exporting the modified configuration is also very simple with the LibXML2 libraries.
Although somewhat of a paradigm shift from the standard UNIX idealogy, XML can provide a powerful way to manage data. The tree structure provides a view of the data that is more flexible than a simple database format. When developing new applications or modifying legacy ones, standardizing on XML configuration files can be easily performed with the standard libraries provided free by the Gnome project, as shown in this article.
Learn
LibXML: Visit the LibXML site for more information on the XML C parser and the toolkit on Gnome.
XML tutorial: A tutorial that teaches you how to start using XML in your applications.
Understanding DOM (developerWorks, July 2003): This tutorial teaches you about the structure of a DOM (Document Object Model) document. DOM allows a developer to refer to, retrieve, and change items within an XML structure, and is essential to working with XML.
Visit theXML section of developerWorks.
XML parsers: Read a comprehensive list of XML parsers for many programming languages.
AIX and UNIX: Want more? The developerWorks AIX and UNIX zone hosts hundreds of informative articles and introductory, intermediate, and advanced tutorials.
Stay current withdeveloperWorks technical events and webcasts.
Get products and technologies
Cpan.org: Visit this site for more information on the Perl module and documentation repository.
XML toolkit: Get the full XML toolkit in Perl.
Build your next development project withIBM trial software, available for download directly from developerWorks.
Discuss
Participate indeveloperWorks blogs and get involved in the developerWorks community.

 

David Dougall graduated with a Masters Degree in computer engineering from Brigham Young University (BYU) in 2001. He has worked as a UNIX System Administrator for six years at the College of Engineering and Technology at BYU. You can contact him atdavid.dougall@gmail.com.



Nicholas Chase has been involved in Web site development for companies such as Lucent Technologies, Sun Microsystems, Oracle, and the Tampa Bay Buccaneers. Nick has been a high school physics teacher, a low-level radioactive waste facility manager, an online science fiction magazine editor, a multimedia engineer, an Oracle instructor, and the Chief Technology Officer of an interactive communications company. He is the author of several books, including XML Primer Plus (Sams). You can contact him atibmquestions@nicholaschase.com.