Best xml questions in July 2012

Is there any way or any framework in python to create an object model from a xml?

6 votes

for example my xml file contains :

<layout name="layout1">
    <grid>
        <row>
            <cell colSpan="1" name="cell1"/>
        </row>
        <row>
            <cell name="cell2" flow="horizontal"/>
        </row>
    </grid>
</layout>

and I want to retrieve an object from the xml for example returned object structure be like this

class layout(object):
     def __init__(self):
         self.grid=None
class grid(object):
     def __init__(self):
         self.rows=[]
class row(object):
     def __init__(self):
         self.cels=[]

I've found my answer I used objectify in lxml package

this is a sample code:

from lxml import objectify

root = objectify.fromstring("""
 <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
   <a attr1="foo" attr2="bar">1</a>
   <a>1.2</a>
   <b>1</b>
   <b>true</b>
   <c>what?</c>
   <d xsi:nil="true"/>
 </root>
""")

print objectify.dump(root)

it prints:

root = None [ObjectifiedElement]
    a = 1 [IntElement]
      * attr1 = 'foo'
      * attr2 = 'bar'
    a = 1.2 [FloatElement]
    b = 1 [IntElement]
    b = True [BoolElement]
    c = 'what?' [StringElement]
    d = None [NoneElement]
      * xsi:nil = 'true'

lxml: add namespace to input file

5 votes

I am parsing an xml file generated by an external program. I would then like to add custom annotations to this file, using my own namespace. My input looks as below:

<sbml xmlns="http://www.sbml.org/sbml/level2/version4" xmlns:celldesigner="http://www.sbml.org/2001/ns/celldesigner" level="2" version="4">
  <model metaid="untitled" id="untitled">
    <annotation>...</annotation>
    <listOfUnitDefinitions>...</listOfUnitDefinitions>
    <listOfCompartments>...</listOfCompartments>
    <listOfSpecies>
      <species metaid="s1" id="s1" name="GenA" compartment="default" initialAmount="0">
        <annotation>
          <celldesigner:extension>...</celldesigner:extension>
        </annotation>
      </species>
      <species metaid="s2" id="s2" name="s2" compartment="default" initialAmount="0">
        <annotation>
           <celldesigner:extension>...</celldesigner:extension>
        </annotation>
      </species>
    </listOfSpecies>
    <listOfReactions>...</listOfReactions>
  </model>
</sbml>

The issue being that lxml only declares namespaces when they are used, which means the declaration is repeated many times, like so (simplified):

<sbml xmlns="namespace" xmlns:celldesigner="morenamespace" level="2" version="4">
  <listOfSpecies>
    <species>
      <kjw:test xmlns:kjw="http://this.is.some/custom_namespace"/>
      <celldesigner:data>Some important data which must be kept</celldesigner:data>
    </species>
    <species>
      <kjw:test xmlns:kjw="http://this.is.some/custom_namespace"/>
    </species>
    ....
  </listOfSpecies>
</sbml>

Is it possible to force lxml to write this declaration only once in a parent element, such as sbml or listOfSpecies? Or is there a good reason not to do so? The result I want would be:

<sbml xmlns="namespace" xmlns:celldesigner="morenamespace" level="2" version="4"  xmlns:kjw="http://this.is.some/custom_namespace">
  <listOfSpecies>
    <species>
      <kjw:test/>
      <celldesigner:data>Some important data which must be kept</celldesigner:data>
    </species>
    <species>
      <kjw:test/>
    </species>
    ....
  </listOfSpecies>
</sbml>

The important problem is that the existing data which is read from a file must be kept, so I cannot just make a new root element (I think?).

EDIT: Code attached below.

def annotateSbml(sbml_input):
  from lxml import etree

  checkSbml(sbml_input) # Makes sure the input is valid sbml/xml.

  ns = "http://this.is.some/custom_namespace"
  etree.register_namespace('kjw', ns)

  sbml_doc = etree.ElementTree()
  root = sbml_doc.parse(sbml_input, etree.XMLParser(remove_blank_text=True))
  nsmap = root.nsmap
  nsmap['sbml'] = nsmap[None] # Makes code more readable, but seems ugly. Any alternatives to this?
  nsmap['kjw'] = ns
  ns = '{' + ns + '}'
  sbmlns = '{' + nsmap['sbml'] + '}'

  for species in root.findall('sbml:model/sbml:listOfSpecies/sbml:species', nsmap):
    species.append(etree.Element(ns + 'test'))

  sbml_doc.write("test.sbml.xml", pretty_print=True, xml_declaration=True)

  return

Modifying the namespace mapping of a node is not possible in lxml. See this open ticket that has this feature as a wishlist item.

It originated from this thread on the lxml mailing list, where a workaround replacing the root node is given as an alternative. There are some issues with replacing the root node though: see the ticket above.

I'll put the suggested root replacement workaround code here for completeness:

>>> DOC = """<sbml xmlns="http://www.sbml.org/sbml/level2/version4" xmlns:celldesigner="http://www.sbml.org/2001/ns/celldesigner" level="2" version="4">
...   <model metaid="untitled" id="untitled">
...     <annotation>...</annotation>
...     <listOfUnitDefinitions>...</listOfUnitDefinitions>
...     <listOfCompartments>...</listOfCompartments>
...     <listOfSpecies>
...       <species metaid="s1" id="s1" name="GenA" compartment="default" initialAmount="0">
...         <annotation>
...           <celldesigner:extension>...</celldesigner:extension>
...         </annotation>
...       </species>
...       <species metaid="s2" id="s2" name="s2" compartment="default" initialAmount="0">
...         <annotation>
...            <celldesigner:extension>...</celldesigner:extension>
...         </annotation>
...       </species>
...     </listOfSpecies>
...     <listOfReactions>...</listOfReactions>
...   </model>
... </sbml>"""
>>> 
>>> from lxml import etree
>>> from StringIO import StringIO
>>> NS = "http://this.is.some/custom_namespace"
>>> tree = etree.ElementTree(element=None, file=StringIO(DOC))
>>> root = tree.getroot()
>>> nsmap = root.nsmap
>>> nsmap['kjw'] = NS
>>> new_root = etree.Element(root.tag, nsmap=nsmap)
>>> new_root[:] = root[:]
>>> new_root.append(etree.Element('{%s}%s' % (NS, 'test')))
>>> new_root.append(etree.Element('{%s}%s' % (NS, 'test')))

>>> print etree.tostring(new_root, pretty_print=True)
<sbml xmlns:celldesigner="http://www.sbml.org/2001/ns/celldesigner" xmlns:kjw="http://this.is.some/custom_namespace" xmlns="http://www.sbml.org/sbml/level2/version4"><model metaid="untitled" id="untitled">
    <annotation>...</annotation>
    <listOfUnitDefinitions>...</listOfUnitDefinitions>
    <listOfCompartments>...</listOfCompartments>
    <listOfSpecies>
      <species metaid="s1" id="s1" name="GenA" compartment="default" initialAmount="0">
        <annotation>
          <celldesigner:extension>...</celldesigner:extension>
        </annotation>
      </species>
      <species metaid="s2" id="s2" name="s2" compartment="default" initialAmount="0">
        <annotation>
           <celldesigner:extension>...</celldesigner:extension>
        </annotation>
      </species>
    </listOfSpecies>
    <listOfReactions>...</listOfReactions>
  </model>
<kjw:test/><kjw:test/></sbml>

Generating an XML document from Java objects where the structure is very different

5 votes

The Situation

I have a complex model object graph in Java that needs to be translated back and forth into an XML document. The object graph structure of the XML document's schema is extremely different from the model's object tree. The two are interchangeable, but the translation requires lots of context-driven logic where parent/child-like relationships are used.

The Problem

I'm working with model objects that are well established in a older system and the XML document's schema is fairly new. Since lots of our code depends on the structure of the model objects, we don't want to restructure them. Here is a simplified example of the type of structural differences I'm dealing with:

Example data model tree

Item

  • Description
  • cost
  • ...

Person

  • First Name
  • Last Name
  • Address
  • ...

Address

  • Street
  • City
  • ...

SaleTransaction (*this is the thing being translated)

  • Buyer (Person)
  • Seller (Person)
  • Sold Items[] (List)
  • Exchanged Items[] (List)
  • Location of Transaction (Address)

Example XML Document Structure

Exchange

  • Type
  • Parties
    • party_contact_ref
      • type
      • contact_id
  • Exchange Details
    • type
    • total_amount_exchanged
  • Items
    • Item
      • type
      • owning_party_contact_ref_id
      • exchange_use_type
  • Contacts
    • Contact
      • id
      • type

Exchange Type: [ CASH SALE | BARTER | COMBINATION CASH AND BARTER ]

Contact Type: [ PERSON | ADDRESS ]

Exchange Details Type: [ CASH EXCHANGE | BARTER EXCHANGE ]

Mapping between SaleTransaction and Exchange is possible, just not 1-1. In the example, the "buyer" in the model would be mapped to both a contact and a contact reference element in the XML document. Also, the value of the "owning_party_contact_ref_id" attribute of an "Item" element would be determined by looking at several different values in the SaleTransaction object graph.

If an object graph I'm working with needs some translation in order to be used in an XML document, my go-to tool is an XmlAdapter. In this case though, I don't see using JAXB XML adapters as a viable solution for three reasons.

  1. Which XML element an object in the model graph corresponds too is data dependent. I believe all XmlAdapter to class/property mappings are fixed.
  2. It doesn't seem to be possible to do a many to one, or one to many solution with XmlAdapters. MOXy has an interesting extension, but again, it requires fixed mappings to properties.
  3. As far as I know, XmlAdapters work with individual objects and don't have a way to get at the context of the entire graph being marshalled/unmarshalled.

The Question

I'm sure this type of problem is fairly common, so how do you handle it? Is there a way of handling this problem with standard tools?

What I've come up with

In case it's interesting, here are the possible approaches that I've come up with:

#1 Separate the object graph translation problem from the XML generation problem. I have a home-grown tool that assists with generating object graphs based on some context object. I could create the JAXB classes from the XML schema, then rely on this tool to generate the objects of those classes based on the context of our model object. This would work well to generate an XML document from the model object graph, but not the other way around. It also means relying on non-standard tools, which I'd like to avoid if possible.

#2 Go XmlAdapter crazy and modify the model classes to be able to retain translation state information (e.g. This object in the model tree was used to create this element in the XML document). This would keep the problem very close to the standard usage model for JAXB, but I think it would be a nightmare to develop, test and maintain.

#3 Separate the object graph problem like I would in #1, but use JDOM instead of JAXB. This would remove all of the JAXB required classes and mappings, but require another custom tool be built to manage the model object to DOM tree mappings.

I'm not super excited about any of the three solutions, but I'm most partial to #1.

1 is your best bet IMHO. Writing mapping code is tedious but you should resist the urge to be too clever. Any mapping tool you use is going to require configuration and I bet that's just as much work as writing the Java mapping code by hand. Just write lots of unit tests.

You can try Dozer for any classes that have similar named fields, it will use reflection to do the mapping. I've used this in the past, but my schema looked more similar to my domain objects, so it might not be that helpful.

To make the code more pleasant, use all the xjc plugins you can find for JAXB, such as the fluent-api and value-constructor ones.

Linq to XML - Find an element

4 votes

I am sure that this is basic and probably was asked before, but I am only starting using Linq to XML.

I have a simple XML that i need to read and write to.

<Documents>
...
    <Document>
      <GUID>09a1f55f-c248-44cd-9460-c0aab7c017c9-0</GUID>
      <ArchiveTime>2012-05-15T14:27:58.5270023+02:00</ArchiveTime>
      <ArchiveTimeUtc>2012-05-15T12:27:58.5270023Z</ArchiveTimeUtc>
      <IndexDatas>
        <IndexData>
          <Name>Name1</Name>
          <Value>Some value</Value>
          <DataType>1</DataType>
          <CreationTime>2012-05-15T14:27:39.6427753+02:00</CreationTime>
          <CreationTimeUtc>2012-05-15T12:27:39.6427753Z</CreationTimeUtc>
        </IndexData>
        <IndexData>
          <Name>Name2</Name>
          <Value>Some value</Value>
          <DataType>3</DataType>
          <CreationTime>2012-05-15T14:27:39.6427753+02:00</CreationTime>
          <CreationTimeUtc>2012-05-15T12:27:39.6427753Z</CreationTimeUtc>
        </IndexData>
   ...
 </IndexDatas>
</Document>
...
</Documents>

I have a "Documents" node that contains bunch of "Document" nodes.

I have GUID of the document and a "IndexData" name. I need to find the document by GUID and check if it has "IndexData" with some name. If it does not have it i need to add it.

Any help would be apreciated, as i have problem with reading and searching trough elements.

Currently I am trying to use (in C#):

IEnumerable<XElement> xmlDocuments = from c in XElement
                                        .Load(filePath)
                                        .Elements("Documents") 
                                         select c;

// fetch document
 XElement documentElementToEdit = (from c in xmlDocuments where 
                    (string)c.Element("GUID").Value == GUID select c).Single();

EDIT

xmlDocuments.Element("Documents").Elements("Document")

This returns no result, even tho xmlDocuments.Element("Documents") does. It looks like i cant get Document nodes from Documents node.

You can find those docs (docs without related name in index data) with below code, after that you could add your elements to the end of IndexData elements.

var relatedDocs = doc.Elements("Document")
   .Where(x=>x.Element("GUID").Value == givenValue)
   .Where(x=>!x.Element("IndexDatas")
              .Elements("IndexData")
              .Any(x=>x.Element("Name") == someValue);

XPath normalize-space() to return a sequence of normalized strings

4 votes

I need to use the XPath function normalized-space() to normalize the text I want to extract from a XHTML document: http://test.anahnarciso.com/clean_bigbook_0.html

I'm using the following expression:

//*[@slot="address"]/normalize-space(.)

Which works perfectly in Qizx Studio, the tool I use to test XPath expressions.

    let $doc := doc('http://test.anahnarciso.com/clean_bigbook_0.html')
    return $doc//*[@slot="address"]/normalize-space(.)

This simple query returns a sequence of xs:string.

144 Hempstead Tpke
403 West St
880 Old Country Rd
8412 164th St
8412 164th St
1 Irving Pl
1622 McDonald Ave
255 Conklin Ave
22011 Hempstead Ave
7909 Queens Blvd
11820 Queens Blvd
1027 Atlantic Ave
1068 Utica Ave
1002 Clintonville St
1002 Clintonville St
1156 Hempstead Tpke
Route 49
10007 Rockaway Blvd
12694 Willets Point Blvd
343 James St

Now, I want to use the previous expression in my Java code.

String exp = "//*[@slot=\"address"\"]/normalize-space(.)";
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile(exp);
Object result = expr.evaluate(doc, XPathConstants.NODESET);

But the last line throws an Exception:

Cannot convert XPath value to Java object: required class is org.w3c.dom.NodeList; supplied value has type xs:string

Obvsiously, I should change XPathConstants.NODESET for something; I tried XPathConstants.STRING but it only returns the first element of the sequence.

How can I obtain something like an array of Strings?

Thanks in advance.

Your expression works in XPath 2.0, but is illegal in XPath 1.0 (which is used in Java) - it should be normalize-space(//*[@slot='address']).

Anyway, in XPath 1.0, when normalize-space() is called on a node-set, only the first node (in document order) is taken.

In order to do what you want to do, you'll need to use a XPath 2.0 compatible parser, or traverse the resulting node-set and call normalize-space() on every node:

XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr;

String select = "//*[@slot='address']";
expr = xpath.compile(select);
NodeList result = (NodeList)expr.evaluate(input, XPathConstants.NODESET);

String normalize = "normalize-space(.)";
expr = xpath.compile(normalize);

int length = result.getLength();
for (int i = 0; i < length; i++) {
    System.out.println(expr.evaluate(result.item(i), XPathConstants.STRING));
}

...outputs exactly your given output.

HTML5 Cross-domain XmlHttpRequest vs. Older XmlHttpRequests

4 votes

I've sensed a lot of ramble about easy cross-domain XmlHttpRequest methods with new HTML5 JS XHR techniques. Given the standard JavaScript XHR code below...

  var xhr=new XMLHttpRequest();
  xhr.open("GET",url,false);
  xhr.send();
  var output=xhr.responseXML;

...what would be the equivalent HTML5 XHR cross-domain-enabled code that would give the same output?

There's nothing different from the JS perspective. The cross-domain authorization is handled by the browser on the HTTP level using CORS, so your server has to support cross-domain negotiation.

How to Generate an XML File from a set of XPath Expressions?

4 votes

I want to be able to generate a complete XML file, given a set of XPath mappings.

The input could specified in two mappings: (1) One which lists the XPath expressions and values; and (2) the other which defines the appropriate namespaces.

/create/article[1]/id                 => 1
/create/article[1]/description        => bar
/create/article[1]/name[1]            => foo
/create/article[1]/price[1]/amount    => 00.00
/create/article[1]/price[1]/currency  => USD
/create/article[2]/id                 => 2
/create/article[2]/description        => some name
/create/article[2]/name[1]            => some description
/create/article[2]/price[1]/amount    => 00.01
/create/article[2]/price[1]/currency  => USD

For namespaces:

/create               => xmlns:ns1='http://predic8.com/wsdl/material/ArticleService/1/
/create/article       => xmlns:ns1='http://predic8.com/material/1/‘
/create/article/price => xmlns:ns1='http://predic8.com/common/1/‘
/create/article/id    => xmlns:ns1='http://predic8.com/material/1/'

Note also, that it is important that I also deal with XPath Attributes expressions as well. For example: I should also be able to handle attributes, such as:

/create/article/@type => richtext

The final output should then look something like:

<ns1:create xmlns:ns1='http://predic8.com/wsdl/material/ArticleService/1/'>
    <ns1:article xmlns:ns1='http://predic8.com/material/1/‘ type='richtext'>
        <name>foo</name>
        <description>bar</description>
        <ns1:price xmlns:ns1='http://predic8.com/common/1/'>
            <amount>00.00</amount>
            <currency>USD</currency>
        </ns1:price>
        <ns1:id xmlns:ns1='http://predic8.com/material/1/'>1</ns1:id>
    </ns1:article>
    <ns1:article xmlns:ns1='http://predic8.com/material/2/‘ type='richtext'>
        <name>some name</name>
        <description>some description</description>
        <ns1:price xmlns:ns1='http://predic8.com/common/2/'>
            <amount>00.01</amount>
            <currency>USD</currency>
        </ns1:price>
        <ns1:id xmlns:ns1='http://predic8.com/material/2/'>2</ns1:id>
    </ns1:article>
</ns1:create>

PS: This is a more detailed question to a previous question asked, although due to a series of further requirements and clarifications, I was recommended to ask a more broader question in order to address my needs.

Note also, I am implementing this in Java. So either a Java-based or XSLT-based solution would both be perfectly acceptable. thnx.

Further note: I am really looking for a generic solution. The XML shown above is just an example.

This problem has an easy solution if one builds upon the solution of the previous problem:

<xsl:stylesheet version="2.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:xs="http://www.w3.org/2001/XMLSchema"
     xmlns:my="my:my">
     <xsl:output omit-xml-declaration="yes" indent="yes"/>

     <xsl:key name="kNSFor" match="namespace" use="@of"/>
     <xsl:variable name="vStylesheet" select="document('')"/>

     <xsl:variable name="vPop" as="element()*">
        <item path="/create/article/@type">richtext</item>
        <item path="/create/article/@lang">en-us</item>
        <item path="/create/article[1]/id">1</item>
        <item path="/create/article[1]/description">bar</item>
        <item path="/create/article[1]/name[1]">foo</item>
        <item path="/create/article[1]/price[1]/amount">00.00</item>
        <item path="/create/article[1]/price[1]/currency">USD</item>
        <item path="/create/article[1]/price[2]/amount">11.11</item>
        <item path="/create/article[1]/price[2]/currency">AUD</item>
        <item path="/create/article[2]/id">2</item>
        <item path="/create/article[2]/description">some name</item>
        <item path="/create/article[2]/name[1]">some description</item>
        <item path="/create/article[2]/price[1]/amount">00.01</item>
        <item path="/create/article[2]/price[1]/currency">USD</item>

        <namespace of="create" prefix="ns1:"
                   url="http://predic8.com/wsdl/material/ArticleService/1/"/>
        <namespace of="article" prefix="ns1:"
                   url="xmlns:ns1='http://predic8.com/material/1/"/>
        <namespace of="@lang" prefix="xml:"
                   url="http://www.w3.org/XML/1998/namespace"/>
        <namespace of="price" prefix="ns1:"
                   url="xmlns:ns1='http://predic8.com/material/1/"/>
        <namespace of="id" prefix="ns1:"
                   url="xmlns:ns1='http://predic8.com/material/1/"/>
     </xsl:variable>

     <xsl:template match="/">
      <xsl:sequence select="my:subTree($vPop/@path/concat(.,'/',string(..)))"/>
     </xsl:template>

     <xsl:function name="my:subTree" as="node()*">
      <xsl:param name="pPaths" as="xs:string*"/>

      <xsl:for-each-group select="$pPaths" group-adjacent=
            "substring-before(substring-after(concat(., '/'), '/'), '/')">
        <xsl:if test="current-grouping-key()">
         <xsl:choose>
           <xsl:when test=
              "substring-after(current-group()[1], current-grouping-key())">

             <xsl:variable name="vLocal-name" select=
              "substring-before(concat(current-grouping-key(), '['), '[')"/>

             <xsl:variable name="vNamespace"
                           select="key('kNSFor', $vLocal-name, $vStylesheet)"/>


             <xsl:choose>
              <xsl:when test="starts-with($vLocal-name, '@')">
               <xsl:attribute name=
                 "{$vNamespace/@prefix}{substring($vLocal-name,2)}"
                    namespace="{$vNamespace/@url}">
                 <xsl:value-of select=
                  "substring(
                       substring-after(current-group(), current-grouping-key()),
                       2
                             )"/>
               </xsl:attribute>
              </xsl:when>
              <xsl:otherwise>
               <xsl:element name="{$vNamespace/@prefix}{$vLocal-name}"
                          namespace="{$vNamespace/@url}">

                    <xsl:sequence select=
                     "my:subTree(for $s in current-group()
                                  return
                                     concat('/',substring-after(substring($s, 2),'/'))
                                   )
                     "/>
                 </xsl:element>
              </xsl:otherwise>
             </xsl:choose>
           </xsl:when>
           <xsl:otherwise>
            <xsl:value-of select="current-grouping-key()"/>
           </xsl:otherwise>
         </xsl:choose>
         </xsl:if>
      </xsl:for-each-group>
     </xsl:function>
</xsl:stylesheet>

When this transformation is applied on any XML document (not used), the wanted, correct result is produced:

<ns1:create xmlns:ns1="http://predic8.com/wsdl/material/ArticleService/1/">
   <ns1:article xmlns:ns1="xmlns:ns1='http://predic8.com/material/1/" type="richtext"
                xml:lang="en-us"/>
   <ns1:article xmlns:ns1="xmlns:ns1='http://predic8.com/material/1/">
      <ns1:id>1</ns1:id>
      <description>bar</description>
      <name>foo</name>
      <ns1:price>
         <amount>00.00</amount>
         <currency>USD</currency>
      </ns1:price>
      <ns1:price>
         <amount>11.11</amount>
         <currency>AUD</currency>
      </ns1:price>
   </ns1:article>
   <ns1:article xmlns:ns1="xmlns:ns1='http://predic8.com/material/1/">
      <ns1:id>2</ns1:id>
      <description>some name</description>
      <name>some description</name>
      <ns1:price>
         <amount>00.01</amount>
         <currency>USD</currency>
      </ns1:price>
   </ns1:article>
</ns1:create>

Explanation:

  1. A reasonable assumption is made that throughout the generated document any two elements with the same local-name() belong to the same namespace -- this covers the predominant majority of real-world XML documents.

  2. The namespace specifications follow the path specifications. A nsmespace specification has the form: <namespace of="target element's local-name" prefix="wanted prefix" url="namespace-uri"/>

  3. Before generating an element with xsl:element, the appropriate namespace specification is selected using an index created by an xsl:key. From this namespace specification the values of its prefix and url attributes are used in specifying in the xsl:element instruction the values of the full element name and the element's namespace-uri.

Parsing huge, badly encoded XML files in Python

4 votes

I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream because loading them into memory is much too inefficient and often leads to OutOfMemory troubles.

I have used the libraries miniDOM, ElementTree, cElementTree and I am currently using lxml. Right now I have a working, pretty memory-efficient script, using lxml.etree.iterparse. The problem is that some of the XML files I need to parse contain encoding errors (they advertise as UTF-8, but contain differently encoded characters). When using lxml.etree.parse this can be fixed by using the recover=True option of a custom parser, but iterparse does not accept a custom parser. (see also: this question)

My current code looks like this:

from lxml import etree
events = ("start", "end")
context = etree.iterparse(xmlfile, events=events)
event, root_element = context.next() # <items>
for action, element in context:
    if action == 'end' and element.tag == 'item':
    # <parse>
    root_element.clear() 

Error when iterparse encounters a bad character (in this case, it's a ^Y):

lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0x19 0x73 0x20 0x65, line 949490, column 25

I don't even wish to decode this data, I can just drop it. However I don't know any way to skip the element - I tried context.next and continue in try/except statements.

Any help would be appreciated!

Update

Some additional info: This is the line where iterparse fails:

<description><![CDATA:[musea de la photographie fonds mercator. Met meer dan 80.000 foto^Ys en 3 miljoen negatieven is het Muse de la...]]></description>

According to etree, the error occurs at bytes 0x19 0x73 0x20 0x65.
According to hexedit, 19 73 20 65 translates to ASCII .s e
The . in this place should be an apostrophe (foto's).

I also found this question, which does not provide a solution.

Since the problem is being caused by illegal XML characters, in this case the 0x19 byte, I decided to strip them off. I found the following regular expression on this site:

invalid_xml = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]')

And I wrote this piece of code that removes illegal bytes while saving an xml feed:

conn = urllib2.urlopen(xmlfeed)
xmlfile = open('output', 'w')

while True:
    data = conn.read(4096)
    if data:
        newdata, count = invalid_xml.subn('', data)
        if count > 0 :
            print 'Removed %s illegal characters from XML feed' % count
        xmlfile.write(newdata)

    else:
        break

xmlfile.close()

Count the leaves of an XML string with C#

4 votes

Is there a simple way to get the number of all leaves of an XML string (XML document is provided as a string) with C#?

XDocument xDoc = XDocument.Parse(xml);
var count = xDoc.Descendants().Where(n => !n.Elements().Any()).Count();

or as @sixlettervariables suggested

var count = xDoc.Descendants().Count(e => !e.HasElements);

Content MathML vs. OpenMath for model exchange

4 votes

In my research group, we have different people doing algebraic modelling in different symbolic tools such as Symbolic Toolbox in Matlab and Sympy in Python. These models are then typically exported to C-code and copy-pasted-adapted into our own symbolic C++-based tools for further symbolic manipulation.

When looking for an alternative to this hardly maintainable approach I found two formats that looked more or less standardized: OpenMath and "Content MathML". Note that we are only interested in the semantics, no pretty printing.

What is the relation between these two formats? Can both be used to store and exchange mathematical expressions between tools?

Are there yet other more or less standardized exchange format for mathematical expressions?

The formats are very closely related (and defined at roughly the same time by an overlapping set of people) (I'm an editor of both MathML and OpenMath specs for example). In the current version of Content MathML (MathML 3) this is formalised far more than in earlier versions and all the MathML content elements are given semantics in terms of openmath symbols. So formally the only difference is syntax, Content MathML has a "strict" subset that is a formal encoding of OpenMath, plus a set of convenience elements that are given formal rewrite rules to the OpenMath equivalent subset.

Apart from the syntax of the expressions themselves, if you are straying away from the fixed set of operators pre-defined in MathML, you need some way of recording definitions, and here both OpenMath and MathML use the same OpenMath "Content Dictionary" format.

Editing a large xml file 'on the fly'

4 votes

I've got an xml file stored in a database blob which a user will download via a spring/hibernate web application. After it's retrieved via Hibernate as a byte[] but before it's sent to the output stream I need to edit some parts of the XML (a single node with two child nodes and an attribute).

My concern is if the files are larger (some are 40mb+) then I don't really want to do this by having the whole file in memory, editing it and then passing it to the user via the output stream. Is there a way to edit it 'on the fly' ?

byte[] b = blobRepository.get(blobID).getFile();
// What can I do here?
ServletOutputStream out = response.getOutputStream();
out.write(b);

You can use a SAX stream.

Parse the file using the SAX framework, and as your Handler receives the SAX events, pass the unchanged items back out to a SAX Handler that constructs XML output.

When you get to the "part to be changed", then your intermediary class would read in the unwanted events, and write out the wanted events.

This has the advantage of not holding the entire file in memory as an intermediate representation (say DOM); however, if the transformation is complex, you might have to cache a number of items (sections of the document) in order to have them available for rearranged output. A complex enough transformation (one that could do anything) eventually turns into the overhead of DOM, but if you know you're ignoring a large portion of your document, you can save a lot of memory.

Why reference strings from an external resource file instead of hard-coding it to your Android XML layout?

4 votes

I keep getting warnings in Eclipse for hard-coding strings in my Android XML layout, but I think it makes more sense than putting everything in a string resource file and referencing it from there. I'm only going to use the said strings for that Activity anyway, and never again.

Are there any dangers into this kind of practice, like maybe initialization errors or performance issues, that I am overlooking? Why does Android encourage using a separate resource file?

The main reason is for internationalization. Putting strings in resource files makes it much easier to provide separate translations of each string for different languages, without having to copy your layout files.

Serialize a private backing data member in .NET 2.0?

4 votes

I am writing a small xml config file that will be saved and loaded from a specific location (so no using user.config). My application is .NET 2.0 and can not be moved to a newer version (so no DataContractSerializer) I am required to implement a "Save Password" option so the password field will be pre-filled in when the user uses the app.

Currently here is how I do it

public class UserSettings
{
    //Snip many other properties...

    public bool SavePassword { get; set; }

    [XmlIgnore]
    public string Password
    {
        get
        {
            string retVal = string.Empty;
            if (ProtectedPassword != null)
            {
                try
                {
                    retVal = Encoding.UTF8.GetString(ProtectedData.Unprotect(ProtectedPassword, _md5.ComputeHash(Encoding.UTF8.GetBytes(this.Username.ToUpper())), DataProtectionScope.LocalMachine));
                }
                catch
                {
                    retVal = string.Empty;
                }
            }
            return retVal;
        }
        set
        {
            ProtectedPassword = ProtectedData.Protect(Encoding.UTF8.GetBytes(value), _md5.ComputeHash(Encoding.UTF8.GetBytes(this.Username.ToUpper())), DataProtectionScope.LocalMachine);
        }
    }

    public byte[] ProtectedPassword;

    private readonly MD5 _md5 = MD5.Create();


    public void Save()
    {
        var xOver = new XmlAttributeOverrides();

        //If Save password is false do not store the encrypted password
        if (this.SavePassword == false)
        {
            var xAttrs = new XmlAttributes();
            xAttrs.XmlIgnore = true;
            xOver.Add(typeof(UserSettings), "ProtectedPassword", xAttrs);
        }

        XmlSerializer xSer = new XmlSerializer(typeof(UserSettings), xOver);
        Directory.CreateDirectory(Path.GetDirectoryName(savePath));
        using(var fs = new FileStream(savePath, FileMode.Create))
        {
            xSer.Serialize(fs, this);
        }

    }

I would like to make ProtectedPassword not public however if I set it to anything other than public xSer.Serialize(fs, this) will not include the property. What do I need to do to make this work correctly?

I know there are many other similar questions to this, however none of them have the .NET 2.0 requirement and use solutions that are not available to a person who is limited to 2.0. Is there any option other than writing a custom XMLSerarlizer or living with the fact that ProtectedPassword is public.

As far as I know, the only way to get this done in .NET 2.0 would be to write a custom implementation of IXmlSerializable.

That said, if the config file does not need to be human readable/editable, I would recommend using the BinaryFormatter to perform a binary serialization, which would capture the private members.