Best xml questions in May 2012

Can you use antixml to create xml documents?

8 votes

there are a few examples for using Anti-Xml to extract information from XML documents, but none that I could find of using Anti-Xml to create XML documents. Does Anti-Xml support creating documents, or should I use another library for this (which one?). Does anyone have an example of creating an XML document with Anti-Xml?

Yes, you can build (and serialize) XML documents:

import com.codecommit.antixml._

val doc = Elem(None, "doc", Attributes(), Map(), Group(
  Elem(None, "foo", Attributes("id" -> "bar"), Map(), Group(Text("baz")))
))

val writer = new java.io.StringWriter
val serializer = new XMLSerializer("UTF-8", true)

serializer.serializeDocument(doc, writer)

You can also use Anti-XML's zippers to do some interesting editing tricks:

val foos = doc \ "foo"
val newFoo = foo.head.copy(children = Group(Text("new text!")))
val newDoc = foos.updated(0, newFoo).unselect

Now newDoc contains the edited document:

scala> newDoc.toString
res1: String = <doc><foo id="bar">new text!</foo></doc>

The Zipper that doc \ "foo" returns is different from a NodeSeq in that it carries information about its context, which allows you to "undo" the selection operation done by \.


Update in response to ziggystar's comment below: if you want something like Scala's XML literals, you can just use convert on any scala.xml.Elem:

val test: com.codecommit.antixml.Elem = <test></test>.convert

I'd assumed the question was about programmatic creation.

Deserialize XML element presence to bool in C#

6 votes

I'm trying to deserialize some XML from a web service into C# POCOs. I've got this working for most of the properties I need, however, I need to set a bool property based on whether an element is present or not, but can't seem to see how to do this?

An example XML snippet:

<someThing test="true">
    <someThingElse>1</someThingElse>
    <target/>
</someThing>

An example C# class:

[Serializable, XmlRoot("someThing")]
public class Something
{
    [XmlAttribute("test")]
    public bool Test { get; set; }

    [XmlElement("someThingElse")]
    public int Else { get; set; }

    /// <summary>
    /// <c>true</c> if target element is present,
    /// otherwise, <c>false</c>.
    /// </summary>   
    [XmlElement("target")]
    public bool Target { get; set; }
}

This is a very simplified example of the actual XML and object hierarchy I'm processing, but demonstrates what I'm trying to achieve.

All the other questions I've read related to deserializing null/empty elements seem to involve using Nullable<T>, which doesn't do what I need.

Does anyone have any ideas?

One way to do it would be to use a different property to get the value of the element, then use the Target property to get whether that element exists. Like so.

[XmlElement("target", IsNullable = true)]
public string TempProperty { get; set; }

[XmlIgnore]
public bool Target
{
    get
    {
        return this.TempProperty != null;
    }
}

As even if an empty element exists, the TempProperty will not be null, so Target will return true if <target /> exists

Xml signature is invalidated on adding a c14n exclusive transform

6 votes

This is my code to generate xml signature :

DOMSignContext dsc = new DOMSignContext
  (prk, xmldoc.getDocumentElement());

XMLSignatureFactory fac = 
  XMLSignatureFactory.getInstance("DOM");   

  DigestMethod digestMethod = 
      fac.newDigestMethod("http://www.w3.org/2000/09/xmldsig#sha1", null);
  C14NMethodParameterSpec spec = null;
  CanonicalizationMethod cm = fac.newCanonicalizationMethod(
      "http://www.w3.org/2001/10/xml-exc-c14n#",spec);
  SignatureMethod sm = fac.newSignatureMethod( 
      "http://www.w3.org/2000/09/xmldsig#rsa-sha1",null);
  ArrayList transformList = new ArrayList();
  TransformParameterSpec transformSpec = null;
  Transform envTransform =   fac.newTransform("http://www.w3.org/2000/09/xmldsig#enveloped-signature",transformSpec);
  Transform exc14nTransform = fac.newTransform(
      "http://www.w3.org/2001/10/xml-exc-c14n#",transformSpec);
transformList.add(exc14nTransform); 
transformList.add(envTransform);

 Reference ref = fac.newReference("",digestMethod,transformList,null,null);
 ArrayList refList = new ArrayList();
 refList.add(ref);
 SignedInfo si =fac.newSignedInfo(cm,sm,refList);

This gives a reference validation as false and also core validity as false. But when I remove envTrasnform variable i.e fac.new Transform("http://www.w3.org/2001/10/xml-exc-c14n#",transformSpec) and execute with the following code :

DOMSignContext dsc = new DOMSignContext
  (prk, xmldoc.getDocumentElement());

XMLSignatureFactory fac = 
  XMLSignatureFactory.getInstance("DOM");   

  DigestMethod digestMethod = 
      fac.newDigestMethod("http://www.w3.org/2000/09/xmldsig#sha1", null);
  C14NMethodParameterSpec spec = null;
  CanonicalizationMethod cm = fac.newCanonicalizationMethod(
      "http://www.w3.org/2001/10/xml-exc-c14n#",spec);
  SignatureMethod sm = fac.newSignatureMethod( 
      "http://www.w3.org/2000/09/xmldsig#rsa-sha1",null);
  ArrayList transformList = new ArrayList();
  TransformParameterSpec transformSpec = null;
  Transform envTransform = fac.newTransform(
      "http://www.w3.org/2000/09/xmldsig#enveloped-signature",transformSpec);
 transformList.add(envTransform);
 Reference ref = fac.newReference("",digestMethod,transformList,null,null);
 ArrayList refList = new ArrayList();
 refList.add(ref);
 SignedInfo si =fac.newSignedInfo(cm,sm,refList);

This gives the core validity and the reference validity as true. Why is this happening. I got this code form this link(code fragment 2 in creating enveloped signature section).

Actually the c14n transformation should be performed after the enveloped signature transform. It should be canonicalized after the extracting the document to be signed(the document currently contains the signature element as well. So it has to be separated before canonicalizing the actual part to be signed). The order should be this way :

transformList.add(envTransform);
 transformList.add(exc14nTransform);

How to preserve XML nodes that are not bound to an object when using SAX for parsing

5 votes

I am working on an android app which interfaces with a bluetooth camera. For each clip stored on the camera we store some fields about the clip (some of which the user can change) in an XML file.

Currently this app is the only app writing this xml data to the device but in the future it is possible a desktop app or an iphone app may write data here too. I don't want to make an assumption that another app couldn't have additional fields as well (especially if they had a newer version of the app which added new fields this version didn't support yet).

So what I want to prevent is a situation where we add new fields to this XML file in another application, and then the user goes to use the android app and its wipes out those other fields because it doesn't know about them.

So lets take hypothetical example:

<data>
  <title>My Title</title>
  <date>12/24/2012</date>
  <category>Blah</category>
</data>

When read from the device this would get translated to a Clip object that looks like this (simplified for brevity)

public class Clip {
  public String title, category;
  public Date date;
}

So I'm using SAX to parse the data and store it to a Clip. I simply store the characters in StringBuilder and write them out when I reach the end element for title,category and date.

I realized though that when I write this data back to the device, if there were any other tags in the original document they would not get written because I only write out the fields I know about.

This makes me think that maybe SAX is the wrong option and perhaps I should use DOM or something else where I could more easily write out any other elements that existed originally.

Alternatively I was thinking maybe my Clip class contains an ArrayList of some generic XML type (maybe DOM), and in startTag I check if the element is not one of the predefined tags, and if so, until I reach the end of that tag I store the whole structure (but in what?).. Then upon writing back out I would just go through all of the additional tags and write them out to the xml file (along with the fields I know about of course)

Is this a common problem with a good known solution?

-- Update 5/22/12 --

I didn't mention that in the actual xml the root node (Actually called annotation), we use a version number which has been set to 1. What I'm going to do for the short term is require that the version number my app supports is >= what the version number is of the xml data. If the xml is a greater number I will attempt to parse for reading back but will deny any saves to the model. I'm still interested in any kind of working example though on how to do this.

BTW I thought of another solution that should be pretty easy. I figure I can use XPATH to find nodes that I know about and replace the content for those nodes when the data is updated. However I ran some benchmarks and the overhead is absurd in parsing the xml when it is parsed into memory. Just the parsing operation without even doing any lookups resulted in performance being 20 times worse than SAX.. Using xpath was between 30-50 times slower in general for parsing, which was really bad considering I parse these in a list view. So my idea is to keep the SAX to parse the nodes to clips, but store the entirety of the XML in an variable of the Clip class (remember, this xml is short, less than 2kb). Then when I go to write the data back out I could use XPATH to replace out the nodes that I know about in the original XML.

Still interested in any other solutions though. I probably won't accept a solution though unless it includes some code examples.

Here's how you can go about it with SAX filters:

  1. When you read your document with SAX you record all the events. You record them and bubble them up further to the next level of SAX reader. You basically stack together two layers of SAX readers (with XMLFilter) - one will record and relay, and the other one is your current SAX handler that creates objects.
  2. When you're ready to write your modifications back to disk you fire up the recorded SAX events layered with your writer that would overwrite those values/nodes you have altered.

I spent some time with the idea and it worked. It basically came down to proper chaining of XMLFilters. Here's how the unit test looks like, your code would do something similar:

final SAXParserFactory factory = SAXParserFactory.newInstance();
final SAXParser parser = factory.newSAXParser();

final RecorderProxy recorder = new RecorderProxy(parser.getXMLReader());
final ClipHolder clipHolder = new ClipHolder(recorder);

clipHolder.parse(new InputSource(new StringReader(srcXml)));

assertTrue(recorder.hasRecordingToReplay());

final Clip clip = clipHolder.getClip();
assertNotNull(clip);
assertEquals(clip.title, "My Title");
assertEquals(clip.category, "Blah!");
assertEquals(clip.date, Clip.DATE_FORMAT.parse("12/24/2012"));

clip.title = "My Title Updated";
clip.category = "Something else";

final ClipSerializer serializer = new ClipSerializer(recorder);
serializer.setClip(clip);

final TransformerFactory xsltFactory = TransformerFactory.newInstance();
final Transformer t = xsltFactory.newTransformer();
final StringWriter outXmlBuffer = new StringWriter();

t.transform(new SAXSource(serializer, 
            new InputSource()), new StreamResult(outXmlBuffer));

assertEquals(targetXml, outXmlBuffer.getBuffer().toString());

The important lines are:

  • your SAX events recorder is wrapped around the SAX parser
  • your Clip parser (ClipHolder) is wrapped around the recorder
  • when the XML is parsed, recorder will record everything and your ClipHolder will only look at what it knows about
  • you then do whatever you need to do with the clip object
  • the serializer is then wrapped around the recorder (basically re-mapping it onto itself)
  • you then work with the serializer and it will take care of feeding the recorded events (delegating to the parent and registering self as a ContentHandler) overlayed with what it has to say about the clip object.

Please find the DVR code and the Clip test over at github. I hope it helps.

p.s. it's not a generic solution and the whole record->replay+overlay concept is very rudimentary in the provided implementation. An illustration basically. If your XML is more complex and gets "hairy" (e.g. same element names on different levels, etc.) then the logic will need to be augmented. The concept will remain the same though.

How to parse an xml and extract the xslt element from it in javascript

5 votes

I am sending an xml response from my servlet to my html page. I receive it via the xmlresponse object of the xmlhttprequest object. My xml document contains a xsl:stylesheet as an element I want to extract this element and execute that xslt code in my java script.
Is it possible? This is my xml code :

<samlp:AuthnRequest xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol"
Version="2.0" IssueInstant="2012-05-22T13:40:52:390" ProtocolBinding="urn:oasis:na
mes:tc:SAML:2.0:bindings:HTTP-POST" AssertionConsumerServiceURL="localhos
t:8080/consumer.jsp">
<UserID>
   xyz
</UserID>
<testing>
   text
</testing>
<saml:Issuer xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion">
   http://localhost:8080/saml/SProvider.jsp
</saml:Issuer>
<xslt>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:output method="text" />
<xsl:template match="/">
UserID : <xsl:copy-of select="//UserID"/>
testing : <xsl:copy-of select="//testing"/>
</xsl:template>
</xsl:stylesheet>
</xslt>
</samlp:AuthnRequest>


Once I get this xml string from the ajax response, I want to convert it into xml, extract the xslt part and execute it and show the output in a text area.
EDIT2
What is wrong with this code :

    var xmlDoc=xmlhttp.responseXML;
     //var xmltext=new XMLSerializer().serializeToString(xmlDoc);
     var xsltProcessor = new XSLTProcessor();
var element=xmlDoc.getElementsByTagNameNS("http://www.w3.org/1999/XSL/Transform","stylesheet");//
 //document.forms['my']['signature'].value=xmltext;
var stylesheet=xsltProcessor.importStylesheet(element[0]);
var result=xsltProcessor.transformToDocument(xmlDoc);
 var xmltext1=new XMLSerializer().serializeToString(result);
document.forms['my']['signature2'].value = xmltext1;


The output(xmltext1) for the xslt transformation is -

<transformiix:result xmlns:transformiix="http://www.mozilla.org/TransforMiix">
UserID : 1212

Testing : 1212
</transformiix:result>

But if you see in the xslt code, the outputmethod is set to "text". then why are xml tags included in the output?

Answer
This gives the exlpanation for edit2. Thanks for the answers:)

This is what worked for me, though I only tested it in the newest chrome:

var getNsResolver = function (element) {
  var ns = {
    samlp: 'urn:oasis:names:tc:SAML:2.0:protocol',
    xsl: 'http://www.w3.org/1999/XSL/Transform'
  };

  return function (prefix) {
    return ns[prefix] || null;
  };
};

var handleResponse = function (xhr) {
  var
    doc = xhr.responseXML,
    xsl = doc.evaluate('/samlp:AuthnRequest/xslt/xsl:stylesheet', doc, getNsResolver(doc.documentElement), XPathResult.ANY_TYPE, null).iterateNext(),
    processor = new XSLTProcessor(),
    result;

  processor.importStylesheet(xsl);
  result = processor.transformToFragment(doc, document);

  document.getElementById('foo').value = result.textContent;
};

window.addEventListener('load', function () {
  var request = new XMLHttpRequest();
  request.addEventListener('load', function (evt) {
    handleResponse(request);
  }, false);

  request.open('GET', 'sample.xml', true); // sample.xml contains the xml from the question
  request.send();
}, false);

XML serialization array in C#

5 votes

I'm having trouble figuring this out, I have an xml sheet that looks like this

<root>
  <list id="1" title="One">
    <word>TEST1</word>
    <word>TEST2</word>
    <word>TEST3</word>
    <word>TEST4</word>
    <word>TEST5</word>
    <word>TEST6</word>   
  </list>
  <list id="2" title="Two">
    <word>TEST1</word>
    <word>TEST2</word>
    <word>TEST3</word>
    <word>TEST4</word>
    <word>TEST5</word>
    <word>TEST6</word>   
  </list>
</root>

And I'm trying to serialize it into

public class Items
{
  [XmlAttribute("id")]
  public string ID { get; set; } 

  [XmlAttribute("title")]
  public string Title { get; set; }   

  //I don't know what to do for this
  [Xml... something]
  public list<string> Words { get; set; }   
}

//I don't this this is right either
[XmlRoot("root")]
public class Lists
{
  [XmlArray("list")]
  [XmlArrayItem("word")]
  public List<Items> Get { get; set; }
}

//Deserialize XML to Lists Class
using (Stream s = File.OpenRead("myfile.xml"))
{
   Lists myLists = (Lists) new XmlSerializer(typeof (Lists)).Deserialize(s);
}

I'm really new with XML and XML serializing, any help would be much appreciated

It should work if you declare your classes as

public class Items
{
    [XmlAttribute("id")]
    public string ID { get; set; }

    [XmlAttribute("title")]
    public string Title { get; set; }

    [XmlElement("word")]
    public List<string> Words { get; set; }
}

[XmlRoot("root")]
public class Lists
{
    [XmlElement("list")]
    public List<Items> Get { get; set; }
}

Using Java to remove all xml attributes from a XML file that match a attribute-name?

5 votes

I am trying to use Java to remove all xml attributes from a XML file that match a attribute-name. I am stuck at this point. At the bottom of this code I am able to get the attribute value of each node as I loop through but I can't figure out how to delete the attribute from the Node altogether . Any ideas?

import java.io.IOException;
import java.io.StringWriter;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.TransformerFactoryConfigurationError;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.w3c.dom.*;
import org.xml.sax.SAXException;


public class StripAttribute { 

  public static void main(String[] args) { 

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); 
    factory.setNamespaceAware(true); 
    org.w3c.dom.Document doc = null;
    NodeList nodes = null;
    try {
      DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();  
      DocumentBuilder db = dbf.newDocumentBuilder(); 
      doc = db.parse("a.xml");
      nodes = doc.getChildNodes();
    }  catch (IOException e) {
      e.printStackTrace();
    } catch (ParserConfigurationException e) {
      e.printStackTrace();
    } catch (SAXException e) {
      e.printStackTrace();
    }
    for ( int i = 0; i < nodes.getLength(); i++ ) { 
      String id = nodes.item(i).getNodeValue();
      if ( id.equals("siteKey")) {
        Element el = ((Attr) nodes.item(i)).getOwnerElement(); 
        el.removeAttribute(id);
      }
    } 

    Transformer transformer;
    StreamResult result = null;
    try {
      transformer = TransformerFactory.newInstance().newTransformer();
      transformer.setOutputProperty(OutputKeys.INDENT, "yes");
      result = new StreamResult(new StringWriter()); 
      DOMSource source = new DOMSource(doc); 
      transformer.transform(source, result); 
    } catch (TransformerConfigurationException e) {
      e.printStackTrace();
    } catch (TransformerFactoryConfigurationError e) {
      e.printStackTrace();
    } catch (TransformerException e) {
      e.printStackTrace();
    } 
    String xmlString = result.getWriter().toString(); 
    System.out.println(xmlString); 
  } 
}    

Here is a sample of the XML I want to transform:

https://gist.github.com/2784907

Try:

   for ( int i = 0; i < nodes.getLength(); i++ ) { 
     String id = nodes.item(i).getNodeValue();
     if ( id.equals("siteKey")) {
         //doc.removeChild(nodes.item(i));
            Element el = ((Attr) nodes.item(i)).getOwnerElement(); 
            el.removeAttribute(id);
     }
   } 

It seems that the nodes returned by the query are detached from the document so getParentNode is null. - no, they are not detached, I updated the code.

I found an article that says that the nodes returned by XPathExpression are still attached to the document.

You're original code + the above change:

 public static void main(String[] args) throws Exception {

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    factory.setNamespaceAware(true);
    Document doc = null;
    NodeList nodes = null;
    Set<String> ids = null;
    try {
        doc = factory.newDocumentBuilder().parse(new File("d:/a.xml"));

        XPathExpression expr = XPathFactory.newInstance().newXPath()
                .compile("//@siteKey");
        ids = new HashSet<String>();
        nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (ParserConfigurationException e) {
        e.printStackTrace();
    } catch (XPathExpressionException e) {
        e.printStackTrace();
    }

    for (int i = 0; i < nodes.getLength(); i++) {
        String id = nodes.item(i).getNodeValue();
        if (id.equals("siteKey")) {
            Element el = ((Attr) nodes.item(i)).getOwnerElement();
            el.removeAttribute(id);
        }
    }

    int dupes = 0;
    for (int i = 0; i < nodes.getLength(); i++) {
        String id = nodes.item(i).getNodeValue();
        if (ids.contains(id)) {
            System.out.format("%s is duplicate\n\n", id);
            dupes++;
        } else {
            ids.add(id);
        }
    }

    System.out.format("Total ids = %d\n Total Duplicates = %d\n", ids
            .size(), dupes);

    Transformer transformer;
    StreamResult result = null;
    try {
        transformer = TransformerFactory.newInstance().newTransformer();
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        result = new StreamResult(new StringWriter());
        DOMSource source = new DOMSource(doc);
        transformer.transform(source, result);
    } catch (TransformerConfigurationException e) {
        e.printStackTrace();
    } catch (TransformerFactoryConfigurationError e) {
        e.printStackTrace();
    } catch (TransformerException e) {
        e.printStackTrace();
    }

    String xmlString = result.getWriter().toString();
    System.out.println(xmlString);

} 

Update:

for (int i = 0; i < nodes.getLength(); i++) {
    String id = nodes.item(i).getNodeValue();
    Element el = ((Attr) nodes.item(i)).getOwnerElement();
    el.removeAttribute(id);
}

How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?

4 votes

There are good answers on SO about how to use readHTMLTable from the XML package and I did that with regular http pages, however I am not able to solve my problem with https pages.

I am trying to read table on this website (url string):

library(RTidyHTML)
library(XML)
url <- "https://ned.nih.gov/search/ViewDetails.aspx?NIHID=0010121048"
h = htmlParse(url)
tables <- readHTMLTable(url)

But I get this error: File https://ned.nih.gov/search/Vi...does not exist.

I tried to get past the https problem with this (first 2 lines below)(from using google to find solution (like here:http://tonybreyal.wordpress.com/2012/01/13/r-a-quick-scrape-of-top-grossing-films-from-boxofficemojo-com/).

This trick helps to see more of the page, but any attempts to extract the table are not working. Any advice appreciated. I need the table fields like Organization, Organizational Title, Manager.

 #attempt to get past the https problem 
 raw <- getURL(url, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
 head(raw)
[1] "\r\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; 
...
 h = htmlParse(raw)
Error in htmlParse(raw) : File ...
tables <- readHTMLTable(raw)
Error in htmlParse(doc) : File ...

The new package httr provides a wrapper around RCurl to make it easier to scrape all kinds of pages.

Still, this page gave me a fair amount of trouble. The following works, but no doubt there are easier ways of doing it.

library("httr")
library("XML")

# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")

# Read page
page <- GET(
  "https://ned.nih.gov/", 
  path="search/ViewDetails.aspx", 
  query="NIHID=0010121048",
  config(cainfo = cafile, ssl.verifypeer = FALSE)
)

# Use regex to extract the desired table
x <- text_content(page)
tab <- sub('.*(<table class="grid".*?>.*</table>).*', '\\1', x)

# Parse the table
readHTMLTable(tab)

The results:

$ctl00_ContentPlaceHolder_dvPerson
                V1                                      V2
1      Legal Name:                    Dr Francis S Collins
2  Preferred Name:                      Dr Francis Collins
3          E-mail:                 francis.collins@nih.gov
4        Location: BG 1 RM 1261 CENTER DRBETHESDA MD 20814
5       Mail Stop:                                       Â
6           Phone:                            301-496-2433
7             Fax:                                       Â
8              IC:             OD (Office of the Director)
9    Organization:            Office of the Director (HNA)
10 Classification:                                Employee
11            TTY:                                       Â

Get httr here: http://cran.r-project.org/web/packages/httr/index.html


EDIT: Useful page with FAQ about the RCurl package: http://www.omegahat.org/RCurl/FAQ.html

Why does the STAX parser think this is valid XML 1.0 but not 1.1?

4 votes

In the following code example, I use the STaX parser to parse a piece of XML. If I run the xml10 through it, it works as expected. The xml11 string (which is the same, except for the xml version) - it throws a NullPointerException. I'm running this on a Mac using JDK 1.6.

import javax.xml.namespace.QName;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.io.StringReader;
import java.util.Stack;

/**
 */
public class StaxSucks {

    static String xml10 ="<?xml version=\"1.0\" encoding=\"utf-8\" ?>\n"+
                        "<anElement/>";

    static String xml11 ="<?xml version=\"1.1\" encoding=\"utf-8\" ?>\n"+
            "<anElement/>";


    static void parse(InputStream is) throws Exception{
        final XMLInputFactory factory = XMLInputFactory.newInstance();
        factory.setProperty(XMLInputFactory.IS_SUPPORTING_EXTERNAL_ENTITIES, Boolean.FALSE);
        final XMLStreamReader xmlStreamReader = factory.createXMLStreamReader(is);
        Stack<QName> XMLDEPTH = new Stack<QName>();
        int eventType = xmlStreamReader.next();
        while(eventType != XMLStreamConstants.END_DOCUMENT){
            if(XMLStreamConstants.START_ELEMENT == eventType){
                QName eventName = xmlStreamReader.getName();
                XMLDEPTH.push(eventName);
            }else if(XMLStreamConstants.END_ELEMENT == eventType){
                //ends should always match the starts.
                QName eventName = xmlStreamReader.getName();
                if(XMLDEPTH.peek().equals(eventName)){
                    XMLDEPTH.pop();
                }else{
                    System.out.println("Hit an end with a non-matching beginning:"+eventName);
                }
            } else{
                System.out.println("Hit event type:"+eventType);
            }
            eventType = xmlStreamReader.next();
        }
        System.out.println("Stack is empty:"+XMLDEPTH.empty());

    }

    public static void main(String[] args) throws Exception{
        System.out.println("Starting XML1.0");
        InputStream is = new ByteArrayInputStream(xml10.getBytes("utf8"));
        parse(is);
        System.out.println("Starting XML1.1");
        is = new ByteArrayInputStream(xml11.getBytes("utf8"));
        parse(is);
    }
}

Stack Trace:

Exception in thread "main" java.lang.NullPointerException
    at com.sun.org.apache.xerces.internal.impl.XML11NSDocumentScannerImpl.scanStartElement(XML11NSDocumentScannerImpl.java:351)
    at com.sun.org.apache.xerces.internal.impl.XML11NSDocumentScannerImpl$NS11ContentDriver.scanRootElementHook(XML11NSDocumentScannerImpl.java:889)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:3104)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:922)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
    at com.sun.org.apache.xerces.internal.impl.XML11NSDocumentScannerImpl.next(XML11NSDocumentScannerImpl.java:852)
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:554)
    at StaxSucks.parse(StaxSucks.java:46)
    at StaxSucks.main(StaxSucks.java:74)

Hi This is a case of broken stax implementation in the Sun/Oracle JDK, IBM JDK works fine, or you can even just use the latest Xerces jars and you will be fine.

You can download xerces jars from: http://xerces.apache.org/mirrors.cgi#binary

dims@dims-laptop-520:~/test$ /usr/lib/jvm/java-6-sun/bin/java -cp . StaxSucks
Starting XML1.0
Stack is empty:true
Starting XML1.1
Exception in thread "main" java.lang.NullPointerException
    at com.sun.org.apache.xerces.internal.impl.XML11NSDocumentScannerImpl.scanStartElement(XML11NSDocumentScannerImpl.java:351)
    at com.sun.org.apache.xerces.internal.impl.XML11NSDocumentScannerImpl$NS11ContentDriver.scanRootElementHook(XML11NSDocumentScannerImpl.java:889)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:3104)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:922)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
    at com.sun.org.apache.xerces.internal.impl.XML11NSDocumentScannerImpl.next(XML11NSDocumentScannerImpl.java:852)
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:554)
    at StaxSucks.parse(StaxSucks.java:26)
    at StaxSucks.main(StaxSucks.java:54)
dims@dims-laptop-520:~/test$ java -cp .:xercesImpl.jar:xml-apis.jar StaxSucks
Starting XML1.0
Stack is empty:true
Starting XML1.1
Stack is empty:true

How to merge >1000 xml files into one in Java

4 votes

I am trying to merge many xml files into one. I have successfully done that in DOM, but this solution is limited to a few files. When I run it on multiple files >1000 I am getting a java.lang.OutOfMemoryError.

What I want to achieve is where i have the following files

file 1:

<root>
....
</root>

file 2:

<root>
......
</root>

file n:

<root>
....
</root>

resulting in: output:

<rootSet>
<root>
....
</root>
<root>
....
</root>
<root>
....
</root>
</rootSet>

This is my current implementation:

    DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
    Document doc = docBuilder.newDocument();
    Element rootSetElement = doc.createElement("rootSet");
    Node rootSetNode = doc.appendChild(rootSetElement);
    Element creationElement = doc.createElement("creationDate");
    rootSetNode.appendChild(creationElement);
    creationElement.setTextContent(dateString); 
    File dir = new File("/tmp/rootFiles");
    String[] files = dir.list();
    if (files == null) {
        System.out.println("No roots to merge!");
    } else {
        Document rootDocument;
            for (int i=0; i<files.length; i++) {
                       File filename = new File(dir+"/"+files[i]);        
               rootDocument = docBuilder.parse(filename);
               Node tempDoc = doc.importNode((Node) Document.getElementsByTagName("root").item(0), true);
               rootSetNode.appendChild(tempDoc);
        }
    }   

I have experimented a lot with xslt, sax, but I seem to keep missing something. Any help would be highly appreciated

You might also consider using StAX. Here's code that would do what you want:

import java.io.File;
import java.io.FileWriter;
import java.io.Writer;

import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.stream.StreamSource;

public class XMLConcat {
    public static void main(String[] args) throws Throwable {
        File dir = new File("/tmp/rootFiles");
        File[] rootFiles = dir.listFiles();

        Writer outputWriter = new FileWriter("/tmp/mergedFile.xml");
        XMLOutputFactory xmlOutFactory = XMLOutputFactory.newFactory();
        XMLEventWriter xmlEventWriter = xmlOutFactory.createXMLEventWriter(outputWriter);
        XMLEventFactory xmlEventFactory = XMLEventFactory.newFactory();

        xmlEventWriter.add(xmlEventFactory.createStartDocument());
        xmlEventWriter.add(xmlEventFactory.createStartElement("", null, "rootSet"));

        XMLInputFactory xmlInFactory = XMLInputFactory.newFactory();
        for (File rootFile : rootFiles) {
            XMLEventReader xmlEventReader = xmlInFactory.createXMLEventReader(new StreamSource(rootFile));
            XMLEvent event = xmlEventReader.nextEvent();
            // Skip ahead in the input to the opening document element
            while (event.getEventType() != XMLEvent.START_ELEMENT) {
                event = xmlEventReader.nextEvent();
            }

            do {
                xmlEventWriter.add(event);
                event = xmlEventReader.nextEvent();
            } while (event.getEventType() != XMLEvent.END_DOCUMENT);
            xmlEventReader.close();
        }

        xmlEventWriter.add(xmlEventFactory.createEndElement("", null, "rootSet"));
        xmlEventWriter.add(xmlEventFactory.createEndDocument());

        xmlEventWriter.close();
        outputWriter.close();
    }
}

One minor caveat is that this API seems to mess with empty tags, changing <foo/> into <foo></foo>.

E4X select Nodes where descendants can be either A OR B or A && B

4 votes

So I have this XML structure:

<Items>
<Item name="aaa">
    <ProductRanges>
        <ProductRange id="1" />
    </ProductRanges>
</Item>
<Item name="bbb">
    <ProductRanges>
        <ProductRange id="2" />
    </ProductRanges>
</Item>
<Item name="ccc">
    <ProductRanges>
        <ProductRange id="1" />
        <ProductRange id="2" />
    </ProductRanges>
</Item>
</Items>

Using the following E4X query I get only item "aaa" and item "bbb".

trace(   Items.Item.(descendants("ProductRange").@id == "1" || descendants("ProductRange").@id == "2")   );

However, I can kind of see why I'm not seeing item "ccc" because it is BOTH id="1" && "2"

So not really sure what the correct query should be here, and even if descendants is the correct technique.

I don't want to end up doing long additional id="1" && id="2" queries either because I have unlimited combinations of these values ("2" && "3", "1" && "2" && "3") etc..

Any thoughts would be most helpful..

Thanks


So Patrick solved this with this expression:

xml.Item.(descendants('ProductRange').(@id=="1" || @id=="2").length()>0);

However, taking this one step further, how would dynamically create the @id values, because this will be a changing query depending on user selections.

Something like this (but this, but this doesn't work):

var attributeValues:String = "@id==\"1\" || @id==\"2\" || @id==\"3\" || @id==\"4\"";
xml.Item.(descendants('ProductRange').(attributeValues).length()>0);

Any more thoughts Patrick.. anyone?

Thanks

For the Or search you can simply do :

xml.Item.(descendants('ProductRange').(@id=="1" || @id=="2").length()>0);

Using a custom filter function you can do your And search that is more complicated than an Or:

var xml:XML=<Items>
<Item name="aaa">
    <ProductRanges>
        <ProductRange id="1" />
    </ProductRanges>
</Item>
<Item name="bbb">
    <ProductRanges>
        <ProductRange id="2" />
    </ProductRanges>
</Item>
<Item name="ccc">
    <ProductRanges>
        <ProductRange id="1" />
        <ProductRange id="3" />
        <ProductRange id="2" />
    </ProductRanges>
</Item>
</Items>;

function andSearch(node:XMLList, searchFor:Array) {
    var mask:int = 0;
    var match:int = (1 << searchFor.length ) - 1;
    var fn:Function = function(id:String) {
        var i:int = searchFor.indexOf(id);
        if (i >= 0) {
            mask = mask | (1<<i);
        }
        return mask==match;
    }
    node.(ProductRange.(fn(@id)));
    return mask==match;
}

trace( xml.Item.( andSearch( ProductRanges, ["1", "2"] ) ) );