Best xml questions in April 2011

How do I extract info deep inside XML using C# and LINQ?

8 votes

This is my first post on StackOverflow, so please bear with me. And I apologize upfront if my code example is a bit long.

Using C# and LINQ, I'm trying to identify a series of third level id elements (000049 in this case) in a much larger XML file. Each third level id is unique, and the ones I want are based on a series of descendant info for each. More specifically, if type == A and location type(old) == vault and location type(new) == out, then I want to select that id. Below is the XML and C# code that I'm using.

In general my code works. As written below it will return an id of 000049 twice, which is correct. However, I have found a glitch. If I remove the first history block that contains type == A, my code still returns an id of 000049 twice when it should only return it once. I know why it is happening, but I can't figure out a better way to run the query. Is there a better way to run my query to get the output I want and still use LINQ?

My XML:

<?xml version="1.0" encoding="ISO8859-1" ?>
<data type="historylist">
    <date type="runtime">
        <year>2011</year>
        <month>04</month>
        <day>22</day>
        <dayname>Friday</dayname>
        <hour>15</hour>
        <minutes>24</minutes>
        <seconds>46</seconds>
    </date>
    <customer>
        <id>0001</id>
        <description>customer</description>
        <mediatype>
            <id>kit</id>
            <description>customer kit</description>
            <volume>
                <id>000049</id>
                <history>
                    <date type="optime">
                        <year>2011</year>
                        <month>04</month>
                        <day>22</day>
                        <dayname>Friday</dayname>
                        <hour>03</hour>
                        <minutes>00</minutes>
                        <seconds>02</seconds>
                    </date>
                    <userid>batch</userid>
                    <type>OD</type>
                    <location type="old">
                        <repository>vault</repository>
                        <slot>0</slot>
                    </location>
                    <location type="new">
                        <repository>out</repository>
                        <slot>0</slot>
                    </location>
                    <container>0001.kit.000049</container>
                    <date type="movedate">
                        <year>2011</year>
                        <month>04</month>
                        <day>22</day>
                        <dayname>Friday</dayname>
                    </date>
                </history>
                <history>
                    <date type="optime">
                        <year>2011</year>
                        <month>04</month>
                        <day>22</day>
                        <dayname>Friday</dayname>
                        <hour>06</hour>
                        <minutes>43</minutes>
                        <seconds>33</seconds>
                    </date>
                    <userid>vaultred</userid>
                    <type>A</type>
                    <location type="old">
                        <repository>vault</repository>
                        <slot>0</slot>
                    </location>
                    <location type="new">
                        <repository>out</repository>
                        <slot>0</slot>
                    </location>
                    <container>0001.kit.000049</container>
                    <date type="movedate">
                        <year>2011</year>
                        <month>04</month>
                        <day>22</day>
                        <dayname>Friday</dayname>
                    </date>
                </history>
                <history>
                    <date type="optime">
                        <year>2011</year>
                        <month>04</month>
                        <day>22</day>
                        <dayname>Friday</dayname>
                        <hour>06</hour>
                        <minutes>43</minutes>
                        <seconds>33</seconds>
                    </date>
                    <userid>vaultred</userid>
                    <type>S</type>
                    <location type="old">
                        <repository>vault</repository>
                        <slot>0</slot>
                    </location>
                    <location type="new">
                        <repository>out</repository>
                        <slot>0</slot>
                    </location>
                    <container>0001.kit.000049</container>
                    <date type="movedate">
                        <year>2011</year>
                        <month>04</month>
                        <day>22</day>
                        <dayname>Friday</dayname>
                    </date>
                </history>
                <history>
                    <date type="optime">
                        <year>2011</year>
                        <month>04</month>
                        <day>22</day>
                        <dayname>Friday</dayname>
                        <hour>06</hour>
                        <minutes>45</minutes>
                        <seconds>00</seconds>
                    </date>
                    <userid>batch</userid>
                    <type>O</type>
                    <location type="old">
                        <repository>out</repository>
                        <slot>0</slot>
                    </location>
                    <location type="new">
                        <repository>site</repository>
                        <slot>0</slot>
                    </location>
                    <container>0001.kit.000049</container>
                    <date type="movedate">
                        <year>2011</year>
                        <month>04</month>
                        <day>22</day>
                        <dayname>Friday</dayname>
                    </date>
                </history>
                <history>
                    <date type="optime">
                        <year>2011</year>
                        <month>04</month>
                        <day>22</day>
                        <dayname>Friday</dayname>
                        <hour>11</hour>
                        <minutes>25</minutes>
                        <seconds>59</seconds>
                    </date>
                    <userid>ihcmdm</userid>
                    <type>A</type>
                    <location type="old">
                        <repository>out</repository>
                        <slot>0</slot>
                    </location>
                    <location type="new">
                        <repository>site</repository>
                        <slot>0</slot>
                    </location>
                    <container>0001.kit.000049</container>
                    <date type="movedate">
                        <year>2011</year>
                        <month>04</month>
                        <day>22</day>
                        <dayname>Friday</dayname>
                    </date>
                </history>
                <history>
                    <date type="optime">
                        <year>2011</year>
                        <month>04</month>
                        <day>22</day>
                        <dayname>Friday</dayname>
                        <hour>11</hour>
                        <minutes>25</minutes>
                        <seconds>59</seconds>
                    </date>
                    <userid>ihcmdm</userid>
                    <type>S</type>
                    <location type="old">
                        <repository>out</repository>
                        <slot>0</slot>
                    </location>
                    <location type="new">
                        <repository>site</repository>
                        <slot>0</slot>
                    </location>
                    <container>0001.kit.000049</container>
                    <date type="movedate">
                        <year>2011</year>
                        <month>04</month>
                        <day>22</day>
                        <dayname>Friday</dayname>
                    </date>
                </history>
            </volume>
            ...

My C# code:

IEnumerable<XElement> caseIdLeavingVault =
    from volume in root.Descendants("volume")
    where
        (from type in volume.Descendants("type")
         where type.Value == "A"
         select type).Any() &&
        (from locationOld in volume.Descendants("location")
         where
             ((String)locationOld.Attribute("type") == "old" &&
              (String)locationOld.Element("repository") == "vault") &&
             (from locationNew in volume.Descendants("location")
              where
                  ((String)locationNew.Attribute("type") == "new" &&
                   (String)locationNew.Element("repository") == "out")
              select locationNew).Any()
         select locationOld).Any()
    select volume.Element("id");

    ...

foreach (XElement volume in caseIdLeavingVault)
{
    Console.WriteLine(volume.Value.ToString());
}

Thanks.


OK guys, I'm stumped again. Given this same situation and @Elian's solution below (which works great), I need the "optime" and "movedate" dates for the history used to select the id. Does that make sense? I was hoping to end with something like this:

select new { 
    id = volume.Element("id").Value, 

    // this is from "optime"
    opYear = <whaterver>("year").Value, 
    opMonth = <whatever>("month").Value, 
    opDay = <whatever>("day").Value, 

    // this is from "movedate"
    mvYear = <whaterver>("year").Value, 
    mvMonth = <whatever>("month").Value, 
    mvDay = <whatever>("day").Value 
} 

I have tried so many different combinations, but the Attributes for <date type="optime"> and <date type="movedate"> keep getting in my way and I can't seem to get what I want.


OK. I found a solution that works well:

select new {
    caseId = volume.Element("id").Value,

    // this is from "optime"
    opYear = volume.Descendants("date").Where(t => t.Attribute("type").Value == "optime").First().Element("year").Value,
    opMonth = volume.Descendants("date").Where(t => t.Attribute("type").Value == "optime").First().Element("month").Value,
    opDay = volume.Descendants("date").Where(t => t.Attribute("type").Value == "optime").First().Element("day").Value,

    // this is from "movedate"
    mvYear = volume.Descendants("date").Where(t => t.Attribute("type").Value == "movedate").First().Element("year").Value,
    mvMonth = volume.Descendants("date").Where(t => t.Attribute("type").Value == "movedate").First().Element("month").Value,
    mvDay = volume.Descendants("date").Where(t => t.Attribute("type").Value == "movedate").First().Element("day").Value
};

However, it does fail when it finds an id with no "movedate". A few of these exist, so now I am working on that.


Well, late yesterday afternoon I finally figured out the solution I had been wanting:

var caseIdLeavingSite =
    from volume in root.Descendants("volume")
    where volume.Elements("history").Any(
        h => h.Element("type").Value == "A" &&
        h.Elements("location").Any(l => l.Attribute("type").Value == "old" && ((l.Element("repository").Value == "site") ||
                                                                               (l.Element("repository").Value == "init"))) &&
        h.Elements("location").Any(l => l.Attribute("type").Value == "new" && l.Element("repository").Value == "toVault")
        )
    select new {
        caseId = volume.Element("id").Value,
        opYear = volume.Descendants("date").Where(t => t.Attribute("type").Value == "optime").First().Element("year").Value,
        opMonth = volume.Descendants("date").Where(t => t.Attribute("type").Value == "optime").First().Element("month").Value,
        opDay = volume.Descendants("date").Where(t => t.Attribute("type").Value == "optime").First().Element("day").Value,
        mvYear = (volume.Descendants("date").Where(t => t.Attribute("type").Value == "movedate").Any() == true) ? 
                 (volume.Descendants("date").Where(t => t.Attribute("type").Value == "movedate").First().Element("year").Value) : "0",
        mvMonth = (volume.Descendants("date").Where(t => t.Attribute("type").Value == "movedate").Any() == true) ? 
                  (volume.Descendants("date").Where(t => t.Attribute("type").Value == "movedate").First().Element("month").Value) : "0",
        mvDay = (volume.Descendants("date").Where(t => t.Attribute("type").Value == "movedate").Any() == true) ? 
                (volume.Descendants("date").Where(t => t.Attribute("type").Value == "movedate").First().Element("day").Value) : "0"
   };

This satisfies the requirements that @Elian helped with and grabs the additional date info necessary. It also accounts for those few instances when there is no element for "movedate" by using the ternary operator ?:.

Now, if anyone knows how to make this more efficient, I'm still interested. Thanks.

I think you want something like this:

IEnumerable<XElement> caseIdLeavingVault =
    from volume in document.Descendants("volume")
    where volume.Elements("history").Any(
        h => h.Element("type").Value == "A" &&
            h.Elements("location").Any(l => l.Attribute("type").Value == "old" && l.Element("repository").Value == "vault") &&
            h.Elements("location").Any(l => l.Attribute("type").Value == "new" && l.Element("repository").Value == "out")
        )
    select volume.Element("id");

Your code independently checks if a volume has a <history> element of type A and a (not necessarily the same) <history> element which has the required <location> elements.

The code above checks if there exists a <history> element that is both of type A and contains the required <location> elements.

Update: Abatishchev suggested a solution that uses an xpath query instead of LINQ to XML, but his query is too simple and doesn't return exactly what you asked for. The following xpath query will do the trick, but it's also a little bit longer:

data/customer/mediatype/volume[history[type = 'A' and location[@type = 'old' and repository = 'vault'] and location[@type = 'new' and repository = 'out']]]/id

Full text searching XML data with Python: best practices, pros & cons

8 votes

Task

I want to use Python for doing full text searches of XML data.

Example data

<elements>
  <elem id="1">some element</elem>
  <elem id="2">some other element</elem>
  <elem id="3">some element
    <nested id="1">
    other nested element
    </nested>
  </elem>
</elements>

Basic functionality

The most basic functionality I want is that a search for "other" in an XPath ("/elements/elem") returns at least the value of the ID attribute for the matching element (elem 2) and nested element (elem 3, nested 1) or the matching XPaths.

Ideal functionality

The solution should be flexible and scalable. I am looking for possible combinations of these features:

  • search nested elements (infinite depth)
  • search attributes
  • search for sentences and paragraphs
  • search using wildcards
  • search using fuzzy matching
  • return precise matching info
  • good search speed for large XML files

Question

I don't expect a solution with all of the ideal functionality, I'll have to combine different existing functionalities and code the rest myself. But first I would like to know more about what there is out there, which libraries and approaches you would usually use for this, what their pros and cons are.

EDIT: Thanks for the answers so far, I added detail and started a bounty.

Not sure if that will be enough for your needs, but lxml has support for regular expressions in xpath (meaning: you can use xpath 1.0 plus the EXSLT extension functions for regular expressions)

Compared to the feature list that was added later:

  • search nested elements (infinite depth): yes
  • search attributes: yes
  • search for sentences and paragraphs: no. Assuming that "paragraphs" are actual xml elements, then yes. But "sentences" as such, no.
  • search using wildcards: yes (regular expressions)
  • search using fuzzy matching: no (assuming stemming, synonyms and so on...)
  • return precise matching info: yes
  • good search speed for large XML files: yes, except when your files are so extremely large that you would actually need a fulltext index to get good speed anyway.

The only way to satisfy all your request that I see, would be to load your files into a native xml database that supports "real" fulltext search (via XQuery Fulltext probably) and use that. (can't help you much further with that, maybe Sedna, which seems to have a python API and seems to supports fulltext search?)

Human-readable XML output from Scala?

6 votes

Scala seems to do two things to XML that you enter that make it no less parseable but make it less readable:

First, it expands tags that close themselves:

scala> <tag/>
res109: scala.xml.Elem = <tag></tag>

And second, it scrambles attributes into random order, as if it put them into a hash set:

scala> <tag a="a" b="b" c="c" d="d"/>         
res110: scala.xml.Elem = <tag d="d" a="a" c="c" b="b"></tag>

Together, these conspire to render XML considerably less human-readable (at least by me). I'm not very familiar with the XML library; is there a way to perform xml-to-string translation that yields a compact human-readable form? (If not by default, by recursing and writing one's own string conversions--or are there too many special cases that lurk there?)

Mostly, see scala.xml.Utility.toXml. The attribute thing doesn't have a solution, though (as far as I know).

scala> xml.Utility.toXML(<a/>, minimizeTags = true)
res13: StringBuilder = <a />

You may want to look at scala.xml.PrettyPrinter as well.

How Check if a XML file if well formed using delphi?

6 votes

How i can check if a xml file is well formed without invalid chars or tags.

for example consider this xml

<?xml version="1.0"?>
<PARTS>
   <TITLE>Computer Parts</TITLE>
   <PART>
      <ITEM>Motherboard</ITEM>
      <MANUFACTURER>ASUS</MANUFACTURER>
      <MODEL>P3B-F</MODEL>
      <COST> 123.00</COST>
   </PART>
   <PART>
      <ITEM>Video Card</ITEM>
      <MANUFACTURER>ATI</MANUFACTURER>
      <MODEL>All-in-Wonder Pro</MODEL>
      <COST> 160.00</COST>
   </PART>
</PARTSx>

the last tag </PARTSx>must be </PARTS>

You can use the IXMLDOMParseError interface returned by the MSXML DOMDocument

this interface return a serie of properties which help you to identify the problem.

  • errorCode Contains the error code of the last parse error. Read-only.
  • filepos Contains the absolute file position where the error occurred. Read-only.
  • line Specifies the line number that contains the error. Read-only.
  • linepos Contains the character position within the line where the error occurred.
  • reason Describes the reason for the error. Read-only.
  • srcText Returns the full text of the line containing the error. Read-only.
  • url Contains the URL of the XML document containing the last error. Read-only.

check these two functions which uses the MSXML 6.0 (you can use another versions as well)

uses
  Variants,
  Comobj,
  SysUtils;

function IsValidXML(const XmlStr :string;var ErrorMsg:string) : Boolean;
var
  XmlDoc : OleVariant;
begin
  XmlDoc := CreateOleObject('Msxml2.DOMDocument.6.0');
  try
    XmlDoc.Async := False;
    XmlDoc.validateOnParse := True;
    Result:=(XmlDoc.LoadXML(XmlStr)) and (XmlDoc.parseError.errorCode = 0);
    if not Result then
     ErrorMsg:=Format('Error Code : %s  Msg : %s line : %s Character  Position : %s Pos in file : %s',
     [XmlDoc.parseError.errorCode,XmlDoc.parseError.reason,XmlDoc.parseError.Line,XmlDoc.parseError.linepos,XmlDoc.parseError.filepos]);
  finally
    XmlDoc:=Unassigned;
  end;
end;

function IsValidXMLFile(const XmlFile :TFileName;var ErrorMsg:string) : Boolean;
var
  XmlDoc : OleVariant;
begin
  XmlDoc := CreateOleObject('Msxml2.DOMDocument.6.0');
  try
    XmlDoc.Async := False;
    XmlDoc.validateOnParse := True;
    Result:=(XmlDoc.LoadXML(XmlFile)) and (XmlDoc.parseError.errorCode = 0);
    if not Result then
     ErrorMsg:=Format('Error Code : %s  Msg : %s line : %s Character  Position : %s Pos in file : %s',
     [XmlDoc.parseError.errorCode,XmlDoc.parseError.reason,XmlDoc.parseError.Line,XmlDoc.parseError.linepos,XmlDoc.parseError.filepos]);
  finally
    XmlDoc:=Unassigned;
  end;
end;

How can I create a schema from an example XML document in Perl?

6 votes

I need to create a XSD schema based on a XML file. Are there any Perl modules which can do this?

You can create the XSD by a XSL transformation using any XSLT processor. See XML::XSLT

An XSD file contains two element types: simple and complex. All leaf nodes have to be translated into simple type elements and the others have to be translated into complex types. Leaf nodes are nodes without any descendants. The corresponding XPath is //*[not(descendant::element())]. The following XSLT implements this approch:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" 
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
  <xsl:template match="/">
    <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
               elementFormDefault="qualified" 
               attributeFormDefault="unqualified">
      <xsl:for-each select="//*[not(descendant::element())]">
        <xsl:element name="xs:element">
          <xsl:attribute name="name">
            <xsl:value-of select="name()"/>
          </xsl:attribute>
          <xs:simpleType>
            <xs:restriction base="xs:string"/>
          </xs:simpleType>
        </xsl:element>
      </xsl:for-each>
      <xsl:for-each select="//*[descendant::element()]">
        <xsl:element name="xs:element">
          <xsl:attribute name="name">
            <xsl:value-of select="name()"/>
          </xsl:attribute>
          <xs:complexType>
            <xs:sequence>
              <xsl:for-each select="child::*">
                <xsl:element name="xs:element">
                  <xsl:attribute name="ref">
                    <xsl:value-of select="name()"/>
                  </xsl:attribute>
                </xsl:element>
              </xsl:for-each>
            </xs:sequence>
          </xs:complexType>
        </xsl:element>
      </xsl:for-each>
    </xs:schema>
  </xsl:template>
</xsl:stylesheet>

The following example:

<?xml version="1.0" encoding="UTF-8"?>
<person>
  <firstname>Peter</firstname>
  <lastname>Pan</lastname>
  <born>
    <year>1904</year>
    <month>12</month>
    <day>27</day>
  </born>
</person>

Will produce the following schema:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
           elementFormDefault="qualified" 
           attributeFormDefault="unqualified">
  <xs:element name="firstname">
    <xs:simpleType>
      <xs:restriction base="xs:string"/>
    </xs:simpleType>
  </xs:element>
  <xs:element name="lastname">
    <xs:simpleType>
      <xs:restriction base="xs:string"/>
    </xs:simpleType>
  </xs:element>
  <xs:element name="year">
    <xs:simpleType>
      <xs:restriction base="xs:string"/>
    </xs:simpleType>
  </xs:element>
  <xs:element name="month">
    <xs:simpleType>
      <xs:restriction base="xs:string"/>
    </xs:simpleType>
  </xs:element>
  <xs:element name="day">
    <xs:simpleType>
      <xs:restriction base="xs:string"/>
    </xs:simpleType>
  </xs:element>
  <xs:element name="person">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="firstname"/>
        <xs:element ref="lastname"/>
        <xs:element ref="born"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="born">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="year"/>
        <xs:element ref="month"/>
        <xs:element ref="day"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

What are the major methods / libraries available for parsing XML?

6 votes

I'm performing a trade study evaluating various methods for parsing XML for a large system. I'm looking at both analytical and actual relative performance (space & time) on multiple platforms (iOS, Linux, OS X, Windows). My current candidate evaluation list of methods and libraries is the following:

Am I missing any particularly valuable tools, or different parsing methods?

You are missing Linq-To-Xml/XDocument which provides an alternative to XmlDocument/DOM .

How/Can I use linq to xml to query huge xml files with reasonable memory consumption ?

6 votes

I've not done much with linq to xml, but all the examples I've seen load the entire XML document into memory.

What if the XML file is, say, 8GB, and you really don't have the option?

My first thought is to use the XElement.Load Method (TextReader) in combination with an instance of the FileStream Class.

QUESTION: will this work, and is this the right way to approach the problem of searching a very large XML file?

Note: high performance isn't required.. i'm trying to get linq to xml to basically do the work of the program i could write that loops through every line of my big file and gathers up, but since linq is "loop centric" I'd expect this to be possible....

Using XElement.Load will load the whole file into the memory. Instead, use XmlReader with the XNode.ReadFrom function, where you can selectively load notes found by XmlReader with XElement for further processing, if you need to. MSDN has a very good example doing just that: http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx

If you just need to search the xml document, XmlReader alone will suffice and will not load the whole document into the memory.

Ultra-portable, small complex config file library in ANSI C?

5 votes

I'm looking for a very portable, minimalistic/small XML/configuration language library in ANSI C with no external dependencies (or very few), compiling down to less than 100K. I need it for a moderately complex configuration file, and it must support Unicode.

Some more requirements:

  1. OK to use/embed/statically link into proprietary code. Credit will always will be given where credit is due.
  2. Not necessarily XML.
  3. Really, clean code/no weird or inconsistent string handling.
  4. UTF-8.

Thank you fellas.

This is somehow similar to this question: Is there a good tiny XML parser for an embedded C project?

I was able to tweak the compilation flags of the following XML parser libraries for C, and cut down more than 50% of their size on my Ubuntu machine. Mini-XML is the only one close to what you requested:

Optimising a Haskell XML parser

5 votes

I'm experimenting with Haskell at the moment and am very much enjoying the experience, but am evaluating it for a real project with some fairly stringent performance requirements. The first pass of my task is to process a complete (no-history) dump of wikipedia (bzipped) - totalling about 6Gb compressed. In python a script to do a full extract of each raw page (about 10 million in total) takes about 30 minutes on my box (and for reference a scala implementation using the pull parser takes about 40 mins). I've been attempting to replicate this performance using Haskell and ghc and have been struggling to match this.

I've been using Codec.Compression.BZip for decompression and hexpat for parsing. I'm using lazy bytestrings as the input to hexpat and strict bytestrings for the element text type. And to extract the text for each page I'm building up a Dlist of pointers to text elements and then iterating over this to dump it out to stdout. The code I've just described has already been through a number of profiling/refactor iterations (I quickly moved from strings to bytestrings, then from string concatenation to lists of pointers to text - then to dlists of pointers to text). I think I've got about 2 orders of magnitude speedup from the original code, but it still takes over an hour and a half to parse (although it has a lovely small memory footprint). So I'm looking for a bit of inspiration from the community to get me the extra mile. The code is below (and I've broken it up into a number of subfunctions in order to get more detail from the profiler). Please excuse my Haskell - I've only been coding for a couple of days (having spent a week with Real World Haskell). And thanks in advance!

import System.Exit
import Data.Maybe
import Data.List
import Data.DList (DList)
import qualified Data.DList as DList

import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString.Char8 as BS
import qualified Data.ByteString.Lazy as LazyByteString
import qualified Codec.Compression.BZip as BZip

import Text.XML.Expat.Proc
import Text.XML.Expat.Tree
import Text.XML.Expat.Format

testFile = "../data/enwiki-latest-pages-articles.xml.bz2"

validPage pageData = case pageData of
    (Just _, Just _) -> True
    (_, _) -> False

scanChildren :: [UNode ByteString] -> DList ByteString
scanChildren c = case c of
    h:t -> DList.append (getContent h) (scanChildren t)
    []  -> DList.fromList []

getContent :: UNode ByteString -> DList ByteString
getContent treeElement =
    case treeElement of
        (Element name attributes children)  -> scanChildren children
        (Text text)                         -> DList.fromList [text]

rawData t = ((getContent.fromJust.fst) t, (getContent.fromJust.snd) t)

extractText page = do
    revision <- findChild (BS.pack "revision") page
    text <- findChild (BS.pack "text") revision
    return text

pageDetails tree =
    let pageNodes = filterChildren relevantChildren tree in
    let getPageData page = (findChild (BS.pack "title") page, extractText page) in
    map rawData $ filter validPage $ map getPageData pageNodes
    where
        relevantChildren node = case node of
            (Element name attributes children) -> name == (BS.pack "page")
            (Text _) -> False

outputPages pagesText = do
    let flattenedPages = map DList.toList pagesText
    mapM_ (mapM_ BS.putStr) flattenedPages

readCompressed fileName = fmap BZip.decompress (LazyByteString.readFile fileName)
parseXml byteStream = parse defaultParseOptions byteStream :: (UNode ByteString, Maybe XMLParseError)

main = do
    rawContent <- readCompressed testFile
    let (tree, mErr) = parseXml rawContent
    let pages = pageDetails tree
    let pagesText = map snd pages
    outputPages pagesText
    putStrLn "Complete!"
    exitWith ExitSuccess

After running your program I get somewhat weird results:

./wikiparse  +RTS -s -A5m -H5m | tail
./wikiparse +RTS -s -A5m -H5m
3,604,204,828,592 bytes allocated in the heap
  70,746,561,168 bytes copied during GC
      39,505,112 bytes maximum residency (37822 sample(s))
       2,564,716 bytes maximum slop
              83 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 620343 collections,     0 parallel, 15.84s, 368.69s elapsed
  Generation 1: 37822 collections,     0 parallel,  1.08s, 33.08s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time  243.85s  (4003.81s elapsed)
  GC    time   16.92s  (401.77s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time  260.77s  (4405.58s elapsed)

  %GC time       6.5%  (9.1% elapsed)

  Alloc rate    14,780,341,336 bytes per MUT second

  Productivity  93.5% of total user, 5.5% of total elapsed

Total time is more than OK I think: 260s is way faster than 30m for Python. I have no idea though why the overall time is so big here. I really don't think that reading 6Gb file would take more than an hour to complete.

I'm running your program again to check if the results are consistent.

If the result of those 4'20'' is right, then I believe something is wrong with the machine... or there is some other strange effect here.

The code was compiled on GHC 7.0.2.


Edit: I tried various versions of the program above. The most important optimization seems to be {-# INLINE #-} pragma and specialization of functions. Some have pretty generic types, which is known to be bad for performance. OTOH I believe inlining should be enough to trigger the specialization, so you should try to experiment further with this.

I didn't see any significant difference across the versions of GHC I tried (6.12 .. HEAD).

Haskell bindings to bzlib seems to have optimal performance. The following program, which is near-complete reimplementation of standard bzcat program, is as fast or even faster than the original.

module Main where

import qualified Data.ByteString.Lazy as BSL
import qualified Codec.Compression.BZip as BZip
import System.Environment (getArgs)

readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName)

main :: IO ()
main = do
    files <- getArgs
    mapM_ (\f -> readCompressed f >>= BSL.putStr) files                 

On my machine it takes ~1100s to decompress the test file to /dev/null. The fastest version I was able to get was based on SAX style parser. I'm not sure though if the output matches that of the original. On small outputs the result is the same, and so is the performance. On the original file the SAX version is somewhat faster and completes in ~2400s. You can find it below.

{-# LANGUAGE OverloadedStrings #-}

import System.Exit
import Data.Maybe

import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BSL
import qualified Codec.Compression.BZip as BZip

import System.IO

import Text.XML.Expat.SAX as SAX

type ByteStringL = BSL.ByteString
type Token = ByteString
type TokenParser = [SAXEvent Token Token] -> [[Token]]

testFile = "/tmp/enwiki-latest-pages-articles.xml.bz2"


readCompressed :: FilePath -> IO ByteStringL
readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName)

{-# INLINE pageStart #-}
pageStart :: TokenParser
pageStart ((StartElement "page" _):xs) = titleStart xs
pageStart (_:xs) = pageStart xs
pageStart [] = []

{-# INLINE titleStart #-}
titleStart :: TokenParser
titleStart ((StartElement "title" _):xs) = finish "title" revisionStart xs
titleStart ((EndElement "page"):xs) = pageStart xs
titleStart (_:xs) = titleStart xs
titleStart [] = error "could not find <title>"


{-# INLINE revisionStart #-}
revisionStart :: TokenParser
revisionStart ((StartElement "revision" _):xs) = textStart xs
revisionStart ((EndElement "page"):xs) = pageStart xs
revisionStart (_:xs) = revisionStart xs
revisionStart [] = error "could not find <revision>"

{-# INLINE textStart #-}
textStart :: TokenParser
textStart ((StartElement "text" _):xs) = textNode [] xs
textStart ((EndElement "page"):xs) = pageStart xs
textStart (_:xs) = textStart xs
textStart [] = error "could not find <text>"

{-# INLINE textNode #-}
textNode :: [Token] -> TokenParser
textNode acc ((CharacterData txt):xs) = textNode (txt:acc) xs
textNode acc xs = (reverse acc) : textEnd xs

{-# INLINE textEnd #-}
textEnd {- , revisionEnd, pageEnd -} :: TokenParser
textEnd = finish "text" . finish "revision" . finish "page" $ pageStart
--revisionEnd = finish "revision" pageEnd
--pageEnd = finish "page" pageStart

{-# INLINE finish #-}
finish :: Token -> TokenParser -> TokenParser
finish tag cont ((EndElement el):xs) | el == tag = cont xs
finish tag cont (_:xs) = finish tag cont xs
finish tag _ [] = error (show (tag,("finish []" :: String)))

main :: IO ()
main = do
  rawContent <- readCompressed testFile
  let parsed = (pageStart (SAX.parse defaultParseOptions rawContent))
  mapM_ (mapM_ BS.putStr) ({- take 5000 -} parsed) -- remove comment to finish early
  putStrLn "Complete!"

Generally I'm suspicious that Python's and Scala's versions are finishing early. I couldn't verify that claim though without the source code.

To sum up: inlining and specialization should give reasonable, about two-fold increase in performance.

Can I get a specific example of XML in c#

5 votes

I am trying to pull multiple elements out of an XML documents and their children but I cannot find a useful example anywhere... MSDN is very vague. This is c# in .Net

I am creating this XML dynamically already and transferring it to a string. I have been trying to use XmlNode with a NodeList to go through each file in a foreach section but It is not working properly.

Here is some sample XML:

<searchdoc>
    <results>
        <result no = "1">
            <url>a.com</url>
            <lastmodified>1/1/1</lastmodified>
            <description>desc1</description>
            <title>title1</title>
        </result>
        <result no = "2">
            <url>b.com</url>
            <lastmodified>2/2/2/</lastmodified>
            <description>desc2</description>
            <title>title2</title>
        </result>
    </results>
</searchdoc>

I need to pull each of the full paths <result>

There are multiple ways to solve this problem, depending on which version of the .NET Framework you are working on:

.NET 1.x, 2.0 and 3.0

You can easily obtain a filtered list of nodes from your XML document by issuing an XPath query via the XPathDocument class:

using (var reader = new StringReader("<Results><Result>...</Result></Results>"))
{
  var document = new XPathDocument(reader);
  var navigator = document.CreateNavigator();
  var results = navigator.Select("//result");

  while (results.MoveNext())
  {
    Console.WriteLine("{0}:{1}", results.Current.Name, results.Current.Value);
  }
}

.NET 3.5 and later

You should use LINQ to XML to query and filter XML hierarchies, since it offers a much more expressive API than the XPath syntax:

var document = XDocument.Parse("<Results><Result>...</Result></Results>");
var results = document.Elements("result");

foreach (var item in results)
{
  Console.WriteLine("{0}:{1}", item.Name, item.Value);
}

Related resources:

How can I export a DBGrid to OOXML format (Excel 2007/2010 format) without Excel installed?

4 votes

I have a Delphi 2007 DBGrid that I'd like to allow the user to save in the newer Excel format (OOXML), but my criteria is that the user does not need to have Excel installed. Is anyone aware of any components that already do this? And yes, I did search already, but I have not found anything.

Of the top of my head was TMS FlexCel Studio for VCL, but I was wrong. The current VCL version doesn't support xslx. Their .NET edition does support xslx, though...

So a quick google search pointed me to an EDN discussion that refers to these sites:

I have no knowledge about these products, but it might be worth a look...

Scala way of filling a template?

4 votes

In Ruby I could have this:

string=<<EOTEMPLATE
<root>
  <hello>
     <to>%s</to>
     <message>welcome mr %s</message>
  </hello>
  ...
</root>
EOTEMPLATE

And when I want to "render" the template, I would do this:

rendered = string % ["me@mail.com","Anderson"]

And it would fill the template with the values passed in the array. Is there a way to do this in Scala, other than using Java's String.format? If I write this in Scala:

val myStr = <root>
<hello>
<to>{address}</to>
<message>{message}</message>
</hello>
</root>

the resulting XML would already be "filled". Is there a way I could "templatize" the XML?

Using a function and Scala's XML:

 val tmpl = {(address: String, message: String) =>
  <root>
    <hello>
      <to>{address}</to>
      <message>{message}</message>
    </hello>
  </root>
  }

and:

tmpl("me@mail.com","Anderson")

Some sugar:

def tmpl(f: Product => Elem) = new {
   def %(args: Product) = f(args)
}

val t = tmpl{case (address, message) => 
  <root>
    <hello>
      <to>{address}</to>
      <message>{message}</message>
    </hello>
  </root>
}

t % ("me@mail.com","Anderson")

Deserialize <ArrayOf> in XML to List<>

4 votes

I have trouble deserializing the result from my WCF webservice. The method returns a List<RecipeEntity>, which is serialized to XML as shown below. When I try to deserialize I just get an exception, shown below. It seems I cannot deserialize <ArrayOfRecipe> to List<RecipeEntity>. Note that RecipeEntity is mapped by contract name to Recipe.

After searching I see many propose XmlArray and XmlElement attributes, but as far as I can tell they do not apply here on the GetRecipes() method. I have only seen them used on fields of serialized classes.

I know I could wrap the List<RecipeEntity> in a RecipeList class and return that instead, but I would rather deserialize directly to List<> for any given type.

Exception:

System.InvalidOperationException was caught
  Message=There is an error in XML document (1, 2).
  StackTrace:
       at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader xmlReader, String encodingStyle, Object events)
       at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader xmlReader, String encodingStyle)
       at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader xmlReader)
       at GroceriesAppSL.Pages.Home.GetRecipesCallback(RestResponse response)
  InnerException: System.InvalidOperationException
       Message=<ArrayOfRecipe xmlns='Groceries.Entities'> was not expected.
       StackTrace:
            at Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializationReaderList1.Read5_Recipe()
       InnerException: 

Data contract:

[DataContract(Name = "Recipe", Namespace = "Groceries.Entities")]
public class RecipeEntity
{
    [DataMember] public int Id;
    [DataMember] public string Name;
    [DataMember] public string Description;
}

Implementation:

[ServiceContract]
public interface IMyService
{
    [OperationContract]
    [WebGet(ResponseFormat = WebMessageFormat.Xml, UriTemplate = "Recipes/{username}")]
    List<RecipeEntity> GetRecipes(string username);
}

public class MyService : IMyService
{
    public List<RecipeEntity> GetRecipes(string username)
    {
        return _recipeDB.Recipes.Select(ToEntity).ToList();
    }
}

Example XML result, for illustration purposes only.

<ArrayOfRecipe xmlns="Groceries.Entities" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Recipe>
<Id>139</Id>
<Name>ExampleRecipe</Name>
<Description>5 L milk;4 eggs</Description>
</Recipe>
<Recipe>...</Recipe>
<Recipe>...</Recipe>
<Recipe>...</Recipe>
...
</ArrayOfRecipe>

Deserialization code:

using (var xmlReader = XmlReader.Create(new StringReader(response.Content)))
{
    var xs = new System.Xml.Serialization.XmlSerializer(typeof(List<RecipeEntity>));
    var recipes = (List<RecipeEntity>)xs.Deserialize(xmlReader);
}

You are using DataContractSerializer to serialize and XmlSerializer to deserialize. Those two doesn't use the same approach. You must either use DataContractSerializer in your deserialization method or you must mark your operation or service with XmlSerializerFormat attribute (in such case WCF will use XmlSerializer instead of DataContractSerializer). DataContract and DataMember attributes are only for DataContractSerializer. XmlSerializer uses its own attributes defined in System.Xml.Serialization namespace.

Data Binding Wizard in Delphi XE - can It be configured to map to MSXML interfaces?

4 votes

The Data Binding Wizard in Delphi XE generates classes and interfaces that inherit from Delphi's own implementation of the DOM (ADOM XML v4), which doesn't appear to support validation against schemas - the 'validate on parse' option only works with the MSXML vendor type - as can be seen from the VCL source code and also the behavior of XMLDocument component in the IDE. All validation support seems to be based on the MSXML implementation, which makes the wizard useless if you need schema validation. Am I missing something here?

Is there a way to configure the binding wizard (or some underlying utility) to generate classes and interfaces based on MSXML, which supports validation?

Or are there calls/interfaces that support schema validation using Delphi's implementation of ADOM XML that I haven't come across?

MNG

The code generated by the XML Data Binding Wizard depends on the units XMLDoc and XMLIntf (the document references are TXMLDocument and IXMLDocument).

IXMLDocument is implemented by TXMLDocument, which is the generic wrapper around XML DOMs supported by Delphi. The DOM used by TXmlDocument depends on the value of the DOMVendor property.

If a DOMVendor is not specified when activating a TXMLDocument instance (it isn't as the XML Data Binding Wizard generates DOM neutral code), then the actual XML DOM used depends on these two members of the XMLDOM unit:

var
  DefaultDOMVendor: string;
  DOMVendors: TDOMVendorList;

In your case, it appears that the MSXML DOM is either the default XML DOM, or the only XML DOM available.

So you should check the values of DefaultDOMVendor and the DOMVendors list.

It would certainly help if you can edit your question with the values of the above, and a reproducible case that shows how you observed the MSXML DOM is being used.

Edit:

You can fore the run-time use of a specific XML DOM vendor right before you load your XML root node, or create a new XML root node like this:

DefaultDOMVendor = 'MSXML';

Select XML nodes using TSQL

Asked on Tue, 26 Apr 2011 by Xi xml tsql
4 votes

Hi:

My SQL Server 2008 database table has a XML field. I would like to select nodes from the field, along with other fields. For example, consider the following table:

DECLARE @TestTable AS TABLE ([Id] VARCHAR(20),  [Name] XML )
INSERT INTO @TestTable SELECT 
    '001', 
    '<Name><First>Ross</First><Last>Geller</Last></Name>'
UNION ALL SELECT
    '002', 
    '<Name><First>Rachel</First><Last>Green</Last></Name>'

I want the result be:

001  |  Ross  |  Geller     
002  | Rachel | Green

Is that possible? Thanks,

This should do it:

DECLARE @TestTable AS TABLE (
     [Id] VARCHAR(20),
     [Name] XML
    )
INSERT  INTO @TestTable
        SELECT  '001',
                '<Name><First>Ross</First><Last>Geller</Last></Name>'
        UNION ALL
        SELECT  '002',
                '<Name><First>Rachel</First><Last>Green</Last></Name>'

SELECT  Id,
        x.value('(/Name/First)[1]', 'varchar(20)') AS [First],
        x.value('(/Name/Last)[1]', 'varchar(20)') AS [Last]
FROM    @TestTable t
CROSS APPLY [Name].nodes('/Name') AS tbl ( x )