I want to use Python for doing full text searches of XML data.
<elements> <elem id="1">some element</elem> <elem id="2">some other element</elem> <elem id="3">some element <nested id="1"> other nested element </nested> </elem> </elements>
The most basic functionality I want is that a search for "other" in an XPath ("/elements/elem") returns at least the value of the ID attribute for the matching element (elem 2) and nested element (elem 3, nested 1) or the matching XPaths.
The solution should be flexible and scalable. I am looking for possible combinations of these features:
- search nested elements (infinite depth)
- search attributes
- search for sentences and paragraphs
- search using wildcards
- search using fuzzy matching
- return precise matching info
- good search speed for large XML files
I don't expect a solution with all of the ideal functionality, I'll have to combine different existing functionalities and code the rest myself. But first I would like to know more about what there is out there, which libraries and approaches you would usually use for this, what their pros and cons are.
EDIT: Thanks for the answers so far, I added detail and started a bounty.
Compared to the feature list that was added later:
- search nested elements (infinite depth): yes
- search attributes: yes
- search for sentences and paragraphs: no. Assuming that "paragraphs" are actual xml elements, then yes. But "sentences" as such, no.
- search using wildcards: yes (regular expressions)
- search using fuzzy matching: no (assuming stemming, synonyms and so on...)
- return precise matching info: yes
- good search speed for large XML files: yes, except when your files are so extremely large that you would actually need a fulltext index to get good speed anyway.
The only way to satisfy all your request that I see, would be to load your files into a native xml database that supports "real" fulltext search (via XQuery Fulltext probably) and use that. (can't help you much further with that, maybe Sedna, which seems to have a python API and seems to supports fulltext search?)