• Forum
  • Lounge
  • Storage formats for hierarchical structu

 
 
Storage formats for hierarchical structures

Hi, all.

Our team is coming to need to make a decision regarding a file format for storing hierarchical structures, ie data saved to be used later for constructing our objects. We can choose pretty much whatever format, as long as it isn't too unsightly.

We've considered XML, but it seems to get unsightly quickly. We've looked at JSON, and it seems like it might fit nicely. Of course, there are countless formats.

Does anyone have any ideas/input regarding formats they've used and what they found worked well?
Last edited on
I've played around with JSON and I've found that it was quite easy to extract data, thanks to the wealth of good-quality C++ libraries which via a simple Google search.

I haven't used any parsers XML at all (I haven't found a use where I wouldn't simply use JSON) yet, so I can't tell you if the library situation is as good as/better than JSON's or not.
First, nearly everything in this post is opinion, so just imagine I've written "in my opinion" where necessary. :)

I've used symbolic expressions (i.e., Lisp) to great success.

It's more readable as JSON, and much more readable than XML, but the main advantages are that
- It's (obviously) trivial to manipulate/restructure/analyze/emit programmatically;
- It's got excellent support for hand-editing if necessary; and
- It's got almost no syntax.

JSON is not a bad choice. The tool support for JSON (especially for hand-editing) is worse than Lisp's, but it's increasingly popular. The equivalent manipulation language is Javascript, which is less powerful and less convenient. If you were working with Javascript I would select JSON. It is noisier (there is more syntax) than S-expressions.

There are existing parsers for S-expressions:
http://sexpr.sourceforge.net/
http://people.csail.mit.edu/rivest/sexp.html

There are also small embeddable Lisp interpreters if you really need such a thing. There are equivalent EMCAScript interpreters for JSON, but I expect that a lisp interpreter is likely to be much smaller.

With this being said, you should pick a format consistent with the other tools you'll be working with. Even if it's XML.

Last edited on
I really dislike xml. The only xml has ever accomplished is turning a KB or 2 of data into a MB or 2 of the same data. This is mostly the fault of the users, but its still a neverending source of bloat under most applications. Ive been working with it a lot this past year and I am not exaggerating... bzip can cut a 2gb xml file down to just a couple of MB. And it takes time to parse through all the bloat.

json seems to be much better, but I have only just started to look at it.



When abused, any format is shit.

XML is symbolic expressions, except noisier, and the average parser seems to be heavier. JSON, on the other hand, comes with built-in meaning, and is not extensible. I realized that I haven't mentioned that JSON does not attempt to solve the same problem that an extensible format attempts to solve. If you will not benefit from an extensible format, JSON is perhaps a superior choice. Make sure you know what JSON does not support while you make this decision.

As I've indicated, the main advantage of sexps over XML is that s-expressions are much easier to manipulate (e.g., converting from Lisp to XML is dead simple, but going the opposite direction is less so.) It's also less redundant, and there are widely accepted standard schema for sexps, while XML has no equivalent.

Last edited on
The s-expression is more interesting than I thought it would be. I think the Lisp syntax is what scares me off the most, especially the parens. But a small lisp interpreter could very well carry less overhead than the JSON parser.

It seems most don't seem to prefer XML for complex data storage.

We're using C++, so we're using a JSON library to parse it for us. So far it seems to work OK, but we haven't tried storing enormously complex data in it yet.
All of our tools are pretty independent of each other, so we (should) be able to use basically whatever format we want. It's just for storage, so it'll be read into memory and then passed off to other tools when needed.

I agree that any format can be abused. That's probably what puts me off from XML.

I see that the sexpr site you linked to says s-expressions would be good for storing "Arbitrary directed or un-directed graphs". I can foresee JSON presenting a problem here, and graphs will be in our data.

Did you find that losing the readability of JSON for sexpr was worth the trade?
That's probably what puts me off from XML.

"prone to abuse" cannot apply to XML but not to S-expressions.
I'm unsure whether XML is more or less prone to abuse given how much more difficult XML is to manipulate.

Did you find that losing the readability of JSON for sexpr was worth the trade?

Call me insane, but Lisp is perfectly readable (when indented correctly). The tree structure should be visible from the indentation; if you're counting parens, you're doing something wrong.

Apparently someone wrote an article about it (although maybe that's a bad sign):
https://yoo2080.wordpress.com/2014/07/04/it-is-not-hard-to-read-lisp-code/#sec-1

I see that the sexpr site you linked to says s-expressions would be good for storing "Arbitrary directed or un-directed graphs".

Yes, lisp programs can contain infinite cycles. There's a special notation for displaying cyclic data in lists. I don't remember whether the library at my first link supports them.

But a small lisp interpreter could very well carry less overhead than the JSON parser.

I suggest you avoid making the jump from parsing to full interpretation unless you actually need it: the advice is just basic minimalism -- embedding a second programming language into your project is usually a big deal conceptually.
If XML is equivalent to S-expressions, how can S-expressions contain cycles when XML can't? Or, at least, I've never seen an XML cycle.
Last edited on

Call me insane, but Lisp is perfectly readable (when indented correctly). The tree structure should be visible from the indentation; if you're counting parens, you're doing something wrong.

I've only moderate experience with Lisp, so learning how to properly indent would do me well. I shouldn't focus on readability. I'm so used to C-like languages that Lisp's appearance is extremely foreign to me.

Hmm, I think I may have worded things poorly. I should have said parser, not interpreter for the s-expressions. We already have a scripting engine embedded, and don't need another.
I'll have to give s-expressions some serious thought (and look at how to properly indent Lisp), since graphs will ultimately be an important part of our data.
If XML is equivalent to S-expressions, how can S-expressions contain cycles when XML can't?

XML is extensible, which implies it can be used to represent cycles.
A quick search yields this:
http://stackoverflow.com/a/11168563
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<graph>
  <vertex id="1">
    <!-- vertex data -->
    <edge to="3"/>
  </vertex>
  <vertex id="2">
    <!-- vertex data -->
    <edge to="1"/>
    <edge to="3"/>
  </vertex>
  <vertex id="3">
    <!-- vertex data -->
    <edge to="1"/>
    <edge to="2"/>
  </vertex>
</graph>


You can obviously do the same thing using s-expressions:
1
2
3
(((v1 data goes here) . (3))
 ((v2 data goes here) . (1 3))
 ((v3 data goes here) . (1 2)))

Last edited on
Oh, that. I thought you meant something else.

OP, if you need to serialize arbitrary graphs, perhaps this would be more useful than any text-based format: https://github.com/Helios-vmg/cppserialization/
Interesting. I find the s-expressions far easier to read than the XML.

That's very interesting, Helios. I could go down the rabbit hole in regards to serialized v text-based, but the size difference would probably be considerable. Would the serialization of numerical data potentially present an issue with devices with different endianness? I'm not aware of a way to determine then endianness to determine whether a byte swapping would be necessary when deserializing.
It's possible and even common to define a binary format that's independent of the CPU endianness.
For example, a common solution is
0 = 00 00 00 00
1 = 01 00 00 00
16711935 = FF 00 FF 00
287454020 = 44 33 22 11
It's also possible to define variable-length encodings.
Hmm. I will have to loop up some more on how to do that, but it certainly seems reasonable.
Thank you for the feedback, everyone!
Topic archived. No new replies allowed.