Why are AppStream metainfo files XML data?

This is a question raised quite quite often, the last time in a blogpost by Thomas, so I thought it is a good idea to give a slightly longer explanation (and also create an article to link to…).

There are basically three reasons for using XML as the default format for metainfo files:

1. XML is easily forward/backward compatible, while YAML is not

This is a matter of extending the AppStream metainfo files with new entries, or adapt existing entries to new needs.

Take this example XML line for defining an icon for an application:

and now the equivalent YAML:

Now consider we want to add a width and height property to the icons, because we started to allow more than one icon size. Easy for the XML:

This line of XML can be read correctly by both old parsers, which will just see the icon as before without reading the size information, and new parsers, which can make use of the additional information if they want. The change is both forward and backward compatible.

This looks differently with the YAML file. The “foobar.png” is a string-type, and parsers will expect a string as value for the cached key, while we would need a dictionary there to include the additional width/height information:

The change shown above will break existing parsers though. Of course, we could add a cached2 key, but that would require people to write two entries, to keep compatibility with older parsers:

Less than ideal.

While there are ways to break compatibility in XML documents too, as well as ways to design YAML documents in a way which minimizes the risk of breaking compatibility later, keeping the format future-proof is far easier with XML compared to YAML (and sometimes simply not possible with YAML documents). This makes XML a good choice for this usecase, since we can not do transitions with thousands of independent upstream projects easily, and need to care about backwards compatibility.

2. Translating YAML is not much fun

A property of AppStream metainfo files is that they can be easily translated into multiple languages. For that, tools like intltool and itstool exist to aid with translating XML using Gettext files. This can be done at project build-time, keeping a clean, minimal XML file, or before, storing the translated strings directly in the XML document. Generally, YAML files can be translated too. Take the following example (shamelessly copied from Dolphin):

This would become something like this in YAML:

Looks manageable, right? Now, AppStream also covers long descriptions, where individual paragraphs can be translated by the translators. This looks like this in XML:

Now, how would you represent this in YAML? Since we need to preserve the paragraph and enumeration markup somehow, and creating a large chain of YAML dictionaries is not really a sane option, the only choices would be:

  • Embed the HTML markup in the file, and risk non-careful translators breaking the markup by e.g. not closing tags.
  • Use Markdown, and risk people not writing the markup correctly when translating a really long string in Gettext.

In both cases, we would loose the ability to translate individual paragraphs, which also means that as soon as the developer changes the original text in YAML, translators would need to translate the whole bunch again, which is inconvenient.

On top of that, there are no tools to translate YAML properly that I am aware of, so we would need to write those too.

3. Allowing XML and YAML makes a confusing story and adds complexity

While adding YAML as a format would not be too hard, given that we already support it for DEP-11 distro metadata (Debian uses this), it would make the business of creating metainfo files more confusing. At time, we have a clear story: Write the XML, store it in /usr/share/metainfo, use standard tools to translate the translatable entries. Adding YAML to the mix adds an additional choice that needs to be supported for eternity and also has the problems mentioned above.

I wanted to add YAML as format for AppStream, and we discussed this at the hackfest as well, but in the end I think it isn’t worth the pain of supporting it for upstream projects (remember, someone needs to maintain the parsers and specification too and keep XML and YAML in sync and updated). Don’t get me wrong, I love YAML, but for translated metadata which needs a guarantee on format stability it is not the ideal choice.

So yeah, XML isn’t fun to write by hand. But for this case, XML is a good choice.

11 Comments

  • not-a-yaml-supporter commented on 26. April 2016 Reply

    I had no real contact with yaml in years, and never used it much, just changed some configs, so I wouldn’t call me a supporter of yaml.

    But your examples to the first point you are trying to make, are just bad. You’re using yaml wrong, so that the comparison is in favor of xml.

    Icon example should look more like:

    If you want a list of things, use a list …

    For the language part you could use:

    • Matthias commented on 27. April 2016 Reply

      That’s why it is an example 😉
      What if I don’t know that I need a list later, when initially designing the YAML? With a dictionary, I could filter out exactly the type I want, e.g. for icons:

      That cached icons want additional properties is something that’s added later. I am not saying it is impossible to sometimes have backward compatibility, but I am saying that breakage happens, and you can’t really protect against that without consulting a fortune teller or building a time machine.

      For the translation stuff, the main point is that translating the tag is almost impossible to do in a sane way, and that YAML is usually not a translated format, so no toolchain for translating it exists yet.
      A YAML version of AppStream *distro* metadata exists afterall, and we already have the issue that certain extensions will lead to breaking changes. See https://appstream.debian.org/sid/main/metainfo/totem.html for how the YAMl looks like.

      • Jimmy Berry commented on 27. April 2016 Reply

        You are still not comparing them accurately. Just as XML can use either all tag/value pairs or properties so can YAML use lists or no list.

        Consider the following:

        value1
        value2

        vs

        At the time you write it that way you decided that thing1 was always a single value and could not be repeated. At the same time when you used separate element key/value pair you made them a repeatable list with a single initial value.

        Just as I cannot repeat thing1:

        When you write YAML the same principals apply.

        # or formatted on next line
        thing1: – value1
        thing2: – value2

        vs the properties approach

        thing1: value1
        thing2: value2

        They are backwards compatably expandable in the same ways. Your comparison was against different data representations. If this is not clear consider the parsing code.

        root->thing1[0]

        vs

        root[‘thing1’]

        YAML could just as easily be used. You just need to make the same sort of forward thinking decision as you have done for XML. May need to throw in “default” key or what not in lists to allow expansion as well, but it’s just like accessing element [0].

      • Jimmy Berry commented on 27. April 2016 Reply

        EDIT: Please delete/fix previous comment..trying to figure out how to format code since no instructions under comment field.

        You are still not comparing them equivalently. Just as XML can use either all tag/value pairs or properties so can YAML use lists or no list.

        Consider the following:

        value1
        value2

        vs

        At the time you write it that way you decided that thing1 was always a single value and could not be repeated. At the same time when you used separate element key/value pair you made them a repeatable list with a single initial value.

        Just as I cannot repeat thing1:

        When you write YAML the same principals apply.


        # or formatted on next line
        thing1: - value1
        thing2: - value2

        vs the properties approach


        thing1: value1
        thing2: value2

        They are backwards compatably expandable in the same ways. Your comparison was against different data representations. If this is not clear consider the parsing code.


        root->thing1[0]

        vs


        root['thing1']

        YAML could just as easily be used. You just need to make the same sort of forward thinking decision as you have done for XML. May need to throw in default/value key or what not in lists to allow expansion as well, but it’s just like accessing element [0].

  • zanny commented on 27. April 2016 Reply

    So I was in the last comment thread, but this post is a great insight into the conversation, so thank you for it!

    Mindshare sucks. Someone can accept that XML is better in this situation, but still hate writing XML. The problem with appstream as has been discussed seems to be a combination of overworked developers, abandoned projects, and people who don’t like XML.

    I’m curious, though, about how much actual infrastructure is needed to parse YAML appstream data when Debian is already using it, and since Debian + Ubuntu is by far the most mature appstream repository out there, that in many ways the standard matters less since the most common consumer of it is shipping mixed YAML + XML appstream data to begin with.

    My experience is only really in depth with Archlinux’s availability of metadata, but lets just say it is kind of lacking. But a lot of the metadata from Debian is unusable in Arch, since a lot of it is YAML based, which seems to demonstrably show maintainers that are willing to write appstream data seem to write YAML versions for Debian while ignoring XML distro generic data, regardless of technical merit.

    None of this is an easy problem, or it would have been solved in the 90s. We probably really do need some tooling infrastructure to streamline the creation of metadata for developers if we ever want to see appstream become pervasive, though.

    • Matthias commented on 27. April 2016 Reply

      I think you confuse metainfo files and AppStream distro metadata here.

      For the distro metadata, YAML is supported for historic reasons at Debian.

      For upstream metadata, appdata and metainfo files, *only* XML is supported, so every distro only has to know XML. The YAML data of Debian and Ubuntu is only usable for Debian and Ubuntu, because it contains packaging information from those distributions, which doesn’t apply to Arch.
      Arch itself can also do (almost) nothing to increase the amount of metadata, since this is a thing upstream developers need to provide. Fedora though has patched in a lot of metainfo files into their packages, so you can likely take files from there and add them to the Arch packaging 😉 .

      So, there is no such problem that some distros are on YAML and others aren’t, since the source is always a metainfo XML file. Simple as that.

      And I too don’t like XML for most cases – I will prefer JSON (text data serialization over the net, fast parsing) and YAML (files which need to be read/edited by humans) usually. But in this situation, XML (extensible and translatable) is simply the technically best choice, no matter if I personally like that or not.

  • Gregor commented on 29. April 2016 Reply

    Hi,
    thanks for the article. I also would like to argue that the first XML/Yaml comparison is not quite sensible. Why should the type property (“cached”) be mixed up with the name property (“foobar.png”) mixed up in one line? This is clearly not extensible.

    For example take a look at http://codebeautify.org/xml-to-yaml. This converts your XML example to:

    icon:
    _type: cached
    __text: “foobar.png”

    This is as extensible as the given XML.

    My suggestion would be to move the first paragraph from the top to bottom and rename it from “1. XML is easily forward/backward compatible, while YAML is not” to “3. XML is easily forward/backward compatible, while YAML is not when best practices are not applied” 🙂

    • Matthias commented on 1. May 2016 Reply

      By having that as-a-list approach, we implicitly allow defining multiple icons of type=cached, which is not allowed by the spec.
      So I think taking a dict there makes sense from that point of view. The automatic XML->YAML converter doesn’t know about that detail.

      The thing is that my argument is not that this example is great, but that breaking YAML forward-compatibility is really easy as the requirements of the formats evolve. Maybe in the past having a dict there made perfect sense, while today a list is wanted. XML doesn’t care much about that, in YAML you can’t go forward without breaking the document structure.

  • Jasem Mutlaq commented on 1. May 2016 Reply

    Thanks for the write up. Any idea when Muon will use this data? I have 16.04 and the information for my project KStars is still from a few years ago, including screenshot from really really old versions.

    We had the AppStream (org.kde.kstars.appdata.xml file) for a couple of years now and I have yet to see a software center that uses the information contained within!

    • Matthias commented on 1. May 2016 Reply

      Muon will likely never use AppStream, but Discover uses AppStream in Kubuntu 16.04. For some reason it doesn’t seem to load the description from AppStream and rather decides to load the package description instead.
      This is probably because Discover is using the APT backend on Kubuntu, instead of the PackageKit one.

      But that being said, KStars doesn’t even have correct metadata because the Debian packaging is messed up, leading to the .desktop file and the metainfo file being in separate packages.
      See
      https://appstream.debian.org/sid/main/issues/kstars-data.html and https://appstream.debian.org/sid/main/issues/kstars.html
      Normally I would say file a bug, but I just fixed that in Git master for that package 😉 But please report bugs in case you find other software doing this!

Leave a Reply

Your email address will not be published. Required fields are marked *