Why are AppStream metainfo files XML data?

This is a question raised quite quite often, the last time in a blogpost by Thomas, so I thought it is a good idea to give a slightly longer explanation (and also create an article to link to…).

There are basically three reasons for using XML as the default format for metainfo files:

1. XML is easily forward/backward compatible, while YAML is not

This is a matter of extending the AppStream metainfo files with new entries, or adapt existing entries to new needs.

Take this example XML line for defining an icon for an application:

<code>&lt;icon type="cached"&gt;foobar.png&lt;/icon&gt;</code>

and now the equivalent YAML:

Icons:
  cached: foobar.png

Now consider we want to add a width and height property to the icons, because we started to allow more than one icon size. Easy for the XML:

<icon type="cached" width="128" height="128">foobar.png</icon>

This line of XML can be read correctly by both old parsers, which will just see the icon as before without reading the size information, and new parsers, which can make use of the additional information if they want. The change is both forward and backward compatible.

This looks differently with the YAML file. The “foobar.png” is a string-type, and parsers will expect a string as value for the `cached` key, while we would need a dictionary there to include the additional width/height information:

Icons:
  cached: name: foobar.png
          width: 128
          height: 128

The change shown above will break existing parsers though. Of course, we could add a `cached2` key, but that would require people to write two entries, to keep compatibility with older parsers:

Icons:
  cached: foobar.png
  cached2: name: foobar.png
          width: 128
          height: 128

Less than ideal.

While there are ways to break compatibility in XML documents too, as well as ways to design YAML documents in a way which minimizes the risk of breaking compatibility later, keeping the format future-proof is far easier with XML compared to YAML (and sometimes simply not possible with YAML documents). This makes XML a good choice for this usecase, since we can not do transitions with thousands of independent upstream projects easily, and need to care about backwards compatibility.

2. Translating YAML is not much fun

A property of AppStream metainfo files is that they can be easily translated into multiple languages. For that, tools like intltool and itstool exist to aid with translating XML using Gettext files. This can be done at project build-time, keeping a clean, minimal XML file, or before, storing the translated strings directly in the XML document. Generally, YAML files can be translated too. Take the following example (shamelessly copied from Dolphin):

<summary>File Manager</summary>
<summary xml:lang="bs">Upravitelj datoteka</summary>
<summary xml:lang="cs">Správce souborů</summary>
<summary xml:lang="da">Filhåndtering</summary>

This would become something like this in YAML:

Summary:
  C: File Manager
  bs: Upravitelj datoteka
  cs: Správce souborů
  da: Filhåndtering

Looks manageable, right? Now, AppStream also covers long descriptions, where individual paragraphs can be translated by the translators. This looks like this in XML:

<description>
  <p>Dolphin is a lightweight file manager. It has been designed with ease of use and simplicity in mind, while still allowing flexibility and customisation. This means that you can do your file management exactly the way you want to do it.</p>
  <p xml:lang="de">Dolphin ist ein schlankes Programm zur Dateiverwaltung. Es wurde mit dem Ziel entwickelt, einfach in der Anwendung, dabei aber auch flexibel und anpassungsfähig zu sein. Sie können daher Ihre Dateiverwaltungsaufgaben genau nach Ihren Bedürfnissen ausführen.</p>
  <p>Features:</p>
  <p xml:lang="de">Funktionen:</p>
  <p xml:lang="es">Características:</p>
  <ul>
    <li>Navigation (or breadcrumb) bar for URLs, allowing you to quickly navigate through the hierarchy of files and folders.</li>
    <li xml:lang="de">Navigationsleiste für Adressen (auch editierbar), mit der Sie schnell durch die Hierarchie der Dateien und Ordner navigieren können.</li>
    <li xml:lang="es">barra de navegación (o de ruta completa) para URL que permite navegar rápidamente a través de la jerarquía de archivos y carpetas.</li>
    <li>Supports several different kinds of view styles and properties and allows you to configure the view exactly how you want it.</li>
    ....
  </ul>
</description>

Now, how would you represent this in YAML? Since we need to preserve the paragraph and enumeration markup somehow, and creating a large chain of YAML dictionaries is not really a sane option, the only choices would be:

  • Embed the HTML markup in the file, and risk non-careful translators breaking the markup by e.g. not closing tags.
  • Use Markdown, and risk people not writing the markup correctly when translating a really long string in Gettext.

In both cases, we would loose the ability to translate individual paragraphs, which also means that as soon as the developer changes the original text in YAML, translators would need to translate the whole bunch again, which is inconvenient.

On top of that, there are no tools to translate YAML properly that I am aware of, so we would need to write those too.

3. Allowing XML and YAML makes a confusing story and adds complexity

While adding YAML as a format would not be too hard, given that we already support it for DEP-11 distro metadata (Debian uses this), it would make the business of creating metainfo files more confusing. At time, we have a clear story: Write the XML, store it in `/usr/share/metainfo`, use standard tools to translate the translatable entries. Adding YAML to the mix adds an additional choice that needs to be supported for eternity and also has the problems mentioned above.

I wanted to add YAML as format for AppStream, and we discussed this at the hackfest as well, but in the end I think it isn’t worth the pain of supporting it for upstream projects (remember, someone needs to maintain the parsers and specification too and keep XML and YAML in sync and updated). Don’t get me wrong, I love YAML, but for translated metadata which needs a guarantee on format stability it is not the ideal choice.

So yeah, XML isn’t fun to write by hand. But for this case, XML is a good choice.

11 Comments

  • not-a-yaml-supporter commented on April 26, 2016 Reply

    I had no real contact with yaml in years, and never used it much, just changed some configs, so I wouldn’t call me a supporter of yaml.

    But your examples to the first point you are trying to make, are just bad. You’re using yaml wrong, so that the comparison is in favor of xml.

    Icon example should look more like:

    icons:
        - type: cached
          width: 128
          height: 128
    
        - type: cached2
          width: 128
          height: 128
    
    next version adds a name
    
    icons:
        - type: cached
          width: 128
          height: 128
          name: Foo
    
        - type: cached2
          width: 128
          height: 128
          name: Bar
    

    If you want a list of things, use a list …

    For the language part you could use:

    summary:
        default: File Manager
        en: File Manager
        bs: Upravitelj datoteka
        cs: právce souborů
    
    • Matthias commented on April 27, 2016 Reply

      That’s why it is an example 😉
      What if I don’t know that I need a list later, when initially designing the YAML? With a dictionary, I could filter out exactly the type I want, e.g. for icons:

      https://appstream.debian.org/sid/main/metainfo/totem.html for how the YAMl looks like.

      • Jimmy Berry commented on April 27, 2016 Reply

        You are still not comparing them accurately. Just as XML can use either all tag/value pairs or properties so can YAML use lists or no list.

        Consider the following:

        value1
        value2

        vs

        At the time you write it that way you decided that thing1 was always a single value and could not be repeated. At the same time when you used separate element key/value pair you made them a repeatable list with a single initial value.

        Just as I cannot repeat thing1:

        When you write YAML the same principals apply.

        # or formatted on next line
        thing1: – value1
        thing2: – value2

        vs the properties approach

        thing1: value1
        thing2: value2

        They are backwards compatably expandable in the same ways. Your comparison was against different data representations. If this is not clear consider the parsing code.

        root->thing1[0]

        vs

        root[‘thing1’]

        YAML could just as easily be used. You just need to make the same sort of forward thinking decision as you have done for XML. May need to throw in “default” key or what not in lists to allow expansion as well, but it’s just like accessing element [0].

      • Jimmy Berry commented on April 27, 2016 Reply

        EDIT: Please delete/fix previous comment..trying to figure out how to format code since no instructions under comment field.

        You are still not comparing them equivalently. Just as XML can use either all tag/value pairs or properties so can YAML use lists or no list.

        Consider the following:

        value1
        value2

        vs

        At the time you write it that way you decided that thing1 was always a single value and could not be repeated. At the same time when you used separate element key/value pair you made them a repeatable list with a single initial value.

        Just as I cannot repeat thing1:

        When you write YAML the same principals apply.


        # or formatted on next line
        thing1: - value1
        thing2: - value2

        vs the properties approach


        thing1: value1
        thing2: value2

        They are backwards compatably expandable in the same ways. Your comparison was against different data representations. If this is not clear consider the parsing code.


        root->thing1[0]

        vs


        root['thing1']

        YAML could just as easily be used. You just need to make the same sort of forward thinking decision as you have done for XML. May need to throw in default/value key or what not in lists to allow expansion as well, but it’s just like accessing element [0].

  • zanny commented on April 27, 2016 Reply

    So I was in the last comment thread, but this post is a great insight into the conversation, so thank you for it!

    Mindshare sucks. Someone can accept that XML is better in this situation, but still hate writing XML. The problem with appstream as has been discussed seems to be a combination of overworked developers, abandoned projects, and people who don’t like XML.

    I’m curious, though, about how much actual infrastructure is needed to parse YAML appstream data when Debian is already using it, and since Debian + Ubuntu is by far the most mature appstream repository out there, that in many ways the standard matters less since the most common consumer of it is shipping mixed YAML + XML appstream data to begin with.

    My experience is only really in depth with Archlinux’s availability of metadata, but lets just say it is kind of lacking. But a lot of the metadata from Debian is unusable in Arch, since a lot of it is YAML based, which seems to demonstrably show maintainers that are willing to write appstream data seem to write YAML versions for Debian while ignoring XML distro generic data, regardless of technical merit.

    None of this is an easy problem, or it would have been solved in the 90s. We probably really do need some tooling infrastructure to streamline the creation of metadata for developers if we ever want to see appstream become pervasive, though.

    • Matthias commented on April 27, 2016 Reply

      I think you confuse metainfo files and AppStream distro metadata here.

      For the distro metadata, YAML is supported for historic reasons at Debian.

      For upstream metadata, appdata and metainfo files, *only* XML is supported, so every distro only has to know XML. The YAML data of Debian and Ubuntu is only usable for Debian and Ubuntu, because it contains packaging information from those distributions, which doesn’t apply to Arch.
      Arch itself can also do (almost) nothing to increase the amount of metadata, since this is a thing upstream developers need to provide. Fedora though has patched in a lot of metainfo files into their packages, so you can likely take files from there and add them to the Arch packaging 😉 .

      So, there is no such problem that some distros are on YAML and others aren’t, since the source is always a metainfo XML file. Simple as that.

      And I too don’t like XML for most cases – I will prefer JSON (text data serialization over the net, fast parsing) and YAML (files which need to be read/edited by humans) usually. But in this situation, XML (extensible and translatable) is simply the technically best choice, no matter if I personally like that or not.

  • Gregor commented on April 29, 2016 Reply

    Hi,
    thanks for the article. I also would like to argue that the first XML/Yaml comparison is not quite sensible. Why should the type property (“cached”) be mixed up with the name property (“foobar.png”) mixed up in one line? This is clearly not extensible.

    For example take a look at http://codebeautify.org/xml-to-yaml. This converts your XML example to:

    icon:
    _type: cached
    __text: “foobar.png”

    This is as extensible as the given XML.

    My suggestion would be to move the first paragraph from the top to bottom and rename it from “1. XML is easily forward/backward compatible, while YAML is not” to “3. XML is easily forward/backward compatible, while YAML is not when best practices are not applied” 🙂

    • Matthias commented on May 1, 2016 Reply

      By having that as-a-list approach, we implicitly allow defining multiple icons of type=cached, which is not allowed by the spec.
      So I think taking a dict there makes sense from that point of view. The automatic XML->YAML converter doesn’t know about that detail.

      The thing is that my argument is not that this example is great, but that breaking YAML forward-compatibility is really easy as the requirements of the formats evolve. Maybe in the past having a dict there made perfect sense, while today a list is wanted. XML doesn’t care much about that, in YAML you can’t go forward without breaking the document structure.

  • Jasem Mutlaq commented on May 1, 2016 Reply

    Thanks for the write up. Any idea when Muon will use this data? I have 16.04 and the information for my project KStars is still from a few years ago, including screenshot from really really old versions.

    We had the AppStream (org.kde.kstars.appdata.xml file) for a couple of years now and I have yet to see a software center that uses the information contained within!

    • Matthias commented on May 1, 2016 Reply

      Muon will likely never use AppStream, but Discover uses AppStream in Kubuntu 16.04. For some reason it doesn’t seem to load the description from AppStream and rather decides to load the package description instead.
      This is probably because Discover is using the APT backend on Kubuntu, instead of the PackageKit one.

      But that being said, KStars doesn’t even have correct metadata because the Debian packaging is messed up, leading to the .desktop file and the metainfo file being in separate packages.
      See
      https://appstream.debian.org/sid/main/issues/kstars-data.html and https://appstream.debian.org/sid/main/issues/kstars.html
      Normally I would say file a bug, but I just fixed that in Git master for that package 😉 But please report bugs in case you find other software doing this!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.