Since mid-2015 we were using the dep11-generator in Debian to build AppStream metadata about available software components in the distribution.
Getting rid of dep11-generator
Unfortunately, the old Python-based dep11-generator was hitting some hard limits pretty soon. For example, using multiprocessing with Python was a pain, since it resulted in some very hard-to-track bugs. Also, the multiprocessing approach (as opposed to multithreading) made it impossible to use the underlying LMDB database properly (it was basically closed and reopened in each forked off process, since pickling the Python LMDB object caused some really funny bugs, which usually manifested themselves in the application hanging forever without any information on what was going on). Additionally to that, the Python-based generator forced me to maintain two implementations of the AppStream YAML spec, one in C and one in Python, which consumes quite some time. There were also some other issues (e.g. no unit-tests) in the implementation, which made me think about rewriting the generator.
Adventures in Go / Rust / D
Since I didn’t want to write this new piece of software in C (or basically, writing it in C was my last option 😉 ), I explored Go and Rust for this purpose and also did a small prototype in the D programming language, when I was starting to feel really adventurous. And while I never intended to write the new generator in D (I was pretty fixated on Go…), this is what happened. The strong points for D for this particular project were its close relation to C (and ease of using existing C code), its super-flat learning curve for someone who knows and likes C and C++ and its pretty powerful implementations of the concurrent and parallel programming paradigms. That being said, not all is great in D and there are some pretty dark spots too, mainly when it comes to the standard library and compilers. I will dive into my experiences with D in a separate blogpost.
What good to expect from appstream-generator?
So, what can the new appstream-generator do for you? Basically, the same as the old dep11-generator: It will extract metadata from a distribution’s package archive, download and resize screenshots, search for icons and size them properly and generate reports in JSON and HTML of found metadata and issues.
LibAppStream-based parsing, generation of YAML or XML, multi-distro support, …
As opposed to the old generator, the new generator utilizes the metadata parsers and writers of libappstream. This allows it to return the extracted metadata as AppStream YAML (for Debian) or XML (everyone else) It is also written in a distribution-agnostic way, so if someone wants to use it in a different distribution than Debian, this is possible now. It just requires a very small distribution-specific backend to be written, all of the details of the metadata extraction are abstracted away (just two interfaces need to be implemented). While I do not expect anyone except Debian to use this in the near future (most distros have found a solution to generate metadata already), the frontend-backend split is a much cleaner design than what was available in the previous code. It also allows to unit-test the code properly, without providing a Debian archive in the testsuite.
Feature Flags, Optipng, …
The new generator also allows to enable and disable certain sets of features in a standardized way. E.g. Ubuntu uses a language-pack system for translations, which Debian doesn’t use. Features like this can be implemented as disableable separate modules in the generator. We use this at time to e.g. allow descriptions from packages to be used as AppStream descriptions, or for running optipng on the generated PNG images and icons.
No more Contents file dependency
Another issue the old generator had was that it used the Contents file from the Debian archive to find matching icons for an application. We could never be sure whether the contents in the Contents file actually matched the contents of the package we were currently dealing with. What made things worse is that at Ubuntu, the archive software is only updating the Contents file weekly daily (while the generator might run multiple times a day), which has lead to software being ignored in the metadata, because icons could not yet be found. Even on Debian, with its quickly-updated Contents file, we could immediately see the effects of an out-of-date Contents file when updating it failed once. In the new generator, we read the contents of each package ourselves now and store them in a LMDB database, bypassing the Contents file and removing the whole class of problems resulting from missing or wrong contents-data.
It can’t all be good, right?
That is true, there are also some known issues the new generator has:
Large amounts of RAM required
The better speed of the new generator comes at the cost of holding more stuff in RAM. Much more. When processing data from 5 architectures initially on Debian, the amount of required RAM might lie above 4GB, with the OOM killer sometimes being quicker than the garbage collector… That being said, on subsequent runs the amount of required memory is much lower. Still, this is something I am working on to improve.
What are symbolic links?
To be faster, the appstream-generator will read the md5sum file in .deb packages instead of extracting the payload archive and reading its contents. Since the md5sums file does not list symbolic links, symlinks basically don’t exist for the new generator. This is a problem for software symlinking icons or even .desktop files around, like e.g. LibreOffice does.
I am still investigating how widespread the use of symlinks for icons and .desktop files is, but it looks like fixing packages (making them not-symlink stuff and rather move the files) might be the better approach than investing additional computing power to find symlinks or even switch back to parsing the Contents file. Input on this is welcome!
Deploying asgen
I finished the last pieces of the appstream-generator (together with doing lots of other cool things and talking to great people) at the GNOME Software Hackfest in London last week (detailed blogposts about things that happened there will follow – many thanks once again for the Ubuntu community for sponsoring my attendance!).
Since today, the new generator is running on the Debian infrastructure. If bigger issues are found, we can still roll back to the old code. I decided to deploy this faster, so we can get some good testing done before the Stretch release. Please report any issues you may find!
A factual correction: Ubuntu’s Contents files are updated daily, not weekly. There used to be some races in the publication process that caused the updated Contents file to frequently fail to be installed, but I fixed those over a year ago.
(Not that this is to suggest that you should go back to using Contents for this application, of course.)
Sorry, I only just read this post (didn’t get an email notification for some reason…) – updated in the blogpost, thanks for the information!