Skip to content

Creating a DEP-5 parser with Config::Model

September 14, 2010

Hello

Following a discussion on #debian-perl IRC channel, I’ve proposed to provide a script to parse DEP-5 files. The goal is to be able to parse *and* validate DEP-5 files. DEP-5 is a
proposal to make debian/copyright machine-interpretable. This proposal is driven by Lars Wirzenius.

With Config::Model, any evolution of DEP-5 specification will
be easy to include in the DEP-5 model read by Config::Model.

What DEP-5 model ?

To keep a long story short, let’s say that DEP-5 model is a description of DEP-5 syntax and semantic that can be used by Config::Model to perform validation. For more detail on how to create a model, please read this doc. In other word, DEP-5 model is DEP-5 document translated into a special format.

First step was to directly edit the doc and munge it into a YAML document describing the structure of DEP-5. Here’s a small extract of this YAML file (slightly edited to remove most long descriptions) :


---
class:
  Debian::Dep5:
    class_description: >
       Machine-readable debian/copyright
    accept:
      name_match: X-.*
      type: leaf
      value_type: string
    element:
      Format-Specification:
        mandatory: 1
        type: leaf
        value_type: uniline
        description: >-
          URI of the format specification, such as ...
      Name:
        type: leaf
        value_type: uniline
      Files:
        type: hash
        index_type: string
        ordered: 1
        cargo:
          type: node
          config_class_name: Debian::Dep5::Content


During this YAML file creation step, the problem raised by the License keyword became obvious because of License properties:

  • Limited number of valid License (no problem, let’s use an enum)
  • License names are not case sensitive (optimism goes somewhat down)
  • License names have version number and an optional ‘+’ suffix (ok, let’s use a regular expression with Config::Model’s brand new ‘match’ specification)
  • License can be combined with ‘and’ or ‘or’. (uh oh, the ‘match’ regexp will not be enough. A grammar would be better.)
  • License can specify an abbreviation or the full text of the license.

Long story short, I had to add to Config::Model the possibility to specify a Parse::RecDescent grammar to validate a value. More on this later.

Of course, the first draft of a model in YAML was far from being perfect.

So the second step was to load it with config-edit-model. I had to fix a number of YAML errors and then some errors in the model description.

Then, I had to write a parser to load the DEP-5 data into Config::Model tree. I first used Raphaël Hertzog’s Dpkg::Control::Hash module. But this one is not able to cope with repeated fields without clobbering them. So I had to provide my own parser.

The parser is divided in 2 parts:
- the parse function to load DEP-5 data in a simple data structure
- the read function to load the simple data structure into Config::Model’s configuration tree

You can view the code on Config::Model repository.

Then the model is divided in 3 configuration classes:

  • Debian::Dep5: the root class
  • Debian::Dep5::Content: To represent Files specification
  • Debian::Dep5::License

And the full model of Dep-5 can be also read on Config::Model repository:

(The models are biggish because they include help text taken from DEP-5 documentation)

Now some explanation is required on how is performed the License validation.

The trick is that each License used must be listed in Dep5′s license parameter which is specified this way (in YAML syntax):


License:
  type: hash
  index_type: string
  allow_keys_matching: '^(?i:Apache|Artistic|BSD|etc...|other)[\d\.\-]*\+?$'
  cargo:
     type: leaf
     value_type: string

So this License element contains the list of licenses with their full text and only Licenses whose name matches the regular expression are accepted.

Now, let’s explain how the files are tied to the declaredlicenses. The Debian::Dep5::Content class had a License element that represent the relation between the files and the License(s). This bond is represented by Debian::Dep5::License configuration class. The ‘abbrev’ and ‘full_license’ elements are fairly obvious.

The ‘abbrev’ element is another matter. Here’s its declaration (minus the description and help):

abbrev:
 type: leaf
 value_type: uniline
 default: other

So far, so good. Now the meaty part: the validation requirement based on Parse::RecDescent

 
grammar: "license (oper license)(s?)
          oper: 'and' | 'or'
          license: /[\\w\\-\\.\\+]+/i\n
             { # PRD action to check if the license text is provided
               $return = $arg[0]->grab('! License')->defined($item[0]);
             } "

This grammar specifies:

  • The syntax of the License line itself (hence something like “Perl or GPL”)
  • An action performed when the grammar is matched. Using Config::Model API, this Perl snippet checks that the License abbreviation has a corresponding License declared in Debian::Dep5 License hash

This way, an abbreviation cannot be used without a proper License statement.

This code is the first stab. Some more work is to be done soon:

  • Implement exception parsing
  • Provided DEP-5 writer
  • Provide config-edit-dep5 cli

Anyway, I’ll soon release Config::Model with this first version of DEP-5 parser.

As always feedback are more than welcome.

All the best

update: added missing Parse::RecDescent link

3 Comments
  1. I don’t claim to be too deep into specifics, but wouldn’t it make sense to create a machine-readable format in a way that can easily be read by a machine…?

    • Well, the trick is that the DEP-5 format must be easy to parse by a machine and easy to read and edit by a human. This means that some compromises must be made. From a syntax point fo view, the file is quite easy to parse, about 20 lines of Perl. The rest of the parser task is to detect as more errors as possible in the semantic content of the file. Hence all the rules between the License declarations and the relations between files and Licenses. I hope this answers your question…

  2. Your method of explaining all in this article is in fact pleasant, every
    one be capable of effortlessly understand it, Thanks a lot.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: