[[!tag idea]]

There is no common way to express license information in each source file at the moment. Some people embed license information in each file, others keep it in README or another file at the top of the source tree.

Worse, there is no common syntax to express the license information in a machine-parseable way. If we had this, we could have tools that, for example, tell you if you're trying to merge code that has an incompatible license.

Obviously, this kind of thing can never work perfectly. People keep inventing new licenses, and it is not possible for a computer program to fully understand any license. It is not clear humans can do that, either. However, it would be possible to do it to a number of well-known licenses, which would help most of the time. A classic 20/80 situation.

I would like to suggest a syntax, similar to Emacs's "Hey Emacs" modelines, for embedding a summary of the license, or licenses, for a source file, in such a way that it can be programmatically extracted and parsed and analysed. With this syntax, one could then write a tool to ensure that all files in a project have the same license, or that all licenses are compatible with the project's overall license.

Such a tool will, of course, rely on heuristics and assumptions. For example, it needs to assume the machine-parseable license summary is correct, and rely on a ruleset on what licenses are compatible with what licenses. Things can go wrong. That's life. Remember, this is aiming at doing the 20% of the work that will work 80% of the time, not perfection.

I don't have a tool written, but I have a suggestion for the syntax.

/*
 * Copyright 2013 Lars Wirzenius
 *
 * tl;dr =*= Licenses: GPL-3+ or Expat, and Artistic =*=
 *
 * Blah. Blah. Blah. Imagine long, boring license texts 
 * here.
 */

The important part is this:

=*= Licenses: GPL-3+ or Expat, and Artistic =*=

The =*= prefix and suffix and the word Licenses are there to make grepping reasonably reliable without too many false positives, and to allow comment characters and other text on the same line.

The actual license summary follows the syntax and semantics of the Debian copyright-format 1.0 specification, which I chose because it exists and has had a fair bit of review so far, and is reasonably expressive.

The license summaries can be extracted with the following GNU sed invocation:

sed -n '/.*=\*= [Ll]icen[cs]es\?: \(.*\)=\*=.*/s//\1/p'

I allowed various forms of the word license, since it's a word that a lot of people will get wrong, and it's easy to catch all four common forms.

So, does anyone else think this might be useful? Would you use it in your own projects?