Accurately identifying the license for open source software is important for license compliance. However, determining the license can sometimes be difficult due to a lack of information or ambiguous information. Even when there is some licensing information present, a lack of consistent ways of expressing the license can make automating the task of license detection very difficult, thus requiring significant amounts of manual human effort. There are some commercial tools applying machine learning to this problem to reduce the false positives, and train the license scanners, but a better solution is to fix the problem at the upstream source.
In 2013, the U-boot project decided to use the SPDX license identifiers in each source file instead of the GPL v2.0 or later header boilerplate that had been used up to that point. The initial commit message had an eloquent explanation of reasons behind this transition.
Licenses: introduce SPDX Unique Lincense Identifiers Like many other projects, U-Boot has a tradition of including big blocks of License headers in all files. This not only blows up the source code with mostly redundant information, but also makes it very difficult to generate License Clearing Reports. An additional problem is that even the same lincenses are referred to by a number of slightly varying text blocks (full, abbreviated, different indentation, line wrapping and/or white space, with obsolete address information, ...) which makes automatic processing a nightmare. To make this easier, such license headers in the source files will be replaced with a single line reference to Unique Lincense Identifiers as defined by the Linux Foundation's SPDX project . For example, in a source file the full "GPL v2.0 or later" header text will be replaced by a single line: SPDX-License-Identifier: GPL-2.0+ We use the SPDX Unique Lincense Identifiers here; these are available at . . . .  http://spdx.org/  http://spdx.org/licenses/
The SPDX project liked the simplicity of this approach and formally adopted U-Boot’s syntax for embedding SPDX-License-Identifier’s into the project. Initially, the syntax was available on the project WIKI and was formalized in SPDX specification version 2.1 “Appendix V: Using SPDX short identifiers in Source Files”. Since then, other upstream open source projects and repositories have adopted use of these short identifiers to identify the licenses in use, including github in its licenses-API. In 2017, the Free Software Foundation Europe created a project called REUSE.software that provided guidance for open source projects on how to apply the SPDX-License-Identifiers into projects. The REUSE.software guidelines were followed for adding SPDX-License-Identifiers into the Linux kernel, later that year.
The SPDX-License-Identifier syntax used with short identifiers from the SPDX License List short form identifiers (referred here as SPDX LIDs) can be used to indicate relevant license information at any level, from package to the source code file level. The “SPDX-License-Identifier” phrase and a license expresssion formed of SPDX LIDs in a comment form a precise, concise and language neutral way to document the licensing, that is simple to machine process. This leads to source code that is easier to read, which appeals to developers, as well as enabling the licensing information to travel with the source code.
To use SPDX LIDs in your project’s source code, just add a single line in the following format, tailored to your license(s) and the comment style for that file’s language. For example:
// SPDX-License-Identifier: MIT /* SPDX-License-Identifier: MIT OR Apache-2.0 */ # SPDX-License-Identifer: GPL-2.0-or-later
In addition to U-boot and Linux transitioning to use the SPDXLIDs, newer projects like Zephyr and Hyperleger fabric have adopted them right from the start as a best practice. Indeed, to achieve the Core Infrastructure Initiative’s gold badge, each file in the source code must have a license, and the recommended way is to use an SPDX LID.
The project MUST include a license statement in each source file. This MAY be done by including the following inside a comment near the beginning of each file: SPDX-License-Identifier: [SPDX license expression for project].
When SPDX LIDs are used, gathering license information across your project files can start to become as easy as running grep. If a source file gets reused in a different package, the license information travels with the source, reducing the risk of licence identification errors, and making license compliance in the recipient project easier. By using SPDX LIDs in license expressions, the meaning of license combinations is understood more accurately. Saying “this file is MPL/MIT” is ambiguous, and leaves recipients unclear about their compliance requirements. Saying “MPL-2.0 AND MIT” or “MPL-2.0 OR MIT” specifies precisely whether the licensee must comply with both licenses, or either license, when redistributing the file.
As illustrated by the transition underway in the Linux kernel, SPDX LIDs can be adopted gradually. You can start by adding SPDX LIDs to new files without changing anything already present in your codebase. A list of projects known to be using SPDX License Identifiers can be found at: https://spdx.org/ids-where, and if you know of one that’s missing, please send email to firstname.lastname@example.org.
Learn more in this presentation at Open Source Summit: Automating the Creation of Open Source BOMs