Answers to frequently asked questions - HackMD
# Answers to frequently asked questions Last updated Oct. 29, 2024 # :warning: This document is still a work in progress :warning: ## What is an Open Source AI? TL;DR: An Open Source AI is one made freely available with all necessary code, data and parameters under legal terms approved by the Open Source Initiative. For more details read below. ## Why did you write the Open Source AI Definition? The point #2 of the [Open Source Definition (OSD)](https://opensource.org/osd) says `"The program must include source code [...] The source code must be the preferred form in which a programmer would modify the program [...]`. Nobody had a clear answer to what is the preferred form to modify an AI system so OSI offered to find one with the communities involved in a [co-design process](https://opensource.org/deepdive/). ## What's the difference between the Open Source Definition and the Open Source AI Definition? The [Open Source Definition (OSD)](https://opensource.org/osd) refers to software programs. AI and specifically machine learning systems are not simply software programs but they blend boundaries with data, configuration options, documentation and new artifacts, like weights and biases. The *Open Source AI Definition* describes what is the preferred form to modify an AI system providing clarity to interpret the principles of the OSD in the domain of AI. ## What is the role of training data in the Open Source AI Definition? Open Source means giving anyone the ability to meaningfully fork (study and modify) your system, without requiring additional permissions, to make it more useful for themselves and also for everyone. This is why OSD \#2 requires that the source code must be provided in the preferred form for making modifications. This way everyone has the same rights and ability to fork as the original developers, starting a virtuous cycle of innovation. However, training data does not equate to a software source code. Training data is important to study modern machine learning systems. But it is not what AI researchers and practitioners necessarily use as part of the preferred form for making modifications to a trained model. The Data Information and Code requirements allow Open Source AI systems to be forked by third-party AI builders downstream using the same information as the original developers. These forks could include removing non-public or non-open data from the training dataset, in order to train a new Open Source AI system on fully public or open data. ### Why do you allow the exclusion of some training data? Because we want Open Source AI to exist also in fields where data cannot be legally shared, for example medical AI. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information – like decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing. There are also many cases where terms of use of publicly-available data may give entity A the confidence that they may use it freely and call it "open data", but not give entity A the confidence they can give entity B guarantees in a different jurisdiction. Meanwhile, entity B may or may not feel confident to use that data in their jurisdiction. An example is so-called public domain data, where the definition of public domain varies country-by-country. Another example is fair-use or private data where the finding of fair use or privacy laws may require a good knowledge of the law of a given jurisdiction. This resharing is not so much *limited* as [lacking legal certainty](https://opensource.org/blog/copyright-law-makes-a-case-for-requiring-data-information-rather-than-open-datasets-for-open-source-ai). ### How did you arrive at this conclusion? Is it compromising Open Source ideals? During our co-design process, relationships between the weights and the data drove the highest amount of community engagement. In the [“System analysis” phase](https://discuss.opensource.org/t/report-on-working-group-recommendations/247), the volunteer groups suggested that training code and data processing code was more important to modify the AI system than accessing the training and testing data. That result was validated in the [“Validation phase”](https://discuss.opensource.org/t/initial-report-on-definition-validation/368) and suggested a path that allows Open Source AI to exist on equal grounds with proprietary systems: both can train on the same [kind of data](#What-kind-of-data-should-be-required-in-the-Open-Source-AI-Definition). Some people believe that full unfettered access to all training data (with no distinction of its [kind](#What-kind-of-data-should-be-required-in-the-Open-Source-AI-Definition)) is paramount, arguing that anything less would compromise full reproducibility of AI systems, transparency and security. This approach would relegate Open Source AI to a niche of AI trainable only on open data (see [FAQ](#What-kind-of-data-should-be-required-in-the-Open-Source-AI-Definition)). That niche would be tiny, even relative to the niche occupied by Open Source in the traditional software ecosystem. The requirements of Data Information keep the same approach present in the Open Source Definition that doesn't mandate full reproducibility and transparency but enables them (i.e. [reproducible builds](https://reproducible-builds.org/)). At the same time, setting a baseline requiring Data Information doesn't preclude others from formulating and demanding more requirements, like the [Digital Public Goods Standard](https://digitalpublicgoods.net/standard/) or the [Free Systems Distribution Guidelines](https://www.gnu.org/distros/free-system-distribution-guidelines.html) add requirements to the Open Source Definition. One of the key aspects of OSI’s mission is to drive and promote Open Source innovation. The approach OSI takes here enables full user choice with Open Source AI. Users can keep the insights derived from training+data pre-processing code and description of unshareable training data and build upon those with their own unshareable data and give the insights derived from further training to everyone, allowing for Open Source AI in areas like healthcare. Or users can obtain the available and public data from the Data Information and retrain their model without any unshareable data resulting in more data transparency in the resulting AI system. Just like with copyleft and permissive licensing, this approach leaves the choice with the user. ### What kind of data should be required in the Open Source AI Definition? There are four classes of data, based on their legal constraints, all of which can be used to train Open Source AI systems: * **Open training data**: data that can be copied, preserved, modified and reshared. It provides the best way to enable users to study the system. This must be shared. * **Public training data**: data that others can inspect as long as it remains available. This also enables users to study the work. However, this data can degrade as links or references are lost or removed from network availability. To obviate this, different communities will have to work together to define standards, procedures, tools and governance models to overcome this risk, and Data Information is required in case the data becomes later unavailable. This must be disclosed with full details on where to obtain it. * **Obtainable training data**: data that can be obtained, including for a fee. This information provides transparency and is similar to a purchasable component in an open hardware system. The Data Information provides a means of understanding this data other than obtaining or purchasing it. This is an area that is likely to change rapidly and will need careful monitoring to protect Open Source AI developers. This must be disclosed with full details on where to obtain it. * **Unshareable non-public training data**: data that cannot be shared for explainable reasons, like Personally Identifiable Information (PII). For this class of data, the ability to study some of the system's biases demands a detailed description of the data – what it is, how it was collected, its characteristics, and so on – so that users can understand the biases and categorization underlying the system. This must be revealed in detail so that, for example, a hospital can create a dataset with identical structure using their own patient data. OSI believes that all these classes of data can be part of the preferred form of making modifications to the AI system. This approach both advances openness in all the components of the AI system and drives more Open Source AI, i.e. in private-first areas such as healthcare. ## What is a skilled person? In legal circles, **Skilled Person** means any person having the current knowledge, experience and competence to perform a certain duty. This [Wikipedia entry](https://en.wikipedia.org/wiki/Person_having_ordinary_skill_in_the_art) provides more details. ## Is the Open Source AI Definition covering models and weights and parameters? Yes. The Open Source AI Definition makes no distinction between what might be called AI system, model, or weights and parameters. To be called Open Source AI, whether the offering is characterized as an AI system, a model, or weights and parameters, the requirements for providing the preferred form for making modifications will be the same. ## Why do you require training code while OSD \#2 doesn’t require compilers? AI and software are radically different domains and drawing comparisons between them is rarely productive. OSD \#2 doesn’t mandate that Open Source software uses only compilers released with an OSI-Approved License because compilers are standardized, de-jure (like ANSI C) or de-facto like TurboPascal or Python. It was generally accepted that to develop more Open Source software one could accept to use a proprietary development environment. For machine learning, the training code is not standardized and therefore it must be part of the preferred form of making modifications to preserve the right to fork an AI system. ## Why is there no mention of safety and risk limitations in the Open Source AI Definition? The Open Source AI Definition does not specifically guide or enforce ethical, trustworthy, or responsible AI development practices. However, it does not put up any barriers that would prevent developers from adhering to such principles, if they chose to. The efforts to discuss the responsible development, deployment and use of AI systems, including through appropriate government regulation, are a separate conversation. A good starting point is OECD's Recommendation of the Council on Artificial Intelligence, [Section 1: Principles for responsible stewardship of trustworthy AI](https://legalinstruments.oecd.org/en/instruments/oecd-legal-0449) ## Are model parameters copyrightable? The Open Source AI Definition does not take any stance about the legal nature of Parameters. They may be free by their nature or a license or other legal instrument may be required to ensure their freedom. We expect this will become clearer over time, once the legal system has had more opportunity to address these issues. In any case, we require an explicit assertion accompanying the distribution of Parameters that assures they're freely available to all. ## Why will parameters be available under "OSI-approved terms" but the code will be under "OSI-approved licenses"? Are you going to allow restrictions on the terms for models? We used the word "terms" instead of "license" for models because, as mentioned above, we do not yet know what the legal mechanism will be to assure that the models are available to use, study, modify and share. We used "terms" to avoid suggesting that a "license" is the only legal mechanism that could be used. That said, to be approved by the OSI, the terms for parameters must assure the freedoms to use, study, modify and share. ## Why is the "Preferred form to make modifications" limited to machine learning? The principles stated in the Open Source AI Definition are generally applicable to any kind of AI but it's machine learning that challenges the Open Source Definition. For machine learning, there is a set of artifacts (components) that are required to study and modify the system, thus requiring a new explanation of what's necessary to study and modify the system. ## Which AI systems comply with the Open Source AI Definition? As part of our validation and testing of the OSAID, the volunteers checked whether the Definition could be used to evaluate if AI systems provided the freedoms expected. The list of models that passed the Validation phase are: Pythia (Eleuther AI), OLMo (AI2), Amber and CrystalCoder (LLM360) and T5 (Google). There are a couple of others that were analyzed and would probably pass if they changed their licenses/legal terms: BLOOM (BigScience), Starcoder2 (BigCode), Falcon (TII). Those that have been analyzed and don't pass because they lack required components and/or their legal agreements are incompatible with the Open Source principles: Llama2 (Meta), Grok (X/Twitter), Phi-2 (Microsoft), Mixtral (Mistral). These results should be seen as part of the definitional process, a learning moment, they're not certifications of any kind. OSI will continue to validate only legal documents, and will not validate or review individual AI systems, just as it does not validate or review software projects.