While diverse issues persist, the world and the software ecosystem is
still proceeding
with the advancement of AI. As a particular type of software, AI is
quite different
from the paradigm of traditional software, since there are more
components involved
as an integral parts of the AI system. People gradually realize the Open
Source
Definition[3], derived from DFSG[4], could no longer cover AI software
very well.
To answer the question "what kind of AI is free software / open source",
there are
multiple relevant efforts in recent years. Six years ago we discussed
the same question[6],
and as a result, I drafted an unofficial document named ML-Policy[5]; In
the recent
one or two years, OSI started the drafting process of "Open Source AI
Definition" (OSAID),
and its 1.0-RC2 version[1] is available for public review, and about to
be formally
released; FSF is working on a similar effort concurrently[2].
I think the upcoming of release of OSAID will make a big impact on the
open source
ecosystem. However, while OSAID starts from DFSG and the software
freedom definition,
it is very concerning to me. Here I'll only discuss the most pressing
issue -- data.
The current OSAID-1.0-RC2 only requires "data information", but not the
"original
training data" to be available. That effectively allows "Open Source AI"
to hide
their original training datasets. A group of people expressed their
concerns and
disagreement about the draft on OSI's forum[7][8][9][10], emphasizing
the negative
impacts of allowing "Open Source AI" to hide their original training
datasets.
Allowing "Open Source AI" to hide their original training dataset is
nothing different
than setting up a dataset barrier protecting the monopoly. The "open
source community"
around such "Open Source AI" is only able to conduct further development
based on
such AI, but not able to inspect the process of how the original piece
of "Open
Source AI" is produced, and not able to improve the "Open Souce AI" itself.
This leads to many implications including but not limited to security
and bias issues.
For instance, without being able to access the original training data of
an "Open
Source AI", once those "Open Source AI" starts to say harmful or toxic
things,
or starts to deliver advertisements, nobody other than the first party
is able
to diagnose and fix the bias issue or rip the advertisement off and
produce an
improved AI. In the sense of traditional open source software this looks
ridiculous
because you can easily modify its source code, ripping off the
advertisement pop up
window, and re-compile it.
My mind remains mostly the same from 6 years ago. And after 5~6 years,
the most
important concept in ML-Policy remains to be ToxicCandy, which is
exactly AI released under
open source license with their training data hidden.
I felt OSI destines to draft something I disagree with some time ago.
And upon the
release of OSAID-1.0, it will make a huge, irreversible impact. I could
not convince
OSI to change their mind, but I do not want to see free software
communities being
impacted by the OSAID and start to compromise software freedom.
No data, no trust. No data, no security. No data, no freedom[11].
Maybe it is time for us to build a consensus on how we tell whether a
piece of
AI is DFSG-compliant or not, instead of waiting for ftp-masters to
interpret those