GitHub Copilot: AI’s March Towards Fair Learning

Blog

August 15, 2023

Michael Peterson

Lead DevOps Engineer, Civis Analytics

The topic of GitHub Copilot and OpenAI Codex have been discussed in detail — the legal and ethical issues, as well as security risks. As Civis weighs the technical and ethical pros and cons of incorporating AI, I wanted to delve into the fair-use perspective and provide some insight from my master’s work on the ethics of artificial intelligence.

illustration of two flags on a pole, one with a "X" and one with a check mark

Understanding the Legal and Ethical Dilemmas of GitHub Copilot and OpenAI Codex

Users have uploaded enormous amounts of public code to GitHub with a wide assortment of licenses covering this work. GitHub and OpenAI have used this public code as the training data for the OpenAI Codex model. GitHub suggests that Copilot is “fair use” because it creates transformative work from this source material. Questions of transformative use and market use are at the core of the copyright’s fair use doctrine. Researchers and legal scholars believe that GitHub Copilot may violate these principles. Programmers need to remain aware and diligent in the use of any tool because complacent use of GitHub Copilot only amplifies these security risks.

Illustration of two people coding on laptops

Pair Programming and AI Development

GitHub Copilot claims to be an “AI pair programmer” comparable to a human pair programmer. OpenAI developed Codex from GPT-3 and is the large language model (LLM) that powers GitHub Copilot. It is a ‘black-box model’ — inputs are provided, outputs are returned, and the model’s internal state is hidden. A black-box model sacrifices interpretability for performance. A human pair programmer acts as an open model — they question assumptions, identify problems, and offer insights. The code created is more accurate and more interpretable. GitHub Copilot doesn’t question assumptions and its code suggestions often tend towards obfuscation. This lack of interpretability limits developers’ ability to work out bugs and security flaws. OpenAI Codex has been trained on code from public GitHub repositories, and much of that code is untested, malformed, or just objectively bad code. A generative model will mimic this training data and output more bad code, generating uninterpretable code that statistically resembles its training data.

Pair Programming, v.

A software development technique in which two programmers sit side-by-side and work together to write and debug code on the fly, switching back and forth between who is writing and who is helping to debug.

Illustration of a book open with a magnifying glass above it

Decoding the Doctrine of Fair Use in AI Training

GitHub relies heavily on the doctrine of fair use to defend its use of public code for training purposes. Many, including OpenAI, have discussed the fair use doctrine and the transformative aspect of LLMs. Most have pre-supposed that the fair use of copyrighted code was the issue. Casey and Lemley took a step back and considered the nature of what machine learning systems were actually using from a copyrighted work. They proposed a copyright doctrine of “fair learning” based on the idea that machine learning is analogous to human learning. A human can learn from a created work and reuse the concepts, ideas, and facts without falling afoul of copyright law. A machine’s learning should be allowed in the same way since it intends to reuse the unprotected aspects of the work.

Four Factors Determining Fair Use in AI Systems

For those with a less high-minded perspective, the tenets of fair use stipulate that the original source material must be transformed, creating new information, new aesthetics, new insights, and new understandings. The fair use doctrine intends to protect this type of activity for the enrichment of society. Four factors are considered when determining something to be fair use:

the purpose and character of the use
the nature of the copyrighted work
the amount and substantiality of the part taken
the effect of the use on the potential market

Moving Beyond Copyright: The Debate on ‘Fair Use’ vs. ‘Fair Learning’ in AI

“If the purpose of the AI’s use is not to obtain or incorporate the copyrightable elements of a work but to access, learn, and use the unprotectable parts of the work, that use should be presumptively fair under the first fair use factor.” (Casey and Lemley, 2021)

The machines read copyrighted code not to copy the specific expression of the code; the machines read copyrighted code to synthesize ideas and facts. Since the beginning of copyright, ideas and facts have been held to be the uncopyrightable attributes of a work. That code may be returned verbatim is not plagiarism but rather a truism of programming. Programming syntax limits the expression of ideas when a language does not allow for any other practical way to express that idea. Compare 100 if-else statements and you will find 99 if-else statements that are the same.

The copyright issues surrounding GitHub Copilot and other LLMs are complicated and far from settled. Opposition to GitHub Copilot raises questions about the “fair use” of copyrighted material. It focuses on the questions of transformative use and market use. “Fair learning” presents a new approach, and moves the conversation forward.

Invitation to Discuss: Leveraging AI Tools in Platform

What are your opinions about fair use vs “fair learning?” Are you planning on utilizing AI as a pair programmer in your work?

Civis is keeping an eye on the developments that seem to be coming daily. We are thinking strategically about how we could potentially leverage tools like these inside Platform in the future. If you have thoughts about the pros and cons, reach out to your account manager and share them with us!