In recent months, we’ve seen an explosion of AI coding assistants. They do everything from explaining code to writing unit tests. Engineering teams at every company want to use them. While these tools can improve test coverage and detect code smells, for many legal teams, they are a nightmare. The technology is new, and the legal risks untested.
When evaluating AI coding providers, look for four key terms that will mitigate the primary risks associated with the tools.
No Training on Your Data
Be clear in your contract with your AI coding provider that your data (inputs and outputs) will not be used to train models.
When a model trains on your data, the model provider keeps your data for a period of time. If and when the model provider suffers a security breach, depending on the amount of sensitive data—such as personal or customer data—in your code, you may suddenly have a notifiable security incident on your hands.
Per best practices, code shouldn’t contain these types of data. But when it does, contractual and regulatory obligations may require you to notify affected users and customers and government agencies. Security breaches can also trigger customer termination rights, depending on your customer contracts.
In addition, data subject rights may apply to any personal data processed by the model. Once a model has been trained on your data, it’s hard to see how a model provider can delete—or correct—that data because models don’t store training data in a searchable sense like a database, but they can recall or duplicate data from their training sets.
Minimal Data Retention
To reduce code leakage risk and bolster data privacy compliance, negotiate minimal data retention periods with your AI coding provider. It’s critical that the minimal retention periods apply to any models that process your code.
The default is often no specified retention period. Some AI coding assistants and models offer 30-day retention periods. The best is zero retention: Your inputs and outputs are not retained beyond the time it takes to generate the output. Zero retention significantly reduces security breach risk. It also helps you comply with user data deletion rights.
Make sure that you own, rather than license, any IP rights in the outputs created by your use of your AI coding assistant. This is already the default position for many but not all AI coding products on the market. Some give you a license to use the outputs. Others are silent on output ownership.
Companies generally want to own their key code so they can prevent others from using it. When it comes to AI-generated code, companies might not be able to prevent others from using the same code, given recent US Copyright Office guidance.
However, you should still avoid licensing the outputs from your AI coding provider. For example, absent an irrevocable license, the owner can change or cancel your license any time. If your AI coding provider of choice insists on a license, ask why and make sure it’s at least irrevocable, perpetual, and royalty-free.
Cover your bases by negotiating a broad indemnity.
Today, your AI-generated code is unlikely to attract copyright suits. The practical risk of a suit arising from code snippets in your private codebase is low because they are private. Code snippets are also hard to copyright because functions, instructions, methods, processes, and generic code are not copyrightable. That describes most code snippets emitted by AI coding assistants today.
The lawsuit involving GitHub Copilot doesn’t allege actual copyright infringement. It alleges many other violations including failing to comply with attribution requirements under open-source licenses, which companies fix in their own codebase by using open-source compliance tools that help attribute or remove noncompliant code. This signals that copyright infringement suits arising from your use of AI coding assistants will not be easy to file.
To further reduce copyright risk, you can go for the tools powered by the largest training sets. Generally, the bigger the training set, the better the performance and the fewer the verbatim outputs. Most AI coding assistants on the market are trained on at least billions of lines of code. However, some models trained on smaller datasets can also reduce legal risk.
For example, using models trained on only permissively licensed code largely eliminates concerns of infringement on copyleft or proprietary code. You can also implement routine open-source scans, apply filters to remove AI outputs that look similar to copyrighted public code, and negotiate an indemnity.
The indemnity would require the vendor to defend you in the event that despite your proper use of the product, someone claims that your AI-generated outputs infringe on their IP. Your ideal IP indemnity is uncapped or subject to a super-cap instead of your standard liability limit, which is usually capped at your annual fee and likely won’t cover the cost of a copyright claim unless you are paying your AI coding assistant millions annually.
Most AI coding providers today will not warrant that their outputs are non-infringing, given the broad scope of their model training data, but to offset harms arising from your use of their product, they will offer a capped IP indemnity.
In the future, they will likely converge toward an uncapped IP indemnity. Uncapped IP indemnity is common in the rest of the SaaS industry, especially among large providers, from Google Cloud to Snowflake. The copyright risk is no greater than other IP risks like patent troll suits. As the market matures, we should see more AI coding providers offer uncapped IP indemnity.
If you have bargaining power and time, negotiate broad indemnity coverage. At the same time, the risk of a copyright suit is small, so a limited indemnity may be acceptable to you.
In addition to managing risk by negotiating key terms, enforcing coding best practices becomes more important than ever. Avoid personal and customer data in code, implement proper code reviews, and apply routine scans and testing.
This article does not necessarily reflect the opinion of Bloomberg Industry Group, Inc., the publisher of Bloomberg Law and Bloomberg Tax, or its owners.
Tammy Zhu is a tech lawyer who helps companies build and use AI products. She is the VP of Legal at Sourcegraph Inc.
Write for Us: Author Guidelines