Copyright and training data deals shaping generative AI

Here’s a compact briefing on how copyright fights and training‑data deals are shaping generative AI right now, plus what to watch next.

Big picture

– Two tracks are converging: (1) litigation and regulation to define what’s lawful to use for training, and (2) private licensing to secure high‑value corpora and reduce legal risk. Most large labs are hedging—asserting fair use (or local TDM exceptions) while also signing content deals and offering customer indemnities.

Key legal fault lines

– Is training fair use in the US? Precedents like Authors Guild v. Google Books favor transformative indexing, but courts haven’t squarely ruled on modern model training. Outcomes will likely be domain‑specific (text vs. images vs. music) and fact‑intensive (scope, market harm, security).

– EU/UK tilt toward licensing: The EU’s TDM exceptions allow text/data mining unless a rightsholder opts out via machine‑readable means; the AI Act adds transparency and opt‑out compliance. The UK considered expanding TDM but stepped back; licensing remains the safer path.

– Japan is permissive: Article 30‑4 effectively allows data mining regardless of opt‑out, drawing interest from some labs; political pressure is rising but the statute stands.

– Human authorship: US Copyright Office says purely AI‑generated content isn’t protectable; mixed works need disclosure. This doesn’t answer the training‑use question but affects downstream claims.

– Notable cases to track

– News/media: NYT v. OpenAI/Microsoft (outputs and training). Authors’ class actions against OpenAI/Meta largely trimmed but some claims continue.

– Images: Artists v. Stability AI/Midjourney/DeviantArt; Getty Images v. Stability AI (US/UK).

– Code: GitHub Copilot litigation saw many claims dismissed; key issues (e.g., attribution/DMCA) linger.

– Music/lyrics: Major labels sued Suno and Udio; music publishers sued Anthropic over lyrics. These could set tougher precedents for music than for text.

– Proprietary databases: Thomson Reuters v. Ross (jury found infringement over Westlaw headnotes used to train); underscores risk when sourcing from proprietary datasets.

What the big deals look like (illustrative, not exhaustive)

– News and publishers (text)

– OpenAI has signed multi‑year licenses with major outlets (e.g., Associated Press, Axel Springer, Financial Times, Vox Media, The Atlantic, TIME, News Corp). Terms vary: content access for training, display rights in products, and source attribution/links.

– Google licenses certain sources (e.g., Reddit) and integrates publisher content in products like AI Overviews; it also relies on EU TDM rules and robots directives where applicable.

– Expect more mid‑tier publisher consortium deals and standardized opt‑in marketplaces.

– Social/community data

– Reddit licensed its corpus to OpenAI and Google; Stack Overflow partnered with OpenAI and with Google (product integrations plus data access components).

– These deals acknowledge unique Q&A/forum value for reasoning and coding tasks.

– Stock media and photos

– Shutterstock licenses its library to multiple labs (OpenAI, Google, Meta and others) and runs contributor compensation programs; Adobe trains Firefly on Adobe Stock/other licensed or public‑domain sources and offers strong indemnities.

– Music and video

– Music remains litigation‑heavy; broad training licenses are rare. YouTube’s Music AI incubator experiments with consent‑based models; labels are testing opt‑in vocal likeness programs while suing universal ingestion.

– Enterprise/government/vertical data

– Labs are assembling “clean rooms” with licensed technical manuals, legal/tax content, scientific corpora, and proprietary enterprise datasets; many enterprise deployments train or fine‑tune on customer‑provided data to avoid public‑web risk.

Compliance and controls taking shape

– Web signals and opt‑outs: Robots directives for model crawlers (e.g., GPTBot), meta tags (noai/noimageai), EU TDM opt‑out requirements. Not legally dispositive everywhere, but widely adopted and increasingly respected.

– Documentation: EU AI Act will require a “sufficiently detailed” training data summary and opt‑out compliance for general‑purpose models supplied in the EU.

– Provenance and labeling: C2PA/Content Credentials (Adobe, Google, Microsoft, OpenAI and others) for embedding provenance; watermarking like Google’s SynthID for generated media.

– Indemnities: OpenAI Copyright Shield, Microsoft Copilot Copyright Commitment, Adobe IP indemnities—these shift risk for customers but don’t resolve upstream legality.

Strategic implications

– Data scarcity premium: High‑quality, well‑labeled, rights‑clear datasets (news, forums, stock media, domain corpora) are appreciating assets. Expect rising prices and exclusivity pushes, plus “most‑favored” and usage‑audit clauses.

– Model differentiation: Licensed data can improve recency, reliability, and on‑screen attribution. Open models and smaller labs may lean more on permissive jurisdictions, public domain/CC corpora, synthetic data, or opt‑in creator pools.

– Safety and privacy: Scraping personal data triggers GDPR/CCPA risks; regulators are scrutinizing biometric/face/voice training. Data minimization and subject‑access/deletion processes are becoming table stakes.

– Traffic and economics: Generative answers reduce referral traffic. Licensing deals often bundle link attribution or in‑product branding to offset this, but the balance remains contentious and may draw antitrust scrutiny.

What to watch in the next 6–12 months

– First substantive rulings on fair use for training in US federal courts; any settlements with behavioral remedies (e.g., expanded opt‑outs, revenue shares).

– EU AI Act implementation details: what counts as a “sufficiently detailed” dataset summary; enforcement posture on TDM opt‑outs.

– Music cases (Suno/Udio; publishers v. Anthropic) setting precedents that could spill over to other media.

– Expansion of creator‑opt‑in pools (visual artists, voice actors, musicians) with clearer rates and rev share; more media “incubators.”

– Growth of standardized data licenses and registries; better machine‑readable rights metadata across the web.

– More regional divergence: EU/UK compliance builds, US litigation continues, Japan remains permissive (unless amended).

Practical guidance

– For rights holders/publishers

– Decide your posture: license, allow with conditions, or opt out. Implement machine‑readable controls (robots, TDM tags, IPTC/C2PA metadata) and log access.

– Explore consortium bargaining or marketplaces to increase leverage and reduce transaction costs.

– Negotiate for attribution/links, update APIs/ToS, and monitor product surfaces for compliance.

– For AI builders/product teams

– Track and respect opt‑outs; keep audit trails of sources and licenses.

– Prefer licensed, well‑governed datasets for high‑risk domains; ring‑fence proprietary sources.

– Prepare EU AI Act documentation and provenance/labeling for generated outputs.

– Offer customer indemnities only with matching upstream coverage and red‑team outputs for copyrighted content leakage.

– For creators

– Use available opt‑out tools and provenance tags; register works where valuable.

– Consider opt‑in programs with clear compensation and downstream usage terms.

If you share your role (publisher, startup, creator, counsel) and region, I can tailor a checklist and point to the most relevant deals, standards, and risks for your situation.

Copyright and training data deals shaping generative AI

Leave a Reply Cancel reply