Copyright and training data deals shaping generative AI

Hereโ€™s a compact briefing on how copyright fights and trainingโ€‘data deals are shaping generative AI right now, plus what to watch next.

Big picture

– Two tracks are converging: (1) litigation and regulation to define whatโ€™s lawful to use for training, and (2) private licensing to secure highโ€‘value corpora and reduce legal risk. Most large labs are hedgingโ€”asserting fair use (or local TDM exceptions) while also signing content deals and offering customer indemnities.

Key legal fault lines

– Is training fair use in the US? Precedents like Authors Guild v. Google Books favor transformative indexing, but courts havenโ€™t squarely ruled on modern model training. Outcomes will likely be domainโ€‘specific (text vs. images vs. music) and factโ€‘intensive (scope, market harm, security).

– EU/UK tilt toward licensing: The EUโ€™s TDM exceptions allow text/data mining unless a rightsholder opts out via machineโ€‘readable means; the AI Act adds transparency and optโ€‘out compliance. The UK considered expanding TDM but stepped back; licensing remains the safer path.

– Japan is permissive: Article 30โ€‘4 effectively allows data mining regardless of optโ€‘out, drawing interest from some labs; political pressure is rising but the statute stands.

– Human authorship: US Copyright Office says purely AIโ€‘generated content isnโ€™t protectable; mixed works need disclosure. This doesnโ€™t answer the trainingโ€‘use question but affects downstream claims.

– Notable cases to track

– News/media: NYT v. OpenAI/Microsoft (outputs and training). Authorsโ€™ class actions against OpenAI/Meta largely trimmed but some claims continue.

– Images: Artists v. Stability AI/Midjourney/DeviantArt; Getty Images v. Stability AI (US/UK).

– Code: GitHub Copilot litigation saw many claims dismissed; key issues (e.g., attribution/DMCA) linger.

– Music/lyrics: Major labels sued Suno and Udio; music publishers sued Anthropic over lyrics. These could set tougher precedents for music than for text.

– Proprietary databases: Thomson Reuters v. Ross (jury found infringement over Westlaw headnotes used to train); underscores risk when sourcing from proprietary datasets.

What the big deals look like (illustrative, not exhaustive)

– News and publishers (text)

– OpenAI has signed multiโ€‘year licenses with major outlets (e.g., Associated Press, Axel Springer, Financial Times, Vox Media, The Atlantic, TIME, News Corp). Terms vary: content access for training, display rights in products, and source attribution/links.

– Google licenses certain sources (e.g., Reddit) and integrates publisher content in products like AI Overviews; it also relies on EU TDM rules and robots directives where applicable.

– Expect more midโ€‘tier publisher consortium deals and standardized optโ€‘in marketplaces.

– Social/community data

– Reddit licensed its corpus to OpenAI and Google; Stack Overflow partnered with OpenAI and with Google (product integrations plus data access components).

– These deals acknowledge unique Q&A/forum value for reasoning and coding tasks.

– Stock media and photos

– Shutterstock licenses its library to multiple labs (OpenAI, Google, Meta and others) and runs contributor compensation programs; Adobe trains Firefly on Adobe Stock/other licensed or publicโ€‘domain sources and offers strong indemnities.

– Music and video

– Music remains litigationโ€‘heavy; broad training licenses are rare. YouTubeโ€™s Music AI incubator experiments with consentโ€‘based models; labels are testing optโ€‘in vocal likeness programs while suing universal ingestion.

– Enterprise/government/vertical data

– Labs are assembling โ€œclean roomsโ€ with licensed technical manuals, legal/tax content, scientific corpora, and proprietary enterprise datasets; many enterprise deployments train or fineโ€‘tune on customerโ€‘provided data to avoid publicโ€‘web risk.

Compliance and controls taking shape

– Web signals and optโ€‘outs: Robots directives for model crawlers (e.g., GPTBot), meta tags (noai/noimageai), EU TDM optโ€‘out requirements. Not legally dispositive everywhere, but widely adopted and increasingly respected.

– Documentation: EU AI Act will require a โ€œsufficiently detailedโ€ training data summary and optโ€‘out compliance for generalโ€‘purpose models supplied in the EU.

– Provenance and labeling: C2PA/Content Credentials (Adobe, Google, Microsoft, OpenAI and others) for embedding provenance; watermarking like Googleโ€™s SynthID for generated media.

– Indemnities: OpenAI Copyright Shield, Microsoft Copilot Copyright Commitment, Adobe IP indemnitiesโ€”these shift risk for customers but donโ€™t resolve upstream legality.

Strategic implications

– Data scarcity premium: Highโ€‘quality, wellโ€‘labeled, rightsโ€‘clear datasets (news, forums, stock media, domain corpora) are appreciating assets. Expect rising prices and exclusivity pushes, plus โ€œmostโ€‘favoredโ€ and usageโ€‘audit clauses.

– Model differentiation: Licensed data can improve recency, reliability, and onโ€‘screen attribution. Open models and smaller labs may lean more on permissive jurisdictions, public domain/CC corpora, synthetic data, or optโ€‘in creator pools.

– Safety and privacy: Scraping personal data triggers GDPR/CCPA risks; regulators are scrutinizing biometric/face/voice training. Data minimization and subjectโ€‘access/deletion processes are becoming table stakes.

– Traffic and economics: Generative answers reduce referral traffic. Licensing deals often bundle link attribution or inโ€‘product branding to offset this, but the balance remains contentious and may draw antitrust scrutiny.

What to watch in the next 6โ€“12 months

– First substantive rulings on fair use for training in US federal courts; any settlements with behavioral remedies (e.g., expanded optโ€‘outs, revenue shares).

– EU AI Act implementation details: what counts as a โ€œsufficiently detailedโ€ dataset summary; enforcement posture on TDM optโ€‘outs.

– Music cases (Suno/Udio; publishers v. Anthropic) setting precedents that could spill over to other media.

– Expansion of creatorโ€‘optโ€‘in pools (visual artists, voice actors, musicians) with clearer rates and rev share; more media โ€œincubators.โ€

– Growth of standardized data licenses and registries; better machineโ€‘readable rights metadata across the web.

– More regional divergence: EU/UK compliance builds, US litigation continues, Japan remains permissive (unless amended).

Practical guidance

– For rights holders/publishers

– Decide your posture: license, allow with conditions, or opt out. Implement machineโ€‘readable controls (robots, TDM tags, IPTC/C2PA metadata) and log access.

– Explore consortium bargaining or marketplaces to increase leverage and reduce transaction costs.

– Negotiate for attribution/links, update APIs/ToS, and monitor product surfaces for compliance.

– For AI builders/product teams

– Track and respect optโ€‘outs; keep audit trails of sources and licenses.

– Prefer licensed, wellโ€‘governed datasets for highโ€‘risk domains; ringโ€‘fence proprietary sources.

– Prepare EU AI Act documentation and provenance/labeling for generated outputs.

– Offer customer indemnities only with matching upstream coverage and redโ€‘team outputs for copyrighted content leakage.

– For creators

– Use available optโ€‘out tools and provenance tags; register works where valuable.

– Consider optโ€‘in programs with clear compensation and downstream usage terms.

If you share your role (publisher, startup, creator, counsel) and region, I can tailor a checklist and point to the most relevant deals, standards, and risks for your situation.


Leave a Reply

Your email address will not be published. Required fields are marked *