Hereโs a compact briefing on how copyright fights and trainingโdata deals are shaping generative AI right now, plus what to watch next.
Big picture
– Two tracks are converging: (1) litigation and regulation to define whatโs lawful to use for training, and (2) private licensing to secure highโvalue corpora and reduce legal risk. Most large labs are hedgingโasserting fair use (or local TDM exceptions) while also signing content deals and offering customer indemnities.
Key legal fault lines
– Is training fair use in the US? Precedents like Authors Guild v. Google Books favor transformative indexing, but courts havenโt squarely ruled on modern model training. Outcomes will likely be domainโspecific (text vs. images vs. music) and factโintensive (scope, market harm, security).
– EU/UK tilt toward licensing: The EUโs TDM exceptions allow text/data mining unless a rightsholder opts out via machineโreadable means; the AI Act adds transparency and optโout compliance. The UK considered expanding TDM but stepped back; licensing remains the safer path.
– Japan is permissive: Article 30โ4 effectively allows data mining regardless of optโout, drawing interest from some labs; political pressure is rising but the statute stands.
– Human authorship: US Copyright Office says purely AIโgenerated content isnโt protectable; mixed works need disclosure. This doesnโt answer the trainingโuse question but affects downstream claims.
– Notable cases to track
– News/media: NYT v. OpenAI/Microsoft (outputs and training). Authorsโ class actions against OpenAI/Meta largely trimmed but some claims continue.
– Images: Artists v. Stability AI/Midjourney/DeviantArt; Getty Images v. Stability AI (US/UK).
– Code: GitHub Copilot litigation saw many claims dismissed; key issues (e.g., attribution/DMCA) linger.
– Music/lyrics: Major labels sued Suno and Udio; music publishers sued Anthropic over lyrics. These could set tougher precedents for music than for text.
– Proprietary databases: Thomson Reuters v. Ross (jury found infringement over Westlaw headnotes used to train); underscores risk when sourcing from proprietary datasets.
What the big deals look like (illustrative, not exhaustive)
– News and publishers (text)
– OpenAI has signed multiโyear licenses with major outlets (e.g., Associated Press, Axel Springer, Financial Times, Vox Media, The Atlantic, TIME, News Corp). Terms vary: content access for training, display rights in products, and source attribution/links.
– Google licenses certain sources (e.g., Reddit) and integrates publisher content in products like AI Overviews; it also relies on EU TDM rules and robots directives where applicable.
– Expect more midโtier publisher consortium deals and standardized optโin marketplaces.
– Social/community data
– Reddit licensed its corpus to OpenAI and Google; Stack Overflow partnered with OpenAI and with Google (product integrations plus data access components).
– These deals acknowledge unique Q&A/forum value for reasoning and coding tasks.
– Stock media and photos
– Shutterstock licenses its library to multiple labs (OpenAI, Google, Meta and others) and runs contributor compensation programs; Adobe trains Firefly on Adobe Stock/other licensed or publicโdomain sources and offers strong indemnities.
– Music and video
– Music remains litigationโheavy; broad training licenses are rare. YouTubeโs Music AI incubator experiments with consentโbased models; labels are testing optโin vocal likeness programs while suing universal ingestion.
– Enterprise/government/vertical data
– Labs are assembling โclean roomsโ with licensed technical manuals, legal/tax content, scientific corpora, and proprietary enterprise datasets; many enterprise deployments train or fineโtune on customerโprovided data to avoid publicโweb risk.
Compliance and controls taking shape
– Web signals and optโouts: Robots directives for model crawlers (e.g., GPTBot), meta tags (noai/noimageai), EU TDM optโout requirements. Not legally dispositive everywhere, but widely adopted and increasingly respected.
– Documentation: EU AI Act will require a โsufficiently detailedโ training data summary and optโout compliance for generalโpurpose models supplied in the EU.
– Provenance and labeling: C2PA/Content Credentials (Adobe, Google, Microsoft, OpenAI and others) for embedding provenance; watermarking like Googleโs SynthID for generated media.
– Indemnities: OpenAI Copyright Shield, Microsoft Copilot Copyright Commitment, Adobe IP indemnitiesโthese shift risk for customers but donโt resolve upstream legality.
Strategic implications
– Data scarcity premium: Highโquality, wellโlabeled, rightsโclear datasets (news, forums, stock media, domain corpora) are appreciating assets. Expect rising prices and exclusivity pushes, plus โmostโfavoredโ and usageโaudit clauses.
– Model differentiation: Licensed data can improve recency, reliability, and onโscreen attribution. Open models and smaller labs may lean more on permissive jurisdictions, public domain/CC corpora, synthetic data, or optโin creator pools.
– Safety and privacy: Scraping personal data triggers GDPR/CCPA risks; regulators are scrutinizing biometric/face/voice training. Data minimization and subjectโaccess/deletion processes are becoming table stakes.
– Traffic and economics: Generative answers reduce referral traffic. Licensing deals often bundle link attribution or inโproduct branding to offset this, but the balance remains contentious and may draw antitrust scrutiny.
What to watch in the next 6โ12 months
– First substantive rulings on fair use for training in US federal courts; any settlements with behavioral remedies (e.g., expanded optโouts, revenue shares).
– EU AI Act implementation details: what counts as a โsufficiently detailedโ dataset summary; enforcement posture on TDM optโouts.
– Music cases (Suno/Udio; publishers v. Anthropic) setting precedents that could spill over to other media.
– Expansion of creatorโoptโin pools (visual artists, voice actors, musicians) with clearer rates and rev share; more media โincubators.โ
– Growth of standardized data licenses and registries; better machineโreadable rights metadata across the web.
– More regional divergence: EU/UK compliance builds, US litigation continues, Japan remains permissive (unless amended).
Practical guidance
– For rights holders/publishers
– Decide your posture: license, allow with conditions, or opt out. Implement machineโreadable controls (robots, TDM tags, IPTC/C2PA metadata) and log access.
– Explore consortium bargaining or marketplaces to increase leverage and reduce transaction costs.
– Negotiate for attribution/links, update APIs/ToS, and monitor product surfaces for compliance.
– For AI builders/product teams
– Track and respect optโouts; keep audit trails of sources and licenses.
– Prefer licensed, wellโgoverned datasets for highโrisk domains; ringโfence proprietary sources.
– Prepare EU AI Act documentation and provenance/labeling for generated outputs.
– Offer customer indemnities only with matching upstream coverage and redโteam outputs for copyrighted content leakage.
– For creators
– Use available optโout tools and provenance tags; register works where valuable.
– Consider optโin programs with clear compensation and downstream usage terms.
If you share your role (publisher, startup, creator, counsel) and region, I can tailor a checklist and point to the most relevant deals, standards, and risks for your situation.
Leave a Reply