Disclosure: The views and opinions expressed here belong solely to the author and do not represent the views and opinions of crypto.news’ editorial.
Elon Musk sued OpenAI over its alleged diversion from the mission of developing AGI ‘for the benefit of humanity.’ Carlos E. Perez suspects the lawsuit could turn the current Generative AI market leader into the next WeWork.
OpenAI’s for-profit transformation is a focus of this legal battle. However, the excessive emphasis on profit betrays vested corporate interests. It also diverts attention from more critical concerns for end-users, i.e., ethical AI training and data management.
Grok, Elon’s brainchild and ChatGPT competitor, can access ‘real-time information’ from tweets. OpenAI is anyway infamous for scraping copyrighted data left, right, and center. Now, Google has struck a $60 million deal to access Reddit users’ data to train Gemini and Cloud AI.
Merely pushing for open-source doesn’t serve the user’s interest in this environment. They need ways to ensure meaningful consent and compensation to help train LLMs. Emergent platforms building tools to crowdsource AI training data, for example, are critical in this regard. More on that later.
It’s mostly non-profit for users
Over 5.3 billion people use the internet globally, and roughly 93% of them use centralized social media. Thus, it’s likely that most of the 147 billion terabytes of data produced online in 2023 were user-generated. The volume is expected to cross 180 billion by 2025.
While this massive data set or ‘publicly available information’ fuels AI’s training and evolution, users don’t reap the benefits for most parts. They neither have control nor real ownership. The ‘I Agree’ way of giving consent isn’t meaningful either—it’s a deception at best and coercion at worst.
Data is the new oil. It’s not in Big Tech’s interest to give end-users more control over their data. For one, paying users for data would significantly increase LLM training costs, which is over $100 million anyway. However, as Chris Dixon argues in “Read, Write, Own,” five big firms controlling and potentially ‘ruining everything’ is the fast lane to dystopia.
However, given the evolution of blockchains as the distributed data layer and source of truth, the best era for users has just begun. Most importantly, unlike big corporations, new-age AI companies embrace such alternatives for better performance, cost-efficiency, and, ultimately, the betterment of humanity.
Crowdsourcing data for ethical AI training
Web2’s read-write-trust model relies on entities and stakeholders not being evil. But human greed knows no bounds—we’re all a bunch of ‘self-interested knaves’, per the 18th-century philosopher David Hume.
Web3’s read-write-own model, therefore, uses blockchain, cryptography, etc., so that distributed network participants can’t be evil. Chris explores this idea extensively in his book.
The web3 tech stack is fundamentally community-oriented and user-led. Providing the toolkit to let users regain control over their data—financial, social, creative, and otherwise—is a core premise in this domain. Blockchains, for instance, serve as distributed, verifiable data layers to settle transactions and immutably establish provenance.
Moreover, viable privacy and security mechanisms like zero-knowledge proofs (zkProofs) or multi-party computation (MPC) have evolved in the past couple of years. They open new avenues in data validation, sharing, and management by allowing counterparties to establish truths without revealing the content.
These broad capabilities are highly relevant from an AI training PoV. It’s now possible to source reliable data without relying on centralized providers or validators. But most importantly, web3’s decentralized, non-intermediated nature helps directly connect those who produce data—i.e., users—and projects who need it for training AI models.
Removing ‘trusted intermediaries’ and gatekeepers significantly reduces costs. It also aligns incentives so projects can compensate users for their efforts and contributions. For example, users can earn cryptocurrencies by completing microtasks like recording scripts in their native dialect, recognizing and labeling objects, sorting and categorizing images, structuring unstructured data, etc.
Companies, on the other hand, can build more accurate models using high-quality data validated by humans in the loop and at a fair price. It’s a win-win.
Bottom-up advancements, not merely open-source
Traditional frameworks are so steeped against individuals and user communities, mere open-source means nothing as such. Radical shifts in existing business models and training frameworks is necessary to ensure ethical AI training.
Replacing top-down systems with a grassroots, bottom-up approach is the way to go. It’s also about establishing a meritocratic order that holds ownership, autonomy, and collaboration in high regards. In this world, equitable distribution is the most profitable, not maximization.
Interestingly, these systems will benefit big corporations as much as they empower smaller businesses and individual users. Because, after all, high-quality data, fair prices, and accurate AI models are things everyone needs.
Now, with the incentives aligned, it’s in the industry’s shared interest to embrace and adopt new-age models. Holding on to narrow, short-sighted gains won’t help in the long run. The future has different demands than the past.
William Simonin is the chairman at Ta-da, an AI data marketplace that leverages blockchain to gamify data verification. He previously worked as a software engineer and researcher for the French Defense Ministry for about six years and with the Security Association of Epitech Nancy, serving as their President and later as a Professor of Functional Programming. He is a French entrepreneur and co-founder of multiple AI, tech, and cryptocurrency companies.