(ORDO NEWS) — An international team of 1,000 volunteer scientists has developed a large linguistic BLOOM model.
The creators of the model say that their model will become public and will not be controlled by large IT corporations.
BLOOM was trained with $7 million in government computing resources. It competes in scale with the models created by Google and OpenAI, but is open source and will soon be available to any user.
Developing and training language models costs millions of dollars. So far, only IT giants have been doing this. Now their monopoly is broken
An international team of 1,000 volunteer scientists developed and trained the BLOOM language model. The training was funded by government computing resources and cost $7 million. The team was named BigScience. The first version of the model was launched on June 17th.
Models that recognize and generate language messages and are able to maintain a dialogue with the user are increasingly being used by large technology firms in applications ranging from chatbots to translators.
Sometimes the dialogue sounds so “human” that it becomes creepy. A Google engineer this month said his company’s AI model is intelligent (although Google vehemently denies that AI is intelligent). But many of these models suffer from serious practical and ethical shortcomings: they mimic human biases.
(We wrote about the racist and sexist inclinations of the CLIP language model ). It is difficult to negotiate with such models, because the internal work of most of them is carried out by corporations and is closed to external researchers.
BLOOM will be open: both its training arrays and the source code of the programs.
A number of studies have already been planned in which BLOOM will participate: this is the extraction of information from the correspondence of Renaissance merchants, and the creation of classifications in biology.
Learning machines
Large language models are algorithms that learn the statistical relationships between billions of words and phrases to perform tasks such as summary generation, translation, question answering, and text classification.
Built using neural networks, the models are trained by adjusting the parameter values step by step: text is taken, some words are thrown out of it, and the models offer to fill in the gaps. Gradually, the model restores the crossed out words more and more accurately, changes the parameters and thus learns.
BLOOM has 176 billion parameters, on par with GPT-3, one of the most famous such models, which was created by the non-profit firm OpenAI and licensed by Microsoft. (GPT-3 training and paid for by Microsoft – the first stage cost $4 million)
Existing models can promote abuse, call for violence, and repeat racist or sexist language that is found in human-written texts. This is the drawback that the BLOOM developers tried to avoid.
Handpicked text
Most models directly download text from the internet, including sites like Reddit. Instead, the BigScience researchers manually sampled nearly two-thirds of the 341 billion word training dataset from 500 sources.
Among them was Semantic Scholar, an AI-enabled search engine for academic publications. Using mostly selected sources, the team hopes to improve their model.
In addition, since the underlying code and data set of BLOOM is open source, any researcher can try to understand the reasons for the model’s inappropriate behavior and suggest improvements.
It is possible that the model will perform a little worse than other large models in English, given the smaller volume of English texts, but this should be compensated by better performance when working with other languages. BLOOM is immediately announced as a multi-language model.
Free but not all
The fully trained BLOOM model will be available as a free download for researchers who want to experiment with it or train it on new application-specific data.
But since this is available to a very small number of research groups (too much computing power required), BigScience will also publish less demanding versions of the hardware, and create a distributed system that will allow laboratories to share the model on their servers.
In addition, a web application will be released that will allow any user to request BLOOM without having to download it. BLOOM is certainly a huge step towards more free use of language models.
—
Online:
Contact us: [email protected]
Our Standards, Terms of Use: Standard Terms And Conditions.