Cheetah Mobile's Fu Sheng: Data is the real barrier to big model competition - News - LogoMoeny - Us stocks at the forefront

21st Century Business Herald reporter Bai Yang reports from Beijing
In the fierce competition of AI big models, computing power resources and algorithm optimization have always been the focus pursued by major enterprises. However, as technology gradually matures, the focus of the industry is undergoing a subtle shift - from simple model training and computing power investment, to how to process and utilize massive, high-quality data.
In fact, data has become the decisive factor in whether big models can be successfully implemented. On November 27th, Fu Sheng, Chairman and CEO of Cheetah Mobile, clearly stated in an interview with 21st Century Business Herald that "algorithms and computing power are not the core competitiveness of big models, the real barrier is data
Fu Sheng mentioned that most large model companies do not have significant differences in algorithms. Although chips and algorithms are still crucial, their gap is not as profound as data. If the data is not of sufficient quality and quantity, no algorithm or computing power advantage can be fully utilized
The training of large models relies on a large amount of labeled data, which directly determines the actual performance of the model. Fu Sheng metaphorically said that a model is like a growing child, only by receiving the correct information can it learn correctly.
Data faces dual challenges of quality and quantity
However, in terms of data acquisition and utilization, the development of large models is facing many challenges.
Firstly, the real data available for training large models is becoming depleted. DeepMind delved into the Scaling problem in a paper and concluded that in order to fully train a model, its token count needs to be 20 times the number of parameters in the model.
Currently, it is known that GPT has the highest number of training tokens in closed source models, approximately 20T; The open source model with the highest number of training tokens is LLaMA3, which is about 15T. According to this calculation, if a 500 billion parameter Dense model wants to achieve the same training effect, it needs to train about 107T tokens, which is far beyond the amount of data currently available in the industry.
Therefore, using synthetic data has become a consensus for large models. According to forecast data, by 2026, all natural data will be used up by big models, and by 2030, artificial intelligence will use more synthetic data than real data.
But Fu Sheng believes that using synthetic data directly to train large models carries huge risks. Due to the inherent systematic biases in synthetic data, if it is directly used for training, the model may mistakenly consider these biases as routine, and in the long run, the model's cognition may have fatal flaws.
So the synthesized data also needs some processing, such as manual tuning or enhancement with other data, to improve the quality of the synthesized data.
The most significant issue with real data is the low utilization rate. Many companies have sufficient data, but the performance of the large models trained is always unsatisfactory, also because their data quality is not high enough.
Explore business opportunities in data services
Based on this, Cheetah Mobile also sees a business opportunity, and its holding company, Orion Starry Sky, has launched a new data service product - AI Ready Data Service (AirDS).
The services provided by AI Data Treasure AirDS include data collection, cleaning, annotation, prompt word engineering, and evaluation. Fu Sheng stated that because Cheetah Mobile is also training large models, compared to traditional data annotation companies, Cheetah Mobile has a deeper understanding of large models and is better able to meet the data needs of enterprises.
It should be pointed out that current data services still rely on manual labor. In the era of big models, tools can be used to improve efficiency in data filtering, cleaning, and other processes. However, in order to obtain high-quality data, manual fine annotation is still indispensable.
Fu Sheng stated that in the era of big models, Cheetah Mobile's core business model is not to make money through model interfaces, but to create value by helping customers implement AI applications.
The core of this business model is to conduct in-depth mining around the application scenarios of large models. Taking AirDS as an example, Cheetah Mobile uses data service products to help enterprise customers achieve a full process service from data cleaning to labeling, and then to application optimization. This not only greatly improves the AI application effectiveness of enterprises, but also creates huge commercial space for Cheetah Mobile.
At present, the successful cases of AI Databao have covered many industries, including mobile communication, Internet entertainment, new energy vehicles, etc.
Regarding the future development of large models, Fu Sheng believes that although technological bottlenecks have slowed down the iteration speed of models, the depth and breadth of application scenarios are constantly expanding. Especially in vertical industries such as search and enterprise services, with the improvement of data quality and application capabilities, AI is expected to bring revolutionary changes to the industry.
Next year will be a year of great prosperity for applications, "Fu Sheng predicted." The ability of big models has become relatively stable, and the next step of competition will depend more on how to apply big models in specific scenarios. As long as the scenarios are clear enough, their explosive power will be very strong