Cheetah Mobile's Fu Sheng: Data is the real barrier to big model competition
我放心你带套猛
发表于 어제 23:50
119
0
0
21st Century Business Herald reporter Bai Yang reports from Beijing
In the fierce competition of AI big models, computing power resources and algorithm optimization have always been the focus pursued by major enterprises. However, as technology gradually matures, the focus of the industry is undergoing a subtle shift - from simple model training and computing power investment, to how to process and utilize massive, high-quality data.
In fact, data has become the decisive factor in whether big models can be successfully implemented. On November 27th, Fu Sheng, Chairman and CEO of Cheetah Mobile, clearly stated in an interview with 21st Century Business Herald that "algorithms and computing power are not the core competitiveness of big models, the real barrier is data
Fu Sheng mentioned that most large model companies do not have significant differences in algorithms. Although chips and algorithms are still crucial, their gap is not as profound as data. If the data is not of sufficient quality and quantity, no algorithm or computing power advantage can be fully utilized
The training of large models relies on a large amount of labeled data, which directly determines the actual performance of the model. Fu Sheng metaphorically said that a model is like a growing child, only by receiving the correct information can it learn correctly.
Data faces dual challenges of quality and quantity
However, in terms of data acquisition and utilization, the development of large models is facing many challenges.
Firstly, the real data available for training large models is becoming depleted. DeepMind delved into the Scaling problem in a paper and concluded that in order to fully train a model, its token count needs to be 20 times the number of parameters in the model.
Currently, it is known that GPT has the highest number of training tokens in closed source models, approximately 20T; The open source model with the highest number of training tokens is LLaMA3, which is about 15T. According to this calculation, if a 500 billion parameter Dense model wants to achieve the same training effect, it needs to train about 107T tokens, which is far beyond the amount of data currently available in the industry.
Therefore, using synthetic data has become a consensus for large models. According to forecast data, by 2026, all natural data will be used up by big models, and by 2030, artificial intelligence will use more synthetic data than real data.
But Fu Sheng believes that using synthetic data directly to train large models carries huge risks. Due to the inherent systematic biases in synthetic data, if it is directly used for training, the model may mistakenly consider these biases as routine, and in the long run, the model's cognition may have fatal flaws.
So the synthesized data also needs some processing, such as manual tuning or enhancement with other data, to improve the quality of the synthesized data.
The most significant issue with real data is the low utilization rate. Many companies have sufficient data, but the performance of the large models trained is always unsatisfactory, also because their data quality is not high enough.
Explore business opportunities in data services
Based on this, Cheetah Mobile also sees a business opportunity, and its holding company, Orion Starry Sky, has launched a new data service product - AI Ready Data Service (AirDS).
The services provided by AI Data Treasure AirDS include data collection, cleaning, annotation, prompt word engineering, and evaluation. Fu Sheng stated that because Cheetah Mobile is also training large models, compared to traditional data annotation companies, Cheetah Mobile has a deeper understanding of large models and is better able to meet the data needs of enterprises.
It should be pointed out that current data services still rely on manual labor. In the era of big models, tools can be used to improve efficiency in data filtering, cleaning, and other processes. However, in order to obtain high-quality data, manual fine annotation is still indispensable.
Fu Sheng stated that in the era of big models, Cheetah Mobile's core business model is not to make money through model interfaces, but to create value by helping customers implement AI applications.
The core of this business model is to conduct in-depth mining around the application scenarios of large models. Taking AirDS as an example, Cheetah Mobile uses data service products to help enterprise customers achieve a full process service from data cleaning to labeling, and then to application optimization. This not only greatly improves the AI application effectiveness of enterprises, but also creates huge commercial space for Cheetah Mobile.
At present, the successful cases of AI Databao have covered many industries, including mobile communication, Internet entertainment, new energy vehicles, etc.
Regarding the future development of large models, Fu Sheng believes that although technological bottlenecks have slowed down the iteration speed of models, the depth and breadth of application scenarios are constantly expanding. Especially in vertical industries such as search and enterprise services, with the improvement of data quality and application capabilities, AI is expected to bring revolutionary changes to the industry.
Next year will be a year of great prosperity for applications, "Fu Sheng predicted." The ability of big models has become relatively stable, and the next step of competition will depend more on how to apply big models in specific scenarios. As long as the scenarios are clear enough, their explosive power will be very strong
CandyLake.com is an information publishing platform and only provides information storage space services.
Disclaimer: The views expressed in this article are those of the author only, this article does not represent the position of CandyLake.com, and does not constitute advice, please treat with caution.
Disclaimer: The views expressed in this article are those of the author only, this article does not represent the position of CandyLake.com, and does not constitute advice, please treat with caution.
-
최근 양자의 노래 (NASDAQ: QSG) 는 2025 회계연도 1분기 실적 ("2025 회계연도 1분기", 2024 년 7 월 1 일부터 2024 년 9 월 30 일까지) 을 발표했으며 이번 회계연도의 총 매출은 8 억 1 천만 위안입니다.순이익 ...
- 另一支睾
- 어제 11:26
- Up
- Down
- Reply
- Favorite
-
북경 11월 28일발 신화재정경제소식 (강경보): 당지시간으로 11월 27일, 중국자동운전회사 소마지행이 나스닥에 상장되여 Robotaxi의 제1주로 되였다.이날 장 마감 현재 소마지행은 12달러/주로 13달러/주 발행가보 ...
- 我是来围观的逊
- 어제 23:45
- Up
- Down
- Reply
- Favorite
-
담수하곡은 현지시간으로 12월 2일, 선결조건이 이미 만족되였고 이미 영미자원집단과 협력동반자관계를 건립했으며 담수하곡은 이로부터 이미 Anglo American Minério de Ferro Brasil S.A. 15% 의 지분을 획득했 ...
- 崔炫俊献
- 그저께 13:59
- Up
- Down
- Reply
- Favorite
-
[OpenAI 스카우트 암호화폐 회사 Coinbase 초대 최고 마케팅 책임자 영입] OpenAI는 회사 최초의 최고 마케팅 책임자 (CMO), 암호화폐 회사 Coinbase의 전 임원 케이트 루시 (Kate Rouch) 를 영입했다.루시는 2021 ...
- 大叔的爸爸
- 어제 09:57
- Up
- Down
- Reply
- Favorite