Collecting user data to train AI before triggering regulation, Meta pauses action

21st Century Business Herald reporter Xiao Xiao reports from Beijing
This week, Meta announced the suspension of using data from EU and UK users to train AI, and postponed the launch of its own large model in Europe.
Ireland, the UK, Norway and other regulatory agencies have claimed it, and the company's move is in response to regulatory requirements. The Norwegian data protection agency stated that Meta has promised to suspend the use of posts and images on Facebook and Instagram to train large models, and it is currently uncertain how long it will be delayed. Discussions are underway with regulatory agencies in other EU countries.
Meta's plan to collect user data began last month, and the platform notified European users that it will officially launch a new privacy policy by the end of June: the company will use public content on Facebook and Instagram to train the big model, including interactive content, status, photos, and titles, excluding private chat records and minor account information. The updated privacy policy has sparked opposition, and the Austrian non-profit organization NOYB immediately filed complaints to 11 EU member states, requesting the initiation of emergency procedures.
The controversy is not unique. How to train AI through data authorization from users is a difficult problem for all Internet companies. Companies should not only grasp the compliance criteria, but also take into account the increasingly sensitive user emotions to privacy issues. The interviewed experts told 21st Century Business Herald that citing the EU's "legitimate interests" clause to obtain user data may become increasingly common in the future. However, currently, China's Personal Information Protection Law does not directly establish similar provisions, and domestic enterprises need to pay special attention to obtaining the explicit consent of users.
The "legitimate interests" clause may become a familiar face
In the complaint against Meta, NOYB identified two non compliances:
The first reason is that Meta's description of artificial intelligence is too broad, without specifying the purpose of collecting and processing user information. Meta's privacy policy only uses the term "artificial intelligence technology", which NOYB founder Max Schrems believes is equivalent to saying "we will use data in the database.".
"Meta did not specify what it would use this data for, nor did it set any restrictions. Artificial intelligence technology may refer to a simple chatbot, highly aggressive personalized advertising, or even lethal drone weapons." Max Schrems explained.
The second reason is that the user defaults to agreeing to collect data, and the rejection process is complex. Taking Facebook as an example, if users want to refuse platform collection of their data, they need to go through settings and privacy - Privacy Center - Generative AI - More Information - "Meta How to Train Big Models with Data" five level page, in order to find an opposition form at the end of the file. And only by actively filling out the form and passing it through the company can users refuse data collection.
Meta argues that the large model needs to reflect the diversity of language, geography, and cultural backgrounds of the European people, so the data collected by company users should belong to the "legitimate interests" stipulated in the General Data Protection Regulations, without the need for special user consent.
Generally speaking, the General Data Protection Regulations assume that collecting personal information is illegal, but the "legitimate interests" clause exempts some situations where data collection is necessary and does not require user consent. Such legal collection behavior can be for personal, commercial, or public interests.
"The industry generally believes that the EU has strict restrictions on personal information processing, but in fact, it leaves some room for interpretation through legitimate interest clauses." Wang Xinrui, a partner at Shihui Law Firm, has been engaged in data compliance business for many years. Wang Xinrui told 21st Century Business Herald that the setting of legitimate interest clauses is complex and flexible, and requires a series of tests. It can be said that it is a legal foundation with a large explanatory space.
Previously, Meta had also cited legitimate interests, defending the act of collecting user data to place personalized advertisements. However, the European Court of Justice ultimately refuted this claim, and Max Schrems therefore believed that legitimate interests were also difficult to apply to data capture and use in training AI. Wang Xinrui stated that for some emerging technology scenarios, other legal foundations may be difficult to establish, but there is still some room for interpretation of legitimate interests. Therefore, Meta will try to cite it, estimating that "this clause will repeatedly appear in various AI related cases in the future."
It should be noted that unlike the European Union, China's personal insurance law does not directly include "legitimate interests" in the statutory exemption situation. However, Wang Xinrui pointed out that some typical situations stipulated in the EU's General Data Protection Regulations are also covered by other provisions in China.
Lawyer Cheng Nian from Zhejiang Kenting (Beijing) Law Firm told 21st Century Business Herald that similar regulations in China include limited situations: one is sudden health emergencies or emergency situations to protect natural persons, and the other is legally confidential actions, such as collecting data without obtaining user consent due to the epidemic or anti-terrorism investigations by public security agencies, and business operations are usually difficult to fall within this scope.
User data becomes an industry sensitive point
"We are very disappointed." "This is a setback for European innovation and artificial intelligence development competition, and further delays the benefits that artificial intelligence brings to the European people." Meta complained in her blog that she is actually following the industry's approach - Google and OpenAI have already used European user data to train AI, and "compared to peers, our data collection methods are more transparent." "
However, it seems that this is not the case, and caution towards user data has gradually developed into a consensus approach. For example, ChatGPT was the first to allow users to refuse their personal data from being taken for training by the official by turning off the chat recording function, although this inevitably affects the quality of the large model's answers; On June 19th, Adobe specifically updated its service terms, explicitly stating that Adobe's software will not use the user's local or cloud content to train generative AI models.
Last year, the domestic office software WPS attempted to add a new privacy policy: "We will use the document materials you voluntarily upload as the basic materials for AI training after desensitization treatment." After being discovered by users, it triggered a collective boycott. WPS apologized to users and promised that user documents will not be used for AI training.
At present, technology giants that clearly collect user data to train AI include Google and X: in order to launch Musk's x AI company X updated its privacy policy in September last year, which stated in Regulation 2.1: "We may use collected and publicly available information to help train our machine learning or artificial intelligence models."; Last July, Google's privacy policy also added a new clause, "We may collect publicly available online information or information from other public sources to help train Google's artificial intelligence models."
However, at that time, Deng Zhisong, senior partner of Beijing Dacheng Law Firm, told 21st Century Business Herald that Google had provided a detailed explanation of the scope and purpose of collecting and processing user personal information. Even with the stricter "inform agree" rules under the EU GDPR as the standard, Google's approach was at least formally compliant.
NOYB also pointed out that Meta hopes to collect all public and non-public personal information since 2007, covering the interaction traces on Facebook and Instagram social media, which is different from the general approach of AI companies to disclose information via the Internet.
How to meet compliance requirements and develop technology while respecting user rights? Wang Xinrui emphasized to 21st Century Business Herald that for domestic companies, if they want to collect user data to train AI, they need to comply with the "Interim Measures for the Management of Generative Artificial Intelligence Services", which clearly stipulates that if personal information is involved, they should obtain personal consent or comply with the law. That is to say, special attention needs to be paid to whether the user has been clearly informed and their consent has been obtained before collecting and using their personal information. If the user's consent is not obtained in advance, there should be legal obligations, public interests, and other legal foundations, otherwise there are corresponding compliance risks.
Cheng Nian added that personal information collected and obtained based on user use of the product requires explicit consent, and sensitive information also requires separate consent; In addition, it is necessary to ensure that users can easily access, correct, delete personal information, and withdraw their consent, especially by providing them with the option to refuse to collect data for AI training, ensuring their right to know and choice.

浏览过的版块