Trust But Verify: Navigating Privacy Settings Across Major AI Platforms

Don't train on my data!

Dec 30, 2024

As we prepare to close out 2024 and welcome in 2025, I am sharing some insights into my most commonly asked question this year: how can I safely use these LLMs with my data? I've discussed this topic with hundreds of firms this year and wanted to summarize the insights here.

First, though, there are two critical announcements: I'm moving off the substack platform and onto Beehiiv, so your next newsletter will be from there. This newsletter is also being renamed "Intelligence by Intent." The focus will primarily be practical AI use cases and insights you can use daily. Many examples will still focus on RevOps, but I will also cover several other areas, including legal, marketing, finance, operations, and more.

Second, many of you know we lost our beloved St. Bernard, Ollie, earlier this year to cancer. During this holiday season, we realized that the house was just too quiet, and we needed the presence of another dog. I want to introduce you to Magnus - our new 9-week-old St. Bernard. Like prior newsletters, I'll add a photo or two of him as he grows and fills our lives with joy.

Now, on to the main article.

DON'T TRAIN ON MY DATA (PLEASE!)

When ChatGPT first got announced to the public, there was a lot of fear that OpenAI would train on anything you put into the model. Many companies originally forbid their employees from using it in any way as they feared that their corporate secrets would be made available to the world.

I've had the opportunity to discuss AI in front of hundreds of organizations over the past year, and privacy has come up a lot. I want to share how you can use these fantastic tools confidently so that you know that your data is not being trained on and can get the full value of what they can do.

The most important thing to take away from this is that if you are using free versions of the tools, you should not expect any confidentiality. This is not to say that any of these companies will train on your data, but they have the option to train on your data. In each of the following sections, I will focus on the paid versions of these tools. I also want to point out that all of these privacy statements are in regard to US privacy policies.

ChatGPT (OpenAI)

Let's start with OpenAI and ChatGPT. When you first subscribe to ChatGPT, the default setting is that they can train on your data. You need to select the option to opt out of this. You can find this by doing the following:

Click on your profile icon in the upper right-hand of the screen
Select 'settings' from the menu options
Select 'data controls' on the left-hand-side of the dialog box
The top option says 'improve this model for everyone' – toggle that so it says 'Off'

OpenAI has a second option to ensure that your data is not being trained on, which I recommend you do.

Go to this page: https://privacy.openai.com/policies and select the option in the upper right-hand side to 'make a privacy request'. Select the next option that says 'I have a ChatGPT account,' then, in the next window, select 'Do not train on my content.' It will then ask you for the email address associated with your account. Enter your email address, and shortly after that, you will get a confirmation from their privacy team that they will not be training on your data.

The first option alone is sufficient (and you should definitely do that), but I've been encouraging everyone to send the email option just to be safe.

Once you've done that, any new chats you start will not be trained on by OpenAI.

Let's talk about OpenAI's data access and retention policies

Access Restrictions: OpenAI employs strict access controls to limit who can access user data. However, some authorized personnel may access data for purposes such as system maintenance, security, and compliance.

Data Retention Period: OpenAI retains personal data only as long as necessary to provide services or for other legitimate business purposes, such as resolving disputes or complying with legal obligations. The specific retention period depends on factors like the nature of the data and the purposes for which it was collected.

By default, OpenAI does not use content submitted by customers to their business offerings, such as our API, ChatGPT Team, and ChatGPT Enterprise, to improve model performance unless you have explicitly opted to share your data with them.

I am confident enough in these controls that I have no issue putting in selected financial data, medical records, and more (I would not put in things that include my SSN but everything I'm personally comfortable doing).

Claude (Anthropic)

Claude (by Anthropic) does not use your inputs (prompts) and outputs (responses) from the free Claude or Claude Pro to train their models by default. You must opt-in if you want your data to be trained on. By default, Claude is the most privacy-focused of the large language models most people use. You don't have to do anything - by default - Claude will not train on your data.

Exceptions: There are specific scenarios where your data may be utilized for model training:

Explicit Permission: Your data may be used if you provide explicit consent, such as submitting feedback through features like thumbs up/down or directly reaching out with a request.
Trust and Safety Reviews: If your prompts or conversations are flagged for trust and safety concerns, Anthropic may analyze and use this data to enhance its ability to detect and enforce Usage Policy violations, including training trust and safety classifiers to improve service safety.

Data Access and Retention

Access Restrictions: Only a limited number of authorized staff members have access to conversation data, and they access it solely for explicit business purposes.
Data Retention Period: Anthropic retains inputs and outputs on the backend for up to 30 days to provide a consistent product experience. If you delete a conversation, it is immediately removed from your conversation history and automatically deleted from their backend within 30 days.

Like ChatGPT, I am confident enough in Claude with these controls that I have no issue putting in selected financial data, medical records, and more (I would not put in things that include my SSN, but everything I'm personally comfortable doing). Claude was focused initially on trust, safety, and privacy – it's been a founding principle of their overall business.

Gemini (Google)

Google has a very mixed approach to privacy compared to OpenAI and Anthropic – and you have fewer options to opt out of having Google train on your data. There are significant differences between their consumer and business versions, and I will try to give you a high level here (and remember, all of this is related to the paid versions of Gemini).

Consumer Gemini Advanced (paid): Google's privacy policies are, at best, unclear (to me at least) on whether or not they can and will train on your data if you are using the paid consumer version of Gemini Advanced. The available information does not specify whether individual user data from paid consumer subscribers is directly used in model training. However, it's common practice for AI service providers to use aggregated and anonymized data to improve model performance. The service does not explicitly state that they will not train on your data when you are using the tool. I would not put confidential data here as the policies are unclear at best.

Google Workspace Gemini Advanced (business paid account): I have a regular Google Workspace account and subscribe to Gemini Advanced as well, and at the bottom of the screen when you are using the tool, it explicitly states that 'chats aren't used to improve our models.' This message does not show up on the consumer-paid accounts, which leads me to believe that they have the option to train on those chats if they would like. I would have no issues putting confidential data in here as Google is explicitly stating that they don't train on your data.

Google AI Studio: there is no paid version of Google AI Studio, but it's a free and fantastic tool. It gives you access to all of Google's latest models, including streaming in real-time with Google's multimodal models. That being said – as a free tool – Google has the ability to train on the data you put in here. I've heard from various sources connected to Google's legal department that they are not currently training on the data (at least as of about two months ago) – but because this is a free service - it could change at any time. I would assume that they *could* train on anything you put in here.

I am a huge user of Google AI Studio. I use it extensively, and it is a truly fantastic service. I'm careful to ensure I don't put confidential data there.

Google Vertex AI Studio: is part of Google's enterprise suite of tools and explicitly states that they do not train on your data. Google has strong customer data protection in Vertex as it's part of the broader Google Cloud tools. Google Cloud maintains a strong commitment to data privacy and security. Your data, including inputs, outputs, and any custom models developed within Vertex AI Studio, remains under your control. Without your explicit consent, Google does not use your data to train or improve its foundational models. Whenever I am dealing with confidential data, I always access the Google models through Vertex or the AP associated with my paid services.

Google NotebookLM: Google NotebookLM is one of the best AI tools that was launched this year. NotebookLM can query massive amounts of data and generate podcasts from your uploads. Here is the current privacy policy on NotebookLM:

We (Google) value your privacy and never use your personal data to train NotebookLM.

If you are logging in with your personal Google account and choose to provide feedback, human reviewers may review your queries, uploads, and the model's responses to troubleshoot, address abuse, or make improvements. Keep in mind that it's best to avoid submitting any information you wouldn't feel comfortable sharing.
As a Google Workspace or Google Workspace for Education user, your uploads, queries, and the model's responses in NotebookLM will not be reviewed by human reviewers and will not be used to train AI models.

WRAPPING UP

Each of these companies has made it possible to allow you to use their models with confidential data securely. I look at the privacy policies of each of these companies around once a quarter to look for significant changes, and this is my best understanding of how the privacy policies are structured today. Because these can change at any time, I strongly urge you to review each company's privacy policy, only use paid services, and select any opt-out options to ensure that they don't train on your data. This is my best understanding of the privacy policies as of December 2024. Before putting in any confidential data, please review the latest privacy policies and ensure you comply with any guidelines your company may have in place.

I have reached a level of comfort with these tools, and I have no issue putting in financial or medical data, asking endless questions, and getting great insights. I hope this guide has also given you the comfort and insights to use these in a way that benefits you securely.

With that, welcome to 2025!

AI | Strategy | RevOps

Discussion about this post