What 'AI training' actually means (and whether your data is being used)

The single most-asked question by business owners considering AI: "if I use this, does my data get used to train the model?"

Here's the actual answer, with no marketing dust.

What "training" actually means

When AI gets "trained," what happens is: a huge collection of text and images gets fed into a model, and the model adjusts billions of internal numbers to learn the patterns in that data. After training, those numbers are baked into the model.

The data itself isn't stored inside the model. The patterns from the data are. So if you write something and it becomes training data, the AI doesn't memorize your words exactly — but the patterns from your words contribute to how the AI predicts text in the future.

What this means for your data, in practice

Two distinct cases:

Case 1: The big initial training that built ChatGPT, Claude, Gemini.

This happened in the past. The data was scraped from the public internet — websites, books, Wikipedia, public Reddit, public forums, news articles, scientific papers. If you had a public website in 2023, your content was probably in there. Anything behind a login was not.

This phase is mostly done. Models get retrained periodically, but the bigger story is they're done scraping at this scale and now refining what they have.

Case 2: When you use an AI tool day-to-day, is YOUR input training the model?

This is the actual question you care about. The honest answer depends on which tool, which plan, and how the vendor is configured.

Three categories of AI tool, ranked by data risk

Category A: Free consumer chatbots.

Free ChatGPT.com, free Claude.ai, free Gemini. By default, your inputs may be used for model training. Each provider has settings to opt out, but they're not always default-off.

What this means: don't paste client confidential data, NDAs, financials, or legal matters into free consumer AI tools. Even if the privacy policy says they don't train, you don't have an audit trail to prove it.

Category B: Paid consumer plans (ChatGPT Plus, Claude Pro, Gemini Advanced).

Stronger privacy. Most paid plans have data-not-used-for-training as the default. Still consumer-grade, fine for individual productivity, not appropriate for matter-specific or HIPAA data.

Category C: API and Enterprise plans.

This is where business use should live. OpenAI API, Anthropic API, Google Vertex AI, Azure OpenAI Service. By contract, your data is not used to train models. Period. Most include data-residency guarantees, SOC 2 compliance, and BAAs for healthcare.

When a productized agent gets built for a business, it runs on Category C. Your data flows through the API, gets a response, and the API provider doesn't keep it for training.

The four questions to ask any vendor

Before signing up for any AI tool, ask:

1. Which underlying model do you use?

You want a specific answer: "Claude 3.5 via the Anthropic API" or "GPT-4 via Azure OpenAI." Vague answers ("our proprietary model") usually mean either the vendor doesn't know or they're hiding which API they're calling.

2. Is my data used to train models?

The answer should be no. If they hesitate or say "we anonymize and aggregate," push for clarity. Anonymized data can still leak through patterns, especially if you're a growing team with unique terminology.

3. Where does my data live?

US, EU, somewhere else? Some vendors route through cheaper regions. If you have data residency requirements (HIPAA, EU customers, government work), get this in writing.

4. Do you offer a Business Associate Agreement (BAA) or DPA?

If you're in healthcare, legal, or financial services, you need either a BAA (HIPAA) or a DPA (GDPR). Free-tier consumer products don't offer these. Enterprise plans usually do.

What this means for the agents we build

When Alchmy builds a productized agent for a growing team, three things by default:

The agent calls the AI provider via Category C API (Anthropic, OpenAI, Google enterprise tier)
Your data is not used to train models, by contract
Your data lives in your environment (your CRM, your email, your filesystem). The agent reads and writes to those systems through APIs. The vendor doesn't keep a copy.

Custom builds for regulated industries get additional layers: BAAs, audit logs, data-residency commitments. Standard B2B builds don't need that level, you have the contractual protection from the API tier already.

What to do if you've already pasted sensitive data into a free tool

Three steps:

Check the privacy settings on whatever tool you used. Turn off "improve the model" or "data sharing" if it's not already off.
Going forward, use the paid tier or a Category C tool for any business data.
Don't panic. Most paste-into-ChatGPT moments aren't actually a security incident. They're a habit you should change going forward, not a five-alarm fire.

What this means for you

Use the right tier for the data. Free tools for personal stuff. Paid tools for business non-confidential. Enterprise/API tier for client data. The trade-up between tiers is mostly about contractual protection, not capability differences.

If you're paying $20/month for ChatGPT and using it for client data, upgrade to Teams or use the API. If you're using free ChatGPT for client data, stop, take a breath, and reconfigure today.

Get started

Want to know what we'd build for your team?

30-minute audit walks through your workflows and outputs a list of 2-3 candidates ranked by ROI, with prices.

Get a free audit Meet PG