Cloud Engineer Lab
Cloud Engineer Lab
Cloud Engineer Lab
Cloud Engineer Lab
© 2026
How to Save AI Tokens, Manage Licences, and Keep Your AI Spend Under Control
AI & InnovationIntermediate

How to Save AI Tokens, Manage Licences, and Keep Your AI Spend Under Control

AI tools can get expensive fast if you are not careful. This practical guide shows IT admins and teams how to use fewer tokens, choose the right model for each task, set hard spend caps, and manage licences — so you get maximum value from every penny you spend on AI.

18 min read
Share

AI tools are genuinely useful. They are also genuinely expensive if nobody is paying attention.

A team of 20 people using ChatGPT Team, Microsoft Copilot, and the OpenAI API without any governance in place can quietly spend £3,000–£8,000 per month — and get far less value than they should, because most of those tokens are wasted on vague prompts, the wrong models, and duplication.

This article gives you a complete, practical system to stop that waste. Whether you are an individual trying to stay on the free tier, an IT admin managing AI tools for a team, or a developer building applications on top of AI APIs — every section here has something you can act on today.


First: What Is a Token and Why Does It Cost Money?

Before controlling costs, you need to understand what you are actually paying for.

Think of tokens like coins in a taxi meter. The meter starts running the moment you start talking to the AI — every word you type clicks the meter, and every word the AI types back clicks it again. When you run out of coins, you either top up or the journey stops.

More precisely:

  • A token is approximately 4 characters of text — roughly ¾ of an English word
  • 1,000 tokens ≈ 750 words ≈ about one and a half pages of text
  • Every request has two token counts: input tokens (what you send) and output tokens (what the AI sends back)
  • You pay for both

Here is the important insight most people miss: you are paying for every word in the conversation — including the system prompt at the top, the entire chat history, and any documents you paste in. Not just the question you just typed.

That is where most waste hides.


How Much Do Tokens Actually Cost?

Costs vary enormously depending on which model you use. This is one of the most powerful levers you have.

API Pricing (Pay-as-You-Go)

ModelInput (per 1M tokens)Output (per 1M tokens)Best for
GPT-4o mini~$0.15~$0.60Simple tasks, high volume
GPT-4o~$2.50~$10.00Complex reasoning, important tasks
Claude Haiku 4~$0.80~$4.00Fast, simple, high volume
Claude Sonnet 4~$3.00~$15.00Balanced quality and cost
Claude Opus 4~$15.00~$75.00Hardest problems only
Gemini 1.5 Flash~$0.075~$0.30Very cheap, large context
Gemini 1.5 Pro~$1.25~$5.00Quality at reasonable cost

The ratio matters: Output tokens cost 3–5× more than input tokens on most models. Shorter, more focused answers save significant money at scale.

Subscription Plans

PlanCostWho it is forKey limit
ChatGPT Free£0/monthOccasional personal useLimited GPT-4o access
ChatGPT Plus~£20/month per userIndividual power usersHigher limits, all models
ChatGPT Team~£25/month per userTeams, shared workspaceAdmin controls, no training on data
ChatGPT EnterpriseCustom pricingLarge organisationsSSO, compliance, unlimited
Microsoft Copilot for M365~£25–30/month per userOffice 365 usersPer-user, admin managed
Anthropic Claude.ai Pro~£18/month per userIndividual Claude usersPriority access

Subscription vs API — which applies to you?

If you use ChatGPT, Claude.ai, or Copilot through a browser or app, you are on a subscription plan — a flat monthly fee. If you are a developer calling OpenAI, Anthropic, or Google through code, you are on the API and paying per token. The saving strategies differ slightly, but most principles in this guide apply to both.


Part 1: Smart Prompting — Use Fewer Tokens Per Request

This is the highest-impact area. The average person wastes 30–50% of their tokens before the AI even starts answering — through unfocused prompts, unnecessary context, and asking for longer answers than they need.

Technique 1: Be Specific, Not Conversational

Most people write to AI like they are texting a friend. Friendly and natural costs tokens.

Instead of this (47 tokens):

text
Hi, I was wondering if you could maybe help me understand how to write 
a PowerShell script? I need one that does something with Active Directory. 
I am not sure where to start.

Write this (22 tokens):

text
Write a PowerShell script to list all disabled AD user accounts. 
Output: CSV with Name, UPN, LastLogonDate.

Fewer words, better answer, less cost.

Technique 2: Tell the AI How Long to Be

AI models default to thorough — they add context, caveats, and explanations you often do not need. One short instruction at the end of your prompt cuts output tokens dramatically.

Add one of these to the end of your prompts:

InstructionWhen to use it
Be concise. Max 3 sentences.Quick factual answers
Bullet points only. No intro.Lists and summaries
Output the code only. No explanation.When you just need the script
Summary only — 5 bullet points max.Summarising documents
One paragraph. Plain language.Explaining concepts to non-technical users

Technique 3: Do Not Paste What You Do Not Need

When summarising a document, people often paste the entire thing — including headers, footers, disclaimers, and boilerplate. Strip all of that before pasting.

If you need to analyse a 40-page PDF, do not paste all 40 pages at once. Paste the relevant sections only. Or paste page by page and ask for a running summary.

Technique 4: Use a Shorter System Prompt

If you are using the API or a custom GPT, your system prompt runs on every single request. A 500-word system prompt that runs 1,000 times costs 500,000 tokens — before a single user has typed a word.

Audit your system prompts ruthlessly. Cut anything that is not doing active work. A system prompt should be instructions, not an essay.

Rule of thumb: Keep system prompts under 200 words unless you have a specific reason to go longer.

Technique 5: Ask for Drafts, Not Finals

When writing something — an email, a policy document, a script — do not ask for a full finished version on the first pass. Ask for an outline or bullet points first. Review it. Only then ask for the full draft.

text
First, give me a bullet-point outline only. I will tell you which sections to expand.

This two-step approach often saves 60–70% of output tokens, because you avoid the AI writing sections you will delete anyway.

Technique 6: Avoid Repeating Context

In a long conversation, people often re-explain the context they already gave. "As I mentioned earlier, I am working with Windows 11..." — the AI remembers everything in the conversation window. Do not repeat what is already there.

Technique 7: Use Code Instead of English for Structured Tasks

If you need data in a specific format, give the AI an example of the format rather than describing it in words.

Wordy (uses more tokens, less reliable output):

text
Can you give me a table with three columns showing the first column as the 
name of the country, the second column as the capital city, and the third 
column as the population? Please format it nicely.

Precise (uses fewer tokens, more reliable output):

text
List 5 countries as JSON: [{"country":"","capital":"","population":0}]

Technique 8: One Question at a Time

Asking three questions in one message often results in the AI answering all three at length — three times the output tokens. Ask one question, get the answer, then ask the next.

Unless you genuinely need all three answers together, this single habit significantly reduces output length.


Part 2: Use the Right Model for Each Task

This is the fastest way to cut costs without changing how you work at all. Most teams default to the most expensive model for everything — and that is pure waste.

Think of it like hiring contractors:

  • You do not hire a senior architect to hang a picture frame
  • You do not hire an apprentice to design your building's structure

AI models work the same way.

The Three-Tier Model Strategy

Tier 1 — The Fast and Cheap Model

Use for: simple questions, formatting, short summaries, classification, quick lookups, drafting short emails.

Models: GPT-4o mini, Claude Haiku 4, Gemini 1.5 Flash

Cost: 10–50× cheaper than top-tier models. The quality for simple tasks is nearly identical to the expensive models.

Tier 2 — The Balanced Model

Use for: writing longer documents, reviewing code, analysing reports, multi-step tasks, technical explanations.

Models: GPT-4o, Claude Sonnet 4, Gemini 1.5 Pro

Cost: Mid-range. Good quality at a reasonable price. This should be your default for most professional work.

Tier 3 — The Premium Model

Use sparingly for: complex reasoning, critical decisions, long document analysis, debugging difficult problems, legal or financial review.

Models: Claude Opus 4, o1, o3

Cost: 5–10× more than Tier 2. Only justified when the task genuinely needs it.

Quick Model Selection Guide

Is this a simple, routine task? (format text, short email, basic question)
YES → Use Tier 1 (cheap and fast). Save 80–90% vs using top-tier.
NO → Does it involve complex reasoning, long analysis, or important decisions?
YES → Use Tier 3 sparingly. Document why Tier 2 was not enough.
NO → Use Tier 2. This covers the majority of professional work.

Real Savings Example

Say your team sends 500 requests per day:

ScenarioModel usedMonthly cost (estimate)
All requests on GPT-4oTier 2 for everything~£400/month
Simple tasks on GPT-4o mini, complex on GPT-4oSmart tiering~£80/month
Saving~£320/month

The work output quality is nearly identical. The cost is 80% lower.

Not sure which tier your task needs? Start cheap.

Always try the cheapest model first. If the output is not good enough, move up one tier. You will be surprised how often GPT-4o mini or Claude Haiku handles a task just as well as the expensive model.


Part 3: Licence Management — Paying Only for What You Actually Use

Licences are where organisations waste the most money — not through heavy usage, but through sheer inertia. Licences assigned to people who barely use them. Teams on Enterprise plans that do not need Enterprise features.

Audit Who Is Actually Using What

Before your next renewal, pull usage data:

  • Microsoft Copilot: Admin Centre → Reports → Usage → Copilot activity
  • ChatGPT Team/Enterprise: Settings → Usage in the admin console
  • OpenAI API: Platform → Usage dashboard — shows per-API-key consumption

The typical finding: 20–30% of licensed users account for 80%+ of usage. The remaining users log in once a month or less.

Action: Remove or downgrade licences for low-usage users. Reassign them to people who will actually benefit.

Right-Size the Plan for Each User Type

Not everyone needs the same tier. Segment your users honestly:

User typeWhat they needRight plan
Occasional user (1–2 questions/day)Basic AI accessFree tier or shared team account
Regular user (drafting, summarising, Q&A)Reliable access, good modelsPlus or Team
Power user (complex analysis, API, integrations)High limits, all models, priorityTeam or Enterprise
Developer (building apps on the API)API access, fine controlPay-as-you-go API

Consolidate Overlapping Tools

Many organisations are unknowingly paying for multiple AI tools that do the same thing:

  • ChatGPT Plus for some users
  • Microsoft Copilot for the same users
  • Gemini Advanced through Google Workspace
  • An independent AI writing tool

Do a quick audit. If Copilot already covers writing and summarisation for Microsoft 365 users, they may not need a separate ChatGPT Plus subscription. Picking one primary tool for most users and reserving specialist tools for specific roles can cut AI licence spend by 30–50%.


Part 4: Setting Hard Spend Caps and Token Limits

Unlimited usage is a liability. Every major AI platform now gives you tools to set hard limits — use them. Here is how on each platform:

OpenAI API — Spend Limits

Set a Monthly Spend Limit

Log in to platform.openai.com → Settings → Limits. Set a hard limit (usage stops at this amount) and a soft limit (you get an email warning when you approach it).

Start conservative — for a small team just beginning with the API, £50–100/month is a sensible hard limit until you understand your actual usage patterns.

Use Separate API Keys Per Project

In platform.openai.com → API Keys, create a different key for each project or use case. You can monitor spending per key in the Usage dashboard. This tells you exactly which application or team is consuming the most.

Set Per-Key Rate Limits

For each API key, you can set requests-per-minute and tokens-per-minute limits. This prevents a runaway script from consuming your entire monthly budget in an hour.

Microsoft Copilot — Admin Controls

If your organisation uses Microsoft Copilot for Microsoft 365:

Manage Licence Assignment in Entra ID

In the Microsoft 365 Admin Centre → Users → Active Users, assign Copilot licences only to users who will actively use it. Unassign licences from users who have not opened Copilot in 30 days.

Review Usage Reports Monthly

Admin Centre → Reports → Microsoft 365 Usage → Copilot. Review the Active Users and Feature Usage report. Filter to users with fewer than 5 interactions in the last 30 days — those are candidates for licence removal.

Use Microsoft Purview for Data Controls

If your organisation handles sensitive data, enable Copilot sensitivity labels in Microsoft Purview. This prevents Copilot from referencing files above a certain classification level — both a governance control and a token saver (less context being pulled in automatically).

Anthropic API — Usage Controls

In console.anthropic.com → Settings → Limits:

  • Set a monthly spend limit — usage halts when the limit is reached
  • Create separate API keys per project with individual rate limits
  • Monitor per-key usage in the Usage tab to see which projects are most expensive

Azure OpenAI — Quota Management

If you are using Azure OpenAI (common in enterprises):

  • In the Azure Portal → your Azure OpenAI resource → Quotas, set TPM (Tokens Per Minute) limits per deployment
  • Create separate deployments for different teams or applications with individual quotas
  • Set Azure Cost Management budgets with email alerts at 80% and 100% of your monthly target
  • Use Azure Policy to restrict which resource groups can create OpenAI deployments

Set alerts at 80%, not 100%

A spend alert at 100% tells you when the damage is already done. Set your first alert at 50% of your monthly budget (a sanity check) and a second at 80% (time to investigate). If you hit 80% by mid-month, you have time to act before the bill lands.


Part 5: Monitor Usage and Catch Waste Early

Controls without monitoring are not controls — they are just hopes. Build a simple monthly review habit.

What to Review Each Month

What to checkWhere to find itWhat to do with it
Total spend vs last monthPlatform billing dashboardIf growing, investigate which project or user is driving it
Top 5 highest-spending API keysOpenAI / Anthropic usage dashboardReview whether the usage is legitimate and expected
Copilot inactive licencesM365 Admin Centre usage reportRemove licences from users with <5 interactions
Average tokens per requestAPI usage metricsSpikes indicate runaway prompts or a broken script
Model distributionAPI logsCheck you are not using premium models for simple tasks

Set Up Automated Alerts

Do not rely on remembering to check. Set email alerts so you are notified before a problem becomes a bill:

  • OpenAI: platform.openai.com → Settings → Limits → Soft limit email notification
  • Azure: Cost Management → Budgets → add email alert at 80% of target
  • Anthropic: console.anthropic.com → Settings → Billing → usage alert threshold
  • Google (Gemini API): Cloud Console → Billing → Budgets and Alerts

The Complete Cost-Control Checklist

Use this as your quarterly AI spend audit:

Prompting habits:

  • Are users asking focused, specific questions rather than long conversational ones?
  • Are output length instructions included in reusable prompts?
  • Are people pasting full documents when only sections are needed?

Model choices:

  • Is the team defaulting to the most expensive model out of habit?
  • Are simple tasks being routed to cheap models (Haiku, GPT-4o mini, Flash)?
  • Are premium models reserved for genuinely complex tasks?

Licences:

  • Have inactive users been identified and their licences removed?
  • Are users on the right tier for their actual usage level?
  • Are you paying for overlapping tools that do the same job?

Spend controls:

  • Are hard spend limits set on all API accounts?
  • Are separate API keys used per project?
  • Are email alerts configured at 80% of your monthly target?

Monitoring:

  • Is there a monthly usage review in the calendar?
  • Does someone own the AI cost review — or does it fall through the cracks?

Frequently Asked Questions

How much should a team of 10 realistically spend on AI per month? Highly variable — but as a benchmark, a 10-person team using AI for writing, summarisation, and code assistance with good cost hygiene typically lands at £100–250/month in total. Teams without controls in place often spend 3–5× that for the same output.

Is it worth switching between providers to save money? Sometimes, but do not underestimate the switching cost — learning curves, integration changes, different prompt styles. Better to optimise on one or two platforms than to constantly chase the cheapest per-token rate.

Can I use prompt caching to save on API costs? Yes — both Anthropic and OpenAI support prompt caching, which discounts repeated context (like a long system prompt) on subsequent requests. If you have a fixed system prompt or document that appears in every request, caching can reduce input token costs by up to 90% on those cached portions. Worth implementing for any application that runs at volume.

Does ChatGPT Plus have token limits? ChatGPT Plus subscriptions have usage caps rather than hard token limits — you get a certain number of GPT-4o requests per period (the exact number changes based on demand). When you hit the cap, you are temporarily dropped to GPT-4o mini until the window resets. Monitoring your usage in Settings → Data Controls shows how close you are to the cap.

What is the single biggest waste of tokens that most people do not realise? Pasting entire documents when you only need a section. A 10,000-word PDF pasted into ChatGPT to answer a question about one paragraph costs 10,000 input tokens when you needed roughly 300. Extract the relevant section first — always.

Should I use streaming in my API applications? Streaming (receiving the response token by token as it is generated, rather than waiting for the full reply) does not affect your token cost — you pay the same either way. It improves perceived speed for users but does not reduce the bill.


Conclusion: Control Is a Feature, Not a Restriction

The goal of all of this is not to make AI less useful — it is to make it sustainably useful. Unlimited, unmonitored AI spend creates two problems: bills that grow out of control and a culture where nobody thinks carefully about how they use AI.

The best-run AI programmes treat token budgets like time budgets. You would not let a team book unlimited travel with no approval. The same discipline applies here.

Start with the quick wins:

  1. Add output-length instructions to your most-used prompts today
  2. Set a spend alert on your API accounts this week
  3. Pull your Copilot usage report and identify inactive licences this month

Each of these takes under 15 minutes and can save meaningfully on your next bill. Build from there — model tiering, system prompt audits, and monthly reviews — and you will find that better AI cost management often also means better AI quality. Focused prompts get better answers. Right-sized models respond faster. Less waste in means better signal out.

CChetan Yamger

Written by

Chetan Yamger

Cloud Engineer · AI Automation Architect · Modern Workplace Consultant

Cloud Engineer, AI Automation Architect, and Modern Workplace Consultant based in Amsterdam, Netherlands. Specializing in scalable, secure enterprise solutions with Microsoft Azure, Intune, PowerShell, and AI-driven automation using ChatGPT, Gemini, and modern LLM technologies.

Cloud & Modern WorkplaceMicrosoft Intune & MDMAzure & Microsoft 365AI AutomationPrompt EngineeringPowerShell & Graph APIWindows AutopilotConditional Access & Zero TrustSCCM / MECM & MSIXVDI / WVDPower BINode.js & Next.js
Newsletter

Stay in the loop.
New articles, straight to you.

Deep-dive technical articles on Intune, PowerShell, and AI — no noise, no spam.

New article notifications
No spam, ever
Free forever

Discussion

Share your thoughts — your email stays private

Leave a comment

0/2000

Your email is used to prevent spam and will never be displayed.