LLMs in Operations: Where They Actually Work

The hype around LLMs is intense. Vendors promise they'll revolutionize everything. The reality is more nuanced.

After deploying LLMs in dozens of operational contexts, here's our honest assessment of where they work—and where they don't.

Where LLMs Excel

1. Document Understanding and Extraction

The use case: Extracting structured data from unstructured documents—invoices, contracts, emails, specifications.

Why it works: LLMs are remarkably good at understanding context and handling variation. A traditional rule-based parser breaks when invoice formats change. An LLM handles the variation naturally.

Real example: We built an invoice processing system that extracts vendor, amount, line items, and due date from PDF invoices. It handles 50+ different vendor formats with 95%+ accuracy, no per-vendor customization required.

Watch out for: Numerical precision. LLMs can misread numbers. Always validate extracted amounts against expected ranges.

2. Classification and Routing

The use case: Categorizing incoming items—support tickets, leads, documents—and routing them appropriately.

Why it works: LLMs understand nuance and context better than keyword matching. "My order never arrived" and "Package seems lost" get routed to the same queue, even though they share no keywords.

Real example: Customer service email classification. Previously required 15 manual routing rules and still had 30% misroutes. LLM-based classifier: 94% accuracy, handles edge cases gracefully.

Watch out for: Ambiguous cases. Some items genuinely could go multiple places. Build in a "low confidence" pathway for human review.

3. Summarization and Synthesis

The use case: Condensing long documents, meeting notes, or report collections into actionable summaries.

Why it works: This is essentially what LLMs were designed for. They're excellent at identifying key points and restating them concisely.

Real example: Weekly operations summary that pulls from 15 different system reports and produces a one-page executive brief. What took 2 hours now takes 10 minutes of review time.

Watch out for: Hallucination risk. LLMs may confidently summarize things that aren't in the source. Always include source references.

4. Draft Generation

The use case: Creating first drafts of routine communications—responses, reports, proposals.

Why it works: LLMs dramatically accelerate the "blank page to rough draft" phase. Human review and editing still required, but starting point is much further along.

Real example: RFP response drafting. Pull relevant sections from past proposals, adapt to new requirements, generate first draft. Reduced proposal prep time by 60%.

Watch out for: Over-reliance. If humans stop reading carefully because "the AI wrote it," quality will drift.

5. Conversational Interfaces to Data

The use case: Letting non-technical users query databases using natural language.

Why it works: LLMs can translate "show me sales by region for last quarter" into SQL remarkably well, especially for well-documented schemas.

Real example: Ops managers can now ask questions of the data warehouse directly, without waiting for analyst time. 70% of queries handled without human intervention.

Watch out for: Complex queries. Multi-join, edge-case logic still needs human query builders. And always validate outputs against known benchmarks.

Where LLMs Fall Short

1. Precise Calculations

LLMs are language models, not calculators. They can generate the formula for a calculation, but they shouldn't perform the calculation itself.

Don't: Ask an LLM to calculate monthly interest on a loan. Do: Have the LLM identify what needs to be calculated, then use actual computation tools.

2. Real-Time Decision Making

Current LLMs have latency measured in seconds. For operational decisions that need to happen in milliseconds (routing, load balancing, real-time pricing), LLMs aren't appropriate.

Don't: Put an LLM in the hot path of a high-volume transaction system. Do: Use LLMs for decision support that humans or fast algorithms will execute.

3. Tasks Requiring Perfect Accuracy

If being wrong 5% of the time causes serious problems (medical diagnosis, legal filings, financial compliance), LLMs alone aren't sufficient.

Don't: Let an LLM autonomously submit regulatory filings. Do: Use LLMs to draft, but require human verification for high-stakes outputs.

4. Anything Requiring Memory of Past Interactions

Out of the box, LLMs don't remember previous conversations. Each request is independent. Building memory requires additional infrastructure.

Don't: Assume the LLM knows what you discussed yesterday. Do: Build explicit context passing if continuity matters.

5. Tasks Where Explainability Is Critical

LLMs can give you an answer but struggle to explain exactly why. For decisions that need to be audited or justified, this is problematic.

Don't: Use LLMs for decisions that might be legally challenged. Do: Pair LLM outputs with traditional rule-based logic that creates an audit trail.

Implementation Principles

Start with the 80/20. Find the tasks where LLMs handling 80% automatically (with humans handling the 20% edge cases) creates massive value. Don't chase 100% automation.

Build validation layers. LLM outputs should be sanity-checked before action. Does this invoice amount seem reasonable? Does this classification match the keywords present?

Monitor and improve. Track where the LLM struggles. Those patterns become training examples for improvement or triggers for human review.

Keep humans in the loop. For most operational use cases, the winning pattern is "LLM proposes, human approves" or "LLM handles routine, human handles exceptions."

The Bottom Line

LLMs are powerful tools for specific categories of work—understanding unstructured text, handling variation, accelerating drafts.

They're not magic. They're not good at math. They hallucinate. They don't remember things.

The wins come from matching LLM strengths to operational pain points, building appropriate safeguards, and keeping realistic expectations about what they can and can't do.

Used well, they're a genuine productivity multiplier. Used poorly, they create new categories of errors.

Choose your use cases wisely.

Share this article