Telegram signal channels are creative. "BUY EU NOW SL 1.0850 TP 1.0920" and "📈 EURUSD long. Entry: market. Stop: 50p below. TP1 1.09, TP2 1.095" and a screenshot of a TradingView chart with no text — all valid signals to a human eye, all hostile to a regex.
Why we kept reaching for new regexes
Every channel onboarded forced another regex. The collection grew to over forty patterns with conflicting precedence; each added rule risked breaking another channel's parse. The maintenance cost was real, the false-positive rate climbed, and the rules made no attempt to handle screenshots.
What the rewrite changed
Two layers. The first is a deterministic parser — about 600 lines of TypeScript — that handles the clean structured cases (most of the volume, milliseconds to parse, no API cost). If it returns low confidence, the second layer is a small LLM call with a tightly-constrained JSON schema, with the original message as the only context and a system prompt that explicitly forbids guessing.
The constraints that matter
- Output must be valid JSON conforming to a strict schema. The model returns null fields where it would otherwise hallucinate.
- The model is asked to refuse parsing if it can't identify direction, symbol, and at least one of entry/SL/TP. Refusals route to operator review, not auto-execution.
- Images get a vision model pass with the same schema; OCR alone proved too lossy for hand-annotated charts.
- Every parsed signal stores the original message text, the parser path (deterministic vs LLM), and the model's confidence. The audit trail is more important than the parse.
Practical lessons
Don't let the model invent missing fields. Don't let the model break the JSON schema. Don't auto-execute on low confidence — flag for review. The hard part isn't the model, it's the contract around the model.