Warning: I’m not advocating the following approach unless accompanied by appropriate quality testing and risk analysis. LLM performance will improve over time, but there is always a chance of non-deterministic behaviour - that’s inherent in their design. An optimised prompt could work 95% of the time and suddenly throw out garbage despite heavy prompt engineering.
Modern stacks tend to perform analytic transformations in an ELT fashion - data is loaded to a warehouse, and SQL is used to transform the data into metrics and analytics views. This article considers the stage before this - data is converted, cleaned and prepared to a form that can loaded to a datastore.
Intro
In the following post, I explore the possibility of doing no-code conversions and basic transforms on data with the gpt-3.5-turbo LLM. I’ll take some data in JSON form, convert it to CSV, and translate the data to a more useful form - all via an LLM prompt.
TLDR: Overall conclusion? I would still prefer to use prompt-generated, deterministic code for transforms. See my previous article.
Source code
Find the source code here - it’s pretty small.
Transform weather forecast data
TLDR: The transform to CSV worked! I needed a prompt with one-shot examples to guide the LLM in places.
Pass 1:
I dumped a full version of the JSON in the prompt context. It looked like this - you can see it has many days of data.
However, there is an immediate problem: many rows were missing along with columns - this was due to token limitations (4,096 tokens for gpt-3.5-turbo). I suspect gpt-3.5 returns its best effort under these circumstances.
The results:
Pass 2:
To solve the previous problem, I reduced the JSON to only a single day’s data. The rows were all there, but some columns were still missing.
Here is the slimmer JSON used.
Pass 3:
Next, the focus was on prompt tuning to solve the problem in pass 2, and develop a more complex transformation:
- Ask to include all columns
- Ask to calculate the date-time
Note: I included a one-shot example in the prompt to get the date-time calculation consistently accurate.
Here is the JSON used (same as pass 2).
The results:
Final prompt
With this, we have a concise natural language prompt that could transform the weather forecast JSON to CSV without code. More complex transforms on the data could easily be prompted with one or more examples in the prompt if needed.
Limitations
An immediate limitation was the context length (gpt-3.5-turbo = 4096 tokens). For lots of data, the process would be to chunk the data, process each chunk and recombine. Langchain has a pattern for this: MapReduceChain.
Here are the average per-request costs:
- 6 rows of forecast data
- Tokens Used: 1046
- Prompt Tokens: 804
- Completion Tokens: 242
- Total Cost (USD): $0.00169
Let’s do some rough math to understand scale.
Let’s assume we continued with gpt-3.5-turbo. If there were 1000 rows of data (versus the 6 in my experiment), the costs are (1000 / 6)*$0.00169 = $0.28.
If there were 100,000 rows, this would cost $28. Things start pricing up quickly on a relatively small dataset - this is still less than 20MB.
There are higher token context options like gpt-3.5-turbo-16k, which scale well:
- 4K context $0.0015 / 1K tokens $0.002 / 1K tokens
- 16K context $0.003 / 1K tokens $0.004 / 1K tokens
The prices would soon add up if running this on much larger event data with wide columns and over 100,000 rows generated regularly. Private LLMs become the go-to option in these cases.
Er, wait though …
Could the LLM be performing well because this Met Office data schema is public knowledge and well-documented? At the very least, ‘weather’ is a widely documented concept. The prompt was not exactly light, but the LLM might have been consistently doing well because it knows the structure and public-knowledge domain to make the right leaps.
I ran another experiment on a similarly structured dataset with an entirely made-up (meaningless) domain.
Transform custom domain data
Here is the customised JSON used. I’ve changed all the JSON field names.
The LLM needed slightly more prompt guidance on mapping field code names to meaningful column names. Also, the prompt includes mapping the ‘H’ field’s code values to word values. Besides this, the prompt changed very little, and the LLM successfully performed the transformation.
The prompt:
The results:
Non-determinism
The more you prompt an LLM, the more deterministic the results. The problem is that you never really know how deterministic it is - i.e. the probability. This works for many use cases where the results can be ‘creative’.
LLM usage needs a high amount of critical thinking. Much of the cool stuff you see promoted works because it applies LLMs over publicly documented domains with well-known language and concepts. LLMs are trained on a vast corpus of this knowledge. And for many use cases, this is great - it will drive automation and productivity with low effort.
A public LLM could perform less well for domain-specific things where language is more customised and contextualised. Is a dog an animal or a brand of beer? If a non-contextualised LLM incorrectly thinks it’s an animal, imagine what kind of response it could give - it may do well, it may not.
With growing context length limits, the contextualisation can be done at runtime (via the prompt) - this only goes so far. Custom domains containing large volumes of training data are better off with fine-tuning LLMs.
Conclusion
For now, though, I’ll be sticking to having transforms in deterministic code. However, a friendly LLM can certainly help me write that code. I fully expect my current understanding will get blown away as LLMs evolve.
It would be super cool to explore fine-tuning an LLM to domain-specific data and see what powerful features could result. I could also extend the transformation complexity and use a more powerful LLM like gpt-4 to see how it performs.