As data engineers or analysts, we often receive unsuitable data formats. This article explores transforming data from the Met Office API using a Chat LLM Model (gpt-3.5-turbo) as a copilot.
To be clear, I am not going full-blown no-code with this - it still involves generated code. In a later post, I will experiment with LLM’s ability to transform data without code.
Raw data
The data has a few challenges:
- Verbosity
- Meaningless fields
- Unfiltered
- Uncombined data
Here is a simplified version of the data. You can see it is still a relatively complex format, with field names mapped in a separate data structure (SiteRep.Wx.Param)
Note: The most complex part of this transformation is the SiteRep.DV.Location.Period[0].Rep[0].$ field. This location represents the “minutes from midnight” from the Period date. Therefore, each Rep is a 3-hourly block under that Period date. This fact is difficult to determine from the data (and indeed, the LLM needed prompting around this).
Transformed data
I wanted to transform it to a single row representing the forecast matching the current date time:
I could then load this to a database, or prompt a large language model for some generative AI use case without using up tokens unnecessarily.
Two paths
- Code manually and then use an LLM to optimise it
- Use the LLM to write and optimise the code
My instinct was an LLM would struggle with path 2, but I was very wrong. I tested both paths to capture observations.
Results
Overall, the LLM could generate new or optimise existing code with minimal need for changes once integrated into the codebase.
Path 1
- Time to write manually: 5 minutes
- Time for LLM optimisation: 1 minute
- Total time: 6 minutes
I overengineered some things: separating parts into functions and not encapsulating the abstraction properly. I also favoured a more compositional approach to functions, which was overkill. I could have refactored my code to collapse complexity - but this would have added time (let’s say another 5 minutes).
Before: Source code link - manually written
After: Source code link - LLM optimised
A few design-level issues remain since the LLM only optimised each function individually.
Path 2
- Time to prompt the LLM to write code: 2 minute
- Total time: 2 minutes
- Reduced lines of code versus path 1: ~100 (and removed 1 module/file)
The result was usable with only two minor tweaks to resolve some bugs. More importantly, the LLM did not over-engineer or overthink the task. It produced a single function that balanced readable code and performance.
Source code link - LLM generated
Conclusions
Overall, I am impressed with the LLM copilot for helping me with my data transformation code. It is an objective tool that can generate and optimise code based on short prompts and, at the very least, provide alternatives. It is a great way to learn a new programming language or features you may not know about in an existing language.
Copiloting is much easier if you have written code before - it’s a much faster feedback loop as you scan the returned code for each iteration and decide on the next prompt.
I’ve always been a fan of pair programming, yet copilot replaces the knowledge transfer and smaller-scoped, code-level problem-solving traditionally performed during pair programming. Pair programming morphs into collaborations focused on contextual understanding, higher-level problem-solving, gathering domain expertise, solution design, strategic decisions and quality review.
Code can be subjective: LLM-generated code might spark debate over coding styles and what constitutes maintainability. In the previous examples, the LLM created code that utilised Python generators - a Pythonic approach - some will disagree that this is the most readable form. Production codebases should continue to be team-reviewed and alignment reached so that teams can support LLM-generated code.
Conversations
I won’t share all the raw chat conversations as they are lengthy.
For path 1 (optimising existing code), the prompts were entirely variations of: Can you improve the code?
For path 2 (LLM generated code), here is the ordered list of prompts. All were successful the first time, except one.
Write some Python code that converts this JSON to CSV
Can you improve it?
Can you include the datetime of the period?
Almost: Did not include the Z in the date parser’s formatter, so manually added this.
Thanks! Can you ensure the datetime includes the time? The time is defined by the $ field in each Rep, which indicates the minutes from midnight from the period’s date.
Can you improve this function?
Can you further improve this function?
Can you change it so it only includes the row matching the current date time?
Can you ensure the current date time is within a three-hour window from the the period datetime?
Failed: I could not articulate the windowing concept, so manually corrected this.
Can you improve readability and performance?
Can you change this slightly so CSV is output to a string instead of a file?