javisantana.com

Building a simple SQL Agent from Scratch

There are more people talking about AI Agents than actual agents. And likely the ratio of builders to people talking about building is 1:10000 and still here I am talking about agents, so looks like I’m contributing with more shit. I hope not but you know, I’m just a guy with a blog.

I wanted to create a really simple SQL Agent to teach myself how to do it, no libraries to simplify the process, just a bash script using the llm cli tool. I don’t actually like bash so much but it’s somehow limited so that allows me to focus on the actual problem.

The basic algorithm to generate a working SQL based on a user question would work like (python-ish):

chat = LLM(system="You are an expert SQL query generator, write the SQL given the prompt.")
answer = chat(prompt)
while not is_correct(answer):
    answer = chat(answer + ".Please fix the SQL query.")
print(answer)

So, in theory, I’d just need to write a function (is_correct) that tells the model if the SQL is rigth, and the initial prompt. In theory.

Ok, so let’s build a super simple agent that write SQL given a prompt. For simplicity the is_correct function will return “OK” if the SQL runs. It’s is not a good criterion but it’s a good starting point to understand the dynamics. I’m using DuckDB (a embedded analytics database I’m testing lately) as runtime (so I don’t need to run a fully fledged database server)

1. Generate a SQL query for a given prompt

LLM="llm -m gemini-2.0-flash-exp"
prompt="You are an expert SQL for duckdb query generator. Do not write any explainations, just the SQL query. Generate a SQL query for: $1"
flags=""

while true; do
    # Generate SQL using LLM
    SQL=$($LLM $flags "$prompt")
    # Remove any ``` markers from the SQL
    SQL=$(echo "$SQL" | sed 's/^```sql$//g' | sed 's/```$//g' | sed 's/^```//g')
    
    # Try executing the SQL with DuckDB and capture the error output
    ERROR=$(echo "$SQL" | duckdb 2>&1)
    if [ $? -ne 1 ]; then
        echo "*** DONE!"
        echo "$SQL"
        break
    else
        echo "*** Invalid SQL generated, retrying..."
        echo "*** Error: $ERROR"
        # Include the error in the next LLM prompt
        prompt="Previous attempt failed with error: $ERROR Please fix the SQL query."
        flags="-c" # to tell LLM to continue the conversation
        sleep 1
    fi
done

It kind of works. If you test with prompts that don’t require any table, it gives you pretty decent SQL queries.

./sqlagent_1.sh "generate a the fibonacci series"
./sqlagent_1.sh "generate the game of life"

For the game of life needs a few iterations but that’s ok, it’s a recursive query. I coded that query myself a few years ago and it took me an afternoon.

But if we ask for stuff like

./sqlagent_1.sh "generate a histogram with table http_requests"

It fails. It does because we are asking to do something that can’t do as the http_requests table doesn’t exist. Humans are constanty asking for stuff that is not possible, so we need a way to stop the agent to try. Is there a way so the LLM can say “enough is enough” and stop trying?

To be honest, sometimes it finds the way and generates things like:

SELECT
    CASE 
        WHEN request_time < 1 THEN 1
        WHEN request_time < 2 THEN 2
        WHEN request_time < 3 THEN 3
        WHEN request_time < 4 THEN 4
        WHEN request_time < 5 THEN 5
        WHEN request_time < 6 THEN 6
        WHEN request_time < 7 THEN 7
        WHEN request_time < 8 THEN 8
        WHEN request_time < 9 THEN 9
        ELSE 10
    END AS bucket,
    count(*) AS count
FROM (SELECT random()*10 as request_time FROM range(100))
GROUP BY bucket
ORDER BY bucket;

Step 2: trying to stop the agent

So I tried adding a “If you unable to fix it, just return select 'STOP'” to the prompt but it does not work :)

prompt="Previous attempt failed with error: $ERROR Please fix the SQL query. If you are not able to fix it, just return `select 'STOP'`."

But being more explicit and setting the prompt to:

prompt="Previous attempt failed with error: $ERROR Please fix the SQL query. If you can't find the tables just return select 'STOP'."

it works (added the iteration number as debugging info)

-- [ITERATION 0] --------------------------------------
*** Invalid SQL generated, retrying...
*** Error: Catalog Error: Table with name http_requests does not exist!
Did you mean "pg_sequences"?
LINE 5:     http_requests
            ^
-- [ITERATION 1] --------------------------------------
*** DONE!

SELECT 'STOP'

Step 3: using some the data inside the database

Generating a query that do not read any data is not really useful, so let’s add a step where we pass the duckdb database to the agent so it can use the data inside it.

There are just two changes to the agent: 1) we pass to the prompt a list of columns for each table with the type (using information_schema.columns table) 2) we pass the database to the agent so it can use the data inside the database.

In this case I’m using a duckdb database I generated with all the views I track on this website. It looks like this

✗ duckdb test.db -c "describe web_requests" 
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ timestamp   │ UINTEGER    │ YES     │         │         │         │
│ session_id  │ VARCHAR     │ YES     │         │         │         │
│ action      │ VARCHAR     │ YES     │         │         │         │
│ version     │ VARCHAR     │ YES     │         │         │         │
│ payload     │ VARCHAR     │ YES     │         │         │         │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

The code:

#!/bin/bash

LLM="llm -m gemini-2.0-flash-exp"

DB=$1

# Get the list of columns for each table
tables=$(duckdb $DB -c "select table_name, column_name, data_type, is_nullable  from information_schema.columns")

prompt="You are an expert SQL for duckdb query generator. Do not write any explainations, just the SQL query. The list of columns for each table is:\n $tables. Generate a SQL query for: $2"
flags=""

ITERATION=0
while true; do
    # Generate SQL using LLM
    echo "-- [ITERATION $ITERATION] --------------------------------------"
    SQL=$($LLM $flags "$prompt")
    # Remove any ``` markers from the SQL
    SQL=$(echo "$SQL" | sed 's/^```sql$//g' | sed 's/```$//g' | sed 's/^```//g')
    
    # Try executing the SQL with DuckDB and capture the error output
    ERROR=$(echo "$SQL" | duckdb $DB 2>&1)
    if [ $? -ne 1 ]; then
        echo "*** DONE!"
        echo "$SQL"
        res=$(duckdb $DB -c "$SQL")
        SQL=$($LLM $flags "the query ran successfully, check the results and return 'select 'STOP' if you think is that what the initial prompt asked for. Otherwise, return a new the SQL query.")
        echo "$res"
        SQL=$(echo "$SQL" | sed 's/^```sql$//g' | sed 's/```$//g' | sed 's/^```//g')
        if [[ "$SQL" == *"STOP"* ]]; then
            break
        fi
    else
        echo "*** Invalid SQL generated, retrying..."
        echo "*** Error: $ERROR"
        # Include the error in the next LLM prompt
        #prompt="Previous attempt failed with error: $ERROR Please fix the SQL query. If you are not able to fix it, just return `select 'STOP'`."
        prompt="Previous attempt failed with error: $ERROR\n Please fix the SQL query. If you don't find the tables just return select 'STOP'."
        flags="-c" # to tell LLM to continue the conversation
        sleep 1
    fi
    ITERATION=$((ITERATION + 1))
done

duckdb $DB -c "$SQL"

When running it, it can generate queries but the results are not good.

✗ ./sqlagent_3.sh test.db "visitors per day"
-- [ITERATION 0] --------------------------------------
*** Invalid SQL generated, retrying...
*** Error: Catalog Error: Scalar Function with name date does not exist!
Did you mean "datesub"?
...
                    ^
-- [ITERATION 8] --------------------------------------
*** DONE!

SELECT CAST(TO_TIMESTAMP(timestamp/1000) AS DATE) AS day, COUNT(DISTINCT session_id) AS visitors
FROM web_requests
GROUP BY day
ORDER BY day;
┌────────────┬──────────┐
│    day     │ visitors │
│    date    │  int64   │
├────────────┼──────────┤
│ 1970-01-21 │     7502 │
└────────────┴──────────┘

I see many ways to fix this:

  1. Run the query and ask the LLM if it thinks it correct based on the initial prompt.
  2. Give more info about the data (like a sample) so in the case of the example, it can actually see how the timestamp look like
  3. Put human in the loop so it can give more context, save the context for future runs and so on. Actually that’s what people are doing, check this example from Uber on how they do it: QueryGPT.

I tested 1 (as you see) and it does not work well, and tested 3 that, of course, works.

And this could go on and on, but I think I now understand how to build an super simple SQL Agent and I hope you did too.

SQL is good

I used to dislike SQL. I learned to deal with databases using ORMs and when I needed to open psql to understand what was going on I felt like I was losing my time having to deal with SQL. Why would someone want to deal with SQL when having a simple API in the language you are working with?

After some time I moved to the database infra space (as CTO of a company dealing with hundreds of postgres databases) and I started to work with SQL every day. It was not that bad and I learned to appreciate it little by little. And I finally loved it, I have done all sorts of crazy stuff with it, like rendering my company logo, recreating a game of life simulation and starting a company (that raised +50M) where the only interface is SQL.

Don’t get me wrong, I don’t think ORMs are useless, quite the opposite, I think they are good and developers should use them (they help you to avoid making mistakes) but that shouldn’t stop someone from learning SQL. You are going to find it somewhere in your professional career, there are many systems that only talk SQL (actually all the database systems end up talking SQL, hello MongoDB). And being honest, any developer can learn the basics of SQL in a few hours. That plus a few LLM prompts and you are good to go for 90% of the use cases. Learning a little bit of database internals is also good, that’ll put you in the top 1% of developers worldwide.

But there is one thing about SQL that amazes me: the runtime. It’s so simple, a simple text language, you send it over the wire and boom, you have results. No npm install, just a simple language, focused on a simple (but powerful) task that have lived many iterations of databases and systems. That’s hard to beat.

You have local runtimes (sqlite, duckdb), you can install postgres, mysql… you can go to neon and get a postgres database in a few milliseconds. You don’t even need data locally, you can query data in S3 and other places, you can even query data coming from other processes, it’s just amazing how simple it is. There is no other programming language with this simplicity.

I know SQL is not perfect when you need to work with complex logic. You can always twist it but you end up with crazy 2000 line SQL queries. In 20 years working with it I only found a few cases where it gets actually unmanageable to run data operations but many times it’s easy to use any other procedural language. You need to like functional programming as well, but you get used to thinking in that way after a while.

I’ll probably end my career writing SQL or talking to an LLM that writes SQL :)

LLMs to transform data

I transform data every day and I usually do 2 kinds of transformations, changing the data format so I can use it in a tool (CSV to parquet) or the shape of it, like running an aggregation so I can understand it. I’m using LLMs more and more for this because it saves me a lot of time (and it’s cool)

I see two approaches for using LLMs to transform data:

Asking to transform the data and output it

cat file | llm —system "transform the data into <format>"

Asking to generate code to transform the data

cat file | llm —system "generate a python code to transform this data into <format>" > trasform.py && \
python trasform.py file > transformed

If you don’t know what the llm command is, please, go to check the fantastic llm cli tool from

The second one has many benefits:

The code will run way faster, LLMs are still slow compared to regular CPUs

The transformation can be audited and fixed

Let’s test it. I have a file with NMEA records. NMEA according to Wikipedia “is a combined electrical and data specification for communication between marine electronics such as echo sounder, sonars, anemometer, gyrocompass, autopilot, GPS receivers and many other types of instruments”. if NMEA was invented today, it would have been NDJSON but at that time machines were sending data through a 9600 bauds per second comm line.

Anyway, I have some data I got from my car’s GPS (which still send the info using NMEA these days), filter RMC sentences (the ones that have the coordinates) and pipe into the llm command (using gemini-2.0 code execution). This would be the command (I shortened it for clarity)

head -n 40 cheste_fixed.nmea | grep RMC | llm  -m gemini-2.0-flash-exp -o code_execution 1 --system "write and EXECUTE a program to parse that data and generate ndjson with lat, lon, time and speed attributes. No explanations, just write and execute"

import json
import re
from datetime import datetime, timedelta

data = """
$GPRMC,172633.00,A,3929.0184,N,00037.8573,W,80.7,54.1,160923,,*22
$GPRMC,172634.00,A,3929.0331,N,00037.8280,W,84.0,55.5,160923,,*24
$GPRMC,172635.00,A,3929.0495,N,00037.7985,W,93.5,54.8,160923,,*22
...
"""

def parse_nmea(line):
    match = re.match(r'\$GPRMC,([\d.]+),A,([\d.]+),([NS]),([\d.]+),([EW]),([\d.]+),([\d.]+),(\d+),,', line)
    if match:
      time_str, lat_str, lat_dir, lon_str, lon_dir, speed_str, _, date_str = match.groups()

      time_obj = datetime.strptime(date_str+ time_str.split(".")[0], "%d%m%y%H%M%S")


      lat = float(lat_str[:2]) + float(lat_str[2:]) / 60.0
      if lat_dir == 'S':
          lat = -lat
      lon = float(lon_str[:3]) + float(lon_str[3:]) / 60.0
      if lon_dir == 'W':
         lon = -lon

      speed = float(speed_str) * 1.852

      return {
          "lat": lat,
          "lon": lon,
          "time": time_obj.isoformat(),
          "speed": speed
      }
    return None


parsed_data = []
for line in data.strip().splitlines():
  parsed_line = parse_nmea(line)
  if parsed_line:
    parsed_data.append(parsed_line)

for item in parsed_data:
  print(json.dumps(item))

{"lat": 39.48364, "lon": -0.630955, "time": "2023-09-16T17:26:33", "speed": 149.4564}
{"lat": 39.483885, "lon": -0.6304666666666667, "time": "2023-09-16T17:26:34", "speed": 155.568}
{"lat": 39.48415833333333, "lon": -0.629975, "time": "2023-09-16T17:26:35", "speed": 173.162}
{"lat": 39.48441833333333, "lon": -0.6294983333333334, "time": "2023-09-16T17:26:36", "speed": 177.9772}
{"lat": 39.48466666666667, "lon": -0.6290183333333333, "time": "2023-09-16T17:26:37", "speed": 177.9772}
{"lat": 39.48491666666666, "lon": -0.628515, "time": "2023-09-16T17:26:38", "speed": 179.27360000000002}

It feels like it did the right transformation (indeed, checking the data, it’s accurate). Just in case you are checking the data carefully, speed attribute feels like to high but it’s a car in a race track, so it’s expected.

But how could we make sure it’s doing it right? let’s use what we have been using in software development for years: tests. So let’s ask the LLM to generate not just the transform, but also the test.

$ head -n 10  ~/Downloads/lap_timer/cheste_fixed.nmea | grep RMC | llm  -m gemini-2.0-flash-exp -o code_execution 1 --system "write and EXECUTE a program to parse that data and generate ndjson with lat, lon, time and speed attributes. Also write the code that transforms the output back to original format and check the transformation back. No explanations, just write and execute"

import json
import pynmea2
from datetime import datetime, timedelta

nmea_data = """$GPRMC,172633.00,A,3929.0184,N,00037.8573,W,80.7,54.1,160923,,*22
$GPRMC,172634.00,A,3929.0331,N,00037.8280,W,84.0,55.5,160923,,*24
$GPRMC,172635.00,A,3929.0495,N,00037.7985,W,93.5,54.8,160923,,*22
$GPRMC,172636.00,A,3929.0651,N,00037.7699,W,96.1,54.8,160923,,*28
$GPRMC,172637.00,A,3929.0800,N,00037.7411,W,96.1,55.5,160923,,*2D"""

def parse_nmea_to_ndjson(nmea_str):
  ndjson_output = []
  for line in nmea_str.strip().splitlines():
    try:
      msg = pynmea2.parse(line)
      if isinstance(msg, pynmea2.RMC):
          time_obj = datetime.combine(msg.datestamp, msg.timestamp)
          ndjson_output.append(json.dumps({
              "lat": msg.latitude,
              "lon": msg.longitude,
              "time": time_obj.isoformat(),
              "speed": msg.spd_over_grnd
          }))
    except pynmea2.ParseError:
      print(f"Could not parse: {line}")
      continue
  return "\n".join(ndjson_output)

def transform_ndjson_to_nmea(ndjson_str):
    nmea_output = []
    for line in ndjson_str.strip().splitlines():
        try:
            data = json.loads(line)
            time_obj = datetime.fromisoformat(data['time'])
            nmea_time = time_obj.strftime("%H%M%S.%f")[:-3]
            nmea_date = time_obj.strftime("%d%m%y")

            lat_deg = int(abs(data['lat']))
            lat_min = (abs(data['lat']) - lat_deg) * 60
            lat_dir = "N" if data['lat'] >= 0 else "S"

            lon_deg = int(abs(data['lon']))
            lon_min = (abs(data['lon']) - lon_deg) * 60
            lon_dir = "E" if data['lon'] >= 0 else "W"
            
            nmea_speed = data['speed']
            
            nmea_str = f"$GPRMC,{nmea_time},A,{lat_deg}{lat_min:.4f},{lat_dir},{lon_deg}{lon_min:.4f},{lon_dir},{nmea_speed:.1f},{nmea_speed:.1f},{nmea_date},,"
            
            checksum = 0
            for char in nmea_str[1:]:
                checksum ^= ord(char)
            nmea_str += f"*{hex(checksum)[2:].upper()}"
            nmea_output.append(nmea_str)
            
        except json.JSONDecodeError:
            print(f"Could not decode json: {line}")
            continue
    return "\n".join(nmea_output)

ndjson_data = parse_nmea_to_ndjson(nmea_data)
print("NDJSON output:")
print(ndjson_data)

nmea_transformed = transform_ndjson_to_nmea(ndjson_data)
print("\nTransformed back to NMEA:")
print(nmea_transformed)

print("\nOriginal NMEA:")
print(nmea_data)

def check_transformation(original, transformed):
    original_lines = original.strip().splitlines()
    transformed_lines = transformed.strip().splitlines()

    if len(original_lines) != len(transformed_lines):
        print("Number of lines do not match.")
        return False
    
    for orig_line, trans_line in zip(original_lines, transformed_lines):
        try:
            orig_msg = pynmea2.parse(orig_line)
            trans_msg = pynmea2.parse(trans_line)
            if not isinstance(orig_msg, pynmea2.RMC) or not isinstance(trans_msg, pynmea2.RMC):
                print("Message type is incorrect.")
                return False

            if not (abs(orig_msg.latitude - trans_msg.latitude) < 0.00001 and
                   abs(orig_msg.longitude - trans_msg.longitude) < 0.00001 and
                   abs(orig_msg.spd_over_grnd - trans_msg.spd_over_grnd) < 0.0001 and
                   orig_msg.timestamp == trans_msg.timestamp and
                   orig_msg.datestamp == trans_msg.datestamp
                   ):
                print(f"Original: {orig_msg}")
                print(f"Transformed:{trans_msg}")
                return False
        except pynmea2.ParseError:
            print(f"Could not parse: {orig_line} or {trans_line}")
            return False
    print("Transformation check passed!")
    return True

check_transformation(nmea_data, nmea_transformed)

ExecuteCode failed: MODULE_NOT_FOUND.
msg_err: Traceback (most recent call last):
  File "<string>", line 3, in <module>
ModuleNotFoundError: No module named 'pynmea2'

It fails to run because of the pynmea2 but if you run it locally it manages to do it. So running that self-test gives us some confidence about the transformation function and I’d trust it to be in a pull requests.

BTW, using the first approach also work, at least for gemini 2.0 it manages to transform NMEA to JSON right, but it take way more time than generating the script (as expected)

Learnings after 4 years as a

During the first 4 years of Tinybird (the company I founded) I’ve been helping our customers on the technical side (pre and post sales). I’ve probably talked to more than 100 companies and actively helped +50, ranging from those with just a few employees and Gigabytes of data to top companies in the world with Terabytes.

Tinybird (bear with me, this is not a post about the product) helps solve specific projects, and most of the use cases we had to deal with were not just about building a data platform but also refactoring what the company already had to meet real-time requirements. I think helping these companies change their mindset has saved them millions of dollars a year.

Some clarifications before I start: when I talk about real-time, people think about Kafka, Spark, Flink, etc. But reality means “what you were doing before, but actually fast” or “what you were doing before, but without having to go take a coffee when you run the pipeline”. I like to call it “high performance data engineering”. It usually means:

  1. Lots of data (more than dozens of millions of rows a day with a few years of history, usually Terabytes)
  2. Low latency end to end (so no or lightweight ETL)
  3. Sub second queries, usually <100ms
  4. Reasonable costs (similar to or lower than traditional ETL)

Some practical learnings, in no particular order:

I learned a lot more things about people itself, but that’s another story.

La respuesta al concurso del Lego de la TGR23

En esta TRG en Tinybird hicimos un concurso en el que tenías que adivinar el tiempo de una query SQL sencilla sobre 1 billón de registros. Si acertabas ganarías un LEGO que seguramente sea más difícil de montar que aprender SQL. El objetivo era no sólamente hacer un concurso estúpido para recoger “leads”, hacer algo que retase a la gente y a su vez sirviese para explicar qué hace nuestra empresa tenía más sentido. Además, si no sabes mucho de datos, te permite aprender.

Antes de escribir nada: es imposible saber el tiempo de ejecución de una query en un sistema donde desconoces todos los componentes. Bueno, aunque los conozcas, es en la práctica imposible. Así que era un juego de un mucho de estimación y otro mucho de suerte

El enunciado era tal que así:

“”” Responde con el tiempo que crees que tarda Tinybird en ejecutar esta query SQL

SELECT 
    toDate(timestamp) AS date, 
    avg(speed) AS avg_speed
FROM rocket_telemetry 
GROUP BY date

La tabla se le añaden 1000 registros por segundo y en el momento del comienzo de la TRG23 tenía 1 billón americano de registros.

HINT: es más rápido de lo que crees. “””

Como ves, sin conocer datos tan básicos como cual es el sistema que corre, como almacena los datos, el esquema de la tabla etc etc es absolutamente imposible. Aún así, vamos a hacer un análisis de como se podría estimar sin ponernos muy técnicos, en términos simples.

Empecemos por los básicos: el tiempo de una query tan sencilla depende del tiempo que tardemos en leer los datos. Esto es así para la gran mayoría de queries que son ejecutadas en nuestras queridas bases de datos.

Así que en el caso que nos atañe tenemos dos columnas, vamos a asumir que son de 4 bytes cada una (un campo Date y otro Float), a la hora de ejecutar la query había unos 1.1b de registros, tenemos un total de ~8.2Gb (=1000000000 * 4 * 2). Mi portátil, macbook pro M1, es capaz de leer aproximadamente unos 3.8gb/s de memoria cuando todos los registros están perfectamente colocados, no tengo ni pajolera de lo que tardará un servidor de google -donde estaba alojada mi cuenta de tinybird- pero entiendo que es algo similar, así que la query debería tardar algo más de 2 segundos.

No tan rápido, en realidad nosotros somos más listos que eso, en la tabla había un timestamp, y en enunciado dice que son unos 1000 registros por segundo. Debería haber usado un timestamp de 64 bits, pero no lo hice (mal), pero en este caso, sabemos que hay 1000 valores seguidos iguales si los vamos colocando en la base de datos según nos los envían.

Y si en vez almacenar el valor, almacenamos la diferencia con el valor anterior? Pues vamos a tener 1 uno, seguido de 999 ceros. No sé a ti, pero a mi me da que eso se puede guardar de una forma eficiente. Alguien ya se dio cuenta hace años he inventó algo llamado RLE compression, que permite reducir la cantidad de información cuando se repiten mucho los valores. En este caso, usé un algoritmo algo más moderno, LZ4. La historia de LZ4 es bastante increíble, os recomiendo este podcast.

Echando cuentas, si guardamos 1 uno y 999 ceros, repetido 60 segundos * 60 minutos * 24 horas son básicamente 1kb que es tamaño que ocupa 1000 valores y otro más para decirle las veces que lo tiene que repetir. En la base de datos que usamos, Clickhouse, ocupa unos 22mb. Lógicamente hay ineficiencias porque el sistema de almacenamiento no está pensado para este caso de uso, ni LZ4 funciona exactamente como he dicho. Aún así, de 4gb a 22Mb podemos decir que hemos hecho un buen trabajo.

Entonces ahora tenemos dos columnas, una de ellas, timestamp, que ocupa 21mb y otra, speed que ocupa 4gb. Dirás, por qué no se comprime esa columna? Pues porque usé un rand() que por definición apenas comprime. La realidad es que usando ZSTD, otro algoritmo de compresión, se queda en aproximadamente 3.8gb. Es importante saber donde es fuerte cada algoritmos de compresión, no tienes que saber los detalles, pero sí tener intuición.

Para terminar, echando cuentas, la query debería tardan aproximadamente 1 segundo, y efectivamente, así era. De las 224 personas que participaron, menos de un 5% dieron una respuesta entre 0.5 y 1.5 segundos, menos de 10% si muevo el rango de 500ms a 3s.

Pero aquí no termina todo, a nada que sepas SQL podrás intuir que para ejecutar esa query no necesitas leer todos los datos todas las veces que ejecutes la query, puedes ir agregando a medida que vienen los datos. SI estás pensando ahora mismo en “pongo un scheduler que se ejecute cada minuto” o “uso spark whatever” o “snowflake blabla” siento decirte que la tecnología ha avanzado en estas últimas 3 décadas y ahora las cosas se pueden en tiempo real. Bueno, procesar 1000 registros cada segundo se puedía hacer hace unos 30 años en cualquier PC, va a hacer 30 años de la release de DOOM, amigos (para los milenials, DOOM era nuestro Fortnite)

Así que si agregas según vienen los datos, necesitas almacenar unos 12 días (aprox 100M de eventos al día son generados a 1000 ev/s) con lo cual necesitas una tabla de 12 filas con 3 columnas: fecha, suma total, count. Eso son unos 144 bytes si usas 4 bytes por cada tipo de dato.

Con ese dato puedes intuir que la query está por debajo de 1 milisegundo casi con total seguridad, lo que se tarde en parsear la SQL y poco más. Si tu sistema tarda en parsear la query y coger 120bytes (~8kb en Clickhouse) más de 1ms, me jode ser yo el que te lo diga, pero hay algo que hay que repensar.

Solo 8 personas respondieron 1ms o menos. De hecho la query total costaba unos 0.8ms, ganó Claudia que puso 0.99 :)

Si has llegado hasta aquí es porque el tema te interesa. Hice un curso de unas 3 horas explicando en detalle todo esto, lo puedes ver por aquí.

Gracias por pasar por el stand de Tinybird, fue un placer hablar con vosotros.