javisantana.com

23 Millones

23 millones

O dicho de otra forma, más de 3000 millones de pesetas (he tenido que revisar dos veces la cifra) es la gallina que unos inversores están arriesgando por un producto “casi español”. Para algo más de información sobre todo este chiringuito leete el post de Miguel Arias que lo explica bastante bien..

La cifra acojona, igual que hacían los 8 millones de la ronda A del año pasado. La mayoría de la gente te felicita, creo que creen que esos millones significan que eres rico de un día para otro y ese dinero sirve parau subirte el sueldo y comprar gilipolleces para la oficina. Nada más lejos de la realidad, esos millones significan libertad, sí, pero muchísima presión, mucho trabajo para gastarlos como se debe. De hecho la celebración cuando nos enteramos fueron unos aquarius que nos tomamos Miguel Arias y yo en el bar de la lado de la oficina, seguramente no nos hubiese entrado nada más, teníamos algún tipo de obstrucción en la garganta. El resto de gente del exec de CartoDB estaba en en diferentes partes del mundo dando el callo…

Más o menos puedes hacerte una idea de lo que significa ese dinero para una empresa como CartoDB, no entraré en detalles, pero a grandes rasgos permite pasar a primera división (puestos de descenso eso sí), puedes pensar a lo grande, cambiar tecnologías que todo el mundo piensa que no puedes cambiar o que son inaccesibles, acceder a personas que de otra forma no podrías. Hace 1 año se acabó el ir en coche por un camino, ahora toca crear la vía para la locomotora.

Pero he venido aquí a hablar de mi libro, qué significa esta historia a nivel personal. El año pasado pasé de ser un desarrollador a ser el CTO. Pasas de preocuparte por tu código a gestionar gente, organizar el trabajo, hacer gestiones que no tienen mucho sentido en ese momento, resolver problemas que no deberían tocarte y en los ratos libres hacer las cosas que realmente tienes que hacer (porque quien sabe qué es lo que tiene que hacer un CTO?) . Ha sido un año realmente difícil, no ha pasado un solo día sin pensar “pero qué cojones estás haciendo y por qué no estás programando?”, siempre tienes la sensación de que no llegas, de que siempre hay un problema más urgente que resolver. Creo que esa sensación de no llegar es algo que pasa cuando trabajas con gente del nivel que tenemos en el equipo.

A medida que pasan los meses te das cuenta que todas esas ideas que tienes de a donde llevar la tecnología no puedes hacerlas solo, necesitas un equipo que sepa lo que se hace en cada área, necesitas tiempo para validar las ideas felices y sobretodo, necesitas creerte y hacer creer que puedes hacerlo. Hace un año pensar que podríamos mandar un parche (y que te lo acepten) a Postgres, que podríamos poner en producción una versión propia de Varnish, aguantar 100k QPS parecía una utopía, ahora son reales. Ahora sabemos como montar un equipo, como diseñar unos procesos de trabajo para escalar, como hacer que la gente se interese por el mundo de los mapas, como tienes que negociar para poder hacer lo que realmente quieres. Ha costado y sigue costando, nadie dijo que fuese fácil.

Así que ahora sigo sin saber qué hace un CTO, y la verdad me da igual, pero tengo claro lo que quiero hacer y me gusta mucho.

Esto es un poco como conducir, al principio te preocupa como usas cada una de los componentes del coche pero luego te das cuenta que lo que realmente hace que lo hagas bien es mirar lejos, lo más lejos posible. Ahora ya no tenemos carretera donde mirar, así que habrá que construirla (con todos estos putos amos)

Local Search

I was talking with some friends about a geo problem they have:

Given a set of elements with a position and a name in a database give me the N closest to a certain point that match some pattern on the name.

So imagine you have openstreetmap database and want to find the first 300 banks and bars closer to Madrid city center (pretty interesting combination I’d say, in Spain ATMs are in the banks).

So in order to test it I loaded Madrid OSM in a postgres database. I just downloaded the data from geofabrik site and imported using osm2pgsql tool, pretty straightforward.

Then I created some indices for the way and amenity columns (using full text search stuff)

-- gist index for the geometry, full text search for the text
create index on planet_osm_point gist(way);
create index on planet_osm_point using to_tsvector('spanish', amenity)

First try, use Nearest Neighbour search

Since postgis 2.0 we have Nearest Neighbour search (thanks to CartoDB which founded it) that allows to use the geospatial index to sort results (read this blogpost in boundless blog), so the first try was to order by distance operator <-> the results from the text filter.

select way, amenity from planet_osm_point where to_tsvector('spanish', amenity) @@ to_tsquery('ba:*') order by way <->'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry limit 301

This takes around 48ms (everything cached). Looking at the explan analyze out of curiosity I realize spatial index wasn’t being used:

Limit  (cost=23947.86..23948.61 rows=301 width=41) (actual time=48.729..48.780 rows=301 loops=1)
   ->  Sort  (cost=23947.86..24035.30 rows=34976 width=41) (actual time=48.727..48.755 rows=301 loops=1)
         Sort Key: ((way <-> '010100002031BF0D00F8DAD368D52F19C1C324815603CB5241'::geometry))
         Sort Method: top-N heapsort  Memory: 48kB
         ->  Bitmap Heap Scan on planet_osm_point  (cost=1363.07..22333.09 rows=34976 width=41) (actual time=12.120..41.008 rows=17382 loops=1)
               Recheck Cond: (to_tsvector('spanish'::regconfig, amenity) @@ to_tsquery('ba:*'::text))
               ->  Bitmap Index Scan on planet_osm_point_to_tsvector_idx1  (cost=0.00..1354.32 rows=34976 width=0) (actual time=10.889..10.889 rows=17382 loops=1)
                     Index Cond: (to_tsvector('spanish'::regconfig, amenity) @@ to_tsquery('ba:*'::text))
 Total runtime: 48.877 ms

I don’t fully understand how the postgres planner works but sounds like it might be using a bitmap and operation both indices. Increasing the limit does not change anything, I though it could change the selectivity. I tried a search by distance:

SELECT way, amenity FROM planet_osm_point 
WHERE
to_tsvector('spanish', amenity) @@ to_tsquery('ba:*') 
AND
st_dwithin(way, 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, 500)

and I get the desired BitmapAnd:

 ...
 ->  BitmapAnd  (cost=1422.90..1422.90 rows=42 width=0) (actual time=14.881..14.881 rows=0 loops=1)
         ->  Bitmap Index Scan on planet_osm_point_index  (cost=0.00..68.32 rows=2121 width=0) (actual time=2.229..2.229 rows=6028 loops=1)
               Index Cond: (way && '010300002031BF0D000100000005000000F8DAD368154F19C1C32481560FC95241F8DAD368154F19C1C3248156F7CC5241F8DAD368951019C1C3248156F7CC5241F8DAD368951019C1C32481560FC95241F8DAD368154F19C1C32481560FC95241'::geometry)
         ->  Bitmap Index Scan on planet_osm_point_to_tsvector_idx1  (cost=0.00..1354.32 rows=34976 width=0) (actual time=12.607..12.607 rows=17382 loops=1)
 ...

Second try, use a recursive query

But that’s not the result I was looking for, I need the first 300 bars and banks closer to my location, no matter if they are 30km away

So I though about having a kind of python generator, a query that returns results on demand until I have enough results.

Reading postgres documentation there is a nice trick in CTE documentation, you can do a recursive query without stop condition that stops where the external query reach the limit.

WITH RECURSIVE t(osm_id, way, amenity, distance) AS (
  SELECT osm_id, way, amenity, 4800.0 as distance from planet_osm_point 
        WHERE
    to_tsvector('spanish', amenity) @@ to_tsquery('ba:*')
      AND 
    st_dwithin(way, 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, 4800)
  UNION ALL
    -- select only one row from the previous iteration to know the distance
    SELECT p.osm_id, p.way, p.amenity, prev.distance * 2 as distance 
    FROM planet_osm_point p , (SELECT distance FROM t LIMIT 1) prev
       WHERE 
    to_tsvector('spanish', p.amenity) @@ to_tsquery('ba:*')
      AND 
    st_dwithin(p.way, 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, prev.distance * 2)
      AND not
    st_dwithin(p.way, 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, prev.distance)
),
results as  (
    select *, st_distance(way,'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry) as real_dist FROM t limit 300
)
SELECT * FROM results order by real_dist

The time for this query is around 48ms so no improvement at all. But that may depend on the number of iterations it needs to do until fetch all the results. Starting with 300 meters takes 4 loops since last bar is 2393 meters away form city center.

If it starts the iteration with a bigger radius, like 1500 (2 iterations), the query time is 25ms. If it only needs to do the fist iteration, it’s 18ms which is much better.

So how do we know what would be a good value to start? Hard to say without some density information stats… luckily postgres has pretty good stats about indices and there are good ways to access it: EXPLAIN and _postgis_selectivity

spain=# explain select 1 FROM planet_osm_point where   to_tsvector('spanish', amenity) @@ to_tsquery('ba:*')       ;
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on planet_osm_point  (cost=1363.07..22245.65 rows=34976 width=0)
   Recheck Cond: (to_tsvector('spanish'::regconfig, amenity) @@ to_tsquery('ba:*'::text))
   ->  Bitmap Index Scan on planet_osm_point_to_tsvector_idx1  (cost=0.00..1354.32 rows=34976 width=0)
         Index Cond: (to_tsvector('spanish'::regconfig, amenity) @@ to_tsquery('ba:*'::text))
(4 rows)

Forget everything but rows=34976 that’s the stimation of total number of bars and banks in the whole table (the real count value is 17k but let’s see if it’s good enough)

In order to know how many points there are in a certain are we can use _postgis_selectivity function. It’s kind of hidden, is what actually postgres stats planner use.

select _postgis_selectivity('planet_osm_point', 'way', st_expand('SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, 5000))
 _postgis_selectivity
----------------------
  0.00751143626570702

(you can get the same value using EXPLAIN but I like _postgis_selectivity)

So knowing the selectivity (the part of total rows selected by that bbox), the total number of bars and banks and the total count (1.8M) we can estimate:

total_rows * percentaje_categories * selectivity
1748797 * 0.02 * 0.00751143626570702 ~ 262.0

So in 10km area we have an stimation of 262 points, so to get 300 points:

density = 262/(10000*10000) // points/m^2
area_to_300 = 300/density
R = sqrt(area_to_300/2*PI);
R -> 4268.93996744354m

So using a radius of 4200 meters the query takes ~18ms as it’s doing a single iteration. It’s important to say that calculate the stats is almost free, it takes less than 1ms (except total count that could be precalculated).

Other nice thing about the recursive query is it can be paginated, so you can find 100 results get the last distance and the next time use that distance to start iterating.

Maybe this case is too simple or have too few points but it’s the best approach I found, any idea?

Vector Tiles

vector tiles

Last week we had the n-th GIS drama about how Mapbox Vector Tiles should be called. I’m actually thiking in creating the GIS version of rubydramas (it’s gone, looks like ruby community has moved to node.js, sorry, io.js). This post is not to talk about naming, we all know standards with company names in the description are never used, look at those ESRI shapefiles…

The objetive of vector tiles (whatever format you use) is to move the data closer to the rendering stage, so it’s projected, clipped, transformed, extra precision is removed, filtered using naive filters, encoded and finally compressed. You should take a look at this Michal Migurski’s blogpost or this talk from Dane Springmeyer in foss4g to understand what is the rationale behind them.

CartoDB is a platform that runs on top of postgis, it renders tiles fetching them directly from a spatial query so we pay the price for all the overhead of going to the database, fetching the data, preparing it for render and so on. That’s why at some point someone though about removing that dynamic part of the equation. But in the other hand we have all the power of postgres and postgis (you know, spatial indices, geometry functions and so on)

So would it possible to generate a easily vector tile from postgres? Of course is, you can do almost everything in postgres right now with an extension but I’m one of those persons that like to do obvious things. It sounds easy, basically we need:

  1. Get the geometry for a given tile with CDB_XYZ_Extent. PostGIS makes this pretty fast since if you have an index in the column.
  2. Remove extra precision. A tile is usually 256x256 pixels so you don’t need 6 decimal precision. Also this makes a lot of points to be in the same pixel so we can remove them. ST_SnapToGrid to the resque here. It’s useful to know what is the resolution for the zoom level you are generating so CDB_XYZ_resolution is handy.
  3. We don’t need geometry outside the tile, so ST_ClipByBox2d can remove the geometry outside the tile. This function is only present in postgis 2.2 (currently in development), you can use the slower version ST_Intersection
  4. Finally change coordinate system so coordinates are within 0-255 range, ST_Affine makes the algebra thing easy.

So here it is, given a query that retusn a resultset with cartodb_id and the_geom_webmercator (a geometry column in 3857):

Notice this query does not manage buffer-size, overzooming and so on, that’s pretty easy add tho. Also there is a res/20 that needs an extra explanation. If we used half of the pixel for the snapping we’d soon realize that some polygons and lines are removed pretty soon so using that 20 fixes the thing. I have to say that value was calcualted by hand and there are not maths behind it, why spend hours thinking when with a simple binary search you can fix the thing… The geometry is also simplified after snapping (be sure you do after snapping, the simplify algorithm complexity is higher than the snapping)

does this work?

let’s try with an extract of OSM planet where the geometry is about 350Mb

the CartoDB Vector Tile (did I say I’m pretty good at naming?) is 44Mb (3.8M gzip compressed) so not that bad.

But we still didn’t do anything special with the geometry encoding, we are using WKB to store all the things. Remember that WKB uses 16 bytes per coordinate in a geometry. Mapbox vector tiles use varint encoding of delta values in order to make this smaller. I personally don’t like varint to encode numbers, It’s better to leave the compressor do its work and don’t try to be smart playing with bits. But ok, in postgis we have a way to delta-varint all the things, it’s called TWKB:

copy (select st_astwkbagg(geom, 0, id) from cdb_tile(0, 0, 0, 'select id as cartodb_id, the_geom as the_geom_webmercator from planet')) TO '/tmp/tile2.cvt';

The result is a 9.8Mb (1.8M gzip compressed) tile. Much better and took about the same time to encode it. This also works much better with polygon/lines tables than mixed types, specially when there are a lot of points like in this case.

There a lot of things left, for example, when to use clipping, snapping and simplification (sometimes it’s better to send every single geometry than cut), coincident points, attribute optimization (I didn’t talk about attributes here because with postgres is pretty clear how to do this)

Ingenieros De Verdad

ingeniería de verdad

Estaba yo viendo un maravilloso documental sobre los motores turbo en fórmula 1 y me daba cuenta lo parecidas que son todas las áreas de la ingeniería, incluso la informática. El docu dura un par de horas (está formado por dos partes) y básicamente habla del diseño de un motor turbo desde cero con especial énfasis en el uso de la informática como algo novedoso.

Puedes verlo con una palomitas en youtube, incluso si no te interesa para nada la fórmula 1 ni los motores es didáctico y ameno. De hecho me ha asombrado la calidad como documental: riguroso, parándose en los detalles, sin música ni efectos para tener entretenido al personal, sin repetir 200 veces lo mismo antes de hacer una parada de publicidad ni entrevistas a expertos que solo dicen obviedades. En pocas palabras, no tratando como un subnormal al que lo ve.

Y lo mejor, no les ha importado invertir más de media hora de documental enseñándo como trabajaba de verdad esa gente en sus oficinas. No tenéis que buscar mucho para encontrar videos que dan vergüenza ajena sobre como se supone un ingeniero trabaja ahora mismo. Esperad, no busquéis, ya os linko yo uno de twitter

Y es que cuando llegan los retos tecnológicos la realidad del trabajo es como la muestra ese documental. La cagan una y otra vez, Desde el comienzo muestran como van poco avanzando, como se encuentran problemas, como los analizan y solucionan. Joder, incluso piensan.

Me encantan momentos como los siguientes que he vivido unas cuantas veces a lo largo de mi vida profesional:

En el minuto 14 explican como en vez de perder el tiempo diseñando un sistema final usan una solución temporal basada en un compresor externo en vez de un turbo para agilizar las pruebas (os suena?). Ojo al comentario del perico en el minuto 16 cuando les revienta el motor: “we had an accident” (eso sí que es humor del bueno). Más tarde en el 20 se puede ver la cara de desesperación total.

A partir del minuto 48 prueban el nuevo motor que decidieron diseñar desde 0. El motor no arranca, le dan mil vueltas, analizan y al final lo arrancan. Ojo al proceso que siguen y el razonamiento que hacen para solucionarlo (minuto 7 del segundo video).

Cuando ya tiene el motor ya montado en el coche haciendo pruebas en pista, van a probar la parada del motor desde el volante y no funciona (minuto 31, segundo video). Un pequeño detalle estúpido que echa toda la prueba a perder.

El documental está lleno de momentos relacionados con la informática que ahora nos hacen reir pero que en aquel momento eran tecnología punta.

Ahora pienso en como quieren pintar el trabajo que hacemos los que desarrollamos software y me gusta mucho más como es en la realidad, mucho más parecida a como se muestra en ese video. Quiero pensar que cuando este negocio tenga algunos años más nos verán como gente profesional y no como una panda de gilipollas, cortados por el mismo patrón que dan a teclas sin pensar.

Non Mercator Tiles

working with non mercator projections in cartodb

CartoDB provides a way to render tiles in non mercator projection, it’s not something we though about when our Map API was designed but it’s possible and works pretty well.

CartoDB geometry rendering intro

CartoDB stores data in postgres tables using postgis for the geospatial data. I will not explain this, if you are reading this is because you already know how it works. Tables in CartoDB are regular tables with some special column names like cartodb_id, the_geom and the_geom_webmercator

In order to render map tiles CartoDB needs to things:

Maps API uses the_geom_webmercator column by default as geometry source so when a geometry needs to be rendered you need to call with that name. It also expects that column is projected, no special SRID is required altough web mercator is for what was designed for.

client side code

With that in mind we could tweak Maps API sending a geometry projected in something different than web mercator, so we can create this cartodb.js example:

cartodb.createLayer(map, {
  user_name: 'dev',
  sublayers: [{
    sql: 'select st_transform(the_geom, 32661) as the_geom_webmercator from tm_world_borders_s_11'
    cartocss: '#layer { polygon-fill: #F00; polygon-opacity: 0.3; line-color: #F00; }',
  }]
})
.addTo(map)

that would render a map projected in SRID 32661, that’s it, it works. The problem is leaflet is still working with mercator. Luckily it provides a way to change the projection used using crs option in the map.

So aplying the same technique than Gregor Aisch describes in this post you can create a CRS that maps the coordinates from the required projection to mercator (and in the other way around) so you can use leaflet as you normally do but using a different projection:

var map = new L.Map('map', {
  center: center,
  zoom: 2,
  crs: cartodb.proj('+proj=stere +lat_0=90 +lat_ts=90 +lon_0=0 +k=0.994 +x_0=2000000 +y_0=2000000 +datum=WGS84 +units=m +no_defs', '32661')
});
cartodb.createLayer(map, {
...

it uses an small library I created for this matter, cartodb.proj

interaction

CartoDB not only render png tiles, it also does utf grid tiles to provide interaction and this also works as expected. In the following example an infowindow with some data about the countries is included.

cartodb.createLayer(map, {
  user_name: 'dev',
  type: 'cartodb',
  sublayers: [{
     sql: 'select area, iso2, st_transform(the_geom, 32661) as the_geom_webmercator from tm_world_borders_s_11 where st_y(st_centroid(the_geom)) > 0',
     cartocss: '#layer { polygon-fill: #F00; polygon-opacity: 0.3; line-color: #F00; }',
     interactivity: 'area, iso2'
  }]
})
.addTo(map)
.done(function(layer) {
    var sub = layer.getSubLayer(0)
    cdb.vis.Vis.addInfowindow(map, sub, ['area', 'iso2'])
    sub.on('featureOver', function(e, ll, pos, data) {
      console.log(data);
    })
});

Notice nothing special needs to be added, just cartodb.js code that would work with mercator

vector features

Leaflet provides a bunch of methods to work with vector features and guess what, they work as expected so you can use CartoDB SQL API to fetch geometry and render as a GeoJSON layers.

cartodb.SQL({ user: 'dev', format: 'geojson' }) .execute('select the_geom from tm_world_borders_s_11 where iso2 = \\'ES\\'') .done(function(data) { L.geoJson(data).addTo(map); });

L.marker(center).addTo(map); </code>

in this case the projection is done client side, notice the example fetchs 4326 from CartoDB

the working map

some problems

Some projections don’t work well depending on the zone you are showing and that’s something you should expect, most projections are meant to work well only in a concrete region. For example, a projection ready to show Antartica is not going to work well showing Norway and it will be rendered with distorsion or it will not even rendered because the projection fails.

In PostGIS some projections fails when you try to render outside a bounding box:

db=# select st_transform(st_setsrid(st_makepoint(0, -181), 4326), 2857);
ERROR:  transform: couldn't project point (0 -181 0): latitude or longitude exceeded limits (-14)

in order to make it work we can use PostGIS functions ST_Intersection, like:

SELECT ST_Intersection(the_geom, ST_MakeEnvelope(-180, 0, 180, 90, 4326))

in this case we get the geometry in the north side or the earth.

extra ball

A map I created with CartoDB with some dataset Antartica