hckrnws

Databases should contain their own Metadata – Use SQL Everywhere

by matheusalmeida

zabzonk

17h

Not clear if the author realises that all commercial SQL database engines support querying of the database's metadata using SQL. Or maybe I have misunderstood - I only skimmed the article.

da_chicken

17h

Yeah, this seemed like a very long way to say, "Our RDBMS has system catalogs," as if it's 1987.

But then, they're also doing JOINs with the USING clause, which seems like one of those things that everybody tries... until they hit one of the several reasons not to use them, and then they go back to the ON clause which is explicit and concrete and works great in all cases.

Personally, I'd like to hear more about the claims made about Snowflake IDs.

zabzonk

13h

> doing JOINs with the USING clause

I'm ashamed to say that despite using SQL from the late 1980s, and as someone that likes reading manuals and text books, I'd never come across USING. Probably a bit late for me now to use it (or not) :-(

tkejser

12h

I didn't really write USING in anger until around 10 years ago, and I have been around a long time too.

Not all databases support it. But once you start using it (pun) - a lot of naming conventions snap into place.

It has some funky semantics you should be aware of. Consider this:

  CREATE TABLE foo (x INT);

  CREATE TABLE bar (x INT);

  SELECT \* FROM foo JOIN bar USING (x);

There is only one `x` in the above `SELECT *` - the automatically disambiguated one. Which is typically want you want.

n4r9

12h

I've used SQL for around a decade and also never came across it. I'm maintaining SQL code with hundreds if not thousands of basic primary key joins and this could make those queries way more concise. Now I want to know the reasons for not using USING!

tkejser

12h

There are reasons for not USING.

First, you need to be aware of the implicit disambiguration. When you join with USING, you are introducing a hidden column that represents both sides. This is typically what you want - but it can bite you.

Consider this PostgreSQL example:

  CREATE TABLE foo (x INT);
  INSERT INTO foo VALUeS (1);

  CREATE TABLE bar (x FLOAT);
  INSERT INTO bar VALUES (1);

  SELECT pg_typeof(x) FROM foo JOIN bar USING (x);

The type of x is is double, - because x was implicitly upcast as we can see with EXPLAIN:

  Merge Join  (cost=338.29..931.54 rows=28815 width=4)
    Merge Cond: (bar.x = ((foo.x)::double precision))

Arguably, you should never be joining on keys of different types. It just bad design. But you don't always get that choice if someone else made the data model for you.

It also means that this actually works:

  CREATE TABLE foo (x INT);
  INSERT INTO foo VALUeS (1);

  CREATE TABLE bar (x INT);
  INSERT INTO bar VALUES (1);

  CREATE TABLE baz (x INT);
  INSERT INTO baz VALUES (1);

  SELECT \*
  FROM foo
  JOIN bar USING (x)
  JOIN baz USING (x);

Which might not be what you expected :-)

If you are both the data modeller and the query writer - I have not been able to come up with a reason for not USING.

n4r9

12h

Thanks for the reply. The use case I have in mind is joining onto an INT primary key using a foreign key column of another table. This alone would remove a massive amount of boilerplate code.

tkejser

12h

@da_chicken: You can read more about Snowflake ID in the Wiki page linked in the article.

The short story:

They are bit like UUID in that you can generate them across a system in a distributed way without coordination. Unlike UUID they are only 64-bit.

The first bits of the snowflake ID are structured in such a way that the values end up roughly sequentially ordered on disk. That makes them great for large tables where you need to locate specific values (such a those that store query information).

whynotmaybe

11h

I don't think it's as easy to do the example in the article just by using information_schema.

> Which tables have a column with the name country where that column has more than two different values

But on their product page, the definition of floesql left me puzzled

> It uses intelligent caching and LLVM-based vectorized execution to deliver the query execution speed your business users expect.

> With its powerful query planner, FloeSQL executes queries with lots of joins and complicated SQL syntax without breaking your budget.

tkejser

INFORMATION_SCHEMA is a good start, but it does not get you to full metadata flexibility. The columns you need just aren't there. It is good to have a standard for the metadata - but the standard isn't ambitious enough (a point I also make in the blog and as you observe, the sample query isn't possible on Information Schema alone)

The Floe engine is a full database on top of Iceberg and Delta storage. The system views are just the tip of the iceberg. We will be blogging more about what we are building.

whynotmaybe

Good, execution planning for majors DBMS didn't receive any ground breaking evolution because it can be considered a "solved" problem but I'm always curious about new ways to address it.

deepsun

16h

Differently though, AFAIR PostgreSQL does most of the schema changes transactionally, but not MySQL.

zabzonk

13h

I'd be too terrified to change the schema directly via SQL on the metadata tables even if the engine allowed it, transactional or not.

Comment was deleted :(

galaxyLogic

17h

Isn't his like it is in many relational databases, you can query them about the tables in them?

matheusalmeida

12h

The key difference is that it's not just about schema metadata (tables, indexes, views, columns, etc...). PostgreSQL is fabulous regarding this. Even native types are part of the catalog (pg_catalog).

Things are great in your DB... until they aren't. The post is about making observability a first-class citizen. Plans and query execution statistics, for example, queryable using a uniform interface (SQL) without the need to install DB extensions.

tkejser

12h

Thank you and yes!

By making the entire architecture of the database visible via system objects - you allow the user to form a mental model of how the database itself works. Instead of it being just a magic box that runs queries - it becomes a fully instrumented data model of itself.

Now, you could say: "The database should just work" and perhaps claim that it is design error when it doesn't. Why do I need instrumentation at this level?

To that I can say: Every database ever made makes query planning mistakes or has places where it misbehaves. That's just the way this field works - because data is fiendishly complicated - particularly at high concurrency of when there is a lot of it. The solution isn't (just) to keep improving and fixing edge cases - it is to make those edge cases easy to detect for all users.

evanelias

This sounds conceptually similar to performance_schema [1] in MySQL or MariaDB, which is a built-in feature originally introduced in MySQL 5.5 (2010). Or perhaps the easier-to-use sys schema [2], which wraps performance_schema among other things, introduced in MySQL 5.7 (2015).

It's great to have that observability functionality, but I don't really understand the purpose of writing a new DBMS from scratch to add this though. Why not get something merged into Postgres core?

[1] https://dev.mysql.com/doc/refman/8.4/en/performance-schema.h...

[2] https://dev.mysql.com/doc/refman/8.4/en/sys-schema.html

tkejser

Merging to PostgreSQL core for something that needs to run on top of a Petabytes of data in the cloud, on Iceberg, with an advanced query planner and a high speed SIMD engine.... AND trying to squeeze into the "pg_" naming mess?

I don't think so...

And yes, its conceptually similar to MySQL and also conceptually similar to SQL Servers implementation from 1997. That's by the design. Obviously, we are not writing a new DBMS from scratch just to add system objects.

Have a look at some of the other blogs on that site to see what we are up to. Basically, we want to give you an experience that resemblers that instrumentation you got used to from the on-premise databases, but one that can run on top of Iceberg in the cloud.

evanelias

> something that needs to run on top of a Petabytes of data in the cloud, on Iceberg, with an advanced query planner and a high speed SIMD engine

Part of my confusion was that this blog post makes no mention whatsoever of any of those things!

It gave me the (incorrect) impression that this observability functionality was the purpose of the product. And it is worded in a way which makes no mention of prior art in built-in DBMS observability.

Looking at the other threads here, I don't think I'm the only one who was confused about that. A couple intro paragraphs to the product might help a lot.

tkejser

That's fair feedback and I shall take that into account in future blogs.

Thanks for letting me know - you can stare yourself blind on that stuff

Comment was deleted :(

umairnadeem123

17h

i think the key difference is making that metadata first-class and queryable across the whole system (lineage, stats, access patterns), not just information_schema / catalog tables. most rdbms expose schema metadata, but not things like which queries produced which rows, freshness, or cost/latency signals unless you bolt it on with tracing. curious if floe is treating metadata as data (versioned, joinable) or as observability sidecars?

zabzonk

17h

> which queries produced which rows

I doubt many real-world applications could tolerate the amount of data/performance degradation this implies. If you need this (and I can't think why you would), then I think writing your own logging code is the answer, rather than lumbering everyone else with it.

tkejser

11h

To clarify a point: We store the row count of each operator in the query - not the actual row (that would indeed be madness!). Though we DO have tracing you can control that allow you to enable rows for very specific diagnostics - but the stream is massive and you need to opt in for that.

With that clarified, the logging not as large as you might think (see my other response).

Think of web servers - they routinely store much larger log streams than this with metadata about each hit. You rely on that stream to do various forms of web analytics - yet you would not dream of rolling your own - nor worry about the small overhead you are already paying.

Example:

SELECT x, y FROM foo JOIN bar USING (k)

In a query like this, you will have these operators:

SCAN of foo

SCAN of bar

JOIN (on k between Foo and Bar)

Hope that clarifies the point... That's 3 operators, and we log the row counts of each. That in turn allows you answer questions about your data model and how it is being used.

rbanffy

13h

If they did it efficiently, then they might have a game changer.

zabzonk

13h

Well, you can do it efficiently as you like, but if the data is to be historically accurate on a writeable table then if we are recording

    SELECT * FROM foobar

it is likely to require a copy of foobar at the time of the query, or doing amazing things with the transaction log. I agree this would be a programming tour de force, but I question its utility.

rbanffy

12h

Punishing someone for returning all rows in a large table is a feature, not a bug. ;-)

Having all state preserved at any given point in time is also useful in itself. It costs almost nothing if the table doesn’t change, and continuous point-in-time query capability can be extremely useful.

Access audits, when important, are very important. I’d be fine with the system blocking access after someone tries to return all lines in a table, even if only for the reason it might be someone trying to export data without a good reason.

da_chicken

17h

That's quite expensive. Most systems that need this sort of data will instead implement some form of audit log or audit table. Which is still quite expensive.

At the record level, I've seldom seen more than an add timestamp, and add user id, a last change timestamp, and a last change user id. Even then, it covers any change to the whole row, not every field. It's still relatively expensive.

rbanffy

13h

> Which is still quite expensive.

OTOH, if they managed to do that in an efficient way, they have something really interesting.

da_chicken

11h

Writing to disk is never free.

rbanffy

11h

True, but you do it with blocks that often contain padding. If you can make the padding useful, that's a win.

tkejser

12h

Hi All

Original author here (no, I am not an LLM).

First, a clarifying point on INFORMATION_SCHEMA. In the post I make it clear that this interface is supported by pretty much every database since the 1980s. Most tools would not exist without them. When you write an article like this - you are trying to hit a broad audience and not everyone knows that there are standards for this.

But, our design goes further and treats all metadata as data. It's joinable, persisted and acts, in every way, like all other data. Of course, some data we cannot allow you to delete - such as that in `sys.session_log` - because it is also an audit trail.

Consider, by contrast, PostgreSQL's `pg_stat_statements`. This is an aggregated, in memory, summary of recent statements. You can get a the high level view, but you cannot get every statement run and how that particular statement deviated from statements like it. You also cannot get the query plan for a statement that ran last week.

To address the obvious question: "Isn't that very expensive to store?"

Not really. Consider a pretty aggressive analytical system (not OLTP) - you get perhaps 1000 queries/sec. The query text is normalised and so is the plan - so the actual query data (runtimes, usernames, skewness, stats about various operators) is in the order of few hundred bytes. Even on a heavily used system, we are talking some double digit GB every day for a very busy system - on cheap Object Storage. Your company web servers store orders of magnitude more data than that in their logs.

With a bit of data rotation - you can keep the aggregates sizes over time manageable.

What stats do we store about queries?

- Rows in each node (count, not the actual row data as that would be a PII problem) - Various runtimes - Metadata about who, when and where (ex: cluster location)

Again, these are tiny amounts of data in the grand schema of things. But somehow our industry accepts that our web servers store all that - but our open source databases don't (this level of detail is not controversial in the old school databases by the way).

Of course, we can go further than just measuring the query plan.

Performance Profiling of workers is a a concept you can talk about - so it is also metadata. Let us say you want to really understand what is going on inside a node in a cluster.

You can do this:

```sql SELECT stack_frame, samples FROM sys.node_trace WHERE node_id = 42 ```

Which returns a 10 second sample (via `perf`) of the process running on one of the cluster node.

(Obviously, that data is emphemeral - we are good at making things fast but we can't make tracing completely free)

Happy to answer all questions

waffletower

"SQL everywhere" is decidedly dystopian. That being said, creating a standard for database introspection could be powerful for agents.

ewuhic

13h

AI shit slop

mergisi

15h

[dead]

Crafted by Rajat

Source Code