KOR | Let’s Talk Data with Daan Gerits

Eric Kavanagh

All right, ladies and gentlemen, backyard Current 2022 at the Kafka Summit, the Confluent event in Austin, Texas. Lots of interviews yesterday, and a few today, including a really good one coming right now. We've got Daan Gerits. He is the Chief Data Officer with a twist, for KOR Financial. I like twists, Daan. So tell me a bit about your job, and what makes it different from the other Chief Data Officers?

Daan Gerits

Oh, yeah. So one of the things when you think about a Chief Data Officer, you think of data, security, data governance, and everything that comes with that. We are a very young startup but we have a very bold vision, and the vision is that we want to take a full streaming approach. But it's next level. One thing that is common in every organization is that you have action, and you have a reaction, but we already knew this long before. But, why are we so obsessed with tracking state? So all of our organizations, all of the ways that we have been working for the past 30-40 years even, are focussed on state and materializing that state. But that's a consequence. That's not the real reason why something happens.

Eric Kavanagh

Oh, interesting. So you're dealing with the realities of information systems as they had been designed, right? Which is to get data, persist data, and then use that data as needed, and track state as a way of understanding where we are in a particular process. But I think I see where you're going with this, which is that if you can know right now, and have a view into actual activity happening all the time, that's a lot better than worrying about persisting and then managing state and then extracting as needed and so forth. Is that right?

Daan Gerits

Yeah, that's indeed a big part of it. The other part is that, state is something that is up for interpretation. It always has been, right? If you go to an organization, like many organizations have done multiple data management projects, and stuff like that to get a common understanding, like what is a customer for our business. But it doesn't really work. The reason why it doesn't work well is that it means something different depending on who you're talking to in your organization. So state is always up for interpretation, but the behavior isn't. The fact that a new client has been created, the fact that a trade happened, the fact that the sale happened, that is something that's happened, its effect. So the only thing that you do is actually do an interpretation of these facts and come up with a consequence. What is the consequence of all these things happening in my organization?

Now, the nice thing about it is that we have a way of figuring out, not what the data is like or what the result is, but we have a way to figure out how we got there. This is something that, for example, data scientists have been working on quite hard, because we said, “we have hadoop and storage is cheap so let's just dump everything in a data lake and everything allowing for schema-on-read, so everything is fine, right?” But in reality, it becomes a very big dump of data and nobody has any idea of what is still in there. Again we paid people to understand how we actually came to the data that is in there. So yeah, it's different, it's a different twist. I went back through my presentations that I did in the past about big data and even in 2012, we were already talking about focusing on events, we don't have to focus on the state because we can recreate the state once we have the original events. So why not do it like that? Like, why are we still so obsessed with just keeping track of that state and trying to materialize that state and make it available? Well, we can actually make it very specific, depending on the use case.

Eric Kavanagh

Right? Well, I guess here's one way we could kind of look at it in terms of the data lifecycle and what we're doing with the data. Historically, especially with Hadoop with this huge file system, we said, oh, let's just store everything and free versions of it as well, right? And then anytime a node goes down, you have to repopulate that node, it's remarkably inefficient if you get right down to brass tacks, right? It's not an efficient way to get or use data. And it's almost like because of this policy of persisting-first-accessing-later if you're driving a car, it's almost like you're driving like this all the time: Looking back, looking back, asking yourself “where am I? what's the speed?” We're looking back all the time. Why do that when you can get a view of the world we are in now and understand where you're going and be able to steer in real time? And someone else can be in the back seat looking around and kind of taking notes and things.

Daan Gerits

And just augment your experience while you're doing it. You basically have something that is sitting in the back seat that has the sole job of looking at the history, interpreting the history, and giving you the best feedback. So it's like your copilot in your car.

Eric Kavanagh

That's very interesting. So you're taking a streaming-first approach is what it sounds like to me. Now, does that mean you had to re-architect the systems of record, the systems that are running the business, or how did that happen?

Daan Gerits

So we had the benefit of starting from scratch. Being a startup, it's great and because we could start from scratch, we could also rethink several of the things that we saw as established ways of doing things. And one of the things was that we made the decision not to have a database. We basically said: “we're going only on Kafka.” So we are running everything on Kafka and its infinite storage.

Usually when I tell that, and I tell people like, “oh, and we want to store data for 40 years and in three years' time and we expect it to be like 160 petabytes.”, one of two things usually happens. Either they point me to the door, or we have cool conversations. Now, most of the time, we have cool conversations, to be honest. Because once you start thinking about it, when you look at infinite storage and how that has been built, it is offloaded to something like s3, for example, right? So how different is it then what you would do with a Spark job going in and going to s3 and getting your data from there? The only thing I have now is that, okay, it's in kafka’s format, it's already sharded. So it's already in a distributed format. So I can just take it from there and do the calculations I need to do.

Eric Kavanagh

That's very interesting. And the fact that you were doing this from scratch is a wonderful opportunity. We talked earlier today and yesterday about this deep impedance mismatch between the streaming world and the batch world. Everything that we built today, or at least up to a couple of years ago was geared around batch, that's all we had, we didn't have streaming. So how do you connect these worlds? And the short answer is in a difficult fashion, right?

Daan Gerits

Well, yes and no. My belief is that we are dissolving the barrier between transactional and analytical because we have to. Our clients demand different things, they demand not only to know information about them but also how our organization is working for them. In order to be able to do that, you need to couple back that information directly and make it part of your transactional information going through. So yes, from a traditional point of view, you're absolutely right. If you approach this as “we have an analytical plane and we have our transactional plane, and we have data exchanges between them” then yes, it's difficult. It's really difficult. but if you approach it from “no there is no different plane, it's the same thing” then things change. If you would ask a question in the past, you have your operational database, you can start querying it for doing analytics. But then you would have a very angry database administrator at your side. But in this case, it's a little bit different because you aren't putting loads of querying on something like Kafka because Kafka isn't there for querying; It's not a database. But it is a transaction log and actually a very good one. So we can take a log and reinterpret that right into what we need at this moment in time.

Eric Kavanagh

So the log becomes the database, essentially, right? because what is a database? It's a place to store information. What is the log file? It's a place to store information about transactions about things that have been done.

It's very interesting and I like this because you're right, there's a twist, you're taking a very creative approach to the very serious challenge of trying to help the business understand what's happening, and how can we change things, and you know, we came up with a bunch of good analogies yesterday, in terms of streaming again versus batch and think of the baseball world. These hitters are so good at being able to watch the pitch come in and determine do I swing , when do I swing, where do I put the bat to get the best contacts if you only had seven snapshots instead of the fluid version of the pitch coming if you just thought here, here, here, you're never getting the ball. You won't hit it because you can't see the spin on the ball, you can't see the trajectory, you can't even get a feel for what's happening because you don't get a feel from snapshots.

Daan Gerits

No, because you, as a human, have to interpret or interpolate whatever is missing from that information. But I don't want to interpolate because if I do so, I'm missing outliers. I'm missing the interesting parts of the data and that is actually the stuff that matters. That's where the value is for a lot of things. But also, for us, it's really about the operational part. Things like how to deal with data that evolves in 40 years? How do you make sure that you can still read that data 40 years from now? We have to invent our own software and libraries to be able to do that because most of the use cases that you will hear about when dealing with Kafka are low retention, high throughput use cases, while we are actually doing the opposite. It is a way to build an operational system that can scale to huge sizes and gives us flexibility. Because we are a startup, the likelihood of having to pivot or to make changes to the approach is very, very real. So we need to have that flexibility. And the only way that we can have that flexibility is to have a system that we can easily hook into and have an architecture that's very flexible. So if that means that at any given time we need new functionality, like for example, we have to do billing, you just hook into the streams.

Eric Kavanagh

That's so cool because, again, what you have in the enterprise today is you have this fragmentation of data across many, many systems, and all the reason for MDM to come along in the first place was to be able to reconcile what I have in all these disparate systems to be able to get a cohesive view, well, why don't you just get that cohesive view out of the box, then by focusing on the stream and ingesting what you can? This is one thing I thought when I first learned about Kafka, it's like, that's a very clever way to just create a steady stream of information that anyone can subscribe to, I want to grab Clickstream data, I grab it, I want to grab transactional data, I grab, I want to grab your revenue data, I grab that and use, they're all constantly running, right? It doesn't stop running, it's just constantly going, and you can choose to use it or not, instead of the other way, which again, is to persist it, and then have it fragmented across these different systems. And manage a huge project to reconcile and try to figure out what the hell we did. And what is the customer? Did you say we have 8,000 customers? He says, we have 3,742? Well, which the hell number Is it, because it's not both?

Daan Gerits

No, Indeed. That's where you really have to go back to reality and see what is happening and how the interpretation has been done. Because that's what matters, right? It's the interpretation that matters. It's what you interpret to be a client, or a customer, or an employee, or a product even. That's the thing that matters. But even if through time, your understanding of what a customer is changes you would still be able to do that. You would be able to go back to the beginning of time with your new understanding of what a customer is.

Eric Kavanagh

That's very interesting, that's cool. So what are your main sources of data right now? What are the topics that you're managing?

Daan Gerits

We have submissions coming in from the crypto world. Basically, we collect trade and trade information and pass that through to the regulators. So most of what we get in is submissions and trades. So the submission eventually adds up to being a trade, like, for example, a new trade. You get to the point where you just started to trade stocks. And then you get lifecycle events for that trade. It can be that you get valuations, but it can also mean that at some point your trade is stopped or expired. So that is the thing that we need to track. And we have to make sure that it makes sense whatever is sent in. So for every submission that comes in, we have built a validation framework, a validator component, that will go through a ridiculous amount of validations to make sure that the logic that is inside of the data actually makes sense. That is quite challenging because it's easy to create a submission so it looks like it makes sense. It makes sense just by looking at a message, but that's not enough. You need to place it in context to understand whether it makes sense within that context. If I already got a new message in, then the message that I get now might be the same, or it might be a duplicate, or it might be an invalid message, or an unauthorized message. So all of those, whatever it is the authorizations that we have, the rules for that engine, that validator, all of those are Kafka.

Eric Kavanagh

Wow, that's pretty intense. You said you had to build some of your own tools to be able to do the analysis of the log files and to kind of parse things after the fact. We do see an ecosystem developing around streaming, just like we saw an ecosystem developing around Hadoop. I mean, my take on Hadoop from the earliest days was “Are you guys sure?” Does that whole thing, the entire analytical world distills down to a MapReduce function? And that was their argument, basically. Right?

Daan Gerits

I think one of the things we did when we were playing around with big data, and basically one of the reasons that Hadoop isn't really around anymore is that many of the principles in distributed systems still go down to some sort of MapReduce form. It always comes down to something like that. Because that's how you do real stuff. You spread it out and you collect it, but that's how the technology works. The biggest problem that we had with big data was nobody got the mindset. And that is exactly the thing that is so important when you do streaming. It is not about the technology, you have to understand the mindset that is behind streaming. Yes, you can get away with not having that mindset and doing like a ETL++ kind of approach but once you start diving deeper into the stuff, where you want to track everything and being able to create state and everything, that's something different. For that you really need to have a different state of mind to get there and that's the tricky part. That's something that we struggle with when, for example, we get developers, we have stellar developers, like really good people. But it takes a lot of time to make them understand like, “no, now you have immutable data, so how are you going to deal with something that was wrong? And how are you going to correct it in a system where you cannot make corrections?” So it is a different way of working. And once you get your head around it, it has implications throughout your whole setup. That's also why I think this works for us because we're starting. It’s something infinitely more difficult if you have to do this in an existing organization. But in all honesty, if you're a startup, why wouldn't you do it like that?

Eric Kavanagh

Well, and I think this is some really good advice to existing organizations that are large and do have a lot of technical debt start your pilot project, get something rolling on some key part of the business that makes a lot of sense for streaming, and then recognize that yeah, you're gonna have to build out around this new nucleus. Well, one of the oldest jokes in the book is that, in fact, I had a great quote from a guy, you might like this. This guy, Gilbert van Cutsem said to me one day out of the blue, he goes, “So elephants go to a special place to die. But there is no software graveyard. It all just goes to the cloud.” Which I thought was so funny, because sunsetting systems is an incredibly difficult thing to do. It's just incredibly difficult to do that. But at a certain point, you got to rip the band-aid off, right, and start over.

Daan Gerits

Yeah, I think you're absolutely right. There’s something we can do as humans and that’s being able to reflect, look back at what we’ve been doing and start asking why we are doing it this way. And we did that. We came to our conclusion, which might not be the conclusion of anyone else, we hope it is though. We came to the conclusion that there is a better way of doing this. We can do better. I couldn't have done this, or we couldn't have built this platform as a team five years ago.

Eric Kavanagh

That's very cool. You make an excellent point because this really is a renaissance in how to use data to run your business. This is not an incremental change from the old way of doing things, it is a sea change. Literally.

Daan Gerits

Yeah, and for me, I've been talking to a lot of people that I know from back in the day when we were doing Hadoop stuff and things like HBase, not much has changed since then. You've snowflake and you have “lake house” but it's still the same thing that we're doing.

Eric Kavanagh

It's in the cloud. So that's what's different. But that's a differentiator.

Daan Gerits

In all honesty, if you're building a company that is providing Kafka to others, then yeah, be on-prem, but if Kafka is not your core business, then there's no reason at all not to be in the cloud. If you think it's security, think again, you cannot do a better job than these people, it's their bread and butter. It's a no-brainer, and I am truly convinced... I talk to a lot of people about this. I'm truly convinced that this is the next step in data evolution, but I don't believe in evolution, I only believe in revolution. I think if you want to have significant change, you need to basically start doing things really differently. It's not a migration path, although we would like to do that. I think at some point you have to say, “Okay, let's start rethinking this because we will get value out of it.” Those things that are very hard right now will become significantly easier. You don't need to do meetings for four months to get an idea of what will be impacted if you make a change on something.

Eric Kavanagh

How funny is that? Yeah, that's a good way to end this call, folks. Because what he's basically saying is, yes, it's going to take some time to get it right in this new way of doing things. But the amount of time and effort that you'll save down the road is just mind-blowing. You won't have to fret for four months about making one change to your schema for example. You'll just test around with something and go. Oh, wow, that was great. Do it. Love it!

Been talking to Daan Gerits from KOR Financial, look these folks up online. They're gonna do big stuff. Thank you for your time.

Daan Gerits

Thank you, I really enjoyed it.

Let’s Talk Data with Daan Gerits

Daan Gerits

Related articles

Make the CFTC UPI go-live a non-event

The CFTC Starts the Unique Product Identifier Implementation Clock

KOR Completes SOC 2 Type II Certification

Evolving Schemas Without Schema Evolution