Advanced Data Modeling Techniques

Choosing between domain models and meta models

José Manuel Aguirre
ITNEXT

--

The Data Model of every application is the heart of it.

In the end, it’s all about data: Data comes in from the user keyboard, or from an external origin, data is processed following some business rules and finally, data is presented to the user (or external applications) in some convenient way.

Photo by Eric Prouzet on Unsplash — A Data Model it’s like having many boxes for storage

Every aspect of your application, every function you write, has data involved to give sense to the whole system.

So, the question is:

What are the key aspects of good Data Modeling?

We are going to answer this question in this article, but first, we need 2 definitions:

Definition 1: What is a Data Model?

The Data Model is a way to organize the data of your application. It’s not the data itself, it’s not the device you use to store it (the database system you choose). So the following assertions are true:

  • You can store the same data, using different data models.
  • You can store different data using the same data model.
  • It should be possible to transform the data from one data model to another data model (this process is commonly called “Migration of data”).

Definition 2: How we define what is a good Data Model?

In other words: How we can compare different data modeling options? What aspects should be considered?

The 5 dimensions of a good Data Model

I propose 5 dimensions as the most relevant:

  • Clearness: The capacity of the Data Model to be understood by people looking at it. As you may have heard, most of the time developers read code instead of writing it, so we need to clearly understand what we are doing with our data.
  • Flexibility: The ability of the model to evolve without having to make a huge impact on our code. The startup you work for is evolving, so the systems are changing and the data models behind them will need to evolve over time.
  • Performance: This is a very broad topic and we are not talking here about database vendors, indexing of data or some technical tweaks to improve speed on reads and writes. I think we can attribute performance benefits based exclusively on how we model the data. We will go deeper into this aspect later on.
  • Productivity: On the developer side of things, we want to have a model that is easy to work with and to reason about, so we can “produce a lot of good code” without using a lot of time (the definition of productivity 😄)
  • Traceability: Finally, we want not only to have the data that is relevant to our users, but also have data that is relevant to the system itself. Being able to know what happened in the past, what values the entities had at some point in the time, go back and forward in time, etc.

In other words:

Give me a Data Model that is easy to understand, easy to extend or change, with good performance while also good for developer productivity and with the ability to know what happened in the past…

Boom! Everyone will be very very happy 🎊 🎉…

Main Data Modeling Techniques

As you may guess by now, in this article we are going to propose a generic way of modeling data that meets all our requirements.

I strongly believe that there is no silver bullet and the correct answer most of the time is “it depends”, but I have already used this new technique and it looks very promising.

So, my excuses in advance for being so opinionated on this subject 😉.

But first, let me tell you about the “normal way” of modeling data that I think it should look very familiar to you.

Standard Data Model (a.k.a. Domain Models)

It’s very simple: You define your entities and their properties based on the domain of the problem you are solving. So, each entity will be similar to the other entities on the same domain and very different from entities in another domain. Like having a different kind of box for each type of thing we want to store.

Photo by Clem Onojeghuo on Unsplash — The Standard Data Model: One kind of box for each purpose

Suppose you are developing a Meetings software solution. Your domains will look something like the following list:

  • Meetings: With basic info about the place, time, duration and acting as a container for the rest of entities listed below.
  • People: The members of the meeting maybe with some specific role (organizer, secretary, presenter, etc)
  • Topics: The agenda for the meeting as a list of topics with some order, description, duration, etc.
  • Agreements: The main output of the conversation with maybe some additional tagging for later search-ability.
  • Notes: Main conversations inside a topic.
  • Actions: Some responsibilities on the short term assigned to a person.
  • you get the idea…

It’s obvious that this kind of modeling is pretty clear as it’s defined just like the way we think about the problem.

So, let’s do the first check on our 5 dimensions of analysis (points from 1 to 10):

  • 🔟 Clearness: 10 points. Yes, the model is super clear, just like humans think.
  • 3️⃣ Flexibility: 3 points. Not so good here though. Each new field required, will need a change to the model 🤷🏻‍♂️.
  • 6️⃣ Performance: 6 points. This kind of model has not the best performance, we will explain later why.
  • 3️⃣ Productivity: 3 points. Mmm, each collection (or table) will need its own method to update values in each field right? This is not good for the developer productivity, unless you develop a middleware to talk to the database in a “parametric way”, but this is also unnatural. We will propose a better way to accomplish this.
  • 2️⃣ Traceability: 2 points. This kind of modeling updates fields in place, so the address change, you lost the old address, right? The workaround is to have a separate table that record all changes (a log table) but it will be separated from the rest of the model and therefore “noisy”.
The Standard Data Model under 5 dimensions of analysis

A meta-modeling approach

Ok, we will go straight to the point now.

Structure of the Model

The proposal is have just one table (or collection) that stores all the domain data, under the same structure and without loosing any data (no updates, no deletes)*.

* There are 2 exceptions to this rule, we will cover them later.

The Shape of the Record in the Meta Model

The structure of this “universal record” is:

  • _id: Unique identifier of the entity.
  • uniqueKey: (Optional) This is also a unique identifier of the entity but it’s driven by some business rule. For example: email addresses must by unique or the relation between two entities may give birth to a new entity that has a unique key of “entity_id_1--entity_id_2”.
  • domain: Or the type of entity we are storing. Most of all useful for data fetching filtering and clearness for the developer when looking at the database.
  • company_id: This field may be controversial, but in the apps I have developed the concept of company (an organization your users belong to) is always present, so it’s good for indexing, clearness an to have a “container entity” for all the domains.
  • parent: Ok, we are now thinking about relations. Given that there is one table, there is no explicit relations at the model level (there are at the data level). Here we identify the entity parent of the current entity, so when if we have access to the parent, we will have access to this entity too. In my experience here is the company_id or the user_id for most cases.
  • attrs: Here is where “the meat” is (the actual Data). This is an array of objects in the form { key, value, timestamp }.
An example of a “meta-entity”

The attrs field

So, the whole model is really under this field and each key may be more than one time (for different timestamps).

For example:

  • { key: ‘name’, value: ‘José’, timestamp: 1575490495682 }.
  • { key: ‘name’, value: ‘José Manuel’, timestamp: 1575490495795 }.

Is indicating that the same key name had the value ‘José’ at the timestamp 1575490495682, but after that it changed to ‘José Manuel’ at 1575490495795. Given that this timestamp is greater, than the previous, we consider this value as the current value.

Also, we will always have 3 special fields inside the attrs field:

  • company_id: Already explained.
  • user_id: Or the user responsible for the creation of the entity.
  • status: 1 for active entities and -1 for deleted ones (we never actually delete an entity!)
The “attrs” field in detail

Notice that the “shape” of the value property inside each of the attributes could be of any type. If we think in javascript terms (I have used Mongo, but I think the same structure is suitable for other dbs), we can have: Strings, Booleans, Numbers, Dates, Arrays, Objects, etc.

Now, we will cover each of the 5 dimensions of analysis in detail.

Clearness

This is not the greatest feature of this model, because every time you look at a record, you need to go deep inside the attrs field. Here is the cost we are paying, the trade-off for having the other benefits.

I should mention though that after working with this key-value mental model for a while you end up “seeing the model very clear”, but for the reader it may sound confusing at first glance.

We also can have a nice bonus when we model like this: We can describe all our model with one query. So if we need to have a document (or better a webpage) that shows the “real” fields of each entity, we can achieve that very easily.

Photo by Jacob Sapp on UnsplashClearness

Now, let’s look a flexibility.

Flexibility

The flexibility is built-in the meta model, it’s the core concept: instead of defining in advance the fields of the entity for each domain (also known as “the schema”) we just define this generic structure that can hold any schema.

This is really powerful in this cases:

  • Your system need to hold new type of data at a given entity or maybe new entities. No problem, you already know that any entity can be modeled by a simple key-value definition, so you will not break any code by making this change. The thing is that the model is really inside the data, not inside the database!
  • You need to change a relation between entities, maybe an entity depending on a user now need to depend on a group of users… No problem, you just update the parent field with a query. In this case maybe you will need to change code, but you are (again) not changing the model.

It’s amazing to have this feature, really. The next meeting you attend called “Changes needed for the Data Model”, you can just walk into it with confidence and a big smile 😁

Photo by Wesley Tingey on UnsplashFlexibility

Performance

Now, this is the least clear benefit of this type of modeling. You can argue (and you will be right) that this model is more space eater than the traditional one. Yes, it is.

But nowadays the storage is not an issue. We have plenty and it’s cheap.

The main performance does not have to do with the key-value way of storing things, but with the timestamp field combined with that we never update or delete anything.

Because of that, clients reading from this model (where the performance is actually perceived) don’t need to get all the entities, nor all the entity fields they are interested, because they might already have it in place!

For example:

  • Some data was created in the database at timestamp t_0.
  • A client request some data at t_1 and the server responded with the data the client is interested (only the data the module require / has access). Then the client logged out the app at timestamp.
  • Later on, at timestamps t_2 and t_3, the server got new data, coming (for example) for other users that interacted with the application.
  • And after that, at t_4 the client connects again and, instead of having all the data again (plus new information), he only got the new information, avoiding to transfer the data he already have over the network.
We get only the data we need, every time

This is good for performance!! Instead of tweaking some little details, we are just cutting-off a lot of work every time we go to the database by saying: Just give me what you have for me form this point ahead.

This works not only at the entity level (new entities will be sent completely) but also at he field level (we will send just the new field of old entities), reducing in the long term the amount of bytes sent over the wire.

In other words, we are implementing here a local cache, that is possible only because we are not updating the fields, we are always appending new data.

Photo by Harley-Davidson on UnsplashPerformance

In which case we break the “just append” rule?

There are 2 situations when we are not appending, just in case you are wonder how to manage them:

  1. The field is just state, it’s volatile. Suppose you have Notifications sent to your users. The user may want to mark it as read or unread, and can execute this process a lot of times. So, there is no real value keeping all the “clicks history”, because of the nature of the data.
  2. The field represents a sub-entity, for example, you may have a domain called “Skills” that has “grades” from 1 to n. You can create a new domain called “SkillGrades” but being a child entity its simpler to have it nested inside the parent. The point is that instead of having multiple key=’grades’ for each time you add or remove a grade, you have just one key=’grades’ and you append inside it.

In both cases we update the timestamp of this fields in each update, so it’s recognized as new information next time the client request it.

Productivity

If you have 20–30 types of entities in your model (and it’s not so difficult to achieve this number) you will need to have some CRUD (Create-Read-Update-Delete) methods for each entity right? So, you will be around 100 methods written one by one.

I had that feeling before: Duplicate existing code from one method, change the name of file, add it to the index file, change the content of the method, add/remove fields, etc. And every time you change the fields of the entity, guess what? Locate the file, change the field, restart the server… 🤯

When you have a meta-model, quickly you will realize that you need to build some utility functions to interact with it. But! The difference with the 100 methods mentioned above, is that the set of utilities is smaller and does not depend on the size of your domain (20–30 types of entities will have the same utility functions than 90 types of entities).

Here are some of this functions:

  • createEntity: Builds the basic structure and it’s entity agnostic (like all the other functions below).
  • addAttrsToEntity: The “update” function, allows to add new data to the entity without loosing the previous data.
  • addParentKey: Adds the “relation side” of the entity, to be able to determine later who can access it.
  • getEntities: By selector, by _id or uniqueKey, you get the an entity data.
  • pickKeyCurrentValue: Most of time we are interested only in the last (current) value of each field, so we pass an entity to this method and some keys to return and we get an object with the last value for each key.
  • getRelatedEntities: Allows to return entities that are at the company level (all users have access) at the user level or other levels (route, query parameters for example).

You should expect around 50% to 70% of less code written, based on the of a project I am working now, but also less errors because of standardization.

Photo by Andreas Klassen on UnsplashProductivity

Traceability

If you delete (physically) data or update data in place (the most common approach) you are loosing the capability to know what happened in the past. Sometimes this is what you actually desire (for example in the 2 cases commented above) but other times you just feel safer.

You know that the database is not only holding the data that is relevant to your users, but also holding the events related to each piece of data. This is good for later:

  • Debugging: You are able to “really see what happened”
  • Analytics: You also get the notion of application usage by looking at the database.
Photo by Wil Stewart on UnsplashTraceability

Ok, now let’s take a look at our numbers with the new approach:

The 5 dimensions of analysis, comparing the 2 techniques

The Meta-Model has clear advantages in each dimension, except from being less clear (although you can adapt your brain to this model over time). Take the picture only as a way of comparing the 2 alternatives, I know the numbers are arbitrary and are not based on serious studies.

Bonus: What about using this model also in the Browser?

This can easily be a “part 2” of this article, but in summary the key aspects are:

  • We will have a local copy of the data received from the server at the LocalStorage (or even better in the IndexedDB). We will only touch this data when new data arrives.
  • Then we will populate an object (a large JS object) with all the data with 2 transformations:
  • 1) We will only have the last value of each field (timestamp will be not needed anymore) and
  • 2) Instead of having the data in the form [ { key: ‘name’, value: ‘José }, { key: ‘city’, value: ‘Santiago’ } ], we will have it just like { name: ‘José’, city: ‘Santiago’}. So it’s more natural to our mental model.
  • This is very important, because we are taking care of the “accessibility” for the code that the client also have.
  • When new data is produced by the user we will be optimistic: We will add it to our large JS object and also make the request to the server, so our UI will reflect changes immediately.

Last but no least, building an Admin Module over the Meta-Model (create, delete, update new data) is also very straight forward. You will have less special code for each collection, more consistency across the code you have and also less UI components to write.

Let me know if you found this approach useful and your thoughts about it!

Cheers from Chile !!🇨🇱

--

--