Storing chilean RUTs in the database
Passing the knowledge to you before you f*** it up.
Chilean RUTs (or RUNs, they’re practically the same) are the thing most chilean services like to use. It solves the problem of having one unique data point attached to an unique existing person, in comparison with emails or even phone numbers which can be easily generated and vinculated to the same individual.
RUT are standard in Chile. Most of services and companies allows to cross-reference RUT with other data, like financial health, residence verification, background certificates, vehicles registered, and so on. As a person, you wouldn’t use the RUT of another person, you can be in big trouble since it impersonation is punishable by law.
Back into the main topic. In case a RUT is sent to your application, you will end up looking for a way how to properly store it. I have seen many techniques, but the one that still use until this day is very simple.
Quick story lesson: RUTs where created in 1969 to identify a person or a company financially. One of the features was to made them unable to fake on-the-go, but little you know, computers happened some decades later which do that almost instantly.
The RUT number and verification digit
The RUT is comprised of a number, and a verification digit, the latter being a digit between
9, or the
K character. So the “digit” part is not a digit but a character, but everyone says it like it is so.
Since RUTs are unique for every person and company, you will never get a RUT with the same number but different verification digit. Also, RUTs are never re-assigned.
The verification digit is the result of a mathematical algorithm over the number that there is no point in describing for this article purpose.
Storing the RUT
As you may be aware now, the first part of the RUT is just a integer, while the other can be a integer or a character. So the best way to store them in a database is using our friends
| number | verification_digit |
| ---------- | ------------------ |
| 6898742 | 3 |
| 15846327 | K |
The RUT Number
RUTs numbers cannot be negative, and they can be as high as 90 million.
The best approach is to store as an unsigned integer. This uses only 4 bytes, and allows a number up to 2³² (4,294,967,295), so you don’t have to account for any other future-proof data type… unless you ban every birth control product in the country.
The RUT Verification Digit
The verification digit can should be stored as a fixed one character length, since it can be a integer or character. That amounts to only 1 byte.
MySQL databases allow for ENUM, which allows a column to store only a value from a predefined list of values. In this case, since the list is 0–9 or K, we could use it, but we would rather not. ENUM is not SQL standard.
In these cases, CHECK will check if the value to set is inside the list of valid values, while being a varchar or integer. But it won’t change the data type of the column, and only adds overhead to the insertion which can be passed to the application itself.
int allows to use the number column as an index to quickly find users in the database and make relationships to other tables based on it.
The resulting SQL Query
We can start with something like this:
CREATE TABLE `person` (
`rut_num` INT unsigned NOT NULL,
-- ... more columns UNIQUE KEY `rut_num`,
PRIMARY KEY (`rut_num`)
And with this we end up with only 5 bytes of data to identify a person, and 4 bytes for each index entry.
Storing only the number
The other technique is to rely on RUT validation. If your application has a way to ensure the number has the correct verification character before storing, you can easily dispose of the verification digit altogether.
| number | other info ...|
| ---------- | ------------- |
| 6898742 | |
| 15846327 | |
When retrieving the RUT number, you can tell your application to generate the verification digit, or much better, create an SQL function to add it to the number column automatically.
This only saves you 1 byte of data, 20% of the RUT data size, but may be less than 1% taking into account the whole data row you want to retrieve.
This start to sound awful when you take into consideration that it will add overhead to the application or database. Just imagine retrieving thousand of numbers and generate the verification digit for each one of them!
What about storing it as characters?
Welp.. there is a big problem on that.
A RUT can have between 8 and 12 characters if we add thousand separator and the hyphen. A variable character would need between 8 and 12 bytes, and using a fixed 12 character length column would give some performance on read but you will use 240% of space for the same data.
Also, you will have to filter the incoming data in your application, and choose between using thousand separator and hyphen, or just the numbers and assume the last character is the verification digit, to store.
So, yeah, don’t do this, it just adds more logic where it shouldn’t be.
I made PHP package for this
Since most of chilean devs must work with RUTs into their public web applications, I took the liberty to make a PHP package to tackle this. It has a lot of handy functions to work with RUTs, like mass generation and validation.
darkghosthunter/rut-utils - Packagist
A complete library for handling chilean RUTs and RUNs.
Have a nice year 2019.