Picture of a High performance Rust processor

Generating data 100x faster with Rust — Part I

Published in

ITNEXT

5 min readJan 2, 2024

Testing and assessing data tools requires realistic test data, and for this task, I’ve frequently turned to Faker — a popular Python library for data generation.

However, as I started handling larger datasets — up to 1GB, sizeable, but not excessive — Faker’s performance constraints became problematic.

Python’s speed, or lack thereof, is well-known. But libraries can counter this by delegating heavy tasks to efficient, lower-level languages like C, C++, or Rust. Interestingly, Faker hasn’t gone down this road — at least not yet.

With my increasing passion for Rust, I couldn’t resist the urge to explore how Rust should be able to tackle this problem by offering a more efficient implementation.

Let’s find out!

A case of generating a billion people

In search of a challenging dataset, I came across this article about the creation of an extensive 36GB dataset, with data on one billion people.

While the article’s objective is to demonstrate Duckdb’s impressive querying capabilities, what caught my attention was the described method of data generation with Python Faker.

The process required an expensive 128-core machine to run for two hours. Given a modern day CPU can crunch gigabytes of data per second this highlights a significant opportunity for optimization.

The code creates a list of tuples, each representing an individual. I adopted most of the code as-is — be it with some minor adjustments. On analysis, it became clear that data generation consumed virtually all (>95%) compute, so I left out the DataFrame and Parquet parts in this first part of the series— we’ll dig into that later.

import sys
import random
import time

from faker import Faker
from typing import Any

NO_ROWS = 10000

fake = Faker()

def get_person() -> dict[str, Any]:
    person = {
        "id": random.randrange(1000, 9999999999999),
        "first_name": fake.first_name(),
        "last_name": fake.last_name(),
        "email": fake.unique.ascii_email(),
        "company": fake.company(),
        "phone": fake.phone_number()
    }
    return person


def generate_person_list(count: int) -> list[dict[str, Any]]:
    person_list = [get_person() for _ in range(count)]
    return person_list


def main() -> int:
    start_time = time.time()
    person_list = generate_person_list(NO_ROWS)
    end_time = time.time()

    print("First 3 records:", person_list[:3])
    print(f"Time taken to generate {NO_ROWS} people:")
    print(f"--- {round((end_time - start_time), 3)} seconds ---")


if __name__ == "__main__":
    sys.exit(main())

# run the code
python pyfake/generate_row.py

First 3 records:
....
Time taken to generate 10000 people:
--- 2.163 seconds ---

Python update

Before moving to Rust, I explored a different approach within Python to potentially improve performance. I modified the original implementation, aiming for a more efficient way to handle the data.

My expectation was that writing the data via this new approach (lists of attributes instead of list of people; i.e. row -> columnar) should run faster;

import sys
import random
import time

from faker import Faker

NO_ROWS = 10000

fake = Faker()

class ColumnTable:
    def __init__(self, count: int):
        self.ids = [
            random.randrange(1000, 9999999999999) for _ in range(count)
        ]
        self.first_names = [fake.first_name() for _ in range(count)]
        self.last_names = [fake.last_name() for _ in range(count)]
        self.emails = [fake.unique.ascii_email() for _ in range(count)]
        self.companies = [fake.company() for _ in range(count)]
        self.phone_numbers = [fake.phone_number() for _ in range(count)]


def main() -> int:  
    start_time = time.time()
    table = ColumnTable(NO_ROWS)
    end_time = time.time()

    print("First 3 records:")
    for i in range(3):
        print(
            f"Record {i + 1}: {{ id: {table.ids[i]}, "
            f"first_name: \"{table.first_names[i]}\", "
            f"last_name: \"{table.last_names[i]}\", "
            f"email: \"{table.emails[i]}\", "
            f"company: \"{table.companies[i]}\", "
            f"phone_number: \"{table.phone_numbers[i]}\" }}"
        )

    print(f"Time taken to generate {NO_ROWS} people:")
    print(f"--- {round((end_time - start_time), 3)} seconds ---")
    return 0


if __name__ == "__main__":
    sys.exit(main())

Let's run the code;

# run the code
python pyfake/generate.py 

First 3 records:
....
Time taken to generate 10000 people:
--- 2.028 seconds ---

To my surprise, this was not actually much faster in Python. Converting the table to a DataFrame did not matter that much either. In-memory it's all equally fast.

Still, there is a tiny (2-5%) performance bump — when looking hard enough — on the generating part that was fairly consistent on the second approach.

The final scores (average from 10 runs, 10K people) are as follows:

# benchmark row (average over 10 runs)
python -c 'import pyfake; pyfake.benchmark_row()'

Average time taken to generate 10000 people:
--- 2.171 seconds ---

# benchmark column (average over 10 runs)
python -c 'import pyfake; pyfake.benchmark_column()'

Average time taken to generate 10000 people:
--- 2.116 seconds ---

Note this is also still based on using 1 thread. We’ll add multi-threading in the next part of this series. You can find all the code in this repository and play with it yourself. Your hardware may yield different results.

Rust implementation

As we transition to the Rust implementation, you may notice a parallel between the Python version’s class and the Rust code’s struct. Understanding that Rust’s structs are similar to Python’s classes helped me a lot in grasping their usefulness.

Lets look at the Rust implementation and run it;

use std::time::Instant;
use fake::Dummy;
use fake::{Fake, Faker};

use fake::faker::name::en::*;
use fake::faker::internet::en::*;
use fake::faker::company::en::*;
use fake::faker::phone_number::en::*;


const NO_ROWS: usize = 10000;


#[derive(Debug, Dummy)]
struct TableColumns {
    #[dummy(faker = "(1000..9999999999999, NO_ROWS)")]
    pub ids: Vec<i64>,

    #[dummy(faker = "(FirstName(), NO_ROWS)")]
    pub first_names: Vec<String>,

    #[dummy(faker = "(LastName(), NO_ROWS)")]
    pub last_names: Vec<String>,

    #[dummy(faker = "(FreeEmail(), NO_ROWS)")]
    pub emails: Vec<String>,

    #[dummy(faker = "(CompanyName(), NO_ROWS)")]
    pub companies: Vec<String>,

    #[dummy(faker = "(PhoneNumber(), NO_ROWS)")]
    pub phone_numbers: Vec<String>,
}


fn generate_table() {
    let start_time = Instant::now();
    let table: TableColumns = Faker.fake();
    let elapsed = start_time.elapsed().as_secs_f64();

    println!("First 3 records:");
    for i in 0..3 {
        println!(
        "Record {}: {{ id: {}, first_name: \"{}\", last_name: \"{}\",\
         email: \"{}\", company: \"{}\", phone_number: \"{}\" }}", 
            i + 1,
            table.ids[i], 
            table.first_names[i], 
            table.last_names[i], 
            table.emails[i], 
            table.companies[i], 
            table.phone_numbers[i]
        );
    }

    println!("Time taken to generate {NO_ROWS} people:");
    println!("--- {:.3} seconds ---", elapsed);
}


fn main() {
    generate_table();
}

# compile and run
cargo run

...
Running `target/debug/rsfake`
First 3 records:
...
Time taken to generate 10000 people:
--- 0.111 seconds ---

That is not 100x faster! That is correct. This is because cargo run is optimized to minimize compile time, so it is quicker to test if code runs.

Let's try the optimized version;

# compile for release
cargo build --release

# run compiled binary
target/release/rsfake

First 3 records:
...
Time taken to generate 10000 people:
--- 0.015 seconds ---

This run is 133x faster. You may notice some variation in the performance gain still. Feel free to bump up the number of records from 10K to 100K or 1M to find out how it performs on larger numbers — you may be pleasantly surprised, just change the number for NO_ROWS in the code.

Next steps

You might be wondering: it’s impressive that the Rust implementation is so much faster, but how does one transform this into a usable DataFrame or a Parquet file? Indeed, there are a few more steps to cover:

add CLI (Command Line Interface) parser functionality
convert the generated data into a DataFrame
export the data to a Parquet file
enable multi-threading to scale up the process — we’re aiming for 1 billion records, after all

I will cover this in part II of this series. If you don’t like to wait, the code is already in this repository. It may still have some rough edges, but you’re welcome to explore and use this as you see fit.

Update: click here to read part II.

Generating data 100x faster with Rust — Part I

Written by Anthony Potappel