Bulk endpoint accepting existing records without ids

21 Apr 2021 - Jorge Garcia

Last updated: 2023-02-19


Creating and updating data in bulk is always tricky. And it can get worse if the data you receive does not contains ids (in the case of an update).

Context

We have the model PagePermission, the user can have many permissions, and each permission refers to a single page.

PagePermission.new(user_id: user_id, page_id: page_id)

# User model

has_many :page_permissions

# PagePermission model

belongs_to :user
belongs_to :page

validates :page, presence: true, uniqueness: { scope: :user_id }
validates :user, presence: true

The feature

Given a list of users, I want to select N users and create permissions for N pages. If the user already has access to any of these pages, we will not create new ones or delete the existing ones, just ignore them.

There is a tricky part of this feature, we don’t receive the ids of the existing permissions, on the app, you only select a user_id alongside a page_id.

Goals

Payload

{
  "page_permissions": [
    { "user_id": 1, "page_id": 1 },
    { "user_id": 2, "page_id": 1 },
    { "user_id": 2, "page_id": 2 },
    { "user_id": 3, "page_id": 2 }
  ]
}

Updated solution, using UPSERT and unique index constraints.

Since I originally wrote this, I found a better way to solve this problem.

Postgres unique constraints

A unique constraint is a single field or combination of fields that uniquely defines a record

Using a unique constraint to validate the page uniqueness coped to the user allow us to use Potgres’ UPSERT.

The upsert allow us to insert many records and we can specify how to resolve the conflict, if any record is not respecting a validation.

Add the index

class MyMigration
  def change
    add_index :page_permissions,
              [:page_id, :user_id],
              unique: true,
              algorithm: :concurrently
  end
end

And then we can use the upsert all method.

PagePermission.upsert_all(
  payload,
  unique_by: [:page_id, :user_id]
)

And that’s it! I love finding simple solutions to hard problems.

Disclaimer: deprecated work ahead, check it out if you want to see an over complicated solution.

How to solve this problem?


Let’s go first with the simplest solution, iterating the array and create one PagePermission on each iteration.

Note: The code here is not 100% accurate with the code in production, but it is kinda similar. Most of this stuff is for test purposes.

I’m gonna start with a simple solution and iterate the idea.

1st Solution


page_permissions_params.each do |page_permission|
  PagePermission.create(**page_permission)
end

I couldn’t run rails performance tests, I found a weird issue on my test app. Given time constraints I write the next simple test but I know is probably not the best way to benchmark this problem.

def setup
  @users = []
  @pages = []

  5.times do
    @users << User.create
    @pages << Page.create
    PagePermission.create(page: @pages.first, user: @users.last)
  end

  generate_params
end

def generate_params
  @params = []

  @users.each do |user|
    @pages.each do |page|
      @params << { page_id: page.id, user_id: user.id }
    end
  end
end

test "Benchmark time" do
  puts Benchmark.measure {
    post path, params: { page_permissions: @params }
  }
end

test "Benchmark memory" do
  Benchmark.memory do |x|
    x.report("endpoint") {
      post path, params: { page_permissions: @params }
    }
  end
end

Results after running multiple times (around 5) and calculating means:

time: .08s
memory: 2.74mb

This solution works just fine, is good enough software but as I am sometimes obsessive with the performance, I wanted to go further into this problem.

The problem with this solution

Let’s improve this

2nd Solution


Use ActiveRecord-Import gem.

This gem executes a single insert statement.

PagePermission.import(page_permissions_params, validate: true)

Running the same performance test, we have better results:

time: .036s
memory: 509k

The problem with this solution

The endpoint allows partial creation of resources. Mixed between existing records and invalid ones. We need to return on the request useful information about why some records couldn’t be created. Thus we need to know which are the records who:

Alongside the mentioned problems, I’m omitting an important one. Why we’re trying to insert records that we know for sure that shouldn’t be created? I’m talking about the existing PagePermissions. I mean, is for simplicity but if we can avoid it, I’ll take it.

3rd Solution


Goals:

A simple solution can be the next one:

page_permissions_params_without_existing_records = page_permissions_params.filter_map do |page_permission|
  page_permission unless PagePermission.find_by(**page_permission)
end

bulk_creation = PagePermission.import(
  page_permissions_params_without_existing_records, validate: true
)

handle_errors(bulk_creation.failed_instances) if bulk_creation.failed_instances.present?

Performance results:

time: .042s
memory: 652k

It’s similar to the 2nd solution but we have a big problem here; we’re executing N+1 Select statements. So the performance would just get worse if the payload grows.

4th Solution


Goals:

Idea 1: Execute a single SELECT statement to find the existing records, then remove the existing records from the INSERT statement doing a hash diff.

Idea 2 Use relational algebra to find the non-existing records and use the result to execute the INSERT statement.

I didn’t like idea #1 that much specifically the hash diff. If I have the option to choose between iterating through hashes or using SQL, I prefer SQL. It may be worth it to explore this solution but for now, I did not take it. I went with the #2.

Exploring the Idea 2

We want to get the non-existing records from the params, the existing ones are stored in a table. So what we’re trying to do is basically a LEFT JOIN! But how we can do it?

There is no table for the non-existing records… but we can use a temporary table.

Let’s do it.

Please note that temporary tables must be wrapped inside a transaction.

bulk_creation = nil

ActiveRecord::Base.transaction do
  create_temporary_page_permission_table
  create_temporary_indexes
  create_temporary_page_permissions

  bulk_creation = PagePermission.import(new_permissions, validate: true)
end

handle_errors(bulk_creation.failed_instances) if bulk_creation.failed_instances.present?

def create_temporary_page_permission_table
  ActiveRecord::Base.connection.execute <<~SQL
    CREATE TEMP TABLE temporary_page_permissions (
      user_id BIG INTEGER NOT NULL,
      page_id BIG INTEGER NOT NULL
    )
  SQL
end

def create_temporary_indexes
  ActiveRecord::Base.connection.execute <<~SQL
    CREATE INDEX index_temporary_page_permissions_on_user_id ON
      temporary_page_permissions (user_id);
    CREATE INDEX index_temporary_page_permissions_on_page_id ON
      temporary_page_permissions (page_id);
  SQL
end

def create_temporary_page_permissions
  values = transform_params_to_values_string

  ActiveRecord::Base.connection.execute <<~SQL
    INSERT INTO temporary_page_permissions (user_id, page_id) VALUES #{values}
  SQL
end

def transform_params_to_values_string
  values = "".dup

  page_permissions_params.each do |page_permission|
    values.concat(
      "(#{page_permission[:user_id]}, #{page_permission[:page_id]}), "
    )
  end
  values.chomp(", ")
end

def new_permissions
  sql = <<~SQL
    SELECT temporary_page_permissions.user_id, temporary_page_permissions.page_id
    FROM temporary_page_permissions
    LEFT JOIN page_permissions ON temporary_page_permissions.user_id = page_permissions.user_id
      AND temporary_page_permissions.page_id = page_permissions.page_id
    WHERE page_permissions.user_id IS NULL AND page_permissions.page_id IS NULL
  SQL

  ActiveRecord::Base.connection.execute(sql).to_a
end

Performance results:

time: .033s
memory: 509k

With this solution, we’re executing: 2 INSERT statements, 1 SELECT statement so we successfully get rid of the N+1 problems. There can be a bottleneck tho, the method transform_params_to_values_string can cause some problems if the payload is too big. Not too much concern for now at least, but is something we need to consider.

Results and conclusion


Solution # Time Memory
1 0.080s 2.74mb
2 0.036s 509k
3 0.042s 652k
4 0.033s 509k

The #4 has the best results and fulfill all goals

I’m really happy with this solution, I made it almost a year ago and is working fine. We still didn’t have to make changes there so the time would tell if this was the best solution or not.

This is one of my favorites problems because I needed to think differently, my solution was not too intuitive and I explore multiple approaches and end up with a creative one.