Joins vs Preload vs Includes vs Eager load in Rails
An in-depth discussion with a practical example to load associated data in Rails
Rails provide a few ways to load associated data and before moving forward let’s consider one scenario as below,
there is a User table that has a one-to-many association with the Post table. With the scenario set, let’s begin with our discussion.
Joins
The Joins works perfectly fine for comparing and filtering data using associated data.
users = User.joins(:posts).where("posts.published = ?", true)
The query will load all users with posts having published attribute value as true and we can do so only because of the magic of joins
.
User Load (1.0ms) SELECT "users".* FROM "users" INNER JOIN "posts" ON "posts"."user_id" = "users"."id" WHERE (posts.published = TRUE) LIMIT $1 [["LIMIT", 11]]
=> #<ActiveRecord::Relation [#<User id: 1, name: "Tushar", email: "tushar@mail.com", created_at: "2021-01-16 14:55:30", updated_at: "2021-01-16 14:55:30">]>
Pros
- It uses INNER JOIN query to load associated data, as a result, associated data does not get impacted.
ex. If the first user has a total of 12 posts,
=> users = User.joins(:posts).where("posts.published = ?", true)
=> users.first.posts.length #it will ouptput 12
Cons
- It does not solve the (n+1) issue all alone, since it does not store associated data for future use. (Note: It can be solved by using joins+preload.
ex.users = User.joins(:posts).where(“posts.published = ?”, true).preload(:posts)
) - It may give a duplicate record in a few cases as it uses INNER JOIN. The solution over it is using select with it.
ex. Let’s consider that the total number of users is 4,
=> users1 = User.joins(:posts)
=> users1.count
# it will ouptput more than 4
# with some duplicate elements
=> users2 = User.joins(:posts).select('distinct users.*')
=> users2.count
# it will output only those user which have at list 1 post
# without duplication
Extra : Joins with select example as below,
user = User.joins(:posts).select(“users.id, posts.name as post_name, posts.published as published”).where(id: 1)
It will give output of ActiveRecord array with post name, user id and post published for user with id as 1, so we can have all associated post name asuser.map(&:post_name)
Preload
The Preload works perfectly fine for loading associated data.
users = User.preload(:posts)
The query will load and store all users with associated posts data.
User Load (0.7ms) SELECT "users".* FROM "users" LIMIT $1 [["LIMIT", 11]]
Post Load (0.5ms) SELECT "posts".* FROM "posts" WHERE "posts"."user_id" IN ($1, $2) [["user_id", 1], ["user_id", 2]]
=> #<ActiveRecord::Relation [#<User id: 1, name: "Tushar", email: "tushar@mail.com", created_at: "2021-01-16 14:55:30", updated_at: "2021-01-16 14:55:30">, #<User id: 2, name: "Yash", email: "yash@mail.com", created_at: "2021-01-16 14:55:30", updated_at: "2021-01-16 14:55:30">]>
Pros
- It stores the associated data, as a result, it solves the (n+1) issue
Cons
- It by default loads data with two separate queries for loading data for the user and its associated posts data
- It can not be used precisely for comparing and filtering for associated data, such as where can not be used for associated data filtering in preload
Includes
The Includes works perfectly fine for loading, comparing and filtering data with associated data.
users = User.includes(:posts).where(posts: { published: true })
The query will load all users with posts having published attribute values as true.
SQL (0.7ms) SELECT DISTINCT "users"."id" FROM "users" LEFT OUTER JOIN "posts" ON "posts"."user_id" = "users"."id" WHERE "posts"."published" = $1 LIMIT $2 [["published", true], ["LIMIT", 11]]
SQL (0.4ms) SELECT "users"."id" AS t0_r0, "users"."name" AS t0_r1, "users"."email" AS t0_r2, "users"."created_at" AS t0_r3, "users"."updated_at" AS t0_r4, "posts"."id" AS t1_r0, "posts"."name" AS t1_r1, "posts"."description" AS t1_r2, "posts"."published" AS t1_r3, "posts"."user_id" AS t1_r4, "posts"."created_at" AS t1_r5, "posts"."updated_at" AS t1_r6 FROM "users" LEFT OUTER JOIN "posts" ON "posts"."user_id" = "users"."id" WHERE "posts"."published" = $1 AND "users"."id" = $2 [["published", true], ["id", 1]]
=> #<ActiveRecord::Relation [#<User id: 1, name: "Tushar", email: "tushar@mail.com", created_at: "2021-01-16 14:55:30", updated_at: "2021-01-16 14:55:30">]>
The includes
uses most of the time preload
by default (hence gives an error when used withwhere
including SQL query inside it without references
) as given below,
> User.includes(:posts).where("posts.published = ?", true)
User Load (1.8ms) SELECT "users".* FROM "users" WHERE (posts.published = TRUE) LIMIT $1 [["LIMIT", 11]]
Traceback (most recent call last):
ActiveRecord::StatementInvalid (PG::UndefinedTable: ERROR: missing FROM-clause entry for table "posts")
LINE 1: SELECT "users".* FROM "users" WHERE (posts.published = TRUE)...
^
So by default, it will fire two queries the same as preload
(unless used as where
with Hash query or references
) as given below,
> User.includes(:posts)
User Load (0.4ms) SELECT "users".* FROM "users" LIMIT $1 [["LIMIT", 11]]
Post Load (0.5ms) SELECT "posts".* FROM "posts" WHERE "posts"."user_id" IN ($1, $2) [["user_id", 1], ["user_id", 2]]
=> #<ActiveRecord::Relation [#<User id: 1, name: "Tushar", email: "tushar@mail.com", created_at: "2021-01-16 14:55:30", updated_at: "2021-01-16 14:55:30">, #<User id: 2, name: "Yash", email: "yash@mail.com", created_at: "2021-01-16 14:55:30", updated_at: "2021-01-16 14:55:30">]>
Note: To use
where
withincludes
will work only with the Hash (given above). To usewhere
with SQL-fragmentreferences
can be used (reference) as below,
User.includes(:posts).where("posts.published = ?", true).references(:posts)
The includes
will by default use preload
unless the references are being made on the association so as to give SQL output (which will be the same as the Hash one) as below,
SQL (0.7ms) SELECT DISTINCT "users"."id" FROM "users" LEFT OUTER JOIN "posts" ON "posts"."user_id" = "users"."id" WHERE (posts.published = TRUE) LIMIT $1 [["LIMIT", 11]]
SQL (0.7ms) SELECT "users"."id" AS t0_r0, "users"."name" AS t0_r1, "users"."email" AS t0_r2, "users"."created_at" AS t0_r3, "users"."updated_at" AS t0_r4, "posts"."id" AS t1_r0, "posts"."name" AS t1_r1, "posts"."description" AS t1_r2, "posts"."published" AS t1_r3, "posts"."user_id" AS t1_r4, "posts"."created_at" AS t1_r5, "posts"."updated_at" AS t1_r6 FROM "users" LEFT OUTER JOIN "posts" ON "posts"."user_id" = "users"."id" WHERE (posts.published = TRUE) AND "users"."id" = $1 [["id", 1]]
=> #<ActiveRecord::Relation [#<User id: 1, name: "Tushar", email: "tushar@mail.com", created_at: "2021-01-16 14:55:30", updated_at: "2021-01-16 14:55:30">]>
Pros
- It can be used for either OUTER LEFT JOIN query or with two separate queries (same as
preload
) to load left and right table-associated data - It solves the (n+1) issue all alone since it stores associated data for future use
- It is smart enough to load associated data with either a single query for simple cases (in terms of joining table) or with two separate queries for other cases (like when working as
preload
). If a query forincludes
needs to force into a single query, then it can be performed with includes+references as discussed earlier (which will work the same aseager_load
)
ex.User.includes(:posts).references(:posts)
Cons
- It can fire two separate queries to load left and right table data (without using any kind of join same as
preload
) - It sometimes uses a LEFT OUTER JOIN query to load associated data, as a result, associated data gets impacted.
ex. Let’s consider if the first user has a total of 12 posts,
=> users = User.includes(:posts).where("posts.published = ?", true)
=> users.first.posts.length
# it may ouptput less than 12 as ex.4
Eager load
The Eager load works perfectly fine for loading, comparing and filtering data using associated data.
users = User.eager_load(:posts).where("posts.published = ?", true)
The query will load all users with posts having published attribute values as true.
SQL (0.7ms) SELECT DISTINCT "users"."id" FROM "users" LEFT OUTER JOIN "posts" ON "posts"."user_id" = "users"."id" WHERE (posts.published = TRUE) LIMIT $1 [["LIMIT", 11]]
SQL (0.6ms) SELECT "users"."id" AS t0_r0, "users"."name" AS t0_r1, "users"."email" AS t0_r2, "users"."created_at" AS t0_r3, "users"."updated_at" AS t0_r4, "posts"."id" AS t1_r0, "posts"."name" AS t1_r1, "posts"."description" AS t1_r2, "posts"."published" AS t1_r3, "posts"."user_id" AS t1_r4, "posts"."created_at" AS t1_r5, "posts"."updated_at" AS t1_r6 FROM "users" LEFT OUTER JOIN "posts" ON "posts"."user_id" = "users"."id" WHERE (posts.published = TRUE) AND "users"."id" = $1 [["id", 1]]
=> #<ActiveRecord::Relation [#<User id: 1, name: "Tushar", email: "tushar@mail.com", created_at: "2021-01-16 14:55:30", updated_at: "2021-01-16 14:55:30">]>
Pros
- It uses a LEFT OUTER JOIN query to load associated data
- It solves the (n+1) issue all alone since it stores associated data for future use.
- Usually, it loads data with a single query for users and its associated post data in terms of joining the table.
Cons
- It uses a LEFT OUTER JOIN query to load associated data, as a result, associated data gets impacted.
ex. If the first user has a total of 12 posts,
=> users = User.eager_load(:posts).where("posts.published = ?", true)
=> users.first.posts.length
# it may ouptput less than 12 as ex.4
Extra: Since eager_load force data to load with a single query (while joining hte table) to load user and its associated post data, it takes comparatively more time to load a query than to load two separate query in most of the cases. So it depends more on the use-case to use joins/preload/includes/eager_load while loading/comparing/filtering associated data.