The Problem
Imagine that you need to query an API for a large amount of data. When you execute the query, the server will give you back some JSON that looks something like this:
{
data_entries: [ LOTS_OF_DATA, LOTS_OF_DATA, LOTS_OF_DATA ],
last_page: false
}
Notice that since we’re dealing with a large amount of data, the API has paginated the response for us.
This pagination is not fun to deal with, as it means that sometimes we’ll need to make multiple calls to the API. Let’s see how we can set up some convenience methods for working with this.
Start by Imagining the Interface We’d Like to Use
It would be great to be able to call a method, data_entries
that returns something that behaves similarly to an array (aka, an Enumerable
):
data_entries(query_options).each do |data_entry|
# work with data
end
Here, query_options
is just an imaginary hash of options that we might pass into this method. For example, search terms or sorting options. This is handled on the API side and is not our concern.
It would also be nice if we could use other Enumerable
methods, for example, fetching an array with the first 5 elements:
data_entries(query_options).first(5)
Importantly, our data_entries
method must be lazy! Because we’re dealing with a large amount of data, it must retrive as few entries as possible.
Enumerator basics
Let’s look at a silly example to get a handle of the basics.
people = Enumerator.new do |yielder|
yielder.yield 'George'
yielder.yield 'Karl'
yielder.yield 'Janet'
end
- Calling
people.next
will return'George'
. - Calling
people.next
again will return'Karl'
. - Calling
people.next
two more times, will result in aStopIteration
exception being raised!
Essentially, an Enumerator
will execute until it reaches a yield
statement, and then stop and await further instructions. When there are no more yields left, it will raise an exception!
So, now we have a plan of attack. What we’re going to do is loop through all of our data_entries
and do a separate yield
call for each one!
Enumerator to the rescue!
I’m going to show you the code up front. Don’t worry, I’ll go through it step by step!
def data_entries(query_options={})
query_options[:page] = 1
results = {}
Enumerator.new do |yielder|
loop do
raise StopIteration if results[:last_page] == true
results = call_api(query_options)
results[:data_entries].each { |entry| yielder.yield entry }
query_options[:page] += 1
end
end
end
def call_api(query_options)
# this method handles details of querying the API,
# and returns results as JSON in the format specified
# at the beginning of the article
end
We’re defining a method called data_entries
that will take a hash of query options (for the API’s use).
def data_entries(query_options={})
Next we do some setup by setting the initial page number to 1, and setting results
to an empty hash so that we don’t raise an exception on the first loop.
options[:page] = 1
results = {}
Now we’ll instantiate a new Enumerator
.
Enumerator.new do |yielder|
We start an infinite loop (this is similar to writing something like while true
).
loop do
Here we have the exit clause for the loop:
raise StopIteration if results[:last_page] == true
This will raise an exception when the API tells us that we’re on the last page of the results. This exception will be automagically rescued from, and it will signify that we’re at the end of our Enumerator
.
Next up, we’ve got the actual API call that will return the hash of results!
results = call_api(query_options)
results[:data_entries].each { |entry| yielder.yield entry }
query_options[:page] += 1
Note that we’ll need to further loop through the array of :data_entries
, and yield each one in turn. That will give us a separate, individual data_entry
for each iteraton of our data_entries().each
call!
We’ve also got another method defined here, one which will actually do the API query and return proper JSON to us. I’ve not included the details, as they don’t really matter for our purposes.
def call_api(query_options)
All together one more time:
def data_entries(query_options={})
query_options[:page] = 1
results = {}
Enumerator.new do |yielder|
loop do
raise StopIteration if results[:last_page] == true
results = call_api(query_options)
results[:data_entries].each { |entry| yielder.yield entry }
query_options[:page] += 1
end
end
end
Remember! The Enumerator
is executing until it encounters a yield
, and then it is awaiting a request for the next element. Upon receving that request, it will resume execution until the next yield
statement, and so on.
Does this work for our Purposes?
Almost! With the above implementation, we can now perform our desired each
loop!
data_entries(query_options).each do |data_entry|
# work with data
end
At any point in our each
loop, we can break
out, and we will have only queried for the number of pages required, no more.
However, if we try something like this:
data_entries(query_options).first(5)
It’s going to take a while, because we’re querying for every single page! Uh oh!
A Dash of Laziness
(note, this feature requires Ruby 2.0.0)
In order to avoid loading all of the entries when we only ask for the first 5, we’ll have to explicitly mark our enumerator
as lazy
, like so:
data_entries(query_options).lazy.first(5)
The details of how this works are quite complicated, so I’m not going to go into specifics.
Recommended viewing & reading:
- Ruby Tapas has some great episodes on Enumerators
- Pat Shaughnessy has an incredibly in depth article on lazy enumerators.
Thanks!
Hopefully this post has given you a basic understanding of how to incorporate Enumerators into your project!
For this post, I’ve manipulated actual project code into context-independent code. As such, I could have easily made some mistakes! Sorry about that, and if you catch anything wrong please let me know.
If you have any questions or comments tweet me @bolandrm! Thanks for reading!