A look at Enumerators and Laziness

The Problem

Imagine that you need to query an API for a large amount of data. When you execute the query, the server will give you back some JSON that looks something like this:

{ 
  data_entries: [ LOTS_OF_DATA, LOTS_OF_DATA, LOTS_OF_DATA ],
  last_page: false
}

Notice that since we’re dealing with a large amount of data, the API has paginated the response for us.

This pagination is not fun to deal with, as it means that sometimes we’ll need to make multiple calls to the API. Let’s see how we can set up some convenience methods for working with this.

Start by Imagining the Interface We’d Like to Use

It would be great to be able to call a method, data_entries that returns something that behaves similarly to an array (aka, an Enumerable):

data_entries(query_options).each do |data_entry|
  # work with data
end

Here, query_options is just an imaginary hash of options that we might pass into this method. For example, search terms or sorting options. This is handled on the API side and is not our concern.

It would also be nice if we could use other Enumerable methods, for example, fetching an array with the first 5 elements:

data_entries(query_options).first(5)

Importantly, our data_entries method must be lazy! Because we’re dealing with a large amount of data, it must retrive as few entries as possible.

Enumerator basics

Let’s look at a silly example to get a handle of the basics.

people = Enumerator.new do |yielder|
  yielder.yield 'George'
  yielder.yield 'Karl'
  yielder.yield 'Janet'
end
  • Calling people.next will return 'George'.
  • Calling people.next again will return 'Karl'.
  • Calling people.next two more times, will result in a StopIteration exception being raised!

Essentially, an Enumerator will execute until it reaches a yield statement, and then stop and await further instructions. When there are no more yields left, it will raise an exception!

So, now we have a plan of attack. What we’re going to do is loop through all of our data_entries and do a separate yield call for each one!

Enumerator to the rescue!

I’m going to show you the code up front. Don’t worry, I’ll go through it step by step!

def data_entries(query_options={})
  query_options[:page] = 1
  results = {}
  
  Enumerator.new do |yielder|
    loop do
      raise StopIteration if results[:last_page] == true
      results = call_api(query_options)
      results[:data_entries].each { |entry| yielder.yield entry }
      query_options[:page] += 1
    end
  end
end

def call_api(query_options)
  # this method handles details of querying the API,
  # and returns results as JSON in the format specified
  # at the beginning of the article
end

We’re defining a method called data_entries that will take a hash of query options (for the API’s use).

def data_entries(query_options={})

Next we do some setup by setting the initial page number to 1, and setting results to an empty hash so that we don’t raise an exception on the first loop.

options[:page] = 1
results = {}

Now we’ll instantiate a new Enumerator.

Enumerator.new do |yielder|

We start an infinite loop (this is similar to writing something like while true).

loop do

Here we have the exit clause for the loop:

raise StopIteration if results[:last_page] == true

This will raise an exception when the API tells us that we’re on the last page of the results. This exception will be automagically rescued from, and it will signify that we’re at the end of our Enumerator.

Next up, we’ve got the actual API call that will return the hash of results!

results = call_api(query_options)
results[:data_entries].each { |entry| yielder.yield entry }
query_options[:page] += 1

Note that we’ll need to further loop through the array of :data_entries, and yield each one in turn. That will give us a separate, individual data_entry for each iteraton of our data_entries().each call!

We’ve also got another method defined here, one which will actually do the API query and return proper JSON to us. I’ve not included the details, as they don’t really matter for our purposes.

def call_api(query_options)

All together one more time:

def data_entries(query_options={})
  query_options[:page] = 1
  results = {}
  
  Enumerator.new do |yielder|
    loop do
      raise StopIteration if results[:last_page] == true
      results = call_api(query_options)
      results[:data_entries].each { |entry| yielder.yield entry }
      query_options[:page] += 1
    end
  end
end

Remember! The Enumerator is executing until it encounters a yield, and then it is awaiting a request for the next element. Upon receving that request, it will resume execution until the next yield statement, and so on.

Does this work for our Purposes?

Almost! With the above implementation, we can now perform our desired each loop!

data_entries(query_options).each do |data_entry|
  # work with data
end

At any point in our each loop, we can break out, and we will have only queried for the number of pages required, no more.

However, if we try something like this:

data_entries(query_options).first(5)

It’s going to take a while, because we’re querying for every single page! Uh oh!

A Dash of Laziness

(note, this feature requires Ruby 2.0.0)

In order to avoid loading all of the entries when we only ask for the first 5, we’ll have to explicitly mark our enumerator as lazy, like so:

data_entries(query_options).lazy.first(5)

The details of how this works are quite complicated, so I’m not going to go into specifics.

Recommended viewing & reading:

Thanks!

Hopefully this post has given you a basic understanding of how to incorporate Enumerators into your project!

For this post, I’ve manipulated actual project code into context-independent code. As such, I could have easily made some mistakes! Sorry about that, and if you catch anything wrong please let me know.

If you have any questions or comments tweet me @bolandrm! Thanks for reading!