Ray Tracing with Akka – Part 3

Where we left off

In the last two posts, we reached the capabilities of a simple ray-tracer. We now have two types of shapes (sphere and plane), we can calculate shadows, reflections and refractions recursively. Before we move onto the more computation-heavy shapes, I am curious about how I could move our computations to separated nodes. So we will first build a cluster!

 

ray tracing with akka part 3

 

Cluster

Before this ‘project’ I had never tried Akka clusters, so I learned a lot now! After I read the docs (at least twice), I had a clear vision as to what the thing is I want to achieve. I want a grid, built from worker nodes, and some ‘master’ node. The master nodes when joining a cluster tell the actual scene to the worker nodes, so these workers can build the scene locally. When a worker has finished building the scene and is ready to compute rays, it asks for work from the corresponding master, solves the task and sends the result back. I will split the app at the renderer-scene connection and divide it to renderer-masterNode–[cluster]–workerNode-scene.

The responsibilities:

  • Worker
    1. Can build a scene from some kind of message
    2. Can trace master-scene pairs
    3. Can request tasks, compute it on the given scene and send the results back
    4. Can tear down a scene when the master leaves the cluster (due to error or work completion)
  • Master
    1. Can tell what scene it wants to raytrace
    2. Can start a renderer
    3. Can trace which trace messages are requested, and which are answered by workers
    4. Can distribute work

At my first iteration I wanted to achieve W3 and M2, M3, M4. This is just splitting the renderer and the computation to two nodes. Before we start, I want to speak about my expectations in performance. On my computer, the time to render the previous scene was about 6.3-6.8 seconds. We will use a new transport layer, run two actor systems instead of one. My expectation was that it will need about 2-3x more time to compute the same picture. My first tries with the cluster implementation needed 10-12x more time. It was dissapointing. I asked some questions in the akka gitter channel, got some ideas and started experimenting with things. The end result is about 4x slower than the original single-machine-code. But with a more computation-hungry scene and with more nodes (which are actually on other computers), I think this slow-down can be reversed at some point. (I will try to tune the performance of my implementation maybe in the next post, and measure its speed in a bigger test environment, but now I left it like this.)

We want a cluster! We need to modify configs first. Create a new resource folder below the src/main and create a new file (cluster.conf).

Some words about what this is, what I tried before these values and so on. First of all, you need to change the actor provider to cluster because we want a cluster (smile).

If you read the docs, they will write about the options to change the java serializer and use protobuff or kyro or even something that you wrote. I tried out Chill from Twitter and Kyro. Both produced some warning to the output, and in my case they didn’t modify the performance in any way (the performance was poor because of other things which you can see below). So because of the warnings and the lack of performance improvements, I simply switched back to the java build in serializer and turned off the warning.

I tried out both netty and artery. In my case, they worked with the same performance, artery was maybe a bit better, and it’s newer and cooler so I chose that. (I left the netty-specific codes commented out in case I change my mind.)

I added two seed node addresses, turned off the old metrics, set the minimum number of members to two, and enabled the “multi mbeans in same jvm” (because I will run the master and the worker from one method while playing).

 

Back to some naive implementation vs performance tuning. Our application is cpu and message heavy. The renderer sends out a lot of trace messages. The naive implementation would be

“W:give me one trace,

M:trace,

W:(after computing)done,

W:give me one trace”.

This will distribute the work to the node that is free, but the communication and waiting cost between the “give me one” and the “trace” is a waste of time. We could, on the other hand, batch messages like

“W:give me work,

M:trace,

M:trace,

M:trace,

M:trace…” or

“W:give me work,

M:Seq[trace]”.

In my tests the second method was far faster (10x slower vs 20x slower than the nonclusterd version). So I will batch the master→worker direction. I tested the batching in the other direction too, but it was slower. I have theories as to why I had to batch to the one direction and not batch in the others, but if someone wants to share their own, leave a comment below. 😀

Take the cluster things to a subpackage with their own package object and with the communication classes.

The worker node has the main function with port config and worker node actor start.

Dirty implementation. But the WorkNode actor is not complex.

At the start, it subscribes to memberUp events, at the stop it leaves the cluster.

When a master

  • starts up, it sends a registration message.
  • receives a “createScene” String, it initiates the burn-in scene and requests work.
  • gets a package of traces, it sends them to the scene.
  • gets ColorMessageAns-es from the scene, it relays them to the master, and
  • if all the work is computed, it requests new work.

(We didn’t need to touch the scene code. This is a good sign.)

The MasterNode object’s main function is nearly the same as the WorkerNodes’. (We only change the role string, and the actor init.)

In the implementation of the MasterNode we are waiting for worker registrations. If we know them, we add them to our inner list (if we get the first one to start the render), and send them a “createScene” message. We can handle trace messages from the renderer, and write them to an inner map. We can handle the ColorMessageAns-es from the workers. If the ColorMessageAns-es are on this map, we remove them and relay to the renderer (and if they are not in the map, we get this answer twice, so log it out and drop). And if the worker needs work, we are making a jobPackage from the last sent and not-yet-received jobs. (This part of the code can cause a deadlock right now, because of the filter and if some messages were dropped. (I have never seen this. I just wanted to mention that I know it’s not safe right now.))

If we modify the Main function, we can run our ‘cluster’, and it will produce a picture. Small win because of the performance drop, but it can do it (smile)An even bigger win is that we can wrap the functionality of the original code without any modification!

At this point, I made a small refactor and took all the shape-related objects to a ‘shapes’ package (Shape, Sphere, Plane, Reflective, Reflactive).

Cluster phase 2

We have a working cluster implementation, but we can handle only one master, and we need to redeploy the workers if we want a new scene. Let’s make our scene be dynamically built by initial data. For this we will need to provide a list of shape-data, and we need to build concrete shape actors from this list. And we need to stop there for a moment. Whose responsibility is it to know how to create the concrete actor from the data? We had a little discussion about this matter and we came up with three insufficient solutions and I chose the least bad.

The ideas:

  • The naive grade one method would be writing a big match-case and if the data is sphere-data then do Sphere(data). The problem with this approach is that we need a good class. And whenever we create a new shape and shape-data, we need to register the shape-data to that shape object.
  • Another approach is adding the responsibility of creation to the data object, and all the shape-data are capable of calling their own create function: shapedata.create . The problem with this approach is that “why does a data-object contain business-logic?”.
  • The third one is creating a registration service, where all of the factories can register. And when it gets concrete shape-data, it can relay the data to the appropriate factory. The problem with this is the programming overhead of the method.

Here is in my implementation, but I’m not so sure that there is not a better (scala) solution, so if you have an alternative method, please make a comment below.

Let’s start to refactor all the things!

Create the shape-data objects with the proper “factory” call.

Modify the scene to create itself from shape-data.

It may not be a must-have-step. But you should make the ids more id-like in order to remove possible collisions due to the “x,y”-like id generating from the masters.

We can delete the AddShape from the package object.

Create a CreateScene message object in the package.

Move the scene definition to the main as a Seq[ShapeData]. When a worker registers, send the scene-data to it.

It seems like a huge difference at first, but it isn’t. When a master sends the worker a CreateScene message, then the worker creates a scene with the given data. It needs to relay the traces from the master to the scene and back from the scene to the master. We need to make a lookup function for both directions. So I inserted the ActorRef↔ ActorRef both as key-values and value-key to a map. From there, I can resolve the destination from the sender to both directions. When a master is going down or becomes unreachable, the worker needs to tear the scene down. This is our cleanup function.

 

At this point, we can start some workers on machines (and we can scale up at any time), they will create a compute-grid, and when we start the masters, they will distribute and compute the work on this worker grid.

Some words about the performance again. Both the worker and the master can be memory-hungry if you let them. Both actor systems eat up about 20-50Mb of memory without doing anything (I think this is relatively good of a jvm application). When a master starts to compute (and is not heap-limited), it can reach about 1.5Gb of memory, (which is terrible if you ask me,) but seeing the graphs, this is only because the gc is not executed frequently. The worker node is a bit better, I didn’t see it eating up more than 350Mb of memory. I tried both of them out with -Xmx256m. They did not die with an “out of memory” exception, the gc was just working a lot more (and the time it took to be executed was increased by 10-15%).

 

Summary

At the end of the day, we could separate our tracer logic from our scene and image renderer. We now can start up a (static) worker-grid and add master-nodes to distribute and compute the scene on it. I left some “possibly problematic” decisions in the code, so some edge cases can deadlock our communication, but the happy cases work flawlessly. The performance is worse than the single jvm one. The interesting question is how it will perform if we use more comutation-heavy objects. (In the near future I want to write a cylinder and metaball implementation too in order to test how flexible my initial implementation is, and uncover how I can play with performance tuning.)

This is a “learning” tutorial for both the readers and the author, so if you have any suggestions or questions, feel free to comment below!

PS.: The fourth and final part of this series is coming soon! Stay uptodate by liking us on Facebook or following us on Twitter!

Gergő Törcsvári

Gergő Törcsvári

Software Developer at Wanari
I would love to change the world, but they won’t give me the source code (yet).