Case Study: Live video streaming from iOS devices made simple with Elixir & Membrane Framework


This article was originally published Apr 23 2020 as Software Mansion’s blog entry by Dominik Stanaszek. The original content is available here https://blog.swmansion.com/live-video-streaming-in-elixir-made-simple-with-membrane-fc5b2083982d

Some of the areas of expertise at Software Mansion are multimedia & streaming. While working on one of the projects, I recently faced the challenge of setting up a system for real-time live streaming of videos. If you’re familiar with Twitch, the idea here is similar, meaning that the capture of one client’s camera is displayed on recipients’ devices. Viewers can interact with the streamer in ways like tipping them or commenting on their stream. Put simply, the task was to create a domain-specific version of Twitch, where workout buffs could show off with their weightlifting skills, get instant feedback, earn some money from their fans and do all that in almost real-time.

An image defining the problem

Main requirements we were to meet consisted of:

  • The application should allow streaming video and audio content from one iOS device to other iOS devices and web clients.

  • Allowed latency was around 10 seconds as scalability was more of a concern than being ultra-fast.

  • Video resolution should adjust to the receiver’s screen size and network conditions.

  • The number of viewers was relatively higher than the number of concurrent streamers.

  • The streamer should be able to archive videos.

  • Viewers and the author should be able to send messages during the stream.

  • Viewers should be able to tip the streamer.

  • Each stream should have a thumbnail displayed next to the stream link (preview). It should be taken from the stream itself.

  • The application should be ready to be migrated to android in the future.

  • Time… last but not least, the project was in MVP state thus we were very tightly time-constrained and had to look for simple yet powerful and easy to introspect solutions.

Multimedia

First and foremost, we had to establish the multimedia protocol stack and codecs we wanted to use. We had to balance out low latency requirements and scalability. For clarity, I will divide the stream path into two logical steps: the so-called first, and last mile. In a lot of multimedia streaming systems, it is useful to have this division unless there is no server (or CDN) in between the clients.

Protocols we considered:

  • RTMP — It is the most frequently picked first-mile protocol. A lot of services (like youtube) use it to upload videos. It’s built on TCP. RTMP used to be a proprietary protocol owned by Macromedia used in Flash player (which gave it the popularity as Flash Player was used all around the Internet). After Adobe took Flash over, an incomplete version of the protocol description was released and publicly available.

  • RTP/RTCP — This is a pair of protocols commonly used when low-latency is important. It is used in a lot of popular video-chats. RTP describes the format of packets, how to describe multimedia format etc. Additional RFC documents describe how particular multimedia codecs are encapsulated into RTP (so-called payload). RTCP is an accompanying protocol whose role is to control (hence the C in RTCP) the data flow.

  • WebRTC — this is a so-called umbrella protocol which means it incorporates several different protocols (and also JS API) to allow low-latency multimedia exchange between browsers (in theory it will allow any JS based client to use it).

  • RTSP — This is a protocol based on HTTP that lets the client inform the server about the status of video transmission.

  • HTTP progressive upload/RTSP upload — RTSP also allows to transmit multimedia payload itself inside RTSP. This means you are bound to use TCP. The same goes for HTTP upload.

  • DASH/HLS — Those are the two most common HTTP-based protocols used for last-mile purposes. It allows streaming content in an adaptive manner, meaning the stream quality can be adapted to, for example, network conditions or receiver’s screen resolution.

First mile

First mile describes the communication pattern between the streamer and the server. In what way do you want to deliver the content to the server? How do you inform the server that you are streaming? Or that you paused? Or how does the server tell the streamer it stopped receiving packets? All these concerns have to be addressed, and they are addressed differently by different protocols.

For our purposes, we decided all protocols built on TCP were off the table as it meant additional overhead on the time to delivery of the stream. As streamers were oftentimes supposed to use the app in low-connectivity environments, frequent TCP retransmissions could make it impossible to upload the stream under latency requirements. We would rather lose a few pixels here and there than crash the whole transmission.

RTMP seemed like a pretty good fit although, due to our previous experiences, we knew the protocol was quite complicated and the specification released by Adobe was not complete.

Eventually, we decided to go with RTP and a few custom HTTP endpoints and Phoenix channels (you can read more about Phoenix channels in the official documentation) to control the stream flow. RTP is capable of maintaining low latency (even under 200ms), can be encapsulated in UDP and is completely open and well described. We used custom HTTP/WS but only due to development time constraints. We plan on replacing it with RTSP in the future.

Last mile

Last-mile, similarly, describes the communication patterns between the server and the stream receiver. At first, we were thinking about using RTP here as well. Indeed, why bother using a different protocol while the one we already use in the first mile does the job maintaining low latency? There were a few problems. Firstly, the RTP is not very scalable. You would need to have your server actively sending UDP packets to the receivers, taking care of the flow control, failures and distribution — you need to take care of horizontal scaling yourself. Secondly, the adaptivity of the stream is very hard to achieve, you would need to have a custom way of communicating with the client and establishing multimedia details (like resolution). There are protocols like RTSP that help with that, but it’s another protocol in the stack anyway! Thirdly, we wanted to be able to easily serve recordings straight from the storage with as little effort as possible.

HLS came to the rescue. With HLS all we needed to do was to put the incoming stream inside an MP4 container and save it as a file along with a short index file that describes the recording in terms of where its particular parts are stored (under what URL). This way our server did not even take part in the last mile delivery as we saved the files in Google Cloud Object Storage which manages scalability for us (and HTTP requests are much easier to scale in general).

A simplified view of the system. You can see the first mile uses RTP (on the left), server writer HLS files to an external storage system and last-mile consist of HTTP communication to inform the receiver about file locations and HLS HTTP adaptive download itself (omitting the server).

Codecs and containers

We used H264 codec for video and AAC for audio as very commonly-used, free option that have RFCs for how to be payloaded inside RTP. HLS enforced the usage of fmp4 container (fragmented MP4).

An image showing how media are encapsulated in the described protocols. First-mile on the left and last mile on the right.

Membrane Framework

If you’re not familiar with Membrane Framework you should definitely read about it on github, check out the guides or listen to Marcin Lewandowski’s talk. Membrane Framework allowed us to write custom multimedia processing elements in a high-level language, which also comes with such awesome tools as Phoenix where web API implementation was easy enough to not draw us away from the core of the problem here. Make sure you’re familiar with concepts like pipeline, elements and bins before proceeding with this article.

We managed to use Membrane Framework to do the task of accepting RTP streams, encapsulating them in fMP4 and saving the HLS files in the google storage. The procedure looks like this: for each stream, we are creating one pipeline. It will listen on one port for two streams: audio and video (interleaved). In RTP, streams are identified by SSRC which is an integer put in RTP header.

Pipeline architecture

Several elements need to be present in the pipeline due to the RTP and HLS requirements. The most interesting ones are:

The low-level view would be:

A low-level view of the pipeline — showing all elements taking part in the processing. Orange boxes outline bins’ borders.

Thanks to the recent feature in Membrane Framework — bins, if you want to build this pipeline yourself, you can do it like this:

Pipeline view with the use of bins

With Membrane pipeline in the picture, the whole streaming flow is complete (yellow — first mile, red — last mile)

First-mile flow:

  • Client makes a request to create a stream

  • Server creates a pipeline process that will listen on UDP port

  • Server responds with stream details like port

  • Client starts streaming, packets go through the pipeline and end up being saved to the storage

Last-mile flow:

  • Client makes a request to get a video URL

  • Server responds with the URL of the HLS index file

  • Client downloads the index

  • Client uses URLs it finds inside the index file to download next chunks of the recording and plays them with ios build-in player

Thumbnails

Membrane Framework respects the open/closed principle and adding thumbnails to our existing system was as easy as creating a few new elements and linking them to our existing pipeline using the Tee element. Tee is an element with one input pad and two output pads. It forwards the stream from input to two outputs (copying it).

Pipeline with thumbnail branch. The black box outlines the old part of the pipeline.

After we have one branch of the pipeline with a copied stream, we can put elements we need for generating the thumbnails:

  • scissors — It allows to crop specific stream parts (membrane_element_scissors)

  • JPEG — converts h264 frames to jpeg format (elixir_turbojpeg)

  • Video thumbnail — uploads the thumbnails to external storage — custom element

One problem you need to remember about when cropping single frames from H264 encoded video is that some frames are not standalone. To save space H264 encodes some frames as a delta between other frames. Regularly in the stream, there should appear so-called reference frames (type I) that can be read without previous or next frames. This can be achieved in our pipeline by accessing buffers H264 metadata (key_frame? key) in the scissors filter option.

Conclusions

With Membrane Framework, incorporating streaming features into your existing Phoenix server has never been easier. It is important to outline your specific needs, then pick the right streaming protocols, grab already existing Membrane Framework elements (or contribute by creating them) and you’re ready to go. With the power of BEAM underneath, you can still benefit from everything it provides — concurrency, failover mechanisms, distribution model and much more.