## Tuesday, May 27, 2014

### Playing TitanFall and Super Mario 3D World on the bitbox

Well, playing a video of Titanfall on a Bitbox, of course playing a demanding PC / nextgen console videogame is out of reach :) But OK, then, can we have video on the bitbox ?

Full motion video on a very small hardware like the bitbox poses some interesting optimization issues and challenges. That's the kind of challenges you're faced with when you've got a few k on RAM - too few to store the image you're showing entirely- and 180MHz to generate the content and synthetise you VGA signal by software ...

Can we made nice cinematic cutscenes for the game, i.e. full motion video on the bitbox ? Can we make it 4K 3D video ? Well, no (What did you think ? 1080p50 with an Arduino ?)

But ... how far can we go ?

### First approach and limitations

After some reflexion, you quickly realize that the main limitation is RAM memory.

We can use the internal RAM to store a frame and another one to decompress a frame. That will also let us use a codec that encodes the first image completely, then the differences of images. If an image is almost the same, the image compression will be lower. (However we have plenty of space on the SD card, so if we can stream data from the SD to memory, we don't need to compress that much - or even at all).

Also, using a frame buffer, we can separate the display frame rate (ie 60 fps, the speed at which images need to be sent to the display), and the movie frame rate, ie the number of images per second that are provided. If we've got a 20 fps movie, that means we need to update & switch our frames each third frame, letting us a little bit more time to load and process the next frame while we are repeatedly blitting the preceding frame.

We have 192kb total. Let's say we use 64k for our game RAM, and the rest is used for two 64KB frames. At 16 bits per pixel, we can store 32000 pixels ; with a 16/9 aspect ratio (width/height), it's 238 by 134 pixels. Not very high def. Let's say we use a 256 color palette by frame and we store separately the palette and the . We can then use ~8bpp so twice the pixels, and a resolution of 337 by 189. Not that good either.

Example images at the preceding resolutions (click on them to see them at 640x480)
 238 x 134, click to see full size

 337 x 189, 256 colors ; click to see full size

(note that 24bpp ->15bpp with dither - was not done on preceding images. and that they were cut to 16/9 )

OK, not too bad, but some effort can be made to be a little better.

I demand a 640 x 360 resolution !

### A compression algorithm : DXT1

Let's say we want a native resolution of 640 pixels width. At 16/9 that's 360 pixels high. If we want to store it on our buffers, that's 230 400  pixels to store on 64kB, so a little more than 2bpp (bits per pixel). So we can store 4 colors on our frame buffer. That's simple, and a better resolution, but not a nice image. Maybe for black & white or special effects, but not very usable.

 640x480 @ 4bpp - click to see fullsize

But what did we do ? By storing a palette + image, we're effectively storing a compressed image on the frame buffer, and decompressing it at 60 fps now. (note that since we're not storing a decompressed video frame, we ruled out all MPEG encodings or most video encoders which encode frame differences).

What if we used another compression than palette compression such as MJPEG and decompress on the fly in the rendering engine at 60 fps ? That will be easy !

NOT.

Remember we have a limited number of CPU cycles per pixel. The CPU clock is 168 MHz (which we can overclock a bit, by example at 192 MHz to be compatible with USB clock speeds), and 31kHz lines (one line each 1/31000 s), so effectively around 6000 CPU clocks to decompress and write 640 pixels. That's less than 10 CPU clocks per pixel, remember that just loading & storing a word in RAM is 3 cycles. ouch. And we must now decompress at 60fps since we must stream each decompressed frame to the screen.

OK, what can we do ?

Remember that now we're just focusing on RAM to display decompression, not storage to RAM decompression. We might be able to compress more after that for on disk storage.

One fast, interesting compression scheme is S3TC DXT1 compression, which is used for texture compression on current video games (yes, the big current ones, don't we feel high tech to share this technology ?) It is said to be very fast for decompression.

This scheme is described thoroughly in the wikipedia article on DXT1  :
DXT1 (also known as Block Compression 1 or BC1) is the smallest variation of S3TC, storing 16 input pixels in 64 bits of output, consisting of two 16-bit RGB 5:6:5 color values $c_0$ and $c_1$, and a 4x4 two bit lookup table.
If $c_0 > c_1$, then two other colors are calculated, such that $c_2 = {2 \over 3} c_0 + {1 \over 3} c_1$ and $c_3 = {1 \over 3} c_0 + {2 \over 3} c_1$. This mode operates similarly to mode 0xC0 of the original Apple Video codec.[5]
Otherwise, if $c_0 \le c_1$, then $c_2 = {1 \over 2} c_0 + {1 \over 2} c_1$ and $c_3$ is transparent black corresponding to a premultiplied alpha format.
The lookup table is then consulted to determine the color value for each pixel, with a value of 0 corresponding to $c_0$ and a value of 3 corresponding to $c_3$. DXT1 does not store alpha data enabling higher compression ratios.
So take a 4x4 block, store 2 colors and interpolate 4 colors, blit it using those 4 colors.

Well, note that it also seems we're in good company here, as Apple basically had the same idea than us .. a few years ago.

As a bonus, we can have free, high quality compressors, such as libsquish. And indeed, this looks very good ! And it works with 16 bits pixels, so we can change it a bit and make it work natively with 15 bit pixels.

That would give us a 4bpp image with no interframe compression (each image being independant which is needed for frame on the fly decompression, we shouldn't depend on preceding frames !).
4bpp compression allows us 128000 pixels on 64kb with a resolution of 477 by 268. Not 640, but better.

And the quality is very nice.

DXT1 compressed and Original, 15bit 640x400 (so 1/4 the size). Clock to enlarge. (I can't see a difference !)

Nice, eh ?

I implemented a highly optimized with native Cortex M4 instructions to load and multiply several 16bit words in parallel. A nice article about what needs to be done to decompress DXT1 can be found online.

Basically, you need to :
- split c0,c1 into RGB  components
- generate the palette of c2 & c3 for the blocks each 4 lines (or 1/4 of blocks each line for the next block line). this is done by interpolating the RGB values - multiplying (much faster than division) by 1/3 and 2/3 in approximated fixed precision 16bits numbers the 5bits component of the color, and then repacking the components to colors. Doing so is parallelized on a 32bit word, like when you can in base 10 multiply 25 by 20003000 and get 500075000
- blit c0,c1,c2 or c3 according to the pixel map.

And .. I failed, even this simple calculations were more than 10 cycles per pixel. Maybe it can be done, but the best I could do was around 20 cycles - with extra optimizations & cut corners, by example by assuming no transparency - , which enabled a 360 pixels wide image.

Let's see what a 320 image scaled horizontally can be :

 a 320x480 horizontally stretched image (click)

Better than before  but .. not enough. Note that if we had a frame buffer, full motion would be possible at 30 fps since we could decompress once every other frame.

### A custom, simpler codec

OK what if we keep the idea and remove the palette calculation ? That would slash the CPU complexity, but at the expense of much more bandwidth for colors : that would be 6bpp (for a grid of 16 pixels, we now have 4 16 bits colors + 16 2bits pixel references = 3*32/16= 6bpp). So 85k pixels, 389 pixels wide, a bit better, ok why not.

Maybe we can store each color c0-c3 as an 8bpp index to a table instead ;  this would still allow us 4bpp and remove palette interpolation cost, at 477x268, 256 colors max.

But let's degrade the quality and only send 2 colors per block, not 4. And just palettize the color references c0, c1 - so a color reference will be 8bpp + a 512 byte palette for the image (256*16bits). Each pixel will be encoded on a single bit (2 colors) so we're talking about 16 pixels on 2*8bits + 16x1bit : 2 bits per pixel ! Remember what we said we needed for native bitbox resolution ?

That would let us store a resolution of 64kB*2/8=256k pixels so a resolution of 674 x 379 pixels !! But that would look like crap at 2 bits per pixel ! Well, look at sample results :

 our new video codec, 640x480, as displayed on bitbox - click

Other examples :

I officially declare this good enough (for now).

The compression algorithm is the following (quite crude for now) :

For each 4x4 block :
• compute the luminosity of each pixel
• compute the average luminosity
• separate the pixels in 2 groups : above and below average luminosity
• computing the average color of the "light" pixels group and "dark" pixels group
• computing the 16-bit bitmap of the pixels

Then
• reduce the resulting color couples for all blocks to a 256 color table with a simple color reduction algorithm : median cut (http://en.wikipedia.org/wiki/Median_cut) (this is done globally for all colors on the whole image)
• Store each block as 2 u8 byte index to the colormap + 16 1bit pixels of the block

Those are encoded with a very naive (& slow) encoder, and quality of encoding could be much upgraded by the same decoder and rewriting in C would make it MUCH faster (it is in non optimized python). Remember that you're looking at still images, and that the video could be 60fps (we have to decompress at that speed anyway !), so we can also spread the error in time.

Can it be done ? Well, yes, it can be decoded at 60 fps realtime. I've made a demo that can loop 8 frames from internal flash and render it on screen, using a C only optimized decoder (no assembly). Not much time to blit anything more for those lines (well, besides a mouse cursor by example), but who would like to draw sprites over such a nice picture :)

But 8 frames ain't much - so we need to stream from the SD card.

#### Streaming from microSD in realtime (ie 60 times per second)

A resulting full scale stream would be 640x360*2bits/8*60frames per second = 3.2 megabytes per second. The SD card can read at 7MB/s (let's reduce the framerate to less - say 30fps - if that's not good enough and we don't need further compression, which means we can DMA directly from SD card to the display buffers ! )

I used the famous FatFS library to handle the FAT32 file format of the card, adapted for use with the stm32f4.

That's the theory. The practical results are that currently I can only load 16kB per frame, and at 30fps by splitting the transfers in two frames, because of
- my crappy microSD card maybe,
- or the file system complexity that adds some reads and slows down the card access (it's not RAW file access - although it cold be done, it would be much less practical to do so),
- or maybe I'm saturating the internal transfer bus or DMA access ? Who knows ?

Anyway, here is the current, real result : Full motion video on the bitbox at 320x180  using 32KB of RAM ! (As as promised, Titanfall playing on the Bitbox - real hardware here).

Can we optimize further and have better ? YES, since our constraint now is file transfers, we should (and already do up to a point) :
• Not divide by 8 the SD card clock as I did in my first tests - DUH !
• Properly take care of storing data on the card so that it aligns to the inner filesystem and microSD card structures. This means by example storing only frames aligned on blocks of 512bytes (1 sector of the card)
• Order properly the running of the reads during V Blanks and end of drawing instead of on the top of the frame
• Make the reads asynchronous so that we don't wait for the DMA but it runs in the background until next frame (but we need to know we won't have to issue several reads)
• Check to see if the SD bandwith is well used or not using a digital scope (which I haven't), understand why and act accordingly
• Buy a better microSD card (mine is not that impressive)
• Compress again : if we could compress the compressed framebuffers for transmission (by example by take into account the inter frame similarity ! However we couldn't use the DMA to send data from SDIO to buffer directly, so that would imply having another buffer and using the CPU to decompress to the buffer ; degrading the bus utilization so maybe decreasing performance ... I guess SD + FAT32 access can maybe be optimized better before

### Conclusion

Well, this sums up this long article, I hope you enjoyed it all : it shows that you can actually do a lot on this small microcontroller, and that optimizing it something that can be fun in itself !

On the other case, well, the code is open, see github repo.

Thanks and congratulations for reading so far, please leave a comment if you read so far on the quality of the writing or the usefulness/interest of it all, or if you're missing any comment.

.. And as a side note, I'm beginning to be able to make bitbox prototypes to sell to those who would like to tinker with code and code games, play some games or demos programs - when available -, code libraries or low level stuff -  or just support the project ! See the header of the blog !