I finally advanced a little bit on my sample game, while reworking the kernel for better image & needing to rewire a bit of the prototype (hence, the prototype qualification).
The kernel upgrade involved getting hardware sync (as did the new waha demo, jupiter and beyond - credit where it is due), which is much more reliable timing-wise that software, because even a few instructions from flash can behave very differently depending on the concurrent memory access : since the bus is used both by the DMA to transfer the current line and the CPU to prepare the next one, the availability of cached flash data instructions can vary quite a lot, and with 7 cycles latency from flash at the used speeds, it means that one cache miss implies one pixel off !
 Of course
Of course, the pins I was using couldn't be driven directly by hardware, however I found a combination of Timers, DMA channels, pins ... (I'll try to explain the kernel in a future blog post) that implied only one pin change, h-sync.
So I went away and soldered a tiny wire directly on the new pin of the
LQFP64 chip to the vsync VGA output (I'm quite proud to have managed that with my firestick iron), and cut the original pcb trace.
Having done that, the VGA is now much more stable, and can achieve 640x480x60fps @ 4096 colors now without distortion.
Again, the controller is a snes compatible gamepad with the replaced with a ps2 mini din plug (useful to connect a keyboard). 
The kernel exposes the following main elements for the graphics (as a C library) :
- a game_init call back : initialize everything on your game 
- a game_frame callback : do what is needed every frame, like getting user input, moving sprites x/y/frame , .... , using the global integer "frame". 
- the gamepad is simply a global variable "gamepad1", read as a uint16 with 1 bit for each button.
- a draw_buffer of 640 uint16 pixels and a "line" int to know which line it is (from 0 to 479)
- a game_line callback : in this callback, you should draw each line (every frame) by blitting your screen line. So given a line number (from 0 to 479 and a ptr to a bunch of 16bit unsigned data, you have ~5k cycles to blit your graphics.
You can blit however you want, do what you want in 5k cycles per line : from aligned memsets (2 pixels at a time, word by word transfers, very quick) to full antialiased, rotozoomed, alpha blurred sprites (very expensive - tiny sprites !) : you do it & tune it (that's part of the fun) !
Of course, building a library / engine that you can reuse & tune from game to game is useful, but sometimes, the ad-hoc just blit it (tm) engine is simpler !
I then made a sample game, which I'll talk about in a next entry.