Wednesday, February 17, 2016

Is It Time to Reinvent How Video is Done?

What if we're doing video wrong?  Video is basically created and transmitted the same way today as it started in earliest days of television.  That method, in turn, was based on the way motion pictures were made since the invention of motion pictures.  Most people have seen the 1882 film clip by Eadweard Muybridge of a racing horse, but the earliest versions of sequential pictures being used to convey motion go back to 1824.  It's not a very large leap to say all progress in motion pictures and video since about 1882 has been transmitting progressively better quality frames at progressively higher rates.

For quick review, a camera takes a picture of the scene, one frame, at whatever the frame rate is.  A movie is a sequence of vast numbers of these frames.  In a film movie, all of the frames are projected one a time onto a screen, and your eye/brain combination perceives it as smoothly moving.  Video cameras originally exposed a small sensor in a pattern of lines called a raster, re-tracing to start another "scan line", and progressing down the screen line by line.  Today, the cameras hold solid state sensors like the ones in a digital camera that capture an entire frame at once, but those sensors are still read out in the same basic line-by-line way.  In the video realm, the rate varies with the mode, but runs from around 30 to 60 frames per second here in the US.  Each one of those frames is conveyed to the user; either through video transmission or recording, and is played back at that same rate. 

In a sense, this was the only way video could have been developed: one photograph at a time, followed by one line at a time transmitted as an analog signal.  But think about a video of a scene being updated 60 times every second.  The vast majority of video is identical from frame to frame.  In a piece on radio communications, I briefly mentioned Shannon's Information Theory; let me give the 5 minute University version: if I tell you something you already know, I haven't provided any information.  That means the video transmission is horribly inefficient, sending the same scene over and over - 60 times a second! - with no information being sent. 

But what if we looked at the scene differently?  What if we looked at the whole frame and only told the receiver which pixels had changed?  If nothing changed - the characters in the SitCom just sat there for one frame and nobody moved - nothing would be transmitted.  That would drag the data being sent down tremendously, but what was sent would be all new information.  The data rate required for your cable system or other transmission just got hundreds of times lower while the information throughput stayed the same (only the pixels that changed are sent).  To some extent, that's the basis of how compression algorithms attempt to work; by attempting to send what's changing, and not update the parts that aren't changing.  The difference is changing from a frame based system, to an event based system. 

It turns out that the brain and eye work in a similar way.  Years ago, I read about an experiment in which a dog's brain had been surgically tapped so that optic nerve activity could be measured.  When the dog first went into a new room, the activity was intense.  After a while in the room, with nothing moving or changing, nerve activity went down to a much lower rate.  The experimenters introduced something new into the room and again the optic nerve started firing like crazy.  Once the new object was examined and accepted, the rate went down again.  The conclusion was that the eye was operating like a distributed processor, looking at everything and updating what changed. 

The idea is now being pursued in the electronics industry, primarily motivated by the much-hyped "Internet of Things" that so many of us talk about.  Vision is being added, or going to be added, to tons of systems.  The thought is that it might help to think about vision completely differently.  Consider a security camera.  Might it make more sense for a camera watching a door to sit and not send anything unless something approached the door, rather than send a non-changing image of the door over and over, 30 frames per second?  Think of it as being a sensor rather than an imager.

French startup Chronocam comes from the vision research world.  Their two principal researchers are Ryad Benosman, a mathematician who has done original work on event-driven computation, retina prosthetics, and neural sensing models while Christoph Posch has worked in neuromorphic analog VLSI circuits (very large scale integration), CMOS image and vision sensors, biology-inspired signal processing, and biomedical devices and systems.  The inspiration for Chronocam’s event-driven vision sensors comes from studies of how the human eye and brain work. 
According to Benosman, human eyes and brains “do not record the visual information based on a series of frames.”  Biology is in fact much more sophisticated. “Humans capture the stuff of interest—spatial and temporal changes—and send that information to the brain very efficiently,” he said.
This is really just the early days of event-driven vision as a field.  The people conceptualizing the use of machine vision for systems that are coming in four to five years are thinking about this now.  It's the time for the companies like Chronogram who are selling all new technologies and all new ideas to be getting their technical vision in place (pun intended). 
How Chronocam's sensor sees a moving hand in a frame.  The picture on the right is what a conventional camera would be sending. 


9 comments:

  1. The process you describe - the sending/updating of only new information or movement, etc. - isn't this what the mpeg process does now? Sometimes not very well, obviously, as some streams are still huge...

    ReplyDelete
  2. That is what MPEG compression does, but I sense that you mean putting event-based transmission circuitry down on the pixel level. This is an interesting approach, but if you want actual video, it has to be pre-loaded with the levels of the entire frame – and those can drift slowly and not be captured by your event sensor (how far does the light have to change before it registers an event?)

    I'm not sure this approach is better than simply doing it as we do now, but sending only a compressed signal out of the sensor, built-in MPEG as it were. Many cameras already do this, including the $19 asian superwonders, but I doubt the compressor is actually on the sensor silicon.

    No reason why it can't be, though.

    ReplyDelete
  3. I was going to say the same thing.

    Mpeg sends "b" frames and "anchor" frames. The anchor frames contain complete information about the scene, and the b frames "update" that information with what's changed.

    ReplyDelete
  4. Yeah but, we gots to sell all those terrabytes of storage, you know, for the cloud out there some where over the rainbow!!! But on the other side, I wonder who much bandwidth would be saved by moving from current modes of transmission to data saving methods? Interesting.

    ReplyDelete
  5. See this link for an interesting application of this idea.
    https://woodgears.ca/imgcomp/index.html

    Extremely cool site.
    Paul

    ReplyDelete
    Replies
    1. Exactly the same principle, just using a SW approach instead of hardware.

      Delete
  6. I wasn't clear enough on how this works. As Malatrope says, the big difference between this and MPEG compression is that this is done at the pixel level on the sensor. Right now they have relatively low resolution compared to current video standards we're used to, and they're working to get higher resolution. I way I read it, each pixel has a processor (of unknown complexity) under it on a piece of 3D silicon, so the effort is making the pixel and processor smaller.

    So think of the R&D they're doing as addressing all those questions like, "how far does the light have to change before it registers an event?" and improving the density of the pixel/processor stack. Sounds like a tough job, but there's nothing like having a group of very clever people and a big pile of money in front of them that they'll get if they just figure out a tough technical challenge.

    ReplyDelete
    Replies
    1. And their version of an anchor frame might just be called a "level reset" or something. If this isn't done, it will drift – that can't be avoided. To coin a phrase, "analog ain't digital". Heh.

      Delete
  7. This comment has been removed by the author.

    ReplyDelete