3D Video Compression and Computer Vision & Imagination
For quite a while I have been tossing around in my mind the concept of 3D scenes as a means of both Video Compression and Computer Vision.
It seems reasonably clear to me, for instance, that when I have a memory of a physical experience (as opposed to a memory of a thought, music, emotion, concept, etc..), that the memory is not a picture at all. The memory, in fact, is that of some number of things (objects, people, backgrounds, etc..). When I visualize the memory, I often don't remember things that would be obvious in a photographic sense, such as the color of someone's shirt or whether there were clouds in the sky.
Thinking about this makes me reasonably sure that what I am actually doing when I am visualizing the past is that I am imagining it based on what I remember happening at the time. In other words, I am mentally reconstructing the imagery much like a computerized 3D modeling program would. I know what different objects should and do look like and I know what they look like when they are modified by obvious actions. So, for example, if I want to visualize an old, blue, Ford Thunderbird with a dent in the side from an accident, I can do it even if I have never seen such a vehicle. The act of imagining that automobile, I believe, is identical to the act of remembering something that happened in my real past. The exercise feels the same and the imagery in my mind has the same real quality to it.
Take a second to try this out yourself. Imagine an old, blue Ford Thunderbird with a dent in the door and front-right quarter panel from a car accident. Wait a second before continuing to read.
Your brain is an amazing thing. Like me, your imagination probably also furnished you with an appropriate scene for the car. It might have been parked in a parking-lot, or a garage or by the curb of your house. Most likely, it was placed in a comfortable and likely scene for you by the interpolation engine that your imagination uses to fill in the blanks.
Ah, that interpolation engine is an interesting thing though. That bugger is probably responsible for more mistakes in the history of mankind than any other artifact of our intelligence. You see, if I am right and our memories are the imagined reconstructions of the important details that we do semantically and relationally encode in our memories, then we can very easily fall victim to ghost memories. Ghost memories are those memories that were imposed by the interpolation part of your imagination to fill in the gaps in the scenes that you create in your mind. In our example, it was the parking lot or garage. If we pay close attention to our memories while we reconstruct them, we should be able to avoid this problem - because I believe that the actual memories contain few, if any, real scenes. Although imagery is stored, I believe, if it is unique and interesting.
This brings me to how this technique could be used for Video Compression and Computer Vision. It probably is pretty obvious based on what i just wrote, but let me go through it.
First, we must discuss Computer Vision because that is required for 3D Compression to be possible. 3D Computer Vision is the act of composing 3D scenes from video. There are several interesting projects doing this today. I'll try to dig up links another time.
We probably don't have the computational power to do this right today, but what we want is a system with that does the following:
(1) Using stereo cameras and/or temporal (time based) recording (e.g. video), be able to compose virst-approximation 3D scenes at greater than 30fps.
(2) Within the processing constraints required to process video at greater than 30fps, the ability to do model matching against a database of known objects. We want our system to be able to search a database of 3D models for objects that are similar to what is present in the current 3D scene. It should be noted that since we are running in a non-discontiguous system that we should already have model data from previous time-states. So every new time-state or frame provides us with the opportunity to refine our recognition of objects and recognize new objects. If this sounds eerily like what you do when you walk into a room, I believe there is no coincidence.
(3) After seperating out known models, refining models that we did recognize (e.g. that is a person becomes that is John), we must be able to both update the database and add new objects to the database. This gets complicated, but for simplicity sake, consider the database to be always a work in progress. Both 3D models (e.g. polygons, splines, ...) and textures (such as a unique 2D arrangements of color) should be so encoded.
(4) Now, start building a semantic/relational model of the scene (continue building from the previous frame). This model contains references to objects, attributes of those objects and relationships between objects instead of polygons and pixels. Again, the goal is >30fps. We will be encoding movement of objects, movement of camera, appearance of new objects, recognition of objects unrecognized in previous frames (this may force us to walk backwards in the time-stream and adjust previous frames to account for new knowledge), etc..
At the end of this process, we have an updated database and a stream of frames that is purely a series of coded semantics of references attributes and relationships.
From a Computational Intelligence point of view, this stream of semantic knowledge is incredibly valuable. Processing on semantic knowledge is much, much simpler than on image data. The Computational Vision system already provided enough information to let a C.I. system know who is in the room and how things are moving around.. This knowledge can now be run through another pass to push forward into new frontiers of awareness for the system.
If a C.I. system records it's memories in this manner, it would take very little real memory. The database, of course, would hog a bit by encoding all the models of polygons and whatnot as well as textures and attributes that are common for the models. But the actual scenes use virtually no memory, which means that encoding video produces tiny, miniscule resources to be stored away.
Reconstructing full imagery (imagining the scenes) from these semantic memories is reasonably easy. All the system must do is follow the semantic encoding and fetch the referenced objects, apply the semantic attributes and relations (such as movement or contact, distortion, etc..) and a full 3D scene becomes available to the computer imagination system. That scene can be rendered from various angles (including the original angle that the scene was recorded). It can also be analyzed both from an image point-of-view and a semantic point-of-view.
Remember that we were talking about interpolation earlier? Well, it happens here also, but mostly it happens because the database is not complete and so imperfect models will be encoded in scenes. Since the models are always improving (one would hope), old scenes remembered through the filter of new models may yield imprecise results. It would be advisable for model versioning to be present. It would also be advisable for some feedback encoding to occur both at original encode time as well as in later re-visializations (e.g. take another look and see if the visualizations make sense or can be factored into better models)
Using such a system for compression alone would have very interesting consequences. Every re-run of a movie would be different (depending on the interpolations and model refinements in the system). You could view a movie from multiple angles, you could alter the actors, you could change just about anything very very easily...
My predictions: (A) This technology will exist as soon as the computational horsepower to make it work is affordable and (B) It will become the defacto standard for computer vision and video compression.
But, what the heck do I know?