Jose Maria Gonzalez and Lawrence A. Rowe
(October 23, 2001)
This document presents a simple Open Mash script to capture a sequence of video images, encode them, and transmit them. The tutorial will show you the basic video capture abstractions, the video coding abstractions, RTP framing abstractions, and network transmission abstractions.
An overview of the process is given by the following:
This process is implemented by a series of objects some corresponding to the actual operation, such as a VideoCapture object, others which control the process, such as a VideoAgent or VideoPipeline object, or objects that provide a convenient abstraction, such as a Network object. All told, it takes almost 20 objects to capture, encode, and transmit a video stream. We will omit most of these objects because they are not critical to understanding the general process. We will focus on the objects that capture, process, frame, and transmit the packets.
The remainder of this note presents an Open Mash script to capture and send a video stream, a description of the software and hardware architecture, and the objects that implement the script.
The sample script creates an application, named SimpleVideoApp, which creates the objects to open a video capture device, grab video frames, encode them as H.261 streams, and send them to a multicast address session. The code is shown in the following box:
1. Import enable
2. import Application AddressBlock VideoAgent VideoPipeline
3.
4. Class SimpleVideoApp -superclass Application
5.
6. SimpleVideoApp instproc init {} {
7. $self next sv
8. $self add_default defaultTTL 16
9. $self instvar agent_ vpipe_
10. set agent_ [new VideoAgent $self 224.2.3.4/9822]
11. set vpipe_ [new VideoPipeline $agent_]
12. }
13.
14. SimpleVideoApp instproc run {} {
15. $self instvar agent_ vpipe_
16. set l [$vpipe_ input_devices]
17. if {[llength $l] == 0} {
18. puts "No video capture device available."
19. return
20. }
21. $vpipe_ select [lindex $l 0] h261
22. $vpipe_ set_quality 10
23. $vpipe_ start
24. }
25.
26. set app [new SimpleVideoApp]
27. puts "Starting to send video..."
28. $app run
29. vwait forever
|
You can run this code using either mash or smash. Recall that these programs are like shells or command processors in that they read commands from standard input (e.g., a command console) and execute them. Mash can run Tcl/Tk scripts, like wish, and smash runs Tcl scripts, like tclsh.
So, assuming that the program above is in the file simplevideoapp.tcl, you can execute it by running the following commands:
|
# smash % source simplevideoapp.tcl Starting to send video... |
The video stream is sent to the multicast session specified by the IP Multicast address 224.2.3.4 and port 9822. By convention, this address and port combination is written as 224.2.3.4/9822. The program will continue to send the video stream until you kill it. You can see the resulting stream by running vic on another machine and passing it the same multicast address. For example, the following command will run vic on a Unix system:
|
# vic 224.2.3.4/9822 |
The video stream sent by SimpleVideoApp will appear as a minature in the list of received streams. You can double-click on the minature to display the stream in a larger window.
The Import and import commands on lines 1-2 enable the package importing system and import four classes used by the script. The Application class provides abstractions for defining an Open Mash application including command line argument processing, reading default values from the option database, and defining various attributes of the application. The AddressBlock class defines abstractions for managing information about socket addresses including IP addresses, port numbers, TTL levels, RTP data and control channels, and so forth. VideoAgent and VideoPipeline are the key abstractions for establishing and managing the capture, encode, and transmit process. The import command looks in the mash-code/mash/importTable file to locate which file contains the definition of each class. That file might contain other definitions and other import commands.
The class definition on line 4 creates the SimpleVideoApp application. An init method is defined for this class on lines 6-12, and a run method is defined on lines 14-24. Otcl uses the command instproc to specify a method. The init method is called when an instance of the SimpleVideoApp class is created. It allocates the VideoAgent and VideoPipeline objects that control the process. A multicast session address are supplied as arguments to the VideoAgent object which defines to which address the video stream will be sent. The defaultTTL value is set to limit the transmission of the packets to the local network. The Time to Live (TTL) parameter controls how widely the packets are distributed on the Internet. A TTL of 32 sends the packet to other networks within a suborganization (e.g., hosts on the BMRC network at U.C. Berkeley), a TTL of 64 sends packets to the entire organization (e.g., U.C. Berkeley), and a TTL of 128 sends packets to the global Internet.
Otcl and Open Mash use several coding conventions illustrated by the SimpleVideoApp code. The variable $self represents the object on which the method was invoked. The method instvar binds instances variables in the object to local Tcl variables. By convention, Open Mash uses an underline after an identifier to indicate that the Tcl variable refers to an instance variable of an object. Note that this binding is by reference not be value so the variables agent_ and vpipe_ refer to the values in the object not local copies of the values. In other words, when you update these variables you are updating the object.
The command on line 26 creates the SimpleVideoApp application. Creation of the instance causes the init method to be called on the new object. The init method creates the objects needed to capture and transmit the video stream. The body of the method calls the superclass init method on line 7, defines the defaultTTL on line 8, and binds the local variables agent_ and vpipe_ to the instance variables of SimpleVideoApp on line 9. Lines 10-11 create an instance of a VideoAgent and a VideoPipeline and assigns them to the instance variables.
Otcl is different than most object-oriented programming languages in that new instance variables can be defined at any time. Moreover, instances of a class may have different instance variables. In other words, one instance might have variables v1, v2, and v3 and another instance of the same class might have variables v2 and v4. The instvar command creates the instance variable if it is not already defined in the instance before binding that instance variable to the local variable with the same name. This behavior is similar to the behavior of variables in Tcl.
The run method initializes the capture and encoding objects and starts the process of capturing, encoding, and transmitting video. The command on line 16 assigns a list of available capture devices to the local variable l. The commands on lines 17-20 confirm that a capture device exists, and the commands on lines 21-22 select the compression algorithm (H.261) and set the quality parameter for that codec. Lastly, the command on line 23 starts streaming video by executing the start method on the VideoPipeline object.
The run method is called on line 28 after the application has been initialized. Lastly, the Tcl command "vwait forever" passes control to the event loop which continues to produce the video stream. You can leave this last command out if you want to issue commands to the running application. For example, you can change various parameters of the streaming process including the frame rate, bit rate, compression quality, or compression algorithm, and you can stop and restart the streaming process by sending stop and start methods to the VideoPipeline. Remember the vpipe_ variable is not defined at the top level. You can create your own top level Tcl variable pointing to it by executing the command
| % set vp [$app vpipe_] |
Now you can update the slots and call methods on the VideoPipeline object directly. For example, you can change the frame rate by entering the command:
| % $vp set_fps 15 |
This command sets the frame rate to 15 frames per second. Other parameters you can query and modify include bits per second, the image size, and the port on the device from which video frames should be grabbed. The specific methods and definitions are given below.
A video source, like a camera, is connected to a video capture board. When the board is installed into the computer, an OS device driver for the board is added to the system. Open Mash supports many capture boards, and in fact, one machine might have several boards installed in it. Objects are provided that query the system to determine which boards are installed in the machine so the same code can be run on different computers with different boards installed.
Open Mash is a user-level program that has a class, named VideoDevice, that represents the capture board device. Each distinct capture device is defined by a subclass of VideoDevice that contains information about that type of board. In addition, there is a class VideoCapture which represents information about capturing frames from the device. Again, each type of device has a subclass of VideoCapture which defines methods for openning and closing the device, capturing frames, and controlling the device. The VideoDevice abstraction is used to determine what boards exist. The subclasses of the VideoCapture abstraction represent specific boards. An instance of one of those subclasses is created when the capture board device is openned by the VideoPipeline.
For example, here is a listing of VideoDevice and VideoCapture classes defined in Open Mash:
VideoDevice TestVideoDevice - a device to load an image from a file V4LVideoDevice - a Video4Linux device V4WVideoDevice - a Video4Windows device X11VideoDevice - a device to grab images from an X server |
Remember that Tcl names can use the forward slash character ("/") in an identifier so VideoCapture/X11 is the name of a class. A computer can have different types of capture devices installed (e.g., a Video4Windows device and a device for loading images from a file) and there can be several boards of a particular type (e.g., there might be two Video4Linux boards in a computer). And, remember that the V4L and V4W abstractions work for many different types of boards so your computer might have boards from two manufacturers but two instances of VideoCapture/V4W.
All of these classes are defined in the directory mash-code/mash/video. The code is written in C++ because controlling a device requires low-level access to the OS primitives. However, as with much of Open Mash, this code is incorporated into an Otcl class using the TclCL split object abstractions which allow methods to be written in C++ and Tcl.
An application must open the selected device by creating an instance of the VideoCapture class for the board. This action is not shown in the sample code above because it was executed by the VideoPipeline when the start method was executed. Once a frame is captured, it is passed to an encoder which compresses the frame. Remember, the image is represented by an array of pixels. Different representations are used. A typical representation, called 4:2:2 YUV, uses 8 bits of luma (Y) and 8 bits of chrominance for each pixel. This representation typically samples chrominance on every other pixel rather than on every pixel. The image array can either be interlaced meaning that bytes for Y, U, and V are intermixed, typically YUVYYUV..., or the arrays can be split apart meaning that there are three arrays, one for Y, one for U, and one for V. In this latter case, the U and V arrays are 1/2-th the size of the Y array. The details of the image representation vary from board to board, so Open Mash has routines for translating between different formats.
The code above does not show the creation of the encoder object, in this case Module/VideoEncoder/Pixel/H261 which corresponds to the H.261 encoder. The source code for this object is located in the mash-code/mash/codec directory. Encoders are coded in C++ for performance. The following encoder classes are defined:
Module/VideoEncoder/Pixel/JPEG - a JPEG video encoder Module/VideoEncoder/Pixel/H261 - an H.261 video encoder Module/VideoEncoder/Pixel/H263 - an H.263 video encoder Module/VideoEncoder/Pixel/H263+ - an H.263+ video encoder Module/VideoEncoder/Pixel/NV - an nv video encoder |
nv is a format defined by Ron Frederick in an early Mbone tool. Previous versions of the code had other codecs that are no longer supported (e.g., CellB, PVH, etc.). Over time, more codecs need to be added to the system.
The VideoAgent class, defined in the file mash-code/mash/tcl/agent-video.tcl, binds together the VideoPipeline and the network abstractions responsible for sending the packets.
The encoder passes encoded bytes to a framer which packages the data into RTP packets. The framer code is typically included in the encoder source files. And finally, the framer passes it to a transmitter which sends the packets to one or more receivers using a socket. The transmitter is part of the network abstractions defined in the directory mash-code/mash/net. This routine is called from the VideoAgent
The video streaming process is managed by a VideoAgent. A VideoPipeline object connects the capture device object to the encoder object. The definition of the VideoPipeline class, which is written in Tcl, is located in the mash-code/mash/tcl/video/videopipeline.tcl file. You will notice that this file also defines a class, named VideoTap, which conceptually represents a video source. But, most methods defined on the VideoTap are also defined on the VideoPipeline so you can just call them on the pipeline.
Figure 1 shows the sequence of objects through which the data is passed. Although this figure suggests that the data is copied each time it is passed to the next object in the sequence, in fact, the data is not copied. Typically, the capture board DMA's the data into a kernel buffer managed by the device driver. The capture object copies the data into an Open Mash buffer in user space. This buffer is passed to the encoder by reference which calls a framer method to process the compressed data it generates. The framer allocates a buffer that represents an RTP packet and writes the compressed bytes into that packet. When the packet is filled, the framer passes the buffer to the transmitter by reference and allocates a new buffer for the next packet. The transmitter adds the appropriate session headers (i.e., destination addresses and so forth) and writes the packet to a socket at the appropriate data rate. After the packet is sent, it is deallocated by the transmitter. The encoder deallocates the buffer which contains the uncompressed frame data.
An important feature of the encoding system is the rate at which various events happen. There is a frame capture rate that controls how often a frame is grabbed. This data is handed to the encoder which has a target bit rate it uses for encoding the frame. The framer passes the RTP packets to the transmitter which keeps track of the transmission rate. It paces the data packet output so that if you are encoding 24 frames per second, the next frame is not sent until 1/24th of a second after the last frame was sent. In some cases the process is too slow, in other words, encoding may take longer than 1/Nth of a second. In that case, the capture object will fall behind and eventually skip a frame. Eventually the encoding abstractions should monitor their behavior so they do not take longer than real-time to encode, but for now the code does not do that.
The VideoPipeline is defined as follows:
# Definition of VideoPipeline Class
#
# VideoPipeline inherits from RTP/Video which inherits from RTP
# (Note: not sure why it inherits from theses objects since none of
# the inherited slots or methods are meaningful.)
Class VideoPipeline -superclass RTP/Video -configuration {
bufferPool_ # buffers for captured frames
device_ # currently selected device
encoder_ # encoder to be called with captured frame
format_ # currently selected format for video coding
initialized_ # is the device initialized?
quality_ # quality factor for the encoding
running_ # is pipeline capturing video?
session_ # network session to which stream should be sent
tap_ # object that represents a video source
# Inherited Slots
classmap_
rtp_atop_
rtp_ptoa_
}
# Create/destroy a pipeline instance
VideoPipeline public init { session }
VideoPipeline public destroy {}
# A pipeline might be able to capture frames from several input devices
# and produce different compressed streams. These methods return
# info about input devices and available formats (remember some devices
# might support hardware encoding and not allow raw capture). You can open
# and close devices and change formats with select.
VideoPipeline public input_devices {}
VideoPipeline public available_formats { device }
VideoPipeline public select { device format }
VideoPipeline public open { device format }
VideoPipeline public close {}
# This method allows you to change the network session to which the stream
# is sent.
VideoPipeline public switch_session { session }
# This method provides access to hardware attributes
# that can be manipulated by the application such as color
# space. These parameters are defined by the capture device.
VideoPipeline public hardware { args }
# Start/stop stream. The running method returns true if a stream
# is currently being produced.
VideoPipeline public start { args }
VideoPipeline public stop { args }
VideoPipeline public running {}
# These methods set attributes about the pipeline.
VideoPipeline public set_bps { args }
VideoPipeline public set_fps { args }
VideoPipeline public set_decimate { v }
VideoPipeline public set_norm { args }
VideoPipeline public set_port { args }
VideoPipeline public set_quality { q }
VideoPipeline public fillrate { args }
## Inherited Methods
RTP rtp_type { payloadtype }
|
- Architecture
How it works (main view of objects):
- VideoPipeline
- has variable tap_ = new VideoTap
- has variable grabber_ = new VideoCapture/XXX
- has variable encoder_ = new Module/VideoEncoder/XXX
- VideoAgent
- has variable session_ = new Session/RTP/Video
- Encoding Description
a. Object Initialization
a.1. To build a basic encoder you need to create a VideoPipeline and a
VideoAgent object. As we said before, the VideoPipeline gets every
frame from the capture device, encodes it, and packetizes the result.
The VideoPipeline has to be able to call the VideoAgent every time it
has an RTP packet ready, so when you create the former you have to pass
it a handler to the latter. This handler is written into the variable
VideoPipeline::session_
VideoPipeline::init{} creates a VideoTap object and writes its
handler to the variable VideoPipeline::tap_. A "VideoTap" object
(mash-1/tcl/video/pipeline.tcl) represents a generic source of
video.
a.2. VideoAgent inherits from RTPAgent (mash-1/tcl/net/agent-rtp.tcl).
VideoAgent::init{} nexts to RTPAgent::init{}, which calls
VideoAgent::create_session{} (pretty dirty!). The latter creates a
Session/RTP/Video object (the object that will transmit the RTP packets),
and writes the resulting handler into the VideoAgent::session_ variable
Session/RTP/Video is the tcl name of the VideoSession class
(both in mash-1/rtp/session-rtp.h,cc). VideoSession descends from
RTP_Session, which has a method called RTP_Session::recv that
accepts a RTP packet and sends it.
a.3. Once you have created a VideoPipeline, you have to select the
capture device and encoder you are going to use. For the former,
you can select any of the VideoCapture/XXX classes you find in
your mash distribution. Try the following command in mash:
% info command VideoCapture/*
VideoCapture/Test VideoCapture/X11 VideoCapture/V4l0
In elmer, we can see three different capture devices: the first two are
virtual in the sense that there is not a video-capture card associated.
The first one is a test device (mash-1/video/video-test.cc) that creates
an artificial stream by repeating the same frame continuosly. The frame
is obtained from an image whose originating file you can select. The
second device (mash-1/video/video-x11.cc) uses X11 to transmit part of
your X11 display. The only "real" device is VideoCapture/V4l0, which
represents a Video4Linux-compatible capture card (mash-1/video/video-v4l.cc)
located in /dev/video0
For the encoder, you have to select the name of the format. You can
use "h261" and "jpeg" to start with.
Once you have decided the capture device and the encoding format, you
you call VideoPipeline::select{}, which basically writes up the device
name (i.e., the complete name of the VideoCapture/XXX object that
represents the capturer) and encoding format name (h261, jpeg, ...).
a.4. Once you have settled all the objects participating in the
transmission, the last step is starting it. VideoPipeline::start{}
calls VideoPipeline::open{}, which does two things:
a) it first creates the encoder object,
"Module/VideoEncoder/Pixel/XXX",
which is just the tcl name to the C++ object XXXEncoder
(mash-1/codec/encoder-xxx.h,cc), and writes the resulting handler
into VideoPipeline::encoder_. Then it gives to such encoder a link
to the transmitter ("$encoder_ target [$session_ get_transmitter]"),
which fills up the Module::target_ variable. This Module::target_
variable is what the encoder will call using recv() when an RTP
packet is ready (linking encoder->transmitter).
b) second, it calls VideoTap::open{}, which creates the capturer
object, "VideoCapture/XXX" (mash-1/video/video-xxx.cc), that will
create the raw frames, and writes the resulting handler into
VideoTap::grabber_ ("set grabber_ [new $device_ $videoType]").
It then gives to such capturer a link to the encoder
("$tap_ target $grabtarget $encoder_"), which calls VideoTap::target{}
with the encoder handler as second argument. This calls
"$grabber_ encoder $encoder_", which calls VideoCapture::command(),
which fills the "VideoCapture::encoder_" variable that the capturer
can call using recv() when a full frame is ready
(linking capturer->encoder).
Now the video stream starts being transmitted.
b. Run-time Explanation (how every proc is called)
b.1. The timing mechanism.
Every VideoCapture/XXX object descends from "VideoCapture"
(mash-1/video/video-device.h,cc), which itself descends from
Timer (tclcl/timer.h,cc).
VideoCapture::bps() is used to select the number of bits per second
that the encoder must generate, which itself will call
VideoCapture::adjust_frameclock(). The latter uses Timer::usched() as
timing mechanism. Timer::usched() is simply a wrapper doing
microsecond-to-seconds conversion for Timer::msched, which is
implemented as a tcl timer (proc Tcl_CreateTimerHandler() at
tcl8.0/generic/tclTimer.c).
The main idea is that the VideoCapture object wakes himself up with
a frequency that is a function of the user-selected bit (bps) and
frame (fps) rates. Once it is waken up, it can request the capture
device to provide a frame.
To go into details, the tcl timer is requested to call Timer::dispatch()
every often. This proc calls Timer::timeout(), which is a pure virtual
C++ method. Therefore, it is VideoCapture::timeout() the method called,
which calls VideoCapture::tick(), which in turn calls VideoCapture::grab().
VideoCapture::grab() is defined as a virtual function, so it's really
VideoCaptureXXX::grab() the function called.
b.2. Capturing a frame
VideoCaptureXXX::grab() gets a frame and writes its contents into
VideoCapture::frame_. Then it calls VideoCaptureXXX::target_->recv(),
which is indeed a call to XXXEncoder::recv().
b.3. Encoding the frame
The encoder starts therefore in the recv() proc. In the usual case,
XXXEncoder::recv() calls XXXEncoder::encode(), which encodes the
provided frame. When it has enough data to send one RTP packet (which
normally happens more than once per frame), it calls XXXEncoder::flush(),
which, after preparing the RTP-packet header and body, calls
XXXEncoder::target_->recv().
b.4. Sending the packets
XXXEncoder::target_->recv() is indeed a call to RTP_Session::recv(),
which sends the packet.