Martin Rubli
Building a Webcam Webcam Infrastructure for GNU/ GNU/Linux Master Thesis EPFL, Switzerland 2006
Prof. Matthias Grossglauser, Grossglauser, Laboratory for Computer Communications and Applications, EPFL
Richard Nicolet, Logitech Remy Zimmermann, Logitech
© 2006 2006,,
Marti Martin n Rubl Rublii School of Computer and Communication Sciences, Swiss Federal Institute of Technology, Lausanne, Switzerland Logitech, Fremont, California
Revision a. All trademarks used are properties of their respective owners. This document was set in Meridien LT and Frutiger using the LATEX typesetting system on Debian GNU/Linux.
Abstract In this this thes thesis is we analy analyze ze the the curr curren entt stat state e of we webc bcam am supp suppor ortt on the the GNU/ GNU/Li Linu nux x platform. Based on the results gained from that analysis we develop a framework of new software components and improve the current platform with the goal goal of enhanc enhancing ing the user user experi experienc ence e of webcam owners. owners. Along Along the way we get a close insight of the components involved in streaming video from a webcam and of what today’s hardware hardware is capable capable of doing. doing.
Contents 1 Introduction
1
2 Current state of webcam hardware 2.1 Introduction . . . . . . . . . . . . . . . . . 2.2 Terminology . . . . . . . . . . . . . . . . . 2.3 Logitech webcams . . . . . . . . . . . . . . 2.3.1 History . . . . . . . . . . . . . . . . 2.3.2 Cameras using proprietary protocols 2.3.3 USB Video Class cameras . . . . . . 2.4 USB Video Class . . . . . . . . . . . . . . . 2.4.1 Introduction . . . . . . . . . . . . . 2.4.2 Device descriptor . . . . . . . . . . . 2.4.3 Device topology . . . . . . . . . . . 2.4.4 Controls . . . . . . . . . . . . . . . . 2.4.5 Payload formats . . . . . . . . . . . 2.4.6 Transfer modes . . . . . . . . . . . . 2.5 Non-Logitech cameras . . . . . . . . . . . . 3 An introduction to Linux multimedia 3.1 Introduction . . . . . . . . . . . . . . 3.2 Linux kernel multimedia support . . . 3.2.1 A brief history of Video4Linux 3.2.2 Linux audio support . . . . . . 3.3 Linux user mode multimedia support 3.3.1 GStreamer . . . . . . . . . . . 3.3.2 NMM . . . . . . . . . . . . . . 3.4 Current discussion . . . . . . . . . . .
. . . . . . . .
4 Current state of Linux webcam support 4.1 Introduction . . . . . . . . . . . . . . . 4.1.1 Webcams and audio . . . . . . . 4.2 V4L2: Video for Linux Two . . . . . . . 4.2.1 Overview . . . . . . . . . . . . . 4.2.2 The API . . . . . . . . . . . . . .
iv
. . . . . . . .
. . . . .
. . . . . . . .
. . . . .
. . . . . . . . . . . . . .
. . . . . . . .
. . . . .
. . . . . . . . . . . . . .
. . . . . . . .
. . . . .
. . . . . . . . . . . . . .
. . . . . . . .
. . . . .
. . . . . . . . . . . . . .
. . . . . . . .
. . . . .
. . . . . . . . . . . . . .
. . . . . . . .
. . . . .
. . . . . . . . . . . . . .
. . . . . . . .
. . . . .
. . . . . . . . . . . . . .
. . . . . . . .
. . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . . . . . . . .
3 3 3 4 4 5 6 8 8 9 9 9 10 10 10
. . . . . . . .
12 12 12 12 13 13 14 15 16
. . . . . . . .
. . . . . . . .
. . . . .
18 . . . . 18 . . . . 18 . . . . 19 . . . . 19 . . . . 20
4.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . 4.3 Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 The Philips USB Webcam driver . . . . . . . . . . 4.3.2 The Spca5xx Webcam driver . . . . . . . . . . . . 4.3.3 The QuickCam Messenger & Communicate driver 4.3.4 The QuickCam Express driver . . . . . . . . . . . 4.3.5 The Linux USB Video Class driver . . . . . . . . . 4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 V4L2 applications . . . . . . . . . . . . . . . . . . 4.4.2 V4L applications . . . . . . . . . . . . . . . . . . . 4.4.3 GStreamer applications . . . . . . . . . . . . . . . 4.5 Problems and design issues . . . . . . . . . . . . . . . . . 4.5.1 Kernel mode vs. user mode . . . . . . . . . . . . . 4.5.2 The Video4Linux user mode library . . . . . . . . 4.5.3 V4L2 related problems . . . . . . . . . . . . . . . . 5 Designing the webcam infrastructure 5.1 Introduction . . . . . . . . . . . . . . . . . . . 5.2 Goals . . . . . . . . . . . . . . . . . . . . . . . 5.3 Architecture overview . . . . . . . . . . . . . . 5.4 Components . . . . . . . . . . . . . . . . . . . 5.4.1 Overview . . . . . . . . . . . . . . . . . 5.4.2 UVC driver . . . . . . . . . . . . . . . . 5.4.3 V4L2 . . . . . . . . . . . . . . . . . . . . 5.4.4 GStreamer . . . . . . . . . . . . . . . . 5.4.5 v4l2src . . . . . . . . . . . . . . . . . . 5.4.6 lvfilter . . . . . . . . . . . . . . . . . . . 5.4.7 LVGstCap (part 1 of 3: video streaming) 5.4.8 libwebcam . . . . . . . . . . . . . . . . 5.4.9 libwebcampanel . . . . . . . . . . . . . 5.4.10 LVGstCap (part 2 of 3: camera controls) 5.4.11 liblumvp . . . . . . . . . . . . . . . . . 5.4.12 LVGstCap (part 3 of 3: feature controls) 5.4.13 lvcmdpanel . . . . . . . . . . . . . . . . 5.5 Flashback: current problems . . . . . . . . . . 6 Enhancing existing components 6.1 Linux UVC driver . . . . . . . 6.1.1 Multiple open . . . . . 6.1.2 UVC extension support 6.1.3 V4L2 controls in sysfs . 6.2 Video4Linux . . . . . . . . . . 6.3 GStreamer . . . . . . . . . . . 6.4 Bits and pieces . . . . . . . . .
v
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
21 21 21 22 22 23 23 24 24 26 26 27 27 33 34
. . . . . . . . . . . . . . . . . .
39 39 39 43 45 45 46 46 47 47 49 50 50 51 51 52 52 53 53
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . .
56 . . . . . . 56 . . . . . . 56 . . . . . . 59 . . . . . . 62 . . . . . . 62 . . . . . . 65 . . . . . . 65
7 New components 7.1 libwebcam . . . . . . . . . . . . . 7.1.1 Enumeration functions . . 7.1.2 Thread-safety . . . . . . . . 7.2 liblumvp and lvfilter . . . . . . . . 7.3 libwebcampanel . . . . . . . . . . 7.3.1 Meta information . . . . . 7.3.2 Feature controls . . . . . . 7.4 Build system . . . . . . . . . . . . 7.5 Limitations . . . . . . . . . . . . . 7.5.1 UVC driver . . . . . . . . . 7.5.2 Linux webcam framework 7.6 Outlook . . . . . . . . . . . . . . . 7.7 Licensing . . . . . . . . . . . . . . 7.7.1 Libraries . . . . . . . . . . . 7.7.2 Applications . . . . . . . . 7.8 Distribution . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
67 67 68 68 70 71 71 73 74 75 75 78 79 80 80 81 81
8 The new webcam infrastructure at work 8.1 LVGstCap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 lvcmdpanel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83 83 83
9 Conclusion
86
A List of Logitech webcam USB PIDs
88
vi
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
Chapter 1
Introduction Getting a webcam to work on Linux is a challenge on different levels. Making the system recognize the device properly sets the bar to a level that many users feel unable to cross, often for mostly unsubstantiated fear of compiling kernel drivers. Even once that first hurdle is cleared, the adventure has only just started. A webcam is perfectly useless without good software that takes advantage of its features, so where do users go from here? Since the first webcams appeared on the market, they have evolved from simple devices that captured relatively poor quality videos the size of a postage stamp to hightech devices that allow screen-filling videos to be recorded all while applying complex real-time video processing in hardware and software. Traditionally, Linux has been used for server installations and only in the recent years has it started to conquer the desktop. This fact still shows in the form of two important differences when one compares webcam support on Linux and Windows. For one, Linux applications have primarily focused on retrieving still images from the cameras, oftentimes for "live" cameras on the Internet that update a static picture every few seconds. These programs often work in a headless environment, i.e. one that does not require a graphical user interface and a physical screen. For another, webcam manufacturers have provided little support for the Linux platform, most of which was in the form of giving technical information to the open source community without taking the opportunity to actively participate and influence the direction that webcam software takes. This project is an attempt of Logitech to change this in order to provide Linux users with an improved webcam experience that eventually converges towards the one that Windows users enjoy today. Obviously, the timeline of such an undertaking is in the order of years due to the sheer amount of components and people involved. Luckily, the scope of a Master thesis is enough to lay the foundations that are required, not only of a technical nature but also in terms of establishing discussions between the parties involved. In the course of this project, apart from presenting the newly developed 1
framework, we will look at many of the components that already exist today, highlighting their strengths but also their weaknesses. It was this extensive analysis that eventually led to the design of the proposed framework in an attempt to learn from previous mistakes and raise awareness of current limitations. The latter is especially important for a platform that has to keep up with powerful and agile competitors. The foundations we laid with the Linux webcam framework make it easier for developers to base their products on a common core which reduces development time, increases stability, and makes applications easier to maintain. All of these are key to establishing a successful multimedia platform and delivering users the experience they expect from an operating system that has officially set out to conquer the desktop. I would like to thank first of all my supervisors at Logitech, Richard Nicolet and Remy Zimmermann, for their advice and the expertise they shared with me, but also the rest of the video driver and firmware team for their big help with various questions that kept coming up. Thanks also to Matthias Grossglauser, my supervisor at EPFL, for his guidance. A big thank you to the people in the open source community I got to work with or ask questions to. In particular this goes to Laurent Pinchart, the author of the Linux UVC driver, first of all for having written the driver, thereby letting me concentrate on the higher-level components, and second of all for the constructive collaboration in extending it. Last but not least, thanks to everybody who helped make this project happen in one way or another but whose name did not make it into this section. Fremont, USA, September 2006
2
Chapter 2
Current state of webcam hardware 2.1
Introduction
The goal of this chapter is to give an overview of the webcams that are currently on the market. We will first focus on Logitech devices and devote a small section to cameras of other vendors later on. We will also give an overview of the USB Video Class, or simply UVC , specification, which is the designated standard for all future USB camera devices. The Linux webcam framework was designed primarily with UVC devices in mind and the main goal of this chapter is to present the hardware requirements of the framework. Therefore, the majority of the chapter is dedicated to UVC cameras as devices using proprietary protocols are slowly phased out by the manufacturers. We will nevertheless mention the most important past generations of webcams because some of them remain in broad use and it will be interesting to see how they differ in functionality.
2.2 Terminology There are a few terms that will keep coming up in the rest of the report. Let us quickly go over some of them to avoid any terminology related confusion. USB modes In the context of USB we will often use the terms high-speed to denote USB 2.0 operation and full-speed for the USB 1.x case. There also exists a mode called low-speed that was designed for very low bandwidth devices like keyboards or mice. For webcams, low-speed is irrelevant. Image resolutions There is a number of standard resolutions that have corresponding acronyms. We will sometimes use these acronyms for readability’s 3
sake. Table 2.1 has a list of the most common ones. 1 Width [px]
Height [px]
Acronym
160
120
QSIF
176
144
QCIF
320
240
QVGA (also SIF)
352
288
CIF
640
480
VGA
1024
768
XGA
1280
960
SXGA (4:3)
1280
1024
SXGA (5:4)
Table 2.1: List of standard resolutions and commonly used acronyms.
2.3
Logitech webcams
2.3.1 History In the last years the market has seen a myriad of different webcam models and technologies. The first webcams were devices for the parallel port allowing very limited bandwidth and a user experience that was far from the plug-andplay that users take for granted nowadays. With the advent of the Universal Serial Bus, webcams finally became comfortable and simple enough to use for the average PC user. Driver installation became simple and multiple devices could share the bus. Using a printer and a webcam at the same time was no longer a problem. One of the limitations of USB, however, was a bandwidth that was still relatively low and image resolutions above 320x240 pixels required compression algorithms that could send VGA images over the bus at tolerable frame rates. Higher resolution video at 25 or more frames per second only became possible when USB 2.0 was introduced. A maximum theoretical transfer rate of 480 Mb/s provides enough reserves for the next generations of webcams with multi-megapixel sensors. All recent Logitech cameras take advantage of USB 2.0, although they still work on USB 1.x controllers, albeit with a limited resolution set. 1 For
some of the acronyms there exist different resolutions depending on the analog video standard they were derived from. For example, 352x288 is the PAL version of CIF whereas NTSC CIF is 352x240.
4
2.3.2 Cameras using proprietary protocols From a driver point of view Logitech cameras are best distinguished by the ASIC2 they are based on. While the sensors are also an important component that the driver has to know about, such knowledge becomes less important because the firmware hides sensor specific commands from the USB interface. In the case of UVC cameras, even the ASIC is completely abstracted by the protocol and–in the optimal case–every UVC camera works with any UVC driver, at least as far as the functionality covered by the standard is concerned. This following list shows a number of Logitech’s non-UVC cameras and is therefore grouped by the ASIC family they use. We will see in chapter 4 that this categorization is useful when it comes to selecting a driver. Vimicro 30x based Cameras with the Vimicro 301 or 302 chips are USB 1.1 devices, in the case of the 302 with built-in audio support. They support a maximum resolution of VGA at 15 frames per second. Apart from uncompressed YUV data, they can also deliver uncompressed 8 or 9-bit RGB Bayer data or, with the help of an integrated encoder chip, JPEG frames. • Logitech QuickCam IM • Logitech QuickCam Connect • Logitech QuickCam Chat • Logitech QuickCam Messenger • Logitech QuickCam for Notebooks • Logitech QuickCam for Notebooks Deluxe • Logitech QuickCam Communicate STX • Labtec Webcam Plus • Labtec Notebook Pro Philips SAA8116 based The Philips SAA8116 is also a USB 1.1 chipset that supports VGA at a maximum of 15 fps. It has built-in microphone support and delivers image data in 8, 9, or 10-bit RGB Bayer format. It can also use a proprietary YUV compression format that we will encounter again in section 4.3.1 where we talk about the Linux driver for cameras based on this chip. • Logitech QuickCam Zoom • Logitech QuickCam Pro 3000 • Logitech QuickCam Pro 4000 • Logitech QuickCam Orbit/Sphere 3 2 The application-specific integrated circuit in a webcam is the processor designed to process the image data and communicate them to the host. 3 There also exists a model of this camera that does not use Philips ASICs but the SPCA525 described below. This model has a different USB identifier as can be seen in the table in appendix A.
5
• Logitech QuickCam Pro for Notebooks • Logitech ViewPort AV100 • Cisco VT Camera Sunplus SPCA561 based The Sunplus SPCA561 is a low-end USB 1.1 chipset that only supports the CIF format at up to 15 fps. The following is a list of cameras that are based on this chip: • Logitech QuickCam Chat • Logitech QuickCam Express • Logitech QuickCam for Notebooks • Labtec Webcam • Labtec Webcam Plus
2.3.3 USB Video Class cameras Logitech was the first webcam manufacturer to offer products that use the USB Video Class protocol, although this transition was done in two steps. It started with a first set of cameras containing the Sunplus SPCA525 chip which supports both a proprietary protocol as well as the UVC standard. The USB descriptors of these cameras still announce the camera as a so-called vendor class device. This conservative approach was due to the fact that the first models did not pass all the tests required to qualify as UVC devices. As we will see later on when we talk about the Linux UVC driver in more detail, the UVC support of these cameras is still fairly complete, which is why it simply overrides the device class and treats them as ordinary UVC devices. The following is a complete list of these devices: • Logitech QuickCam Fusion • Logitech QuickCam Orbit MP/Sphere MP • Logitech QuickCam Pro 5000 • Logitech QuickCam for Notebooks Pro • Logitech QuickCam for Dell Notebooks (built-in camera for notebooks) • Acer OrbiCam (built-in camera for notebooks) • Cisco VT Camera II Figure 2.1 shows product photos of some of these cameras. All SPCA525 based cameras are USB 2.0 compliant and include an audio chip. They support VGA at 30 fps and, depending on the sensor used, higher resolutions up to 1.3 megapixels at lower frame rates. To reduce the traffic on the bus they feature a built-in JPEG encoder to support streaming of MJPEG data in addition to uncompressed YUV.
6
(a) QuickCam Fusion
(b) QuickCam Orbit MP
(c) QuickCam Pro 5000
(d) QuickCam for Note books Pro
Figure 2.1: The first Logitech webcams with UVC support.
7
The next generation of Logitech webcams scheduled for the second half of 2006 are pure UVC-compliant cameras. Among those are the QuickCam Ultra Vision and the 2006 model of the QuickCam Fusion.
Figure 2.2: The first pure Logitech UVC webcam: QuickCam UltraVision
All of these new cameras are supported by the Linux UVC driver and are automatically recognized because their USB descriptors mark them as USB Video Class devices, therefore eliminating the need to hardcode their product identifiers in the software.
2.4
USB Video Class
2.4.1 Introduction We have already quickly mentioned the concept of USB device classes. Each device can either classify itself as a custom, vendor-specific, device or as belonging to one of the different device classes that the USB forum has defined. There exist many device classes with some of the best-known being mass storage, HID (Human Interface Devices), printers, and audio devices. If an operating system comes with a USB class driver for a given device class, it can take advantage of most or all of the device’s features without requiring the installation of a specific driver, hence greatly adding to the user’s plug-and-play experience. The USB Video Class standard follows the same strategy supporting video devices such as digital camcorders, television tuners, and webcams. It supports a variety of features that cover the most frequently used cases while allowing device manufacturers to add their own extensions. The remainder of this section gives the reader a short introduction to some of the key concepts of UVC. We will only cover what is important to understand the scope of this report and refer the interested reader to [ 6] for the technical details.
8
2.4.2 Device descriptor USB devices are self-descriptive to a large degree, exporting all information necessary for a driver to make the device work in a so-called descriptor . While the USB standard imposes a few ground rules on what the descriptor must contain and on the format of that data, different device classes build their own class-specific descriptors on top of these. The UVC descriptor contains such information as the list of video standards, resolutions, and frame rates supported by the device as well as a description of all the entities that the device defines. The host can retrieve all information it needs from these descriptors and make the device’s features available to applications.
2.4.3 Device topology The functionality of UVC devices is divided up into two different entities: units and terminals. Terminals are data sources or data sinks with typical examples being a CCD sensor or a USB endpoint. Terminals only have a single pin through which they can be connected to other entities. Units, on the other hand, are intermediate entities that have at least one input and one output pin. They can be used to select one of many inputs (selector unit) or to control image attributes (processing unit). There is a special type of unit that we will talk most about in this report, the extension unit . Extension units are the means through which vendors can add features to their devices that the UVC standard does not specify. To do anything useful with the functionality that extension units provide, the host driver or application must have additional knowledge about the device because while the extension units themselves are self-descriptive, the controls they contain are not. We shall see the implications of this fact later on when we discuss the Linux UVC driver. When the driver initializes the device, it enumerates its entities and builds a graph with two terminal nodes, an input and an output terminal, and one or multiple units in between.
2.4.4 Controls Both, units and terminals contain sets of so-called controls through which a wide range of camera settings can be changed or retrieved. Table 2.2 lists a few typical examples of such controls grouped by the entities they belong to. Note that the controls in the third column are not specified by the standard but are instead taken from the list of extension controls that the current Logitech UVC webcams provide.
9
Camera terminal
Processing unit
Extension units
• Exposure time
• Pan/tilt reset
• Lens focus
• Backlight compensation
• Zoom
• Brightness
• LED state
• Motor control (pan/tilt/roll)
• Contrast
• Pixel defect correction
• Hue
• Firmware version
• Saturation • White balance
Table 2.2: A selection of UVC terminal and unit controls. The controls in the first two columns are defined in the standard, the availability and definition of the controls in the last column depends on the camera model.
2.4.5 Payload formats The UVC standard defines a number of different formats for the streaming data that is to be transferred from the device to the host, such as DV, MPEG-2, MJPEG, or uncompressed. Each of these formats has its own adapted header format that the driver needs to be able to parse and process correctly. MJPEG and uncompressed are the only formats used by today’s Logitech webcams and they are also currently the only ones understood by the Linux UVC driver.
2.4.6 Transfer modes UVC devices have the choice between using bulk and isochronous data transfer. Bulk transfers guarantee that all data arrives without loss but do not make any similar guarantees as to bandwidth or latency. They are commonly used in file transfers where reliability is more important than speed. Isochronous transfers are used when a minimum speed is required but the loss of certain packets is tolerable. Most webcams use isochronous transfers because it is more acceptable to drop a frame than to transmit and display the frames delayed. In the case of a lost frame, the driver can simply repeat the previous frame, something that is barely noticeable by the user, whereas delayed frames are usually considered more disruptive of a video conversation.
2.5 Non-Logitech cameras Creative WebCam Creative has a number of webcams that work on Linux, most of them with the SPCA5xx driver. A list of supported devices can be found on the developer’s website[23]. Creative also has a collection of links to drivers that work with some of their older camera models[3].
10
Microsoft LifeCam In summer 2006 Microsoft entered the webcam market with two new products, the LifeCam VX-3000 and VX-6000 models. Both of them are not currently supported by Linux due to the fact that they use a proprietary protocol. Further models are scheduled but none of them are reported to be UVC compliant at this time.
11
Chapter 3
An introduction to Linux multimedia 3.1
Introduction
This chapter gives an overview of what the current state of multimedia support looks like on GNU/Linux. We shall first look at the history of the involved components and then proceed to the more technical details. At the end of this chapter the reader should have an overview of the different multimedia components available on Linux and how they work together.
3.2 Linux kernel multimedia support 3.2.1 A brief history of Video4Linux Video devices were available long before webcams became popular. TV tuner cards formed the first category of devices to spur the development of a multimedia framework for Linux. In 1996 a series of drivers targeted at the popular BrookTree Bt848 chipset that was used in many TV cards made it into the 2.0 kernel under the name of bttv . The driver evolved quickly to include support for radio tuners and other chipsets. Eventually, more drivers started to show up, among others the first webcam driver for the Connectix QuickCam. The next stable kernel version, Linux 2.2, was released in 1999 and included a multimedia framework called Video4Linux , or short V4L, that provided a common API for the available video drivers. It must be said that the name is somewhat misleading in the sense that Video4Linux not only supports video devices but a whole range of related functions like radio tuners or teletext decoders. With V4L being criticized as too inflexible, work on a successor had started as early as 1998 and, after four years, was merged into version 2.5 of the of-
12
ficial Linux kernel development tree. When version 2.6 of the kernel was released, it was the first version of Linux to officially include Video for Linux Two, or simply V4L21 . Backports of V4L2 to earlier kernel versions, in particular 2.4, were developed and are still being used today. V4L and V4L2 coexisted for a long time in the Linux 2.6 series but as of July 2006 the old V4L1 API was officially deprecated and removed from the kernel. This leaves Video4Linux 2 as the sole kernel subsystem for video processing on current Linux versions.
3.2.2 Linux audio support Linux has traditionally separated audio and video support. For one thing, audio has been around much longer than video has, and for another both subsystems have followed a rather strict separation of concerns. Even though they were developed by different teams at different times, their history is marked by somewhat similar events. Open Sound System The Open Sound System, or simply OSS, was originally developed not only for the Linux operating system but for a number of different Unix derivatives. While successful for a long time its rather simple architecture suffers from a number of problems, the most serious of which to the average user being the inability to share a sound device between different applications. As an example it is not possible to hear system notification sounds while an audio application is playing music in the background. The first application to claim the device blocks the device for all other applications. Together with a number of non-technical reasons this eventually led to the development of ALSA, the Advanced Linux Sound Architecture. Advanced Linux Sound Architecture Starting with Linux 2.6, ALSA became the standard Linux sound subsystem, although OSS is still available as a deprecated option. The reason for this is the lack of ALSA audio drivers for some older sound devices. Thanks to features like allowing devices to be shared among applications most new applications come with ALSA support built in and many existing applications make the conversion from older audio frameworks.
3.3 Linux user mode multimedia support The Linux kernel community tries to move as many components as possible into user space. On the one hand this approach brings a number of advan1 Note
the variety in spelling. Depending on the author and the context Video for Linux Two is also referred to as Video4Linux 2 or just Video4Linux.
13
tages like easier debugging, faster development, and increased stability. On the other hand, user space solutions can suffer from problems such as reduced flexibility, the lack of transparency, or lower performance due to increased overhead. Nevertheless the gains seem to outweigh the drawbacks, which is why a lot of effort has gone into the development of user space multimedia frameworks. Depending on the point of view, the fact that there is a variety of such frameworks available can be seen as a positive or negative outcome of this trend. The lack of a single common multimedia framework undoubtedly makes it more difficult for application developers to pick a basis for their software. The available choices range from simple media decoding libraries to fully grown network-oriented and pipeline-based frameworks. For the rest of this section we will present two of what we consider the most promising frameworks available today, GStreamer and NMM . The latter one is still relatively young and therefore not as wide-spread as GStreamer which has found its way into all current Linux distributions, albeit not always in its latest and most complete version. Both projects are available under open source licenses (LGPL and LGPL/GPL combined, respectively).
3.3.1 GStreamer GStreamer can be thought of as a rather generic multimedia layer that provides solid support for pipeline centric primitives such as elements, pads, and buffers. It bears some resemblance to Microsoft DirectShow , which has been the center of Windows multimedia technology for many years now. The GStreamer architecture is strongly plugin-based, i.e. the core library provides basic functions like capability negotiation, routing facilities, or synchronization, while all input, processing, and output is handled by plugins that are loaded on the fly. Each plugin has an arbitrary number of so-called pads. Two elements can be linked by their pads with the data flowing from the source pad to the sink pad . A typical pipeline consists of one or more sources that are connected via multiple processing elements to one ormore sinks. Figure 3.1 shows a very simple example.
Figure 3.1: A simple GStreamer pipeline that plays an MP3 audio file on the default ALSA source. The mad plugin decodes the MP3 data that it receives from the file source and sends the raw audio data to the ALSA sink.
Table 3.1 lists a few plugins for each category. Source elements are characterized by the fact that they only have source pads, sink elements only have
14
sink pads, and processing elements have at least one of each. Sources
Processing
Sinks
• filesrc
• audioresample
• udpsink
• alsasrc
• identity
• alsasink
• v4l2src
• videoflip
• xvimagesink
Table 3.1: An arbitrary selection of GStreamer source, processing, and sink plugins.
3.3.2 NMM NMM stands for Network-Integrated Multimedia Middleware and, as the name already suggests, it tightly integrates network resources into the process. By doing so NMM sets a counterpoint to most other multimedia frameworks that take a machine centric approach where input, processing, and output usually all happen on the same machine. Let us look at two common examples of how today’s multimedia software interacts with the network: 1. Playback of a file residing on a file server in the network 2. Playback of an on-demand audio or video stream coming from the network 1. Playback of a network file From the point of view of a player application, this is the easiest case because it is almost entirely transparent to the applications. The main requirement is that the underlying layers (operating system or desktop environment) know how to make network resources available to their applications in a manner that resembles access to local resources as closely as possible. There are different ways how this can be realized, e.g. in kernel mode or user mode, but all of these are classified under the name of a virtual file system. As an example, an application can simply open a file path such as \\192. 168.0.10\media\clip.avi (UNC path for a Windows file server resource) or sftp://192.168.1.2/home/mrubli/music/clip.ogg (generic URL for a secure FTP resource as used by many Linux environments). The underlying layers make sure that all the usual input/output functions work the same on these files as on local files. So apart from supporting the syntax of such network paths the burden is not on the application writer.
15
2. Playback of an on-demand stream Playing back on-demand multimedia streams has been made popular by applications such as RealPlayer or Windows Media Player . The applications communicate with a streaming server via partially proprietary protocols based on UDP or TCP. The burden of flow control, loss detection, and loss recovery lies entirely on the application’s shoulders. Apart from that, the client plays a rather passive role by just processing the received data locally and exercising relatively little control over the provided data flow. It is usually limited to starting or stopping the stream and jumping to a particular location within the stream. In particular, the application has no way of actively controlling remote devices, e.g. the zoom factor of the camera from which the video stream originates. Note how there is no transparency from the point of view of the streaming client. It requires deep knowledge of different network layers and protocols, which strongly reduces platform independence and interoperability. NMM tries to escape this machine centric view by providing an infrastructure that makes the entire network topology transparent to applications using the framework. The elements of the flow graph can be distributed within a network without requiring the application to be aware of this fact. This allows applications to access remote hardware as if it were plugged into the local computer. It can change channels on a remote TV tuner card or control the zoom level of a digital camera connected to a remote machine. The NMM framework abstracts all these controls and builds communication channels that reliably transmit data between the involved machines. The website of the NMM project[16] lists a number of impressive examples of the software’s capabilities. One of them can be seen in figure 3.2. The photo is from an article that describes the setup of a video wall in detail[13].
3.4 Current discussion Over the years many video device drivers have been developed by many different people. Each one of these developers had their own vision of what a driver should or should not do. While the V4L2 API specifies the syntax and semantics of the function calls that drivers have to implement, it does not provide much help in terms of higher-level guidance, therefore leaving room for interpretation. The classic example where different people have different opinions is the case of video formats and whether V4L2 drivers should include support for format conversion. Some devices provide uncompressed data streams whereas others offer compressed video data in addition to uncompressed formats. Not every application, however, may be able to process compressed data, which is why certain driver writers have included decompressor modules in their drivers. In the case of a decompressor-enabled driver format conversion can occur transparently if an application asks for uncompressed data but the device provides only compressed data. This guarantees maximum compatibility and
16
Figure 3.2: Video wall based on NMM. It uses two laptop computers to display one half of a video each and a third system that renders the entire video.
allows applications to focus on their core business: processing or displaying video data. Other authors take the view that decompressor modules have no place in the kernel and base their opinion partly on ideological and partly on technical reasons like the inability of using floating point mathematics in kernel-space. Therefore, for an application to work with devices that provide compressed data, it has to supply its own decompressor module possibly leading to code– and bug–duplication unless a common library is used to carry out such tasks. We will see the advantages and disadvantages of both approaches together with possible solutions–existing and non-existing–in more detail in the next chapter. What both sides have in common is that the main task of a multimedia framework is to abstract the device in a high-level manner so that applications need as little as possible a priori knowledge of the nature, brand, and model of the device they are talking to.
17
Chapter 4
Current state of Linux webcam support 4.1
Introduction
In the previous chapter we saw a number of components involved in getting multimedia data from the device to the user’s eyes and ears. This chapter will show how these components are linked together in order to support webcams. We will find out what exactly they do and don’t do and what the interfaces between them look like. After this chapter readers should understand what is going on behind the scenes when a user opens his favorite webcam application and they should have enough background to understand the necessity of the enhancements and additions that were part of this project.
4.1.1 Webcams and audio With the advent of USB webcams vendors started including microphones in the devices. To the host system these webcams appear as two separate devices, one of them being the video part, the other being the microphone. The microphone adheres to the USB Audio Class standard and is available to every host that supplies a USB audio class driver. On Linux, this driver is called snd-usb-audio and exposes recognized device functions as ALSA devices. Due to the availability of the Linux USB audio class driver there was no particular need for us to concentrate on the audio part of current webcams as they work out of the box. For this reason, and the fact that Video4Linux does not (need to) know about the audio part of webcams, audio will only come up when it requires particular attention in the remainder of this report.
18
4.2 V4L2: Video for Linux Two Video for Linux was already quickly introduced in section 3.2.1 where we saw the evolution from the first video device drivers into what is today known as Video for Linux Two, or just V4L2. This section focuses on the technical aspects of this subsystem.
4.2.1 Overview In a nutshell, V4L2 abstracts different video devices behind a common API that applications can use to retrieve video data without being aware of the particularities of the involved hardware. Figure 4.1 shows a schematic of the architecture.
Figure 4.1: Simplified view of the components involved when a V4L2 application displays video. The dashed arrows indicate that there are further operating system layers involved between the driver and the hardware. The gray box shows which components run in kernel space.
The full story is a little more complicated than that. For one thing, V4L2 not only supports video devices but related subdevices like audio chips integrated on multimedia boards, teletext decoders, or remote control interfaces. The fact that these subdevices have relatively little in common makes the job of specifying a common API difficult. The following is a list of device types that are supported by V4L2 and, where available, a few examples: • Video capture devices (TV tuners, DVB decoders, webcams) • Video overlay devices (TV tuners) • Raw and sliced VBI input devices (Teletext, EPG, and closed captioning decoders) • Radio receivers (Radio tuners integrated on some TV tuner cards) • Video output devices
19
In addition, the V4L2 specification talks about codecs and effects, which are not real devices but virtual ones that can modify video data. However, support for these was never implemented, mostly due to disagreement how they should be implemented, i.e. in user space or kernel space. The scope of this project merely encloses the first category of the above list, video capture devices. Even though the API was originally designed with analog devices in mind, webcam drivers also fall into this category. It is also the category that has by far the greatest number of devices, drivers, and practical applications.
4.2.2 The API Due to its nature as a subsystem that communicates both with kernel space components and user space processes, V4L2 has two different interfaces, one for user space and one for kernel space. The V4L2 user space API Every application that wishes to use the services that V4L2 provides needs a way to communicate with the V4L2 subsystem. This communication is based on two basic mechanisms: file I/O and ioctls. Like most devices on Unix-like systems V4L2 devices appear as so-called device nodes in a special tree within the file system. These device nodes can be read from and written to in a similar manner as ordinary files. Using the read and write system calls is one of two ways to exchange data between video devices and applications. The other one is the use of mapped memory where kernel space buffers are mapped into an application’s address space to eliminate the need to copy memory around, thereby increasing performance. Ioctls are a way for an application and a kernel space component to communicate data without the usual read and write system calls. While ioctls are not used to exchange large amounts of data, they are an ideal means to exchange control commands. In V4L2 everything that is not reading or writing of video data is accomplished through ioctls 1 . The V4L2 API[5] defines more than 50 such ioctls, ranging from video format enumeration to stream control. The fact that the entire V4L2 API is based on these two relatively basic elements makes it quite simple. That simplicity does, however, come with a few caveats as we will see later on when we discuss the shortcomings of the current Linux video architecture. The V4L2 kernel interface The user space API is only one half of the V4L2 subsystem. The other half consists of the driver interface that every driver that abstracts a device for V4L2 must implement. 1
In the case of memory mapped communication, or mmap, even the readiness of buffers is communicated via ioctls.
20
Obviously kernel space does not know the same abstractions as user space, so in the case of the V4L2 kernel interface all exchange is done through standard function calls. When a V4L2 driver loads, it registers itself with the V4L2 subsystem and gives it a number of function addresses that are called whenever V4L2 needs something from the driver–usually in response to a user space ioctl or read/write system call. At each callback the driver carries out the requested action and returns a value indicating success or failure. The V4L2 kernel interface does not specify how drivers have to work internally because the devices that these drivers talk to are fundamentally different. While webcam drivers usually communicate with their webcams through the USB subsystem, other drivers find themselves accessing the PCI bus to which TV tuner cards are connected. Therefore, each driver depends on its own set of kernel subsystems. What makes them V4L2 drivers is the fact that they all implement a small number of V4L2 functions.
4.2.3 Summary We have seen that the V4L2 subsystem itself is a rather thin layer that provides a standardized way through which video applications and video device drivers can communicate. Compared to other platforms where the multimedia subsystems have many additional tasks like converting between formats, managing data flow, clocks, and pipelines the V4L2 subsystem is rather low level and focused on its core task: exchange of video and control.
4.3
Drivers
This section presents four drivers that are in one way or another relevant to the Logitech QuickCam series of webcams. All of them are either V4L1 or V4L2 drivers and available as open source.
4.3.1 The Philips USB Webcam driver The Philips USB Webcam Driver , or simply PWC , has a troubled history and has caused a lot of discussion and controversy in the Linux community. The original version of the driver was written by a developer known under the pseudonym Nemosoft as a project he did with the support of Philips. At the time there was no USB 2, so video compression had to be applied for video streams above a certain data rate. These compression algorithms were proprietary and Philips did not want to release them into open source. Therefore, the driver was split into two parts: the actual device driver (pwc) that supported the basic video modes that could be used without compression and a decompressor module (pwcx) that attached to the driver and enabled the higher resolutions. Only the former one was released in source code, the decompressor module remained available in binary form. The pwc driver eventually made it into
21
the official kernel but the pwcx module had to be downloaded and installed separately. In August 2004, the maintainer of the Linux kernel USB subsystem, Greg Kroah-Hartman, decided to remove the hook that allowed the pwcx module to hook into the video stream. The reason he gave was the fact that the kernel is licensed under the GPL and such functionality is considered in violation of it. As a reaction, Nemosoft demanded that the pwc driver be removed entirely from the kernel because he felt that his work had been crippled and did not agree with the way the situation was handled by the kernel maintainers. Much of the history can be found in [ 1] and the links in the article. Only a few weeks later, Luc Saillard published a pure open source version of the driver after having reverse-engineered large parts of the original pwcx module. Ever since, the driver has been under continuous development and was even ported to V4L2. The driver works with many Philips-based webcams from different vendors, among others a number of Logitech cameras. The complete list of Logitech USB PIDs compatible with the PWC driver can be found in appendix A.
4.3.2 The Spca5xx Webcam driver The name of the Spca5xx Webcam driver is a little misleading because it suggests that it only works with the Sunplus SPCA5xx series of chipsets. While that was true at one time, Michel Xhaard has developed the Spca5xx driver into one of the most versatile Linux webcam drivers that exist today. Next to the mentioned Sunplus chipsets it supports a number of others from manufacturers such as Pixart, Sonix, Vimicro, or Zoran. The (incomplete) list of supported cameras at [23] contains more than 200 cameras and the author is working on additional chipsets. The main drawback of the Spca5xx driver is the fact that it does not support the V4L2 API yet. This limitation, and the way the driver has quickly grown over time, are the main reasons why the author has recently started rewriting the driver from scratch, this time based on V4L2 and under the name of gspca. Among the many supported cameras on the list, there is a fair number of Logitech’s older camera models as well as some newer ones. Again, appendix A has a list of these devices.
4.3.3 The QuickCam Messenger & Communicate driver This driver supports a relatively small number of cameras, notably a few models of the QuickCam Messenger, QuickCam Communicate, and QuickCam Express series. They are all based on the STMicroelectronics 6422 chip. The driver supports only V4L1 at the time of this writing and can be found at [14].
22
4.3.4 The QuickCam Express driver Another relatively limited V4L1 driver, [19] focuses on the Logitech QuickCam Express and QuickCam Web models that contain chipsets from STMicroelectronics’ 6xx series. It is still actively maintained, although there are no signs yet of a V4L2 version.
4.3.5 The Linux USB Video Class driver Robot contests have been the starting point for many an open source software project. The Linux UVC driver is one of the more prominent examples. It was developed in 2005 by Laurent Pinchart because he needed support for the Logitech QuickCam for Notebooks Pro camera that he was planning to use for his robot. The project quickly earned a lot of interest with Linux users who tried to get their cameras to work. Driven by both, personal and community interest, the driver has left the status of a hobby project behind and is designated to become the official UVC driver of the Linux kernel. Since this driver is one of the corner stones of this project, we will give here a basic overview of the driver. Later in section 6.1 we shall discuss extensions and changes that were done to support the Linux webcam infrastructure. The official project website can be found at [ 17]. Technical overview The Linux UVC driver, or short uvcvideo, is a Video4Linux 2 and a USB driver at the same time. It registers with the USB stack as a handler for devices of the UVC device class and, whenever a matching device is connected, the driver initializes the device and registers it as a V4L2 device. Let us now look at a few tasks and aspects of the UVC driver in the order they typically occur. Device enumeration The first task of any USB driver is to define a criteria list for the operating system so that the latter one knows which devices the driver is willing and able to handle. We saw in section 2.3.3 that some Logitech cameras do not announce themselves as UVC devices even though they are capable of the protocol. For this reason, uvcvideo includes a hard-coded list of product IDs of such devices in addition to the generic class specifier. Device initialization As soon as a supported device is discovered, the driver reads and parses the device’s control descriptor and, if successful, sets up the internal data structures for units and terminals before it finally registers the camera with the V4L2 subsystem. At this point, the device becomes visible to user space, usually in the form of a device node, e.g. /dev/video0. Stream setup and streaming If a V4L2 application requests a video stream, the driver enters the so-called probe/commit phase to negotiate the parameters of the video stream. This includes setting attributes like video data format, 23
frame size, and frame rate. When the driver finally receives video data from the device, it must parse the packets, check them for errors and reassemble the raw frame data before it can send a frame to the application. Controls Video streaming does not only consist of receiving video data from the device, but applications can use different controls to change the settings of the camera or the properties of the video stream. These control requests must be translated from the V4L2 requests that the driver receives to UVC requests understood by the device. This process requires some mapping information because the translation is all but obvious. We will have a closer look at this problem and how it can be solved later on. Outlook For obvious reasons V4L2 cannot support all possible features that the UVC specification defines. The driver thus needs to take measures that allow user space applications to access such features nonetheless. In section 6.1 we shall see one such example that was realized with the help of the sysfs virtual file system and is about to be included in the project. It is safe to say that the Linux USB Video Class driver is going to be the most important Linux webcam driver in the foreseeable future. Logitech is already moving all cameras onto the UVC track and other vendors are expected to follow given that UVC is a Windows Vista logo requirement. For Linux users this means that all these cameras will be natively supported by the Linux UVC driver.
4.4
Applications
4.4.1 V4L2 applications Ekiga Ekiga is a VoIP and video conferencing that supports SIP and H.323, which makes it compatible not only to applications such as NetMeeting but also to conferencing hardware that supports the same standards. It comes with plugins for both, V4L1 and V4L2, and is therefore able to support a large number of different webcams. Given the resemblance to other popular conferencing software, Ekiga is one of the main applications for webcams on Linux. It is licensed under the GPL, documentation, sources and binary packages can be downloaded from [18]. luvcview This tool was developed by the author of the Spca5xx driver with the intention to support some features unique to the Linux UVC driver, hence its name. 24
Figure 4.2: The main window of Ekiga during a call.
Thanks to its simplicity it has become one of the favorite programs for testing whether the newly installed camera works. It is based on V4L2 for video input and the SDL library for video output. The simple user interface allows basic camera controls to be manipulated, including some of the custom controls that the UVC driver provides to enable mechanical pan/tilt for the Logitech QuickCam Orbit camera series. The latest version includes a patch that was written during this project to help with debugging of camera and driver issues. It allows to easily save the raw data received from the device into files with the help of command line options. luvcview can be downloaded from [22]. Figure 4.3 shows a screenshot of the luvcview user interface and the command line used to start it in the background. fswebcam This nifty application is the proof that not all webcam software needs a GUI to be useful. Purely command-line based it can be used to retrieve pictures from a webcam and store them in files, e.g. for uploading them to a web server in regular intervals. The fswebcam website can be found at [ 9].
25
Figure 4.3: The window of luvcview and the console used to start it in the background.
4.4.2 V4L applications Camorama Camorama is a V4L1 only application made for taking pictures either manually or in specified intervals. It can even upload the pictures to a remote web server. Camorama allows adjusting the most common camera controls and includes a number of video filters, some of which don’t seem very stable, though. It can be downloaded from [11] and is part of many Linux distributions. Unfortunately development seems to stand still at the moment. Figure 4.4 shows Camorama in action.
4.4.3 GStreamer applications There are many small multimedia applications that use the GStreamer engine as a back-end but a relatively small number of prominent ones. The most used ones are probably Amarok, the default KDE music player and Totem, the GNOME’s main media player. At the moment Amarok is limited to audio, although video support is being discussed. What makes Totem interesting from the point of view of webcam users is a little webcam utility called Vanity. Unfortunately it has received very little attention from both developers and users and it remains to be seen whether the project is revived or even integrated into Totem. 26
Figure 4.4: Camorama streaming at QVGA resolution from a Logitech QuickCam Messenger camera using the Spca5xx driver.
We will see another webcam application based on GStreamer in the next chapter when we look at the software that was developed for this project. At that time we shall also see how GStreamer and V4L2 work together.
4.5 Problems and design issues As with every architecture, there are a number of drawbacks, some of which were briefly hinted at in the previous sections. We will now look at these issues in more detail and see what their implications on webcam support on the Linux platform are. At the same time we will look at possible solutions to these problems and how other platforms handle them.
4.5.1 Kernel mode vs. user mode The discussion whether functionality X should be implemented in user mode or in kernel mode is an all-time classic in the open source community, particularly in the Linux kernel. Unfortunately these discussions are oftentimes far from conclusive leading to slower progress in the implementation of certain
27
features or, in the worst case, to factually discontinued projects due to lack of consent and acceptance. Table 4.1 shows the most notable differences between kernel mode and user mode implementations of multimedia functionality. While the points are focused on webcam applications, many of them can also be applied to other domains like audio processing or even devices completely unrelated to multimedia. In the following we will analyze these different points and present possible solutions and workarounds.
+
Kernel space
User space
• Transparency for user space
• Simple upgrading
• Direct device access
• Simple debugging
• Device works "out of the box"
• Safer (bugs only affect one process) • More flexible licensing
–
• No floating point math
• Difficult to establish standard
• Complicated debugging
• Requires flexible kernel back-end
• Open source only • No callback functions
Table 4.1: Kernel space vs. user space software development
Format transparency One of the main problems in multimedia application is the myriad of formats that are in use. Different vendors use different compression schemes for a number of reasons: licensing and implementation costs, memory and processing power constraints, backward compatibility, and personal or corporate preference. For application developers it becomes increasingly difficult to stay current on which devices use which formats and to support them all. In some cases, as in the case of the cameras using the PWC driver it may even be impossible for someone to integrate certain algorithms for legal reasons. This is a strong argument for hiding the entire format conversion layer from the application, so that every application only needs to support a very small number of standard formats to remain compatible with all hardware and drivers. A typical example is the way the current Logitech webcam drivers for Windows are implemented. While the devices usually provide two formats, compressed MJPEG and uncompressed YUY2, applications get to see neither of these formats. Instead, they are offered the choice between I420 and 24-bit RGB with the latter one being especially easy to process because each pixel 28
is represented by a red, green, and blue 8-bit color value. These formats are provided independent of the mode in which the camera is being used. For example, if the camera is streaming in MJPEG mode and the capturing software requests RGB data, the driver uses its internal decompressor module to convert the JPEG data coming from the camera into uncompressed RGB. The capturing software is not aware of this process and does not need to have its own JPEG decoder; one nontrivial module less to implement. At which layer this format conversion should happen depends on a num ber of factors of both technical and historical nature. Traditionally, Windows and Linux have seen different attempts at multimedia frameworks and many of them have only survived because their removal would break compatibility with older applications still relying on these APIs. If vendors and driver developers are interested in the support of these outdated frameworks, they may need to provide format filters for each one of these frameworks in the case of a proprietary streaming format. If, however, the conversion takes place in the driver itself, all frameworks can be presented with some standard format that they are guaranteed to understand. This can greatly simplify development by concentrating the effort on a single driver instead of different framework components. There are also performance considerations when deciding on which level a conversion should take place. If two or more applications want to access the video stream of a camera at the same time, they will create as many different pipelines as there are applications. If the format conversion–or any other computationally intensive process–is done in the user space framework, the same process has to be carried out in the pipeline of each application because there is no way through which the applications could share the result. This has the effect of multiplying the required work, something that leads to poor scalability of the solution. In the opposite case, where the conversion process is carried out before the stream is multiplexed, the work is done just once in the driver and all the frameworks receive the processed data as an input, therefore importantly reducing the overhead associated with multiple streams in parallel. Feature transparency Up until now our discussion has focused primarily on format conversion. There exists another category of video processing that is different in a very important way: computer vision. Computer vision is a form of image or video processing with the goal of extracting meta data that enables computers to "see" or at least recognize certain features and patterns. A few classic examples are face tracking, where the algorithm tries to keep track of the position of one or multiple faces, feature tracking, where the computer locates not only the face but features like eyes, nose, or mouth, and face recognition, where software can recognize faces it has previously memorized. To see the fundamental difference between computer vision and format conversion modules we have to look first at a basic mechanism of multimedia frameworks: pipeline graph 29
construction. When an application wants to play a certain media source it should not have to know the individual filters that become part of the pipeline in order to do so. The framework should automatically build a flow graph that puts the right decoders and converters in the right order. The algorithms that do this are usually based on capability descriptors that belong to each element combined with priorities to resolve ambiguities. For example, a decoder filter could have a capability descriptor that says "Able to parse and decode .mp3 files" and "Able to output uncompressed audio/x-wav data". When an application wants to play an .mp3 file, it can simply request a pipeline that has the given .mp3 file as input and delivers audio/x-wav data as output. In many cases there exist multiple graphs that are able to fulfill the given task, so the graph builder algorithm has to take decisions. Back in our example there could be two MP3 decoders on the system, one that uses the SIMD instruction set of the CPU if available and one that uses only simple arithmetic. Let us call the first module mp3_simd and assume it has a priority of 100. The default MP3 decoder is called mp3_dec and has a lower priority of 50. Naturally, the graph builder algorithm will first try to build the graph using mp3_simd . If the current CPU supports the required SIMD instructions, the graph construction will succeed. In the opposite case where the current machine lacks SIMD, mp3_simd can refuse to be part of the graph but the framework will still be able to build a working graph because it can fall back to our standard decoder, mp3_dec . Imagine now an audio quality improvement filter called audio_qual that accepts uncompressed audio/x-wav data as input and outputs the same type of data. How can the application benefit from audio_qual without having to know about it? The graph builder algorithm will always take the simplest graph possible, so it does not see an advantage in introducing an additional filter element that–from the algorithm’s capability oriented perspective–is nothing but a null operation. This problem is not easy to solve because making every audio application aware of the plugin’s existence is not always practical. The case of computer vision is very similar with respect to the pipeline graph creation process. The computer vision module does not modify the data, so the input and output formats are the same and the framework does not see the need to include the element into the graph. One elegant solution to this problem is to do the processing in kernel mode in the webcam driver before the data actually reaches the pipeline source. Obviously, this approach can require a format conversion in the driver if the computer vision algorithms cannot work directly on the video format delivered by the camera. So the solution presented in the previous section becomes not only a performance advantage but a necessity to support certain features transparently for all applications.
30
Direct device access Another main advantage of a kernel mode multimedia framework is that the framework has easy access to special features that the device provides. For example, a new camera model can introduce motion control for pan and tilt. If the user mode multimedia framework is not aware of this or incapable to map these controls onto its primitives, applications running on top of it cannot use these features. Obviously this point is also valid for kernel mode frameworks but it is generally easier to communicate between kernel components than across the barrier between user mode and kernel mode. For an application to be able to communicate with the driver, it is not enough to use the framework API, but a special side channel has to be established. The design of such a side channel can turn out to be rather complicated if future reusability is a requirement because of the difficulty to predict the features of upcoming devices. We will see a concrete example of this issue–and a possible solution–later on when we look at how the webcam framework developed as part of this project communicates with the device driver. Callback Many APIs rely on callback to implement certain features as opposed to polling or waiting on handles. The advantage of this approach is that it has no impact on performance (especially compared to polling) and is much simpler because it does not require the application to use multiple threads to poll or wait. There are many cases where such notification schemes are useful: • Notification about newly available or unplugged devices • Notification about controls whose value has changed, possibly as a result of some device built-in automatism • Notification about device buttons that have been pressed • Notification about the success or failure of an action asynchronously triggered by the application (e.g. a pan or tilt request that can take some time to finish) • Notification about non-fatal errors on the bus or in the driver Unfortunately, current operating systems provide no way to do direct call back from kernel mode to user mode. Therefore, for V4L2 applications to be able to use the comfort of callback notification, a user space component would have to be introduced that wraps polling or waiting and calls the application whenever an event occurs. In chapter 7 we propose a design that does just that. Ease of use The Linux kernel comes with a variety of features built in including many drivers that users of other operating systems have to download and install 31
separately. If a certain device works "out of the box" it provides for good user experience because people can immediately start using the device and launch up their favorite applications. Such behavior is obviously desirable because it frees users from having to compile and install the driver themselves, something that not every Linux user may be comfortable doing. On the other hand, the disadvantage of such an approach is the limited upgradeability of kernel components. Even though current distributions provide comfortable packaging of precompiled kernels, such an upgrade usually requires rebooting the machine. In comparison, upgrading a user mode application is as easy as restarting the application once the application package has been upgraded. In high-availability environments, e.g. in the case of a popular webcam streaming server, the downtime incurred by a reboot can be unacceptable. Development aspects For a number of reasons programming in user mode tends to be easier than programming in kernel mode. Three of these reasons are the variety of development tools, the implications of a software bug, and the comfort of the API. Traditionally there are many more tools available for developing applications than kernel components. The simple reason is for one that the development of user space tools itself is easier and for another that the number of application developers is just much higher than the one of system developers. There is a large variety of debug tools and helper libraries out there but almost none of them are applicable to kernel mode software. Therefore the Linux kernel mode developer has to rely mostly on kernel built-in tools. While these are very useful, they cannot compare with the comfort of the kernel debugger tools available on the Windows platform. If a problem in the kernel component occurs, the implications can be manifold. In some cases the entire machine can freeze without so much as a single line of output that would help locate the problem. In less severe cases the kernel manages to write enough useful debug information to the system log and may even continue to run without the component in question. Nevertheless, such an isolated crash often requires a reboot of the test machine because the crashed component cannot be replaced by a new, and possibly fixed, version anymore. These circumstances inevitably call for two machines, one for development and one for testing. In user mode an application bug is almost always limited to a single process and trying out a new version is as easy as recompiling and relaunching the program. Finally, not all the comfort of the API that application programmers are used to is available in kernel space. Seemingly simple tasks like memory allocation, string handling, and basic mathematics can suddenly become much more complicated. One important difference is that floating point operations
32
are oftentimes not available in kernel mode for performance reasons 2 . One has to resort to algorithms that avoid floating point computations or apply tricks that are unlikely to receive a positive echo in the Linux kernel community. All of these points make the development of multimedia software in user mode much easier, an important point given the complexity that the involved algorithms and subsystems often have. Licensing Nothing speaks against writing closed source software for Linux. As a matter of fact, there is a large number of commercial Linux applications out there that were ported from other operating systems or written from scratch without releasing their source code. The GNU General Public License (GPL), under which the Linux kernel and most of the system software is released, does not forbid closed source applications. The situation for kernel modules, however, is more complicated than that. Since the GPL requires derived works of a GPL-licensed product to be published under the same terms, most kernel modules are assumed derived works, therefore ruling out the development of closed source kernel modules[ 20]. There seems, however, to be an acceptable way of including a binary module into the Linux kernel. It basically consists of having a wrapper module, itself under the GPL, that serves as a proxy for the kernel functions required by the second module. This second module can be distributed in a binary only form and does not have to adopt the kernel’s license because it cannot be considered a derived work anymore. Even after sidestepping the legal issues of a binary only kernel module there remain a few arguments against realizing a project in such a way, notably the lack of acceptance in the community and the difficult maintenance given the large number of different kernel packages that exist. In many cases, the software would have to be recompiled for every minor upgrade and for every flavor and architecture of the supported Linux distributions. This can drastically limit the scope of supported platforms.
4.5.2 The Video4Linux user mode library One solution to most of the problems just described keeps coming up when new and missing features and design issues are discussed on the V4L mailing list: a widely available, open source, user mode library that complements the kernel part of V4L2. Such a library could take over tasks like format conversion, providing a flexible interface for more direct hardware access, and taking complexity away from today’s applications. At the same time, the kernel part could entirely concentrate on providing the drivers that abstract device capa bilities and making sure that they implement the interfaces required by the V4L library. 2
Banning floating point from kernel mode allows the kernel to omit the otherwise expensive saving and restoring of floating point registers when the currently executing code is preempted.
33
While the approach sounds very promising and would bring the Linux multimedia platform a large step forward, nobody has found themselves willing or able to start such a project. In the meantime, other user mode frameworks like GStreamer or NMM have partly stepped into the breach. Unfortunately, since these frameworks do not primarily target V4L, they are rarely able to abstract all desirable features. The growing popularity of these multimedia architectures, in turn, makes it increasingly harder for a V4L library to become widespread and eventually the tool of choice for V4L2 front-ends. It seems fair to say that the project of the V4L user mode library has died long before it even got to the stage of a draft and it would require a fair amount of initiative to be revived.
4.5.3 V4L2 related problems Video4Linux has a number of problems that have their roots partially in the legacy of V4L1 and Unix systems in general as well as in design decisions that were made with strictly analog devices in mind. For some of them easy fixes are possible, for others solutions are more difficult. Input and output We saw in section 4.2.2 that V4L2 provides two different ways for applications to read and write video data. The use of the standard read and write system calls and memory mapped buffers ( mmap). Device input and output using the read/write interface used to be–and still is in some cases–very popular but is not the technique of choice due to the fact that it does not allow meta information such as frame timestamps to be communicated alongside the data. This classic I/O-based approach, in turn, has the advantage of enabling every application that supports file I/O to work with V4L2 devices. While it would be possible for drivers to implement both techniques, some of them choose not to support read/write and mmap at the same time. The uvcvideo driver for example does not support the read/write protocol in favor of the more flexible mmap. The fact that for the application the availability of either protocol depends on the driver in use erodes the usefulness of the abstraction layer that V4L is supposed to provide. To be on the safe side, an application would have to implement both protocols at the same time, again something that not all application authors choose to do. Usually their decision depends on the purpose of their tool and the hardware they have access to during development. The legacy of ioctl The ioctl system call was first introduced with AT&T Unix version 7 in the late seventies. It was used to exchange control data that did not fit into the stream-oriented I/O model. The operating system forwards ioctl requests directly to the driver responsible for the device. Let us look at the prototype
34
of the ioctl function to understand where some of the design limitations in V4L2 come from: int ioctl(int device, int request, void *argp); There are two properties that stick out for an interface based on this function: 1. There is only one untyped argument for passing data. 2. Every call needs a device handle. The fact that ioctl provides only one argument for passing data between caller and callee is not a serious technical limitation in practice and neither is its untypedness. The ways that this interface is used, however, deprives the compiler of doing any sort of compile-time type checking leading to possibly hard to find bugs if a wrong data type is passed. For developers this also makes for a little intuitive interface since even relatively simple requests require data structures to be used where a few individual arguments of basic types would be simpler. While the first point is mostly a cosmetic one, the second one opposes a more important limitation on applications: there are no "stateless" calls to the V4L2 subsystem possible. Since the operating system requires a device handle to be passed to the ioctl request, the application has no choice but to open the device prior to doing the ioctl call. As a consequence this eliminates the possibility of device independent V4L2 functions. It is easy to come up with a few occasions where such stateless functions would be desirable: • Device enumeration. It is currently left to the application to enumerate the device nodes in the /dev directory and filter those that belong to V4L2 devices. • Device information querying. Unless the driver supports multiple opening of the same device, something that is not trivial to implement because the associated policies have to be carefully thought through, applications have no more information than what the name of the device node itself provides. Currently this is restricted to the device type (Video devices are called videoN , radio devices radioN , etc. where N is a num ber.) • Module enumeration. If the V4L2 system were to provide format conversion and other processing filters, applications would want to retrieve a list of the currently available modules without requiring opening a device first. • System capability querying. Similarly, V4L2 capabilities whose existence is independent of a device’s presence in the system could be queried without the need for the application to know which capability was introduced with which kernel version and hardcoding corresponding conditionals. 35
It is clear that the current API was designed to blend in nicely with the Unix way of communicating between applications and system components. While this keeps the API rather simple from a technical point of view, it has to be asked whether it is worth sticking to these legacy interfaces that clearly were not–and could not at the time–designed to handle all the cases that come up nowadays. Especially for fast advancing areas like multimedia a less generic but more flexible approach is often desirable. Missing frame format enumeration We have mentioned that the current Video4Linux API was designed mostly with analog devices in mind. Analog video devices have a certain advantage over digital ones in that they oftentimes have no constraints as to the video size and frame rate they can deliver. For digital devices, this is different. While the sensors used by digital webcams theoretically provide similar capabilities, these are hidden by the firmware to adapt to the way that digital video data is transmitted and used. So while an analog TV card may very well be capable of delivering an image 673 pixels wide and 187 pixels high, most webcams are not. Instead, they limit the supported resolutions to a finite set, most of them with a particular aspect ratio such as 4:3. Similar restrictions apply for frame rates where multiples of 5 or 2.5 dominate. One implication of this is that at the time V4L2 was designed, there was no need to provide applications with a way to retrieve these finite sets. This has peculiar effects at times: • Many applications are completely unaware of the frame rate and rely on the driver to apply a default value. • The only way for V4L2 applications to enumerate frame rates is to test them one by one and check if the driver accepts them. • Since a one-by-one enumeration of resolutions is impossible due to the sheer number of possible value combinations, applications simply have to live with this limitation and either provide a hardcoded list of resolutions likely to be supported or have the user enter them by hand. Once a selection is made, the application can test the given resolution. To make this process less frustrating than what it seems V4L2 drivers return the nearest valid resolution if a resolution switch fails. As an example, if an application requests 660x430, the driver would be likely to set the resolution to 640x480. We shall see in 6.2 how this severe limitation was removed by enhancing the V4L2 API. Control value size Another limitation that is likely to become a severe problem in the future is the structure that V4L2 uses to get and set the values of device controls: 36
struct v4l2_control { __u32 id /* Identifies the control. */ __s32 value /* New value or current value. */ }; The value field is limited to 32 bits, which is satisfactory for most simple controls but not for more complex controls. This has already given rise to the recent introduction of extended controls (see the VIDIOC_G_EXT_CTRLS, VIDIOC_S_EXT_CTRLS, and VIDIOC_TRY_EXT_CTRLS ioctls in [ 5]), which allow applications to group several control requests and provide some room for extension. We will come back to this issue at the beginning of chapter 5 when we discuss the goals of our webcam framework. Lack of current documentation The last problem we want to look at is unfortunately not limited to V4L2 but affects a wide range of software products, especially in the non-commercial and open source sector: poor documentation. The V4L2 documentation is split into two parts, an API specification for application programmers[5] and a driver writer’s guide[ 4]. While the first one is mostly complete and up-to-date, the latter one is completely outdated, little helpful except for getting a first overview, and it gives no guidelines on how to implement a driver and what to watch out for. The main source of information on how to write a V4L2 driver is therefore the source code of existing drivers. The lack of a reference driver doesn’t make the choice easy, though, and there exist some poorly written drivers out there. Moreover, there is little documentation available on what the V4L2 subsystem actually does and doesn’t do. Again, delving into the source code is the best and only way to get answers. This lack of starting points for developers is likely one of the biggest problems of V4L2 at the moment. It sets the threshold for newcomers quite high and makes it hard for established developers to find common guidelines to adhere to, something that in turn prevents code sharing and modularization of common features. As part of this project the author has tried to set a good example by properly documenting the newly added frame format enumeration features and providing a reference implementation that demonstrates their usage. One can only hope that the current developers eventually take a little time out of their schedules to document the existing code as long as the knowledge and recollection is still there. Stream synchronization There is one important aspect normally present in multimedia frameworks that all applications known to the author have blissfully ignored without any obviously bad consequences: synchronization of multimedia streams. 37
Whenever a computer processes audio and video inputs simultaneously there is an inevitable tendency for the two streams to slowly drift apart when they are recorded. This has numerous reasons and there are different strategies to reduce the problem, many of which are explained in [ 12], an excellent article by the author of VirtualDub, an extremely popular video processing utility for Windows. The fact that no bad consequences can be observed with current Linux webcam software does not mean, however, that the problem does not exist on the Linux platform. The problem only becomes apparent when videos are recorded that include an audio stream and none of the common applications seem to do that yet. Once this has changed, applications will need to figure out a way to avoid the problem of having the video and audio streams drift apart. V4L2 on its own cannot prevent this because it has no access to the audio data. Despite all these problems, Linux has a functioning platform for webcams today. It is only a matter of time and effort to resolve them one buy one. The next chapter is a first step in that direction, as it provides some ideas and many real solutions.
38
Chapter 5
Designing the webcam infrastructure 5.1
Introduction
After having seen all the relevant requirements for operating a webcam on Linux, we can finally discuss what our webcam framework looks like. This chapter treats the ideas and goals behind the project, how we have tackled the difficulties and why the solution looks as it looks today. We will present all the components involved in a high-level manner and save the technical details for the two following chapters. To conclude we shall revisit the problems discussed in the previous chapters and summarize how our solution solves them and strives to avoid similar problems in the future. Before doing so, however, we need to be clear about the goals we want to achieve and set priorities. Software engineering without having clear goals in mind is almost guaranteed to lose focus of the main tasks over the little things and features.
5.2
Goals
The main goal of the project, enhancing the webcam experience of Linux users, is a rather vague one and does not primarily lend itself as a template for a technical specification. It does, however, entail a number of secondary goals, or means, that fit together to achieve the primary goal. These goals are of a more concrete nature and can be broken down into technical or environmental requirements. Apart from the obvious technical challenges that need to be solved, there is another group of problems that are less immediate but must nevertheless be carefully considered: business and legal decisions. When a company takes a go at open source software, conflicts inevitably arise, usually between protection
39
of intellectual intellectual property property and publishing publishing source code. Their consideratio consideration n has played an important role in defining the infrastructure of the webcam framework and we will return to the topic when discussing the components affected by it. Let us now look at the different goals one by one and how they were achieved. trivial al as it may may soun sound, d, the the solu solutio tion n shou should ld work work.. A solu solutio tion n that that work workss As trivi Not only on a small selection of systems that happens to be supported by the developer but on as broad a system base as possible and for as many users as possible. possible. Nothing is more frustrating frustrating for a user than downloading downloading a program program just to find out that it does not work on his system. Unfortunately it cannot always be avoided to limit the system base to a certain degree for practical and technical reasons. Practical reasons are mostly due to the fact that it is impossible impossible to test the software software on every system combination out there. Many different different versions versions of the kernel can be combined combined with just as many different versions of the C runtime library. On the technical side there is an entire list of features that a given solution is based on and without which it cannot properly properly work. The size of the supported system system base is therefore a tradeoff between development and testing effort on one side and satisfying satisfying as many users as possible possible on the other. Making this tradeoff was not particularly difficult for this project as one of the pillars of the webcam framework framework already sets a quite strict technical technical limit. For USB 2.0 isochronous mode to work properly a Linux kernel with version 2.6.15 or higher is strongly recommended because the USB stack of earlier versions is known to have issues that can cause errors in the communication between drivers and devices. In a similar way, certain features of Video4Linux 2 only became available in recent versions of the kernel, notably the frame format enumeration that we will see in 6.2 6.2.. This does not mean, however, that the solution does not work at all on systems that do not meeting these requirements. The feature set of the webcam framework on older platforms is just smaller. Everything that does not depend on features of the UVC driver works works on kernels kernels older than 2.6.15 and the lack of a V4L2 implementation that does not provide frame format enumeration prevents only this particular feature from working. A solution that works best–but not exclusively–with Logitech cameras Parts of the solution we have developed are clearly optimized for the latest Logitech cameras, no need to hide this fact. Logitech has invested large amounts of money and time into developing the QuickCam hardware and software. software. There is a lot of intellectual intellectual property property contained contained in the software software as well as some components components licensed from third party companies. companies. Even if Logitech wanted to distribute these features in source code form, it would not be legally possible. possible. As a result, result, these components must be distributed distributed in binary format and they are designed to work only if a Logitech camera is present in
40
the system because other cameras don’t implement the necessary features. These binary components are limited to a single dynamic library that is not required required for the webcam infrastructure infrastructure to work. For users this means means that there is some extra functionality available if they are using a Logitech camera but nothing stops them from using the same software with any other UVC compliant camera. the fast fast movin moving g worl world d of cons consum umer er elec electr tron onic icss it is some some-Planning Planning ahead In the times hard to predict where technology will lead us in a few years from now. Future webcams will have many features that today’s software does not know about. It is therefore important to be prepared for such features by designing interfaces in a way that makes them easily extensible to accommodate new challenges. A typical example of this necessity is the set of values of certain camera controls. controls. Most controls controls are limited to 32-bit integer integer values, which is enough enough for simple simple contro controll such such as image brightne brightness ss or camera camera tilt. One can imagine, however, that certain software supported features could need to transmit chunks chunks of data to the camera that do not fit in 32 bits. Image processin processing g on the host could compute a list of defect pixels that the camera should interpolate in the firmware or it could transmit region information to help the camera use different exposure settings for foreground and background. In the provided solution we have avoided fixed-length value limitations wherever possible. Each control can have arbitrary long values and all fixedlength strings, often used in APIs for simplicity reasons, have been replaced by variable-len variable-length, gth, null-terminated null-terminated strings. strings. While it is true that this approach is slightly more complicated for all involved parties, it assures that future problems do not encounter data width bottlenecks. We have carefully planned the API in a way that puts the burden on the libraries and not the applications and their developers. developers. For applications, applications, buffer management management is mostly transparent transparent and the enumeration API functions are no different than if fixed-width data had been used. Another Another example example that guarantees guarantees future extensibility extensibility is the generic generic access access to UVC extensio extension n units that we added to the UVC driver. driver. Without Without such a feature, the driver would need to be updated for every new camera model, the very proces processs that that genera generaliz lizing ing standard standardss like UVC strive strive to avoid. avoid. The new sysfs interface of the UVC driver allows user mode applications generic raw access access to contro controls ls provided provided by a camera camera’s ’s UVC extensi extension on units. Since Since these extension units are self-descriptive, the driver can retrieve all required information information at runtime and need not be recompiled. recompiled. There are a few other places where we have planned ahead for future extensions, such as the abstraction layers we are taking advantage of and the modula modularity rity of some some of the involve involved d module modules. s. These These examples examples will be explained in more detail in the rest of this chapter.
41
Dealing Dealing with curren currentt proble problems ms A prer prereq equi uisi site te for for and and a goal goal of this this proj projec ectt at the same time was solving the problems we saw in chapter 4 in the best manner for everybody. This means that we did not want to further complicate the current situation by introducing parallel systems but instead help solve these problems so that currently existing applications can also leverage off the improvements we required for our framework. Admittedly, it may sometimes seem easier to reinvent the wheel than improve the wheels already in place, but in the end having a single solution that suits multiple problems is preferable because a combined effort often achieves a higher higher quality quality than two half-b half-bake aked d solutio solutions ns do. The effects effects of a develdeveloper branching the software out of frustration with the line a project is following lowing can be seen seen quite quite often in the open source source community community.. The recent recent Mambo/Joomla Mambo/Joomla dispute1 is a typical example where it is doubtful that the split has resulted in an advantage of any of the involved parties. Let us use the UVC driver as an example to illustrate the situation in the webcam context. context. Creating Creating our own driver or forking the current one would have made it easier to introduce features that are interesting for Logitech because we could have changed the interface without discussing the implications with anyone. By doing so, both drivers would have received received less testing and it would have been harder to synchronize changes applicable to both branches. Keeping a single driver is a big win for the Linux webcam user and avoids the frustrating situation where two similar devices require two slightly different drivers. Many Linux Linux projec projects ts with with a commer commercia ciall backbackCommunity acceptance Many ground have received a lukewarm reception from the open source community in the past, sometimes for valid reasons, sometimes out of fear and skepticism. There is no recipe for guaranteed acceptance by the Linux community but there are a few traps one can try to avoid. One of the traps that many companies fall into is that they strictly limit use of their their software software to their their own produc products. ts. Obvious Obviously, ly, for certain certain device device classes they may not have any choice, take the example of a graphics board. Fortunately, for the scope of this project, this was relatively easy given that the webcams for which it was primarily designed adhere to the USB Video Class standard. Linux users have every interest in good UVC support, so there were wer e very very few negative negative reaction reactionss to Logitec Logitech’s h’s involveme involvement. nt. The fact that that somebody was already developing a UVC driver when we started the project may also have helped convince some of the more suspicious characters out there that it was not our intent to create a software solution that was merely for Logitech’s benefit. Throughout the project we have strived to add features to the UVC driver that we depend on for the best support of our cameras in the most generic 1 The
open source content management system Mambo was forked in August 2005 after the company that owned the the trademark founded a non-profit organization with whose organization many of the developers did not agree with. The fork was named Joomla.
42
way so that devices of other vendors can take advantage of them. A typical example for this is the support for UVC extensions. While not strictly necessary for streaming video, all additional camera features are built on top of UVC extension units. It can therefore be expected that other vendors will use the same mechanisms as Logitech, so that by the time that more UVC devices appear on the market, they will already be natively supported by Linux. Avoid the slowness of democracy This goal may at first seem diametrical to the previous point. The open source community is a democracy where everyone can contribute their opinions, concerns, and suggestions. While this often helps make sure that bad solutions never even end up being realized, it renders the process similarly slow as in politics. For projects with time constraints and full-time jobs behind it, this is less than optimal, so we had to avoid being stalled by long discussions that dissolve without yielding an actual solution. However, like so often it can turn out to be more fruitful to confront people with an actual piece of software that they can touch and test. Feedback becomes more concrete, the limitations become better visible, and so do the good points. If a project finds rapid acceptance with users, developers are likely to become inspired and contribute or eventually use some of the ideas for other projects. We are confident that the webcam framework will show some of the pros as well as cons that a user mode library brings. Maybe one day somebody revives the project of a V4L2 user mode library and integrates parts of the webcam framework as a subset of its functionality because that is where it would ideally lie.
5.3 Architecture overview With a number of high-level goals in mind, we can start to translate these goals into an architecture of components and specify each component’s tasks and interfaces. To start off, let us compare what the component stack looks like with the conventional approach on one side and with the webcam framework on the other. From section 4.2 we already know how V4L2 interfaces with the UVC driver on one side and the webcam application on the other (figure 5.1a). The stack is relatively simple as all data, i.e. control and video data, flows through V4L2 without carrying out any processing itself. This approach is used by all current webcam applications and suffers from a few issues identified in section 4.5. The webcam framework positions itself between the operating system and the application that receives live video from a camera. Figure 5.1b illustrates the different subsystems involved and where the core of the webcam framework is located. We see that the webcam framework fills a relatively small spot in the entire system but it is one of two interfaces that a webcam application interfaces 43
(a)
(b)
Figure 5.1: Layer schema of the components involved in a video stream with (a) the conventional approach and (b) the webcam framework in action. Note the border between user space and kernel space and how both V4L2 and sysfs have interfaces to either side.
44
with to communicate with the camera. This leaves the application the flexi bility to choose for every task the component that performs it best: V4L2 for video streaming and related tasks such as frame format enumeration or stream setup, the webcam framework for accessing camera controls and advanced features that require more detailed information than what V4L2 provides.
5.4
Components
5.4.1 Overview Despite of what the previous schema suggests, the Linux webcam framework is not a single monolithic component but a collection of different libraries with strictly separated tasks. This modularity ensures that no single component grows too complicated and that the package remains easy to maintain and use. Figure 5.2 gives an overview of the entire framework in the context of the GStreamer and Qt based webcam application, as well as a panel application. Both these applications are provided as part of the package and can be seen in action in chapter 8.
Figure 5.2: Overview of the webcam framework kernel space and user space components. The dashed box shows the three components that use the GStreamer multimedia framework.
In the remainder of this section we will look at all of these components,
45
what their tasks are, and what the interfaces between them look like. While doing so we shall see how they accomplish the goals discussed above.
5.4.2 UVC driver The UVC driver was already introduced in chapter 4.3.5, therefore we will only give a short recapitulation at this point. Its key tasks are: • Supervise device enumeration and register the camera with the system. • Communicate with the camera using the UVC protocol over USB. • Verify and interpret the received data. • Respond to V4L2 requests originating from applications. • Provide additional interfaces for features not supported by V4L2. It is the last of these points that makes it a key component in the webcam framework. Conventional webcam drivers oriented themselves at the features supported by V4L2 and tried to implement these as far as possible. This was not an easy task since the specifications available to the developers were often incomplete or even had to be reverse engineered from scratch. Therefore the necessity to support features unknown to V4L2 rarely arose. With the USB Video Class standard this is completely different. The standard is publicly available and if both, device manufacturers and driver engineers stick to it, compatibility comes naturally. The challenge stems from the fact that the functions described in the UVC standard are not a subset of those supported by V4L2. It is therefore impossible for a Video4Linux application to make use of the entire UVC feature spectrum without resorting to interfaces that work in parallel to the V4L2 API. For the UVC driver the sysfs virtual file system takes over this role. It provides raw access to user mode software in a generic manner, all of this in parallel to the V4L2 API, which is still used for the entire video streaming part and provides support for a fairly general subset of the camera controls.
5.4.3 V4L2 We have seen previously that Video4Linux has two key tasks relevant to we bcams: • Make the video stream captured by the device driver available to applications. • Provide image and camera related controls to applications. V4L2 is good at the first point but it has some deficiencies when it comes to the second one due to its limitation of control values to 32 bits (see 4.5.3). This is why our scenario does not rely solely on V4L2 for webcam controls but uses the UVC driver’s sysfs interface where necessary. 46
We can see from the figures that V4L2 serves as interface between user mode and kernel mode. In user mode it takes requests from the application, which it then redirects towards the UVC driver that runs in kernel mode, vice-versa for the replies that originate from the driver and end up in the application. Another important point is that V4L2 is not limited to talking to one application at a time. As long as the driver supports it–there is no multiplexing done on Video4Linux’ part–, the same device can be opened multiple times by one or more processes. This is required by the current webcam framework because the video application is not the only component to access the V4L2 device handle. We will see the different access scenarios as we go.
5.4.4 GStreamer Parts of our webcam framework are built on top of GStreamer because, in our opinion, it is currently the most advanced multimedia framework on the Linux platform. Its integration with the GNOME desktop environment proves that it has reached a respectable grade of stability and flexibility and Phonon, the multimedia framework of KDE 4, will have a back-end for GStreamer. Together with the ongoing intensive development that takes place, this makes it a safe choice for multimedia applications and is likely to guarantee a smooth integration into future software. Note that even though, currently, GStreamer is the only framework supported by the Linux webcam framework, plugins for different libraries like NMM can be written very easily. All that needs to be ported in such a case is the lvfilter plugin, the interface between GStreamer and liblumvp. This will become clear as we talk more about the components involved. There are three elements in the figure that take advantage of the GStreamer multimedia framework. Simply speaking, the box labeled GStreamer is the "application" as far as V4L2 is concerned. Technically speaking, only the GStreamer v4l2src plugin uses the V4L2 API, all other components use techniques provided by the GStreamer library to exchange data. Figure 5.3 visualizes this by comparing the component overview of a V4L2 application to a GStreamer application that uses a V4L2 video source.
5.4.5 v4l2src As the name already suggests, this plugin is the source of all V4L2 data that flows through the GStreamer pipeline. It translates V4L2 device properties into pad capabilities and pipeline state changes into V4L2 commands. This is best illustrated by an example. Figure 5.1 shows the functions that v4l2src uses and the V4L2 counterparts that they call. Note that v4l2src does not directly process the GStreamer state transitions but is based on the GstPushSrc plugin that wraps those and uses a callback mechanism. The capability negotiation that is carried out during stream initialization uses the information retrieved from V4L2 function calls like ENUMFMT or 47
(a) Components involved when a V4L2 application displays video.
(b) Components involved when a GStreamer based application displays V4L2 video.
Figure 5.3: Component overview with and without the use of the GStreamer multimedia framework.
GStreamer
V4L2
Description
start
open
Initialization
get_caps
ENUMFMTTRY_FMT
Format enumeration
set_caps
S_FMTSTREAMON
Stream setup
create
DQBUFQBUF
Streaming
...
...
stop
STREAMOFFclose
Cleanup
Table 5.1: Translation between GStreamer and V4L2 elements and functions.
48
G_FMT to create a special data description format that GStreamer uses internally to check pads for compatibility. There are two so-called caps descriptors involved in our example, the pad capabilities and the fixed capabilities. The former is created by enumerating the device features during the get_caps phase. It is a set that contains the supported range of formats, resolutions, and frame rates and looks something like this: video/x-raw-yuv, format=YUY2, width=[ 160, 1280 ], height=[ 120, 960 ], framerate=[ 5/1, 30/1 ]; image/jpeg, width=[ 160, 960 ], height=[ 120, 720 ], framerate=[ 5/1, 25/1 ] The format is mostly self-explanatory. The camera supports two pixel formats, YUV (uncompressed) and MJPEG (compressed) and the intervals give the upper and lower limits on frame size and frame rate. Note that the section for the uncompressed format has an additional format attribute that specifies the FourCC code. This is necessary for the pipeline to identify the exact YUV format used as there are many different ones with YUY2 being only one of them. The descriptor for the fixed capabilities is set only after the set_caps phase when the stream format has been negotiated with V4L2. This capability contains no ranges or lists but is a simple subset of the pad capabilities. After requesting an uncompressed VGA stream at 25 fps from the camera, for example, it would look as follows: video/x-raw-yuv, format=YUY2, width=640, height=480, framerate=25/1 We can clearly see that the format chosen for the pipeline is a subset of the pad capabilities seen above. The intervals have disappeared and all attributes have fixed values now. All data that flows through the pipeline after the caps are fixed are of this format.
5.4.6 lvfilter The Logitech video filter or, short, lvfilter component is also realized as a GStreamer plugin. Its task is relatively simple: intercept the video stream when enabled (filter mode) and act as a no-op when disabled (pass-through mode).
49
We will come back to the functionality of lvfilter when we look at some of the other components, in particular liblumvp. For the moment, let lvfilter be a no-op.
5.4.7 LVGstCap (part 1 of 3: video streaming) The sample webcam software provided as part of the framework is LVGstCap, the Logitech Video GStreamer Capture application. It is the third component in our schema that uses GStreamer and the only one with a user interface. LVGstCap is also the first webcam capture program to use the approach depicted in 5.1b, i.e. use both V4L2 and the webcam framework simultaneously to access the device. This fact remains completely transparent to the user as everything is nicely integrated into a single interface. Among others, LVGstCap provides the basic features expected from a we bcam capture application: • List the available cameras and select one. • List the available frame formats (i.e. a combination of pixel format, image resolution, and frame rate) and select one. • Start, stop, and freeze the video stream. • Modify image controls (e.g. brightness, contrast, sharpness). These features work with all webcams as long as the camera is supported by Linux and its driver works with the GStreamer v4l2src plugin. On top of this basic functionality LVGstCap supports some additional features. We will talk about them in parts 2 and 3.
5.4.8 libwebcam The Webcam library is a cornerstone of the webcam framework in that all other new components rely on it in one way or another. Being more than only an important technical element, libwebcam realizes part of what the Video4Linux user space library was always supposed to be: an easy to use library that shields its users from many of the difficulties and problems of using the V4L2 API directly. Today libwebcam provides the following core features: • Enumeration of all cameras available in the system. • Provide detailed information about the detected devices. • Wrapper for the V4L2 frame format enumeration. • Provide unified access to V4L2 and sysfs camera controls. In addition, the interface is prepared to handle device events ranging from newly detected cameras over control value changes to device button events. It is easy to add new features without breaking application compatibility and the addition of new controls or events is straightforward. 50
5.4.9 libwebcampanel The Webcam panel library takes libwebcam one step further. While libwebcam is still relatively low-level and does not interpret any of the controls or events directly, libwebcampanel does just that. It combines internal information about specific devices with the controls provided by libwebcam to provide applications with meta information and other added value. This makes it a common repository for device-specific information that would otherwise be distributed and duplicated within various applications. The core features of libwebcampanel are: • Provide meta data that applications need to display camera information and user-friendly control elements. • Implement a superset of libwebcam’s functionality. • Give access to the feature controls that liblumvp provides. We can see that the main goal of libwebcampanel is making the development of generic webcam applications easier. It is for this reason that most applications will want to use libwebcampanel instead of the lower-level libwebcam. The last point of the above list will become clear when we discuss liblumvp. Before doing so, however, let us look at LVGstCap one more time to see how it uses the control meta information.
5.4.10 LVGstCap (part 2 of 3: camera controls) When the user selects a device in LVGstCap, it immediately enumerates the controls that the chosen device provides and displays them in a side panel. Ordinarily, i.e. in the case of V4L2 controls, there is no additional information on the control apart from the value range and whether the control is a num ber, a Boolean, or a list of choices. While most controls can be made to fit in one of these categories, in practice there are a number of controls for which this representation is not quite right. Two examples are controls whose value is a bitmask and read-only controls. In the former case it seems inappropriate to present the user with an integer control that accepts values from, say, 0 to 255 when each bit has a distinct meaning. libwebcampanel might transform such a control either into a list of eight choices if the bits are mutually exclusive or split it up into eight different Boolean controls if arbitrary bit combinations are allowed. This allows LVGstCap to display the controls in a generic manner. In the case of read-only controls the user should not be allowed to change the GUI element but still be able to read its current value. Therefore, if libwe bcampanel sets the read-only flag on a certain control, LVGstCap will disable user interaction with it and gray it out to make this fact visually clear to the user. We will see a few concrete examples of such cases later in chapter 7.
51
5.4.11 liblumvp The name liblumvp stands for Logitech user mode video processing library. It is the only component of the webcam framework that is not open source because it contains Logitech intellectual property. liblumvp consists of a fairly simple video pipeline that passes the video data it receives through a list of plugins that can process and modify the images before they are output again. The library receives all its input from lvfilter. Whenever lvfilter is in filter mode, it sends the video data it intercepts to liblumvp and uses the–possibly modified–video buffer it receives back as its output. All of this remains transparent to the application 2 . One can think of a multitude of plugins that liblumvp could include, basically it could implement all the features that Logitech QuickCam provides on Windows. This requires applications to be able to communicate with these plugins, for example to enable or disable them or change certain parameters. For this reason, the library exposes a number of controls, so-called feature controls in a manner almost identical to how libwebcam does it. This is where the second reason for the additional layer introduced by libwebcampanel lies: it can provide applications with a list of hardware camera controls on the one hand and a list of liblumvp software controls on the other hand. Applications can handle both categories in an almost symmetric manner 3 , which is just what LVGstCap does.
5.4.12 LVGstCap (part 3 of 3: feature controls) LVGstCap uses libwebcampanel not only for presenting camera controls to the user but also for feature controls if liblumvp is currently enabled. When a video stream is started, the feature control list is retrieved and its control items are displayed to the user in a special tab next to the ordinary controls. The application also has access to the names of the different features that liblumvp has compiled in. This information can be used to group the controls into categories when required. When the user changes a feature control, LVGstCap communicates this to libwebcampanel, which takes care of the communication with liblumvp. We will later see that this communication is not as trivial in all cases as it may look at first. In the example of a video application that incorporates both video output and control panel in a single process, there is no need for special measures. There is, however, a case where this does not hold true: panel applications. 2 As
a matter of fact, the application must explicitly include lvfilter in its GStreamer pipeline, but once the pipeline stands, its presence is transparent and needs no further attention. We will see the advantages and disadvantages of this in chapter 7 3 The reasons why the two are not treated exactly the same are explained in chapter 7.
52
5.4.13 lvcmdpanel A panel application is a–usually simple–program that does not do any video handling itself but allows the user to control a video stream that is currently active in another application. There are a few situations where panel applications are useful: • Allow command line tools or scripts to modify video stream parameters. • Permit control over the video stream of an application that does not have its own control panel. • Provide an additional way of changing controls, e.g. from a tray application. Our webcam framework includes an example application of the first kind, a command line tool called lvcmdpanel . Figure 5.4 shows the output of the help command. Chapter 8 has a sample session to illustrate some of the commands. lvcmdpanel 0.1 Control webcam video using the command line Usage: lvcmdpanel [OPTIONS]... [VALUES]... -h, -V, -v, -d, -l, -c, -g, -s,
--help --version --verbose --device=devicename --list --clist --get=control --set=control
Print help and exit Print version and exit Enable verbose output Specify the device to use List available cameras List available controls Retrieve the current control value Set a new control value
Figure 5.4: Command line options supported by lvcmdpanel.
5.5 Flashback: current problems In chapter 4 we discovered a number of issues that current V4L2 applications have to deal with. Let us now revisit them one by one and show how our webcam framework avoids or solves them. Note that we don’t go to great technical details here but save those for chapter 7. Avoid kernel mode components Apart from some work on the UVC driver and V4L2 that are necessary to exploit the full feature set provided by current webcams the entire framework consists of user mode components. This demonstrates that there are good ways to realize video processing and 53
related tasks in user mode today and that for most of the associated drawbacks good solutions can be found. Direct device access While direct device access can never be achieved without the support of select kernel mode components, we tackled this problem by extending the UVC driver so that it allows user mode applications to access the full spectrum of UVC extensions. With the help of sysfs, we have developed an interface that is superior to any standard C interface in that it allows shell scripts and system commands to access the hardware in an intuitive way. Simple API We have seen that mechanisms such as function callback are valuable if not indispensable for certain features like event notification. The webcam framework provides the corresponding interfaces that can be used as soon as the kernel space components implement the necessary underlying mechanisms. In addition, the enumeration APIs that our libraries provide are superior in terms of usability to those that V4L2 offers. While some V4L2 functions like frame format enumeration can require dozens of ioctl calls and the management of dynamic data structures in the client, our framework allows all enumeration data to be retrieved in two function calls. The first one returns the required buffer size and the second one returns the data in one self-contained block of memory. The complexity on the application’s side is minimal and so is the overhead. Complicated device enumeration Applications should not have to loop through the huge number of device nodes in the system and filter out the devices they can handle. This approach requires the applications to know criteria they should not have to know like the decision whether a given device node is a video device or not. If these criteria change, all applications have to be updated, which is a big problem if certain programs are no longer maintained. This problem is solved by the device enumeration function of libwebcam. No stateless device information querying It seems unnecessary to open a device just to retrieve its name and other information an application may want to present to its user. In the same way that listing the contents of a directory with ls does not open each single file, it would be desirable to query the device information at enumeration time. libwebcam does this by maintaining an internal list of camera devices that contains such data. It can be retrieved at any time by any application without opening a V4L2 device. Missing frame format enumeration As we will see later on, this problem was solved by adding the missing functionality directly to V4L2 with the UVC driver being the first one to support it. To keep the API as uniform and simple as possible for application developers, libwebcam has a wrapper for frame
54
format enumeration that severely reduces the complexity associated with retrieving the supported frame formats. Lack of current documentation While we have not solved the problem of parts of the V4L2 documentation being outdated or incomplete, we did make sure that all libraries that application developers can interact with are thoroughly documented; an extensive API specification is available in HTML format. In addition, this report gives a vast amount of design and implementation background. This is a big advantage for developers who want to use parts of the webcam framework for their own applications. The next two chapters are devoted to the more technical details of what was presented in this chapter. We will first look at the extensions and changes that were applied to currently existing components before we focus on the newly developed aspects of the webcam framework.
55
Chapter 6
Enhancing existing components In order to realize the webcam framework like it was described in the previous chapter a few extensions and changes to existing components were necessary. These range from small patches that correct wrong or inflexible behavior to rewrites of bigger software parts. This chapter sums up the most important of these and lists them in the order of their importance.
6.1
Linux UVC driver
With UVC devices being at the center of the Linux webcam framework the UVC driver was the main focus of attention as far as preexisting components are concerned. The following sections describe some important changes and give an outlook of what is about to change in the near future.
6.1.1 Multiple open From chapter 5 we know that multiple open is a useful feature to work around some of V4L2’s limitations. Since the webcam framework relies on the camera driver being able to manage multiple simultaneously opened file handles to a given device, this was one of the most important extensions to the UVC driver. The main challenge when developing a concept for multiple device opening are permissions and priorities. As with ordinary file handles where the operating system must make sure that readers and writers do not disrupt each other, the video subsystem must make sure that two video device handles cannot influence each other in unwanted ways. Webcam drivers that are unable to multiplex the video stream must make sure that only a single device handle is streaming at a time. While this seems easy enough to do the problem arises because the concept of "streaming" is not clearly definable. When does streaming start? When does 56
it stop? There are several steps involved between when an application decides to start the video stream and when it frees the device again: 1. Open the device. 2. Set up the stream format. 3. Start the stream. 4. Stop the stream. 5. Close the device. Drawing the line at the right place is a trade-off between preventing ill interactions on the one hand and allowing a maximum of parallel access on the other. We decided to make the boundary right before the stream setup. To this end we divided the Video4Linux functions into privileged (or streaming) ioctls and unprivileged ioctls and introduced a state machine for the device handles (figure 6.1).
Figure 6.1: The state machine for the device handles of the Linux UVC driver used to guarantee device consistency for concurring applications. The rounded rectangles show which ioctls can be carried out in the corresponding state.
There are four different states: 57
• Closed The first unprivileged state. While not technically a state in the software, this state serves as a visualization for all inexistent handles that are about to spring into existence when they are opened by an application. It is also the state that all handles end up in when the application closes them. • Passive The second unprivileged state. Every handle is created in this state. It stands for the fact that the application has opened the device but has not yet made any steps towards starting the stream. Querying device information or enumerating controls can already happen in this state. • Active The first privileged state. A handle moves from passive to active when it starts setting up the video stream. Four ioctls can be identified in the UVC driver that applications use before they start streaming: TRY_FMT, S_FMT, and S_PARM for stream format setup and REQBUFS for buffer allocation. As soon as an application calls one of these functions, its handle moves into the active state–unless there already is another handle for the same device in a privileged state, in which case an error is returned. • Streaming The second privileged state. Using the STREAMON ioctl lets a handle move from active to streaming. Obviously only one handle can be in this state at a time for any given device because the driver made sure that no two handles could get into the active state in the first place. The categorization of all ioctls into privileged and unprivileged ones not only yields the state transition events but also decides which ioctls can be used in which states. Table 6.1 contains a list of privileged ioctls. Also note that the only way for an application with a handle in a privileged state to give up its privileges is to close the handle. ioctl
Description
S_INPUT
Select the current video input (no-op in uvcvideo).
QUERYBUF
Retrieve information about a buffer.
QBUF
Queue a video buffer.
DQBUF
Dequeue a video buffer.
STREAMON
Start streaming.
STREAMOFF
Stop streaming.
Table 6.1: Privileged controls in the uvcvideo state machine used for multiple open.
This schema guarantees that different device handles for the same device can perform the tasks required for panel applications and the Linux webcam framework while ensuring that the panel application cannot stop the stream or change its attributes in a way that could endanger the video application.
58
6.1.2 UVC extension support We saw that in section 2.4 when we discussed the USB Video Class specification that extension units are important for device manufacturers to add additional features. For this reason, UVC drivers should have an interface that allows applications to access these extension units. Otherwise, they may not be able to exploit the full range of device capabilities. Raw extension control support through sysfs The first and obvious way to expose UVC extension controls in a generic way is to give applications raw access. Under Linux sysfs is an ideal way to realize such an interface. Extensions and their controls are mapped to a hierarchical structure of virtual directories and files that applications can read from and write to. The files are treated like binary files, i.e. what the application writes to the file is sent as is to the device and what the application reads from the file is the same buffer that the driver has received from the device. During this whole process no interpretation of the relayed data is being done on the driver’s side. Let us look at a simplified example of such a sysfs directory structure: extensions/ |-- 63610682-5070-49AB-B8CC-B3855E8D221D |-- 63610682-5070-49AB-B8CC-B3855E8D221E |-- 63610682-5070-49AB-B8CC-B3855E8D221F +-- 63610682-5070-49AB-B8CC-B3855E8D2256 |-- ctrl_1 +-- ctrl_2 |-- cur |-- def |-- info |-- len |-- max |-- min |-- name +-- res We can see that the camera supports four different extension units, each of which identified by a unique ID. The contents of the last one show two controls and one of the controls has its virtual files visible. All these files correspond directly to the UVC commands of the same name. For example, the read-only files def and len map to GET_DEF and GET_LEN . In the case of the only writable file cur there are two corresponding UVC commands: GET_CUR and SET_CUR. Whatever is written to the cur file is wrapped within a SET_CUR command and sent to the device. In the opposite case where an application opens cur and reads from it, the driver creates a GET_CUR request, sends it to the device and turns the device response into the file contents, followed by an 59
end-of-file marker. If an error occurs during the process, the corresponding read or write call returns an error message. While this approach works well and is supported by our extended UVC driver, there is a limitation associated with it that has to do with the way that ownership and permissions are set on these virtual files. This can lead to security issues on multi-user machines like section 7.5.1 will show. Another problem with this approach of using raw data is that applications must know exactly what they are doing. This is undesirable in the case of generic applications because the knowledge has to be duplicated in every single one of them. The following section describes a possible way to resolve this issue. Mapping UVC to V4L2 controls V4L2 applications cannot use the raw sysfs controls unless they include the necessary tools and knowledge. Obviously, it would be easier to just use a library like libwebcam or libwebcampanel that can wrap any sort of controls behind a simple and consistent interface, but there are situations where this may not be an option, for example in the case of applications that are no longer maintained. If such an application has functions to enumerate V4L2 controls and present them in a generic manner, then all it would take to allow the program to use UVC extension controls is a mapping between the two. Designing and implementing a flexible mechanism that can cover most of the cases to be expected in the foreseeable future is an ongoing process for which the ground stones were laid as part of this project. One of the assumptions we made was that there could be a 1: n mapping between UVC and V4L2 controls but not in the opposite direction. The rationale behind this is that V4L2 controls must already be as simple as possible and sensible since the application is in contact with them. For UVC controls however, it is conceivable that a device would pack multiple related settings into a single control 1 . If that is the case, applications should see multiple V4L2 controls without knowing that the driver maps them to one and the same UVC control in the background. Figure 6.2 gives a schema of such a mapping. The next fundamental point was the question where the mapping definitions should come from. The obvious answer is from the driver itself but with the perspective of an increasing release frequency of new UVC devices in mind this cannot be the final answer. It would mean that new driver versions would have to be released on a very frequent basis only to update the mappings. We therefore came to the conclusion that the driver should hardcode as few control mappings as possible with the majority coming from user space. The decision on how such mappings are going to be fed to the driver has not yet been made. Two solutions seem reasonable: 1. Through sysfs. User space applications could write mapping data to a sysfs file and the driver would generate a mapping from the data. The 1
As a matter of fact we shall see such an example in section 7.3.
60
Figure 6.2: Schema of a UVC control to V4L2 control mapping. The UVC control descriptor contains information about how to locate and access the UVC control. The V4L2 control part has attributes that determine offset and length inside the UVC control as well as the properties of the V4L2 control.
main challenge here would be to find a reasonable format that is both human-readable and easily parseable by the driver. XML would be ideal for the first one but a driver cannot be expected to parse XML. Binary data would be easier for the driver to parse but contradict the philosophy of sysfs after which exchanged data should be human-readable. Whatever the format looks like, the mapping setup would be as easy as redirecting a configuration file to a sysfs file. 2. Through custom ioctls. For the driver side the same argument as for a binary sysfs file applies here with the difference that ioctls were designed for binary data. The drawback is that a specialized user space application would be necessary to install the mapping data, such as a control daemon. For the moment, we restrict ourselves to hardcoded mappings. The future will show which way turns out to be the best to manage the mapping configuration from user space. Internally the driver manages a global list of control descriptors with their V4L2 mappings. In addition, a device-dependent list of controls, the so-called control instances, is used to store information about each device’s controls like
61
the range of valid values. When a control descriptor is added, the driver loops through all devices and adds a control instance only if the device in question supports the new control. This process required another change to the driver’s architecture: the addition of a global device list. Many drivers do not need to maintain an internal list of devices because almost all APIs provide space for a custom pointer in the structures that they make available when they call an application. Such a pointer allows for better scaling and less overhead because the driver does not have to walk any data structures to retrieve its internal state. This is indispensable for performance critical applications and helps simplify the code in any case. The Linux UVC driver also uses this technique whenever possible but for adding and removing control mappings it must fall back to using the device list. Luckily, this does not cause any performance problems because these are exceptional events that do not occur during streaming. Once all the data structures are in place, the V4L2 control access functions must be rewritten to use the mappings. Laurent Pinchart is currently working on this as part of his rewrite of the control code that fixes a number of other small problems.
6.1.3 V4L2 controls in sysfs In connection with the topics mentioned above there is an interesting discussion going on whether all V4L2 controls could be exposed to sysfs by default and in a generic manner. The idea comes from the pvrusb2 driver[ 10] which does just that. What originally started out as a tool for debugging turned out to be a useful option for scripting the supported devices. Given the broad application scenarios and generic nature of the feature it would be optimal if the V4L2 core took care of automatically exposing all device controls to sysfs in addition to the V4L2 controls that are available today. While currently not more than a point of discussion and an entry on the wish list, it is likely that Video4Linux will eventually receive such a control mapping layer. It would complete the sysfs interface of uvcvideo in a very nice manner and open the doors for entirely new tools. If such an interface became reality, libwebcam could automatically use it if the current driver does not support multiple opening of the same device because this would prevent it from using the V4L2 controls that it uses now. This switch would be completely transparent to users of libwebcam.
6.2
Video4Linux
In section 4.5.3 we saw a number of issues that developers of software using the current version of V4L2 have to deal with. While most of them could not be fixed without breaking backwards compatibility, the most severe one, the lack of frame format enumeration as described in 4.5.3, was relatively easy to overcome.
62
V4L2 currently provides a way to enumerate a device’s supported pixel formats using the VIDIOC_ENUM_FMT ioctl. It does this by using the standard way for list enumeration in V4L2: The application repeatedly calls a given ioctl with an increasing index, starting at zero, and receives all the corresponding list entries in return. If there are no more entries left, i.e. the index is out of bounds, the driver returns the EINVAL error value. There are two fundamental problems with this approach: • Application complexity. The application cannot know how many entries there are in the list. Using a single dynamically allocated memory buffer is therefore out of question unless the buffer size is chosen much bigger than the average expected size. The only reliable and scalable way is to build up a linked list within the application and add an entry for each ioctl call. This shifts the complexity towards the application, something that should be avoided by an API in order to encourage developers to use it in the first place and to discourage possibly unreliable hacks. • Non-atomicity. If the list that the application wants to enumerate does not remain static over time, there is always a chance that the list changes while an application is enumerating the contents of the list. If this happens, the received data is inevitably inconsistent leading to unexpected behavior in the best case or crashes in the worst case. The first idea for a workaround that comes to mind is that the driver could return a special error value indicating that the data has changed and that the application should restart the enumeration. Unfortunately this does not work because the driver has no way of knowing if an application is currently enumerating at all. Nothing forbids the application to start with a different index than zero or quit the enumeration process before the driver has had a chance to return the end of list marker. When we decided to add frame size and frame rate enumeration, our first draft would have solved both of these problems at once. The entire list would have been returned in a single buffer making it easy for the application to parse on the one hand and rendering the mechanism insusceptible to consistency issues on the other. The draft received little positive feedback, however, and we had to settle for a less elegant version that we present in the remainder of this section. The advantages of the second approach are its obvious simplicity for the driver side. It is left up to the reader to decide whether driver simplicity justifies the above problems. No matter what enumeration approach is chosen, an important point must be kept in mind: the attributes pixel format, frame size, and frame rate are not independent of each other. For any given pixel format, there is a list of supported frame sizes and any given combination of pixel format and frame size determines the supported frame rates. This seems to imply a certain hierarchy of these three attributes, but it is not necessarily clear what this hierarchy
63
should look like. Technical details, like the UVC descriptor format, suggest the following: 1. Pixel format 2. Frame size 3. Frame rate However, for users it may not be obvious why they should even care about the pixel format. A video stream should mainly have a large enough image and a high enough frame rate. The pixel format and whether compression is used, is just a technicality that the application should deal with in an intelligent and transparent manner. As a result, a user might prefer a list of frame sizes to choose from first and, possibly, a list of frame rates as a function of the selected resolution. In order to keep the V4L2 frame format enumeration API consistent with the other layers, we decided to leave the hierarchy in the order mentioned above. An application can still opt to collect the entire attribute hierarchy and present the user with a more suitable order. Once such a hierarchy has been established, the input and output values of each of the enumeration functions becomes obvious. The highest level has no dependency on lower levels, the lower levels have dependencies on only the higher levels. This mechanism can theoretically be extended to an arbitrary number of attributes although in practice there are limits to what can be considered a reasonable number of input values. Table 6.2 gives the situation for the three attributes used by webcams. Enumeration attribute
Input parameters
Output values
Pixel format
none
Pixel formats
Frame size
Pixel format f
Frame sizes supported for frame format f
Frame rate
Pixel format f , Frame size s
Frame rates supported for frame format f and frame size s
Table 6.2: Input and output values of the frame format enumeration functions.
As it happens the V4L2 API already provided a function for pixel format enumeration, which means that it could be seamlessly integrated with our design for frame size and frame rate enumeration. These functions are now part of the official V4L2 API, the documentation for which can be found at [5].
64
6.3
GStreamer
GStreamer has had V4L2 support in the form of the v4l2src plugin for a while, but had not received any testing with webcams using the UVC driver. There is a particularity about the UVC driver that causes it not to work with a few applications, notably the absence of the VIDIOC_G_PARM and VIDIOC_S_PARM ioctls that do not apply to digital devices. The GStreamer V4L2 source was one of these applications that would rely on these functions to be present and fail in the adverse case. After two small patches, however, the first to remove the above dependency and the second to fix a small bug in the frame rate detection code, the v4l2src plugin worked great with UVC webcams and proved to be a good choice as a basis for our project. In September 2006, Edgard Lima, one of the plugin’s authors, added proper support for frame rate negotiation using GStreamer capabilities which allows GStreamer applications to take full advantage of the spectrum of streaming parameters.
6.4
Bits and pieces
Especially during the first few weeks the project involved a lot of testing and bug fixing in various applications. Some of these changes are listed below. Ekiga During some tests with a prototype camera a bug in the JPEG decoder of Ekiga became apparent. The JPEG standard allows an encoder to add a customized Huffman table if it does not want to use the one defined in the standard. The decoder did not process such images properly and failed to display the image as a result. Also there were two issues with not supported ioctls and the frame rate computation, very similar to those in GStreamer’s v4l2src . Spca5xx The Spca5xx driver already supports a large number of webcams as we saw in section 4.3.2 and the author relies to a large part on user feedback to maintain his compatibility list. We also did some tests at Logitech with a number of our older cameras and found a few that were not recognized by the driver but would still work with the driver after patching its USB PID list. luvcview The luvcview tool had a problem with empty frames that could occur with certain cameras and which would make the application crash. This was fixed as part of a patch that added two different modes for capturing raw frames. One mode writes each received frame into a separate file (raw frame capturing), the other one creates one single file where it stores the complete video stream (raw frame stream capturing). The first mode can be used to easily
65
capture frames from the camera, although, depending on the pixel format, the data may require some post processing, e.g. adding of an image header.
66
Chapter 7
New components Chapter 5 gave an overview of our webcam framework and described its goals without going into much technical detail. This chapter is dedicated to elaborate how some of these goals were achieved and implemented. It will also explain the design decisions and explain why we have chosen certain solutions before others. At the same time we will show the limitations of the current solution and their implications towards future extensibility. Another topic of this chapter is the licensing model of the framework, a crucial topic of any open source project. We will also give an outlook on future work and opportunities.
7.1
libwebcam
The goals of the Webcam library, or simply libwebcam, were briefly covered in section 5.4.8. The API is described in great detail in the documentation that comes with the sources. The functions can be grouped into the following categories: • Initialization and cleanup • Opening and closing devices • Device enumeration and information retrieval • Frame format enumeration • Control enumeration and usage • Event enumeration and registration The general usage is rather simple. Each application must initialize the li brary before it is first used. This allows the library to properly set up its internal data structures. The client can now continue by either enumerating devices or, if it already knows which device it wants to open, directly go ahead and open a device. If a device was successfully opened, the library returns a handle 67
that the application has to use for all subsequent requests. This handle is then used for tasks such as enumerating frame formats or controls and reading or writing control values. Once the application is done, it should close the device handles and uninitialize the library to properly free any resources that the library may have allocated. Let us now look at a few implementation details that are important for application developers using libwebcam to know.
7.1.1 Enumeration functions All enumeration functions use an approach that makes it very easy for applications to retrieve the contents of the list in question. This means that enumeration usually takes exactly two calls, the first one to determine the required buffer size, the second one to fill the buffer. In the rare occasion where the list changes between the two calls, a third call can be necessary, but with the current implementation this situation can only arise for devices. The following pseudo code illustrates the usage of this enumeration schema from the point of view of the application. buffer := NULL buffer_size := 0 required_size := c_enum(buffer : NULL, size : buffer_size) while(required_size > buffer_size) buffer_size := required_size buffer := allocate_memory(size : buffer_size) required_size := c_enum(buffer : buffer, size : buffer_size) Obviously, the syntax of the actual API looks slightly different and applications must do proper memory management and error handling. Another aspect that makes this type of enumeration very easy for its users is that the buffer is completely self-contained. Even though the buffer can contain variable-sized data, it can be treated as an array through which the application can loop. Figure 7.1 illustrates the memory layout of such a buffer.
7.1.2 Thread-safety The entire library is programmed in a way that makes it safe to use from multithreaded applications. All internal data structures are protected against simultaneous changes from different threads that could otherwise lead to inconsistent data or program errors. Since most GUI applications are multi-threaded, this spares the application developer from taking additional steps to prevent multiple simultaneous calls to libwebcam functions.
68
Figure 7.1: Illustration of the memory block returned by a libwebcam enumeration function. The buffer contains three list items and a number of variable-sized items (strings in the example). Each list item has four words of fixed-sized data and two char pointers. The second item shows pointers to two strings in the variable-sized data area at the end of the buffer. Pointers can also be NULL, in which case there is no space reserved for them to point to. Note that only the pointers belonging to the second item are illustrated.
69
7.2 liblumvp and lvfilter The Logitech user mode video processing library is in some ways similar to libwebcam. It also provides controls as we have seen in section 5.4.11 and its interface is very similar when it comes to library initialization/cleanup or control enumeration. The function categories are: • Initialization and cleanup • Opening and closing devices • Video stream initialization • Feature enumeration and management • Feature control enumeration and usage • Video processing In our webcam framework, liblumvp is not directly used by the application. Instead, its two clients are lvfilter, the video interception filter that delivers video data, and libwebcampanel, from which it receives commands directed at the features that it provides. Nothing, however, prevents an application from directly using liblumvp apart from the fact that this would make the application directly dependent of a library that was designed to act transparently in the background. lvfilter hooks into the GStreamer video pipeline where it influences the stream capability negotiation in a way that makes sure that the format is understood by liblumvp. It then initializes the latter with the negotiated stream parameters and waits for the pipeline state to change. When the stream starts, it redirects all video frames through liblumvp, where they can be processed and, possibly, modified before it outputs them to the remaining elements in the pipeline. While lvfilter takes care of the proper initialization of liblumvp, it does not use the feature controls that liblumvp provides. Interaction with these happens through libwebcampanel as we will see shortly. We have mentioned that applications must explicitly make use of liblumvp by including lvfilter in their GStreamer pipeline. This has positive and negative sides. The list of drawbacks is led by the fact that it does not smoothly integrate into existing applications and that each application must test for the existence of lvfilter if it wants to use the extra features. It is this very fact, however, that can also be seen as an opportunity. Some users do not like components to work transparently, either because they could potentially have negative interactions that would make problems hard to debug or because they do not trust closed source libraries. Before we move on to the next topic a few words about the two plugins that are currently available:
70
Mirror The first, very simple plugin is available on any camera and lets the user mirror the image vertically and horizontally. While the first one can be used to turn a webcam into an actual mirror, the second one can be useful for laptops with cameras built into the top of the screen because these are usually rotatable by 180 degrees along the upper edge and allow switching between targeting the user and what is in front of the user. Face tracking This module corresponds closely to what users of the QuickCam software on Windows know as the "Track two or more of us" mode of the face tracking feature. The algorithm detects people’s faces and zooms in on them, so that they are better visible when the user moves away from the camera. If the camera supports mechanical pan and tilt, like the Logitech QuickCam Orbit, it does so by moving the lens head in the right direction. For other cameras the same is done digitally. This feature is only available for Logitech cameras that are UVC compatible. In the future, more features from the Logitech QuickCam software will be made available for Linux users through similar plugins.
7.3 libwebcampanel The interface of the Webcam panel library is very similar to the one provided by libwebcam. This was a design decision that should make it easy for applications that started out using libwebcam to switch to libwebcampanel when they want more functionality.
7.3.1 Meta information Section 5.4.9 gave a high-level overview of what sort of information filtering libwebcampanel adds on top of what libwebcam provides. Let us look at these in more detail. • Devices – Camera name change: The camera name string in libwebcam comes from the V4L2 driver and is usually generic. In the case of the UVC driver it is always "USB Video Class device", not very helpful for the user who has three different UVC cameras connected. For this reason libwebcampanel has a built-in database of device names that it associates with the help of their USB vendor and product IDs. This gives the application more descriptive names like "Logitech QuickCam Fusion". If the library recognizes only the vendor but not the exemplary device ID 0x1234, it is still able to provide a somewhat useful string like "Unknown Logitech camera (0x1234)". • Controls 71
– Control attribute modification: These modification range from simple name changes to more complex ones like modification of the value ranges or completely changing the type of a control. Controls can also be made read-only or write-only. – Control deletion: A control can be hidden from the application. This can be useful in cases where a driver wrongly reports a generic control that is not supported by the hardware. The library can filter those out stopping them from appearing in the application and confusing users. – Control splitting: A single control can be split into multiple controls. As a fictitious example, a 3D motion control could be split up into three different motion controls, one for each axis. While the first point is pretty self-explanatory, the second one deserves a few real life examples. Example 1: Control attribute modification The UVC standard defines a control called Auto-exposure mode. It determines what parameters the camera changes to adapt to different lighting conditions. This control is a 8-bit wide bitmask with only four of the eight bits actually being used. The bits are mutually exclusive leaving 1, 2, 4, 8 as the set of legal values. However, due to the limited control description capabilities of UVC, the control is usually exported as an integer control with valid values ranging from 1 to 255. If an application uses a generic algorithm to display such a control, it might present the user with a slider or range control that can take all possible values between 1 and 255. Unfortunately, most values will have no effect because they do not represent a valid bitmask. libwebcampanel comes with enough information to avoid this situation by turning the auto-exposure mode control into a selection control instead that allows only four different settings–the ones defined in the UVC standard. Now, the user will see a list box, or whatever the application developer decided to use to represent a selection control, with each entry having a distinct and clear meaning and no chance for the user to accidentally select invalid values, a major gain in usability. Example 2: Control splitting The Logitech QuickCam Orbit series has mechanical pan and tilt capabilities with the help of two little motors. Both motors can be moved separately by a given angle. The control, through which these capabilities are exposed, however, combines both values, i.e. relative pan angle and relative tilt angle, in a single 4-byte control containing a signed 2-byte integer for each. For an application such a control is virtually unusable without the knowledge how the control values have to be interpreted. libwebcampanel solves this problem very elegantly by splitting up the control into two separate controls: relative pan angle and relative tilt angle. It also
72
marks both controls as write-only, because it makes no sense to read a relative angle, and as action controls, meaning that changing the controls causes a one-time action to be performed. The application can use this information, as in the example of LVGstCap, to present the user with a slider control that can be dragged to either side and jumps back to the neutral position when let go of. Obviously, most of this information is device-specific and needs to be kept up-to-date whenever new devices become available. It can therefore be expected that new minor versions of the library appear rather frequently including only minor changes. An alternative approach would be to move all device-specific information outside the library, e.g. in XML configuration files. While this would make it easier to keep the information current, it would also make it harder to describe device-specific behavior . The future will show, which one of these approaches is more suitable.
7.3.2 Feature controls Feature controls influence directly what goes on inside liblumvp. They can enable or disable certain features or change the way video effects operate. They are different to ordinary controls in a few ways and they require a few special provisions as we shall see now. Controls vs. feature controls We have previously mentioned that controls and feature controls are handled in an almost symmetrical manner. The small but important difference between the two is that ordinary controls are device-related but feature controls are stream-related. What this means is that the list of device controls can be queried before the application takes any steps to start the video stream. The driver and therefore V4L2 know about them from the very start. At this time, the GStreamer pipeline may not even be built and lvfilter and liblumvp not loaded. So in practice, a video application will probably query the camera controls right after device connection but feature controls only when the video is about to be displayed. This timing difference would make it considerably more complicated for applications to manage a combined control list in a user-friendly manner. As a nice side-effect, it becomes easy for the application to selectively support only one set of controls or to clearly separate the two sets. Communication between client and library There is another very important point that was left unmentioned until now and that only occurs in the case of a panel application. The video stream, and therefore liblumvp, and the panel application that uses libwebcampanel run in
73
two different processes. This means that the application would in vain try to change feature controls. liblumvp could well be loaded into the application’s address space, but it would be a second and completely independent instance. To avoid this problem, the two libraries must be able to communicate across process borders–a clear case for inter-process communication. Both, libwebcampanel and liblumvp have a socket implementation over which they can transfer all requests related to feature controls. Their semantics are completely identical, only the medium differs. Whenever a client opens a device using liblumvp (in our case this is done by lvfilter), it creates a socket server thread that waits for such requests. libwebcampanel, on the other side, has a socket client that it uses to send requests to liblumvp whenever one of the feature control functions is used. There is a possible optimization here, namely the use of the C interface instead of the IPC interface whenever both libraries run in the same process. However, the IPC implementation does not cause any noticeable delays since the amount of transmitted data remains in the order of kilobytes. We opted for the simpler solution of using the same interface in both cases, although the C version is still available and ready to use if circumstances make it seem preferable.
7.4
Build system
The build system of the webcam framework is based on the Autotools suite, the traditional choice for most Linux software. The project is mostly selfcontained with the exception of liblumvp which has some dependencies on convenience libraries 1 outside the build tree. These convenience libraries contain some of the functionality that liblumvp plugins rely on and were ported from the corresponding Windows libraries. The directory structure of the open source part looks as follows: / +--lib | +--libwebcam | +--libwebcampanel | +--gstlvfilter | +--src +--lvgstcap The top level Makefile generated by Autotools compiles all the components, although each component can also be built and installed on its own. Generic build instructions are included in the source archive. 1 Convenience
libraries group a number of partially linked object files together. While they are not suitable for use as-is, they can be compiled into other projects in a similar way to ordinary object files.
74
7.5
Limitations
Each solution has its trade-offs and limitations and it is important to be aware of them. Some of them have technical reasons; others are the result of time constraints or are beyond the project’s scope. This section is dedicated to make developers and users of the Linux webcam framework aware of these limitations. At the same time it gives pointers for future work, which are the topic of the next section.
7.5.1 UVC driver Even though the Linux UVC driver is stable and provides support for all basic UVC features needed to do video streaming and manage video controls, it is still work in progress and there remains a lot of work to be done for it to implement the entire UVC standard. At the moment, however, having a complete UVC driver cannot be more than a long-term goal. For one thing, the UVC standard describes many features and technologies for which there exist no devices today and for another, not even Windows ships with such a driver. What is important, and this is a short-term goal that will be achieved soon, is that the driver supports the features that today’s devices use. Luckily, the list of tasks to get there is now down to a relative small number of items. A few of these are discussed below. Support for status interrupts The UVC standard defines a status interrupt endpoint that devices must implement if they want to take advantage of certain special features. These are: • Hardware triggers (e.g. buttons on the camera device for functions such as still image capturing) • Asynchronous controls (e.g. motor controls whose execution can take a considerable amount of time and after completion of which the driver should be notified) • AutoUpdate controls (controls whose values can change without an external set request, e.g. sensor-based controls) When such an event occurs, the device sends a corresponding interrupt packet to the host and the UVC driver can take the necessary action, for example update the internal state or pass the notification on to user space applications. Currently, the Linux UVC driver has no support for status interrupts and consequently ignores the packets. While this has no influence on the video stream itself, it prevents applications from receiving device button events or be notified when a motor control command has finished. The latter one can be quite useful for applications because they may want to prevent the user from sending further motion commands while the device is still moving. 75
In the context of mechanical pan/tilt there are two other issues that the lack of such a notification brings with it: 1. Motion tracking. When a motion tracking algorithm, like the one used for multiple face tracking in liblumvp, issues a pan or tilt command to the camera, it must temporarily stop processing the video frames for the duration of the movement. Otherwise, the entire scene would be interpreted as being in motion due to the viewport translation that happens. After the motion has completed, the algorithm must resynchronize. If the algorithm has no way of knowing the exact completion time it must resort to approximations and guess work, therefore decreasing its performance. This is what liblumvp does at the moment. 2. Keeping track of the current angle. If the hardware itself does not provide the driver with information as to the current pan and tilt angles, the driver or user space library can approximate this by keeping track of the relative motion commands it sends to the device. For this purpose, it needs to know whether a given command has succeeded and if so, at what point in time in order to avoid overlapping requests. One of the reasons why the UVC driver does not currently process status interrupts is that the V4L2 API does not itself have any event notification support. As we saw in section 4.5.1 such a scheme is not easy to implement due to the lack of callback techniques that kernel space components have at their disposition. The sysfs interface that is about to be included in the UVC driver is a first step into the direction of adding a notification scheme. Since kernel 2.6.17 it is possible to make sysfs attributes pollable (see [2] for an overview of the interface). This polling process does not impose any CPU load on the system because it is implemented with the help of the poll system call. The polling process sleeps and wakes up as soon as one of the monitored attributes changes. For the application this incurs some extra complexity, notably the necessity of multi-threading. This is clearly a task for a library like libwebcam. The polling functionality only needs to be written once and at the same time the notifications can be sent using a more application friendly mechanism like callback. libwebcam already has an interface designed for this exact purpose. As soon as the driver is up to the task, applications will be able to register callback functions for individual events, some of them coming from the hardware, others being synthesized by the library itself. Sysfs permissions Another problem that still awaits resolution is to find a method to avoid giving all users arbitrary access to the controls exported to the sysfs virtual file system. Since sysfs attributes have fixed root:root ownership when the UVC driver creates them, this does not leave it much choice when it comes to defining 76
permissions. Modes 0660 and 0664, on the one hand, would only give the superuser write access to the sysfs attributes, and therefore the UVC extension controls. Mode 0666, on the other hand, would permit every user to change the behavior of the attached video devices leading to a rather undesirable situation: a user guest that happens to be logged in via SSH on a machine on which a video conference is in progress could change settings such as brightness or even cause the camera to tilt despite not having access to the video stream or the V4L2 interface itself. For device nodes this problem is usually resolved by changing the group ownership to something like root:video and giving it 0660 permissions. This still does not give fine grained permissions to individual users but at least a user has to be a member of the video group to be able to access the camera. A good solution would be to duplicate the ownership and permission from the device node and apply them to the sysfs nodes. This would make sure that whoever has access to the V4L2 video device also has access to the device’s UVC extensions and controls. Currently, however, such a solution does not seem feasible due to the hard-coded attribute ownership. Another approach to the problem would be to let user space handle the permissions. Even though sysfs attributes have their UID and GID set to 0 on creation, they do preserve new values when set from user space, e.g. using chmod . A user space application running with elevated privileges could therefore take care of this task. Ongoing development The ongoing development of the UVC driver is of course not a limitation in itself. The limitation merely stems from the fact that not all of the proposed changes have made their way into the main driver branch yet. As of the time of this writing, the author is rewriting parts of the driver to be more modular and to better adapt them to future needs. At the same time, he is integrating the extensions presented in 6.1 piece by piece. The latest SVN version of the UVC driver does not yet contain the sysfs interface, but it will be added as soon as the completely rewritten control management is finished. Therefore, for the time being, users who want to try out the webcam framework in its entirety, in particular functions that require raw access to the extension units need to use the version distributed as part of the framework. Another aspect of the current rewrite is the consolidation of some internal structures, notably the combination of the uvc_terminal and uvc_unit structs. This will simplify large parts of the control code because both entity types can contain controls. The version distributed with the framework does not properly support controls on the camera terminal. This only affects the controls related to exposure parameters and will automatically be fixed during the merge back.
77
Still image support The UVC specification includes features to retrieve still images from the camera. Still images are treated differently from streaming video in that they do not have to be real-time, which gives the camera time to apply image quality enhancing algorithms and techniques. At the moment, the Linux UVC driver does not support this method at all. This is hardly a limitation at the moment because current applications are simply not prepared for such a special mode. All single frame capture applications that currently exist open a video stream and then process single frames only, something that obviously works perfectly fine with the UVC driver. In the future one could, however, think of some interesting features like the ability to read still images directly from /dev/videoX after setting a few parameters in sysfs. This would allow frame capturing with simple command line tools or amazingly simple scripts. Imagine the following, for example: dd if=/dev/video0 of=capture.jpg It would be fairly simple to extend the driver to support such a feature, but the priorities are clearly elsewhere at the moment.
7.5.2 Linux webcam framework Missing event support The fact that libwebcam currently lacks support for events, despite the fact that the interface is there, was already mentioned above. To give the reader an idea of what the future holds, let us look at the list of events that libwebcam and libwebcampanel could support: • Device discovered/unplugged • Control value changed automatically (e.g. for UVC AutoUpdate controls) • Control value changed by client (to synchronize multiple clients of libwebcam) • Control value change completed (for asynchronous controls) • Other, driver-specific events • Feature control value changed (libwebcampanel only) • Events specific to liblumvp feature plugins (libwebcampanel only) Again, the events supported by libwebcampanel will be a superset of those known to libwebcam in a manner analog to controls.
78
Single stream per device The entire framework is laid out to only work with a single video stream at a time. This means that it is impossible to multiplex the stream, for example with the help of the GStreamer tee element, and control the feature plugins separately for both substreams. This design decision was made for simple practicality; the additional work required would hardly justify the benefits. For most conceivable applications this is not a limitation, though. There are no applications today that provide multiple video windows per camera at the same time and the possible use cases seem restricted to development and debug purposes. There is another reason why it is unlikely that such applications appear in the near future: the XVideo extension used on Linux to accelerate video rendering can only be used by one stream at a time, so that any additional streams would have to be rendered using unaccelerated methods. In GStreamer terms this means that the slower ximagesink would have to be used instead of xvima gesink, which is the default in LVGstCap.
7.6
Outlook
Providing an outlook of the further development of the Linux webcam framework at this moment is not easy given that it has not been published yet and therefore received very little feedback. There are, however, a few signs that there is quite some demand out there for Linux webcam software as well as related information. For one thing, requests and responses that come up on the Linux UVC mailing list clearly show that the current software has deficits. A classic example is the fact that there are still many programs out there that do not support V4L2 but are still based on the deprecated V4L1 interface. Even V4L2 applications still use API calls that are not suitable for digital devices, clearly showing their origins in the world of TV cards. For another thing, the demand for detailed and reliable information out there is quite large. Linux users who want to use webcams have a number of information related problems to overcome. Typical questions that arise are: • What camera should I buy so that it works on Linux? • I have camera X. Does it work on Linux? • Which driver do I need? Where do I download it? • How do I compile and install the driver? How can I verify its proper functioning? • What applications are there? What can they do? • What camera features are supported? What would it take to fix this? All these questions are not easy to answer. Even though the information is present somewhere on the web, it is usually not easy to find because there 79
is no single point to start from. Many sites are incomplete and/or feature outdated information, making the search even harder. Providing software is thus not the only task on the to do list of Linux we bcam developers. More and better information is required, something that Logitech is taking initiative in. Together with the webcam framework, Logitech will publish a website that is designated to become such an information portal. At the end of this chapter we will give more details about that project. In terms of software, the Linux webcam framework certainly has the potential to spur the development of new and great webcam applications as well as giving new improved tools to preexisting ones. Our hope is that, on the one hand, the broader use of the framework will bring forth further needs that can be satisfied by future versions and, on the other hand, that the project will give impulses for improving the existing components. The Linux UVC driver is one such component that is rapidly improving. As we have seen during the discussion of limitations above, new versions will create the need for libwebcam extensions. But libwebcam is not the only component that will see further improvements. Logitech will add more feature plugins to liblumvp as the framework gains momentum with the most prominent one being an algorithm for face tracking. Compared to the current motion tracker algorithm it performs much better when there is only a single person visible in the picture.
7.7
Licensing
The licensing of open source software is a complex topic, especially when com bined with closed source components. There are literally hundreds of different open source licenses out there and many projects choose to use their own, adapted license, further complicating the situation.
7.7.1 Libraries One key point that poses constraints on the licensing of a project is the set of used licenses for the underlying components. In our case, this situation is quite easy. The only closed source component of our framework, liblumvp, uses GStreamer, which is in turn developed under the LGPL. The LGPL is considered one of the most appropriate licenses for libraries because it allows both open and closed source components to link against it. Such a licensing scheme considerably increases the number of potential users because developers of closed source applications do not need to reinvent the wheel, but can instead rely on libraries proven to be stable. For this reason libwebcam and libwebcampanel are also released under the LGPL enabling any application to link against them and use their features. The same reasoning applies to the lvfilter GStreamer plugin. The only closed source component of the webcam framework is the li blumvp library. Some of the feature plugins contain code that Logitech has
80
licensed from third parties under conditions that disallow their distribution in source code form. While liblumvp is free or charge, it is covered by an end-user license agreement very similar to the one that is used for Logitech’s Windows applications. There is one question that keeps coming up in Internet forums when closed source components are discussed: "Why doesn’t the company want to publish the source code?" The answer is usually not that companies do not want to but that they cannot for legal reasons. Hardware manufacturers often buy software modules from specialized companies and these licenses do not allow the source to be made public.
7.7.2 Applications All non-library code, in particular LVGstCap and lvcmdpanel, is licensed under version 2 of the GNU GPL. This allows anybody to make changes to the code and publish new versions as long as the modified source code is also made available. Table 7.1 gives an overview over the licenses used for the different components of this project. The complete text of the GPL and LGPL licenses can be found in [7] and [8]. Component
License
libwebcam
LGPL
libwebcampanel
LGPL
lvfilter
LGPL
liblumvp
Closed source
LVGstCap
GPL
lvcmdpanel
GPL
Samples
Public domain
Table 7.1: Overview of the licenses used for the Linux webcam framework components.
7.8
Distribution
Making the webcam framework public and getting people to use it, test it, and receive feedback will be an important task of the upcoming months. Logitech is currently setting up a web server that is expected to go online in the last quarter of 2006 and will contain the following: • List of drivers: Overview of the different webcam drivers available for Logitech cameras. 81
• Compatibility information: Which devices work with which drivers? • FAQ: Answers to questions that frequently come up in the context of webcams. • Downloads: All components of the Linux webcam framework (incl. sources except for liblumvp). • Forum: Possibility for users to discuss problems with each other and ask questions to Logitech developers. The address will be announced through the appropriate channels, for example on the mailing list of the Linux UVC driver.
82
Chapter 8
The new webcam infrastructure at work After the technical details it is now time to see the webcam framework in action–or at least static snapshots of this action. The user only has direct contact with the video capture application LVGstCap and the panel application lvcmdpanel. The work of the remaining components is, however, still visible, especially in the case of lvcmdpanel, whose interface is very close to libwebcampanel’s.
8.1
LVGstCap
Figure 8.1 shows a screenshot of LVGstCap with its separation into video and control area. The video window to the left displays the current picture streaming from the webcam while the right-hand side contains both camera and feature controls in separate tabs. The Camera tab allows the user to change settings directly related to the image and the camera itself. All control elements are dynamically generated from the information that libwebcampanel provides. The Features tab gives control over the plugins that liblumvp contains. Currently it allows flipping the image about the horizontal and vertical axes and enabling or disabling the face tracker.
8.2
lvcmdpanel
The following console transcript shows an example of how lvcmdpanel can be used. $ lvcmdpanel -l Listing available devices:
83
Figure 8.1: A screenshot of LVGstCap with the format choice menu open.
video0 video1
Unknown Logitech camera (0x08cc) Logitech QuickCam Fusion
There are two devices in the system; one was recognized, the other one was detected as an unknown Logitech device and its USB PID is displayed instead. $ lvcmdpanel -d video1 -c Listing available controls for device video1: Power Line Frequency Backlight Compensation Gamma Contrast Brightness $ lvcmdpanel -d video1 -cv Listing available controls for device video1: Power Line Frequency ID : 13, Type : Choice, Flags : { CAN_READ, CAN_WRITE, IS_CUSTOM }, Values : { ’Disabled’[0], ’50 Hz’[1], ’60 Hz’[2] }, Default : 2 Backlight Compensation ID : 12, Type : Dword, Flags : { CAN_READ, CAN_WRITE, IS_CUSTOM }, Values : [ 0 .. 2, step size: 1 ], Default : 1
84
Gamma ID : Type : Flags : Values : Default : Contrast ID : Type : Flags : Values : Default : Brightness ID : Type : Flags : Values : Default :
6, Dword, { CAN_READ, CAN_WRITE }, [ 100 .. 220, step size: 120 ], 220 2, Dword, { CAN_READ, CAN_WRITE }, [ 0 .. 255, step size: 1 ], 32 1, Dword, { CAN_READ, CAN_WRITE }, [ 0 .. 255, step size: 1 ], 127
The -c command line switch outputs a list of controls supported by the specified video device, in this case the second one. For the second list the verbose switch was enabled, which yields detailed information about the type of control, the accepted and default values, etc. (Note that the output was slightly shortened by leaving out a number of less interesting controls.) The final part of the transcript can be followed easiest by first starting an instance of luvcview in the background. The commands below change the brightness of the image while luvcview–or any other video application–is running. $ lvcmdpanel -d video1 -g brightness 127 $ lvcmdpanel -d video1 -s brightness 255 $ lvcmdpanel -d video1 -g brightness 255
The current brightness value is 127 as printed by the first command. The second command changes the brightness value to the maximum of 255 and the third one shows that the value was in fact changed. The last example shows how simple it is to create scripts to automate tasks with the help of panel applications. Even writing an actual panel application is very straightforward; lvcmdpanel consists of less than 400 lines of code and already covers the basic functionality.
85
Chapter 9
Conclusion Jumping into work in the open source community with the support of a company in the back is a truly gratifying job. The expression "it’s the little things that count" immediately comes to mind and the positive reactions one receives, even for small favors, is a great motivation along the way. Having been on the user side of hardware and software products for many years myself, I know how helpful the little insider tips can be. Until recently most companies were unaware of the fact that small pieces of information that seem obvious on the inside of a product team can have a much higher value when carried outside. The success of modern media like Internet forums with employee participation and corporate blogs is a clear sign for this. Open source is in some ways similar to these media. Simple information that is given out comes back in the form of improved product support, drivers written from scratch, and, last but not least, reputation. The Logitech video team has had such a relationship with the open source community for a while, although in a rather low-profile manner leading to little public perception. This is the first time that we have actively participated and while it remains to be seen what the influence of the project will be, the little feedback we have received makes us confident that the project is a success and will not end here. As far as the author’s personal experience is concerned, the vast majority was of a positive nature. I was in contact with project mailing lists, developers, and ordinary users of open source software without a strong programming background. Out of these three, the last two are certainly the easiest to work with. Developers are grateful for feedback, test results, suggestions, and patches whereas users appreciate help with questions to which the answers are not necessarily obvious. Mailing lists are a category of their own. While many fruitful discussions are held, some of them reminded me of modern politics. What makes democracy a successful process is the fact that everybody has their say and everybody is encouraged to speak up, something that holds true for mailing lists. Unfortunately, the good and bad sides go hand in hand and so mailing lists inherit 86
the dangers of slow decision making and standstill. Many discussions fail to reach a conclusion and silently dissolve, much to the frustration of the person who brought up the topic. If open source developers need to learn one thing, it is seeing their users as customers and treating them as such. The pragmatic solution often beats the technically more elegant in terms of utility, a fact that each developer must learn to live with. The future will show whether we are able to reach our long-term goal, achieving a webcam experience among users that can catch up to what Windows offers nowadays. The Linux platform has undoubtedly become a competitive platform but in order not to lose its momentum, Linux must focus on its weaknesses and multimedia is clearly one of them. The components are there for the most part but they need to be consistently improved to make sure that they work together more closely. There are high hopes on KDE 4 with its multimedia architecture and camera support will definitely have its place in it. The moment when Linux users can plug in their webcam, start their favorite instant messenger and have a video conference taking advantage of all the camera’s features is within a grasp–an opportunity not to be missed.
87
Appendix A
List of Logitech webcam USB PIDs This appendix contains a list webcams manufactured by Logitech, their USB identifiers and the name of the driver they are reported or tested work with. We use the following abbreviated abbreviated driver names in the table: Key
Driver
pwc pwc
Phili hilips ps USB USB Webc Webca am driv driver er (see see 4.3.1 4.3.1))
qcexpress qcexpress
QuickCam QuickCam Express Express driver driver (see 4.3.4 4.3.4))
quickcam quickcam
QuickCam QuickCam Messenger Messenger & Communica Communicate te driver driver (see 4.3.3 4.3.3))
spca5x spca5xx x
Spca5x Spca5xx x Webcam Webcam driver driver (see (see 4.3.2 4.3.2))
uvcvid uvcvideo eo
Linux Linux USB Video Video Class Class driver driver (see (see 4.3.5 4.3.5))
The table below contains the following information: 1. The USB product product ID as reporte reported, d, for example example,, by lsusb. Note that that the the lsusb. Note vendor ID is always 0x046D. 0x046D. 2. The ASIC that that the camera is based based on. 3. The name under which which the product was released. released. 4. The driver by which which the camera is supported. supported. An asterisk means that the state of support for the given camera is untested but that the camera is likely to work the driver given the ASIC. Possibly the driver may need patching in order to recognize the given PID. A dash means that the camera is not currently supported.
88
PID
ASIC
Product name
Driver
0840
ST ST600
Logitech QuickCam Express
qcexpress
0850
ST610
Logitech QuickCam Web
qcexpress
0870 0870
ST6 ST602
Log Logitech tech Qui QuickC ckCam Exp Express ress Logitech QuickCam for Notebooks Labtec WebCam
qcexpress
0892
VC321
Acer OrbiCam
–
0896
VC321
Acer OrbiCam
–
08A0
VC301
Logitech QuickCam IM
spca5xx
08A2
VC302
Labtec Webcam Plus
spca5xx
08A4
VC301
Logitech QuickCam IM
spca5xx (*)
08A7
VC V C302
Logitech QuickCam Image
spca5xx (*)
08A9 08A9
VC30 VC302 2
Logi Logite tech ch Quic QuickC kCam am for for Note Notebo book okss Delu Deluxe xe
spca spca5x 5xx x
08AA
VC302
Labtec Notebook Pro
spca5xx
08AC
VC301
Logitech QuickCam IM
spca5xx (*)
08AD 08AD
VC302 302
Log Logitech tech Qui QuickC ckCam Com Communica icate STX
spca pca5xx 5xx
08AE
VC302
Logitech QuickCam for Notebooks
spca5xx
08B0 08B0
SAA8 SAA811 116 6
Logi Logite tech ch Quic QuickC kCam am Pro Pro Logitech QuickCam Pro 3000
pwc
08B 08B1
SAA SAA8116 8116
Log Logitech tech Qui QuickC ckCam Pro Pro for for Note otebook bookss
pwc
08B2
SAA8116
Logitech QuickCam Pro 4000
pwc
08B3
SA SAA8116
Logitech QuickCam Zoom
pwc
08B4
SA SAA8116
Logitech QuickCam Zoom
pwc
08B5 08B5
SAA8 SAA811 116 6
Logi Logite tech ch Quic QuickC kCam am Orbi Orbitt Logitech QuickCam Sphere
pwc
08B6
SAA8116
Cisco VT Camera
pwc
08B7
SA SAA8116
Logitech ViewPort AV100
pwc
08BD
SAA8116
Logitech QuickCam Pro 4000
pwc
08BE
SA SAA8116
Logitech QuickCam Zoom
pwc
08C1
SP SPCA525
Logitech QuickCam Fusion
uvcvideo
08C2
SPCA525
Logitech QuickCam Orbit Logitech QuickCam Sphere MP
08C3 08C3
SPCA SPCA52 525 5
Logi Logite tech ch Quic QuickC kCam am for for Note Notebo book okss Pro Pro
uvcv uvcvid ideo eo
08C5
SPCA525
Logitech QuickCam Pro 5000
uvcvideo
08C6
SPCA525
QuickCam for Dell Notebooks
uvcvideo
08C7
SPCA525
Cisco VT Camera II
uvcvideo
08D9 08D9
VC302 302
Log Logitech tech Qui QuickC ckCam IM Logitech QuickCam Connect
spca5xx
08DA
VC VC302
Logitech QuickCam Messenger
spca5xx
89
MP
uvcvideo
08F0
ST6422
Logitech QuickCam Messenger
quickcam
08F1
ST6422
Logitech QuickCam Express
quickcam (*)
08F4
ST6422
Labtec WebCam
quickcam (*)
08F5
ST6422
Logitech QuickCam Communicate
quickcam
08F6
ST6422
Logitech QuickCam Communicate
quickcam
0920
ICM532
Logitech QuickCam Express
spca5xx
0921
ICM532
La Labtec WebCam
spca5xx
0922
ICM532
Logitech QuickCam Live
spca5xx (*)
0928
SPCA561B
Logitech QuickCam Express
spca5xx
0929
SPCA561B
Labtec WebCam
spca5xx
092A 092A
SPC SPCA561 A561B B
Log Logitech tech Qui QuickC ckCam for for Note Noteb book ooks
spca pca5xx 5xx
092B
SPCA561B
Labtec WebCam Plus
spca5xx
092C
SP SPCA561B
Logitech QuickCam Chat
spca5xx
092D
SPCA561B
Logitech QuickCam Express
spca5xx (*)
092E
SPCA561B
Lo Logitech QuickCam Chat
spca5xx (*)
092F
SPCA561B
Logitech QuckCam Express
spca5xx
09C0
SPCA525
QuickCam for Dell Notebooks
uvcvideo
90
Bibliography [1] Jonathan Corbet. Linux loses the Philips webcam driver. LWN , 2004. URL http://lwn.net/Articles/99615/. [2] Jonathan Corbet. Some upcoming sysfs enhancements. LWN , 2006. URL http://lwn.net/Articles/174660/. [3] Creative. Creative Open Source: Webcam support. opensource.creative.com/.
URL http://
[4] Bill Dirks. Video for Linux Two - Driver Writer’s Guide , 1999. URL http: //www.thedirks.org/v4l2/v4l2dwg.htm . [5] Bill Dirks, Michael H. Schimek, and Hans Verkuil. Video for Linux Two API Specification, 1999-2006. URL http://www.linuxtv.org/downloads/ video4linux/API/V4L2_API/. [6] USB Implementers Forum. Universal Serial Bus Device Class Definition for Video Devices. Revision 1.1 edition, 2005. URL http://www.usb.org/ developers/devclass_docs. [7] Free Software Foundation. GNU General Public License. 1991. URL http://www.gnu.org/copyleft/gpl.html. [8] Free Software Foundation. GNU Lesser General Public License. 1999. URL http://www.gnu.org/copyleft/lesser.html . [9] Philip Heron. fswebcam/.
fswebcam, 2006.
URL http://www.firestorm.cx/
[10] Mike Isely. pvrusb2 driver, 2006. pvrusb2/pvrusb2.html. [11] Greg Jones and Jens Knutson. camorama.fixedgear.org/.
URL http://www.isely.net/
Camorama, 2005.
URL http://
[12] Avery Lee. Capture timing and capture sync. 2005. URL http://www. virtualdub.org/blog/pivot/entry.php?id=78 .
91