SmartVision: active vision for the blind
financed by

FCT/MCTES - PTDC/EIA/73633/2006
|
|
Project Objectives | |
|
The goal is to develop a prototype system that consists of a palmtop or laptop with GPS and WiFi link to a server, plus a stereo camera. This system must be able to detect outlines of sidewalks and zebra crossings within 5 m, obstacles bigger than 20 cm at a distance of 2 m, and typical household items in a pantry at 0.5 m. Response time must be less than 2 s. Specific objectives are: (1) to integrate all hard- and software in a backpack with external battery, plus chest-mount equipped with stereo camera, microphone and speakers; (2) to develop a simple voice control and query interface with auditory feedback; (3) to implement and further develop computer and human vision algorithms for 3D object detection, categorisation and recognition, including position and distance estimation; (4) to do psychophysical and fMRI studies for optimising algorithms; and (5) to benchmark computer and human vision algorithms. More objectives are specified in the tasks.
|
||
| Task 1 - System integration and prototype development | ||
|
Task 1 concerns the hard- and software development and integration for the prototype. The hardware will consist of a small palmtop or laptop with GPS and WiFi to maintain an ethernet link to a server (depending on the requirements, the server can be a small cluster with standard MPI communications). The palmtop is also connected to two USB or WiFi cameras, and is standard equipped with microphone input and speaker outputs. In order to increase autonomy, the palmtop will have external batteries in a backpack. The cameras, microphone and speakers will be fixed to a chest-mount at a height of about 1.3 m (previous experience showed that blind persons do not like to wear a helmet and that they must rely on binaural information at any time, hence earphones cannot be used and the volume setting of the speakers needs careful adjustment). The palmtop will be almost completely busy with the audio and video streams, i.e. most processing (GPS and GIS, audio and video) must be done remotely. In the case of using a Centrino laptop, more processing can be done "on board", but this can only be decided experimentally. Part of task 1 are the development of the GPS/GIS system and audio interface. Keeping in mind that the prototype needs testing within relatively simple scenarios (see below), the GIS and audio interface will be restricted to the scenarios. Microphone input will be restricted to queries with a few words, like: where, am-I, zebra, corner, post office, ketchup, ketchup. Audio output will be based on Mbrola´s text-to-speech technology with Portuguese dictionary (the quality is not yet very good, but other groups are working on this...). A special audio mode will be applied while "cruising" freely: this concerns the position and heading on sidewalks and zebra crossings, plus obstacles indicated by e.g. bleeping, where the frequency can code the orientation and the volume the distance. The spectrum can code the type of obstacle, for example special sounds for elongated vertical structures (tree trunk, lamp post) and moving objects (child, dog). During the last year, the developed prototype system will be tested under different conditions, and feedback will be used to optimise the aid and solve unforeseen problems. Field testing will be done by (a) persons with normal vision, (b) the same persons but blindfolded, and (c) blind persons. The following scenarios will be used: (1) outdoor: one or two streets with artificial and normal obstacles (boxes, trees, children) and fixed landmarks (shops, post office, zebra crossings); (2) indoor: a normal house with a few rooms and a corridor, the rooms not being cluttered by furniture; (3) household items in a pantry: an open cupboard with two or three shelves containing between 20 and 30 items with different positions and views (upright, lying, front or back side visible).
|
||
|
Task 2 - Computer vision algorithms |
||
|
Task 2 concerns computer vision, but with a strong link to human vision (Task 3) with respect to using specific image features. There are three subtasks related to navigation (2.1), detection of specific objects during navigation (2.2) and detection of household items (2.3). (2.1) Navigation requires detection of position relative to sidewalks (curbs), zebra crossings, steps and stairs, and corridors (indoor). There are two tasks to be solved: heading and obstacle avoidance, where curbs and steps are also considered obstacles. Concerning the heading problem, in a first phase relatively simple and fast algorithms from the literature will be employed, i.e. Canny edge detection, straight line detection by the Hough transform, and RANSAC for retrieving lines which satisfy the vanishing point constraint (Se and Brady, 2003). Resulting vanishing points are combined with estimation of the ground plane as derived from disparity, using edges (Canny) already available. Although ground plane information allows to detect obstacles like curbs (Se and Brady, 2002), direct disparity information will be used to improve certainty and to issue warnings. In a second phase, the use of additional features from Task 3 (human vision) will be studied: instead of using only edges at a fine scale, this concerns the multi-scale line/edge and keypoint representations, including advanced models of disparity. (2.2) Blind persons memorise fixed obstacles and need to be informed about specific landmarks like trees, lamp posts and benches. Indoor navigation requires detection of doors and furniture. Again, in a first phase simple and fast algorithms will be employed, i.e. SLAM (simultaneous localisation and mapping; Se et. al., 2005) by tracking SIFT (scale-invariant feature transform; SIFT code is publicly available; Lowe, 2004). Since 3D object detection is a notoriously difficult problem, where CPU time depends heavily on object/scene complexity, in a second phase the processing will be accelerated by only scrutinising complex image regions, using first a simple saliency map (Itti and Kock, 2000) and then our own, optimised saliency map based on multi-scale keypoints (Task 3). Moving obstacles like children and dogs will be detected by means of a special motion module that detects suspicious movements in ego-motion-corrected successive video frames. (2.3) Detection of specific household items in a pantry will be realised by a selection of algorithms already employed in tasks 2.1 and 2.2, using Canny edges supplemented by edge crossings and colour information. Although being a difficult task, it is made easier because the stereo camera will be not moving wildly. In view of the good results obtained with cortical models (object categorisations), still to be improved in Task 3, it is likely that task 2.3 will employ the cortical models developed by the group involved in Task 3, such that the groups involved in Task 2 can concentrate on 2.1 and 2.2.
|
||
| Task 3 - Cortical disparity and object recognition | ||
|
Task 3, like task 2, concerns stereo vision and object recognition, but it is based on models of cortical processing, i.e. already implemented and tested models of simple, complex, end-stopped and grating cells, with advanced models of line/edge and keypoint detection, Focus-of-Attention and texture segregation (Rodrigues and du Buf, 2006a,b; du Buf, 2006). The following sub-tasks will be integrated: (3.1) A model of disparity, which is based on detected lines and edges in combination with the linear response in the center of simple cells, must be extended to the multi-scale case with coarse-to-fine-scale processing in order to solve the correspondence problem (which edge in the left image corresponds to one in the right image) and to stabilise accuracy. This way, depth information can be attributed to lines and edges, which results in a sort of 3D wireframe representation of objects and of the entire video frames (the latter for obstacle avoidance). (3.2) The grating cell model can be used to code texture complexity on the basis of the symmetry order (du Buf, 2006). In order to extract shape from texture gradients, new models must be developed which can group on and off responses in neighbouring frequency and orientation channels. The shape information must be complemented with shape-from-shading in the cytochrome-oxidase (CO) blobs (luminance background) and simple cells. (3.3) For object recognition, disparity and shape information, together with colour (CO blobs), must be integrated in the multi-scale line/edge representation in the bottom-up what and where data streams. In parallel, an optimised model of Focus-of-Attention (Task 4) must guide the top-down data streams in testing object templates in memory. (3.4) Each object to be recognised will be represented by a sufficient number of templates (views) in memory. These templates will be tested, going from coarse to fine scales (Bar, 2003), against the visual input. This is a dynamic process that goes from detection to categorization to recognition, linking area V1 to V2 and V4 etc. This linking can solve small differences in orientation etc., but it cannot solve significantly different object views. One of the main goals is to study the maximum plasticity of the hierarchical system in order to determine the minimum number of views. With respect to the integration of the algorithms of this task in the prototype system, the modular architecture of the software (see task 2) provides very simple solutions: (1) the depth information extracted by the disparity model (task 3.1), represented by the 3D wireframe model of the entire video frames, is used for obstacle detection. If there is something within a range of 2 m, the closest position (distance and two angles) can be used to modulate the bleeping. (2) Likewise, object detection/recognition (tasks 3.2-3.4) can substitute the module developed in task 2.
|
||
|
Task 4 – Eye tracking and fMRI/EEG experiments |
||
|
Task 4 is devoted to psychophysics and brain imaging (fMRI/ERP). The reason for this is that advanced models of human vision are going to be developed, and these need to reflect the actual processing: (a) FoA in object and face detection and recognition, and (b) hierarchical processing in the cortex with coarse-to-fine-scale hypothesis testing. The following subtasks are specified: (4.1) Eye-tracking will be used to analyse the neural and behavioural (oculomotor) correlates of feature and region selection. Scan paths and fixation points will be analysed when observers with normal vision are looking at 2D and 3D objects and faces with different characteristics, i.e. depth and colour cues, facial expressions, etc. Eye movements will be recorded during different tasks, i.e. object categorisation and recognition. Likewise, a distinction between face detection and recognition will be considered. The results of these experiments will be used in Task 3, to optimise FoA based on keypoints during detection and to balance the line/edge and keypoint representations during recognition. Apart from using standard databases, like the Psychological Image Collection at Stirling University (UK), this task also requires manipulations of a set of images for removing e.g. colour and/or depth cues. (4.2) Event-related fMRI and EEG experiments may unravel functional connectivity patterns of different brain areas in the dynamic detection, categorisation and recognition process. These interactions are initiated by coarse-scale information (which propagates first from V1 to higher areas) and then refined by the arrival of successively finerscale information. More recent neuroimaging techniques not only allow to localise the networks involved; it is also possible to model the dynamical information flow by Structural Equation Modeling and Granger Causality approaches. Brain imaging/ERP studies will also be used to understand the neural correlates of recognition of 3D objects with canonical and non-canonical views, including position, size and pose invariance. Differences between fast view-based and slow mental rotation-based recognition will be studied by means of cortical chronometry, correlating temporal activation patterns from event-related ERP/fMRI with measured reaction times. Apart from analysing the effects of different facial expressions by using image morphing, this subtask also addresses recognition of objects with different levels of complexity and in/out of context, with and without different colour/shape cues. The results of this subtask will be employed in creating the hierarchical cortical architecture in Task 3.
|
||
| Task 5 - Project coordination | ||
|
This is a relatively small project with four partners
involved, but it is necessary that the coordinator assists all partners and
researchers in assuring data flows, the timing of experiments and tests, the
organisation of project meetings, the organisation and preparation of scientific
papers (co-authorships and contributions), the diffusion to the general public
through "popular" papers etc., the communication with the FCT, and to assist the
administrative project management at CINTAL.
|
||
|
|
||
|
|
Updated: 01-10-2009 |