Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In 2014, the World Health Organization reported that the number of visually impaired individuals was estimated to be approximately 285 million worldwide [1]. Many of them are trained by sighted supporters to move along paths, for example, from their houses to offices. During such training, they are often taught information related to important locations (called spots) on the paths. For example, a visually impaired individual is taught that there is a restroom just outside a ticket gate at a station. If the individual remembers the information about the restroom, he or she can use it later. However, otherwise, he or she cannot at all. The individual is strongly affected by whether he or she remembers such information. It is necessary to build an assistive system that helps visually impaired individuals remember information related to spots to visit.

There are several systems that help visually impaired individuals remember the information on everyday environments. The Digital Sign system [3] and the NAVI system [4] determine the current positions of visually impaired individuals by use of passive and AR markers, respectively. These systems need to deploy markers in large scale infrastructure for everyday use. A navigation system [2] determines the current position of a visually impaired individual by use of GPS, and then guides the individual along the predefined route. This GPS-based system cannot be used in, for example, reinforced concrete buildings. Sekai Camera [5] is an AR application on mobile phones. Digital information, called Air Tag, can be virtually attached to the real world, and a user can know the local information from the Air Tag. The main targets of this system are sighted people, and thus it is difficult for visually impaired individuals to use this system. e.Typist Mobile [6] converts characters in environments to voices. Tap Tap See [7] and LookTel Recognizer [8] help visually impaired individuals to identify objects.

In this paper, we propose the concept of spot navigation to help visually impaired individuals remember information related to spots to visit. This concept is implemented as an application software on a mobile system, which is applied to actual indoor and outdoor scenes.

2 Concept of Spot Navigation

Spot navigation is a framework to assist visually impaired individuals in recalling the memories related to spots that they often visit. In this framework, first, a visually impaired individual visits several spots with a sighted supporter, and then records the position data and voice memos of the spots onto a mobile system. Later, when the visually impaired individual visits one of the recorded spots, the mobile system determines the spot position and then plays the voice memo that corresponds to the spot. The visually impaired individual can obtain the spot information by hearing the voice memo.

3 Implementation of a Spot Navigation System

3.1 Spot Navigation System

The spot navigation concept is implemented as an application software on an Android smartphone system (Google Nexus 4 [9]) and our Kinect cane system [1012]. The application software has the following two modes:  

1. Registration mode: :

A visually impaired individual visits each spot with his or her sighted supporter. The individual takes scene images with several perspectives by use of a camera on a mobile system, and records a voice memo about supplemental information related to the spot. The system registers the images and the voice memo to a dictionary in the system. The images are used as keys in the dictionary to determine the spot positions.

2. Spot navigation mode: :

When the visually impaired individual visits one of the recorded spots, he or she takes a scene image, and then inputs the image as a query into the system. The system determines the current spot from the results of image matching between the query image and the dictionary images. The system plays the voice memo corresponding to the matched dictionary image.

 

3.2 Image Matching Based on the SIFT

The Scale Invariant Feature Transform (SIFT) [13] can extract pixels that have distinct features, which are described by 128-dimensional vectors. The feature vectors are invariant against the changes of scale, rotation and illumination. Such pixels are called key points.

Let \(k^q\) and \(k^d\) denote key points in a query image q and a dictionary image d, respectively, and \(v^q_i\) and \(v^d_i\) denote the i-th feature value of \(k^q\) and \(k^d\), respectively. The system searches for the key point pair which minimizes the following distance:

$$\begin{aligned} \delta (v^{d},v^{q})=\sqrt{\sum _{i=1}^{128}(v^{d}_i - v^{q}_i)^2}. \end{aligned}$$
(1)

Fig. 1 shows a matching result where lines represent the key point pairs.

Fig. 1.
figure 1

Example of key point pairs in a query image and a dictionary image at an indoor spot.

The system evaluates the following six criteria based on geometrical relations between key point pairs:

  1. 1.

    too few pairs,

  2. 2.

    size consistency,

  3. 3.

    direction consistency,

  4. 4.

    2D affine constraint,

  5. 5.

    area size and,

  6. 6.

    axis inversion,

which are proposed by Kameda et al. [14, 15]. If all the criteria are satisfied, the query image is determined to correspond to the dictionary image.

4 Experimental Results

4.1 Image Matching Test 1

Conditions: 22 indoor spots and 22 outdoor spots were selected for the experiment, and eight images were taken at each spot. The resolutions of the images were \(144 \times 192\) pixels. A two-fold cross-validation test were employed with the following parameters: \(K_0=10\), \(t_{size}=0.35\), \(t_{dir}=45\), \(t_{affine}=12\), \(t_{area}=15\), which are used in the image matching.

Results: Table 1 lists the accuracy of the image matching. Correct detection represents a situation where the system can find a dictionary image that corresponds to a query image. False detection represents a situation where the system mistakenly selects a dictionary image that does not correspond to a query image. No correspondence represents a situation where the system determines that the dictionary does not include any images that correspond to a query image. In this test, no correspondence is the failure of the image matching, because the image dictionary includes at least one image corresponding to a query image.

In this result, the 317 query images were matched successfully, but the 35 query images cannot be matched. Figure 2 (a), (b) and (c) show the matching results where the right and left images are dictionary and query images, respectively. Figure 2 (a) shows the images which are taken at the same indoor spot. 25 key point pairs are obtained correctly from the same objects. All the criteria are satisfied (i.e. \(K=25\), \(E_{size}=0.05\), \(E_{dir}=2.4\), \(E_{affine}=5.5\) and \(E_{area}=38.8\)), and thus the system successfully determines the spot. Figure 2 (b) shows the images which are taken at different spots, and no key point pairs are obtained from the images. The number of the key point pairs is smaller than the threshold \(K_0\), and thus the system successfully determines that they are the different spots. Figure 2 (c) shows the images which are taken at different spots. 19 key point pairs are obtained, but the size consistency, the direction consistency and the axis inversion are out of permissible ranges. Therefore, they are successfully determined to be different spots.

Table 1. Matching accuracy in Test 1.
Fig. 2.
figure 2

Matching results in Test 1.

Table 2. Matching accuracy in Test 2
Fig. 3.
figure 3

Matching results in Test 2.

Fig. 4.
figure 4

Registration and spot navigation modes in user study.

4.2 Image Matching Test 2

Conditions: We verified whether the system correctly returns no correspondence in cases where the dictionary does not include any images corresponding to a query image. 28 images were taken at other spots (14 indoor spots and 14 outdoor spots), and the image matching were performed by use of the 28 images and the dictionary images.

Results: Table 2 lists the matching results. In this test, no correspondence represents that the system successfully indicates that there is no dictionary images corresponding to a query image. All the query images are correctly determined to be no correspondence.

Figure 3 (a) and (b) show matching results. Figure 3 (a) shows the images which are taken at different outdoor spots. The system successfully determines that they are different spots, because the size consistency and the direction consistency are out of permissible ranges. Figure 3 (b) shows the images which are taken at indoor and outdoor spots. These images provide only two key point pairs. The number of the pairs are smaller than the threshold \(K_0\). Therefore these spots are successfully determined to be different spots.

4.3 User Study

We conducted a user study where a blindfolded subject used an Android smartphone system in which the spot navigation method was implemented. In Fig. 4(a), a person having a white cane played a role of a visually impaired individual, and a person in a white T-shirt played a role of a supporter. They were in an entrance of a building. The supporter set a smartphone, took an image of the scene, and input a voice memo, “Here is an entrance. There is a direction board.” In Fig. 4(b), they were in front of a multi-purpose room. The supporter input a voice memo, “Here is a multi-purpose room. There is a kitchen inside.” In Fig. 4(c), they were in front of our laboratory. In this case, the visually impaired individual input a voice memo, “Here is our laboratory”, under the advice of the supporter. Figure 4(d) shows a situation where the visually impaired individual comes to the multi-purpose room by himself. He took an query image, and then the smartphone system played the correct voice memo. By hearing the voice memo, he could remember the kitchen in the multi-purpose room.

5 Conclusion

In this report, we propose a spot navigation system to assist visually impaired individuals in recalling the memories related to spots that they often visit. The system can identify spots by use of the image matching technique based on the SIFT, and give a visually impaired individual the supplemental information related to spots by use of voice memos. The experimental results indicate that the proposed system is promising to help visually impaired individuals.

One of our future works is improve the accuracy of the image matching by use of, PCA-SIFT [16], BSIFT [17], and CSIFT [18].