Dante - Di Michelino 150° sponsors

Corporate & Society Sponsors
Loquendo diamond package
Nuance gold package
ATT bronze package
Google silver package
Appen bronze package
Appen bronze package
Interactive Media bronze package
Microasoft bronze package
SpeechOcean bronze package
Avios logo package
NDI logo package
NDI logo package


Universit柤e Avignon
Speech Cycle
Universit�i Firenze
Univ. Trento
Univ. Napoli
Univ. Tuscia
Univ. Calabria
Univ. Venezia


Comune di Firenze
Firenze Fiera
Florence Convention Bureau


12thAnnual Conference of the
International Speech Communication Association


Interspeech 2011 Florence

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Show & Tell Demonstration - Mobility and Web-services

Time:Tuesday 13:30 Place:Donatello (Room Onice) - Pala Congressi - Ground Floor Type:Poster
Chair:Mazin Gilbert

#1Making an automatic speech recognition service freely available on the web

Stuart Nicholas Wrigley (University of Sheffield)
Thomas Hain (University of Sheffield)

The state-of-the-art speech recognition system developed by the AMIDA project and which performed well in the NIST RT'09 evaluation has been made available as a web service. The service provides free access to ASR aimed specifically at the scientific community. There are two ways in which this service can be accessed: via a standard web-browser and programmatically via an API.

#2AT&T VoiceBuilder: A Cloud-based Text-To-Speech Voice Builder Tool

Yeon-Jun Kim (AT&T Labs - Research, Inc.)
Thomas Okken (AT&T Labs - Research, Inc.)
Alistair Conkie (AT&T Labs - Research, Inc.)
Giuseppe Di Fabbrizio (AT&T Labs - Research, Inc.)

The AT&T VoiceBuilder provides a new tool to researchers and practitioners who want to have their voices synthesized by a high-quality, commercial-grade text-to-speech (TTS) system without the need to install, configure, or manage speech processing software and equipment. It is implemented as a web service on the AT&T Speech Mashup Portal. The proposed system records, processes, and validates users' utterances, and provides a web service API to make the new voice immediately available to real-time applications. All the procedures are fully-automated to avoid human intervention.

#3Extending Audio Notetaker to Browse WebASR Transcriptions

Roger Tucker (Sonocent Ltd, Chepstow, UK)
Dan Fry (Sonocent Ltd, Chepstow, UK)
Vincent Wan (Department of Computer Science, University of Sheffield, UK)
Stuart Wrigley (Department of Computer Science, University of Sheffield, UK)
Thomas Hain (Department of Computer Science, University of Sheffield, UK)

The audio annotation tool Audio Notetaker has been extended to allow browsing of transcripts produced with the WebASR system from Sheffield University. The interface has been designed to be usable with as much as 50% recognition error.

#4A Web-Based Tool for Developing Multilingual Pronunciation Lexicons

Samantha Ainsley (Department of Computer Science, Columbia University, USA)
Linne Ha (Google Inc., USA)
Martin Jansche (Google Inc., USA)
Ara Kim (Formerly Google Inc., USA)
Masayuki Nanzawa (Google Inc., USA)

We present a web-based tool for generating and editing pronunciation lexicons in multiple languages. The tool is implemented as a web application on Google App Engine and can be accessed remotely from a web browser. The client application displays to users a textual prompt and interface that reconfigures based on language and task. It lets users generate pronunciations via constrained phoneme selection, which allows users with no special training to provide phonemic transcriptions efficiently and accurately.

#5Speak4it and the Multimodal Semantic Interpretation System

Michael Johnston (AT&T Labs Research)
Patrick Ehlen (AT&T Labs)

Multimodal interaction allows users to specify commands using combinations of inputs from multiple different modalities. For example, in a local search application, a user might say “gas stations” while simultaneously tracing a route on a touchscreen display. In this demonstration, we describe the extension of our cloud-based speech recognition architecture to a Multimodal Semantic Interpretation System (MSIS) that supports processing of multimodal inputs streamed over HTTP. We illustrate the capabilities of the framework using Speak4itSM, a deployed mobile local search application supporting combined speech and gesture input. We provide interactive demonstrations of Speak4it on the iPhone and iPad and explain the challenges of supporting true multimodal interaction in a deployed mobile service.

#6TSAB -- Web Interface for Transcribed Speech Collections

Tanel Alumäe (Institute of Cybernetics at Tallinn University of Technology, Estonia)
Ahti Kitsik (Codehoop OU)

This paper describes a new web interface for accessing large transcribed spoken data collections. The system uses automatic or manual time-aligned transcriptions with speaker and topic segmentation information to present structured speech data more efficiently and make accessing relevant speech data quicker. The system is independent of the underlying speech processing technology. The software is free and open-source.

#7Visual Voice Mail to Text on the iPhone/iPad

Andrej Ljolje (AT&T Labs - Research)
Vincent Goffin (AT&T Labs - Research)
Diamantino Caseiro (AT&T Labs - Research)
Taniya Mishra (AT&T Labs - Research)
Mazin Gilbert (AT&T Labs - Research)

A visual Voice-Mail-to-Text (VMTT) transcription system takes a conventional voice mail and converts it to formatted text following the standard punctuation, capitalization and presentation conventions. The text can then be used in a plethora of applications, form emails, to databases, text messages etc., which in turn allow searching, classification, data extraction, statistical analyses and other processes. Here we demonstrate the VMTT application by displaying the best scoring hypotheses from various recognition passes, the addition of punctuation and capitalization, formatting by using appropriate conventions for times, dates, dollar amounts and abbreviations, and finally applying grayscaling to lower the impact of the words recognized with low confidence scores.

#8Percy - an HTML5 framework for media rich web experiments on mobile devices

Christoph Draxler (Institute of Phonetics and Speech Processing, LMU Munich)

Percy is a small software framework for perception experiments via the WWW. It is implemented entirely in dynamic HTML and makes use of the new multimedia tags available in HTML5, eliminating the need for browser plug-ins or external players to display media content. With Percy, perception experiments can be run on any platform supporting HTML5, including tablet computers, smartphones or game consoles and thus access new participant populations. Percy supports touch interfaces and measures reaction times. It stores its data in a relational database system on a server. This allows immediate access to the experiment data from statistics packages, spreadsheet programs or via standard database access application programming interfaces.The system has been used for an online experiment on the identification of regional variants by phonetic features in German. Furthermore, the software has been used in a number of experiments in German, Castilian Spanish and English.

#9The KLAIR toolkit for recording interactive dialogues with a virtual infant

Mark Huckvale (University College London)

The goals of the KLAIR project are to facilitate research into the computational modelling of spoken language acquisition. Previously we have described the KLAIR toolkit that implements a virtual infant that can see, hear and talk. In this demonstration we show how the toolkit can be used to record interactive dialogues with caregivers. The outcomes are both an audio-video recording and a log of the "beliefs" and "goals" of the infant control program. These recordings can then be analysed by machine learning systems to model spoken language acquisition. In our demonstration, visitors will be able to interact with KLAIR and try to teach it the names of some toys.

#10Real-time Prototype for Integration of Blind Source Extraction and Robust Automatic Speech Recognition

Francesco Nesta (Fondazione Bruno Kessler-Irst)
Marco Matassoni (Fondazione Bruno Kessler-Irst)
Hari Krishna Maganti (Fondazione Bruno Kessler-Irst)

This demo presents a real-time prototype for automatic blind source extraction and speech recognition in presence of multiple interfering noise sources. Binaural recorded mixtures are processed by a combined Blind/Semi-Blind Source Separation algorithm in order to obtain an estimation of the target signal. The recovered target signal is segmented and used as input to a real-time automatic speech recognition (ASR) system. Further, to improve the recognition performance, noise robust features based on Gammatone Frequency Cepstral Coefficients (GFCC) are used. The demo utilizes the data provided for the CHiME Pascal speech separation and recognition challenge and also real-time mixtures recorded on-site. Users will be able to listen to the recovered target signal and compare it with the original mixture and ASR output.