To support machine learning of cross-language prosodic mappings and other ways to improve speech-to-speech translation, we present a protocol for collecting closely matched pairs of utterances across languages, a description of the resulting data collection and its public release, and some observations and musings. This report is intended for:
- people using this corpus
- people extending this corpus
- people designing similar collections of bilingual dialog data.
Change Notes. This version supersedes UTEP-CS-22-108. There is some new information and numerous clarifications, mostly arising from our experiences diversifying our corpus and helping a vendor to use this protocol.