Adding Linguistic Information to Statistical Machine Translation:
Confidence in Syntax for SMT

Second phase of PhD project, 2010–2012. Supervisors: Mark Dras, Robert Dale

Background

Project

Machine translation (MT) is the task of automatically translating a written text from one human language to another. In statistical machine translation (SMT), this is accomplished by developing a probabilistic model of the translation process. Intuitively, linguistic information about the sentence should aid translation, but so far the addition of such information to the statistical model has not consistently proven useful. This project investigates how such information can be usefully incorporated into the system.

Syntax in SMT

There has been considerable work on incorporating information about the syntactic structure of sentences into SMT systems. Results, however, have been mixed. This may be because the information is automatically-obtained and errorful. If so, incorporating measures of our confidence in the accuracy of the information may allow the translation system to adjust its reliance on it accordingly.

System Overview and General Notes

The system used in this project is a phrase-based SMT system incorporating syntax in a reordering-as-preprocessing approach. I have re-implemented the Collins et al. (05) system for German-to-English translation using Moses and the Berkeley parser.

The German parsing model that comes with the Berkeley parser does not include the function labels needed, so additional parsing models are required. These are trained on version 1 of the Tiger corpus and are available for download below. By using these models you indicate that you agree to the licences for both the Berkeley parser and the Tiger corpus, which can be found on their respective websites.

Code and models are distributed in a bundle for each publication using the system. Code is copyright © Susan Howlett and is provided for your use free of charge and without warranty. If you use the code, please cite the corresponding publication.

Pilot Using Lattice Input

In a pilot system, I used Moses' lattice input to translate sentences both with and without the reordering step, and to incorporate confidence measures in the translation model. I referred to this system as dual-path phrase-based SMT, named for the structure of the lattice used. Results are given in the following paper.

Howlett and Dras (ALTA 2010): Dual-Path Phrase-Based Statistical Machine Translation. (Paper available on the publications page.)

Please also read the General Notes section above.

On 6 December 2011, we published an erratum for this paper. The erratum is available with the paper on the publications page. Configuration files and notes to accompany the erratum are available in this download package.

Exploring Reordering-as-Preprocessing

The next phase of the project involves a more in-depth exploration of the performance gains of the reordering-as-preprocessing system. Some results appear in the following paper.

Howlett and Dras (ACL 2011): Clause Restructuring for SMT Not Absolutely Helpful. (Paper available on the publications page.)

Please also read the General Notes section above.

On 6 December 2011, we published an erratum for this paper. The erratum is available with the paper on the publications page. Configuration files and notes to accompany the erratum are available in this download package.

Continuing Work

For information about further work on this project, contact Mark Dras.