Suzy Howlett: Confidence in Syntax for SMT

Machine translation (MT) is the task of automatically translating a written text from one human language to another. In statistical machine translation (SMT), this is accomplished by developing a probabilistic model of the translation process. Intuitively, linguistic information about the sentence should aid translation, but so far the addition of such information to the statistical model has not consistently proven useful. This project investigates how such information can be usefully incorporated into the system.

Syntax in SMT

There has been considerable work on incorporating information about the syntactic structure of sentences into SMT systems. Results, however, have been mixed. This may be because the information is automatically-obtained and errorful. If so, incorporating measures of our confidence in the accuracy of the information may allow the translation system to adjust its reliance on it accordingly.

System Overview and General Notes

The system used in this project is a phrase-based SMT system incorporating syntax in a reordering-as-preprocessing approach. I have re-implemented the Collins et al. (05) system for German-to-English translation using Moses and the Berkeley parser.

The German parsing model that comes with the Berkeley parser does not include the function labels needed, so additional parsing models are required. These are trained on version 1 of the Tiger corpus and are available for download below. By using these models you indicate that you agree to the licences for both the Berkeley parser and the Tiger corpus, which can be found on their respective websites.

Code and models are distributed in a bundle for each publication using the system. Code is copyright © Susan Howlett and is provided for your use free of charge and without warranty. If you use the code, please cite the corresponding publication.

Pilot Using Lattice Input

In a pilot system, I used Moses' lattice input to translate sentences both with and without the reordering step, and to incorporate confidence measures in the translation model. I referred to this system as dual-path phrase-based SMT, named for the structure of the lattice used. Results are given in the following paper.

Howlett and Dras (ALTA 2010): Dual-Path Phrase-Based Statistical Machine Translation. (Paper available on the publications page.)

Please also read the General Notes section above.

Download package to accompany the paper, including the system's code and configuration files. View the README file from the package.
German parsing model used in the paper. View details of how the model was trained.
The data for the experiments in this paper are available from the 2010 Workshop on Statistical Machine Translation (WMT'10).

On 6 December 2011, we published an erratum for this paper. The erratum is available with the paper on the publications page. Configuration files and notes to accompany the erratum are available in this download package.

Exploring Reordering-as-Preprocessing

The next phase of the project involves a more in-depth exploration of the performance gains of the reordering-as-preprocessing system. Some results appear in the following paper.

Howlett and Dras (ACL 2011): Clause Restructuring for SMT Not Absolutely Helpful. (Paper available on the publications page.)

Please also read the General Notes section above.

Download package to accompany this paper (hosted by the ACL Anthology), including the system's code and configuration files. There is a partial overlap with the scripts contained in the package for the ALTA 2010 paper. View the README file from the package.
This paper uses the German parsing model from the ALTA 2010 paper in the previous section. The paper also uses several additional models: models trained on 50%, 25%, and 10% of the training data, and lowercased models trained on 100% and 50% of the data. Details of how each model was trained are given in the download package for the paper.
The data for most of the experiments in this paper are available from the 2009 and 2010 Workshops on Statistical Machine Translation. The data used to replicate the Collins et al. (05) experiment as closely as possible was kindly provided to us by Michael Collins.

Adding Linguistic Information to Statistical Machine Translation:
Confidence in Syntax for SMT

Background

Project

Syntax in SMT

System Overview and General Notes

Pilot Using Lattice Input

Exploring Reordering-as-Preprocessing

Continuing Work

Adding Linguistic Information to Statistical Machine Translation: Confidence in Syntax for SMT

Background

Project

Syntax in SMT

System Overview and General Notes

Pilot Using Lattice Input

Exploring Reordering-as-Preprocessing

Continuing Work

Adding Linguistic Information to Statistical Machine Translation:
Confidence in Syntax for SMT