This package contains the downloads associated with Susan Howlett and Mark Dras (2010) "Dual-Path Phrase-Based Statistical Machine Translation" in Proceedings of the Australasian Language Technology Association Workshop (ALTA2010) Package last updated: 26 November 2010 This code is provided for your use free of charge and without warranty. Please cite the above publication if you use this code in your research. Please address questions and comments to Suzy Howlett (suzy@showlett.id.au). ========== Package Contents ---------------- 1. README 2. Code: analysis/ oracle.py lattice/ lattice.py parsing/ distributed_parser parse_german preprocessing/ Collins_baseline lattice_system reordering/ Collins_rules.py Collins_rules_test.py 3. Experiment Management System configuration files: configs/ config.baseline config.feature-lattice config.oracle-evaluation config.plain-lattice config.reordered 4. Moses diffs: notes/ moses-diff.txt moses-mod-diff.txt ========== README Contents --------------- 1. Moses setup 1.1 Baseline system 1.2 Modifications for our cluster 1.3 Modifications for the dual-path systems 2. Parsing setup 3. Notes about data 4. Using this code 4.1 Overview of files 4.2 Parsing 4.3 Preprocessing 4.3.1 REORDER (Collins et al. (05) reordering-as-preprocessing baseline) 4.3.2 LATTICE (plain dual-path PSMT system) 4.3.3 +FEATURES (dual-path PSMT system with all confidence features) 4.3.4 Lattice systems with one vocabulary 4.4 Running baselines 4.5 Running dual-path systems 4.6 Running approximate oracle ========== 1. MOSES SETUP 1.1 Baseline system Check out the Moses repository (this paper used revision 3590): svn co -r 3590 https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk moses Install Moses with the SRILM toolkit. For more information, visit the Moses website: http://www.statmt.org/moses/ 1.2 Modifications for our cluster The experiments in this paper were run across a cluster using TORQUE for job scheduling. The TORQUE qsub command is different from that used in Moses; the accompanying file notes/moses-diff.txt contains the diff between our baseline and the repository revision. These changes should not affect the performance of the system. Our cluster setup also affects the configuration files for the Moses Experiment Management System, specifically the "qsub-settings" variables. These settings instruct the cluster to assign 8 or 16 CPUs for each job. This number of CPUs was not actually needed; this was a hack for our setup to ensure not too many jobs were assigned to each cluster node. These changes should also not affect the results of the systems. 1.3 Modifications for the dual-path systems The dual-path systems we describe in the paper use disjoint vocabularies for the two paths of the lattice, created by prepending "1_" or "2_" to each token. For the dual-path systems we therefore use a second installation of Moses modified to remove these prefixes from out-of-vocabulary items (which are otherwise copied directly to the output). The required change, which was applied on top of the changes mentioned in (1.2), is given in the accompanying file notes/moses-mod-diff.txt. ========== 2. PARSING SETUP We use the Berkeley parser, repository revision 14: svn co -r 14 http://berkeleyparser.googlecode.com/svn/trunk/ berkeleyparser To parse we use the BerkeleyParser.jar file at the top level of this repository. For the grammar file we use ger_tiger.gr which is available as a separate download from http://www.showlett.id.au/ along with a description of how it was trained. ========== 3. NOTES ABOUT DATA The experiments in this paper use the data provided for the 2010 Workshop on Statistical Machine Translation translation task: http://www.statmt.org/wmt10/translation-task.html Specifically, we use the parallel corpus training data (training-parallel.tgz), monolingual language model traning data (training-monolingual.tgz) and development sets (dev.tgz), extracted to /usr/local/data/wmt10. Note that we do not use the WMT10 test sets in this paper. The preprocessing scripts included in this bundle and described below take the data from /usr/local/data/wmt10 and place their output in a new directory as described below in (4.2) and (4.3). Most systems will get their data from this new directory; we use /home/showlett/corpora. This is done in two phases (parsing and the rest) since the parsing step is time-consuming and need not be repeated for the different systems being created. When using the configuration files, remember to change these data directory paths to those used on your system. ========== 4. USING THIS CODE 4.1 Overview of files analysis/oracle.py This Python script compares the outputs of the baseline and reordered systems with the reference translation and selects whichever system output is likely to contribute to a higher BLEU score overall. Where it cannot make a choice, the default is to use the baseline system output. At completion, the script prints statistics of the selections made. This script only compiles the oracle's selections; it does not calculate the final BLEU score. lattice/lattice.py This Python script generates the lattice input that the dual-path system uses in tuning and evaluation. It takes as input files that contain the original set of sentences, their reordered equivalents, and (optionally) the set of confidence features to include on the lattice. As output it produces a file containing the corresponding lattices. If the feature file is provided, this lattice will include all feature values; if not, it includes only the reordering indicator feature described in the paper. parsing/distributed_parser This bash script was used to distribute the job of parsing the German side of the corpora across our cluster, and should be replaced with an equivalent script for your setup. The important thing is that the Berkeley parser is called with the flags "-confidence -tree_likelihood". Distribution across a cluster is not necessary but recommended. parsing/parse_german This bash script takes the German corpus file, removes the SGM markup (if necessary), tokenizes the text, replaces parentheses by *LRB* and *RRB* (as required by the German parsing model we used) and calls the parsing/distributed_parser script above. preprocessing/Collins_baseline This bash script takes the output from the parsing/parse_german script above and uses the reordering/Collins_rules.py script below to produce the training, tuning and evaluation data for the reordering-as-preprocessing baseline system (reimplementing Collins et al. (05)). preprocessing/lattice_system This bash script takes the output from the parsing/parse_german script above and uses the reordering/Collins_rules.py script below and the lattice/lattice.py script above to produce the training, tuning and evaluation data for the four variants of the dual-path lattice system. The four variants are produced by using disjoint vocabularies or using a single vocabulary for the two paths, and by including all features or including only the indicator feature. Note that this paper only used two of these four variants, those with disjoint vocabularies. For the disjoint vocabularies system, this script prepends "1_" or "2_" to each token, as described above. The script also reverses the parenthesis conversion in the parsing/parse_german script, replacing *LRB* and *RRB* with their respective parenthesis characters. Finally, this script produces the 'filter file' required by the translation system to filter the phrase tables, which is simply a single file containing the sentences forming both paths of the lattices. reordering/Collins_rules.py This Python script takes the German sentences after being parsed by the Berkeley parser and reorders them according to the rules in Collins et al. (05). This is used by preprocessing/Collins_baseline to produce the data used by the reordering-as-preprocessing baseline, and is used by preprocessing/lattice_system to produce part of the data needed for the lattice systems. In addition to producing the reordered sentence, this script also optionally collects the confidence feature information used in this paper and stores it in a file, later passed to lattice/lattice.py. reordering/Collins_rules_test.py This Python script contains a small set of 10 test cases for the reordering/Collins_rules.py script above. 4.2 Parsing The parsing/parse_german script takes the following command-line arguments: - The directory containing the file to be parsed. We used /usr/local/data/wmt10/training or /usr/local/data/wmt10/dev. - The name of the file in this directory to be parsed. This is given as a separate argument since the script uses the same filename for the output. - The directory to store the results. As mentioned in (3), we used /home/showlett/corpora. The script creates subdirectories of this directory named "tokenized" and "parsed". - A temp directory in which intermediate results will be stored. Please note that some of the variables in the script will need to be changed to run on a new system. The script calls the parsing/distributed_parser script; as mentioned in (4.1), this should be replaced with an equivalent script for the new system. The core parsing command is: java -jar BerkeleyParser.jar -gr ger_tiger.gr -confidence -tree_likelihood < $infile > $outfile where BerkeleyParser.jar and ger_tiger.gr are as described in (2). 4.3 Preprocessing After parsing is complete, the preprocessing scripts are used to create the actual training, tuning and evaluation data for each system, for both German and English. 4.3.1 REORDER (Collins et al. (05) reordering-as-preprocessing baseline) The preprocessing/Collins_baseline script takes the following command-line arguments: - The directory containing the original corpora. We used /usr/local/data/wmt10/training or /usr/local/data/wmt10/dev. Please note that the filename manipulations used in the script are specific to the WMT10 German--English data, and so the script is not necessarily directly usable for other corpora. - A string indicating which set of files is to be used. The precise filenames are constructed from this string. We used "europarl-v5" and "news-commentary10" for training, "news-test2008" for tuning, and "newstest2009" for evaluation. - The directory given as the third command-line argument in parsing (i.e. for us /home/showlett/corpora). The script uses the files in the "tokenized" and "parsed" subdirectories created by the parsing script, and creates a new subdirectory "reordered" to store its own results. - A temp directory in which intermediate results will be stored. - One of the strings "TRAIN", "TUNE" or "TEST", specifying whether the script is to create data for the traning, tuning or evaluation phase, respectively. Currently there is no difference between TUNE and TEST. Please note that some of the variables in the script will need to be changed to run on a new system. This script calls the reordering/Collins_rules.py script. For information about the command-line arguments of reordering/Collins_rules.py, see the comments at the beginning of that file. 4.3.2 LATTICE (plain dual-path PSMT system) The preprocessing/lattice_system script takes the same command-line arguments as the preprocessing/Collins_baseline script in (4.3.1) plus two arguments that should have as a value either "T" or "F": - The first indicates whether ("T") or not ("F") the two paths of the lattice should have disjoint vocabularies. For the LATTICE system, this should be "T" - The second indicates whether ("T") or not ("F") to include the confidence feature values in the lattices that are created. For the LATTICE system, this should be "F". As for (4.3.1), the script is specific to the WMT10 data and is not necessarily directly usable for other corpora, contains variables that will need to be changed to run on a new system, and currently has no difference between TUNE and TEST. Like in (4.3.1), this script uses the files in the "tokenized" and "parsed" subdirectories of the third command-line argument. With the T/F settings given above, the script will create a new subdirectory "plain-lattice". The script calls the reordering/Collins_rules.py script for all phases, and the lattice/lattice.py script for TUNE and TEST. As mentioned in the paper, this is because lattices are only used in tuning and evaluation; training simply proceeds using the concatenation of original and reordered training data. For TUNE and TEST, the script also creates the filter file corresponding to the lattices. 4.3.3 +FEATURES (dual-path PSMT system with all confidence features) Preprocessing for this system proceeds exactly as for LATTICE in (4.3.2), with the exception that the final command-line argument to the preprocessing/lattice_system script should be set to "T", and the script creates the subdirectory "feature-lattice" instead of "plain-lattice". 4.3.4 Lattice systems with one vocabulary Although lattice systems with one vocabulary for the two paths were not used in this paper, the same preprocessing/lattice_system script can be used to create the lattices needed simply by setting the first T/F command-line argument of the script to "F". The subdirectory that the script creates to store the resulting data is called either "onevocab-plain-lattice" or "onevocab-feature-lattice" depending on the choice of the second T/F command-line argument. 4.4 Running baselines The Moses baseline (MOSES) is run using the Moses Experiment Management System (EMS) with the configuration file configs/config.baseline. On our system this took roughly 1.5 days. The Collins et al. (05) baseline (REORDER) is run using the Moses EMS with the configuration file configs/config.reordered. On our system this took roughly 2 days. For details about using the Moses EMS, see the Moses website: http://www.statmt.org/moses/. Note that several variables in the configuration files will need to be changed to run on a new system. (See also (1.2).) Note that the REORDER system reuses the language model and recaser trained with the MOSES baseline system, and so the configuration file needs to know the locations of these files. 4.5 Running dual-path systems The LATTICE and +FEATURES systems are run using the Moses EMS with the configuration files configs/config.plain-lattice and configs/config.feature-lattice respectively. On our system, LATTICE took roughly 4 days, while +FEATURES took roughly 1 week. For details about using the Moses EMS, see the Moses website: http://www.statmt.org/moses/. As in (4.4), several variables in the configuration files will need to be changed to run on a new system. (See also (1.2).) Also like the REORDER system in (4.4), these dual-path systems reuse the language model and recaser trained with the MOSES baseline system, and so the configuration files require the location of these files. 4.6 Running approximate oracle The approximate oracle described in the paper is implemented in analysis/oracle.py. The script requires the following command-line arguments: - The file containing the reference translations for this evaluation set. - The output of the MOSES system. - The output of the REORDER system. - The file to which the oracle's selections are written. (If it exists, this file will be overwritten.) More detail for each of these is given below. The reference translations and system output files should be tokenized and not lowercased. Run the tokenizer (given as the value of the "en_tokenizer" variable in the preprocessing scripts) on "newstest2009.reference.txt.1" in the "evaluation" directory from the MOSES system working directory to produce the file needed for the first command-line argument. The second and third command-line arguments should be "newstest2009.recased.1" from the "evaluation" directories of the MOSES and REORDER systems working directories. Create a working directory "oracle-evaluation" for the approximate oracle system. This path will need to be specified in the configuration file configs/config.oracle-evaluation. Within this directory, create an "evaluation" directory. For the final command-line argument, give "oracle-evaluation/evaluation/newstest2009.oracle-output". When this script has completed, the file indicated by the final command-line argument will contain the oracle's choices. Each line of the file will be identical to the corresponding line of either the MOSES output or the REORDER output. To score these choices, run the Moses Experiment Management System with the configuration file configs/config.oracle-evaluation. Note that several variables in this configuration file will need to be changed to run on a new system. ==========