Description
-
Overview
1.1 Canadian Hansards
The main corpus for this assignment comes from the o cial records (Hansards) of the 36th Canadian Parliament, including debates from both the House of Representatives and the Senate. This corpus is available at /u/cs401/A2/data/Hansard/ and has been split into Training/ and Testing/ directories.
This data set consists of pairs of corresponding les (*.e is the English equivalent of the French *.f) in which every line is a sentence. Here, sentence alignment has already been performed for you. That is, the nth sentence in one le corresponds to the nth sentence in its corresponding le (e.g., line n in fubar.e is aligned with line n in fubar.f). Note that this data only consists of sentence pairs; many-to-one, many-to-many, and one-to-many alignments are not included.
1.2 Seq2seq
We will be implementing a simple seq2seq model, with and without attention, based largely on the course material. You will train the models with teacher-forcing and decode using beam search. We will write it in PyTorch version 1.2.0 (https://pytorch.org/docs/1.2.0/), which is the version installed on the teach.cs servers. For those unfamiliar with PyTorch, we suggest you rst read the PyTorch tutorial (https: //pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html).
1.3 Tensors and batches
PyTorch, like many deep learning frameworks, operate with tensors, which are multi-dimensional arrays. When you work in PyTorch, you will rarely if ever work with just one bitext pair at a time. You’ll instead be working with multiple sequences in one tensor, organized along one dimension of the batch. This means that a pair of source and target tensors F and E actually correspond to multiple sequences
-
= (F1:(nS)(n) )n2[1;N], E = (E1:(nT)(n) )n2[1;N]. We work with batches instead of individual sequences because: a) backpropagating the average gradient over a batch tends to converge faster than single samples, and b)
sample computations can be performed in parallel. For example, if we want to multiply source sequences F (n) and F (n+1) with an embedding matrix W , we can tell one CPU core to compute the result for F (n) and another for F (n+1), halving the overall time it would take to multiply them independently. Learning to work with tensors can be di cult at rst, but is integral to e cient computation. We suggest you read more about it in the NumPy docs (https://docs.scipy.org/doc/numpy/user/theory.broadcasting. html#array-broadcasting-in-numpy), which PyTorch borrows for tensors.
Copyright c 2020 University of Toronto. All rights reserved.
1
1.4 Di erences from the lectures
There are two changes to the seq2seq architectures that we make for this assignment. First, instead of scaled dot-product energy scores for attention score(a; b) = jaj 1=2ha; bi, we’ll use the cosine similarity between vectors a and b:
ha; bi
score(a; b) = kak2 kbk2
The second relates to how we calculate the rst hidden state for the decoder when we don’t use attention. Recall that a bidirectional recurrent architecture processes its input in both directions separately: the forward direction processes (x1; x2; : : : ; xS) whereas the backward direction processes (xS; xS 1; : : : ; x1).
The bidirectional hidden state concatenates the forward and backward hidden states for the same time
ht = [hforwardt; hbackwardt]. This implies hS has processed all the input in the forward direction, but only one input in the backward direction (and vice versa for h1). To ensure the decoder gets access to all input from both directions, you should initialize the rst decoder state
-
~
forward
backward
]
h1
= [hS
; h1
~
When you use attention, set h1 = 0.
-
Your tasks
2.1 Setup
You are expected to run your solutions on teach.cs. Both code and data are available in the /u/cs401/A2/ directory on teach. Speci cally, starter code is available in the code/ sub-directory. The Hansards parallel text data are located in the data/ sub-directory.
Copy the code from the code/ directory into your working directory to get started. You should have the les a2 bleu score.py, a2 abc.py, a2 encoder decoder.py, a2 training and testing.py, a2 dataloader.py, a2 run.py, test a2 bleu score.py, and test a2 encoder decoder.py.
2.2 Calculating BLEU scores
Modify a2 bleu score.py to be able to calculate BLEU scores on single reference and candidate strings.
We will be using the de nition of BLEU scores from the lecture slides:
BLEU = BPC (p1p2 : : : pn)(1=n)
To do this, you will need to implement the functions grouper(…), n gram precision(…),
brevity penalty(…), and BLEU score(…). Make sure to carefully follow the doc strings of each function. Do not re-implement functionality that is clearly performed by some other function.
Your functions will operate on sequences (e.g., lists) of tokens. These tokens could be the words themselves (strings) or an integer ID corresponding to the words. Your code should be agnostic to the type of token used, though you can assume that both the reference and candidate sequences will use tokens of the same type.
2.3 Building the encoder/decoder
You are expected to ll out a number of methods in a2 encoder decoder.py. These methods belong to sub-classes of the abstract base classes in a2 abcs.py. The latter de nes the abstract classes EncoderBase, DecoderBase, and EncoderDecoderBase, which implement much of the boilerplate code necessary to get
2
a seq2seq model up and running. Though you are welcome to read and understand this code, it is not necessary to do so for this assignment. You will, however, need to read the doc strings in a2 abcs.py to understand what you’re supposed to ll out in a2 encoder decoder.py. Do not modify any of the code in a2 abcs.py.
A high-level description of the contents of the requirements for a2 encoder decoder.py follows here.
More details can be found in the doc strings in a2 fabcs,encoder decoderg.py.
2.3.1 Encoder
a2 encoder decoder.Encoder will be the concrete implementation of all encoders you will use. The encoder is always a multi-layer neural network with a bidirectional recurrent architecture. The encoder gets a batch of source sequences as input and outputs the corresponding sequence of hidden states from the last recurrent layer.
Encoder.init submodules(…) should be lled out to initialize a word embedding layer and a recur-rent network architecture. Encoder.get all rnn inputs(…) accepts a batch of source sequences F1:(nS)(n) and lengths S(n) and outputs word embeddings for the sequences x(1:nS)(n) .
(n)
Encoder.get all hidden states(…) converts the word embeddings x1:S(n) into hidden states for
(n)
the last layer of the RNN h1:S(n) (note we’re using (n) here for the batch index, not the layer index).
2.3.2 Decoder without attention
a2 encoder decoder.DecoderWithoutAttention will be the concrete implementation of the decoders that do not use attention (so-called \transducer” models). Method implementations should thus be tailored to not use attention.
In order to feed the previous output into the decoder as input, the decoder can only process one step of input at a time and produce one output. Thus DecoderWithoutAttention is designed to process one slice of input at a time (though it will still be a batch of input for that given slice of time). The goal, then, is to take some target slice from the previous time step Et(n)1 and produce an un-normalized log-probability
distribution over target words at time step t, called logits(tn). Logits can be converted to a categorical distribution using a softmax:
exp(logits(n)) |
|||||||||||||||
P (yt(n) = ij : : 🙂 = |
t;i |
||||||||||||||
exp(logits |
(n) |
) |
|||||||||||||
j |
t;j |
||||||||||||||
…) |
|||||||||||||||
DecoderWithoutAttention.init |
submodules( |
P should be lled out to initialize a word embedding |
|||||||||||||
layer, a recurrent cell, and a feed-forward layer to convert the hidden state to logits. |
|||||||||||||||
~(n) |
|||||||||||||||
DecoderWithoutAttention.get |
first |
hidden |
state(…) |
produces h1 |
given the encoder hidden |
||||||||||
states h(1:nS)(n) .
DecoderWithoutAttention.get current rnn input(…) takes the previous target Et(n)1 (or the pre-
vious output yt(n)1 in testing) and outputs word embedding x~(tn) for the current step.
(n) |
~(n) |
|||||||||
DecoderWithoutAttention.get |
current |
hidden |
state(…) takes x~t |
and ht 1 |
and produces the |
|||||
~(n) |
||||||||||
current decoder hidden state ht 1. |
||||||||||
~(n) |
(n) |
|||||||||
DecoderWithoutAttention.get |
current |
logits(…) takes ht |
and produces logitst |
. |
||||||
2.3.3 Decoder with attention
a2 encoder decoder.DecoderWithAttention will be the concrete use attention. It inherits from DecoderWithoutAttention to avoid
implementation of the decoders that re-implementing
3
get current hidden states(…) and get current logits(…). The remaining methods must be re-implemented, but slightly modi ed for the attention context.
Two new methods must be implemented for DecoderWithAttention.
~(n)
DecoderWithAttention.get energy scores(…) takes in a decoder state ht and all encoder hidden states h(1:nS)(n) and produces energy scores for that decoder state but all encoder hidden states: e(t;n1:)S(n) .
~(n) |
(n) |
DecoderWithAttention.attend(…) takes in a decoder state ht |
and all encoder hidden states h1:S(n) |
and produces the attention context vector c(tn). Between get energy scores and attend, use
get attention weights(…) to convert e(t;n1:)S(n) to t;(n1:)S(n) , which has been implemented for you.
2.3.4 Putting it together: the Encoder/Decoder
a2 encoder decoder.EncoderDecoder coordinates the encoder and decoder. Its behaviour depends on whether it’s being used for training or testing. In training, it receives both F1:(nS)(n) and E1:(nT)(n) and outputs logits(1:nT)(n) un-normalized log-probabilities over y1:(nT)(n) . In testing, it receives only F1:(nS)(n) and outputs K
(n;k)
paths from beam search per batch element n: y1:T (n;k) .
EncoderDecoder.init submodules(…) initializes the encoder and decoder.
EncoderDecoder.get logits for teacher forcing(…) provides you the encoder output h(1:nS)(n) and
the targets E1:(nT)(n) and asks you to derive logits(1:nT)(n) according to the MLE (teacher-forcing) objective. EncoderDecoder.update beam(…) asks you to handle one iteration of a simpli ed version of the beam search from the slides. While a proper beam search requires you to handle the set of nished paths f, update beam doesn’t need to. Letting (n; k) indicate the nth batch elements’ kth path:
-
(n;k!v)
~(n;k)
8n; k; v:bt;0
!v)
ht+1
bt;(n;k1
[bt;(n;k1
); v]
(n;k!v)
(n;k)
~(n;k)
log P (bt
)
P (bt
) + log P (yt+1 = vjht+1
)
-
(n;k)
k
8n; k:bt+1
argmax
bt(n;k0!v)
log P (b(tn;k0!v))
In short, extend the existing paths, then prune back to the beam width.
2.3.5 Padding
An important detail when dealing with sequences of input and output is how to deal with sequence lengths. Individual sequences within a batch F (n) and E(n) can have unequal lengths S(n) 6= S(n+1), T (n) 6= T (n+1), but we pad the shorter sequences to the right to match the longest sequence. This allows us to parallelize across multiple sequences, but it’s important that whatever the network learns (i.e., the error signal) is not impacted by padding. We’ve mostly handled this for you in the functions we’ve implemented, with three exceptions: rst, no word embedding should be learned for padding (which you’ll have to guarantee); second, you’ll have to ensure the bidirectional encoder doesn’t process the padding; and third, the rst hidden state of the decoder (without attention) should not be based on padded hidden states. You are given plenty of warning in the starter code when these three cases ought to be considered. The decoder uses the end-of-sequence symbol as padding, which is entirely handled in a2 training and testing.py.
2.4 The training and testing loops
After following the PyTorch tutorial, you should be familiar with how models are trained and tested in PyTorch. You are expected to implement training and testing loops in a2 training and testing.py.
4
In a2 training and testing.compute batch total bleu(…), you are given reference (from the dataset) and candidate batches (from the model) in the target language and asked to compute the to-tal BLEU score over the batch. You will have to convert the PyTorch tensors in order to use
a2 bleu score.BLEU score(…).
In a2 training and testing.compute average bleu over dataset(…), you are to follow instruc-tions in the doc string and use compute batch total bleu(…) to determine the average BLEU score over a data set.
In a2 training and testing.train for epoch(…), once again follow the doc strings to iterate through a training data set and update model parameters using gradient descent.
2.5 Running the models
Once you have completed the coding portion of the assignment, it is time you run your models. In order to do so in a reasonable amount of time, you’ll have to train your models using a machine with a GPU. A number of teaching labs in the Bahen building have GPUs (listed in https://www.teach.cs.toronto. edu/faq.html#ABOUT4), but you must log in at the physical machines to use them (as opposed to remote access). Even on a GPU, the code can take upwards of 2 hours to complete in full. Be sure to plan accordingly!
If you have access to your own GPU, you may run this code locally and report the results. However, any modi cations you make to run the code locally must be reverted to work on teach before you submit!
You are going to interface with your models using the script a2 run.py. This script glues together the components you implemented previously. The only meaningful remaining code is in a2 dataloader.py, which converts the Hansard sentences into sequences of IDs. Su ce to say that you do not need to know how either a2 run.py nor a2 dataloader.py works, only use them (unless you are interested).
Run the following code block line-by-line from your working directory. In order, it:
-
Builds maps between words and unique numerical identi ers for each language.
-
Splits the training data into a portion to train on and a hold-out portion.
-
Trains the encoder/decoder without attention and stores the model parameters.
-
Trains the encoder/decoder with attention and stores the model parameters.
-
Returns the average BLEU score of the encoder/decoder without attention on the test set.
-
Returns the average BLEU score of the encoder/decoder with attention on the test set.
TRAIN=/h/u1/cs401/A2/data/Hansard/Training/
TEST=/h/u1/cs401/A2/data/Hansard/Testing/
# 1.
python3.7 a2_run.py vocab $TRAIN e vocab.e.gz python3.7 a2_run.py vocab $TRAIN f vocab.f.gz # 2.
python3.7 a2_run.py split $TRAIN train.txt.gz dev.txt.gz
# 3.
python3.7 a2_run.py train $TRAIN \
vocab.e.gz vocab.f.gz \
train.txt.gz dev.txt.gz \
model_wo_att.pt.gz \
–device cuda
5
python3.7 a2_run.py train $TRAIN \
vocab.e.gz vocab.f.gz \
train.txt.gz dev.txt.gz \
model_w_att.pt.gz \
–with-attention \
–device cuda
# 5.
python3.7 a2_run.py test $TEST \
vocab.e.gz vocab.f.gz model_wo_att.pt.gz \
–device cuda
# 6.
python3.7 a2_run.py test $TEST \
vocab.e.gz vocab.f.gz model_w_att.pt.gz \
–with-attention –device cuda
Steps 1 and 2 should not fail and need only be run once. Step 3 onward depend on the correctness of your code.
In a le called analysis.txt, provide the following:
The printout after every epoch of the training loop of both the model with attention and without. Clearly indicate which is which.
The average BLEU score reported on the test set for each model. Again, clearly indicate which is which.
A brief discussion on your ndings. Was there a discrepancy in between training and testing results? Why do you think that is? If one model did better than the other, why do you think that is?
2.6 Bonus [up to 15 marks]
We will give bonus marks for innovative work going substantially beyond the minimal requirements. How-ever, your overall mark for this assignment cannot exceed 100%. Submit your write-up in bonus.txt.
You may decide to pursue any number of tasks of your own design related to this assignment, although you should consult with the instructor or the TA before embarking on such exploration. Certainly, the rest of the assignment takes higher priority. Some ideas:
Perform substantial data analysis of the error trends observed in each method you implement. This must go well beyond the basic discussion already included in the assignment.
There are many possible ways to assemble attentions for the encoder / decoder. For example, dot-product attention similar to (Vaswani et al., 2017)1, and structured self attention (Lin et al., 2017)2. Analyze several attention mechanisms, compare their performances, and include discussions towards their reasons of the performance di erences.
6
This assignment is submitted electronically. Submit your assignment on MarkUs. Do not tar or compress your les, and do not place your les in subdirectories. Do not format your discussion as a PDF or Word document | use plain text only. You should submit:
-
The les a2 bleu score.py, a2 encoder decoder.py, and a2 training and testing.py that you lled out according to assignment speci cations. We will not accept a2 abc.py, a2 dataloader.py, nor a2 run.py – your assignment must be compatible with the versions we provided you.
-
Your write-up on the experiment in analysis.txt.
-
If you are submitting a bonus, tell us what you’ve done by submitting a write-up in bonus.txt. Please distribute bonus code amongst the above *.py les, being careful not to break functions, methods, and classes related to the assignment requirements.
You should not submit any additional les that you generated to train and test your models. For example, do not submit your model parameter les *.pt.gz or vocab les *.fe,fg.gz. Only submit the above les. Additional source les containing helper functions are not permitted.
7
A.1 Check Piazza regularly
Updates to this assignment as well as additional assistance outside tutorials will be primarily distributed via Piazza (http://piazza.com/utoronto.ca/winter2020/csc4012511/home). It is your responsibility to check Piazza regularly for updates.
A.2 Using your own computer
If you want to do some or all of this assignment on your laptop or other computer, you will have to do the extra work of downloading and installing the requisite software and data. You take on the risk that your computer might not be adequate for the task. You are strongly advised to upload regular backups of your work to teach.cs, so that if your machine fails or proves to be inadequate, you can immediately continue working on the assignment at teach.cs. When you have completed the assignment, you should try your programs out on teach.cs to make sure that they run correctly there. A submission that does not work on teach.cs will get zero marks.
A.3 Unit testing
We strongly recommend you test the methods and functions you’ve implemented prior to running the train-ing loop. You can test actual output against expected input for complex methods like update beam(…). The Python 3.7 teach environment has installed pytest (https://docs.pytest.org/en/5.1.2/). We have included some preliminary tests for BLEU score(…) and update beam(…). You can run your test suite on teach by calling
python3.7 -m pytest
While the test suite will execute initially, the provided tests will fail until you implement the necessary methods and functions. While passing these initial tests is a necessity for full marks, they are not su cient on their own. Please be sure to add your own tests.
Unit testing is not a requirement, nor will you receive bonus marks for it.
A.4 Debugging remotely
While you will be unable to use the teaching labs’ GPUs remotely, you can still debug your program remotely on the teach server on the CPU. The following commands may be used to set up a smaller task with fewer parameters and a much shorter runtime.
# Variables TRAIN and TEST as before.
export OMP_NUM_THREADS=4 # avoids a libgomp error on teach
-
create an input and output vocabulary of only 100 words python3.7 a2_run.py vocab $TRAIN e vocab_tiny.e.gz –max-vocab 100 python3.7 a2_run.py vocab $TRAIN f vocab_tiny.e.gz –max-vocab 100
-
only use the proceedings of 4 meetings, 3 for training and 1 for dev python3.7 a2_run.py split $TRAIN train_tiny.txt.gz dev_tiny.txt.gz –limit 4
-
use far fewer parameters in your model
python3.7 a2_run.py train $TRAIN \
vocab_tiny.e.gz vocab_tiny.f.gz \
train_tiny.txt.gz dev_tiny.txt.gz \
model.pt.gz \
–epochs 2 \
8
–encoder-hidden-size 101 \
–batch-size 5 \
–cell-type gru \
–beam-width 2
Each epoch on teach should be less than 30 seconds using these settings. Note that your BLEU will likely in the reduced vocabulary condition – even using very few parameters – since your model will end up learning to output the out-of-vocabulary symbol. Do not report your ndings on the toy task in analysis.txt.
You can (and should) experiment with di erent sizes, types, and widths before running. You can check what you can modify with the command a2 run.py -h. At some point, you may come across the error:
RuntimeError: Beam search has not finished by t=100. Increase …
This just means your model failed to output an end-of-sequence token after t = 100. This does not necessarily mean you made a mistake – you just might’ve found a parameter combination that fails to learn anything. Try increasing the number of parameters, the number of proceedings, and/or changing the cell type. If the problem persists, you might have a bug.
A.5 Recurrent cell type
You’ll notice that the code asks you to set up a di erent recurrent cell type depending on the setting of the attribute self.cell type. This could be an LSTM, GRU, or RNN (the last refers to the simple linear weighting ht = (W [x; ht 1] + b) that you saw in class). The three cell types act very similarly – GRUs and RNNs are largely interchangeable from a programming perspective – except the LSTM cell often requires you to carry around both a cell state and a hidden state. Pay careful attention to the documentation for when h t, htilde t, etc. might actually be a pair of the hidden state and cell state as opposed to just a hidden state. Sometimes no change is necessary to handle the LSTM cell. Other times you might have to repeat an operation on the elements individually. Take advantage of the following pattern:
if self.cell_type == ’lstm’:
-
do something
else:
-
do something else
Be sure to rerun training with di erent cell types (i.e. use the ag –cell-type) to ensure your code can handle the di erence.
9
The teaching labs are frequently booked for courses in order to perform course lab work. As a result, workstations will be occupied during these times and, even if you were able to nd a seat, you might be asked to leave. Below is a schedule of the days of the week and the times various labs are booked during the term ( marks occupied).
-
time
BA2200
BA3175
BA3185
BA3195
Monday
10-11am
12-2pm
2-3pm
8-9pm
Tuesday
1-3pm
6-8pm
8-9pm
Wednesday
12-2pm
2-4pm
4-5pm
6-8pm
Thursday
9-9pm
Friday
10-12pm
1-4pm
4-6pm
10
We try to match the variable names in a2 abc.py to those in the lecture slides on machine translation.
The table below serves as reference for converting between the two.
In general, we denote a speci c indexed value of a variable from the slides, such as Et, with an underscored variable name, e.g. E t. When an index is omitted, it means the variable contains all the indexed values at once, e.g. F corresponds to F1:S.
Note that the slides are 1-indexed, whereas code is 0-indexed. Also, all variables in the PyTorch code are batched, but the slides only look at one sequence at a time.
-
Variable
Slides
Notes
F
F1:S
Source sequence.
F
lens
S
In the code, the maximum-length source sequence in the
batch is said to have length S. The actual length of each
sequence in the batch (before padding) is stored in F
lens.
x
x
Encoder RNN inputs.
h
h1:S
Encoder hidden states.
Always refers to the last encoder
layer’s hidden states, with both directions concatenated.
~
htilde
0
h1
The rst decoder hidden state.
E
tm1
Et
1
The target token at t
1 (previous).
xtilde
t
x~t
Decoder RNN input at time t (current).
~
htilde
t
ht
Decoder hidden state at time t (current).
For the LSTM
architecture, this can be a pair with the cell state. Note in
the beam search update htilde
t also includes paths, i.e.
~(1:K)
.
ht
logits
t
log P (ytj : : 🙂 + C
Un-normalized log-probabilities over the target vocabulary
at time t (current). Pre-softmax.
E
E1:T
Target sequence.
logits
log P (ytj : : 🙂1:T + C
Un-normalized log-probabilities over the the target vocabu-
lary across each time step. Pre-softmax.
(1:K)
b
tm1
1
bt 1;1
All pre xes in the beam search at time t
1 (previous).
logpb
tm1
log P (bt(1:1K))
The log-probabilities of the beam search pre xes up to time
t
1 (previous).
logpy
t
log P (ytj : : 🙂
Valid (normalized) log-probabilities over the target vocabu-
lary at time t (current). Post-softmax.
b
t
0
bt;(1:0
K)
Decoder hidden states at time t (current).
The di erence
(1:K)
~(1:K)
between bt;0
and ht
is contextual: the latter points to
the paths in the beam before the update, the former after
the update.
b
t
1
bt;(1:1
K)
All pre xes in the beam seach at time t (current).
logpb
t
log P (bt(1:K))
The log-probabilities of the beam search pre xes up to time
t (current).
c
t
ct
Context vector (for attention) at time t (current).
alpha
t
t;1:S
Attention weights over all source times at target time t (cur-
rent).
e
t
et;1:S
Energy scores (for attention) over all source times at target
time t (current).
11