Frequently Asked Questions (FAQ)

Jump to [ Administrivia ] [ Assignments ] [ Project ] [ Lectures ]

(Last three updates: 2 Mar 2023: Added Q2-3 for Assignment 2.
22 Feb 2023: Added FAQs for Assignment 2.
30 Jan 2023: Added 12 Assignment Questions and Cleared Typos.

Administrivia

What is the format of this course?
This course’s administration is currently in flux due to higher demand for the subject. We have opened two sections of the course. L1 will be taught by Kan Min-Yen on Friday AMs; L2 taught by Christian von der Weth on Monday evenings.
Will the L1 and L2 sections of the course work the same?
Generally, yes. We will be offering synchronized lectures such that for Weeks 3 through 12, the L1 section on Friday starts the topic and the subsequent L2 section on Monday covers the same topic. This means that L2 will have no third lecture (no class on 23 Jan) and L1 will have no lecture on the last Friday (no class on 7 Apr). For the remaining weeks (Weeks 1-2, 13) L2 will introduce the topic and L1 will follow.
How about tutorials?
Tutorials will be offered every other week after Week 03. The teaching mode for tutorials has yet to be decided. We may add evening tutorials to help cope with the load.
I’m doing an ATAP, SIP, FYP concerning NLP. Can I get an exception to enter the course?
Generally yes. Our course (even with the 2x increase in instructors) is still heavily oversubscribed. If you absolutely need to get access to the course, make an official appeal and if it can be considered, we’ll hear about it from ModReg directly. You can check with us but we may not have the capacity to answer you directly.
Can you add me as a guest student and if so, what are my responsibilities?
We do accept guest student enrolment in this course. You may write to either lecturer to gain access to the course notes and lecture webcasts. We do not accept auditing students (see NUS Registrar for the formal difference; but basically it appears on your transcript with an ‘Audit’ grade), as per NUS SoC policy. However, we have a quota cap for a reason, and that is to best serve the fully enrolled students. Thus, if our teaching staff lack bandwidth, we may not be able to answer your questions and concerns. Please do understand.
I’m having problems enrolling in a tutorial. How to get help?
Each module has a Tutorial Registration Coordinator (TRC) assigned to them. Our TRC for CS4248 is Pranavan Theivendiram. Please contact Pranavan for help in registering for tutorial if you cannot secure a slot. Our instructors will not be able to help you, as tutorial registration is centrally controlled.
Will the module be webcasted (recorded or broadcasted live)?
Yes, barring technical difficulties, we plan to record the course lectures and make them available on Canvas or MediaWeb. That is, we will make our best effort to do this, but do not guarantee quality of service for webcasted or remote learning. We will attempt to make the lectures available on simulcast on Zoom but only for students who cannot physically make it to campus for official reasons.
I’d like to take this module but couldn’t secure it during ModReg because the quota is full. Will it be possible to make an exception?
See above answers to similar questions. Generally the answer is no, but we welcome students to be a guest in the class, or take the class in subsequent semesters (CS4248 is offered in both Sem 1 and 2 now). We know the demand is high (we heard 160 students for 100 slots), but we need to keep the student cap at a reasonable number this semester.

If you’d like to be a guest student, please send a mail to cs4248@comp.nus.edu.sg, cc:ing kanmy@comp.nus.edu.sg with your name and NUS LumiNUS-registered email address.
I do not fulfill the exact prerequisite requirements for this module. Can you grant me this exception?
Module coordinators do not approve prerequisite waivers in our School. You need to approach our School’s curriculum coordinator for permission. Generally, the answer is no, except in extenuating circumstances. E-mail cs-curriculum@comp.nus.edu.sg and apply for the waiver on ModRec.
I need this course in order to graduate this coming semester. Can you grant me this exception?
Please ask your academic advisor to explain the situation to us. We will see what we can do and perhaps help your academic advisor help plan future students’ plans accordingly.
I heard about some edX MOOC option. What is it?
Our instructors are working on the edX version of this course and have converted the first third into a 4 week MOOC, "Natural Language Processing: Foundations Links to an external site.". You may be able to do this course through DYOM.

Assignments

Assignment 2 (Actively growing)

Q: Are pre-trained word embeddings (e.g. embeddings by Glove and Word2Vec, or as some of you asked, sentenceBERT embeddings) allowed?
A: No, pretrained word embeddings are not allowed, nor are training new ones. The objective of the assignment is to engineer your own text classification features, so you may not be able to get very strong performance compared to one that uses PLMs.
Q: Is there a private test case/ leaderboard for the 6^th (competitive performance) component of Assignment 2?
A: Yes. Your component grade will be based on a combination of both the public and private leaderboard.
Q: For the competitive component, it says that 15% of the grade is on tf.idf based Naïve bayes classification. However, I found that NB is performing much worse than LOGREG. If I use LOGREG as my base and improve it based on that, will I be able to get that 15%? I have attached the screenshot below for your kind review.
A: That's completely up to you. :D But, yes you may.

Assignment 1

Question 1. Regexes

Q1-1: Q1D bonus requirement is vague as it seems like there are some matches that are supposed to happen even though they are not in emoticons.txt, then what are the expected inputs and outputs?
A1-1: Inputs are emoticons (optionally with rearranged ones from the given list, but they are not required in grading, you may tweak the testcases (e.g. the one ;p) as you want), outputs are matching results.

Q1-2: About Q1C, I’m not too sure what the intended solution here is, and the wording seems a bit vague on what is expected.
A1-2: You are only required to specify relations between R3, R4, and R1 in your writeup. If you believe it is not possible to describe it in FSM, justify it properly. Any plausible answers will be given full marks.

Question 2. Tokenization, Zipf's Law

Q2-1: The training for bpe tokenization takes a while, so are we allowed to train once and write the merges to a text file for faster testing?
A2-1: A correct implementation of BPE won’t take long to train on the given corpus. Also, the vocabulary can be easily stored in main memory.

Q2-2: Can I just clarify what does it mean by left-right byte order precedence for q1? (mentioned in the code file)
A2-2: It means to use the lexicographical order to decide precedence consistently when breaking ties.

Q2-3: How should I store the vocabulary for tokenize_sentence when we cannot change its input format? And does the tokenize function return the vocabulary for the tokenizer?
A2-3: We won’t test the output of method tokenize, so you can set its output format by yourself to see which is more convenient. And yes, you’d better store the vocabulary as a class attribute so that other methods (e.g., plot_word_frequency) can use it conveniently.

Q2-4: For tokenize_sentence (BPE), is the sentence the training corpus? Or is the book text the training corpus and we tokenise the sentence according to the book’s vocabulary? (equivalent qn: for the method tokenize_sentence, do we have to tokenize the sentence that is given as input based on the vocabulary generated from the corpus (Pride and Prejudice) or should we create a new vocabulary from the sentence itself?)
A:2-4 Sentence is not part of training corpus; only the book text is the training corpus for BPE.

Q2-5: Are we allowed to use libraries like nltk or spaCy to help us with the tokenization?
A2-5: In this question, no.

Question 3. Language Modelling, Regular Expressions

Q3-1: Do we have to include the padding ‘~’ when calculating the perplexity of the text? Should the padding be counted in the vocabulary?
A3-1: Yes and yes.

Q3-2: For the generate_word method, are we allowed to draw from a probability distribution over all the possible words succeeding the context, or do we have to always return the most common word following the context?
A3-2: You’re allowed and suggested to draw from a probability distribution so that this may incorporate randomness and diversity in your outputs.

Q3-3: After tokenization, should I keep ‘. ? !’ (punctuations) and put them in my ngram model?
A3-3: You’re suggested to keep those tokens in your ngrams as they help end-of-sequence prediction.

Q3-4: Shall we ignore the stop word, like ‘the’, ‘in’, ‘I’ during word prediction?
A3-4: No, it is not required. Stop words can occur in the word prediction.

Question 4. Theory Question, Language Modelling.

Q4-1: Are we required to consider test corpora with zero probability for Q4? From my understanding, it’s possible to construct such a corpus, but then the answers to Q4 would become quite trivial. Also, given that the sum of probabilities over V+ is 1, does this mean that we only consider sentences from V+?
A4-1: We assume in this question that all words seen in any test corpus are in the vocabulary V and each word in any test corpus is seen at least once in training. And for the second question, yes.

Project

Team Formation

Q: Are we allowed to have cross tutorial teams for the project?
A: You don't need to be in the same tutorial group, although that might help.
Q: Can we indicate our interest with whom we wish to work with if we already have names in mind? Or is this fully random?
A: It is partially randomised. We will go over this during lecture and through other slides. You will have to fill out a survey to declare your mini-team members and your mini-team preferences. We'll then do an algorithmic match up.
Q: Do all members of a team have to fill out the Canvas subteam declaration survey? Or is it sufficient for 1 member to submit this?
A: One member to fill out the form. If you need to modify it, please have the same member edit their submission. 😎
Q: I was wondering if we must join a subgroup of 3? (I don’t really have friends taking this module..)
A: You can join as a subgroup of between 1 (just yourself) to 3 students. You can use the Discussion for Project Teammates to solicit other classmates looking for project partners.
Q: Must we use one of the specified datasets in the Project Dataset Description PDF?
A: Yes, you must use one of the specified datasets. You may use it in conjunction with any other dataset that you wish to use (some advanced teams augment their dataset's data with other datasets, suitably processed). Note that the prescription of the dataset does not restrict the purpose of the dataset, although most datasets have an intended purpose.
Q: Can we change the limit from 3 to another number?
A: No, the hard limit is 3. If you have additional interested members, please split up your larger group to 2 or more groups of 3 or less.

Lectures

No questions at this time ... so no answers too.