Regular Expression#
- regexpal.com
- [wW]oodchuck = Woodchuck, woodchuck
- ranges: [A-z], [a-z], [0-9]
- negations: [^Ss], [^A-Z], a^b
- disjunction: pipe(|), yours|mine
- ?,,+,.: colou?r(= color, colour), ooh!=o+h!, beg.n(= begin, began)
- ^(start), $(end): ^[A-Z], \.$(\: escape)
- subset: ([0-9]+), \1er (= [fast]er)
- non-capturing groups: /(?:some|a few) (people|cats)
- ?= (exact match), ?! (does not match)
Tokenization#
- Type vs Token
- N = number of tokens
- V = vocabulary (set of types)
- NLP task: 1)tokenizing words 2)normalizing word formats 3)segmenting sentences
- simpe way to tokenize: use space chracters
- issue: punctuation (like Ph.D), clitic (like we’re), multiword (like New York)
- chinese tokenization: don’t use spaces, count as a word is complex, so treat each character as token
Byte Pair Encoding#
- subword tokenization (tokens can be parts of words)
- BPE, Unigram language modeling tokenization, WordPiece
- add most frequent pair of adjancent tokens in characters
- er -> er_ -> ne -> new -> lo -> low -> …
- frequent subwords are most morphemes like -est or -er
Word Normalization#
- put tokens in a standard format
- all letters to lower case (but, US vs us)
- all words to lemma (he is reading -> he be read)
- morphemes: the smallest meaningful units (stems, affixes)
- porter stemmer: ational -> ate, sses -> ss
- abbreviation dictionary can help