2022-07-06 Log

July 6, 2022 · 1 min · 202 words · minyeamer | Suggest Changes

Table of Contents

Regular Expression

regexpal.com
[wW]oodchuck = Woodchuck, woodchuck
ranges: [A-z], [a-z], [0-9]
negations: [^Ss], [^A-Z], a^b
disjunction: pipe(|), yours|mine
?,,+,.: colou?r(= color, colour), ooh!=o+h!, beg.n(= begin, began)
^(start), $(end): ^[A-Z], \.$(\: escape)
subset: ([0-9]+), \1er (= [fast]er)
non-capturing groups: /(?:some|a few) (people|cats)
?= (exact match), ?! (does not match)

Tokenization

Type vs Token
N = number of tokens
V = vocabulary (set of types)
NLP task: 1)tokenizing words 2)normalizing word formats 3)segmenting sentences
simpe way to tokenize: use space chracters
issue: punctuation (like Ph.D), clitic (like we’re), multiword (like New York)
chinese tokenization: don’t use spaces, count as a word is complex, so treat each character as token

Byte Pair Encoding

subword tokenization (tokens can be parts of words)
BPE, Unigram language modeling tokenization, WordPiece
add most frequent pair of adjancent tokens in characters
er -> er_ -> ne -> new -> lo -> low -> …
frequent subwords are most morphemes like -est or -er

Word Normalization

put tokens in a standard format
all letters to lower case (but, US vs us)
all words to lemma (he is reading -> he be read)
morphemes: the smallest meaningful units (stems, affixes)
porter stemmer: ational -> ate, sses -> ss
abbreviation dictionary can help