Regular Expression

  • regexpal.com
  • [wW]oodchuck = Woodchuck, woodchuck
  • ranges: [A-z], [a-z], [0-9]
  • negations: [^Ss], [^A-Z], a^b
  • disjunction: pipe(|), yours|mine
  • ?,,+,.: colou?r(= color, colour), ooh!=o+h!, beg.n(= begin, began)
  • ^(start), $(end): ^[A-Z], \.$(\: escape)
  • subset: ([0-9]+), \1er (= [fast]er)
  • non-capturing groups: /(?:some|a few) (people|cats)
  • ?= (exact match), ?! (does not match)

Tokenization

  • Type vs Token
  • N = number of tokens
  • V = vocabulary (set of types)
  • NLP task: 1)tokenizing words 2)normalizing word formats 3)segmenting sentences
  • simpe way to tokenize: use space chracters
  • issue: punctuation (like Ph.D), clitic (like we’re), multiword (like New York)
  • chinese tokenization: don’t use spaces, count as a word is complex, so treat each character as token

Byte Pair Encoding

  • subword tokenization (tokens can be parts of words)
  • BPE, Unigram language modeling tokenization, WordPiece
  • add most frequent pair of adjancent tokens in characters
  • er -> er_ -> ne -> new -> lo -> low -> …
  • frequent subwords are most morphemes like -est or -er

Word Normalization

  • put tokens in a standard format
  • all letters to lower case (but, US vs us)
  • all words to lemma (he is reading -> he be read)
  • morphemes: the smallest meaningful units (stems, affixes)
  • porter stemmer: ational -> ate, sses -> ss
  • abbreviation dictionary can help