文本文件处理
TF: the number of occurrences of a word in a document.
IDF: Inverse document frequency
The idf of a rare term is high, whereas the idf of a frequent term is likely to be low.
TF-IDF:The tf–idf value increases proportionally to the number of times a word appears in
the document and is offset by the number of documents in the corpus that contain the word,
which helps to adjust for the fact that some words appear more frequently in general.
出现的文档越多,值越低;但文档出现次数越多,值越高
Collocations: multi-word expressions consisting of two or more words that occur more
frequently than by chance, and correspond to some conventional way of saying things.
在Jupyter Notebook里面:the Fundamentals of Text Pre-processing-checkpoint.ipynb
句子分割(按空格); 处理一些特殊字符(U.S.A, 网站、IP地址、美元数额...)
调用一些库来处理:动词时态、简写、格式...
Tutor内容
Jupyter Notebook: Tutorial_04_answer.ipynb
调用一个library, 对各个字符的出现次数进行统计、调用已经给的字典来remove无意义的词、分辨一个词词性、
把不同时态、单复数的词统计为同一个词(eg. finding & findings)