自然语言处理的project，我负责的是anser processing，大概的pipeline是以question为query，做输入的article做information retrieval. 这里主要用的是tf-idf.

tf-idf (term frequency–inverse document frequency)
- tf 就是term frequency, 就是这个词在某一个文档出现的次数
  - 有好几种计算方法，最简单就是 term在文档出现的词频/文档里面总词数了
- idf 就是inverse docment frequency, 就是衡量这个词在整个语言环境下使用的常见程度，比如the等常见词，在整个英语里面的idf值是很小的，因为太常见了
  - 计算：log(语料库的文档总数/(包含这个词的总文档数 + 1)) (这里加1是为了防止出现divide by zero error)

tf-idf 用来找到doc里面和question最相关的一句话

当找到这句话后，我需要用syntax的方法来找到我的answer
主要用的的analysis方法是dependency parser和consistuency parser:
下面先比较下这两种parse:

consistuency tree:
就是正常的S -> NP VP 这种形式
dependency tree:
就是每一个phrase里面都有一个head，其他相依赖于它，比如：
I love you.
head 是love，subject是I, 然后object是you

然后我主要用的是spacy parser的api

nlp = spacy.load('en')
doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers')
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)
Text: The original noun chunk text.
Root text: The original text of the word connecting the noun chunk to the rest of the parse.
Root dep: Dependency relation connecting the root to its head.
Root head text: The text of the root token's head.
(u'Autonomous cars', u'cars', u'nsubj', u'shift')
(u'insurance liability', u'liability', u'dobj', u'shift')
(u'manufacturers', u'manufacturers', u'pobj', u'toward')

这里要处理who, what, when, where, yes no, why 这几种问题
- when, where的话，我直接用parser先parse，然后遍历tree，找出期中的所有pp，然后在用NER tagger去tagger，看下里面有没涉及时间和LOCATION的。
- yes no 比较简单，就是用dependency parse，找里面有没有neg，没有的话就是true

answer_processing

question generation