spacynlp.py 文件源码-python代码片段

spacynlp.py 文件源码

python

阅读 22 收藏 0 点赞 0 评论 0

项目：scienceie17 作者: OC-ScienceIE 项目源码文件源码

def map_chars_to_tokens(doc):
    """
    Creates a mapping from input characters to corresponding input tokens

    For instance, given the input:

    Nuclear theory ...
    |||||||||||||||
    012345678911111...
              01234

    it returns an array of size equal to the number of input chars plus one,
    whcih looks like this:

    000000011111112...

    This means that the first 7 chars map to the first token ("Nuclear"),
    the next 7 chars (including the initial whitespace) map to the second
    token ("theory") and so on.
    """
    n_chars = len(doc.text_with_ws)
    char2token = np.zeros(n_chars + 1, 'int')
    start_char = 0
    for token in doc:
        end_char = token.idx + len(token)
        char2token[start_char:end_char] = token.i
        start_char = end_char
    char2token[-1] = char2token[-2] + 1
    return char2token