wp.py 文件源码

python
阅读 20 收藏 0 点赞 0 评论 0

项目:prox-server 作者: mozilla-mobile 项目源码 文件源码
def _match_place_name_to_wiki_page(place_name, wiki_page_titles):
    """Work horse of `geosearch`: separated for easier testing & debugging.

    For example places we can't yet match, see `test_wp._CHALLENGE_PLACE_NAME_TO_WIKI`.

    Potential improvements:
    - Change existing dials (for each pass?): local vars (e.g. _THRESHOLD), radius/limit kwarg to Wikipedia API
    - Changes scorers on different passes, e.g. partial_ratio is more lenient than ratio.
    - Modify full_process processor: it removes non-letter-number characters so wiki disambiguation markup can cause
      undesired matching. For example, "Boulevard (restaurant)" becomes "boulevard  restaurant", which matches
      "mourad restaurant" at 79.
    - Add additional processors:
      - Modify plurals, articles, accents (full_process will just remove accented characters :( ).
      - Remove city/state name occurences in wiki pages, e.g. "San Francisco Ferry Building" -> "Ferry Building"
        could better match the Yelp "Ferry Building Marketplace" (disclaimer: US-centric)
    - Modify place_name query string. These may be better than their "remove" counterparts because adding more
      characters gives more information to try to match against and may produce more accurate results than removing characters.
      - (reverse ^) add city/state to place names: "Ferry Building Marketplace" -> "San Francisco Ferry Building Marketplace"
      - Reverse wiki_disambiguation_processor: add common wikipedia endings: (restaurant), (California), etc.
    - Consider running most lenient processors first, moving towards more strict, like a filter. Right now we run the
      strictest first.
    """
    # We run multiple processor passes: if there is no match, the next processor may be more lenient.
    for processor in _PLACE_NAME_TO_WIKI_PAGE_PROCESSORS:
        matches = process.extractBests(place_name, wiki_page_titles, scorer=_SCORER, processor=processor,
                                       score_cutoff=_THRESHOLD)
        if len(matches) >= 1:
            if len(matches) > 1:
                print('More than one match above threshold', matches, file=sys.stderr)
            return matches[0][0]
    return None
评论列表
文章目录


问题


面经


文章

微信
公众号

扫码关注公众号