lda_train.py 文件源码-python代码片段

lda_train.py 文件源码

python

阅读 20 收藏 0 点赞 0 评论 0

项目：Sentences-analysis 作者: sungminoh 项目源码文件源码

def preprocess(post):
  # example
  # {(romeo and juliet 2013),(romeo and juliet),(douglas booth),(hailee steinfeld)}"
  # -> romeo and juliet 2013 romeo and juliet douglas booth hailee steinfeld
  print post
  # remove all punctuations
  post = PUNCTUATION.sub(' ', utils.to_unicode(post))

  # replace all emoji characters to '_EMOTICON_' and add space in between.
  post = EMOTICON.sub(' _emoticon_ ', post)

  # convert all special characters to ascii characters
  post = unidecode(post).decode('ascii', 'ignore')

  # remove all whitespace into single one
  post = WHITESPACE.sub(' ', post).strip()
  return utils.to_unicode(post)