Python

基于NLTK的熊猫文字处理

发布于 2021-01-29 19:34:34

使用nltk时标点和数字小写不起作用。

我的密码

stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']                    
new_stop_words=stopwords+user_defined_stop_words

def preprocess(text):
    return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]

miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)

样本输入

23FLOOR 9 DES VOEUX RD WEST     HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG

预期产量

 floor des voeux west
 pag consulting flat aia central connaught central
 co city lost studios flat f hillier sheung

关注者

被浏览

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。
您的功能缓慢且不完整。首先，关于问题-
1. 您不会降低数据的大小写。
2. 您没有正确摆脱数字和标点符号。
3. 您没有返回字符串（您应该使用来加入列表str.join并返回它）
4. 此外，具有文本处理功能的列表理解是引入可读性问题的主要方法，更不用说可能的冗余了（对于if出现的每个条件，您可以多次调用函数。
接下来，您的功能有两个明显的低效率，尤其是停用词删除代码。
1. 您的stopwords结构是一个列表，列表in检查很慢。首先要做的是将其转换为set，使not in检查保持恒定时间。
2. 您正在使用的nltk.word_tokenize速度太慢了。
3. 最后，apply即使您在使用NLTK的情况下（几乎没有可用的矢量化解决方案），也不应始终依赖。几乎总有其他方法可以做完全相同的事情。通常，即使是python循环也更快。但这并不是一成不变的。
首先，将您的增强功能stopwords作为一组创建-
```
user_defined_stop_words = ['st','rd','hong','kong']

i = nltk.corpus.stopwords.words('english')
j = list(string.punctuation) + user_defined_stop_words

stopwords = set(i).union(j)
```
下一个解决方法是摆脱列表理解，并将其转换为多行函数。这使事情变得更容易使用。函数的每一行都应专门用于解决特定任务（例如，去除数字/标点符号或去除停用词或小写字母）-
```
def preprocess(x):
    x = re.sub('[^a-z\s]', '', x.lower())                  # get rid of noise
    x = [w for w in x.split() if w not in set(stopwords)]  # remove stopwords
    return ' '.join(x)                                     # join the list
```
举个例子。然后，这将apply列在您的专栏中-
```
df['Clean_addr'] = df['Adj_Addr'].apply(preprocess)
```
作为替代方案，这是一种不依赖的方法apply。对于小句子，这应该很好。

将数据加载到系列中-
```
v = miss_data['Adj_Addr']
v

0            23FLOOR 9 DES VOEUX RD WEST     HONG KONG
1    PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT...
2    C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIE...
Name: Adj_Addr, dtype: object
```
现在是沉重的负担。
1. 小写与 str.lower
2. 使用消除噪音 str.replace
3. 使用将单词分成单独的单元格 str.split
4. 使用pd.DataFrame.isin+应用停用词删除pd.DataFrame.where
5. 最后，使用连接数据框agg。
  
  v = v.str.lower().str.replace(‘[^a-z\s]’, ‘’).str.split(expand=True)
  
  v.where(~v.isin(stopwords) & v.notnull(), ‘’)\
  .agg(‘ ‘.join, axis=1)\
  .str.replace(‘\s+’, ‘ ‘)\
  .str.strip()
  
  0 floor des voeux west
  1 pag consulting flat aia central connaught central
  2 co city lost studios flat f hillier sheung
  dtype: object
要在多个列上使用此代码，请将此代码放在函数中，preprocess2然后调用apply-
```
def preprocess2(v):
     v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)

     return v.where(~v.isin(stopwords) & v.notnull(), '')\
             .agg(' '.join, axis=1)\
             .str.replace('\s+', ' ')\
             .str.strip()



c = ['Col1', 'Col2', ...] # columns to operate
df[c] = df[c].apply(preprocess2, axis=0)
```
您仍然需要一个apply电话，但是列数很少，它的伸缩性应该不会太差。如果您不喜欢apply，那么这里有个适合您的循环变体-
```
for _c in c:
    df[_c] = preprocess2(df[_c])
```
让我们看看我们的非循环版本和原始版本之间的区别-
```
s = pd.concat([s] * 100000, ignore_index=True)

s.size
300000
```
首先，进行健全性检查-
```
preprocess2(s).eq(s.apply(preprocess)).all()
True
```
现在是时候了。
```
%timeit preprocess2(s)   
1 loop, best of 3: 13.8 s per loop



%timeit s.apply(preprocess)
1 loop, best of 3: 9.72 s per loop
```
这是令人惊讶的，因为apply它很少比非循环解决方案快。但这在这种情况下是有道理的，因为我们已经做了很多优化preprocess，并且熊猫中的字符串操作很少进行矢量化处理（通常是矢量化的，但是性能提升并没有您期望的那么多）。

让我们看看是否可以做得更好，绕过apply，使用np.vectorize
```
preprocess3 = np.vectorize(preprocess)

%timeit preprocess3(s)
1 loop, best of 3: 9.65 s per loop
```
相同，apply但是由于“隐藏”循环周围的开销减少而碰巧快了一点。

知识点

Python

面圈网VIP题库全新上线，海量真题题库资源。 90大类考试，超10万份考试真题开放下载啦

去下载看看