在python中读取scipy / numpy中的csv文件
发布于 2021-01-29 17:00:21
我在使用python读取由制表符分隔的csv文件时遇到问题。我使用以下功能:
def csv2array(filename, skiprows=0, delimiter='\t', raw_header=False, missing=None, with_header=True):
"""
Parse a file name into an array. Return the array and additional header lines. By default,
parse the header lines into dictionaries, assuming the parameters are numeric,
using 'parse_header'.
"""
f = open(filename, 'r')
skipped_rows = []
for n in range(skiprows):
header_line = f.readline().strip()
if raw_header:
skipped_rows.append(header_line)
else:
skipped_rows.append(parse_header(header_line))
f.close()
if missing:
data = genfromtxt(filename, dtype=None, names=with_header,
deletechars='', skiprows=skiprows, missing=missing)
else:
if delimiter != '\t':
data = genfromtxt(filename, dtype=None, names=with_header, delimiter=delimiter,
deletechars='', skiprows=skiprows)
else:
data = genfromtxt(filename, dtype=None, names=with_header,
deletechars='', skiprows=skiprows)
if data.ndim == 0:
data = array([data.item()])
return (data, skipped_rows)
问题是genfromtxt抱怨我的文件,例如错误:
Line #27100 (got 12 columns instead of 16)
我不确定这些错误来自何处。有任何想法吗?
这是导致问题的示例文件:
#Gene 120-1 120-3 120-4 30-1 30-3 30-4 C-1 C-2 C-5 genesymbol genedesc
ENSMUSG00000000001 7.32 9.5 7.76 7.24 11.35 8.83 6.67 11.35 7.12 Gnai3 guanine nucleotide binding protein alpha
ENSMUSG00000000003 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Pbsn probasin
有没有更好的方法来编写通用的csv2array函数?谢谢。
关注者
0
被浏览
229
1 个回答
-
查看python CSV模块:http :
//docs.python.org/library/csv.htmlimport csv reader = csv.reader(open("myfile.csv", "rb"), delimiter='\t', quoting=csv.QUOTE_NONE) header = [] records = [] fields = 16 if thereIsAHeader: header = reader.next() for row, record in enumerate(reader): if len(record) != fields: print "Skipping malformed record %i, contains %i fields (%i expected)" % (record, len(record), fields) else: records.append(record) # do numpy stuff.