在python中读取scipy / numpy中的csv文件

发布于 2021-01-29 17:00:21

我在使用python读取由制表符分隔的csv文件时遇到问题。我使用以下功能:

def csv2array(filename, skiprows=0, delimiter='\t', raw_header=False, missing=None, with_header=True):
    """
    Parse a file name into an array. Return the array and additional header lines. By default,
    parse the header lines into dictionaries, assuming the parameters are numeric,
    using 'parse_header'.
    """
    f = open(filename, 'r')
    skipped_rows = []
    for n in range(skiprows):
        header_line = f.readline().strip()
        if raw_header:
            skipped_rows.append(header_line)
        else:
            skipped_rows.append(parse_header(header_line))
    f.close()
    if missing:
        data = genfromtxt(filename, dtype=None, names=with_header,
                          deletechars='', skiprows=skiprows, missing=missing)
    else:
    if delimiter != '\t':
        data = genfromtxt(filename, dtype=None, names=with_header, delimiter=delimiter,
                  deletechars='', skiprows=skiprows)
    else:
        data = genfromtxt(filename, dtype=None, names=with_header,
                  deletechars='', skiprows=skiprows)        
    if data.ndim == 0:
    data = array([data.item()])
    return (data, skipped_rows)

问题是genfromtxt抱怨我的文件,例如错误:

Line #27100 (got 12 columns instead of 16)

我不确定这些错误来自何处。有任何想法吗?

这是导致问题的示例文件:

#Gene   120-1   120-3   120-4   30-1    30-3    30-4    C-1 C-2 C-5 genesymbol  genedesc
ENSMUSG00000000001  7.32    9.5 7.76    7.24    11.35   8.83    6.67    11.35   7.12    Gnai3   guanine nucleotide binding protein alpha
ENSMUSG00000000003  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Pbsn    probasin

有没有更好的方法来编写通用的csv2array函数?谢谢。

关注者
0
被浏览
229
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    查看python CSV模块:http :
    //docs.python.org/library/csv.html

    import csv
    reader = csv.reader(open("myfile.csv", "rb"), 
                        delimiter='\t', quoting=csv.QUOTE_NONE)
    
    header = []
    records = []
    fields = 16
    
    if thereIsAHeader: header = reader.next()
    
    for row, record in enumerate(reader):
        if len(record) != fields:
            print "Skipping malformed record %i, contains %i fields (%i expected)" %
                (record, len(record), fields)
        else:
            records.append(record)
    
    # do numpy stuff.
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看