标签云

微信群

扫码加入我们

WeChat QR Code

I have tried to puzzle out an answer to this question for many months while learning pandas.I use SAS for my day-to-day work and it is great for it's out-of-core support.However, SAS is horrible as a piece of software for numerous other reasons.One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets.I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.My first thought is to use HDFStore to hold large datasets on disk and pull only the pieces I need into dataframes for analysis.Others have mentioned MongoDB as an easier to use alternative.My question is this:What are some best-practice workflows for accomplishing the following:Loading flat files into a permanent, on-disk database structureQuerying that database to retrieve data to feed into a pandas data structureUpdating the database after manipulating pieces in pandasReal-world examples would be much appreciated, especially from anyone who uses pandas on "large data".Edit -- an example of how I would like this to work:Iteratively import a large flat-file and store it in a permanent, on-disk database structure.These files are typically too large to fit in memory.In order to use Pandas, I would like to read subsets of this data (usually just a few columns at a time) that can fit in memory.I would create new columns by performing various operations on the selected columns.I would then have to append these new columns into the database structure.I am trying to find a best-practice way of performing these steps. Reading links about pandas and pytables it seems that appending a new column could be a problem.Edit -- Responding to Jeff's questions specifically:I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data.I rarely append rows, but I do perform many operations that create new columns.Typical operations involve combining several columns using conditional logic into a new, compound column. For example, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'.The result of these operations is a new column for every record in my dataset.Finally, I would like to append these new columns into the on-disk data structure.I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model.A typical project file is usually about 1GB.Files are organized into such a manner where a row consists of a record of consumer data.Each row has the same number of columns for every record.This will always be the case.It's pretty rare that I would subset by rows when creating a new column.However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics.For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards.To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on.When creating new columns, however, I would pull all rows of data and only the columns I need for the operations.The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships.The columns that I explore are usually done in small sets.For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan.Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process.What I'm doing is creating candidate variables that explain the relationship between my data and some outcome.At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.It is rare that I would ever add rows to the dataset.I will nearly always be creating new columns (variables or features in statistics/machine learning parlance).


Is the ratio core size / full size 1 %, 10 % ?Does it matter -- if you could compress cols to int8, or filter out noisy rows, would that change your compute-think loop from say hours to minutes ?(Also add tag large-data.)

2019年05月24日06分35秒

Storing float32 instead of float64, and int8 where possible, should be trivial (don't know what tools / functions do float64 internally though)

2019年05月24日06分35秒

can you split your task into chunks of work?

2019年05月23日06分35秒

Thanks for the links.The second link makes me a bit worried that I can't append new columns to the tables in HDFStore?Is that correct?Also, I added an example of how I would use this setup.

2019年05月24日06分35秒

The actual structure in the hdf is up to you. Pytables is row oriented, with fixed columns at creation time. You cannot append columns once a table is created. However, you can create a new table indexed the same as your existing table. (see the select_as_multiple examples in the docs). This way you can create arbitrary sized objects while having pretty efficient queries. The way you use the data is key to how it should be organized on-disk. Send me an off-list e-mail with pseudo code of a more specific example.

2019年05月23日06分35秒

I have updated my question to respond to your detailed points.I will work on an example to send you off-list.Thanks!

2019年05月24日06分35秒

Jeff, with Pandas being at 0.17.x now have the issues outlined above been resolved in Pandas?

2019年05月24日06分35秒

Jeff keen on adding a quick update on your answer to promote dask?

2019年05月24日06分35秒

Agreed. With all the hype, it's easy to forget that command-line tools can be 235x faster than a Hadoop cluster

2019年05月24日06分35秒

and for a fully worked out example with dask, just have a look here stackoverflow.com/questions/37979167/…

2019年05月23日06分35秒

"doing scientific computing on a notebook with 4GB of RAM isn't reasonable" Define reasonable. I think UNIVAC would take a different view. arstechnica.com/tech-policy/2011/09/…

2019年05月24日06分35秒

Turns out performance is relative for time and place ;)

2019年05月24日06分35秒

Agreed!try to continue working in memory even if it costs $$ up front.If your work leads to a financial return, then over time, you will recup expenses through your increased efficiency.

2019年05月24日06分35秒

Doing scientific computing on a workstation with 48GB of RAM isn't reasonable.

2019年05月24日06分35秒

rjurney sorry, maybe I should had deleted my comment. Your judgement on "unreasonable" scientific computer seems very subjective. I do my scientific computations for years on laptops, and that seems enough for me, because most of the time I write code. My algorithms are much more difficult from programming point of view than from computational one. Also I'm pretty sure that to write scalable algorithms one should not rely on current hardware limitations. Your comment on other people's computing may sound a bit offensive (apart from subjectivity), would you mind deleting these few words?

2019年05月24日06分35秒

Another library that might be worth looking at is GraphLab Create: It has an efficient DataFrame-like structure that is not limited by memory capacity. blog.dato.com/…

2019年05月24日06分35秒

Hi, I'm playing around with your example as well and I run into this error when trying to insert into a database: In [96]: test.insert((a[1].to_dict() for a in df.iterrows())) --------------- InvalidDocument: Cannot encode object: 0. Any ideas what might be wrong?My dataframe consists of all int64 dtypes and is very simple.

2019年05月23日06分35秒

Yeah i did the same for a simple range DF and the int64 from numpy seems to bother pymongo.All the data I have played with converts from CSV (vs artificially via range()) and has types long and hence no issues.In numpy you can convert but I do see that as detracting.I must admit the 10.1 items for HDF look exciting.

2019年05月23日06分35秒

would you mind sharing with me some of your code?I am interested in how you load the data from some flat text format without knowing the data types before pushing to pytables.Also, it looks like you only work with data of one type.Is that correct?

2019年05月23日06分35秒

First of all, I assume I know the types of the columns before loading, rather than trying to guess from the data. I save a JSON "data spec" file with the column names and types and use that when processing the data. (The file is usually some awful BCP output without any labels.) The data types I use are strings, floats, integers, or monthly dates. I turn the strings into ints by saving an enumeration table and convert the dates into ints (months past 2000), so I'm just left with ints and floats in my data, plus the enumeration. I save the floats as float64 now, but I experimented with float32.

2019年05月24日06分35秒

if you have time, pls give this a try for external compat with R: pandas.pydata.org/pandas-docs/dev/…, and if you have difficulty, maybe we can tweak it

2019年05月24日06分35秒

I'll try, if I can. rhdf5 is a pain, since it's a bioconductor package, rather than just being on CRAN like h5r. I'm at the mercy of our technical architecture team, and there was some issue with rhdf5 last time I asked for it. In any case, it just seems a mistake to go row-oriented rather than column-oriented with an OLAP store, but now I'm rambling.

1970年01月01日00分03秒

I have added this answer since the post of Private shows up regularly on the suggested for deletion list for content and length.

2019年05月24日06分35秒

Why not edit his answer?

2019年05月23日06分35秒