データが主食

データエンジニアの備忘録。分析だったり、読んだ本のメモだったり。

modin.pandasのread_csvの速度計測してみた

以前、pandasやdaskでのread_csvする様々な方法の速度比較記事がバズってました。

yutori-datascience.hatenablog.com

今回、modinなる分散処理高速化のpandasを見つけたので、比較をやってみようという内容です。

github.com

modinとは

modinはU.C BerkeleyのRise Lab.という組織の研究成果です。Rise Lab.はSparkで有名はIon Stoica先生のラボです。

We have focused heavily on bridging the solutions between DataFrames for small data (e.g. pandas) and large data. Often data scientists require different tools for doing the same thing on different sizes of data. The DataFrame solutions that exist for 1KB do not scale to 1TB+, and the overheads of the solutions for 1TB+ are too costly for datasets in the 1KB range. With Modin, because of its light-weight, robust, and scalable nature, you get a fast DataFrame at small and large data. With preliminary cluster and out of core support, Modin is a DataFrame library with great single-node performance and high scalability in a cluster.

https://github.com/modin-project/modin/blob/master/README.md より

  • pandasで扱うようなsmall dataからlarge dataへの橋渡し
  • 1KBのデータ分析で利用するツールは1TB+では利用できないことが多いが、modinはsmall dataからlarge dataまで同じコードで扱える
  • clusterout of coreにより、シングルノードでのパフォーマンスとclusterスケーラビリティを実現している

ただ、現在開発中ということなので、本番での利用は避けたほうが良いです。

実験環境

EC2(c4.4xlarge)上で以下のコマンドでセットアップ*1

sudo yum install python3
sudo pip-3.7 install modin dask Benchmarker toolz cloudpickle pandas dask
sudo pip-3.7 install pandas==0.24.1

ベンチマーク用スクリプト

from benchmarker import Benchmarker
import numpy as np

from modin import pandas as mpd
import pandas as ppd
import dask.dataframe as ddf
import dask.multiprocessing

def run(row, col, loop=3):

    df_random = ppd.DataFrame(np.random.randn(row, col))
    df_random.to_csv("random.csv")

    with Benchmarker(loop) as bench:

        @bench(f'original pandas row:{row} col:{col}')
        def original_pandas_read(bm):
            ppd.read_csv('random.csv')


        @bench(f'modin pandas row:{row} col:{col}')
        def modin_read(bm):
            mpd.read_csv('random.csv')

        @bench(f'dask pandas row:{row} col:{col}')
        def dask_read(bm):
            df = ddf.read_csv('random.csv')
            df = df.compute()

if __name__ == "__main__":

    for r in [1, 10, 100,
        1_000, 10_000, 100_000,
        1_000_000, 10_000_000, 100_000_000]:
        run(row=r, col=273)

データ量などは、 PythonでCSVを高速&省メモリに読みたい - tkm2261's blog を参考にしています。

全部の出力

## benchmarker:         release 4.0.1 (for python)
## python version:      3.7.2
## python compiler:     GCC 7.3.1 20180303 (Red Hat 7.3.1-5)
## python platform:     Linux-4.14.104-95.84.amzn2.x86_64-x86_64-with-glibc2.2.5
## python executable:   /usr/bin/python3
## cpu model:           Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz  # 2900.272 MHz
## parameters:          loop=3, cycle=1, extra=0

##                                       real    (total    = user    + sys)
original pandas row:1 col:273          0.0295    0.0300    0.0200    0.0100
modin pandas row:1 col:273             0.3122    0.1400    0.1400    0.0000
dask pandas row:1 col:273              0.1540    0.1500    0.1500    0.0000

## Ranking                               real
original pandas row:1 col:273          0.0295  (100.0) ********************
dask pandas row:1 col:273              0.1540  ( 19.2) ****
modin pandas row:1 col:273             0.3122  (  9.5) **

## Matrix                                real    [01]    [02]    [03]
[01] original pandas row:1 col:273     0.0295   100.0   521.5  1056.7
[02] dask pandas row:1 col:273         0.1540    19.2   100.0   202.6
[03] modin pandas row:1 col:273        0.3122     9.5    49.3   100.0

## benchmarker:         release 4.0.1 (for python)
## python version:      3.7.2
## python compiler:     GCC 7.3.1 20180303 (Red Hat 7.3.1-5)
## python platform:     Linux-4.14.104-95.84.amzn2.x86_64-x86_64-with-glibc2.2.5
## python executable:   /usr/bin/python3
## cpu model:           Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz  # 2900.272 MHz
## parameters:          loop=3, cycle=1, extra=0

##                                       real    (total    = user    + sys)
original pandas row:10 col:273         0.0147    0.0200    0.0100    0.0100
modin pandas row:10 col:273            0.1212    0.0700    0.0700    0.0000
dask pandas row:10 col:273             0.1208    0.1200    0.1200    0.0000

## Ranking                               real
original pandas row:10 col:273         0.0147  (100.0) ********************
dask pandas row:10 col:273             0.1208  ( 12.2) **
modin pandas row:10 col:273            0.1212  ( 12.1) **

## Matrix                                real    [01]    [02]    [03]
[01] original pandas row:10 col:273    0.0147   100.0   821.3   824.1
[02] dask pandas row:10 col:273        0.1208    12.2   100.0   100.3
[03] modin pandas row:10 col:273       0.1212    12.1    99.7   100.0

## benchmarker:         release 4.0.1 (for python)
## python version:      3.7.2
## python compiler:     GCC 7.3.1 20180303 (Red Hat 7.3.1-5)
## python platform:     Linux-4.14.104-95.84.amzn2.x86_64-x86_64-with-glibc2.2.5
## python executable:   /usr/bin/python3
## cpu model:           Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz  # 2900.272 MHz
## parameters:          loop=3, cycle=1, extra=0

##                                       real    (total    = user    + sys)
original pandas row:100 col:273        0.0195    0.0200    0.0200    0.0000
modin pandas row:100 col:273           0.1182    0.0700    0.0700    0.0000
dask pandas row:100 col:273            0.1300    0.1300    0.1300    0.0000

## Ranking                               real
original pandas row:100 col:273        0.0195  (100.0) ********************
modin pandas row:100 col:273           0.1182  ( 16.5) ***
dask pandas row:100 col:273            0.1300  ( 15.0) ***

## Matrix                                real    [01]    [02]    [03]
[01] original pandas row:100 col:273    0.0195   100.0   606.0   666.7
[02] modin pandas row:100 col:273      0.1182    16.5   100.0   110.0
[03] dask pandas row:100 col:273       0.1300    15.0    90.9   100.0

## benchmarker:         release 4.0.1 (for python)
## python version:      3.7.2
## python compiler:     GCC 7.3.1 20180303 (Red Hat 7.3.1-5)
## python platform:     Linux-4.14.104-95.84.amzn2.x86_64-x86_64-with-glibc2.2.5
## python executable:   /usr/bin/python3
## cpu model:           Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz  # 2900.272 MHz
## parameters:          loop=3, cycle=1, extra=0

##                                       real    (total    = user    + sys)
original pandas row:1000 col:273       0.0881    0.0900    0.0800    0.0100
modin pandas row:1000 col:273          0.1316    0.0600    0.0600    0.0000
dask pandas row:1000 col:273           0.2032    0.2100    0.1900    0.0200

## Ranking                               real
original pandas row:1000 col:273       0.0881  (100.0) ********************
modin pandas row:1000 col:273          0.1316  ( 66.9) *************
dask pandas row:1000 col:273           0.2032  ( 43.3) *********

## Matrix                                real    [01]    [02]    [03]
[01] original pandas row:1000 col:273    0.0881   100.0   149.4   230.7
[02] modin pandas row:1000 col:273     0.1316    66.9   100.0   154.5
[03] dask pandas row:1000 col:273      0.2032    43.3    64.7   100.0

## benchmarker:         release 4.0.1 (for python)
## python version:      3.7.2
## python compiler:     GCC 7.3.1 20180303 (Red Hat 7.3.1-5)
## python platform:     Linux-4.14.104-95.84.amzn2.x86_64-x86_64-with-glibc2.2.5
## python executable:   /usr/bin/python3
## cpu model:           Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz  # 2900.272 MHz
## parameters:          loop=3, cycle=1, extra=0

##                                       real    (total    = user    + sys)
original pandas row:10000 col:273      0.7424    0.7400    0.7100    0.0300
modin pandas row:10000 col:273         0.2286    0.0700    0.0700    0.0000
dask pandas row:10000 col:273          0.7186    1.0600    0.9600    0.1000

## Ranking                               real
modin pandas row:10000 col:273         0.2286  (100.0) ********************
dask pandas row:10000 col:273          0.7186  ( 31.8) ******
original pandas row:10000 col:273      0.7424  ( 30.8) ******

## Matrix                                real    [01]    [02]    [03]
[01] modin pandas row:10000 col:273    0.2286   100.0   314.4   324.8
[02] dask pandas row:10000 col:273     0.7186    31.8   100.0   103.3
[03] original pandas row:10000 col:273    0.7424    30.8    96.8   100.0

## benchmarker:         release 4.0.1 (for python)
## python version:      3.7.2
## python compiler:     GCC 7.3.1 20180303 (Red Hat 7.3.1-5)
## python platform:     Linux-4.14.104-95.84.amzn2.x86_64-x86_64-with-glibc2.2.5
## python executable:   /usr/bin/python3
## cpu model:           Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz  # 2900.272 MHz
## parameters:          loop=3, cycle=1, extra=0

##                                       real    (total    = user    + sys)
original pandas row:100000 col:273     7.1640    7.1600    6.8700    0.2900
modin pandas row:100000 col:273        1.5127    0.0900    0.0900    0.0000
dask pandas row:100000 col:273         3.4386   14.3200   12.8400    1.4800

## Ranking                               real
modin pandas row:100000 col:273        1.5127  (100.0) ********************
dask pandas row:100000 col:273         3.4386  ( 44.0) *********
original pandas row:100000 col:273     7.1640  ( 21.1) ****

## Matrix                                real    [01]    [02]    [03]
[01] modin pandas row:100000 col:273    1.5127   100.0   227.3   473.6
[02] dask pandas row:100000 col:273    3.4386    44.0   100.0   208.3
[03] original pandas row:100000 col:273    7.1640    21.1    48.0   100.0

## benchmarker:         release 4.0.1 (for python)
## python version:      3.7.2
## python compiler:     GCC 7.3.1 20180303 (Red Hat 7.3.1-5)
## python platform:     Linux-4.14.104-95.84.amzn2.x86_64-x86_64-with-glibc2.2.5
## python executable:   /usr/bin/python3
## cpu model:           Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz  # 2900.272 MHz
## parameters:          loop=3, cycle=1, extra=0

##                                       real    (total    = user    + sys)
original pandas row:1000000 col:273   71.5742   71.6100   68.2400    3.3700
modin pandas row:1000000 col:273      13.4387    0.0900    0.0900    0.0000
dask pandas row:1000000 col:273       32.5911  129.6800  121.1700    8.5100

## Ranking                               real
modin pandas row:1000000 col:273      13.4387  (100.0) ********************
dask pandas row:1000000 col:273       32.5911  ( 41.2) ********
original pandas row:1000000 col:273   71.5742  ( 18.8) ****

## Matrix                                real    [01]    [02]    [03]
[01] modin pandas row:1000000 col:273   13.4387   100.0   242.5   532.6
[02] dask pandas row:1000000 col:273   32.5911    41.2   100.0   219.6
[03] original pandas row:1000000 col:273   71.5742    18.8    45.5   100.0

f:id:ktr89:20190321220459p:plain

まとめ

  • 10,000行以上であれば、pandas/daskの数倍速い。

*1:ImportError: The pandas version installed does not match the required pandas version in Modin. Please install pandas 0.24.1 to use Modin.と怒られたのでダウングレード