Generate multiple, large, sorted CSV files with pseudo-random data

For this exercise, I was tasked to create a 500 million rows dataset for an IMPORT INTO speed test.
Details about the IMPORT INTO command can be found here.
The requirement was to create 100 sorted CSV files, each containing 5 million rows, and save them to an S3 bucket.
The data should also be sorted across all files.

In this blog, I show how to use pgworkload to create such pseudo-random dataset.

Setup pgworkload and aws-cli

Provision a large machine with plenty of RAM.
The AWS instance type r7i.8xlarge, which sports 32 vCPUs and 256GB RAM, is a good candidate.
Make sure to attach an IAM Role with permissions to write to your S3 bucket.

Once ready, ssh into the box and install the tools

<span>sudo </span>apt update
<span>sudo </span>apt <span>install</span> <span>-y</span> python3-pip unzip
pip <span>install</span> <span>-U</span> pip
pip <span>install </span>pgworkload
<span>sudo </span>apt update
<span>sudo </span>apt <span>install</span> <span>-y</span> python3-pip unzip

pip <span>install</span> <span>-U</span> pip
pip <span>install </span>pgworkload
sudo apt update sudo apt install -y python3-pip unzip pip install -U pip pip install pgworkload

Enter fullscreen mode Exit fullscreen mode

Logout and log back in so that pgworkload is in the PATH…

Now install AWS CLI, here the official docs.

curl <span>"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip"</span> <span>-o</span> <span>"awscliv2.zip"</span>
unzip awscliv2.zip
<span>sudo</span> ./aws/install
curl <span>"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip"</span> <span>-o</span> <span>"awscliv2.zip"</span>
unzip awscliv2.zip
<span>sudo</span> ./aws/install
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install

Enter fullscreen mode Exit fullscreen mode

Confirm it’s working

$ aws --version
aws-cli/2.15.17 Python/3.11.6 Linux/5.15.0-1052-aws exe/x86_64.ubuntu.20 prompt/off
# confirm you're authenticated..
$ aws s3 ls workshop-ca
PRE fab/
# put some files in the bucket
$ aws s3 cp s.sql 's3://workshop-ca/fab/'
upload: ./s.sql to s3://workshop-ca/fab/s.sql
# check the value is there
$ aws s3 ls workshop-ca/fab/
PRE 2024/
PRE metadata/
2024-02-02 19:03:10 3585 s.sql
$ aws --version
aws-cli/2.15.17 Python/3.11.6 Linux/5.15.0-1052-aws exe/x86_64.ubuntu.20 prompt/off

# confirm you're authenticated..
$ aws s3 ls workshop-ca
                          PRE fab/

# put some files in the bucket
$ aws s3 cp s.sql 's3://workshop-ca/fab/'
upload: ./s.sql to s3://workshop-ca/fab/s.sql                  

# check the value is there
$ aws s3 ls workshop-ca/fab/
                           PRE 2024/
                           PRE metadata/
2024-02-02 19:03:10       3585 s.sql
$ aws --version aws-cli/2.15.17 Python/3.11.6 Linux/5.15.0-1052-aws exe/x86_64.ubuntu.20 prompt/off # confirm you're authenticated.. $ aws s3 ls workshop-ca PRE fab/ # put some files in the bucket $ aws s3 cp s.sql 's3://workshop-ca/fab/' upload: ./s.sql to s3://workshop-ca/fab/s.sql # check the value is there $ aws s3 ls workshop-ca/fab/ PRE 2024/ PRE metadata/ 2024-02-02 19:03:10 3585 s.sql

Enter fullscreen mode Exit fullscreen mode

Good, we’re all set!

Generate data

The data that needs to be created must match the schema of the table into which it will be imported.

The DDL of the schema is in file s.sql.

<span>-- file s.sql</span>
<span>CREATE</span> <span>TABLE</span> <span>accounts</span> <span>(</span>
<span>acc_no</span> <span>STRING</span><span>(</span><span>20</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
<span>pos_dt</span> <span>DATE</span> <span>NOT</span> <span>NULL</span><span>,</span>
<span>ent_id</span> <span>DECIMAL</span><span>(</span><span>25</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
<span>col_00</span> <span>STRING</span><span>(</span><span>35</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
<span>col_01</span> <span>STRING</span><span>(</span><span>16</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_02</span> <span>STRING</span><span>(</span><span>9</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
<span>col_03</span> <span>STRING</span><span>(</span><span>30</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
<span>col_04</span> <span>DECIMAL</span><span>(</span><span>22</span><span>,</span><span>6</span><span>)</span> <span>NOT</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
<span>col_05</span> <span>DECIMAL</span><span>(</span><span>22</span><span>,</span><span>6</span><span>)</span> <span>NOT</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
<span>col_06</span> <span>STRING</span><span>(</span><span>3</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
<span>col_07</span> <span>DECIMAL</span><span>(</span><span>22</span><span>,</span><span>8</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_08</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>3</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_09</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>4</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_10</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>3</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_11</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_12</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
<span>col_13</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
<span>col_14</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_15</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>3</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_16</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>3</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
<span>col_17</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>8</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_18</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>8</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
<span>col_19</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>2</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
<span>col_20</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>2</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_21</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>2</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_22</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>2</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
<span>col_23</span> <span>STRING</span><span>(</span><span>1</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
<span>col_24</span> <span>STRING</span><span>(</span><span>3</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_25</span> <span>DECIMAL</span><span>(</span><span>22</span><span>,</span><span>9</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_26</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>8</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_27</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>8</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
<span>col_28</span> <span>DECIMAL</span><span>(</span><span>22</span><span>,</span><span>9</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_29</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_30</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_31</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_32</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_33</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_34</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_35</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>2</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_36</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>2</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_37</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_38</span> <span>DATE</span> <span>NULL</span><span>,</span>
<span>col_39</span> <span>DATE</span> <span>NULL</span><span>,</span>
<span>col_40</span> <span>DATE</span> <span>NULL</span><span>,</span>
<span>col_41</span> <span>STRING</span><span>(</span><span>25</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_42</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>6</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_43</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>6</span><span>)</span> <span>NULL</span><span>,</span>
<span>col_44</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>6</span><span>)</span> <span>NULL</span><span>,</span>
<span>CONSTRAINT</span> <span>pk</span> <span>PRIMARY</span> <span>KEY</span> <span>(</span><span>acc_no</span> <span>ASC</span><span>,</span> <span>pos_dt</span> <span>ASC</span><span>,</span> <span>ent_id</span> <span>ASC</span><span>)</span>
<span>);</span>
<span>-- file s.sql</span>
<span>CREATE</span> <span>TABLE</span> <span>accounts</span> <span>(</span>
    <span>acc_no</span> <span>STRING</span><span>(</span><span>20</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
    <span>pos_dt</span> <span>DATE</span> <span>NOT</span> <span>NULL</span><span>,</span>
    <span>ent_id</span> <span>DECIMAL</span><span>(</span><span>25</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
    <span>col_00</span> <span>STRING</span><span>(</span><span>35</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
    <span>col_01</span> <span>STRING</span><span>(</span><span>16</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_02</span> <span>STRING</span><span>(</span><span>9</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
    <span>col_03</span> <span>STRING</span><span>(</span><span>30</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
    <span>col_04</span> <span>DECIMAL</span><span>(</span><span>22</span><span>,</span><span>6</span><span>)</span> <span>NOT</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
    <span>col_05</span> <span>DECIMAL</span><span>(</span><span>22</span><span>,</span><span>6</span><span>)</span> <span>NOT</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
    <span>col_06</span> <span>STRING</span><span>(</span><span>3</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
    <span>col_07</span> <span>DECIMAL</span><span>(</span><span>22</span><span>,</span><span>8</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_08</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>3</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_09</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>4</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_10</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>3</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_11</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_12</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
    <span>col_13</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
    <span>col_14</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_15</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>3</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_16</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>3</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
    <span>col_17</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>8</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_18</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>8</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
    <span>col_19</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>2</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
    <span>col_20</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>2</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_21</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>2</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_22</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>2</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
    <span>col_23</span> <span>STRING</span><span>(</span><span>1</span><span>)</span> <span>NOT</span> <span>NULL</span><span>,</span>
    <span>col_24</span> <span>STRING</span><span>(</span><span>3</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_25</span> <span>DECIMAL</span><span>(</span><span>22</span><span>,</span><span>9</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_26</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>8</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_27</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>8</span><span>)</span> <span>NULL</span> <span>DEFAULT</span> <span>0</span><span>:::</span><span>DECIMAL</span><span>,</span>
    <span>col_28</span> <span>DECIMAL</span><span>(</span><span>22</span><span>,</span><span>9</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_29</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_30</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_31</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_32</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_33</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_34</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_35</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>2</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_36</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>2</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_37</span> <span>DECIMAL</span><span>(</span><span>20</span><span>,</span><span>5</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_38</span> <span>DATE</span> <span>NULL</span><span>,</span>
    <span>col_39</span> <span>DATE</span> <span>NULL</span><span>,</span>
    <span>col_40</span> <span>DATE</span> <span>NULL</span><span>,</span>
    <span>col_41</span> <span>STRING</span><span>(</span><span>25</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_42</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>6</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_43</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>6</span><span>)</span> <span>NULL</span><span>,</span>
    <span>col_44</span> <span>DECIMAL</span><span>(</span><span>21</span><span>,</span><span>6</span><span>)</span> <span>NULL</span><span>,</span>
    <span>CONSTRAINT</span> <span>pk</span> <span>PRIMARY</span> <span>KEY</span> <span>(</span><span>acc_no</span> <span>ASC</span><span>,</span> <span>pos_dt</span> <span>ASC</span><span>,</span> <span>ent_id</span> <span>ASC</span><span>)</span>
<span>);</span>
-- file s.sql CREATE TABLE accounts ( acc_no STRING(20) NOT NULL, pos_dt DATE NOT NULL, ent_id DECIMAL(25) NOT NULL, col_00 STRING(35) NOT NULL, col_01 STRING(16) NULL, col_02 STRING(9) NOT NULL, col_03 STRING(30) NOT NULL, col_04 DECIMAL(22,6) NOT NULL DEFAULT 0:::DECIMAL, col_05 DECIMAL(22,6) NOT NULL DEFAULT 0:::DECIMAL, col_06 STRING(3) NOT NULL, col_07 DECIMAL(22,8) NULL, col_08 DECIMAL(20,3) NULL, col_09 DECIMAL(20,4) NULL, col_10 DECIMAL(20,3) NULL, col_11 DECIMAL(20,5) NULL, col_12 DECIMAL(20,5) NULL DEFAULT 0:::DECIMAL, col_13 DECIMAL(20,5) NULL DEFAULT 0:::DECIMAL, col_14 DECIMAL(20,5) NULL, col_15 DECIMAL(20,3) NULL, col_16 DECIMAL(20,3) NULL DEFAULT 0:::DECIMAL, col_17 DECIMAL(21,8) NULL, col_18 DECIMAL(21,8) NULL DEFAULT 0:::DECIMAL, col_19 DECIMAL(21,2) NULL DEFAULT 0:::DECIMAL, col_20 DECIMAL(21,2) NULL, col_21 DECIMAL(21,2) NULL, col_22 DECIMAL(21,2) NULL DEFAULT 0:::DECIMAL, col_23 STRING(1) NOT NULL, col_24 STRING(3) NULL, col_25 DECIMAL(22,9) NULL, col_26 DECIMAL(21,8) NULL, col_27 DECIMAL(21,8) NULL DEFAULT 0:::DECIMAL, col_28 DECIMAL(22,9) NULL, col_29 DECIMAL(20,5) NULL, col_30 DECIMAL(20,5) NULL, col_31 DECIMAL(20,5) NULL, col_32 DECIMAL(20,5) NULL, col_33 DECIMAL(20,5) NULL, col_34 DECIMAL(20,5) NULL, col_35 DECIMAL(20,2) NULL, col_36 DECIMAL(20,2) NULL, col_37 DECIMAL(20,5) NULL, col_38 DATE NULL, col_39 DATE NULL, col_40 DATE NULL, col_41 STRING(25) NULL, col_42 DECIMAL(21,6) NULL, col_43 DECIMAL(21,6) NULL, col_44 DECIMAL(21,6) NULL, CONSTRAINT pk PRIMARY KEY (acc_no ASC, pos_dt ASC, ent_id ASC) );

Enter fullscreen mode Exit fullscreen mode

Given the schema, pgworkload can generate an intermediate representation of what needs to be generated – a definition file – in YAML syntax.

pgworkload util yaml <span>-i</span> s.sql
pgworkload util yaml <span>-i</span> s.sql
pgworkload util yaml -i s.sql

Enter fullscreen mode Exit fullscreen mode

The result is a file called, by default, s.yaml, below a snippet of the first few lines

<span># file s.yaml</span>
<span>accounts</span><span>:</span>
<span>-</span> <span>count</span><span>:</span> <span>1000</span>
<span>sort-by</span><span>:</span> <span>[]</span>
<span>columns</span><span>:</span>
<span>acc_no</span><span>:</span>
<span>type</span><span>:</span> <span>string</span>
<span>args</span><span>:</span>
<span>min</span><span>:</span> <span>10</span>
<span>max</span><span>:</span> <span>30</span>
<span>seed</span><span>:</span> <span>0.7225861820526325</span>
<span>null_pct</span><span>:</span> <span>0.0</span>
<span>array</span><span>:</span> <span>0</span>
<span>pos_dt</span><span>:</span>
<span>type</span><span>:</span> <span>date</span>
<span>args</span><span>:</span>
<span>start</span><span>:</span> <span>'</span><span>2022-01-01'</span>
<span>end</span><span>:</span> <span>'</span><span>2022-12-31'</span>
<span>format</span><span>:</span> <span>'</span><span>%Y-%m-%d'</span>
<span>seed</span><span>:</span> <span>0.24769809060740589</span>
<span>null_pct</span><span>:</span> <span>0.0</span>
<span>array</span><span>:</span> <span>0</span>
<span>ent_id</span><span>:</span>
<span>type</span><span>:</span> <span>float</span>
<span>args</span><span>:</span>
<span>max</span><span>:</span> <span>10000</span>
<span>round</span><span>:</span> <span>2</span>
<span>seed</span><span>:</span> <span>0.028215986930010706</span>
<span>null_pct</span><span>:</span> <span>0.0</span>
<span>array</span><span>:</span> <span>0</span>
<span>col_00</span><span>:</span>
<span>type</span><span>:</span> <span>string</span>
<span>args</span><span>:</span>
<span>min</span><span>:</span> <span>10</span>
<span>max</span><span>:</span> <span>30</span>
<span>seed</span><span>:</span> <span>0.8785098436269427</span>
<span>null_pct</span><span>:</span> <span>0.0</span>
<span>array</span><span>:</span> <span>0</span>
<span>col_01</span><span>:</span>
<span>type</span><span>:</span> <span>string</span>
<span>args</span><span>:</span>
<span>min</span><span>:</span> <span>10</span>
<span>max</span><span>:</span> <span>30</span>
<span>seed</span><span>:</span> <span>0.8561702097239098</span>
<span>null_pct</span><span>:</span> <span>0.0</span>
<span>array</span><span>:</span> <span>0</span>
<span>[</span><span>...</span><span>]</span>
<span># file s.yaml</span>
<span>accounts</span><span>:</span>
<span>-</span> <span>count</span><span>:</span> <span>1000</span>
  <span>sort-by</span><span>:</span> <span>[]</span>
  <span>columns</span><span>:</span>
    <span>acc_no</span><span>:</span>
      <span>type</span><span>:</span> <span>string</span>
      <span>args</span><span>:</span>
        <span>min</span><span>:</span> <span>10</span>
        <span>max</span><span>:</span> <span>30</span>
        <span>seed</span><span>:</span> <span>0.7225861820526325</span>
        <span>null_pct</span><span>:</span> <span>0.0</span>
        <span>array</span><span>:</span> <span>0</span>
    <span>pos_dt</span><span>:</span>
      <span>type</span><span>:</span> <span>date</span>
      <span>args</span><span>:</span>
        <span>start</span><span>:</span> <span>'</span><span>2022-01-01'</span>
        <span>end</span><span>:</span> <span>'</span><span>2022-12-31'</span>
        <span>format</span><span>:</span> <span>'</span><span>%Y-%m-%d'</span>
        <span>seed</span><span>:</span> <span>0.24769809060740589</span>
        <span>null_pct</span><span>:</span> <span>0.0</span>
        <span>array</span><span>:</span> <span>0</span>
    <span>ent_id</span><span>:</span>
      <span>type</span><span>:</span> <span>float</span>
      <span>args</span><span>:</span>
        <span>max</span><span>:</span> <span>10000</span>
        <span>round</span><span>:</span> <span>2</span>
        <span>seed</span><span>:</span> <span>0.028215986930010706</span>
        <span>null_pct</span><span>:</span> <span>0.0</span>
        <span>array</span><span>:</span> <span>0</span>
    <span>col_00</span><span>:</span>
      <span>type</span><span>:</span> <span>string</span>
      <span>args</span><span>:</span>
        <span>min</span><span>:</span> <span>10</span>
        <span>max</span><span>:</span> <span>30</span>
        <span>seed</span><span>:</span> <span>0.8785098436269427</span>
        <span>null_pct</span><span>:</span> <span>0.0</span>
        <span>array</span><span>:</span> <span>0</span>
    <span>col_01</span><span>:</span>
      <span>type</span><span>:</span> <span>string</span>
      <span>args</span><span>:</span>
        <span>min</span><span>:</span> <span>10</span>
        <span>max</span><span>:</span> <span>30</span>
        <span>seed</span><span>:</span> <span>0.8561702097239098</span>
        <span>null_pct</span><span>:</span> <span>0.0</span>
        <span>array</span><span>:</span> <span>0</span>
<span>[</span><span>...</span><span>]</span>
# file s.yaml accounts: - count: 1000 sort-by: [] columns: acc_no: type: string args: min: 10 max: 30 seed: 0.7225861820526325 null_pct: 0.0 array: 0 pos_dt: type: date args: start: '2022-01-01' end: '2022-12-31' format: '%Y-%m-%d' seed: 0.24769809060740589 null_pct: 0.0 array: 0 ent_id: type: float args: max: 10000 round: 2 seed: 0.028215986930010706 null_pct: 0.0 array: 0 col_00: type: string args: min: 10 max: 30 seed: 0.8785098436269427 null_pct: 0.0 array: 0 col_01: type: string args: min: 10 max: 30 seed: 0.8561702097239098 null_pct: 0.0 array: 0 [...]

Enter fullscreen mode Exit fullscreen mode

This is just a template, we need to configure it as per our needs:

  • generate 500,000,000 rows
  • ensure dataset is sorted as per Primary Key
  • the 2nd column must always be the same date.

Here is therefore the updated head of the file

<span>accounts</span><span>:</span>
<span>-</span> <span>count</span><span>:</span> <span>500000000</span>
<span>sort-by</span><span>:</span> <span>[</span><span>"</span><span>acc_no"</span><span>,</span> <span>"</span><span>pos_dt"</span><span>,</span> <span>"</span><span>ent_id"</span><span>]</span>
<span>columns</span><span>:</span>
<span>acc_no</span><span>:</span>
<span>type</span><span>:</span> <span>string</span>
<span>args</span><span>:</span>
<span>min</span><span>:</span> <span>10</span>
<span>max</span><span>:</span> <span>30</span>
<span>seed</span><span>:</span> <span>0.7225861820526325</span>
<span>null_pct</span><span>:</span> <span>0.0</span>
<span>array</span><span>:</span> <span>0</span>
<span>pos_dt</span><span>:</span>
<span>type</span><span>:</span> <span>costant</span>
<span>args</span><span>:</span>
<span>value</span><span>:</span> <span>"</span><span>2024-02-01"</span>
<span>accounts</span><span>:</span>
<span>-</span> <span>count</span><span>:</span> <span>500000000</span>
  <span>sort-by</span><span>:</span> <span>[</span><span>"</span><span>acc_no"</span><span>,</span> <span>"</span><span>pos_dt"</span><span>,</span> <span>"</span><span>ent_id"</span><span>]</span>
  <span>columns</span><span>:</span>
    <span>acc_no</span><span>:</span>
      <span>type</span><span>:</span> <span>string</span>
      <span>args</span><span>:</span>
        <span>min</span><span>:</span> <span>10</span>
        <span>max</span><span>:</span> <span>30</span>
        <span>seed</span><span>:</span> <span>0.7225861820526325</span>
        <span>null_pct</span><span>:</span> <span>0.0</span>
        <span>array</span><span>:</span> <span>0</span>
    <span>pos_dt</span><span>:</span>
      <span>type</span><span>:</span> <span>costant</span>
      <span>args</span><span>:</span>
        <span>value</span><span>:</span> <span>"</span><span>2024-02-01"</span>
accounts: - count: 500000000 sort-by: ["acc_no", "pos_dt", "ent_id"] columns: acc_no: type: string args: min: 10 max: 30 seed: 0.7225861820526325 null_pct: 0.0 array: 0 pos_dt: type: costant args: value: "2024-02-01"

Enter fullscreen mode Exit fullscreen mode

At this point, we’re ready to generate the data.

<span># check the options available for the `util csv` command</span>
<span>$ </span>pgworkload util csv <span>--help</span>
<span># check the options available for the `util csv` command</span>
<span>$ </span>pgworkload util csv <span>--help</span>
# check the options available for the `util csv` command $ pgworkload util csv --help

Enter fullscreen mode Exit fullscreen mode

╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
* --input -i FILE Filepath to the YAML data generation file. [default: None] [required]
│ --output -o DIRECTORY Output directory for the CSV files. Defaults to <input-basename>. │
│ --procs -x INTEGER Number of processes to spawn. Defaults to <system-cpu-count>. │
│ --csv-max-rows INTEGER Max count of rows per resulting CSV file. [default: 100000]
│ --hostname -n INTEGER The hostname of the http server that serves the CSV files. │
│ --port -p INTEGER The port of the http server that servers the CSV files. [default: 3000]
│ --table-name -t TEXT The table name used in the import statement. [default: table_name]
│ --compression -c TEXT The compression format. [default: None]
│ --delimiter -d TEXT The delimeter char to use for the CSV files. Defaults to "tab". │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --input         -i      FILE       Filepath to the YAML data generation file. [default: None] [required]                                  │
│    --output        -o      DIRECTORY  Output directory for the CSV files. Defaults to <input-basename>.                                      │
│    --procs         -x      INTEGER    Number of processes to spawn. Defaults to <system-cpu-count>.                                          │
│    --csv-max-rows          INTEGER    Max count of rows per resulting CSV file. [default: 100000]                                            │
│    --hostname      -n      INTEGER    The hostname of the http server that serves the CSV files.                                             │
│    --port          -p      INTEGER    The port of the http server that servers the CSV files. [default: 3000]                                │
│    --table-name    -t      TEXT       The table name used in the import statement. [default: table_name]                                     │
│    --compression   -c      TEXT       The compression format. [default: None]                                                                │
│    --delimiter     -d      TEXT       The delimeter char to use for the CSV files. Defaults to "tab".                                        │
│    --help                             Show this message and exit.                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ * --input -i FILE Filepath to the YAML data generation file. [default: None] [required] │ │ --output -o DIRECTORY Output directory for the CSV files. Defaults to <input-basename>. │ │ --procs -x INTEGER Number of processes to spawn. Defaults to <system-cpu-count>. │ │ --csv-max-rows INTEGER Max count of rows per resulting CSV file. [default: 100000] │ │ --hostname -n INTEGER The hostname of the http server that serves the CSV files. │ │ --port -p INTEGER The port of the http server that servers the CSV files. [default: 3000] │ │ --table-name -t TEXT The table name used in the import statement. [default: table_name] │ │ --compression -c TEXT The compression format. [default: None] │ │ --delimiter -d TEXT The delimeter char to use for the CSV files. Defaults to "tab". │ │ --help Show this message and exit. │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Enter fullscreen mode Exit fullscreen mode

Here, you are interested in setting the --csv-max-rows and -x parameters.

To make sure we end up with 100 files, we will use 10 processes (-x) with each process generating files of size 5,000,000 until the total is 500,000,000.

pgworkload util csv <span>-i</span> s.yaml <span>-d</span> <span>","</span> <span>--csv-max-rows</span> 5000000 <span>-x</span> 10
pgworkload util csv <span>-i</span> s.yaml <span>-d</span> <span>","</span> <span>--csv-max-rows</span> 5000000 <span>-x</span> 10
pgworkload util csv -i s.yaml -d "," --csv-max-rows 5000000 -x 10

Enter fullscreen mode Exit fullscreen mode

This will take time, and memory: monitor memory utilization using top or free.
Once completed, you have 100 CSV files in a s directory.

<span>$ </span><span>ls</span> <span>-A1</span> s/
accounts.0_0_0.csv
accounts.0_0_1.csv
accounts.0_0_2.csv
accounts.0_0_3.csv
accounts.0_0_4.csv
accounts.0_0_5.csv
accounts.0_0_6.csv
accounts.0_0_7.csv
accounts.0_0_8.csv
accounts.0_0_9.csv
accounts.0_1_0.csv
accounts.0_1_1.csv
<span>[</span>...]
accounts.0_9_8.csv
accounts.0_9_9.csv
<span>$ </span><span>ls</span> <span>-A1</span> s/
accounts.0_0_0.csv
accounts.0_0_1.csv
accounts.0_0_2.csv
accounts.0_0_3.csv
accounts.0_0_4.csv
accounts.0_0_5.csv
accounts.0_0_6.csv
accounts.0_0_7.csv
accounts.0_0_8.csv
accounts.0_0_9.csv
accounts.0_1_0.csv
accounts.0_1_1.csv
<span>[</span>...]
accounts.0_9_8.csv
accounts.0_9_9.csv
$ ls -A1 s/ accounts.0_0_0.csv accounts.0_0_1.csv accounts.0_0_2.csv accounts.0_0_3.csv accounts.0_0_4.csv accounts.0_0_5.csv accounts.0_0_6.csv accounts.0_0_7.csv accounts.0_0_8.csv accounts.0_0_9.csv accounts.0_1_0.csv accounts.0_1_1.csv [...] accounts.0_9_8.csv accounts.0_9_9.csv

Enter fullscreen mode Exit fullscreen mode

Inspecting one file, we see it’s sorted

<span>$ </span><span>head </span>s/accounts.0_0_0.csv
abcdgzrjyphtemsbcjvt,2024-02-01,4314.45,jkiuotkttqwjnjnbxzbxhgsyke,uxejzkiirmitunpzybjnakoic,ovvqhgmsbwoajhwmiyhnugj,...
aefkzjdckjylb,2024-02-01,9375.53,bswvhyjkodukhwpcxf,uevjmwqhdfaobtlf,oahiaiztayyzftmfkyuez,qtxhjuwpfalfzaeoiiahuoxamns,...
aerkycbddriqtygvilb,2024-02-01,6150.55,mprfweeqoe,nvddlibqqzncrwdnffm,phcnnzvxrauxllj,vjnabkrzgiimmt,...
agjiqkisyshjeorqna,2024-02-01,9901.21,fipnlqezgzzdfreg,yokerzbkxcrzfdeckjkk,guaeeecdqgbbwtnzleopfznzcuv,lqglvuyetypvnovdbflbnodozfebz,...
atpkoobmhvhhxuqxceurv,2024-02-01,2455.17,cwmkzijlrqhtdcx,jtbelvfoajfdagwigpevnmameq,uedouumekwxagdgwbtivewaq,uytgiqpewfexlqkmbelpik,...
<span>$ </span><span>head </span>s/accounts.0_0_0.csv
abcdgzrjyphtemsbcjvt,2024-02-01,4314.45,jkiuotkttqwjnjnbxzbxhgsyke,uxejzkiirmitunpzybjnakoic,ovvqhgmsbwoajhwmiyhnugj,...
aefkzjdckjylb,2024-02-01,9375.53,bswvhyjkodukhwpcxf,uevjmwqhdfaobtlf,oahiaiztayyzftmfkyuez,qtxhjuwpfalfzaeoiiahuoxamns,...
aerkycbddriqtygvilb,2024-02-01,6150.55,mprfweeqoe,nvddlibqqzncrwdnffm,phcnnzvxrauxllj,vjnabkrzgiimmt,...
agjiqkisyshjeorqna,2024-02-01,9901.21,fipnlqezgzzdfreg,yokerzbkxcrzfdeckjkk,guaeeecdqgbbwtnzleopfznzcuv,lqglvuyetypvnovdbflbnodozfebz,...
atpkoobmhvhhxuqxceurv,2024-02-01,2455.17,cwmkzijlrqhtdcx,jtbelvfoajfdagwigpevnmameq,uedouumekwxagdgwbtivewaq,uytgiqpewfexlqkmbelpik,...
$ head s/accounts.0_0_0.csv abcdgzrjyphtemsbcjvt,2024-02-01,4314.45,jkiuotkttqwjnjnbxzbxhgsyke,uxejzkiirmitunpzybjnakoic,ovvqhgmsbwoajhwmiyhnugj,... aefkzjdckjylb,2024-02-01,9375.53,bswvhyjkodukhwpcxf,uevjmwqhdfaobtlf,oahiaiztayyzftmfkyuez,qtxhjuwpfalfzaeoiiahuoxamns,... aerkycbddriqtygvilb,2024-02-01,6150.55,mprfweeqoe,nvddlibqqzncrwdnffm,phcnnzvxrauxllj,vjnabkrzgiimmt,... agjiqkisyshjeorqna,2024-02-01,9901.21,fipnlqezgzzdfreg,yokerzbkxcrzfdeckjkk,guaeeecdqgbbwtnzleopfznzcuv,lqglvuyetypvnovdbflbnodozfebz,... atpkoobmhvhhxuqxceurv,2024-02-01,2455.17,cwmkzijlrqhtdcx,jtbelvfoajfdagwigpevnmameq,uedouumekwxagdgwbtivewaq,uytgiqpewfexlqkmbelpik,...

Enter fullscreen mode Exit fullscreen mode

Now you have the data, sorted by file, but not across all files.
That is, each file is sorted as per PK, but across all the files the data is not yet sorted.
Currently, pgworkload doesn’t have the capability to do so, so we have to develop our own sorted-merge script.

Note You can’t possibly be thinking to load everything into memory, sort, and save.
That’s too big to fit into a single machine, no matter how big it is.
Instead, you must read in chunks and write in chunks, so that it can scale.
Below script will work no matter how many files you have or how large they are.

Here’s my quick and dirty Python script.

UPDATE: as of version 0.1.8, the sort merge functionality has been added to pgworkload, check with pgworkload util merge --help.

<span># file: sort_merge.py </span><span>from</span> <span>io</span> <span>import</span> <span>TextIOWrapper</span>
<span>import</span> <span>sys</span>
<span># input CSV files - it assumes files are already sorted </span><span>CSVs</span> <span>=</span> <span>sys</span><span>.</span><span>argv</span><span>[</span><span>1</span><span>:]</span>
<span>CSV_MAX_ROWS</span> <span>=</span> <span>5000000</span>
<span>COUNTER</span> <span>=</span> <span>0</span>
<span>C</span> <span>=</span> <span>0</span>
<span>source</span><span>:</span> <span>dict</span><span>[</span><span>int</span><span>,</span> <span>list</span><span>]</span> <span>=</span> <span>{}</span>
<span>file_handlers</span><span>:</span> <span>dict</span><span>[</span><span>int</span><span>,</span> <span>TextIOWrapper</span><span>]</span> <span>=</span> <span>{}</span>
<span>def</span> <span>initial_fill</span><span>(</span><span>csv</span><span>:</span> <span>str</span><span>,</span> <span>idx</span><span>:</span> <span>int</span><span>):</span>
<span>"""</span><span> opens the CSV file, saves the file handler, read few lines into source list for the index. </span><span>"""</span>
<span>f</span> <span>=</span> <span>open</span><span>(</span><span>csv</span><span>,</span> <span>"</span><span>r</span><span>"</span><span>)</span>
<span>file_handlers</span><span>[</span><span>idx</span><span>]</span> <span>=</span> <span>f</span>
<span>while</span> <span>len</span><span>(</span><span>source</span><span>[</span><span>idx</span><span>])</span> <span><</span> <span>5</span><span>:</span>
<span>line</span> <span>=</span> <span>f</span><span>.</span><span>readline</span><span>()</span>
<span>if</span> <span>line</span> <span>!=</span> <span>""</span><span>:</span>
<span>source</span><span>[</span><span>idx</span><span>].</span><span>append</span><span>(</span><span>line</span><span>)</span>
<span>else</span><span>:</span>
<span># reached end of file </span> <span>print</span><span>(</span><span>f</span><span>"</span><span>initial_fill: CSV file </span><span>'</span><span>{</span><span>csv</span><span>}</span><span>'</span><span> at source index </span><span>{</span><span>idx</span><span>}</span><span> reached EOF.</span><span>"</span><span>)</span>
<span>f</span><span>.</span><span>close</span><span>()</span>
<span>break</span>
<span>def</span> <span>replenish_source_list</span><span>(</span><span>idx</span><span>:</span> <span>int</span><span>):</span>
<span>"""</span><span> Refills the source list with a new value from the source file </span><span>"""</span>
<span>try</span><span>:</span>
<span>f</span> <span>=</span> <span>file_handlers</span><span>.</span><span>get</span><span>(</span><span>idx</span><span>,</span> <span>None</span><span>)</span>
<span>if</span> <span>not</span> <span>f</span><span>:</span>
<span>return</span>
<span>line</span> <span>=</span> <span>f</span><span>.</span><span>readline</span><span>()</span>
<span>if</span> <span>line</span> <span>!=</span> <span>""</span><span>:</span>
<span>source</span><span>[</span><span>idx</span><span>].</span><span>append</span><span>(</span><span>line</span><span>)</span>
<span>else</span><span>:</span>
<span># reached end of file </span> <span>print</span><span>(</span><span>f</span><span>"</span><span>index </span><span>{</span><span>idx</span><span>}</span><span> reached EOF.</span><span>"</span><span>)</span>
<span>f</span><span>.</span><span>close</span><span>()</span>
<span>del</span> <span>file_handlers</span><span>[</span><span>idx</span><span>]</span>
<span>except</span> <span>Exception</span> <span>as</span> <span>e</span><span>:</span>
<span>print</span><span>(</span><span>"</span><span>Excepton in replenish_queue: </span><span>"</span><span>,</span> <span>e</span><span>)</span>
<span>def</span> <span>write_to_csv</span><span>(</span><span>v</span><span>):</span>
<span>global</span> <span>C</span>
<span>global</span> <span>output</span>
<span>global</span> <span>COUNTER</span>
<span>if</span> <span>C</span> <span>>=</span> <span>CSV_MAX_ROWS</span><span>:</span>
<span>output</span><span>.</span><span>close</span><span>()</span>
<span>COUNTER</span> <span>+=</span> <span>1</span>
<span>C</span> <span>=</span> <span>0</span>
<span>output</span> <span>=</span> <span>open</span><span>(</span><span>f</span><span>"</span><span>out_</span><span>{</span><span>str</span><span>.</span><span>zfill</span><span>(</span><span>str</span><span>(</span><span>COUNTER</span><span>),</span> <span>3</span><span>)</span><span>}</span><span>.csv</span><span>"</span><span>,</span> <span>"</span><span>+w</span><span>"</span><span>)</span>
<span>output</span><span>.</span><span>write</span><span>(</span><span>v</span><span>)</span>
<span>C</span> <span>+=</span> <span>1</span>
<span># init the source dict by opening each CSV file # and only reading few lines. </span><span>for</span> <span>idx</span><span>,</span> <span>csv</span> <span>in</span> <span>enumerate</span><span>(</span><span>CSVs</span><span>):</span>
<span>source</span><span>[</span><span>idx</span><span>]</span> <span>=</span> <span>[]</span>
<span>initial_fill</span><span>(</span><span>csv</span><span>,</span> <span>idx</span><span>)</span>
<span># the source dict now has a key for every file and a list of the first values read </span>
<span>l</span> <span>=</span> <span>[]</span>
<span># pop the first value in each source to a list `l` # `l` will have the first values of all source CSV files </span><span>for</span> <span>k</span><span>,</span> <span>v</span> <span>in</span> <span>source</span><span>.</span><span>items</span><span>():</span>
<span>try</span><span>:</span>
<span>l</span><span>.</span><span>append</span><span>((</span><span>v</span><span>.</span><span>pop</span><span>(</span><span>0</span><span>),</span> <span>k</span><span>))</span>
<span>except</span> <span>IndexError</span> <span>as</span> <span>e</span><span>:</span>
<span>pass</span>
<span>first_k</span> <span>=</span> <span>None</span>
<span>first_v</span> <span>=</span> <span>None</span>
<span>output</span> <span>=</span> <span>open</span><span>(</span><span>f</span><span>"</span><span>out_</span><span>{</span><span>str</span><span>.</span><span>zfill</span><span>(</span><span>str</span><span>(</span><span>COUNTER</span><span>),</span> <span>3</span><span>)</span><span>}</span><span>.csv</span><span>"</span><span>,</span> <span>"</span><span>+w</span><span>"</span><span>)</span>
<span># sort list `l` # pop the first value (the smallest) in `first_v` # make a note of the source of that value in `first_k` # replenish the corrisponding source </span><span>while</span> <span>True</span><span>:</span>
<span>if</span> <span>first_k</span> <span>is</span> <span>not</span> <span>None</span><span>:</span>
<span>try</span><span>:</span>
<span>replenish_source_list</span><span>(</span><span>first_k</span><span>)</span>
<span>l</span><span>.</span><span>append</span><span>((</span><span>source</span><span>[</span><span>first_k</span><span>].</span><span>pop</span><span>(</span><span>0</span><span>),</span> <span>first_k</span><span>))</span>
<span>except</span> <span>IndexError</span> <span>as</span> <span>e</span><span>:</span>
<span># the source list is empty </span> <span>print</span><span>(</span><span>f</span><span>"</span><span>source list </span><span>{</span><span>first_k</span><span>}</span><span> is now empty</span><span>"</span><span>)</span>
<span>first_k</span> <span>=</span> <span>None</span>
<span>if</span> <span>l</span><span>:</span>
<span>l</span><span>.</span><span>sort</span><span>(</span><span>key</span><span>=</span><span>lambda</span> <span>x</span><span>:</span> <span>x</span><span>[</span><span>0</span><span>])</span>
<span>try</span><span>:</span>
<span>first_v</span><span>,</span> <span>first_k</span> <span>=</span> <span>l</span><span>.</span><span>pop</span><span>(</span><span>0</span><span>)</span>
<span>write_to_csv</span><span>(</span><span>first_v</span><span>)</span>
<span>except</span> <span>IndexError</span> <span>as</span> <span>e</span><span>:</span>
<span>print</span><span>(</span><span>"</span><span>Exception in main: </span><span>"</span><span>,</span> <span>e</span><span>)</span>
<span>output</span><span>.</span><span>close</span><span>()</span>
<span>else</span><span>:</span>
<span>break</span>
<span>output</span><span>.</span><span>close</span><span>()</span>
<span>print</span><span>(</span><span>"</span><span>\n</span><span>done!</span><span>"</span><span>)</span>
<span># file: sort_merge.py </span><span>from</span> <span>io</span> <span>import</span> <span>TextIOWrapper</span>
<span>import</span> <span>sys</span>

<span># input CSV files - it assumes files are already sorted </span><span>CSVs</span> <span>=</span> <span>sys</span><span>.</span><span>argv</span><span>[</span><span>1</span><span>:]</span>

<span>CSV_MAX_ROWS</span> <span>=</span> <span>5000000</span>
<span>COUNTER</span> <span>=</span> <span>0</span>
<span>C</span> <span>=</span> <span>0</span>

<span>source</span><span>:</span> <span>dict</span><span>[</span><span>int</span><span>,</span> <span>list</span><span>]</span> <span>=</span> <span>{}</span>
<span>file_handlers</span><span>:</span> <span>dict</span><span>[</span><span>int</span><span>,</span> <span>TextIOWrapper</span><span>]</span> <span>=</span> <span>{}</span>


<span>def</span> <span>initial_fill</span><span>(</span><span>csv</span><span>:</span> <span>str</span><span>,</span> <span>idx</span><span>:</span> <span>int</span><span>):</span>
    <span>"""</span><span> opens the CSV file, saves the file handler, read few lines into source list for the index. </span><span>"""</span>
    <span>f</span> <span>=</span> <span>open</span><span>(</span><span>csv</span><span>,</span> <span>"</span><span>r</span><span>"</span><span>)</span>
    <span>file_handlers</span><span>[</span><span>idx</span><span>]</span> <span>=</span> <span>f</span>
    <span>while</span> <span>len</span><span>(</span><span>source</span><span>[</span><span>idx</span><span>])</span> <span><</span> <span>5</span><span>:</span>
        <span>line</span> <span>=</span> <span>f</span><span>.</span><span>readline</span><span>()</span>
        <span>if</span> <span>line</span> <span>!=</span> <span>""</span><span>:</span>
            <span>source</span><span>[</span><span>idx</span><span>].</span><span>append</span><span>(</span><span>line</span><span>)</span>
        <span>else</span><span>:</span>
            <span># reached end of file </span>            <span>print</span><span>(</span><span>f</span><span>"</span><span>initial_fill: CSV file </span><span>'</span><span>{</span><span>csv</span><span>}</span><span>'</span><span> at source index </span><span>{</span><span>idx</span><span>}</span><span> reached EOF.</span><span>"</span><span>)</span>
            <span>f</span><span>.</span><span>close</span><span>()</span>
            <span>break</span>


<span>def</span> <span>replenish_source_list</span><span>(</span><span>idx</span><span>:</span> <span>int</span><span>):</span>
    <span>"""</span><span> Refills the source list with a new value from the source file </span><span>"""</span>
    <span>try</span><span>:</span>
        <span>f</span> <span>=</span> <span>file_handlers</span><span>.</span><span>get</span><span>(</span><span>idx</span><span>,</span> <span>None</span><span>)</span>
        <span>if</span> <span>not</span> <span>f</span><span>:</span>
            <span>return</span>
        <span>line</span> <span>=</span> <span>f</span><span>.</span><span>readline</span><span>()</span>
        <span>if</span> <span>line</span> <span>!=</span> <span>""</span><span>:</span>
            <span>source</span><span>[</span><span>idx</span><span>].</span><span>append</span><span>(</span><span>line</span><span>)</span>
        <span>else</span><span>:</span>
            <span># reached end of file </span>            <span>print</span><span>(</span><span>f</span><span>"</span><span>index </span><span>{</span><span>idx</span><span>}</span><span> reached EOF.</span><span>"</span><span>)</span>
            <span>f</span><span>.</span><span>close</span><span>()</span>
            <span>del</span> <span>file_handlers</span><span>[</span><span>idx</span><span>]</span>
    <span>except</span> <span>Exception</span> <span>as</span> <span>e</span><span>:</span>
        <span>print</span><span>(</span><span>"</span><span>Excepton in replenish_queue: </span><span>"</span><span>,</span> <span>e</span><span>)</span>


<span>def</span> <span>write_to_csv</span><span>(</span><span>v</span><span>):</span>
    <span>global</span> <span>C</span>
    <span>global</span> <span>output</span>
    <span>global</span> <span>COUNTER</span>
    <span>if</span> <span>C</span> <span>>=</span> <span>CSV_MAX_ROWS</span><span>:</span>
        <span>output</span><span>.</span><span>close</span><span>()</span>
        <span>COUNTER</span> <span>+=</span> <span>1</span>
        <span>C</span> <span>=</span> <span>0</span>
        <span>output</span> <span>=</span> <span>open</span><span>(</span><span>f</span><span>"</span><span>out_</span><span>{</span><span>str</span><span>.</span><span>zfill</span><span>(</span><span>str</span><span>(</span><span>COUNTER</span><span>),</span> <span>3</span><span>)</span><span>}</span><span>.csv</span><span>"</span><span>,</span> <span>"</span><span>+w</span><span>"</span><span>)</span>

    <span>output</span><span>.</span><span>write</span><span>(</span><span>v</span><span>)</span>
    <span>C</span> <span>+=</span> <span>1</span>    

<span># init the source dict by opening each CSV file # and only reading few lines. </span><span>for</span> <span>idx</span><span>,</span> <span>csv</span> <span>in</span> <span>enumerate</span><span>(</span><span>CSVs</span><span>):</span>
    <span>source</span><span>[</span><span>idx</span><span>]</span> <span>=</span> <span>[]</span>

    <span>initial_fill</span><span>(</span><span>csv</span><span>,</span> <span>idx</span><span>)</span>

<span># the source dict now has a key for every file and a list of the first values read </span>
<span>l</span> <span>=</span> <span>[]</span>
<span># pop the first value in each source to a list `l` # `l` will have the first values of all source CSV files </span><span>for</span> <span>k</span><span>,</span> <span>v</span> <span>in</span> <span>source</span><span>.</span><span>items</span><span>():</span>
    <span>try</span><span>:</span>
        <span>l</span><span>.</span><span>append</span><span>((</span><span>v</span><span>.</span><span>pop</span><span>(</span><span>0</span><span>),</span> <span>k</span><span>))</span>
    <span>except</span> <span>IndexError</span> <span>as</span> <span>e</span><span>:</span>
        <span>pass</span>

<span>first_k</span> <span>=</span> <span>None</span>
<span>first_v</span> <span>=</span> <span>None</span>
<span>output</span> <span>=</span> <span>open</span><span>(</span><span>f</span><span>"</span><span>out_</span><span>{</span><span>str</span><span>.</span><span>zfill</span><span>(</span><span>str</span><span>(</span><span>COUNTER</span><span>),</span> <span>3</span><span>)</span><span>}</span><span>.csv</span><span>"</span><span>,</span> <span>"</span><span>+w</span><span>"</span><span>)</span>

<span># sort list `l` # pop the first value (the smallest) in `first_v` # make a note of the source of that value in `first_k` # replenish the corrisponding source </span><span>while</span> <span>True</span><span>:</span>
    <span>if</span> <span>first_k</span> <span>is</span> <span>not</span> <span>None</span><span>:</span>
        <span>try</span><span>:</span>
            <span>replenish_source_list</span><span>(</span><span>first_k</span><span>)</span>
            <span>l</span><span>.</span><span>append</span><span>((</span><span>source</span><span>[</span><span>first_k</span><span>].</span><span>pop</span><span>(</span><span>0</span><span>),</span> <span>first_k</span><span>))</span>

        <span>except</span> <span>IndexError</span> <span>as</span> <span>e</span><span>:</span>
            <span># the source list is empty </span>            <span>print</span><span>(</span><span>f</span><span>"</span><span>source list </span><span>{</span><span>first_k</span><span>}</span><span> is now empty</span><span>"</span><span>)</span>
            <span>first_k</span> <span>=</span> <span>None</span>

    <span>if</span> <span>l</span><span>:</span>
        <span>l</span><span>.</span><span>sort</span><span>(</span><span>key</span><span>=</span><span>lambda</span> <span>x</span><span>:</span> <span>x</span><span>[</span><span>0</span><span>])</span>
        <span>try</span><span>:</span>
            <span>first_v</span><span>,</span> <span>first_k</span> <span>=</span> <span>l</span><span>.</span><span>pop</span><span>(</span><span>0</span><span>)</span>
            <span>write_to_csv</span><span>(</span><span>first_v</span><span>)</span>
        <span>except</span> <span>IndexError</span> <span>as</span> <span>e</span><span>:</span>
            <span>print</span><span>(</span><span>"</span><span>Exception in main: </span><span>"</span><span>,</span> <span>e</span><span>)</span>
            <span>output</span><span>.</span><span>close</span><span>()</span>
    <span>else</span><span>:</span>
        <span>break</span>


<span>output</span><span>.</span><span>close</span><span>()</span>

<span>print</span><span>(</span><span>"</span><span>\n</span><span>done!</span><span>"</span><span>)</span>
# file: sort_merge.py from io import TextIOWrapper import sys # input CSV files - it assumes files are already sorted CSVs = sys.argv[1:] CSV_MAX_ROWS = 5000000 COUNTER = 0 C = 0 source: dict[int, list] = {} file_handlers: dict[int, TextIOWrapper] = {} def initial_fill(csv: str, idx: int): """ opens the CSV file, saves the file handler, read few lines into source list for the index. """ f = open(csv, "r") file_handlers[idx] = f while len(source[idx]) < 5: line = f.readline() if line != "": source[idx].append(line) else: # reached end of file print(f"initial_fill: CSV file '{csv}' at source index {idx} reached EOF.") f.close() break def replenish_source_list(idx: int): """ Refills the source list with a new value from the source file """ try: f = file_handlers.get(idx, None) if not f: return line = f.readline() if line != "": source[idx].append(line) else: # reached end of file print(f"index {idx} reached EOF.") f.close() del file_handlers[idx] except Exception as e: print("Excepton in replenish_queue: ", e) def write_to_csv(v): global C global output global COUNTER if C >= CSV_MAX_ROWS: output.close() COUNTER += 1 C = 0 output = open(f"out_{str.zfill(str(COUNTER), 3)}.csv", "+w") output.write(v) C += 1 # init the source dict by opening each CSV file # and only reading few lines. for idx, csv in enumerate(CSVs): source[idx] = [] initial_fill(csv, idx) # the source dict now has a key for every file and a list of the first values read l = [] # pop the first value in each source to a list `l` # `l` will have the first values of all source CSV files for k, v in source.items(): try: l.append((v.pop(0), k)) except IndexError as e: pass first_k = None first_v = None output = open(f"out_{str.zfill(str(COUNTER), 3)}.csv", "+w") # sort list `l` # pop the first value (the smallest) in `first_v` # make a note of the source of that value in `first_k` # replenish the corrisponding source while True: if first_k is not None: try: replenish_source_list(first_k) l.append((source[first_k].pop(0), first_k)) except IndexError as e: # the source list is empty print(f"source list {first_k} is now empty") first_k = None if l: l.sort(key=lambda x: x[0]) try: first_v, first_k = l.pop(0) write_to_csv(first_v) except IndexError as e: print("Exception in main: ", e) output.close() else: break output.close() print("\ndone!")

Enter fullscreen mode Exit fullscreen mode

Run it

<span>$ </span>python3 sort_merge.py s/<span>*</span>
index 58 reached EOF.
index 82 reached EOF.
<span>[</span>...]
<span>source </span>list 26 is now empty
<span>source </span>list 69 is now empty
<span>done</span><span>!</span>
<span>$ </span>python3 sort_merge.py s/<span>*</span>
index 58 reached EOF.
index 82 reached EOF.
<span>[</span>...]
<span>source </span>list 26 is now empty
<span>source </span>list 69 is now empty

<span>done</span><span>!</span>
$ python3 sort_merge.py s/* index 58 reached EOF. index 82 reached EOF. [...] source list 26 is now empty source list 69 is now empty done!

Enter fullscreen mode Exit fullscreen mode

Inspect the new files, also 100 in total

<span>$ </span><span>ls</span> <span>-A1</span> out_<span>*</span>
out_000.csv
out_001.csv
out_002.csv
<span>[</span>...]
out_098.csv
out_099.csv
<span>$ </span><span>ls</span> <span>-A1</span> out_<span>*</span>
out_000.csv
out_001.csv
out_002.csv
<span>[</span>...]
out_098.csv
out_099.csv
$ ls -A1 out_* out_000.csv out_001.csv out_002.csv [...] out_098.csv out_099.csv

Enter fullscreen mode Exit fullscreen mode

You should now see that the data is now sorted also across files, too.

<span>$ </span><span>head </span>out_000.csv
aabrwawoedcqnosvgzcvf,2024-02-01,5285.54,...
aasobyznvehzvrppwijpbbxfrjzdj,2024-02-01,7942.57,..
abcppzyblqnksovdnf,2024-02-01,7577.34,...
abjfxjatqangpalindkcdzsmzbasfx,2024-02-01,9831.37,...
aepkvifbrl,2024-02-01,1239.02,...
<span>$ </span><span>head </span>out_099.csv
zxwcmhfnwuqarb,2024-02-01,8477.3,...
zyjucqqytplxf,2024-02-01,5049.06,...
zyvwzspgaxzcymlvo,2024-02-01,5590.13,...
zzmrrnytooz,2024-02-01,7936.68,...
zzqpnbksbdheo,2024-02-01,7950.73,...
<span>$ </span><span>head </span>out_000.csv 
aabrwawoedcqnosvgzcvf,2024-02-01,5285.54,...
aasobyznvehzvrppwijpbbxfrjzdj,2024-02-01,7942.57,..
abcppzyblqnksovdnf,2024-02-01,7577.34,...
abjfxjatqangpalindkcdzsmzbasfx,2024-02-01,9831.37,...
aepkvifbrl,2024-02-01,1239.02,...

<span>$ </span><span>head </span>out_099.csv 
zxwcmhfnwuqarb,2024-02-01,8477.3,...
zyjucqqytplxf,2024-02-01,5049.06,...
zyvwzspgaxzcymlvo,2024-02-01,5590.13,...
zzmrrnytooz,2024-02-01,7936.68,...
zzqpnbksbdheo,2024-02-01,7950.73,...
$ head out_000.csv aabrwawoedcqnosvgzcvf,2024-02-01,5285.54,... aasobyznvehzvrppwijpbbxfrjzdj,2024-02-01,7942.57,.. abcppzyblqnksovdnf,2024-02-01,7577.34,... abjfxjatqangpalindkcdzsmzbasfx,2024-02-01,9831.37,... aepkvifbrl,2024-02-01,1239.02,... $ head out_099.csv zxwcmhfnwuqarb,2024-02-01,8477.3,... zyjucqqytplxf,2024-02-01,5049.06,... zyvwzspgaxzcymlvo,2024-02-01,5590.13,... zzmrrnytooz,2024-02-01,7936.68,... zzqpnbksbdheo,2024-02-01,7950.73,...

Enter fullscreen mode Exit fullscreen mode

You can now upload those files to your S3 bucket using the aws cli

aws s3 <span>cp </span>s/ <span>'s3://workshop-ca/sorted_across_files/'</span> <span>--recursive</span>
aws s3 <span>cp </span>s/ <span>'s3://workshop-ca/sorted_across_files/'</span> <span>--recursive</span>
aws s3 cp s/ 's3://workshop-ca/sorted_across_files/' --recursive

Enter fullscreen mode Exit fullscreen mode

At this point, you can safely terminate the AWS instance.

原文链接:Generate multiple, large, sorted CSV files with pseudo-random data

© 版权声明
THE END
喜欢就支持一下吧
点赞6 分享
I may not be perfect but at least I’m not fake.
我可能不完美,但是我至少不虚伪
评论 抢沙发

请登录后发表评论

    暂无评论内容