Re-vectorize data on CHPC

Looking at last night’s submission to CHPC

Got an error in fd_pedestal_data_vectorization.err :

Traceback (most recent call last):
  File "fd_pedestal_data_vectorization.py", line 286, in <module>
    main()
  File "fd_pedestal_data_vectorization.py", line 151, in main
    X, part_status = _vectorize_pedestal_data_df(ymds, part)
  File "fd_pedestal_data_vectorization.py", line 205, in _vectorize_pedestal_data_df
    X[row[i], col[i]] = ped_fluct[i]
IndexError: index 3072 is out of bounds for axis 0 with size 3072

Script made it to LR and then ran into an error. Might have been that I was using BR max frames for LR which I should not being doing. I edited out LR out of the script.

Looking at the shell out log:

$ cat out/fd_pedestal_data_vectorization.out
Beginning job on kp007 on Mon Mar 18 16:50:36 MDT 2019
...
Found 19 Frames for y2009m05d28s1 part 12
Saving Vectorized FD Pedestal Data as Numpy Arrays with shape (19, 32, 96) to /scratch/local/u0949991/Data/fd_ped_vect/y2009m05d28s1p12_ped_fluct_vectorized_padded.npy...
Saving Padded Vectorized FD Pedestal Data as Numpy Arrays with shape (19, 32, 96) to /scratch/local/u0949991/Data/fd_ped_vect/y2009m05d28s1p12_ped_fluct_vectorized_padded.npy...
Found 29 Frames for y2009m05d28s1 part 4
Saving Vectorized FD Pedestal Data as Numpy Arrays with shape (29, 32, 96) to /scratch/local/u0949991/Data/fd_ped_vect/y2009m05d28s1p4_ped_fluct_vectorized_padded.npy...
Saving Padded Vectorized FD Pedestal Data as Numpy Arrays with shape (29, 32, 96) to /scratch/local/u0949991/Data/fd_ped_vect/y2009m05d28s1p4_ped_fluct_vectorized_padded.npy...
Found 32 Frames for y2009m05d29s1 part 8
Finished Data Vectorization
Copying Outputs from Local Scratch
...
Cleaning Up Local Scratch
Job completed on Mon Mar 18 21:40:45 MDT 2019

Ran for about 5 hours. I noticed and error in printing out the shape of the padded array. Fixed that in vectorization scripts.

Looking at last night’s data outputs

Size and amount of BR vectorized data

Padded and Nonpadded Data Files :

$ ls /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect/*s0* | wc -l
19759
$ ls /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/*s0* | wc -l
19759

Clear out LR data on CHPC:

$ ls -lrt /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect/*s1* | wc -l
3032
$ rm /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect/*s1*
$ ls /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect/*s1*
ls: cannot access /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect/*s1*: No such file or directory
$ rm /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/*s1*
$ ls /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/*s1*
ls: cannot access /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/*s1*: No such file or directory

Size of Data :

$ du -sh /scratch/kingspeak/serial/u0949991/Data/*
16G	/scratch/kingspeak/serial/u0949991/Data/fd_ped_h5
98G	/scratch/kingspeak/serial/u0949991/Data/fd_ped_vect
14G	/scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded

Looking at Data with Numpy

$ du -sh '/scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/y2007m11d07s0p4_ped_fluct_vectorized.npy'
196K	/scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/y2007m11d07s0p4_ped_fluct_vectorized.npy

Looking at y2007m11d07 on CHPC :

>>> X = np.load('/scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/y2007m11d07s0p4_ped_fluct_vectorized.npy')
>>> X.shape
(8, 32, 96)
>>> X[0]
array([[ 0.55176031,  0.69698009,  0.64952331, ...,  0.47971234,
         0.47971234,  0.42252466],
       [ 0.57747534,  0.63466303,  0.57747534, ...,  0.47971234,
         0.47971234,  0.32476166],
       [ 0.42252466,  0.64952331,  0.69698009, ...,  0.47971234,
         0.55176031,  0.47971234],
       ...,
       [ 0.32476166,  0.32476166,  0.47971234, ...,  0.32476166,
         0.32476166,  0.32476166],
       [ 0.32476166,  0.32476166,  0.32476166, ...,  0.32476166,
         0.42252466,  0.32476166],
       [ 0.32476166,  0.47971234,  0.32476166, ...,  0.32476166,
         0.32476166,  0.32476166]])
>>> X[-1]
array([[ 0.62178303,  0.32672288,  0.42507626, ...,  0.32672288,
         0.32672288,  0.32672288],
       [ 0.32672288,  0.32672288,  0.32672288, ...,  0.32672288,
         0.32672288,  0.32672288],
       [ 0.42507626,  0.32672288,  0.52342965, ...,  0.32672288,
         0.42507626,  0.32672288],
       ...,
       [ 0.32672288,  0.32672288,  0.32672288, ...,  0.32672288,
         0.32672288,  0.32672288],
       [ 0.32672288,  0.32672288,  0.32672288, ...,  0.32672288,
         0.32672288,  0.32672288],
       [ 0.32672288,  0.32672288,  0.32672288, ...,  0.32672288,
         0.32672288,  0.32672288]])

Issues with colormaps on CHPC. Creating quick plots on GF_Ultra

import matplotlib.pyplot as plt
import matplotlib as mlp
>>> X = np.load('y2007m11d07s0p4_ped_fluct_vectorized.npy')
>>> plt.imshow(X[0], cmap='inferno', vmin=.1, vmax=X[-1].max())
>>> plt.savefig('y2007m11d07s0p4_framefirst.png', bbox_inches='tight')
>>> X = np.load('y2007m11d07s0p30_ped_fluct_vectorized.npy')
>>> plt.imshow(X[-1], cmap='inferno', vmin=.1, vmax=X[-1].max())
<matplotlib.image.AxesImage object at 0x7fd076e24bd0>
>>> plt.savefig('y2007m11d07s0p30_framelast.png', bbox_inches='tight')

y2017m11d07 images

First Frame
Comparing Photos of DataFrame Generated images(Left) and Numpy Vectorized Data(right)


Last Frame
Comparing Photos of DataFrame Generated images(Left) and Numpy Vectorized Data(right)


In agreement.

Backup to GF-Ultra

Cleared out old data on GF-Ultra

$ du -sh *
16G	fd_ped_h5
8.9G	fd_ped_vect
1.2G	fd_ped_vect_nonpadded

$ du -sh *
16G	fd_ped_h5
148K	fd_ped_vect
132K	fd_ped_vect_nonpadded

Backup from CHPC to GF-Ultra

$ rsync -avn kingspeak:/scratch/kingspeak/serial/u0949991/Data/* .
...
fd_ped_vect_nonpadded/y2017m11d28s0p16_ped_fluct_vectorized.npy
fd_ped_vect_nonpadded/y2017m11d28s0p20_ped_fluct_vectorized.npy
fd_ped_vect_nonpadded/y2017m11d28s0p21_ped_fluct_vectorized.npy
fd_ped_vect_nonpadded/y2017m11d28s0p5_ped_fluct_vectorized.npy
fd_ped_vect_nonpadded/y2017m11d28s0p6_ped_fluct_vectorized.npy

sent 118,593 bytes  received 2,005,409 bytes  83,294.20 bytes/sec
total size is 135,588,134,692  speedup is 63,836.16 (DRY RUN)

$ rsync -av kingspeak:/scratch/kingspeak/serial/u0949991/Data/* .
receiving incremental file list
fd_ped_vect/
fd_ped_vect/y2007m11d01s0p4_ped_fluct_vectorized_padded.npy
fd_ped_vect/y2007m11d01s0p7_ped_fluct_vectorized_padded.npy
...
fd_ped_vect_nonpadded/y2017m11d28s0p5_ped_fluct_vectorized.npy
fd_ped_vect_nonpadded/y2017m11d28s0p6_ped_fluct_vectorized.npy

sent 750,929 bytes  received 119,005,598,217 bytes  55,779,868.36 bytes/sec
total size is 135,588,134,692  speedup is 1.14

Size on GF-Ultra

$ du -sh *
16G	fd_ped_h5
98G	fd_ped_vect
14G	fd_ped_vect_nonpadded

Loading Files into Keras in Batch

Since my data is so expansive, it would be nice to just point keras to which .npy files to load for the training data.

Example Sources 1 2

Creating a test directory to test using these functions

$ find . -maxdepth 1 -type f | head -100 |xargs cp -t "../test"
$ ls -lrt /GDF/TAResearch/FD_Ped_Weather/Data/fd_ped_vect/ | wc -l
19760
$ ls ../test| wc -l
100

Another Really Good Machine Learning source from a TA at Stanford. This covers using a data generator with keras in python. Test Using this example.