TAResearch 2019-03-19
Categories : CHPC, Weather, ML Data Preprocessing, Keras,
Re-vectorize data on CHPC
Looking at last night’s submission to CHPC
Got an error in fd_pedestal_data_vectorization.err
:
Traceback (most recent call last):
File "fd_pedestal_data_vectorization.py", line 286, in <module>
main()
File "fd_pedestal_data_vectorization.py", line 151, in main
X, part_status = _vectorize_pedestal_data_df(ymds, part)
File "fd_pedestal_data_vectorization.py", line 205, in _vectorize_pedestal_data_df
X[row[i], col[i]] = ped_fluct[i]
IndexError: index 3072 is out of bounds for axis 0 with size 3072
Script made it to LR and then ran into an error. Might have been that I was using BR max frames for LR which I should not being doing. I edited out LR out of the script.
Looking at the shell out log:
$ cat out/fd_pedestal_data_vectorization.out
Beginning job on kp007 on Mon Mar 18 16:50:36 MDT 2019
...
Found 19 Frames for y2009m05d28s1 part 12
Saving Vectorized FD Pedestal Data as Numpy Arrays with shape (19, 32, 96) to /scratch/local/u0949991/Data/fd_ped_vect/y2009m05d28s1p12_ped_fluct_vectorized_padded.npy...
Saving Padded Vectorized FD Pedestal Data as Numpy Arrays with shape (19, 32, 96) to /scratch/local/u0949991/Data/fd_ped_vect/y2009m05d28s1p12_ped_fluct_vectorized_padded.npy...
Found 29 Frames for y2009m05d28s1 part 4
Saving Vectorized FD Pedestal Data as Numpy Arrays with shape (29, 32, 96) to /scratch/local/u0949991/Data/fd_ped_vect/y2009m05d28s1p4_ped_fluct_vectorized_padded.npy...
Saving Padded Vectorized FD Pedestal Data as Numpy Arrays with shape (29, 32, 96) to /scratch/local/u0949991/Data/fd_ped_vect/y2009m05d28s1p4_ped_fluct_vectorized_padded.npy...
Found 32 Frames for y2009m05d29s1 part 8
Finished Data Vectorization
Copying Outputs from Local Scratch
...
Cleaning Up Local Scratch
Job completed on Mon Mar 18 21:40:45 MDT 2019
Ran for about 5 hours. I noticed and error in printing out the shape of the padded array. Fixed that in vectorization scripts.
Looking at last night’s data outputs
Size and amount of BR vectorized data
Padded and Nonpadded Data Files :
$ ls /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect/*s0* | wc -l
19759
$ ls /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/*s0* | wc -l
19759
Clear out LR data on CHPC:
$ ls -lrt /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect/*s1* | wc -l
3032
$ rm /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect/*s1*
$ ls /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect/*s1*
ls: cannot access /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect/*s1*: No such file or directory
$ rm /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/*s1*
$ ls /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/*s1*
ls: cannot access /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/*s1*: No such file or directory
Size of Data :
$ du -sh /scratch/kingspeak/serial/u0949991/Data/*
16G /scratch/kingspeak/serial/u0949991/Data/fd_ped_h5
98G /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect
14G /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded
Looking at Data with Numpy
$ du -sh '/scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/y2007m11d07s0p4_ped_fluct_vectorized.npy'
196K /scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/y2007m11d07s0p4_ped_fluct_vectorized.npy
Looking at y2007m11d07 on CHPC :
>>> X = np.load('/scratch/kingspeak/serial/u0949991/Data/fd_ped_vect_nonpadded/y2007m11d07s0p4_ped_fluct_vectorized.npy')
>>> X.shape
(8, 32, 96)
>>> X[0]
array([[ 0.55176031, 0.69698009, 0.64952331, ..., 0.47971234,
0.47971234, 0.42252466],
[ 0.57747534, 0.63466303, 0.57747534, ..., 0.47971234,
0.47971234, 0.32476166],
[ 0.42252466, 0.64952331, 0.69698009, ..., 0.47971234,
0.55176031, 0.47971234],
...,
[ 0.32476166, 0.32476166, 0.47971234, ..., 0.32476166,
0.32476166, 0.32476166],
[ 0.32476166, 0.32476166, 0.32476166, ..., 0.32476166,
0.42252466, 0.32476166],
[ 0.32476166, 0.47971234, 0.32476166, ..., 0.32476166,
0.32476166, 0.32476166]])
>>> X[-1]
array([[ 0.62178303, 0.32672288, 0.42507626, ..., 0.32672288,
0.32672288, 0.32672288],
[ 0.32672288, 0.32672288, 0.32672288, ..., 0.32672288,
0.32672288, 0.32672288],
[ 0.42507626, 0.32672288, 0.52342965, ..., 0.32672288,
0.42507626, 0.32672288],
...,
[ 0.32672288, 0.32672288, 0.32672288, ..., 0.32672288,
0.32672288, 0.32672288],
[ 0.32672288, 0.32672288, 0.32672288, ..., 0.32672288,
0.32672288, 0.32672288],
[ 0.32672288, 0.32672288, 0.32672288, ..., 0.32672288,
0.32672288, 0.32672288]])
Issues with colormaps on CHPC. Creating quick plots on GF_Ultra
import matplotlib.pyplot as plt
import matplotlib as mlp
>>> X = np.load('y2007m11d07s0p4_ped_fluct_vectorized.npy')
>>> plt.imshow(X[0], cmap='inferno', vmin=.1, vmax=X[-1].max())
>>> plt.savefig('y2007m11d07s0p4_framefirst.png', bbox_inches='tight')
>>> X = np.load('y2007m11d07s0p30_ped_fluct_vectorized.npy')
>>> plt.imshow(X[-1], cmap='inferno', vmin=.1, vmax=X[-1].max())
<matplotlib.image.AxesImage object at 0x7fd076e24bd0>
>>> plt.savefig('y2007m11d07s0p30_framelast.png', bbox_inches='tight')
y2017m11d07 images
In agreement.
Backup to GF-Ultra
Cleared out old data on GF-Ultra
$ du -sh *
16G fd_ped_h5
8.9G fd_ped_vect
1.2G fd_ped_vect_nonpadded
$ du -sh *
16G fd_ped_h5
148K fd_ped_vect
132K fd_ped_vect_nonpadded
Backup from CHPC to GF-Ultra
$ rsync -avn kingspeak:/scratch/kingspeak/serial/u0949991/Data/* .
...
fd_ped_vect_nonpadded/y2017m11d28s0p16_ped_fluct_vectorized.npy
fd_ped_vect_nonpadded/y2017m11d28s0p20_ped_fluct_vectorized.npy
fd_ped_vect_nonpadded/y2017m11d28s0p21_ped_fluct_vectorized.npy
fd_ped_vect_nonpadded/y2017m11d28s0p5_ped_fluct_vectorized.npy
fd_ped_vect_nonpadded/y2017m11d28s0p6_ped_fluct_vectorized.npy
sent 118,593 bytes received 2,005,409 bytes 83,294.20 bytes/sec
total size is 135,588,134,692 speedup is 63,836.16 (DRY RUN)
$ rsync -av kingspeak:/scratch/kingspeak/serial/u0949991/Data/* .
receiving incremental file list
fd_ped_vect/
fd_ped_vect/y2007m11d01s0p4_ped_fluct_vectorized_padded.npy
fd_ped_vect/y2007m11d01s0p7_ped_fluct_vectorized_padded.npy
...
fd_ped_vect_nonpadded/y2017m11d28s0p5_ped_fluct_vectorized.npy
fd_ped_vect_nonpadded/y2017m11d28s0p6_ped_fluct_vectorized.npy
sent 750,929 bytes received 119,005,598,217 bytes 55,779,868.36 bytes/sec
total size is 135,588,134,692 speedup is 1.14
Size on GF-Ultra
$ du -sh *
16G fd_ped_h5
98G fd_ped_vect
14G fd_ped_vect_nonpadded
Loading Files into Keras in Batch
Since my data is so expansive, it would be nice to just point keras to which .npy
files to load for the training data.
Creating a test directory to test using these functions
$ find . -maxdepth 1 -type f | head -100 |xargs cp -t "../test"
$ ls -lrt /GDF/TAResearch/FD_Ped_Weather/Data/fd_ped_vect/ | wc -l
19760
$ ls ../test| wc -l
100
Another Really Good Machine Learning source from a TA at Stanford. This covers using a data generator with keras in python. Test Using this example.