TAResearch 2019-03-15
Categories : Weather, Machine Learning, Pandas, Numpy,
- Checking Weather Classification Data
- Re-vectorize data from each night pedestal dataframes in fd_ped_h5 data
Checking Weather Classification Data
Looking at the data saved in fd_ped_h5
vs. fd_ped_vect
vs. fd_ped_vect_nonpadded
to see how far the discrepancies go from yesterday.
y2016m07d15s1
fd_ped_h5
>>> store = pd.HDFStore('y2016m07d15s1_ped_fluct.h5')
>>> store
<class 'pandas.io.pytables.HDFStore'>
File path: y2016m07d15s1_ped_fluct.h5
>>> store.info
<bound method HDFStore.info of <class 'pandas.io.pytables.HDFStore'>
File path: y2016m07d15s1_ped_fluct.h5
>
>>> store.info()
u"<class 'pandas.io.pytables.HDFStore'>\nFile path: y2016m07d15s1_ped_fluct.h5\nEmpty"
Looks empty. Try reprocessing? This is a LR night so wait for later.
y2007m11d07 (First Training Night)
Night Animation
As the animation created from the dataframe below shows, there is 4-30 parts.
First Frame :
Last Frame :
fd_ped_h5
Load HDF5 file with Pandas
>>> store = pd.HDFStore('y2007m11d07s0_ped_fluct.h5')
>>> store.info()
u"<class 'pandas.io.pytables.HDFStore'>\nFile path: y2007m11d07s0_ped_fluct.h5\n/frame_info_df frame (shape->[401,7]) \n/ped_fluct_df frame (shape->[401,3072])\n/ped_fluct_norm_df frame (shape->[401,3072])"
Load frame_info_df
:
>>> df = store['frame_info_df']
>>> df.head()
frame_max ... frame_time
0 120.0 ... 2007-11-07 02:45:38.182826622
1 311.0 ... 2007-11-07 02:46:38.182826622
2 402.0 ... 2007-11-07 02:47:38.182826622
3 260.0 ... 2007-11-07 02:48:38.182826622
4 84.0 ... 2007-11-07 02:49:38.182826622
[5 rows x 7 columns]
>>> df.columns
Index([u'frame_max', u'frame_mean', u'frame_min', u'frame_minute',
u'frame_part', u'frame_sigma', u'frame_time'],
dtype='object')
Load ped_fluct_df
:
>>> df = store['ped_fluct_df']
>>> df.columns
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
3062, 3063, 3064, 3065, 3066, 3067, 3068, 3069, 3070, 3071],
dtype='int64', length=3072)
>>> df.head()
0 1 2 3 4 5 6 ... 3065 3066 3067 3068 3069 3070 3071
0 3 1 4 8 2 8 2 ... 1 1 1 2 1 1 1
1 2 1 3 4 2 6 2 ... 3 2 2 1 2 1 1
2 1 1 3 1 4 1 2 ... 1 1 1 1 1 1 1
3 4 1 2 3 5 4 1 ... 1 1 1 1 1 1 1
4 1 2 1 5 6 3 7 ... 1 1 1 1 1 1 1
Load ped_fluct_norm_df
:
>>> df = store['ped_fluct_norm_df']
>>> df.columns
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
3062, 3063, 3064, 3065, 3066, 3067, 3068, 3069, 3070, 3071],
dtype='int64', length=3072)
>>> df.head()
0 1 2 3 ... 3068 3069 3070 3071
0 0.479712 0.324762 0.520288 0.618051 ... 0.422525 0.324762 0.324762 0.324762
1 0.372493 0.286307 0.422909 0.458680 ... 0.286307 0.372493 0.286307 0.286307
2 0.277452 0.277452 0.409830 0.277452 ... 0.277452 0.277452 0.277452 0.277452
3 0.469128 0.292828 0.380978 0.432543 ... 0.292828 0.292828 0.292828 0.292828
4 0.341965 0.444906 0.341965 0.580988 ... 0.341965 0.341965 0.341965 0.341965
[5 rows x 3072 columns]
fd_ped_vect_nonpadded
-rw-r--r-- 1 gfurlich gfurlich 128 Dec 4 10:37 y2007m11d07s0_ped_fluct_vectorized.npy
Looks empty… Any Training data is empty… hmm… Not good
-rw-r--r-- 1 gfurlich gfurlich 128 Dec 4 13:04 y2017m09d13s0_ped_fluct_vectorized.npy
-rw-r--r-- 1 gfurlich gfurlich 245888 Dec 4 13:04 y2017m09d14s0_ped_fluct_vectorized.npy
-rw-r--r-- 1 gfurlich gfurlich 128 Dec 4 13:04 y2017m09d16s0_ped_fluct_vectorized.npy
Looking at it with Numpy
>>> import numpy as np
>>> np.load('y2007m11d07s0_ped_fluct_vectorized.npy')
array([], shape=(0, 32, 96), dtype=float64)
Looks empty. Look at another
fd_ped_vect
y2017m11d28s0 non training data
fd_ped_vect_nonpadded
File :
-rw-r--r-- 1 gfurlich gfurlich 1106048 Dec 4 13:08 y2017m11d28s0_ped_fluct_vectorized.npy
>>>X = np.load('y2017m11d28s0_ped_fluct_vectorized.npy')
>>> X.shape
(45, 32, 96)
>>> X[-1]
array([[0.34257391, 0.34257391, 0.34257391, ..., 0.34257391, 0.34257391,
0.34257391],
[0.34257391, 0.34257391, 0.34257391, ..., 0.34257391, 0.34257391,
0.34257391],
[0.34257391, 0.34257391, 0.34257391, ..., 0.34257391, 0.44569893,
0.34257391],
...,
[0.34257391, 0.34257391, 0.34257391, ..., 0.34257391, 0.34257391,
0.34257391],
[0.34257391, 0.34257391, 0.34257391, ..., 0.34257391, 0.34257391,
0.34257391],
[0.34257391, 0.34257391, 0.34257391, ..., 0.34257391, 0.34257391,
0.44569893]])
Quick plot :
>>> plt.imshow(X[-1], cmap='inferno', vmin=.1, vmax=X[-1].max())
<matplotlib.image.AxesImage object at 0x7f17fc2e4710>
>>> plt.show()
Looks right. Check shape against master db.
Master fd ped part database
Files :
[2019-03-15 10:21:59] $ ls -lrt *.h5
-rw-r--r-- 1 gfurlich gfurlich 3274744 Oct 1 10:58 fd_pedestal_nights_db.h5
-rw-r--r-- 1 gfurlich gfurlich 150787 Oct 1 10:58 pmt_positions.h5
-rw-r--r-- 1 gfurlich gfurlich 16793758 Oct 1 10:58 master_fd_ped_db_by_part.h5
-rw-r--r-- 1 gfurlich gfurlich 2370620 Oct 10 15:09 fd_dark_time_db.h5
[2019-03-15 10:22:13] $ du -sh master_fd_ped_db_by_part.h5
17M master_fd_ped_db_by_part.h5
Load master_fd_ped_db_by_part.h5
>>> import pandas as pd
>>> store = pd.HDFStore('master_fd_ped_db_by_part.h5')
>>> store.info()
u"<class 'pandas.io.pytables.HDFStore'>\nFile path: master_fd_ped_db_by_part.h5\n/master_br_fd_ped_db_by_part frame (shape->[19759,6])\n/master_lr_fd_ped_db_by_part frame (shape->[16493,6])"
>>> df = store['master_br_fd_ped_db_by_part']
>>> df.columns
Index([u'part', u'part_duration', u'part_start', u'part_stop',
u'part_weather_status', u'run_night'],
dtype='object')
Find the last night y2017m11d28 and print info :
>>> df[df['run_night'] == pd.to_datetime('2017-11-28').date()]
part part_duration ... part_weather_status run_night
0 5 00:22:21.008656 ... 0 2017-11-28
1 6 00:43:02.009053 ... 0 2017-11-28
2 10 00:25:47.215286 ... 0 2017-11-28
3 11 00:37:24.218904 ... 0 2017-11-28
4 15 00:24:41.500328 ... 0 2017-11-28
6 20 00:23:18.214296 ... 0 2017-11-28
7 21 00:08:32.148072 ... 0 2017-11-28
5 16 00:45:02.840683 ... 0 2017-11-28
>>> df[df['run_night'] == pd.to_datetime('2017-11-28').date()].to_string()
part part_duration part_start part_stop part_weather_status run_night
0 5 00:22:21.008656 2017-11-28 08:45:34.818238216 2017-11-28 09:07:55.826895139 0 2017-11-28
1 6 00:43:02.009053 2017-11-28 09:08:23.289856964 2017-11-28 09:51:25.298910500 0 2017-11-28
2 10 00:25:47.215286 2017-11-28 09:53:48.009802586 2017-11-28 10:19:35.225089529 0 2017-11-28
3 11 00:37:24.218904 2017-11-28 10:20:02.569907793 2017-11-28 10:57:26.788812717 0 2017-11-28
4 15 00:24:41.500328 2017-11-28 10:59:49.881877531 2017-11-28 11:24:31.382205713 0 2017-11-28
6 20 00:23:18.214296 2017-11-28 12:12:24.851270744 2017-11-28 12:35:43.065567492 0 2017-11-28
7 21 00:08:32.148072 2017-11-28 12:36:10.301215743 2017-11-28 12:44:42.449288199 0 2017-11-28
5 16 00:45:02.840683 2017-11-28 11:24:58.684055964 2017-11-28 12:10:01.524739755 0 2017-11-28
Realize I don’t have frame numbers for each part but I can infer from the minutes in a part in ceil(part_duration) - 1 since I take the minute diff.
Looking at other dataframe for part info
>>> store2 = pd.HDFStore('fd_pedestal_nights_db.h5')
>>> t = store2.info()
>>> print(t)
<class 'pandas.io.pytables.HDFStore'>
File path: fd_pedestal_nights_db.h5
/br_df frame (shape->[1808,2])
/lr_df frame (shape->[1640,2])
>>> df2 = store2['br_df']
>>> df2.columns
Index([u'ped_status', u'run_night'], dtype='object')
Nothing important there…
Max Night :
>>> df['part_duration'].max()
Timedelta('0 days 03:36:42.021044')
3 x 60 + 36 = 216 as max frames in a part. Matches what I had earlier for padding vectorized data.
Re-vectorize data from each night pedestal dataframes in fd_ped_h5 data
Re-transfer data back to CHPC
Deleted wrong aggregated vectorized data arrays on kingspeak
under Vectorized_Data
.
transferred the data over:
[2019-03-15 11:31:00] $ rsync -av fd_ped_h5 kingspeak:~/weat_ml/Data/
fd_ped_h5/y2018m08d12s1_ped_fluct.h5
sent 16,619,429,753 bytes received 65,437 bytes 98,632,018.93 bytes/sec
total size is 16,615,108,644 speedup is 1.00
$ du -sh *
16G fd_ped_h5
transfer looks good.
Edit fd_pedestal_data_vectorization.py
renamed fd_pedestal_rnn_data_vectorization_v_chpc.py
to fd_pedestal_data_vectorization.py
.
[x] Update to save each part into it’s own numpy array [ ] make sure the frames look correct [ ] make sure each is correctly padded [x] ~update master datebase to include pedestal frame info?~ Print Frame length to out file of processing on CHPC
Transferred new vect script to kingspeak
.
Proof that past vectorization was wrong
Looking back at the old log file fd_pedestal_rnn_vectorization.out
on my CHPC account.
Saving Vectorized and Padded FD Pedestal Data as Numpy Arrays to /scratch/local/u0949991/Data/fd_ped_vect/y2017m11d28s0_ped_fluct_vectorized_padded.npy...
Saving Vectorized and Padded FD Pedestal Data as Numpy Arrays to /scratch/local/u0949991/Data/fd_ped_vect/y2017m11d28s0_ped_fluct_vectorized_padded.npy...
Saving Vectorized and Padded FD Pedestal Data as Numpy Arrays to /scratch/local/u0949991/Data/fd_ped_vect/y2017m11d28s0_ped_fluct_vectorized_padded.npy...
Saving Vectorized and Padded FD Pedestal Data as Numpy Arrays to /scratch/local/u0949991/Data/fd_ped_vect/y2017m11d28s0_ped_fluct_vectorized_padded.npy...
Saving Vectorized and Padded FD Pedestal Data as Numpy Arrays to /scratch/local/u0949991/Data/fd_ped_vect/y2017m11d28s0_ped_fluct_vectorized_padded.npy...
Parts in the same night were not being saved to different file.
Submit to CHPC and vectorize data
$ sbatch fd_pedestal_data_vectorization.slm
Submitted batch job 6856344
$ squeue -u u0949991
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6856344 kingspeak vect_rnn u0949991 PD 0:00 1 (Priority)