Data Preparation

This page include all API for data preparation (hidrokit.prep).

Excel module

Module for reading excel files.

prep.excel._cell_index(dataframe, template='phderi')[source]

Return cell index (column, row) of first value on pivot.

Parameters
dataframeDataFrame

Raw dataframe imported from excel

templatestr, optional

Template, by default ‘phderi’

Returns
list

Return [column index, row index]

Raises
Exception

Not match with template.

prep.excel._dataframe_data(pivot, year)[source]

Transform pivot table to list

Parameters
pivotDataFrame

Pivot table

yearint

Year

Returns
list

Return list of data

prep.excel._dataframe_table(pivot, year, name='ch')[source]

Transform pivot table to single column dataframe.

Parameters
pivotDataFrame

Pivot table

yearint

Year

namestr, optional

Column name, by default ‘ch’

Returns
DataFrame

Dataframe

prep.excel._dataframe_year(year)[source]

Return empty dataframe with date index

Parameters
yearint

Year

Returns
Dataframe

Empty dataframe with date index

prep.excel._file_single_pivot(file, template='phderi')[source]

Return pivot table inside file

Parameters
filestr

File path

templatestr, optional

Template, by default ‘phderi’

Returns
Dataframe

Pivot table

Read module

Module for reading data.

prep.read.missing_row(dataframe, date_index=True, date_format='%Y/%m/%d')[source]

Return dictionary of missing values dataframe.

Return dictionary contains columns name and list of the index missing values.

Parameters
dataframeDataFrame

Dataframe

date_indexbool, optional

Format index to date_format, by default True

date_formatstr, optional

String representation of strftime() directive, by default ‘%Y/%m/%d’

Returns
dict

Return dictionary of columns name and index of missing values.

Examples

Examples for non-date index:

>>> A = pd.DataFrame(data=[[1, 3, 4, np.nan, 2, np.nan],
...                        [np.nan, 2, 3, np.nan, 1, 4],
...                        [2, np.nan, 1, 3, 4, np.nan]],
...               columns=['A', 'B', 'C', 'D', 'E', 'F'])
... A
    A    B   C    D  E    F
0  1.0  3.0  4  NaN  2  NaN
1  NaN  2.0  3  NaN  1  4.0
2  2.0  NaN  1  3.0  4  NaN
>>> missing_row(A, date_index=False)
{'A': [1], 'B': [2], 'C': [], 'D': [0, 1], 'E': [], 'F': [0, 2]}

Index is timestamp:

>>> date_index = pd.date_range("20190617", "20190619")
>>> A.set_index(date_index, inplace=True)
... A
              A    B  C    D  E    F
2019-06-17  1.0  3.0  4  NaN  2  NaN
2019-06-18  NaN  2.0  3  NaN  1  4.0
2019-06-19  2.0  NaN  1  3.0  4  NaN
>>> missing_row(A, date_format="%m%d")
{'A': ['0618'],
'B': ['0619'],
'C': [],
'D': ['0617', '0618'],
'E': [],
'F': ['0617', '0619']}

Time Series module

Manipulation timestep dataframe.

prep.timeseries._timestep_multi(array, index=None, timesteps=2, keep_first=True)[source]

Add timesteps array for multiple column array.

Parameters
arrayarray

Multiple numeric column two-dimensional array.

indexlist of int, optional

List of columns index, by default None

timestepsint, optional

Number of timesteps, by default 2

keep_firstbool, optional

Include original column if set True, by default True

Returns
array

Return 2D array with timesteps.

prep.timeseries._timestep_single(array, index=0, timesteps=2, keep_first=True)[source]

Add timesteps array for single column array.

Parameters
arrayarray

Single column two-dimensional array.

indexint, optional

Index column, by default 0

timestepsint, optional

Number of timesteps, by default 2

keep_firstbool, optional

Include original array if set True, by default True

Returns
array

Return 2D array with timesteps.

prep.timeseries.timestep_table(dataframe, columns=None, timesteps=2, keep_first=True, template='{column}_tmin{i}')[source]

Generate timesteps directly from DataFrame.

Parameters
dataframeDataFrame

Dataframe consist of numeric-column only

columnslist of str, optional

List of columns name to generate, by default None

timestepsint, optional

Number of timesteps, by default 2

keep_firstbool, optional

Column _tmin0 will be included if set True, by default True

templatestr

Format column name, by default “{column}_tmin{i}”

Returns
DataFrame

DataFrame with additional timesteps columns.