Getter Module

class suntzu.getter.Getter[source]

Bases: object

get_best_dtype(col: str) → str[source]

Determines the most memory-efficient data type for a column based on its values.

The method inspects the column’s current data type and value range to infer a more optimal dtype: - Integers are downcast to the smallest possible integer type. - Floats are downcast to the smallest possible floating-point type. - Object columns with a low number of unique values are converted to category. - Other types are returned unchanged.

Args: col (str): Name of the column to analyze.

Returns: str: The name of the most suitable data type for the column.

Examples

>>> from suntzu import Getter
>>> Getter.get_best_dtype(df, "age")
'int8'
>>> Getter.get_best_dtype(df, "price")
'float32'
>>> Getter.get_best_dtype(df, "status")
'category'

get_best_float(col_max: float) → str[source]

Determines the most memory-efficient floating-point type capable of representing a range of values.

Parameters:

col_min (float) – The minimum value in the range.
col_max (float) – The maximum value in the range.

Returns:

The name of the smallest floating-point type that can accommodate all values: in the range. Possible returns are “float16”, “float32”, or “float64”.

Return type:

str

Examples

>>> from suntzu import Getter
>>> Getter.get_best_float(0.1, 100.0)
'float16'
>>> Getter.get_best_float(-1e5, 1e5)
'float32'
>>> Getter.get_best_float(-1e40, 1e40)
'float64'

get_best_int(col_max: int) → str[source]

Determines the smallest integer type capable of representing a range of values.

Parameters:

col_min (int) – The minimum value in the range.
col_max (int) – The maximum value in the range.

Returns:

The name of the smallest integer type that can accommodate all values: in the range. Possible returns are “int8”, “int16”, “int32”, or “int64”.

Return type:

str

Examples

>>> from suntzu import Getter
>>> Getter.get_best_int(-50, 100)
'int8'
>>> Getter.get_best_int(-200, 30000)
'int16'
>>> Getter.get_best_int(-50000, 100000)
'int32'
>>> Getter.get_best_int(-5000000000, 5000000000)
'int64'

get_max_value(col: str) → int | str[source]

Returns the maximum value of a DataFrame column, handling different data types appropriately.

Parameters:

col (str) – The column of the DataFrame to inspect.

Returns:

For numeric columns, returns the maximum value.
For categorical or boolean columns, returns the most frequent value (mode).

Return type:

int | str

Raises:

MixedDtypeError – If the column contains mixed types or null values.

Examples

>>> from suntzu import Getter
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 3, 2], 'b': [True, False, True], 'c': ['x', 'y', 'x']})
>>> Getter.get_max_value(df, 'a')
3
>>> Getter.get_max_value(df, 'b')
True
>>> Getter.get_max_value(df, 'c')
'x'

get_memory_insights(col: str, total_usage: int) → list[source]

get_memory_usage(col, unit) → float[source]

Calculates the memory usage of a specific column in the DataFrame.

Parameters:

col (str) – Name of the column to measure.
unit (str) – Unit for memory measurement. Options are: - “b” for bytes - “kb” for kilobytes - “mb” for megabytes

Returns:

Memory usage of the specified column, rounded to 2 decimal places.

Return type:

float

Examples

>>> df.get_memory_usage("age", "kb")
12.5
>>> df.get_memory_usage("price", "mb")
0.01

get_min_value(col: str) → int | str[source]

Returns the minimum value of a DataFrame column, handling different data types appropriately.

Parameters:

col (str) – The column of the DataFrame to inspect.

Returns:

For numeric columns, returns the minimum value.
For categorical or boolean columns, returns the least frequent value.

Return type:

int | str

Raises:

MixedDtypeError – If the column contains mixed types or null values.

Examples

>>> from suntzu import Getter
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 3, 2], 'b': [True, False, True], 'c': ['x', 'y', 'x']})
>>> Getter.get_min_value(df, 'a')
1
>>> Getter.get_min_value(df, 'b')
False
>>> Getter.get_min_value(df, 'c')
'y'

get_total_memory_usage(unit) → float[source]

Calculates the total memory usage of the DataFrame in the specified unit.

Parameters:: unit (str) – Unit for memory measurement. Options are: - “b” for bytes - “kb” for kilobytes - “mb” for megabytes
Returns:: Total memory usage of the DataFrame, rounded to 2 decimal places.
Return type:: float

Examples

>>> df.get_total_memory_usage("kb")
125.5
>>> df.get_total_memory_usage("mb")
0.12