Recipes for pandas

normalize_str[source]

normalize_str()

Convert spaces to underscore, and lowercase all chars. For s:str: use built-in python methods For s:pd.Series: use pandas methods

Vectorization

Vectorized vs. Pure Python

  • In general, do not use for loops with pandas or numpy
  • For large N, performance is ~100x slower with python for loop vs. pandas or numpy methods
  • See also benchmarks for comparisons with Julia
import pandas as pd
import numpy as np
import perfplot

def make_series(n):
    return pd.Series(np.random.randint(10, size=n))

def sum_python(x):
    return sum(e for e in x)

def sum_pandas(x):
    return x.sum()

perfplot.show(
    setup=make_series,
    kernels=[sum_python, sum_pandas],
    n_range=[2**k for k in range(2, 25)],
    logx=True,
    logy=True,
    equality_check=False,  
    xlabel='len(series)'
)

Query styles and performance

  • For breakdown, see: https://stackoverflow.com/a/49937318
  • When using pandas data structures, use boolean indexing, e.g. x[x > k]
  • For maximum performance, use numexpr on numpy arrays directly, e.g. x[numexpr.evaluate('x > k')]
import numexpr
import pandas as pd
import numpy as np
import perfplot

np.random.seed(1337)


def numexpr_nparray(x):
    x = x.A.values
    return x[numexpr.evaluate('(x > 5)')]

def numexpr_pandas(x):
    return x[numexpr.evaluate('(x > 5)')]

def np_boolean_mask(x):
    return x[x.A.values > 5]

def query(x):
    return x.query('A > 5')

def pandas_eval(x):
    return x[x.eval('A > 5')]

def pdseries_boolean_mask(x):
    return x[x.A > 5]


def make_df(n):
    df = pd.DataFrame(np.random.randint(10, size=n), columns=['A'])
    return df


perfplot.show(
    setup=make_df,
    kernels=[numexpr_nparray, numexpr_pandas, np_boolean_mask, query, pandas_eval, pdseries_boolean_mask],
    n_range=[2**k for k in range(2, 25)],
    logx=True,
    logy=True,
    equality_check=False,  
    xlabel='len(df)')

Misc

categorize[source]

categorize(df, categories)

df = pd.read_csv('https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv')
df.head(3)
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
categories = {
    'cut':     {'ordered': True, 'categories': ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']},
    'color':   {'ordered': True, 'categories': ['J', 'I', 'H', 'G', 'F', 'E', 'D']},
    'clarity': {'ordered': True, 'categories': ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']},
}
    
df2 = categorize(df, categories)

pd.DataFrame({'old_dtypes': df.dtypes, 'new_dtypes': df2.dtypes})
old_dtypes new_dtypes
carat float64 float64
cut category category
color category category
clarity category category
depth float64 float64
table float64 float64
price int64 int64
x float64 float64
y float64 float64
z float64 float64

deduplicate[source]

deduplicate(df, axis='rows', subset=None, keep='first')

normalize_str[source]

normalize_str()

Convert spaces to underscore, and lowercase all chars. For s:str: use built-in python methods For s:pd.Series: use pandas methods

normalize_col_names[source]

normalize_col_names(df)

Replace whitespace with underscore, and

log_dtypes[source]

log_dtypes(func)

Decorate your function to log df.dtypes after it is complete.

Example:

@log_dtypes
def func(df): pass

log_transform[source]

log_transform(func)

Decorate your function to get execution time and log df.shape after it is complete.

Example:

@log_transform
def func(df): pass

Method chaining

  • Used to create a declarative (English-readable) set of operations happening on the dataframe
  • Abstract away complexity from the reader, but have it available if they want to see
  • Subjective: Looks cleaner to my eyes
  • Drawbacks: Can make debugging difficult if you are not familiar with how it works.
  • Method chaining syntax (.pipe) may be familiar if coming from JS (.then), Bash (|), or R (%>%)
    df = (
          pd.read_parquet('my_file.parquet')
          .pipe(clean_data)
          .pipe(generate_features)
      )
    

read_files[source]

read_files(path, pattern, filetype='csv')

localize[source]

localize()

strip_whitespace[source]

strip_whitespace(df)

fold[source]

fold(x, y)

normalize[source]

normalize(df, by, how='first')

DataFrame Accessors

  • Namespace and add methods directly to the dataframe for easy discovery
  • Provide multi-level dot access if required
@pd.api.extensions.register_dataframe_accessor("top_level")
class TopLevelAccessor:
    def __init__(self, df):
        self._df = df
        self.level_a = LevelA(df)
        self.level_b = LevelB(df)

    def do_something(self): pass

class LevelA:
    def __init__(self, df):
        self._df = df

    def do_something_else(self): pass

class LevelB:
    def __init__(self, df):
        self._df = df
df.top_level.do_something()
df.top_level.level_a.do_something_else()

For larger datasets

Consider:

  • Vaex
  • Dask
  • Spark (Koalas)
  • Modin
  • CuDF (Nvidia GPU w/ linux only)