Example with code

Distance vs. Dimensionality

  • The x-axis represents the distance between two points
  • The y-axis represents the count (this is a histogram)
  • The play button increases the number of dimensions, causing the histogram to shift to the right. This means distances are increasing as dimensions go up!
  • The curse of dimensionality is that as the number of dimensions increases, the distances between any two points also increases. That makes it hard to group things together!
import numpy as np
import pandas as pd
import plotly.express as px

def generate_data(n, dim):
    x = np.random.normal(0, 1, (n, dim))
    y = np.random.normal(3, 1, (n, dim))

    data = np.concatenate([x, y], axis=0)
    normalized = (data - data.mean(axis=0)) / data.std(axis=0)
    distances = np.linalg.norm(normalized[:, None, :] - normalized[None, :, :], axis=2).flatten()
    
    return distances

n = 100
dims = list(range(1, 10)) + list(range(10, 200, 10))

distances = {dim: generate_data(n, dim) for dim in dims}
df = pd.DataFrame(distances).melt(var_name='dims', value_name='samples')
fig = px.histogram(df, animation_frame='dims', )

fig.update_xaxes(range=[0, 20]);

Required Samples vs. Dimensionality

As dimensionality increases, you also need more samples.

Consider the simple binary variable case.

  • For each variable, there are two choices.
  • For $k$ variables, we have $2^k$ choices.
  • For a linear increase in variables, you have an exponential increase in choices. The amount of data you need to collect increases exponentially!
  • Suppose you wanted 25 observations of each combination of variables. The amount of data you would need is huge!
import pandas as pd
import plotly.express as px

num_variables = range(1, 25)
rows_of_data_required = (25*2**k for k in num_variables)
df = pd.DataFrame(
    {'Rows of Data Required': rows_of_data_required,
     'Num of Binary Variables': num_variables}
).set_index('Num of Binary Variables')
fig = px.line(df, x=df.index, y='Rows of Data Required')
fig.update_layout(hovermode="x unified");