Julia vs NumPy Performance

Element-wise Multiply

  • Faster on NumPy by a 1.3-1.5x for large N

Python 3.8.8, NumPy 1.19.2

macro python_str(s) open(PYTHON_PATH,"w",stdout) do io; print(io, s); end; end;

python"""
from IPython.utils import io
from numpy.random import rand

for n in [10**k for k in range(9)]:
    x = rand(n)
    y = rand(n)
    with io.capture_output():
        time = %timeit -o (x*y).sum()
    print(f'n = {n:>12,}: time={time}')

"""
Python 3.8.8 (default, Feb 24 2021, 21:46:12) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.22.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
In [2]: 
In [3]: 
In [3]:    ...:    ...:    ...:    ...:    ...:    ...: n =            1: time=2.34 µs ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
n =           10: time=2.36 µs ± 27.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
n =          100: time=2.41 µs ± 46.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
n =        1,000: time=3.03 µs ± 10.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
n =       10,000: time=6.92 µs ± 50.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
n =      100,000: time=47.5 µs ± 127 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
n =    1,000,000: time=1.05 ms ± 9.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
n =   10,000,000: time=16.2 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
n =  100,000,000: time=151 ms ± 872 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: 

Julia 1.6

using BenchmarkTools
using Strs

pr"""\%10s("n") \%10s("result")\n"""
for n in 10 .^ (0:8)
    x = rand(n)
    y = rand(n)
    pr"\%10s(n)"
    @btime sum($x .* $y);
end
         n     result
         1  25.590 ns (1 allocation: 96 bytes)
        10  35.727 ns (1 allocation: 160 bytes)
       100  56.584 ns (1 allocation: 896 bytes)
      1000  625.689 ns (1 allocation: 7.94 KiB)
     10000  6.244 μs (2 allocations: 78.20 KiB)
    100000  64.332 μs (2 allocations: 781.33 KiB)
   1000000  1.181 ms (2 allocations: 7.63 MiB)
  10000000  23.123 ms (2 allocations: 76.29 MiB)
 100000000  219.613 ms (2 allocations: 762.94 MiB)

Matrix Multiply

  • Julia is a bit faster, but realistically negligible difference.
  • Both NumPy and Julia are probably using BLAS under the hood.

Python 3.8.8, NumPy 1.19.2

macro python_str(s) open(PYTHON_PATH,"w",stdout) do io; print(io, s); end; end;

python"""
import numpy as np

from IPython.utils import io
from numpy.random import rand

for n in [10**k for k in range(9)]:
    x = rand(n)
    y = rand(n)
    with io.capture_output():
        time = %timeit -o x.T @ y
    print(f'n = {n:>12,}: time={time}')

"""
Python 3.8.8 (default, Feb 24 2021, 21:46:12) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.22.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
In [2]: 
In [2]: 
In [3]: 
In [4]: 
In [4]:    ...:    ...:    ...:    ...:    ...:    ...: n =            1: time=1.1 µs ± 2.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
n =           10: time=1.16 µs ± 12.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
n =          100: time=1.16 µs ± 4.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
n =        1,000: time=1.29 µs ± 5.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
n =       10,000: time=3.18 µs ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
n =      100,000: time=5.2 µs ± 517 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
n =    1,000,000: time=28.6 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
n =   10,000,000: time=4.12 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
n =  100,000,000: time=38.3 ms ± 998 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: 

Julia 1.6

using BenchmarkTools
using Strs

pr"""\%10s("n") \%10s("result")\n"""
for n in 10 .^ (0:8)
    x = rand(n)
    y = rand(n)
    pr"\%10s(n)"
    @btime ($x)' * $y
end
         n     result
         1  8.555 ns (0 allocations: 0 bytes)
        10  11.754 ns (0 allocations: 0 bytes)
       100  17.275 ns (0 allocations: 0 bytes)
      1000  79.726 ns (0 allocations: 0 bytes)
     10000  1.200 μs (0 allocations: 0 bytes)
    100000  2.873 μs (0 allocations: 0 bytes)
   1000000  20.548 μs (0 allocations: 0 bytes)
  10000000  3.631 ms (0 allocations: 0 bytes)
 100000000  37.358 ms (0 allocations: 0 bytes)