Install pyarrow on Alpine 3.20

by Daniel Pham

This article will guide you to install pyarrow on Alpine 3.20. How to build Docker image with pyarrow package based on python:3.12.7-alpine with Alpine version 3.20?

I need to build a Python app with the Alpine image to ensure the highest level of security possible.

With the base image python:3.12.7-slim I can build the image for the app successfully because this base image is based on Debian. But Alpine is different, this is the situation I encountered.

Alpine does not support libraries for the pyarrow package

When I install the pyarrow package on the python:3.12.7-alpine image (this image has Alpine version 3.20) I get the following error.

Install pyarrow on Alpine 3.20
Building wheel for pyarrow (pyproject.toml) … error
/ # pip install pyarrow
Collecting pyarrow
  Downloading pyarrow-18.0.0.tar.gz (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 5.5 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: pyarrow
  Building wheel for pyarrow (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for pyarrow (pyproject.toml) did not run successfully.
   exit code: 1
  ╰─> [783 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build/lib.linux-x86_64-cpython-312/pyarrow
      ...
      copying pyarrow/vendored/version.py -> build/lib.linux-x86_64-cpython-312/pyarrow/vendored
      running build_ext
      creating /tmp/pip-install-ogcm3i1n/pyarrow_45fbd110dc51439481e9486f9d43996a/build/temp.linux-x86_64-cpython-312
      -- Running cmake for PyArrow
      cmake -DCMAKE_INSTALL_PREFIX=/tmp/pip-install-ogcm3i1n/pyarrow_45fbd110dc51439481e9486f9d43996a/build/lib.linux-x86_64-cpython-312/pyarrow -DPYTHON_EXECUTABLE=/usr/local/bin/python3.12 -DPython3_EXECUTABLE=/usr/local/bin/python3.12 -DPYARROW_CXXFLAGS= -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_CYTHON_CPP=off -DPYARROW_GENERATE_COVERAGE=off -DCMAKE_BUILD_TYPE=release /tmp/pip-install-ogcm3i1n/pyarrow_45fbd110dc51439481e9486f9d43996a
      error: command 'cmake' failed: No such file or directory
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pyarrow
Failed to build pyarrow

[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: pip install --upgrade pip
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (pyarrow)

When searching on the internet (at the time of writing this article), there are many people encountering such an error when installing pyarrow on Alpine. You can see some issues on Apache Arrow Github.

Install pyarrow on Alpine 3.20
Github issues about failed to build pyarrow using python:3.10-alpine.

This is where it gets really tricky, as I don’t know what to do when my Python application requires this pyarrow package.

After spending more than a week trying many ways, the command to build the image failed. Finally, I found an issue on Apache Arrow’s Github that mentioned that we need to build the Arrow C++ library from source to support the pyarrow package.

Install pyarrow on Alpine

Below is the Dockerfile I am using for my Python app. I will explain some important points.

FROM python:3.12.7-alpine

# Setup env
ENV LANG=C.UTF-8 \
    LC_ALL=C.UTF-8 \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONFAULTHANDLER=1 \
    ACCEPT_EULA=Y

# Check the latest version for the package 'arrow' at https://github.com/apache/arrow/releases
ARG ARROW_VERSION=18.0.0

# Set the working directory in the container
WORKDIR /app

# Copy requirements first to leverage caching
COPY requirements.txt .

# Install build dependencies and required packages
RUN apk add --no-cache --virtual .build-deps \
        autoconf \
        bash \
        bison \
        boost-dev \
        brotli-dev \
        build-base \
        bzip2-dev \
        cargo \
        ca-certificates \
        clang \
        clang-dev \
        cmake \
        curl \
        curl-dev \
        flex \
        gcc \
        g++ \
        git \
        grpc-dev \
        jemalloc-dev \
        libc-dev \
        libffi-dev \
        libgcc \
        libjpeg-turbo-dev \
        libstdc++ \
        libxml2-dev \
        libxslt-dev \
        libre2 \
        lld \
        llvm-dev \
        linux-headers \
        libstdc++ \
        lz4-dev \
        make \
        musl-dev \
        ncurses-libs \
        ninja \
        openssl-dev \
        postgresql-dev \
        protobuf-dev \
        rapidjson-dev \
        re2-dev \
        rust \
        snappy-dev \
        thrift-dev \
        unixodbc-dev \
        utf8proc-dev \
        xsimd-dev \
        xz-dev \
        zlib-dev \
        zstd-dev \
        py3-pip \
        py3-numpy \
        py3-wheel \
    && apk upgrade --no-cache openssl \
    && pip install --upgrade --no-cache-dir pip Werkzeug \
    && pip install --no-cache-dir cython numpy pandas pipenv pytest setuptools six \
    && git clone --no-checkout https://github.com/apache/arrow.git /arrow \
    && cd /arrow \
    && git checkout tags/apache-arrow-${ARROW_VERSION} \
    && mkdir -p /arrow/cpp/build \
    && cd /arrow/cpp/build \
    && cmake .. --preset ninja-release-python-maximal -DARROW_GANDIVA=OFF -DARROW_ACERO=OFF -DARROW_AZURE=OFF -DARROW_CUDA=OFF -DARROW_BUILD_TESTS=OFF \
    && ninja -j$(nproc) \
    && ninja install \
    && pip install --no-cache-dir pyarrow \
    && cd /app && pip install --no-cache-dir -r requirements.txt \
    && rm -rf /var/cache/apk/* /arrow /usr/local/lib/python3.12/site-packages/examples

# Copy the rest of the application code
COPY . .

# Expose the desired port (change if necessary)
EXPOSE 80

# Run the application
CMD ["python", "app.py"]
  • ARG ARROW_VERSION=18.0.0: First, you need to check the latest version of Arrow package now at https://github.com/apache/arrow/releases. You can choose the version you want to install, for example I need to install arrow 18.0.0.
  • RUN apk add --no-cache --virtual .build-deps... : This command will install the necessary library packages during the installation process.
  • git clone --no-checkout... to ninja install: The commands between these two commands will clone the Arrow repository and install the Arrow C++ library.
  • pip install --no-cache-dir pyarrow: The command to install the pyarrow package on Apline after we have successfully compiled and installed the Arrow C++ library.

The most important part of the Dockerfile above is this code. This is how you can install pyarrow on Alpine.

    && git clone --no-checkout https://github.com/apache/arrow.git /arrow \
    && cd /arrow \
    && git checkout tags/apache-arrow-${ARROW_VERSION} \
    && mkdir -p /arrow/cpp/build \
    && cd /arrow/cpp/build \
    && cmake .. --preset ninja-release-python-maximal -DARROW_GANDIVA=OFF -DARROW_ACERO=OFF -DARROW_AZURE=OFF -DARROW_CUDA=OFF -DARROW_BUILD_TESTS=OFF \
    && ninja -j$(nproc) \
    && ninja install \
    && pip install --no-cache-dir pyarrow \

Build Docker image with pyarrow and Alpine

Now, you have the Dockerfile above, download it to your machine, change the port in the EXPOSE command and the file name in the CMD command. Or change more commands that you need for your application.

Run the command to build the image, I assume you named the file for the Dockerfile above in your repository as Dockerfile.alpine.

docker build -t my-app:latest -f Dockerfile.alpine .
Install pyarrow on Alpine 3.20
Docker build to install pyarrow on Alpine 3.20.

The image build may take up to 30 minutes due to the arrow compilation process.

With python:3.12.7-slim, my image is about 1.6 GB in size (including about 700M of app code).

With the Dockerfile above, using python:3.12.7-alpine, my image is about 3.4 GB in size. That is double the size of the slim image.

The large image size is also a negative point, however, in my case, it prioritizes security so the increased size is acceptable.

Conclusion

With this article, I hope it can help you when you encounter the error of not being able to install pyarrow on Alpine.

Using Alpine seems to be very popular with companies now because it increases the security of the application. I have experienced this error for more than 1 week and hope the above is really useful for you.

0 0 votes
Article Rating

You may also like

Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

0
Would love your thoughts, please comment.x
()
x

Adblock Detected

Please support us by disabling your AdBlocker extension from your browsers for our website.