website/blog/packaging-searx/part1.org

480 lines
19 KiB
Org Mode
Raw Normal View History

#+title: Packaging Searx - Part 1
#+date: <2022-07-24 Sun>
#+begin_src shell :exports none :results none
git clone https://github.com/searx/searx
#+end_src
#+begin_src gitignore :exports none :tangle .gitignore
searx
searx-nix
#+end_src
In this N part blog post series, I'll show you the exact process of packaging [[https://github.com/searx/searx][Searx]] a meta seach engine. Here's an excerpt from Searx's readme to shine a bit of light on what we'll be packaging.
#+begin_quote
Searx is a free internet metasearch engine which aggregates results from more than 70 search services. Users are neither tracked nor profiled. Additionally, searx can be used over Tor for online anonymity.
#+end_quote
So if you're a privacy nerd or want ensure Google doesn't know what you're cooking tonight, read on and you'll learn how Searx works from a system administrator and packager perspective.
Searx is already packaged in nixpkgs, but for the sake of this blog post, let's pretend it isn't. I'll go over all the things I check, verify and all the things I do when packaging. So I'll quit mumbling and start Nix-ing!
* Discovery
First it's imperative that we find the upstream repo we'll be working with, it may sound simple enough, and in the case of Searx it luckily is, but it can also be challenging. It all depends on how well-known the project is and how unique the name is. My recommendation is to use a search engine and search for ~searx git~ in this case, which gets us [[https://github.com/searx/searx]].
Now that we have a link to the repo, we need to identify the language and in the case of some languages the build system. There are several ways to do this, one is to just look at the root of the repo and look for a few recognizable files. I'll leave an incomplete table below.
| files / directories | language / build system |
|-----------------------------------+-------------------------|
| Cargo.toml, Cargo.lock | Rust - cargo |
| requirements.txt, setup.py | Python 2/3 |
| CMakeFiles.txt | C, C++ - cmake |
| meson.build | C, C++ - meson |
| composer.json, composer.lock | PHP - composer |
| package.json, package-lock.json | Node - npm |
| package.json, yarn.lock | Node - yarn |
| *.cabal, stack.yaml, package.yaml | Haskell - stack/cabal |
I won't list tools to use when packaging these different languages, because the recommended set changes often and I'd have to keep this blog post up to date :), but it's easy enough to search for them. Generally if you search for ~<package-manager>2nix~.
#+begin_note dream2nix
[[https://github.com/nix-community/dream2nix][dream2nix]] is a new and shiny thing, I personally haven't used it yet and won't use it in this blog post, but do keep it in mind and check whether it's relevant to your project the next time you're packaging.
#+end_note
Looking at the repository we see a ~requirements.txt~ and a ~setup.py~, the first one is valuable because we *should* have a list of all python packages we need and the second we need to keep in mind, since it contains custom arbitrary python that we may need inspect and fix.
#+begin_src fundamental
certifi==2022.5.18.1
babel==2.9.1
flask-babel==2.0.0
flask==2.1.1
jinja2==3.1.2
lxml==4.9.0
pygments==2.8.0
python-dateutil==2.8.2
pyyaml==6.0
httpx[http2]==0.23.0
Brotli==1.0.9
uvloop==0.16.0; python_version >= '3.7'
uvloop==0.14.0; python_version < '3.7'
httpx-socks[asyncio]==0.7.4
langdetect==1.0.9
setproctitle==1.2.2
#+end_src
It's also worth looking at the ~Dockerfile~ and any ~Makefile~, ~Justfile~, or ~scripts~ folder. Here we have a ~Dockerfile~ and also a ~Makefile~, lucky! Let's start with the ~Dockerfile~, I'll pick out the important bits only.
#+begin_src dockerfile
FROM alpine:3.15
#+end_src
A pretty crucial piece of information here, we now know both the distro the container uses so we can descern the environment a bit and that Searx will happily run on musl libc.
#+begin_src dockerfile
ENTRYPOINT ["/sbin/tini","--","/usr/local/searx/dockerfiles/docker-entrypoint.sh"]
#+end_src
Here we see where we should look for the startup script.
#+begin_src dockerfile
ENV INSTANCE_NAME=searx \
AUTOCOMPLETE= \
BASE_URL= \
MORTY_KEY= \
MORTY_URL= \
SEARX_SETTINGS_PATH=/etc/searx/settings.yml \
UWSGI_SETTINGS_PATH=/etc/searx/uwsgi.ini
#+end_src
Here we have a *incomplete* list of arguments we can pass to into the Docker container, it's important to later notice where they're handled, in the scripting or in the actual program itself?
#+begin_src shell
apk add --no-cache -t build-dependencies \
build-base \
py3-setuptools \
python3-dev \
libffi-dev \
libxslt-dev \
libxml2-dev \
openssl-dev \
tar \
git \
#+end_src
Here we see a list of packages installed with apt, but you (and me actually) may not know what does ~-t build-dependencies~ do. It's best to look at the manpage for ~apk add~, so search for ~apk-add man~. According to [[https://www.mankier.com/8/apk-add]] ~-t~ adds a virtual package with the dependencies listed on the command line and then installs that package. So we have one package ~build-dependencies~ containing a set of packages we need at build time.
#+begin_src shell
apk add --no-cache \
ca-certificates \
su-exec \
python3 \
py3-pip \
libxml2 \
libxslt \
openssl \
tini \
uwsgi \
uwsgi-python3 \
brotli \
#+end_src
Next we have a list of packages needed at runtime, this one is really important to remember since we may have to add these in a special way later. You'll see what I mean.
#+begin_src shell
pip3 install --upgrade pip wheel setuptools \
#+end_src
Then it upgrades ~pip~, ~wheel~, and ~setuptools~. I personally had to look up what ~wheel~ is. But looking at [[https://pkgs.alpinelinux.org/packages?name=*wheel*&branch=edge][Alpine Linux packages]] yields no results, so let's just ignore it for now. If it doesn't come up later it's not important.
#+begin_src shell
pip3 install --no-cache -r requirements.txt \
#+end_src
Second to last it installs the packages specied in ~requirements.txt~ as expected.
#+begin_src shell
apk del build-dependencies \
&& rm -rf /root/.cache
#+end_src
And lastly it does some cleanup. Which is interesting, because I expected those dependencies to be used later by some custom searx native component, but I guess it makes sense they're not.
#+begin_src dockerfile
COPY searx ./searx
COPY dockerfiles ./dockerfiles
#+end_src
We now see where that startup script comes from.
#+begin_src dockerfile
RUN /usr/bin/python3 -m compileall -q searx; \
touch -c --date=@${TIMESTAMP_SETTINGS} searx/settings.yml; \
touch -c --date=@${TIMESTAMP_UWSGI} dockerfiles/uwsgi.ini; \
if [ ! -z $VERSION_GITCOMMIT ]; then\
echo "VERSION_STRING = VERSION_STRING + \"-$VERSION_GITCOMMIT\"" >> /usr/local/searx/searx/version.py; \
fi; \
find /usr/local/searx/searx/static -a \( -name '*.html' -o -name '*.css' -o -name '*.js' \
-o -name '*.svg' -o -name '*.ttf' -o -name '*.eot' \) \
-type f -exec gzip -9 -k {} \+ -exec brotli --best {} \+
#+end_src
This is a complicated little beast, we see ~searx/settings.yml~ ~dockerfiles/uwsgi.ini~ and ~/usr/local/searx/searx/version.py~, we also see that it compiles all the python files, but that will be taken care of by nixpkgs. Interestingly it also compresses all the assets with gzip. The find command looks for all files with ~.html~, ~.css~, ~.js~, ~.svg~, ~.ttf~ and ~.eot~, then executes ~gzip -9 -k~ and ~brotli --best~. (here I had to again search for what's brotli). (it looks to be a [[https://github.com/google/brotli][compression scheme]])
That's all from the Dockerfile. Now we need to look at the script it calls.
** ~docker-entrypoint.sh~ script
#+begin_src shell
printf "\nEnvironment variables:\n\n"
printf " INSTANCE_NAME settings.yml : general.instance_name\n"
printf " AUTOCOMPLETE settings.yml : search.autocomplete\n"
printf " BASE_URL settings.yml : server.base_url\n"
printf " MORTY_URL settings.yml : result_proxy.url\n"
printf " MORTY_KEY settings.yml : result_proxy.key\n"
printf " BIND_ADDRESS uwsgi bind to the specified TCP socket using HTTP protocol. Default value: \"${DEFAULT_BIND_ADDRESS}\"\n"
#+end_src
That's a nice little rundown of the supported configuration options and also that Searx is configured with ~settings.yml~, this knowledge will come in handy when we're writing the NixOS module for Searx.
#+begin_src shell
# update settings.yml
sed -i -e "s|base_url : False|base_url : ${BASE_URL}|g" \
-e "s/instance_name : \"searx\"/instance_name : \"${INSTANCE_NAME}\"/g" \
-e "s/autocomplete : \"\"/autocomplete : \"${AUTOCOMPLETE}\"/g" \
-e "s/ultrasecretkey/$(openssl rand -hex 32)/g" \
"${CONF}"
#+end_src
This command confirms that in fact we're dealing with a settings.yaml.
#+begin_src shell
sed -i -e "s/image_proxy : False/image_proxy : True/g" \
"${CONF}"
cat >> "${CONF}" <<-EOF
# Morty configuration
result_proxy:
url : ${MORTY_URL}
key : !!binary "${MORTY_KEY}"
EOF
#+end_src
This bit is interesting, I initially thought that the script updates the existing config with new values, but the code block above would mean that on every restart a new ~result_proxy~ block would be added. Which means that it must take a default config, write your settings in and replace the current one with that.
It's common to realize things like this, it unusual to get all assumptions right initially, but when you go further into the package, you'll naturally stumble upon issues caused by your assumptions. Just make sure you remember what you know and what you assume.
#+begin_src bash
if [ -f "${CONF}" ]; then
if [ "${REF_CONF}" -nt "${CONF}" ]; then
# There is a new version
if [ $FORCE_CONF_UPDATE -ne 0 ]; then
# Replace the current configuration
printf '⚠️ Automaticaly update %s to the new version\n' "${CONF}"
if [ ! -f "${OLD_CONF}" ]; then
printf 'The previous configuration is saved to %s\n' "${OLD_CONF}"
mv "${CONF}" "${OLD_CONF}"
fi
cp "${REF_CONF}" "${CONF}"
$PATCH_REF_CONF "${CONF}"
else
# Keep the current configuration
printf '⚠️ Check new version %s to make sure searx is working properly\n' "${NEW_CONF}"
cp "${REF_CONF}" "${NEW_CONF}"
$PATCH_REF_CONF "${NEW_CONF}"
fi
else
printf 'Use existing %s\n' "${CONF}"
fi
else
printf 'Create %s\n' "${CONF}"
cp "${REF_CONF}" "${CONF}"
$PATCH_REF_CONF "${CONF}"
fi
#+end_src
When you encounter such an ugly piece of code, you don't need to understand it fully, just the general jist of it is more than enough. At a glance we see that configuration is based on a reference config and patching of it to produce a final config.
#+begin_src shell
# make sure there are uwsgi settings
update_conf ${FORCE_CONF_UPDATE} "${UWSGI_SETTINGS_PATH}" "/usr/local/searx/dockerfiles/uwsgi.ini" "patch_uwsgi_settings"
# make sure there are searx settings
update_conf "${FORCE_CONF_UPDATE}" "${SEARX_SETTINGS_PATH}" "/usr/local/searx/searx/settings.yml" "patch_searx_settings"
#+end_src
Looking at the call sites, we see both the reference config file paths and the functions used for patching.
#+begin_src shell
patch_uwsgi_settings() {
CONF="$1"
# Nothing
}
#+end_src
Interestingly the ~uwsgi~ config doesn't get patched, so the reference one should be fine in most cases.
#+begin_src shell
exec su-exec searx:searx uwsgi --master --http-socket "${BIND_ADDRESS}" "${UWSGI_SETTINGS_PATH}"
#+end_src
And finally we see the command used to actually launch Searx.
** What is ~uwsgi~
I once again had to look this up. But according to Wikipedia it's similar to CGI if you're familiar with that. If not then, well, it's used to allow webserver's like Nginx to serve arbitrary scripts in arbitrary languages. So ~client -> Nginx - uwsgi -> Python backend~.
*** Aren't we missing a full webserver?
#+begin_quote
uWSGI natively speaks HTTP, FastCGI, SCGI and its specific protocol named “uwsgi”
#+end_quote
No, uwsgi can serve as a lightweight webserver. So ideally in the NixOS module we'd support all methods, HTTP, CGI, SCGI and uwsgi, but that's something to worry about later.
* Packaging
Now that we know all there is to know from the Docker image and related files, we can start writing Nix expressions. First let us create a new repository quickly, we'll first do it as a Flake, it's easier and can be easily ported to nixpkgs if done right.
#+begin_src shell :results none
git init searx-nix
#+end_src
#+begin_src nix :tangle searx-nix/flake.nix
{
inputs.nixpkgs.url = "github:NixOS/nixpkgs";
outputs =
{
self,
nixpkgs
}:
let
supportedSystems = [ "x86_64-linux" ];
forAllSystems' = nixpkgs.lib.genAttrs;
forAllSystems = forAllSystems' supportedSystems;
pkgsForSystem =
system:
import nixpkgs { inherit system; };
in
{
packages = forAllSystems
(system:
let
pkgs = pkgsForSystem system;
in
{
default = pkgs.callPackage ./searx.nix {};
}
);
};
}
#+end_src
We then create a tiny ~flake.nix~, the cruft around it is generic and not really important, the important bit is src_nix{pkgs.callPackage ./searx.nix {}}, that ensures that our actual package doesn't really care for whether it's in a flake or not.
Looking up ~nixpkgs python~ gets us to the nixpkgs manual (the information is both in the official one and ryatm's, but the latter is better since it isn't one huge html page) [[https://ryantm.github.io/nixpkgs/languages-frameworks/python/][ryatm's nixpkgs manual]].
#+begin_src nix
{ lib, python3 }:
python3.pkgs.buildPythonApplication rec {
pname = "luigi";
version = "2.7.9";
src = python3.pkgs.fetchPypi {
inherit pname version;
sha256 = "035w8gqql36zlan0xjrzz9j4lh9hs0qrsgnbyw07qs7lnkvbdv9x";
};
propagatedBuildInputs = with python3.pkgs; [ tornado python-daemon ];
meta = with lib; {
...
};
}
#+end_src
As an example we're given a derivation for luigi, I don't know and don't need to know what luigi is. It's important to ignore irrelevant information and not research it to speed up packaging.
Based on the example derivation we can build our own. Instead of ~python3.pkgs.fetchPypi~ we're going to use ~fetchFromGitHub~ as that's more universal and easier to work with.
#+begin_src nix :tangle searx-nix/searx.nix
{
lib,
python3,
fetchFromGitHub
}:
with lib;
let
pname = "searx";
version = "1.0.0";
in
python3.pkgs.buildPythonApplication {
inherit pname version;
src = fetchFromGitHub {
rev = version;
repo = pname;
owner = pname;
# If you update the version, you need to switch back to ~lib.fakeSha256~ and copy the new hash
sha256 = "sha256-sIJ+QXwUdsRIpg6ffUS3ItQvrFy0kmtI8whaiR7qEz4="; # lib.fakeSha256;
};
postPatch = ''
sed -i 's/==.*$//' requirements.txt
'';
# tests try to connect to network
doCheck = false;
pythonImportsCheck = [ "searx" ];
# Since Python is weird, we need to put any dependencies we know of here
# and not into ~buildInputs~ or ~nativeBuildInputs~ as one might expect.
# As a starting point, just copy everything from ~requirements.txt~ and
# hope for the best.
propagatedBuildInputs = with python3.pkgs;
[
certifi
babel
flask-babel
flask
jinja2
lxml
pygments
python-dateutil
pyyaml
# httpx[http2]
httpx
brotli
# uvloop==0.16.0; python_version >= '3.7'
# uvloop==0.14.0; python_version < '3.7'
uvloop
# httpx-socks[asyncio]
httpx-socks
langdetect
setproctitle
# sometimes the packages in ~requirements.txt~ may not be enough, so if something is missing, just add it
requests
];
meta = with lib; {
# You'll fill this in later when upstreaming to nixpkgs
};
}
#+end_src
#+begin_src shell :results none :exports none
git -C searx-nix add searx.nix flake.nix
#+end_src
#+begin_note clarifications
Let me just clarify a few things.
#+begin_src nix
{
cmake,
gnumake,
gcc
}:
#+end_src
That pattern works, because Nix has a special builtin which allow one to inspect the arguments of a function, getting a list with all its arguments. ~calLPackage~ then uses that list to call the function with your requested packages.
#+begin_src nix
{
deps =
[
"cmake"
"gnumake"
"gcc"
];
fn =
{
cmake,
gnumake,
gcc
}:
}
#+end_src
The above would also work, but we like conciseness.
Lastly, you may ask what's up with the ~lib.fakeSha256~, well, it returns ~sha256-AAAAAAAAAAAAAAAAAAAAA=~ (I didn't count the number of ~A~ so it's probably wrong), which stands for /I don't know yet/. The point is that when Nix dowloads the source code and checks the hash, it won't match, therefore it will print out the one you gave it and the one it calculated. You can then replace ~lib.fakeSha256~ with the actual hash.
#+end_note
At this point I looked at the already existing derivation, because I was qurious.
#+begin_src nix
# tests try to connect to network
doCheck = false;
pythonImportsCheck = [ "searx" ];
postPatch = ''
sed -i 's/==.*$//' requirements.txt
'';
#+end_src
The src_nix{doCheck = false} is there by experimentation. I didn't know what src_nix{pythonImportsCheck = [ "searx" ]} does, so I looked around, I first went to [[https://github.com/NixOS/nixpkgs][nixpkgs]] and clicked on ~Go to file~, searched for ~python~ and then went to ~pkgs/top-level/python-packages.nix~. Inspecting the file on line 41 I found the definition of ~buildPythonApplication~.
#+begin_src nix
buildPythonPackage = makeOverridablePythonPackage (lib.makeOverridable (callPackage ../development/interpreters/python/mk-python-derivation.nix {
inherit namePrefix; # We want Python libraries to be named like e.g. "python3.6-${name}"
inherit toPythonModule; # Libraries provide modules
}));
#+end_src
This points to a file called ~mk-python-derivation.nix~, so again, ~Go to file~. [[https://github.com/NixOS/nixpkgs/blob/nixos-22.05/pkgs/development/interpreters/python/mk-python-derivation.nix][mk-python-derivation.nix]] tells us a lot, but still not what ~pythonImportsCheck~ does, it's only mentioned as ~pythonImportsCheckHook~, which prompted me to look for said hook. Going to the containing directory and into ~hooks/python-imports-check-hook.sh~ we can satiate our curiosity.
Lastly the src_nix{postPatch = ''...''} is used to patch out the requirement version constraints, it seems to cause an error at build time.
With all these things, we get a successful build.
In the next blog post we'll start with the NixOS module by first trying to actually get a full launch of Searx. Till then!