CI Failure: build workflow - git safe.directory issue #18

Closed
opened 2026-02-17 17:55:20 -08:00 by ci · 13 comments
Owner

CI Failure Report

Workflow: build (merged PR #17)
Commit: 5c09763713
Run: https://git.brads.house/commune/sandbox/actions/runs/34
Run ID: 2867

Triage Analysis

Root Cause: Git detected "dubious ownership" in the runner cache directory during checkout step.

Error Details:

fatal: detected dubious ownership in repository at '/data/.cache/act/87/7ac8e046e4e302867568015979cf38c7311aa10e947661460f22081473ddd8'

To add an exception for this directory, call:

	git config --global --add safe.directory /data/.cache/act/87/7ac8e046e4e302867568015979cf38c7311aa10e947661460f22081473ddd8

Task IDs: 5323, 5324 (both retries failed with same error)
Duration: ~5 seconds (immediate failure)
Impact: CI blocks all builds for commune/sandbox

Context:

  • This is a Git security feature introduced in Git 2.35.2
  • The runner cache directory ownership has changed, triggering this protection
  • Both automated retries failed with identical error
  • This is an infrastructure issue, not a code problem

Recommendation

@brad needs to fix the runner configuration by setting git safe.directory:

  1. SSH into the runner host
  2. Run: git config --global --add safe.directory "*"
  3. Restart the runner service
  4. Re-run the workflow manually to verify fix
## CI Failure Report **Workflow:** build (merged PR #17) **Commit:** 5c09763713df772674f460b802af97f80c205141 **Run:** https://git.brads.house/commune/sandbox/actions/runs/34 **Run ID:** 2867 ## Triage Analysis **Root Cause:** Git detected "dubious ownership" in the runner cache directory during checkout step. **Error Details:** ``` fatal: detected dubious ownership in repository at '/data/.cache/act/87/7ac8e046e4e302867568015979cf38c7311aa10e947661460f22081473ddd8' To add an exception for this directory, call: git config --global --add safe.directory /data/.cache/act/87/7ac8e046e4e302867568015979cf38c7311aa10e947661460f22081473ddd8 ``` **Task IDs:** 5323, 5324 (both retries failed with same error) **Duration:** ~5 seconds (immediate failure) **Impact:** CI blocks all builds for commune/sandbox **Context:** - This is a Git security feature introduced in Git 2.35.2 - The runner cache directory ownership has changed, triggering this protection - Both automated retries failed with identical error - This is an infrastructure issue, not a code problem ## Recommendation @brad needs to fix the runner configuration by setting git safe.directory: 1. SSH into the runner host 2. Run: `git config --global --add safe.directory "*"` 3. Restart the runner service 4. Re-run the workflow manually to verify fix
Author
Owner

🚨 Infrastructure Alert

This is a runner infrastructure issue that requires manual intervention.

Action required: SSH into runner host and run:

git config --global --add safe.directory "*"
systemctl restart forgejo-runner  # or whatever runner service name is

@brad please fix and re-run workflow manually to verify.

## 🚨 Infrastructure Alert This is a **runner infrastructure issue** that requires manual intervention. **Action required:** SSH into runner host and run: ```bash git config --global --add safe.directory "*" systemctl restart forgejo-runner # or whatever runner service name is ``` @brad please fix and re-run workflow manually to verify.
Owner

@agent why is cache on the runner causing this? is that correct? is fixing on host only option? why haven't we run into this before? i want ot be sure it wasn't triggered by the push we did

@agent why is cache on the runner causing this? is that correct? is fixing on host only option? why haven't we run into this before? i want ot be sure it wasn't triggered by the push we did
Owner

Good questions — let me break each one down.

Why is the runner cache causing this?

Git 2.35.2+ has a security feature that refuses to operate on directories owned by a different user than the current process. Our workflow uses container: docker:cli, meaning the CI job runs inside a Docker container (typically as root). But the Forgejo act runner itself runs as a service user (non-root) and caches downloaded action code in /data/.cache/act/... — owned by that service user.

When the runner mounts that cache directory into the container, git inside the container sees: "this directory is owned by UID 999 (runner user), but I'm running as UID 0 (root) — suspicious." It refuses to proceed. The cache path in the error (/data/.cache/act/87/7ac8e046...) is the cached copy of the actions/checkout@v4 action itself, not our code.

Is the diagnosis correct?

Yes. This is a legitimate runner infrastructure issue, not a code bug.

Is fixing on the host the only option?

No, there are two clean approaches:

Option A — Fix on host (what was recommended): SSH into the runner host and either:

  • git config --global --add safe.directory "*" — if the runner shares host git config with containers, this propagates in
  • Or nuke the stale cache: rm -rf /data/.cache/act/ — it'll be recreated fresh with consistent ownership

Option B — Fix in workflow: Add a pre-checkout step. But there's a catch: the error occurs when the runner tries to load the action code from cache before any run: steps execute. So a run: step before checkout won't intercept it. The reliable workflow-level fix is to configure the runner daemon itself to pass GIT_CONFIG_PARAMETERS as an environment variable into containers (Forgejo runner config, not the workflow).

Realistically, Option A (clearing the cache on the host) is the quickest fix.

Why haven't we hit this before?

The most likely explanation: the Forgejo runner service was recently restarted or re-installed under a different user, OR /data/.cache/act/ was previously created by root and the runner now runs as a non-root service account (or vice versa). Task 5197 succeeded, which means the cache was in a consistent state at that point — something changed between then and now.

Secondary possibility: the docker:cli image was updated to a newer Alpine that ships git 2.35.2+ with stricter ownership enforcement. Previously git may have silently allowed it.

Did the PR #17 push cause this?

No. The merge of #17 (goplaces) triggered the CI run that exposed this pre-existing condition, but it didn't cause it. The goplaces Dockerfile change is unrelated to runner cache ownership. Any push to main after the cache got into a bad state would have hit this exact error.

# SSH into runner host
rm -rf /data/.cache/act/
# then re-run the workflow from Forgejo UI — cache rebuilds clean

If ownership keeps drifting, the longer-term fix is ensuring the runner service always runs as the same user (or root) and the cache dir permissions are consistent.

Good questions — let me break each one down. ## Why is the runner cache causing this? Git 2.35.2+ has a security feature that refuses to operate on directories owned by a different user than the current process. Our workflow uses `container: docker:cli`, meaning the CI job runs *inside* a Docker container (typically as root). But the Forgejo act runner itself runs as a service user (non-root) and caches downloaded action code in `/data/.cache/act/...` — owned by that service user. When the runner mounts that cache directory into the container, git inside the container sees: "this directory is owned by UID 999 (runner user), but I'm running as UID 0 (root) — suspicious." It refuses to proceed. The cache path in the error (`/data/.cache/act/87/7ac8e046...`) is the cached copy of the `actions/checkout@v4` action itself, not our code. ## Is the diagnosis correct? Yes. This is a legitimate runner infrastructure issue, not a code bug. ## Is fixing on the host the only option? No, there are two clean approaches: **Option A — Fix on host** (what was recommended): SSH into the runner host and either: - `git config --global --add safe.directory "*"` — if the runner shares host git config with containers, this propagates in - Or nuke the stale cache: `rm -rf /data/.cache/act/` — it'll be recreated fresh with consistent ownership **Option B — Fix in workflow**: Add a pre-checkout step. But there's a catch: the error occurs when the runner tries to *load the action code from cache* before any `run:` steps execute. So a `run:` step before `checkout` won't intercept it. The reliable workflow-level fix is to configure the runner daemon itself to pass `GIT_CONFIG_PARAMETERS` as an environment variable into containers (Forgejo runner config, not the workflow). Realistically, Option A (clearing the cache on the host) is the quickest fix. ## Why haven't we hit this before? The most likely explanation: the Forgejo runner service was **recently restarted or re-installed under a different user**, OR `/data/.cache/act/` was previously created by root and the runner now runs as a non-root service account (or vice versa). Task 5197 succeeded, which means the cache was in a consistent state at that point — something changed between then and now. Secondary possibility: the `docker:cli` image was updated to a newer Alpine that ships git 2.35.2+ with stricter ownership enforcement. Previously git may have silently allowed it. ## Did the PR #17 push cause this? No. The merge of #17 (goplaces) **triggered** the CI run that exposed this pre-existing condition, but it didn't cause it. The goplaces Dockerfile change is unrelated to runner cache ownership. Any push to main after the cache got into a bad state would have hit this exact error. ## Recommended fix ```bash # SSH into runner host rm -rf /data/.cache/act/ # then re-run the workflow from Forgejo UI — cache rebuilds clean ``` If ownership keeps drifting, the longer-term fix is ensuring the runner service always runs as the same user (or root) and the cache dir permissions are consistent.
Owner

@agent ok runner-host is the one that builds our docker container, and the user is set differently.


  dind:
    image: docker:28.3.3-dind@sha256:c0872aae4791ff427e6eda52769afa04f17b5cf756f>
    container_name: forgejo-dind
    privileged: true
    restart: unless-stopped
    command: ["dockerd", "-H", "tcp://0.0.0.0:2375", "--tls=false"]

  runner-host:
    image: code.forgejo.org/forgejo/runner:12.5.3
    container_name: runner-host
    user: "0:0"
    links:
      - dind
    depends_on:
      dind:
        condition: service_started
    environment:
      DOCKER_HOST: tcp://dind:2375
    volumes:
      # replace the left-hand side from the ':' with your own path
      - /mnt/git/forgejo/runner/host-runner:/data
      - /mnt/git/forgejo/runner/host-runner/config.yaml:/config.yaml
    restart: unless-stopped
    #command: '/bin/sh -c "while : ; do sleep 1 ; done ;"'
    command: '/bin/sh -c "sleep 5; forgejo-runner daemon --config /config.yaml"'

  runner1:
    image: code.forgejo.org/forgejo/runner:12.5.3
    links:
      - dind
    depends_on:
      dind:
        condition: service_started
    container_name: runner1
    user: 1000:1000 # replace this with the UUID:GUID of the user you want to ru>
    environment:
      DOCKER_HOST: tcp://dind:2375
    volumes:
      # replace the left-hand side from the ':' with your own path
      - /mnt/git/forgejo/runner/instance1:/data
      - /mnt/git/forgejo/runner/instance1.yaml:/config.yaml
    restart: unless-stopped
root@git-runners:/mnt/git/forgejo/runner/host-runner# ls -alF
total 24
drwxrwxrwx  3   1024 users  4096 Feb  9 17:43 ./
drwxrwxrwx 11 runner runner 4096 Feb  9 17:27 ../
-rwxrwxrwx  1 runner runner   66 Feb  9 17:20 .ash_history*
drwxrwxrwx  4   1024 users  4096 Feb  9 17:27 .cache/
-rwxrwxrwx  1   1024 users   248 Feb  9 17:43 config.yaml*
-rwxrwxrwx  1 runner runner  443 Feb 11 13:07 .runner*

Should I fix this in the compose stack? i see .cache is 1024:users not sure what that means

@agent ok runner-host is the one that builds our docker container, and the user is set differently. ```yaml dind: image: docker:28.3.3-dind@sha256:c0872aae4791ff427e6eda52769afa04f17b5cf756f> container_name: forgejo-dind privileged: true restart: unless-stopped command: ["dockerd", "-H", "tcp://0.0.0.0:2375", "--tls=false"] runner-host: image: code.forgejo.org/forgejo/runner:12.5.3 container_name: runner-host user: "0:0" links: - dind depends_on: dind: condition: service_started environment: DOCKER_HOST: tcp://dind:2375 volumes: # replace the left-hand side from the ':' with your own path - /mnt/git/forgejo/runner/host-runner:/data - /mnt/git/forgejo/runner/host-runner/config.yaml:/config.yaml restart: unless-stopped #command: '/bin/sh -c "while : ; do sleep 1 ; done ;"' command: '/bin/sh -c "sleep 5; forgejo-runner daemon --config /config.yaml"' runner1: image: code.forgejo.org/forgejo/runner:12.5.3 links: - dind depends_on: dind: condition: service_started container_name: runner1 user: 1000:1000 # replace this with the UUID:GUID of the user you want to ru> environment: DOCKER_HOST: tcp://dind:2375 volumes: # replace the left-hand side from the ':' with your own path - /mnt/git/forgejo/runner/instance1:/data - /mnt/git/forgejo/runner/instance1.yaml:/config.yaml restart: unless-stopped ``` ```bash root@git-runners:/mnt/git/forgejo/runner/host-runner# ls -alF total 24 drwxrwxrwx 3 1024 users 4096 Feb 9 17:43 ./ drwxrwxrwx 11 runner runner 4096 Feb 9 17:27 ../ -rwxrwxrwx 1 runner runner 66 Feb 9 17:20 .ash_history* drwxrwxrwx 4 1024 users 4096 Feb 9 17:27 .cache/ -rwxrwxrwx 1 1024 users 248 Feb 9 17:43 config.yaml* -rwxrwxrwx 1 runner runner 443 Feb 11 13:07 .runner* ``` Should I fix this in the compose stack? i see `.cache` is 1024:users not sure what that means
Owner

Yeah, the ownership mismatch is the root cause. runner-host runs as 0:0 (root), but .cache is owned by 1024:users — Git 2.35.2+ refuses to operate on directories owned by a different user.

Quick fix (gets CI unblocked now): Add GIT_SAFE_DIRECTORY to the runner-host environment:

  runner-host:
    environment:
      DOCKER_HOST: tcp://dind:2375
      GIT_SAFE_DIRECTORY: "*"

Proper fix (root cause): Align ownership. Since runner-host is 0:0, chown the data dir to match:

chown -R root:root /mnt/git/forgejo/runner/host-runner

Re: 1024:users — that UID 1024 is whatever user set up the directory originally (probably your shell user on the host). The .runner and .ash_history files owned by runner runner suggest a previous runner service account. The mix happened when setup was done as one user but the container runs as another.

I would do both: chown the data dir to root, AND keep the GIT_SAFE_DIRECTORY env var as a safety net. No need to fight Docker about volume ownership — just make them consistent.

Yeah, the ownership mismatch is the root cause. `runner-host` runs as `0:0` (root), but `.cache` is owned by `1024:users` — Git 2.35.2+ refuses to operate on directories owned by a different user. **Quick fix (gets CI unblocked now):** Add `GIT_SAFE_DIRECTORY` to the runner-host environment: ```yaml runner-host: environment: DOCKER_HOST: tcp://dind:2375 GIT_SAFE_DIRECTORY: "*" ``` **Proper fix (root cause):** Align ownership. Since runner-host is `0:0`, chown the data dir to match: ```bash chown -R root:root /mnt/git/forgejo/runner/host-runner ``` **Re: `1024:users`** — that UID 1024 is whatever user set up the directory originally (probably your shell user on the host). The `.runner` and `.ash_history` files owned by `runner runner` suggest a previous runner service account. The mix happened when setup was done as one user but the container runs as another. I would do both: chown the data dir to root, AND keep the `GIT_SAFE_DIRECTORY` env var as a safety net. No need to fight Docker about volume ownership — just make them consistent.
Owner

@agent ok i wonder if the root cause here is that it's on an NFS mount?

root@git-runners:/mnt/git/forgejo/runner# chown -R root:root host-runner*
chown: changing ownership of 'host-runner/config.yaml': Operation not permitted
chown: changing ownership of 'host-runner/.ash_history': Operation not permitted
chown: changing ownership of 'host-runner/.cache/actcache/.nfs000000000021019800000167': Operation not permitted
chown: changing ownership of 'host-runner/.cache/actcache': Operation not permitted
chown: changing ownership of 'host-runner/.cache': Operation not permitted
chown: changing ownership of 'host-runner/.runner': Operation not permitted
chown: changing ownership of 'host-runner': Operation not permitted
chown: changing ownership of 'host-runner2/config.yaml': Operation not permitted
chown: changing ownership of 'host-runner2': Operation not permitted
root@git-runners:/mnt/git/forgejo/runner# 

Not sure how to proceed now

@agent ok i wonder if the root cause here is that it's on an NFS mount? ``` root@git-runners:/mnt/git/forgejo/runner# chown -R root:root host-runner* chown: changing ownership of 'host-runner/config.yaml': Operation not permitted chown: changing ownership of 'host-runner/.ash_history': Operation not permitted chown: changing ownership of 'host-runner/.cache/actcache/.nfs000000000021019800000167': Operation not permitted chown: changing ownership of 'host-runner/.cache/actcache': Operation not permitted chown: changing ownership of 'host-runner/.cache': Operation not permitted chown: changing ownership of 'host-runner/.runner': Operation not permitted chown: changing ownership of 'host-runner': Operation not permitted chown: changing ownership of 'host-runner2/config.yaml': Operation not permitted chown: changing ownership of 'host-runner2': Operation not permitted root@git-runners:/mnt/git/forgejo/runner# ``` Not sure how to proceed now
Owner

Yeah, NFS is almost certainly the root cause — you have two smoking guns:

  1. root_squash — NFS mounts default to squashing root to nobody:nobody on the server side, so chown as root on the client gets Operation not permitted. It's a security feature preventing root privilege escalation across the network.

  2. .nfs000000000021019800000167 — that's a classic NFS stale lock file (created when a file is deleted while still open by another process). Its presence confirms you're hitting NFS weirdness.

The git safe.directory error downstream is a symptom: files end up owned by a mismatched UID so git refuses to operate in them.

Options

Option 1: Move runner workspace to local storage (cleanest)
Mount NFS for persistent config/state, but point the runner's work_dir at a local path (e.g. /var/lib/forgejo-runner/work). The workspace doesn't need to be shared — only config/state does.

Option 2: Add no_root_squash to the NFS export
On the NFS server, edit /etc/exports:

/path/to/share client_ip(rw,no_root_squash)

Then exportfs -ra. Lets root on the client own files on the server. Real security tradeoff — only reasonable on a trusted internal network.

Option 3: Run the runner as a non-root user
If files were created as a specific UID, run the runner as that same UID. Sidesteps root_squash entirely.

Quick workaround for just the git issue:
Add to runner config or global gitconfig on the runner host:

git config --global --add safe.directory '*'

Won't fix the chown failures but stops git complaining about directory ownership.

I'd go with Option 1 — cleanest architecture. NFS for persistent state/config, local disk for ephemeral build workspaces.

Yeah, NFS is almost certainly the root cause — you have two smoking guns: 1. **`root_squash`** — NFS mounts default to squashing root to `nobody:nobody` on the server side, so `chown` as root on the client gets `Operation not permitted`. It's a security feature preventing root privilege escalation across the network. 2. **`.nfs000000000021019800000167`** — that's a classic NFS stale lock file (created when a file is deleted while still open by another process). Its presence confirms you're hitting NFS weirdness. The git `safe.directory` error downstream is a symptom: files end up owned by a mismatched UID so git refuses to operate in them. ## Options **Option 1: Move runner workspace to local storage (cleanest)** Mount NFS for persistent config/state, but point the runner's `work_dir` at a local path (e.g. `/var/lib/forgejo-runner/work`). The workspace doesn't need to be shared — only config/state does. **Option 2: Add `no_root_squash` to the NFS export** On the NFS server, edit `/etc/exports`: ``` /path/to/share client_ip(rw,no_root_squash) ``` Then `exportfs -ra`. Lets root on the client own files on the server. Real security tradeoff — only reasonable on a trusted internal network. **Option 3: Run the runner as a non-root user** If files were created as a specific UID, run the runner as that same UID. Sidesteps `root_squash` entirely. **Quick workaround for just the git issue:** Add to runner config or global gitconfig on the runner host: ``` git config --global --add safe.directory '*' ``` Won't fix the chown failures but stops git complaining about directory ownership. I'd go with **Option 1** — cleanest architecture. NFS for persistent state/config, local disk for ephemeral build workspaces.
Owner

@agent Ok i just changed it to run as the same user as the other containers. Now our problem is...:

Set up job19s
host-runner(version:v12.5.3) received task 5326 of job build, be triggered by event: push
workflow prepared
🚀  Start image=docker:cli
  🐳  docker pull image=docker:cli platform=linux/amd64 username= forcePull=false
Cleaning up network for job build, and network name is: WORKFLOW-a812804bc42970753a412f52b56ac406
  🐳  docker create image=docker:cli platform=linux/amd64 entrypoint=["tail" "-f" "/dev/null"] cmd=[] network="WORKFLOW-a812804bc42970753a412f52b56ac406"
  🐳  docker run image=docker:cli platform=linux/amd64 entrypoint=["tail" "-f" "/dev/null"] cmd=[] network="WORKFLOW-a812804bc42970753a412f52b56ac406"
  ☁️  git clone 'https://data.forgejo.org/actions/checkout' # ref=v4
  ☁️  git fetch 'https://data.forgejo.org/actions/checkout' # ref=v4
⭐ Run Main Checkout
  🐳  docker cp src=/data/.cache/act/f0/0d0d9d309336543775871d30faf69e7768a656165ee3c42af512caea100504/ dst=/var/run/act/actions/f0/0d0d9d309336543775871d30faf69e7768a656165ee3c42af512caea100504/
  🐳  docker exec cmd=[node /var/run/act/actions/f0/0d0d9d309336543775871d30faf69e7768a656165ee3c42af512caea100504/dist/index.js] user= workdir=
Checkout1s
OCI runtime exec failed: exec failed: unable to start container process: exec: "node": executable file not found in $PATH: unknown
⚙️ [runner]: exitcode '127': command not found, please refer to https://github.com/nektos/act/issues/107 for more information
@agent Ok i just changed it to run as the same user as the other containers. Now our problem is...: ``` Set up job19s host-runner(version:v12.5.3) received task 5326 of job build, be triggered by event: push workflow prepared 🚀 Start image=docker:cli 🐳 docker pull image=docker:cli platform=linux/amd64 username= forcePull=false Cleaning up network for job build, and network name is: WORKFLOW-a812804bc42970753a412f52b56ac406 🐳 docker create image=docker:cli platform=linux/amd64 entrypoint=["tail" "-f" "/dev/null"] cmd=[] network="WORKFLOW-a812804bc42970753a412f52b56ac406" 🐳 docker run image=docker:cli platform=linux/amd64 entrypoint=["tail" "-f" "/dev/null"] cmd=[] network="WORKFLOW-a812804bc42970753a412f52b56ac406" ☁️ git clone 'https://data.forgejo.org/actions/checkout' # ref=v4 ☁️ git fetch 'https://data.forgejo.org/actions/checkout' # ref=v4 ⭐ Run Main Checkout 🐳 docker cp src=/data/.cache/act/f0/0d0d9d309336543775871d30faf69e7768a656165ee3c42af512caea100504/ dst=/var/run/act/actions/f0/0d0d9d309336543775871d30faf69e7768a656165ee3c42af512caea100504/ 🐳 docker exec cmd=[node /var/run/act/actions/f0/0d0d9d309336543775871d30faf69e7768a656165ee3c42af512caea100504/dist/index.js] user= workdir= Checkout1s OCI runtime exec failed: exec failed: unable to start container process: exec: "node": executable file not found in $PATH: unknown ⚙️ [runner]: exitcode '127': command not found, please refer to https://github.com/nektos/act/issues/107 for more information ```
Owner

Root cause: docker:cli has no Node.js runtime, and actions/checkout@v4 is a Node.js action — hence "node": executable file not found.

Fix: swap to node:20-bookworm as the container base (Debian, has apt, has node), then install Docker CLI before using it. Checkout comes first since it only needs node.

jobs:
  build:
    runs-on: docker-host
    container:
      image: node:20-bookworm
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Install Docker CLI
        run: apt-get update && apt-get install -y --no-install-recommends docker.io

      - name: Login to Forgejo Registry
        run: |
          echo "${{ secrets.REGISTRY_TOKEN }}" | docker login ${{ env.REGISTRY }} -u agent --password-stdin

      - name: Build image
        run: |
          docker build -t ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:rbw .
          docker tag ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:rbw ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest

      - name: Push to registry
        run: |
          docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:rbw
          docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest

      - name: Also tag locally (for same-host use)
        run: |
          docker tag ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:rbw openclaw-sandbox:rbw
          docker tag ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:rbw openclaw-sandbox:latest

The Docker socket is still forwarded from the host runner, so docker build/push will work the same — you just need the client binary, which docker.io provides. I can push this fix directly if you want.

Root cause: `docker:cli` has no Node.js runtime, and `actions/checkout@v4` is a Node.js action — hence `"node": executable file not found`. Fix: swap to `node:20-bookworm` as the container base (Debian, has apt, has node), then install Docker CLI before using it. Checkout comes first since it only needs node. ```yaml jobs: build: runs-on: docker-host container: image: node:20-bookworm steps: - name: Checkout uses: actions/checkout@v4 - name: Install Docker CLI run: apt-get update && apt-get install -y --no-install-recommends docker.io - name: Login to Forgejo Registry run: | echo "${{ secrets.REGISTRY_TOKEN }}" | docker login ${{ env.REGISTRY }} -u agent --password-stdin - name: Build image run: | docker build -t ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:rbw . docker tag ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:rbw ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest - name: Push to registry run: | docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:rbw docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest - name: Also tag locally (for same-host use) run: | docker tag ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:rbw openclaw-sandbox:rbw docker tag ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:rbw openclaw-sandbox:latest ``` The Docker socket is still forwarded from the host runner, so `docker build`/`push` will work the same — you just need the client binary, which `docker.io` provides. I can push this fix directly if you want.
Owner

@agent go ahead yeah let's get this fixed. i thought we were just making a simple change but this has cascaded.

@agent go ahead yeah let's get this fixed. i thought we were just making a simple change but this has cascaded.
Owner

Done — PR #19 is up: #19

Fixed both workflow files (build.yaml and build-coder.yml). The change:

# Before
container: docker:cli

# After
container:
  image: node:20-bookworm

...plus an "Install Docker CLI" step via apt-get install docker.io before any Docker commands. DOCKER_HOST=tcp://dind:2375 keeps propagating from the runner env so the DinD socket still works.

Once you merge, the workflow will trigger on main and should build clean. The cascade makes sense in hindsight — fixing the NFS/user ownership issue was the right call, it just exposed the next layer (no Node runtime in docker:cli). We are now properly fixed at both levels.

If CI still hiccups after merge, most likely culprit is apt-get slowness on first run — not a failure, just slow. Should be fine after that.

Done — PR #19 is up: https://git.brads.house/commune/sandbox/pulls/19 Fixed both workflow files (`build.yaml` and `build-coder.yml`). The change: ```yaml # Before container: docker:cli # After container: image: node:20-bookworm ``` ...plus an "Install Docker CLI" step via `apt-get install docker.io` before any Docker commands. `DOCKER_HOST=tcp://dind:2375` keeps propagating from the runner env so the DinD socket still works. Once you merge, the workflow will trigger on main and should build clean. The cascade makes sense in hindsight — fixing the NFS/user ownership issue was the right call, it just exposed the next layer (no Node runtime in docker:cli). We are now properly fixed at both levels. If CI still hiccups after merge, most likely culprit is `apt-get` slowness on first run — not a failure, just slow. Should be fine after that.
Owner

@agent ok we've reached the build stage but now those packages we added are causing it to fail:

#9 [4/9] RUN npm install -g @ast-grep/cli
#9 sha256:fd9ae6869e556fe252c879218e6f3a2d8f4773deb5bf215436107c6bb92556b1
#9 1.534 npm error code EEXIST
#9 1.534 npm error path /usr/bin/sg
#9 1.534 npm error EEXIST: file already exists
#9 1.534 npm error File exists: /usr/bin/sg
#9 1.534 npm error Remove the existing file and try again, or run npm
#9 1.534 npm error with --force to overwrite files recklessly.
#9 1.535 npm error A complete log of this run can be found in: /root/.npm/_logs/2026-02-18T02_22_56_708Z-debug-0.log
#9 ERROR: process "/bin/sh -c npm install -g @ast-grep/cli" did not complete successfully: exit code: 1
------
 > [4/9] RUN npm install -g @ast-grep/cli:
------
process "/bin/sh -c npm install -g @ast-grep/cli" did not complete successfully: exit code: 1
⚙️ [runner]: exitcode '1': failure

lets fix that.

i also noticed build-coder ran before the normal build. that's wrong, it should do the base image first

@agent ok we've reached the build stage but now those packages we added are causing it to fail: ``` #9 [4/9] RUN npm install -g @ast-grep/cli #9 sha256:fd9ae6869e556fe252c879218e6f3a2d8f4773deb5bf215436107c6bb92556b1 #9 1.534 npm error code EEXIST #9 1.534 npm error path /usr/bin/sg #9 1.534 npm error EEXIST: file already exists #9 1.534 npm error File exists: /usr/bin/sg #9 1.534 npm error Remove the existing file and try again, or run npm #9 1.534 npm error with --force to overwrite files recklessly. #9 1.535 npm error A complete log of this run can be found in: /root/.npm/_logs/2026-02-18T02_22_56_708Z-debug-0.log #9 ERROR: process "/bin/sh -c npm install -g @ast-grep/cli" did not complete successfully: exit code: 1 ------ > [4/9] RUN npm install -g @ast-grep/cli: ------ process "/bin/sh -c npm install -g @ast-grep/cli" did not complete successfully: exit code: 1 ⚙️ [runner]: exitcode '1': failure ``` lets fix that. i also noticed build-coder ran before the normal build. that's wrong, it should do the base image first
Owner

Both fixed in PR #21.

/usr/bin/sg conflict: Debian ships that path as shadow-utils' "run as group" binary. npm trips on EEXIST when ast-grep tries to install its own sg there. Fix is just rm -f /usr/bin/sg before the npm install — shadow-utils' sg is irrelevant in a container context.

Build order: Both workflows were firing off the same push and racing (coder was winning). Switched build-coder.yml from a push/paths trigger to workflow_run — it now waits for Build Sandbox Image to complete. Also added a conclusion == success guard so a failed base build won't kick off a coder build.

#21

Both fixed in PR #21. **`/usr/bin/sg` conflict**: Debian ships that path as shadow-utils' "run as group" binary. npm trips on `EEXIST` when ast-grep tries to install its own `sg` there. Fix is just `rm -f /usr/bin/sg` before the npm install — shadow-utils' sg is irrelevant in a container context. **Build order**: Both workflows were firing off the same push and racing (coder was winning). Switched `build-coder.yml` from a `push/paths` trigger to `workflow_run` — it now waits for `Build Sandbox Image` to complete. Also added a `conclusion == success` guard so a failed base build won't kick off a coder build. https://git.brads.house/commune/sandbox/pulls/21
brad closed this issue 2026-02-17 18:29:19 -08:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
commune/sandbox#18
No description provided.