-
Notifications
You must be signed in to change notification settings - Fork 6k
General dockerfile cleanup #6836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
b787458
to
66dad33
Compare
Thanks for this. I have some questions/comments, and tests listed below. I'm curious, how is moving COPY to the top of the file supposed to decrease build times? Are you referring to CI build or local build? I ask because I have had different experiences with this, and was wondering if you could point me to where this is documented. This PR will change the hashes of the layers, causing newer versions of the image to be unable to reuse existing layers. This is a minor issue, considering the Dockerfile in master is stuctured such that all the changing layers were near the end anyway. Moving to the top would cause newer builds to not reuse any existing layers if Related to build times, here is the difference between master and this branch (both with the base image cached)… Example, from master:
From this branch:
Looking into the effect on layer reuse, here's master's layers:
And this branch's layers:
As you can see, no layer can be reused from previous builds. And regarding my point about moving it to the top of the file and changing all subsequent layers…
And its build time:
If I switch back to master and build with a minor change to docker-entrypoint.sh, you'll see that I'm able to reuse everything up to where we begin adding the source. This is because the root directory changes when we modify docker-entrypoint.sh, so there's not really a good, clean way around this and there's very little gain considering the repository's structure is intended to change on every commit.
All that said, there are some modifications that could be made to improve layer reusability. For instance, because We could also increase layer reuse by adding each subdirectory individually in increasing order of popularity or likelihood of modification. For instance, here's an example Dockerfile for what I'm suggesting:
The problem with this approach is that it could easily go unnoticed if a new module is added to the project and not copied into the image. This is because the image only builds and tests swagger-codegen-cli (3 modules), but is intended to be a "full" swagger codegen container. It would be surprising after 3 release or so for anyone to assume Not only this, but Dockerfile doesn't currently support exclusions in COPY. There may be a way to work out a pattern that excludes only those manually added Any thoughts or comments on any of that? Sorry, I know it was a lot… but I've put a lot of thought into this Dockerfile to reduce it from the previous image (which I think was about 2GB). |
@@ -1,22 +1,22 @@ | |||
FROM jimschubert/8-jdk-alpine-mvn:1.0 | |||
|
|||
COPY docker-entrypoint.sh /usr/local/bin/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COPY should be after apk add
otherwise a change to the file would result in not using the cache install of OS packages.
@@ -1,22 +1,22 @@ | |||
FROM jimschubert/8-jdk-alpine-mvn:1.0 | |||
|
|||
COPY docker-entrypoint.sh /usr/local/bin/ | |||
|
|||
ENV GEN_DIR /opt/swagger-codegen | |||
|
|||
RUN set -x && \ | |||
apk add --no-cache bash | |||
|
|||
RUN mkdir /opt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line could be changed to WORKDIR ${GEN_DIR}
which will result in the directory being created.
https://docs.docker.com/engine/reference/builder/#workdir
snippet
If the WORKDIR doesn’t exist, it will be created even if it’s not used in any subsequent Dockerfile instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, WORKDIR is evaluated and created after COPY. Do you know if this has changed? I can play around with it when I'm not in mobile.
Also, in the past, COPY did not work the same as mkdir -p
, so I'm not certain WORKDIR would be a great solution in the case that we add new subfolders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WORKDIR https://docs.docker.com/engine/reference/builder/#workdir
The WORKDIR instruction sets the working directory for any RUN, CMD, ENTRYPOINT, COPY and ADD instructions that follow it in the Dockerfile.
COPY https://docs.docker.com/engine/reference/builder/#copy
If <dest> doesn’t exist, it is created along with all missing directories in its path.
If any change is to be made to the existing Dockerfile, the following
FROM jimschubert/8-jdk-alpine-mvn:1.0
# install OS packages; should only change when new packages required
# or when the base image is updated.
RUN apk add --no-cache bash
# only changes if the value of GEN_DIR is changed
ENV GEN_DIR /opt/swagger-codegen
WORKDIR ${GEN_DIR}
# most frequent change as it includes the entire context
# copies all of the contents of ./modules/ to the directory created
# by the COPY command 'modules/'. This means if any new modules
# are added then they will get included.
COPY ./modules/ ${GEN_DIR}/modules/
VOLUME ${MAVEN_HOME}/.m2/repository
# most costly to execute from a time perspective
RUN mvn -am -pl "modules/swagger-codegen-cli" package
# least likely to change but if any higher in the file it could result in the maven build running
COPY docker-entrypoint.sh /usr/local/bin/
ENTRYPOINT ["docker-entrypoint.sh"]
CMD ["build"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kenjones-cisco I'm a little confused by your proposal, so I'd like to clarify a bit.
Is your suggestion basically to move WORKDIR
up and replace the mkdir
with it?
Also, are you suggesting to include only ./modules
? You'd also need at a minimum the pom.xml
from the root for this to even build. I'd think this is fine to not include all the samples and build related files in the image.
Being that WORKDIR
, ENV
, and VOLUME
are 0B layers, I don't think there's any benefit to having them in different areas of the Dockerfile. These are build-time variables and specifying them along with docker run
doesn't really make a difference as it doesn't change the built image, only the running container.
I also have some questions about your comments.
only changes if the value of GEN_DIR is changed
I don't understand this. The layer of the image doesn't change if you pass -e "GEN_DIR=/src"
. Also, because WORKDIR
is defined within the build context as /opt/swagger-codegen
, this doesn't change when you specify an override for GEN_DIR
variable. Example:
$ docker run -it -e "GEN_DIR=/src" -v $PWD:/src swagger-codegen /bin/bash
bash-4.3# pwd
/opt/swagger-codegen
least likely to change but if any higher in the file it could result in the maven build running
I don't understand what this means. The addition of the entry point is 899B. It only matters that it comes after the apk add
. Everything else in the Dockerfile is a 0B layer, or will change on every build (this image is built on every commit). The reason this is in a different location is because copying to /usr/local/bin
is a common practice, and because tweaking the entrypoint and rebuilding means we don't have to add the whole project and run the compilation again when we want to modify the entrypoint... is this what you're referring to? It absolutely would result in the build running through again if this was like:
COPY docker-entrypoint.sh /usr/local/bin/
# 0 or more instructions, then…
RUN mvn -am -pl "modules/swagger-codegen-cli" package
…and we were tweaking docker-entrypoint.sh
. This is why I had it at the bottom of the file to begin with.
However, regardless of where this instruction is in the dockerfile, the maven build will always run when you run docker build -t imagename .
on a new branch (this is where I'm confused by the statement).
tl;dr
I think your proposal would work if we add the pom.xml
and clean up some of the comments as they seem to discuss the image build context as if it's the container run context.
Also, I've realized that running with the default CMD of build
is not valid. We should probably change it to langs
or help
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah it makes sense that it sets the working directory for all following instructions, otherwise that wouldn't really be intuitive. I've always used absolute paths for copy, so never had to consider WORKDIR there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kenjones-cisco played around with this a bit, samples and other unnecessary files are already excluded by the dockerignore. Also, in addition to pom.xml
we also need the checkstyles config.
Here's a working example:
FROM jimschubert/8-jdk-alpine-mvn:1.0
RUN set -x && \
apk add --no-cache bash
ENV GEN_DIR /opt/swagger-codegen
WORKDIR ${GEN_DIR}
VOLUME ${MAVEN_HOME}/.m2/repository
COPY ./LICENSE ${GEN_DIR}
COPY ./google_checkstyle.xml ${GEN_DIR}
COPY ./modules/ ${GEN_DIR}/modules
COPY ./pom.xml ${GEN_DIR}
RUN mvn -am -pl "modules/swagger-codegen-cli" package
COPY docker-entrypoint.sh /usr/local/bin/
ENTRYPOINT ["docker-entrypoint.sh"]
CMD ["help"]
Yet, because almost everything else is already excluded, we receive very little gains from restructuring the file in this way (top is built with my changes above, bottom is built from master):
swagger-codegen-dockerfile latest 301867e4fa00 40 seconds ago 497MB
swagger-codegen latest f1dbcf25e9df About an hour ago 500MB
So, in the end... I think the only worthwhile change to the Dockerfile would be to change the CMD to help
or langs
. Even changing ADD
to COPY
has very little benefit as ADD internally does COPY with additional functionality for tgz
extraction and externally added files. I'd say since we have to change CMD, we can change ADD -> COPY and maybe add comments. But breaking apart the COPY instruction has very little gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree, the WORKDIR change and ADD -> COPY are good changes.
The one way that breaking apart the COPY instructions is positive is related to layer caching. When you do COPY . ${GEN_DIR}
, and the only thing that changed was docker-entrypoint.sh
then because you are copying from .
then it will invalidate that cache.
Otherwise from a size perspective, there will not be any change or real value especially if the .dockerignore
is already well defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: breaking apart modules into layers
do you think this is worth it? It'll benefit is who pull the latest tag often. I think that's beneficial enough to eat the additional maintenance (although this would be minimal as a cli container). I've can make the changes in a separate PR, if you agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
The exclude should be handled by using It reduces extra files being copied into the image, and it reduces the size of the context sent to dockerd which can be very helpful if you have a large number of files, and only a small number should be included in the image. |
@kenjones-cisco The exclude I mentioned is conditional based on previous COPY instructions. The ignore file prevents these from being sent to the build context, which won't work. |
I had commented that the Dockerfile in this file is built on every commit. That doesn't seem to be the case (I thought it was previously as a tracking master build). Since this is only used by engineers who want to build without modification to their system (e.g. if they're stuck on Java 6 or have conflicting maven issues), we can make more assumptions and concessions in this image. Also, it's not clear to me why |
The image on master is rebuilt on every commit to master, |
The reason it is likely not used as part of run-in-docker.sh is because if you are developing a change, then you need to rebuild the modules, so if you using an image with the modules already included then it is not going to provide the expected results. This image is used for those actually generate code for their projects. |
The GEN_DIR and its contents are remapped in run-in-docker.sh. I guess because it also maps the Maven directory to the local file system, the container provides no additional benefit. Seems like that would cause issues, especially if that script runs in the container as root, it'll cause permissions issues in the host Maven cache. |
The permissions are different for sure, but as the files are maven cache which are only used within the container, that is not usually an issue. And the generated files coming from actions run via the container are part of .gitignore they never make it into the repository. For this project I actually have a custom All my development for all projects I work on are done similarly to |
PR checklist
./bin/
to update Petstore sample so that CIs can verify the change. (For instance, only need to run./bin/{LANG}-petstore.sh
and./bin/security/{LANG}-petstore.sh
if updating the {LANG} (e.g. php, ruby, python, etc) code generator or {LANG} client's mustache templates). Windows batch files can be found in.\bin\windows\
.3.0.0
branch for changes related to OpenAPI spec 3.0. Default:master
.Description of the PR
I've taken a little time to switch unnecessary add statements to copy and re-order layers for build the docker images to decrease build time.
fix #6835