External python dependencies can be installed with pip, and managed as a list of dependencies for your project with a requirements.txt file. To install dependencies for a project, use:
> python3 -m pip install -r requirements.txt
On MacOS, it’s no longer possible to install global modules, unless you force the install with the ‘–break-system-packages’ option, if you pip like above, you’ll see the error:
error: externally-managed-environment × This environment is externally managed
Instead, create a virtual environment for your project where dependencies local to your project are installed:
Trying to install some Python packages with pip inside a Docker container I ran into this issue:
# pip3 install pytest
Collecting pytest
Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7feb52c30630>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/pytest/
At first I thought this something to do with network restrictions since I’m running this on a Linux AWS Workspace, but I have internet access enabled. Running the command on the Workspace itself works as expected, so this is something specific to the Docker container. Next I thought this might be something to do with the specific container image I was using, but after trying a few others I had the same error on any container.
Searching for the error “Failed to establish a new connection: [Errno -3] Temporary failure in name resolution” online I found this question and answer, and suggested to run the Docker container with the host networking option.
So instead of running bash in the container like this:
$ docker run -it tensorflow/tensorflow:1.12.0-py3 bash
Pass in the network=host option like this:
$ docker run --network=host -it tensorflow/tensorflow:1.12.0-py3 bash
I’ve been trying to deploy a Python based AWS Lambda that’s using PyTorch. The problem I’ve run into is the size of the deployment package with PyTorch and it’s platform specific dependencies is far beyond the maximum size of a deployable zip that you can deploy as an AWS Lambda. Per the AWS Lambda Limits page, the maximum deployable zip is 50MB (and unzipped it needs to be less than 250MB).
I found this article which suggested to build PyTorch from source in an Amazon AMI EC2, using build options to reduce the build size. I followed all steps up to but not including line 65 as I don’t need torchvision.
If you’re looking for the tl:dr summary, here’s the keypoints:
yes, this approach works! (although it took many hours spread over a few weeks to get to this point!)
the specific Amazon AMI you need is the one that’s currently in use for running AWS Lambdas (this will obviously change at some point but as of 9/3/18 this AMI works) : amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 (ami-aa5ebdd2)
a t2.micro EC2 instance does not have enough RAM to successfully build PyTorch. I used a t2.medium with 4GB RAM.
you can’t take a trained model .pt file generated from a different version of PyTorch/torch and use it to generate text using a different version. The PyTorch version for training and generating output must be identical
Ok, beyond the tl;dr summary above, here’s my experience following the steps in this article.
At line 63:
python setup.py install
I got this error:
Could not find /home/ec2-user/pytorch/torch/lib/gloo/CMakeLists.txt Did you run 'git submodule update --init'?
I ran the suggested ‘git submodule update’ and then re-ran the setup.py script and now it ran for a while but ended with error:
gcc: error trying to exec 'cc1plus': execvp: No such file or directory
error: command 'gcc' failed with exit status 1
I spent a bunch of time trying to work out what was going on here, but I decided to take a different direction and skip building Python 3.6 from source, and try recreating these steps using Python 2.7 that is preinstalled in the Amazon Linux 2 AMI. The only parts that are slightly different is pip is no preinstalled, so I installed it with:
I think point I pick up the steps from creating the virtualenv:
virtualenv ~/shrink_venv
After the step to build pytorch, now I’ve got (another) different error:
as: out of memory allocating 4064 bytes after a total of 45686784 bytes {standard input}: Assembler messages: {standard input}:934524: Fatal error: can't close build/temp.linux-x86_64-2.7/torch/csrc/jit/python_ir.o: Memory exhausted torch/csrc/jit/python_ir.cpp:215:2: fatal error: error writing to -: Broken pipe
Ugh, I’m running in a t2.micro that only has 1GB ram. Let’s stop the instance, change the instance type to a t2.medium with 4GB and let’s try building again.
Running free before:
$ free total used free shared buff/cache available Mem: 1009384 40468 870556 288 98360 840700 Swap: 0 0 0
And now after resizing:
$ free total used free shared buff/cache available Mem: 4040024 55004 3825940 292 159080 3780552 Swap: 0 0 0
Ok, trying again, but since we’ve rebooted the instance, remembering to set the flags to minimize the build options which was the whole reason we were doing this:
$ export NO_CUDA=1 $ export NO_CUDNN=1
Next error:
error: could not create '/usr/lib64/python2.7/site-packages/torch': Permission denied
Ok, let’s run the build with sudo instead then. That fixes that.
Now I’m at a point where I can actually run the generate.py script but now I’ve got a completely different error:
/home/ec2-user/shrinkenv/lib/python2.7/site-packages/torch/serialization.py:316: SourceChangeWarning: source code of class 'torch.nn.modules.sparse.Embedding' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
Traceback (most recent call last):
File "generate.py", line 54, in <module>
decoder = torch.load(args.filename)
File "/home/ec2-user/shrinkenv/lib/python2.7/site-packages/torch/serialization.py", line 261, in load
return _load(f, map_location, pickle_module)
File "/home/ec2-user/shrinkenv/lib/python2.7/site-packages/torch/serialization.py", line 409, in _load
result = unpickler.load()
AttributeError: 'module' object has no attribute '_rebuild_tensor_v2'
Searching for the last part of this error found this post, which implies my trained model .pt file is from a different torch/pytorch version … which it most likely is as I trained using a version installed with pip, and now I’m trying to generate with a version built from source.
Rather than spend more time on this (some articles suggested you can read the .pt model from one pytorch version and convert it, but this doesn’t seem like a trivial activity and requires writing some code to do the conversion), so I’m going to train a new model with the same version I just built from source.
Now that’s successfully done, I have my Lambda handler script ready to go, and ready to package up, so back to the final steps from the article to zip up everything built and installed so far in my virtualenv:
cd $VIRTUAL_ENV/lib/python2.7/site-packages zip -r ~/kevinhookebot-ml-lambda-generate-py.zip *
We’re at 57MB, so looking ok so far (although larger than 50MB?). Now add char-rnn.pytorch, my generated model and Lambda handler into the same zip, and we’re now at 58M so well within the 250MB limit for a Lambda package deployed via S3.
Let’s upload and deploy. Test calling the Lambda, and now we get:
Unable to import module 'generatelambda': /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /var/task/torch/lib/libshm.so)
Searching for this error I found this post which has a link to this page which lists a specific AMI version to be used when compiling dependencies for a Lambda deployment (amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2). Picking the default Amazon Linux 2 AMI is probably not this same image (and I already tried the Amazon Linux AMI 2018.03.0 and ran into other issues on that one) so looks like I need to start over (but getting close now, surely!)
Ok, new EC2 t2-medium instance with the exactly the same AMI image as mentioned above. Retraced my steps and now I feel I’m almost back at the same error as before:
error: command 'gcc' failed with exit status 1 gcc: error trying to exec 'cc1plus': execvp: No such file or directory
Searching some more for this I found this post with a solution to change the PATH to point exactly where cc1plus is installed. Instead of 4.8.3 in this AMI though it seems I have 4.8.5, so here’s the settings I used:
And then I noticed in the post they hadn’t included either of these in setting the new PATH which seems like an oversight (I don’t think these will make any difference if they are not in the PATH), so I set my path like this, including COMPILER_PATH first:
Trying to parse a file that has some unusual characters in it. I spent a while trying to work out if my file was in an unusual encoding, or whether it was a CR vs CRLF issue, but no, I did have some unusual chars in the file. Removed the offending chars and now all good.