CLaDS: A Cloud Based Lab for Data Science
The rise of the “big data” era has created a pressing demand for educating many data scientists and engineers quickly at low cost. It is essential they learn by working on assignments that involve real world data sets to develop the skills needed to be successful in the workplace. However, enabling instructors to flexibly deliver all kinds of data science assignments using real world data sets to large numbers of learners (both on-campus and off-campus) at low cost is a significant open challenge.
To address this emerging challenge generally, CLaDS has been developed to enable many learners around the world to work on real-world data science problems without having to move or otherwise distribute prohibitively large data sets. Leveraging version control and continuous integration, CLaDS provides a general infrastructure to enable any instructor to conveniently deliver any hands-on data science assignment that uses large real world data sets to as many learners as our cloud-computing infrastructure allows at very low cost.
These instructions below will be specific to Microsoft Azure, but they
should be able to be generalized to any cloud provider supported by
docker-machine
by substituting the Azure-specific arguments in the set up
scripts with the required arguments for your preferred cloud provider.
You should ensure that you have some experience using Microsoft Azure before setting up CLaDS for your class! We are not responsible for any loss of funds caused by following these instructions or otherwise. You should make sure to periodically monitor your subscription utilization to make sure things are working within expected parameters.
First, clone this repository using git
:
git clone https://github.com/TIMAN-group/clads.git
In the clads
folder created, you should see an azure
directory
containing scripts for spawning the required virtual machines for CLaDS on
Microsoft Azure. Navigate to that folder:
cd clads
cd azure
Now, you should set the following environment variables for the scripts:
export AZURE_SUBSCRIPTION_ID="your-subscription-id-here"
export AZURE_CLIENT_ID="your-client-id-here"
export AZURE_CLIENT_SECRET="your-client-secret-here"
This sets the scripts to tie all machines created to the specified Azure subscription, authenticating with the provided Client ID and Client Secret.
Next, you will need to determine a proper Azure Region to host your CLaDS instance. Once you’ve determined this, set an environment variable as follows:
export AZURE_LOCATION="westcentralus" # or whatever other region you choose
The first machine we will spawn up is the VCS server. This will be using Gitlab for its seamless integration with the auto-scaling Gitlab CI tool that will be used for maintaining the worker cluster.
First, let’s create the machine:
cd gitlab-server
./create-gitlab-server.sh Standard_D2_v2
The argument to this script is the size of the virtual machine you would like to create. The VCS server handles a lot of I/O load, so we recommend using a size that does not place a strong limit on IOPS. Remember, though, that a larger machine will cost more, as the VCS will always be running. We found a reasonable tradeoff with the D2_v2 size, but you may also want to try out other sizes.
Running this command will execute docker-machine
with the correct
arguments to create the virtual machine. This process will take a bit of
time, so be patient. You will see the relevant resources created on the
supplied Azure subscription, a virtual machine created, and Docker
installed on that virtual machine.
Once the VM is created and the script returns, we can then start configuring the VM. The first thing you will likely want to do is provide a hostname to your virtual machine. For now, we will just use the free Azure subdomain system, but you can also configure your own domain name (this is outside of the scope of this guide, however).
Log in to the Azure portal with your username/password that is associated with the subscription ID used to create the VM earlier. You should be able to see it on the dashboard by clicking “All Resources” on the sidebar, and then look for the virtual machine resource called “clads-vcs”. If you click on it, a new blade should open. On the top, in the second column should be a heading called “DNS name” with a “Configure” link below. Click that link, and set a DNS name for your instance, which will look something like “xxxxxxx.azureregionhere.cloudapp.azure.com”.
Let’s configure the hostname. Copy the example file:
cp environment.sh.example environment.sh
Now, edit environment.sh
to have the correct hostname.
Next, let’s copy over our scripts to the VCS VM:
docker-machine scp environment.sh clads-vcs:~
docker-machine scp start-gitlab.sh clads-vcs:~
docker-machine scp reconfigure-gitlab.sh clads-vcs:~
Now we are going to install Gitlab into a docker container on that machine. Get a shell with:
docker-machine ssh clads-vcs
Now you have a shell on the remote VCS server. Let’s start a Gitlab instance now:
# start the Gitlab container (run these commands *on the remote server*)
source environment.sh
./start-gitlab.sh
This will start the installation process for Gitlab in the newly created container in the background, which will take some time. Eventually, you should be able to open your web browser and point it at the DNS name you chose earlier and you should see a page that prompts you to reset your admin password. Don’t do that yet; we want to configure Gitlab first.
On the remote machine, you can edit Gitlab’s configuration file like so:
# on the remote server
sudo vim /srv/gitlab/config/gitlab.rb
There are two things you will want to configure here. First, you will likely want to set up SMTP to allow your Gitlab instance to send email. You can follow the guidelines here to set that up.
Next, you will want to configure Gitlab to use HTTPS. We recommend following the “Free and automated HTTPS with Let’s Encrypt” guide there; specifically, you’ll want to add the following lines:
external_url https://YOURHOSTNAME.azureregion.cloudapp.azure.com
letsencrypt['enable'] = true
letsencrypt['contact_emails'] = ['foo@email.com'] # Optional
Finally, add the following line to the configuration file as well (as the default SSH port will already be taken):
gitlab_rails['gitlab_shell_ssh_port'] = 222
Once you’ve made those changes to the configuration file, you can reconfigure Gitlab:
# on the remote server
./reconfigure-gitlab.sh
Once this command is complete, you should be all set! You should return to
your browser and refresh the page at your configured DNS name—you should
now be using https://
in the URL, at which point you can go ahead and set
your initial admin password. Note that the administrator user is root
.
Next, let’s configure the storage for our datasets and upload any files we need to the cloud. We’ll start by creating a storage account on the Azure portal. Click on “Storage Accounts” on the side panel. Next, click the “Add” button. You can name this whatever you’d like, but do make sure that it is in the same region as your other servers. We recommend using only “Locally-redundant storage” for the datasets, as our resilience requirements are not high in this scenario and this is the lowest cost option. For the resource group, you should be able to use the pre-created “CLaDS” resource group that was made when we created the VCS server. When you’re done, click the “Create” button.
Once the storage account is created, you should get a notification in the portal. Click the button to go to the resource, and on this blade you should see a section called “Files”. Click this, and then click the “+ File share” button in the top left. You can name it whatever you want, and set a Quota if you’d like.
Once the share is created, you should be able to click on it in the blade and then add new directories and upload whatever files you would like to through the Portal UI. Once you’ve got whatever datasets you would like to start with uploaded, click the “Connect” button at the top, and navigate down to the “Connecting from Linux” heading. Copy the command there to your clipboard and save it somewhere.
Now, let’s install Gitlab CI. We’re going to follow the recommended setup and use three things here:
We’ll start by creating the last two servers first.
You can create the cache server as follows (starting from the repository root folder):
cd azure
cd gitlab-ci
./create-cache-server.sh Standard_A1_v2
This will do the following:
minio
S3 server,The private IP for the cache server should be configured to be 10.0.0.5
and will be listening on port 9005
.
We can now create the Docker registry mirror cache server:
./create-docker-registry.sh Standard_A1_v2
This will do the following:
The private IP of the Docker registry mirror cache should be configured to
be 10.0.0.6
and will be listening on port 6000
.
Now that we have the two cache servers running, we are ready to set up the main Gitlab bastion server:
./create-ci-bastion.sh Standard_A1_v2
This will do the following:
The registration process will ask for the following information:
/admin/runners
page of
your Gitlab instance,python:3
). Note that this can be changed in
a project-specific configuration file, so it’s not necessarily important
to be comprehensive with the default choice.Once this is done, the script should print out your runner’s token. Save
this. Now, in the azure/gitlab-ci
folder of the repository you should
find a config.toml.example
configuration file. Let’s copy it and make our
modifications:
cp config.toml.example config.toml
vim config.toml
We will want to change the following things:
concurrent = XX
: set this to the number of current build jobs you want
the auto-scaling build cluster to accept. Any jobs beyond this number
will be shown as “Pending” in the Gitlab UI. This helps serve as an
upper bound on the number of created worker VMs.name = "XXXXX"
in the [[runners]]
block: the name your runner will
use in the Gitlab interfaceurl = "XXXXX"
in the [[runners]]
block: this is he URL (including
https://
prefix) to use to communicate with the VCS servertoken = "XXXXX"
: the token to use for communication with the VCS
server; set this to the token we obtained earlierlimit = XX
: set this to the number of concurrent VMs you wish to
allow, including any idle machines.image = "XXXXX"
: set this to the desired Docker image for your
assignments (keep in mind that students can install software during the
build job as well, so this need not be comprehensive, but pre-installing
things into a Docker image for them can make things faster)AccessKey = "XXXXX"
: this is the access key you copied from the cache
server setupSecretKey = "XXXXX"
: this is the secret key you copied from the cache
server setupIdleCount = XX
: the number of build machines you want to be always
running waiting for build jobs. This is an optimization to eliminate the
delay associated with creating new build VMs for the first few jobs that
arrive. This can be set to 0, which will result in zero build machines
running if there are no pending build jobs.IdleTime = XXXX
: the amount of time to allow a machine to sit idle
(in seconds) before it is decomissioned. Setting this higher increases
the chance that new jobs will use a pre-existing machine, but setting
it too high may result in a large number of machines sitting idle for
longer, increasing cost.MachineOptions = []
: this is an array of options to pass to the
docker-machine
commands issued by the runner itself. The settings
here should be more or less obvious and correspond to the subscription
id, client id and secret, Azure location, and Azure VM size.Once you’ve got that set up how you’d like, there is one last configuration
file to edit: cloud-init.txt
. This file helps configure the new worker
VMs as they are spawned, and we need to edit this file to properly mount
our shared dataset folder. You will need to replace the
STORAGEACCOUNTNAME
and password=XXXXXX
parts of this file with the
correct values for the share you created in step 5.
cp cloud-init.txt.example cloud-init.txt
vim cloud-init.txt
Once these files are properly configured, you will want to upload them to the remote server:
./reconfigure-ci.sh
This should update the configuration on the remote and the runner should automatically pick up the new changes in the configuration file.
At this point, you are all set! You can test things out by creating a new
project with a .gitlab-ci.yml
file; refer to the Gitlab
documentation for how to set this up.
On the build workers, your file share you created in step 5 will be mounted
read only as /data
, so the build workers can use any files you have
uploaded to the file share in their build scripts.
CLaDS currently comes with two leaderboards: one for hosting a search
competition, and one for hosting a (multiclass) classification competition.
You can find the associated leaderboard server code, instructions on
deploying a leaderboard, and the associated student skeleton code in the
leaderboards
folder.
Specifically:
I get the following error when reconfiguring Gitlab:
Running handlers:
There was an error running gitlab-ctl reconfigure:
letsencrypt_certificate[clads-vcs.westcentralus.cloudapp.azure.com] (letsencrypt::http_authorization line 3) had an error: RuntimeError: acme_certificate[staging] (/opt/gitlab/embedded/cookbooks/cache/cookbooks/letsencrypt/resources/certificate.rb line 20) had an error: RuntimeError: ruby_block[create certificate for clads-vcs.westcentralus.cloudapp.azure.com] (/opt/gitlab/embedded/cookbooks/cache/cookbooks/acme/providers/certificate.rb line 100) had an error: RuntimeError: [clads-vcs.westcentralus.cloudapp.azure.com] Certificate request failed: Error creating new cert
This can happen if the hostname of the machine matches the prefix of the
DNS name for the LetsEncrypt setup. Fortunately, resolving this is
simple. On the remote machine, edit the /etc/hosts
file. Near the
bottom, you should see a line like:
clads-vcs 127.0.0.1
Simply comment out this line by placing a #
in front, and then re-run
./reconfigure-gitlab.sh
and the process should complete normally.
I get the following error when creating one of the VMs with
docker-machine
:
Error creating machine: Error running provisioning: Error running "sudo apt-get update": ssh command error:
...
E: can not open /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_xenial_InRelease - fopen (2: No such file or directory)
A simple workaround is to just retry:
docker-machine provision MACHINE-NAME-HERE
If this succeeds, look at the script you are running and just manually
run the commands that followed the docker-machine create
call.