Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training NER models on multiple GPUs (not just one) #8093

Open
Julia-Penfield opened this issue May 14, 2021 · 14 comments
Open

Training NER models on multiple GPUs (not just one) #8093

Julia-Penfield opened this issue May 14, 2021 · 14 comments
Labels
feat / ner Feature: Named Entity Recognizer scaling Scaling, serving and parallelizing spaCy training Training and updating models

Comments

@Julia-Penfield
Copy link

Julia-Penfield commented May 14, 2021

Hello,
I am training my NER model using the following code:

Start of Code

def load_word_vectors(model_name, word_vectors):
    subprocess.run([sys.executable,
                    "-m",
                    "spacy",
                    "init-model",
                    "en",
                    model_name,
                    "--vectors-loc",
                    word_vectors
                    ])

def train_spacy(nlp, training_data, iterations):

    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe("ner", last = True)

    training_examples = []
    faulty_dataset = []
    
    for text, annotations in training_data:
        doc = nlp.make_doc(text)
        try:
            training_examples.append(Example.from_dict(doc, annotations)) #creating examples for training as per spaCy v3.
        except:
            faulty_dataset.append([doc, annotations])        
        for ent in annotations['entities']:
            ner.add_label(ent[2])

    other_pipes = [pipe for pipe in nlp.pipe_names if pipe!= 'ner']

    with nlp.disable_pipes(*other_pipes):
        optimizer = nlp.begin_training()

        for iter in range(iterations):

            print('Starting iteration: ' + str(iter))
            random.shuffle(training_examples)
            losses = {}
            batches = minibatch(training_examples, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                nlp.update(
                            batch,
                            drop = 0.2,
                            sgd = optimizer,
                            losses = losses
                            )
            print(losses)

            for i in range(deviceCount): #to see how much GPU cores I am using:
                handle = nvmlDeviceGetHandleByIndex(i)
                util = nvmlDeviceGetUtilizationRates(handle)
                print(util.gpu)

    return nlp, faulty_dataset, training_examples

spacy.require_gpu() #this returns "True"
nlp = spacy.blank('en')
word_vectors = 'w2v_model.txt'
model_name = "nlp"
load_word_vectors(model_name, word_vectors) #I have some trained word vectors that I try to load them here.

test = train_spacy(nlp, training_data, 30) #training for 30 iterations

End of Code

The problem:

The issue is that each iteration take about 30 minutes - I have 8000 training records which include very long texts and also 6 labels.

So I was hoping to reduce it using more GPU cores, but it seems that only one core is being used - when I execute print(util.gpu) in the code above, only the first core returns a non zero value .

Question 1: Is there any way I could use more GPU cores in the training process to make it faster? I would appreciate any leads.

Edit: After some more research, it seems that spacy-ray is intended to enable parallel training. But I cannot find the documentation on using Ray in the nlp.update as all I find is about using "python -m spacy ray train config.cfg --n-workers 2."
Question 2: Does Ray enable parallel processing using GPUs, is it only for CPU cores?
Question 3: How could I integrate Ray in the python code I have using nlp.update as opposed to using "python -m spacy ray train config.cfg --n-workers 2." ?

Thank you!

Environment

All of the code above is in one conda_python3 notebook on AWS Sagemaker using ml.p3.2xlarge EC2 instance.
Python Version Used: 3
spaCy Version Used: 3.0.6

@svlandeg svlandeg added feat / ner Feature: Named Entity Recognizer scaling Scaling, serving and parallelizing spaCy training Training and updating models labels May 14, 2021
@Julia-Penfield Julia-Penfield changed the title Training NER models on multiple GPU (not just one) Training NER models on multiple GPUs (not just one) May 14, 2021
@adrianeboyd
Copy link
Contributor

The basic spacy train training loop only supports one GPU.

I think in theory you would want to configure ray workers so that each was associated with one particular GPU, but I haven't tried this in practice and I'm not sure how difficult it would be.

Taking a quick look at spacy-ray, it looks like there might be a bug in how it sets the GPU ID from CUDA_VISIBLE_DEVICES, but it looks like most of the setup is there for this to work if the workers are configured correctly. But maybe I'm completely misunderstanding how ray manages this in the first place. See this comment:

https:/explosion/spacy-ray/blob/master/spacy_ray/worker.py#L300-L308

The main thing I don't understand is why this bit calls require_gpu(0) rather than require_gpu(gpu_id) after referencing CUDA_VISIBLE_DEVICES:

https:/explosion/spacy-ray/blob/75cdb637411529f7f0a41c723a8ab71cbae9cc79/spacy_ray/worker.py#L251-L259

I think it would be difficult to use it with a script based around nlp.update vs. using spacy ray train. spacy-ray is still under development and has mostly been tested on CPU, so it's possible it would just require a few small patches/PRs for the GPU support to be improved enough to work in this scenario.

Let us know how it works for you if you try it out!

@Julia-Penfield
Copy link
Author

Julia-Penfield commented May 17, 2021

Thank you for your reply. I think it is time for me to move on to spaCy v3, so I converted my code and data to use spaCy ray train. I am providing some details below before showing you the error I get.

Code to convert training data to v3 format - data is v2 format [(text, 'entities':{(start, stop, label), ...}), ...]

nlp = spacy.blank('en')
def make_v3_training_data(data):
    failed_record = []
    db = DocBin()
    for text, annot in tqdm(data):
        doc = nlp.make_doc(text)
        ents = []
        for start, end, label in annot['entities']:
            span = doc.char_span(start, end, label = label, alignment_mode = 'contract')
            if span is None:
                print('empty entity') #I expect this to never happen
            else:
                ents.append(span)
        try:
            doc.ents = ents
        except:
            failed_record.append((text, annot))
        db.add(doc)
    return db, failed_record

end of training data code. The output is saved to train.spacy and val.spacy for training. So far so good.

I downloaded the base config file from spaCy guidelines and edited the first 2 lines (train address and dev address):

# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = "train.spacy"
dev = "val.spacy"

[system]
gpu_allocator = null

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000

[components]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

[initialize]
vectors = null

Then I used: !python -m spacy init fill-config base_config.cfg config.cfg
It returned:
✔ Auto-filled config with all values
✔ Saved config
config.cfg

To use ray, I executed the following line because I thought I'd try it without GPU first:
!python -m spacy train config.cfg --output ./output

This worked fine and training started. Afterwards, to push the envelope a little, I tried:
!python -m spacy train config.cfg --output ./output --gpu-id 0

This worked fine too using GPU - one core I presume. Eventually it failed due to memory:
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 2,466,849,280 bytes (allocated so far: 10,176,587,264 bytes).

Then I tried the follow to use ray:
!python -m spacy ray train config.cfg --n-workers 2 --output ./output

However, this time I got:
TypeError: create_train_batches() missing 1 required positional argument: 'max_epochs'

At this point, I have a few questions that I am researching to find answers for.

1- Does max_epochs belong to the config file?

2- I share the confusion with you about the spacy ray code as no matter what the gpu ID is, it uses 0. Along those lines of confusion, I wonder that once the max_epochs issue is resolved, should I use the command below for GPU:
!python -m spacy ray train config.cfg --n-workers 2 --output ./output #--gpu-id 0

In the command above, the issue is that I am sure how to tell ray which GPU IDs to use. Does it take a list of IDs? Any ideas?

3- As an irrelevant note to GPUs, I already have a trained word2vec model. In v2, I used the following to load it:

def load_word_vectors(model_name, word_vectors):
    subprocess.run([sys.executable,
                    "-m",
                    "spacy",
                    "init-model",
                    "en",
                    model_name,
                    "--vectors-loc",
                    word_vectors
                    ])

Do you know how to include it in the config file in v3 so that the training does not spend time on training token2vector embedding matrix?

@adrianeboyd
Copy link
Contributor

If you have created a model with vectors using spacy init vectors (the v3 CLI command for this), you then specify it under [initialize.vectors] and set include_static_vectors = true for the relevant components.

@Julia-Penfield
Copy link
Author

I created my word2vec model using gensim.

Here is the code:

!pip install --upgrade gensim
from gensim.models.phrases import Phrases, Phraser

#Phrases() takes a list of list of words as input. "txt" is my corpus of text.

sent = [text.split() for text in txt]

#Creates the relevant phrases from the list of sentences:
phrases = Phrases(sent, min_count=30, progress_per=10000)

#The goal of Phraser() is to cut down memory consumption of Phrases(), by discarding model state not strictly needed for the bigram detection task:
bigram = Phraser(phrases)

#Transform the corpus based on the bigrams detected:
sentences = bigram[sent]

from gensim.models import Word2Vec
w2v_model = Word2Vec(min_count=20,
                 window=2,
                 vector_size=300,
                 sample=6e-5, 
                 alpha=0.03, 
                 min_alpha=0.0007, 
                 negative=20,
                 workers=cores-1)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30)
w2v_model.save('w2v_model.model')
w2v_model.wv.save_word2vec_format('w2v_model.txt')

In v2, I used the code at the end of my last post to load the word2vec model. I am playing to see if I could do something similar in v3.

@adrianeboyd
Copy link
Contributor

Yes, just run spacy init vectors. The options are in a slightly different format than v2 spacy init-model, but it's very similar.

Since spacy actually doesn't include any code for training static word vectors, the only way to include them is to use the initialize.vectors setting. But to have them used as features in the pipeline components, you also need to set include_static_vectors = true in the relevant places related to the HashEmbed config sections.

@Julia-Penfield
Copy link
Author

Julia-Penfield commented May 18, 2021

Thanks for that! I really appreciate your support.

I read the guideline and successfully converted the word2vec txt model to v3 format which is saved in a folder called "vocab" using:

!python -m spacy init vectors en w2v_model.txt ./

I also found the include_static_vectors field in the base_config.cfg file and changed it to True:

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = true

Interestingly, in the last line, "True" returned an error and "true" worked.

How could I use initialize.vectors? Should I change the base_config.cfg file as follows?

[initialize]
vectors = "./vocab"

It seems working and the NER loss is lower than it used to be. Is there any way/test to ensure that my word vectors are being used as initial value?

Going forward, I shift my focus to ray. In the first phase, I will try multi CPU using the following command. Once I get this work, I will try multi GPU:

!python -m spacy ray train config.cfg --n-workers 2 --output ./output

At this time, the problem is getting the error:

ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
2021-05-18 18:29:29,031 INFO resource_spec.py:231 -- Starting Ray with 35.16 GiB memory available for workers and up to 17.6 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=).
2021-05-18 18:29:30,175 INFO services.py:1193 -- View the Ray dashboard at localhost:8265
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/main.py", line 4, in
setup_cli()
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/cli/_util.py", line 69, in setup_cli
command(prog_name=COMMAND)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy_ray/train_cli.py", line 52, in ray_train_cli
code_path=code_path,
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy_ray/train_cli.py", line 87, in ray_train
ray.get(worker.train.remote(workers, evaluator))
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/worker.py", line 1538, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::Worker.train() (pid=5175, ip=172.16.14.165)
File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
TypeError: create_train_batches() missing 1 required positional argument: 'max_epochs'

I posted it as a separate issue. Once I get it resolved, I will proceed with a test for multiple GPUs.

@thejamesmarq
Copy link

I'm having the same issue regarding the error
TypeError: create_train_batches() missing 1 required positional argument: 'max_epochs'

I've looked at nvidia-smi after using spacy-ray with --n-workers 4, and I see all my GPUs memory usage spiking, so it seems like something is happening with all of them, however after a while the training fails with that error message.

@Julia-Penfield
Copy link
Author

Julia-Penfield commented May 19, 2021

@thejamesmarq I am happy that it was not just me. Also, glad that someone else is working towards multiple GPU training.

@adrianeboyd fixed the "max_epoch" issue about 10 hours ago (see #8137) and released a new version of spaCy ray. I reinstalled and the max_epoch problem is gone. If they had patreon, I would not hesitate contributing to it for a second!

I have not tried the multiple GPU solution yet, but the multiple solution using ray seem to be working! I cannot say if it increased the training speed though - need to dig in further. @thejamesmarq Could I ask you to give it a shot and let me know if spacy ray v0.1.2 works for you - with and without using GPUs? I am a little surprised that all your 4 GPU cores were working as the workers are for CPU as far as I understand, not GPU. Did you use the following command?

!python -m spacy ray train config.cfg --n-workers 2 --output ./output --gpu-id 0

@thejamesmarq
Copy link

Tried this out on a machine with 4 GPUs, using --n-workers 4 and while the job does run now, it looks like only one GPU is being used, although four processes are being created. Memory on each GPU is occupied, but utilization was at 0% for all except one GPU.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03   Driver Version: 450.119.03   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   47C    P0    56W / 300W |   1757MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   45C    P0    56W / 300W |   1774MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   44C    P0    61W / 300W |   1782MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   48C    P0    59W / 300W |   1818MiB / 16160MiB |     10%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     82467      C   ray::Worker                      1755MiB |
|    1   N/A  N/A     82460      C   ray::Worker                      1771MiB |
|    2   N/A  N/A     82464      C   ray::Worker                      1779MiB |
|    3   N/A  N/A     82469      C   ray::Worker                      1815MiB |
+-----------------------------------------------------------------------------+

I'm wondering if that is at all related how ray_train in https:/explosion/spacy-ray/blob/master/spacy_ray/train_cli.py uses ray.remote:

RemoteWorker = ray.remote(Worker).options(num_gpus=int(use_gpu >= 0), num_cpus=2)

I believe this will use at most 1 GPU, although I could be wrong.

Also, noticing the Worker in https:/explosion/spacy-ray/blob/master/spacy_ray/worker.py uses gpu_id = int(os.environ.get("CUDA_VISIBLE_DEVICES", -1)). Does this mean if we don't have that variable set, it will default to using CPU (with default set at -1). I'm wondering if we might instead specify the GPU id directly, like as a CLI option.

@Julia-Penfield
Copy link
Author

Julia-Penfield commented May 20, 2021

Gotch ya!

I think we could determine the gpu ID using the CLI option like you mentioned:

!python -m spacy ray train config.cfg --n-workers 4 --output ./output --gpu-id 0

However, there are two issues:

  1. Firstly, it seems that only one gpu ID can be provided,
  2. As @adrianeboyd mentioned before, regardless of which gpu ID is used in the CLI command, only ID=0 is used as per:
    a) https:/explosion/spacy-ray/blob/master/spacy_ray/worker.py#L300-L308
    b) https:/explosion/spacy-ray/blob/75cdb637411529f7f0a41c723a8ab71cbae9cc79/spacy_ray/worker.py#L251-L259

Could you try one more time using the CLI command and see if changing the gpu-id makes a difference in terms of the gpu core used? I will do the same on my end.

@adrianeboyd
Copy link
Contributor

(No need for patreon: this is my job!)

For local testing I only have one GPU, so I may not be much immediate help. The spacy train CLI doesn't have a way to specify multiple GPUs, so if this is going to work, I think it's most likely that you'd use -g 0 to enable GPU in the CLI in general and then check how ray sets CUDA_VISIBLE_DEVICES for the workers. If ray manages the GPU IDs that way, then you'd just need to update the function that sets the GPU ID with spacy.require_gpu to use the one provided by ray.

@Julia-Penfield
Copy link
Author

Julia-Penfield commented May 21, 2021

Thanks @adrianeboyd . I actually did not know it's your job. You're great at it!! :)

I have two questions and 4 reports for you. I really hope those reports are helpful for further development of spaCy v3 given how amazing it is!

The questions:

Q1) What am I doing wrong in using -g 0. Below is my CLI:
!python -m spacy ray train -g 0 config.cfg --n-workers 8 --output ./output

The error is:
✘ Invalid config override '0': name should start with --

Also, Isn't -g 0 a spaCy v2 notion? The guidelines of spaCy 3 says to use --gpu-id.

Q2) Isn't 0 in "-g 0" referring to the GPU ID? How is that different from using gpu-id as in:
!python -m spacy ray train config.cfg --n-workers 8 --output ./output --gpu-id 0

Now onto the reports:

I tried 4 main experiments. All experiments are executed on an ml.p3.8xlarge AWS EC2 instance with:
4 GPUs, 32 vCPUs, 244G memory, 64G CPU memory.

1) In the first experiment, I did not use ray nor did I use GPU. The command was:

!python -m spacy train config.cfg --output ./output

Here is the output:

Starting time: 18:41:46
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0

=========================== Initializing pipeline ===========================
[2021-05-20 18:41:47,737] [INFO] Set up nlp object from config
[2021-05-20 18:41:47,751] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-05-20 18:41:47,755] [INFO] Created vocabulary
[2021-05-20 18:41:48,137] [INFO] Added vectors: ./vocab
[2021-05-20 18:41:48,137] [INFO] Finished initializing nlp object
[2021-05-20 18:43:51,149] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE


0 0 0.00 848.15 0.01 0.01 0.01 0.00
0 200 3862.07 15785.41 18.75 34.10 12.93 0.19
0 400 517.14 1764.46 36.33 45.66 30.17 0.36
0 600 4547.08 1981.70 36.27 49.80 28.53 0.36
0 800 1288.11 1814.77 38.01 52.13 29.91 0.38
0 1000 479.66 1090.03 45.89 54.09 39.85 0.46
0 1200 645.63 870.98 46.93 55.09 40.87 0.47
0 1400 996.74 841.99 46.35 56.44 39.32 0.46
0 1600 6994.02 976.14 39.57 62.33 28.99 0.40
0 1800 2305.24 1002.07 46.91 55.71 40.51 0.47
0 2000 658.52 625.77 48.02 55.59 42.26 0.48
0 2200 1872.88 782.63 50.34 55.98 45.73 0.50
0 2400 1100.66 634.23 48.51 65.04 38.68 0.49
0 2600 3185.53 855.52 49.80 64.04 40.74 0.50
0 2800 24985.04 943.65 44.04 54.52 36.95 0.44
0 3000 7799.75 996.41 48.51 64.74 38.78 0.49
0 3200 3739.99 662.83 52.15 66.79 42.77 0.52
0 3400 1390.63 598.66 45.86 52.33 40.82 0.46
1 3600 2314.42 533.74 49.05 69.72 37.83 0.49
1 3800 1527.68 454.00 46.90 53.44 41.79 0.47
1 4000 4403.58 561.79 51.23 71.50 39.92 0.51
1 4200 2051.66 559.38 44.76 45.04 44.49 0.45
1 4400 3131.92 651.52 49.34 70.12 38.06 0.49
1 4600 3381.79 602.43 39.74 33.66 48.50 0.40
1 4800 1502.89 601.55 49.72 58.03 43.49 0.50
✔ Saved pipeline to output directory
output/model-last
Ending time: 20:32:09
Total elapsed time: 1.84 hours

Also, here is a snapshot on the CPU usage and GPU usage. It is as expected - only CPU 100% and 0% GPUs on all 4 cores.

CPU:

top - 18:47:27 up 31 min, 0 users, load average: 1.00, 0.94, 0.90
Tasks: 359 total, 2 running, 223 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.0%us, 0.4%sy, 0.0%ni, 95.6%id, 0.5%wa, 0.0%hi, 0.0%si, 1.4%st
Mem: 251745828k total, 10681440k used, 241064388k free, 209764k buffers
Swap: 0k total, 0k used, 0k free, 3111976k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28251 ec2-user 20 0 27.4g 5.6g 198m R 100.5 2.3 5:44.80 python
25540 ec2-user 20 0 616m 50m 13m S 2.0 0.0 0:01.72 python
1 root 20 0 19780 2640 2200 S 0.0 0.0 0:10.63 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
5 root 20 0 0 0 0 I 0.0 0.0 0:00.64 kworker/u256:0
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
7 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
8 root 20 0 0 0 0 I 0.0 0.0 0:00.61 rcu_sched
9 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_bh
10 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/0
11 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1
14 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
15 root RT 0 0 0 0 S 0.0 0.0 0:00.36 migration/1 top - 18:47:30 up 31 min, 0 users, load average: 1.00, 0.94, 0.90
Tasks: 359 total, 2 running, 223 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.2%us, 0.0%sy, 0.0%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 251745828k total, 10681596k used, 241064232k free, 209764k buffers

GPU:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03   Driver Version: 450.119.03   CUDA Version: 11.0         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC           |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0                  |
| N/A   35C    P0    52W / 300W |    312MiB / 16160MiB |      0%      Default            |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0                   |
| N/A   33C    P0    35W / 300W |      3MiB / 16160MiB |      0%      Default              |
|                               |                      |                  N/A                                                        |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0                   |
| N/A   32C    P0    39W / 300W |      3MiB / 16160MiB |      0%      Default               |
|                               |                      |                  N/A                                                        |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0                   |
| N/A   33C    P0    38W / 300W |      3MiB / 16160MiB |      0%      Default               |
|                               |                      |                  N/A                                                        |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     28251      C   python                            309MiB |
+-----------------------------------------------------------------------------+

Sorry for the table format. I do not know how to neatly paste it like @thejamesmarq did.

Execusion-wise, so far so good.

2) In the second experiment, I still did not use ray, but I did use gpu using the following command:

!python -m spacy train config.cfg --output ./output --gpu-id

Here is the output:

Starting time: 20:18:14
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2021-05-20 20:18:16,231] [INFO] Set up nlp object from config
[2021-05-20 20:18:16,245] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-05-20 20:18:16,249] [INFO] Created vocabulary
[2021-05-20 20:18:16,719] [INFO] Added vectors: ./vocab
[2021-05-20 20:18:16,720] [INFO] Finished initializing nlp object
[2021-05-20 20:20:40,810] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE


⚠ Aborting and saving the final best model. Encountered exception:
OutOfMemoryError('Out of memory allocating 3,327,510,528 bytes (allocated so
far: 13,736,083,968 bytes).',)
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/main.py", line 4, in
setup_cli()
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/cli/_util.py", line 69, in setup_cli
command(prog_name=COMMAND)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/cli/train.py", line 59, in train_cli
train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/training/loop.py", line 115, in train
raise e
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/training/loop.py", line 98, in train
for batch, info, is_best_checkpoint in training_step_iterator:
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/training/loop.py", line 213, in train_while_improving
score, other_scores = evaluate()
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/training/loop.py", line 268, in evaluate
scores = nlp.evaluate(dev_corpus(nlp))
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/language.py", line 1363, in evaluate
examples,
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/util.py", line 1484, in _pipe
yield from proc.pipe(docs, **kwargs)
File "spacy/pipeline/trainable_pipe.pyx", line 79, in pipe
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/util.py", line 1503, in raise_error
raise e
File "spacy/pipeline/trainable_pipe.pyx", line 75, in spacy.pipeline.trainable_pipe.TrainablePipe.pipe
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/pipeline/tok2vec.py", line 121, in predict
tokvecs = self.model.predict(docs)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/model.py", line 312, in predict
return self._func(self, X, is_train=False)[0]
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/model.py", line 288, in call
return self._func(self, X, is_train=is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/layers/with_array.py", line 40, in forward
return _list_forward(cast(Model[List2d, List2d], model), Xseq, is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/layers/with_array.py", line 76, in _list_forward
Yf, get_dXf = layer(Xf, is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/model.py", line 288, in call
return self._func(self, X, is_train=is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/model.py", line 288, in call
return self._func(self, X, is_train=is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/layers/residual.py", line 40, in forward
Y, backprop_layer = model.layers[0](X, is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/model.py", line 288, in call
return self._func(self, X, is_train=is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/model.py", line 288, in call
return self._func(self, X, is_train=is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/model.py", line 288, in call
return self._func(self, X, is_train=is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/model.py", line 288, in call
return self._func(self, X, is_train=is_train)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/layers/maxout.py", line 49, in forward
Y = model.ops.gemm(X, W, trans2=True)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/thinc/backends/cupy_ops.py", line 51, in gemm
return self.xp.dot(x, y)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/cupy/linalg/_product.py", line 34, in dot
return a.dot(b, out)
File "cupy/core/core.pyx", line 1432, in cupy.core.core.ndarray.dot
File "cupy/core/_routines_linalg.pyx", line 427, in cupy.core._routines_linalg.dot
File "cupy/core/_routines_linalg.pyx", line 450, in cupy.core._routines_linalg.tensordot_core
File "cupy/core/core.pyx", line 2392, in cupy.core.core._ndarray_init
File "cupy/core/core.pyx", line 151, in cupy.core.core.ndarray._init_fast
File "cupy/cuda/memory.pyx", line 578, in cupy.cuda.memory.alloc
File "cupy/cuda/memory.pyx", line 1250, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1271, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 939, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy/cuda/memory.pyx", line 959, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy/cuda/memory.pyx", line 1210, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 3,327,510,528 bytes (allocated so far: 13,736,083,968 bytes).
Ending time: 20:21:41
Total elapsed time: 0.06 hours

3) In the 3rd experiment, I did use ray with 8 CPUs, but I did not use gpu using the following command:

!python -m spacy ray train config.cfg --n-workers 8 --output ./output

Here is the output:
Starting time: 20:56:09
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
2021-05-20 20:56:10,996 INFO resource_spec.py:231 -- Starting Ray with 157.81 GiB memory available for workers and up to 71.63 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=).
2021-05-20 20:56:11,433 INFO services.py:1193 -- View the Ray dashboard at localhost:8265
(pid=50810) E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
(pid=50810) --- ------ ------------ -------- ------ ------ ------ ------
(pid=50810) 0 0 0.00 594.77 0.00 0.00 0.00 0.00
(pid=50810) 0 200 55405.84 5046.95 20.80 28.77 16.29 0.21
(pid=50810) 0 400 110147.05 4137.74 26.28 26.35 26.21 0.26
(pid=50810) 0 600 269715.63 4061.11 35.32 38.92 32.33 0.35
2021-05-20 21:30:21,276 ERROR worker.py:1074 -- Possible unhandled error from worker: ray::Worker.inc_grad() (pid=50810, ip=172.16.41.157)
File "python/ray/_raylet.pyx", line 440, in ray._raylet.execute_task
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/memory_monitor.py", line 128, in raise_if_low_memory
self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-16-41-157 is used (228.18 / 240.08 GB). The top 10 memory consumers are:

PID MEM COMMAND
50810 31.99GiB ray::Worker
50818 29.35GiB ray::Worker
50800 29.24GiB ray::Worker
50788 29.23GiB ray::Worker
50803 28.99GiB ray::Worker
50791 28.3GiB ray::Worker
50783 26.65GiB ray::Worker
50802 6.46GiB ray::Worker
50718 1.58GiB /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/core/src/ray/thirdparty/redis/
50738 0.43GiB /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/core/src/ray/raylet/raylet r

In addition, up to 11.08 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the object_store_memory parameter when starting Ray.

Tip: Use the ray memory command to list active objects in the cluster.

2021-05-20 21:30:22,277 ERROR worker.py:1074 -- Possible unhandled error from worker: ray::Worker.set_param() (pid=50800, ip=172.16.41.157)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_common.py", line 449, in wrapper
ret = self._cache[fun]
AttributeError: _cache

During handling of the above exception, another exception occurred:

ray::Worker.set_param() (pid=50800, ip=172.16.41.157)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1515, in wrapper
return fun(self, *args, **kwargs)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_common.py", line 452, in wrapper
return fun(self)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1557, in _parse_stat_file
with open_binary("%s/%s/stat" % (self._procfs_path, self.pid)) as f:
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_common.py", line 713, in open_binary
return open(fname, "rb", **kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/proc/74685/stat'

During handling of the above exception, another exception occurred:

ray::Worker.set_param() (pid=50800, ip=172.16.41.157)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/init.py", line 371, in _init
self.create_time()
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/init.py", line 727, in create_time
self._create_time = self._proc.create_time()
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1515, in wrapper
return fun(self, *args, **kwargs)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1727, in create_time
ctime = float(self._parse_stat_file()['create_time'])
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1522, in wrapper
raise NoSuchProcess(self.pid, self._name)
psutil.NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=74685)

During handling of the above exception, another exception occurred:
....
....
....
Julia: This error was too long to copy entirely
....
....
....
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/main.py", line 4, in
setup_cli()
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy/cli/util.py", line 69, in setup_cli
command(prog_name=COMMAND)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy_ray/train_cli.py", line 52, in ray_train_cli
code_path=code_path,
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy_ray/train_cli.py", line 91, in ray_train
todo = [w for w in workers if ray.get(w.is_running.remote())]
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/spacy_ray/train_cli.py", line 91, in
todo = [w for w in workers if ray.get(w.is_running.remote())]
File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/ray/worker.py", line 1540, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
(pid=50791) F0520 21:31:04.837767 50791 52026 core_worker.cc:182] Check failed: instance
->global_worker
global_worker
must not be NULL
(pid=50791) *** Check failure stack trace: ***
(pid=50791) @ 0x7f03714437dd google::LogMessage::Fail()
(pid=50791) @ 0x7f037144493c google::LogMessage::SendToLog()
(pid=50791) @ 0x7f03714434b9 google::LogMessage::Flush()
(pid=50791) @ 0x7f03714436d1 google::LogMessage::~LogMessage()
(pid=50791) @ 0x7f037142e629 ray::RayLog::~RayLog()
(pid=50791) @ 0x7f03710cac80 ray::CoreWorkerProcess::GetCoreWorker()
(pid=50791) @ 0x7f0370fde6eb __pyx_pw_3ray_7_raylet_10CoreWorker_51remove_actor_handle_reference()
(pid=50791) @ 0x560320fad06a call_function
(pid=50791) @ 0x56032100de84 _PyEval_EvalFrameDefault
(pid=50791) @ 0x560320f55f6a _PyFunction_FastCallDict
(pid=50791) @ 0x560320fbdad3 method_call
(pid=50791) @ 0x560320f6a98b _PyObject_FastCallDict
(pid=50791) @ 0x56032106a1cd slot_tp_finalize
(pid=50791) @ 0x560320f4e1b5 PyObject_CallFinalizerFromDealloc
(pid=50791) @ 0x560320fdf691 subtype_dealloc
(pid=50791) @ 0x560320f3d5b0 cell_dealloc
(pid=50791) @ 0x560320f449c2 tupledealloc
(pid=50791) @ 0x560320fda0c1 func_dealloc
(pid=50791) @ 0x560320f43ed8 frame_dealloc
(pid=50791) @ 0x560320f43b28 tb_dealloc
(pid=50791) @ 0x560320f43b8c tb_dealloc
(pid=50791) @ 0x560320f43b8c tb_dealloc
(pid=50791) @ 0x560320f43b8c tb_dealloc
(pid=50791) @ 0x56032101031f _PyEval_EvalFrameDefault
(pid=50791) @ 0x560320fab2a9 fast_function
(pid=50791) @ 0x560320facf9f call_function
(pid=50791) @ 0x56032100de84 _PyEval_EvalFrameDefault
(pid=50791) @ 0x560320f55f6a _PyFunction_FastCallDict
(pid=50791) @ 0x560320fbdad3 method_call
(pid=50791) @ 0x560320f5bd29 PyEval_CallObjectWithKeywords
(pid=50791) @ 0x560321087263 t_bootstrap
(pid=50791) @ 0x56032101a514 pythread_wrapper
E0520 21:31:05.177330 50665 50665 raylet_client.cc:124] IOError: Broken pipe [RayletClient] Failed to disconnect from raylet.
Ending time: 21:31:08
Total elapsed time: 0.58 hours

Also, here is a snapshot on the CPU usage and GPU usage before it crashed. It is as expected - multiple CPU >100% usage and 0% GPUs on all 4 cores.

CPU:

top - 20:26:13 up 24 min, 0 users, load average: 9.45, 5.35, 2.89
Tasks: 398 total, 2 running, 261 sleeping, 0 stopped, 0 zombie
Cpu(s): 6.0%us, 0.8%sy, 0.0%ni, 90.8%id, 0.5%wa, 0.0%hi, 0.0%si, 1.8%st
Mem: 251745828k total, 43391664k used, 208354164k free, 212780k buffers
Swap: 0k total, 0k used, 0k free, 4250572k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24491 ec2-user 20 0 78.3g 4.9g 565m S 202.5 2.1 10:14.52 ray::Worker
24476 ec2-user 20 0 77.5g 4.0g 492m R 169.1 1.7 6:52.37 ray::Worker
24488 ec2-user 20 0 78.6g 5.1g 731m S 147.5 2.1 5:21.48 ray::Worker
24465 ec2-user 20 0 78.6g 5.2g 797m S 137.6 2.2 5:20.63 ray::Worker
24485 ec2-user 20 0 78.6g 5.4g 985m S 135.7 2.2 5:22.22 ray::Worker
24458 ec2-user 20 0 78.4g 5.0g 798m S 121.9 2.1 5:30.25 ray::Worker
24460 ec2-user 20 0 78.6g 5.2g 730m S 119.9 2.2 5:20.65 ray::Worker
24457 ec2-user 20 0 78.6g 5.1g 752m S 112.1 2.1 5:21.96 ray::Worker
24417 ec2-user 20 0 145g 113m 32m S 25.6 0.0 0:32.22 raylet
24390 ec2-user 20 0 838m 42m 9960 S 11.8 0.0 0:28.25 gcs_server
24386 ec2-user 20 0 354m 132m 7652 S 3.9 0.1 0:03.09 redis-server
24477 ec2-user 20 0 72.9g 61m 28m S 3.9 0.0 0:05.03 ray::IDLE
8958 root 20 0 6524 96 0 S 2.0 0.0 0:03.62 rngd
24243 ec2-user 20 0 5891m 172m 86m S 2.0 0.1 0:15.80 python
24330 ec2-user 20 0 94.6g 183m 100m S 2.0 0.1 0:18.47 python -m spacy
24419 ec2-user 20 0 224m 54m 24m S 2.0 0.0 0:14.73 /home/ec2-user/
24462 ec2-user 20 0 72.9g 61m 28m S 2.0 0.0 0:05.00 ray::IDLE top - 20:26:16 up 24 min, 0 users, load average: 9.45, 5.35, 2.89
Tasks: 398 total, 4 running, 259 sleeping, 0 stopped, 0 zombie
Cpu(s): 31.2%us, 2.6%sy, 0.0%ni, 65.4%id, 0.0%wa, 0.0%hi, 0.4%si, 0.5%st
Mem: 251745828k total, 43732944k used, 208012884k free, 212780k buffers
Swap: 0k total, 0k used, 0k free, 4266300k cached

GPU:

Thu May 20 20:27:06 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC. |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 31C P0 37W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 31C P0 35W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 32C P0 40W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 33C P0 39W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

4) In the 4th experiment, I used ray, and also I did use gpu using the following command:

!python -m spacy ray train config.cfg --n-workers 8 --output ./output --gpu-id 0

Here is the output:

Starting time: 20:20:36
ℹ Using GPU: 0
2021-05-20 20:20:38,393 INFO resource_spec.py:231 -- Starting Ray with 157.86 GiB memory available for workers and up to 71.67 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=).
2021-05-20 20:20:39,675 INFO services.py:1193 -- View the Ray dashboard at localhost:8265
2021-05-20 20:23:10,240 WARNING worker.py:1134 -- The actor or task with ID ffffffffffffffff7e0a4dfc0100 is pending and cannot currently be scheduled. It requires {CPU: 2.000000}, {GPU: 1.000000} for execution and {CPU: 2.000000}, {GPU: 1.000000} for placement, but this node only has remaining {node:172.16.69.237: 1.000000}, {GPUType:V100: 1.000000}, {CPU: 24.000000}, {memory: 157.861328 GiB}, {object_store_memory: 49.414062 GiB}. In total there are 0 pending tasks and 4 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
2021-05-20 20:23:10,403 INFO (unknown file):0 -- gc.collect() freed 9 refs in 0.06164744400007294 seconds
(pid=21047) 2021-05-20 20:23:10,378 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.031971847000022535 seconds
(pid=21045) 2021-05-20 20:23:10,373 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.032695064999984425 seconds
(pid=21044) 2021-05-20 20:23:10,374 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03262755499997638 seconds
(pid=21046) 2021-05-20 20:23:10,373 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.032567863000053876 seconds
(pid=21049) 2021-05-20 20:23:10,374 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.033858239999972284 seconds
(pid=21054) 2021-05-20 20:23:10,373 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03255391899995175 seconds
(pid=21052) 2021-05-20 20:23:10,374 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03423778600006244 seconds
(pid=21057) 2021-05-20 20:23:10,373 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.032688471000028585 seconds
(pid=21061) 2021-05-20 20:23:10,372 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03215635399999428 seconds
(pid=21048) 2021-05-20 20:23:10,381 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.04032361099996251 seconds
(pid=21070) 2021-05-20 20:23:10,373 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03237629200009451 seconds
(pid=21051) 2021-05-20 20:23:10,374 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.033061161999967226 seconds
(pid=21068) 2021-05-20 20:23:10,373 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.032373207999967235 seconds
(pid=21074) 2021-05-20 20:23:10,372 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03210057000001143 seconds
(pid=21050) 2021-05-20 20:23:10,373 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.032656975999998394 seconds
(pid=21064) 2021-05-20 20:23:10,374 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.032910458000060316 seconds
(pid=21056) 2021-05-20 20:23:10,373 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.032549074999906225 seconds
(pid=21069) 2021-05-20 20:23:10,373 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.032395918000020174 seconds
(pid=21062) 2021-05-20 20:23:10,379 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.0334426680000206 seconds
(pid=21065) 2021-05-20 20:23:10,374 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03312325100000635 seconds
(pid=21053) 2021-05-20 20:23:10,372 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03251463300000523 seconds
(pid=21067) 2021-05-20 20:23:10,373 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03261046800002987 seconds
(pid=21072) 2021-05-20 20:23:10,373 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03262552300009247 seconds
(pid=21059) 2021-05-20 20:23:10,372 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03226556799995706 seconds
(pid=21073) 2021-05-20 20:23:10,372 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03212633500004358 seconds
(pid=21066) 2021-05-20 20:23:10,373 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.032421788999954515 seconds
(pid=21063) 2021-05-20 20:23:10,374 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.033115713000029245 seconds
(pid=21071) 2021-05-20 20:23:10,373 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.032647198999939064 seconds
(pid=21076) 2021-05-20 20:23:10,404 INFO (unknown file):0 -- gc.collect() freed 556 refs in 0.0638765250000688 seconds
(pid=21060) 2021-05-20 20:23:10,400 INFO (unknown file):0 -- gc.collect() freed 556 refs in 0.06019154000000526 seconds
(pid=21075) 2021-05-20 20:23:10,400 INFO (unknown file):0 -- gc.collect() freed 556 refs in 0.06001683799991042 seconds
(pid=21078) 2021-05-20 20:23:10,405 INFO (unknown file):0 -- gc.collect() freed 556 refs in 0.06546428700005436 seconds
^C
Ending time: 20:23:32
Total elapsed time: 0.05 hours

Follow up thoughts and questions:

1- ray does not seem to be working for me. It started and successfully completed a few iterations as shown in experiment 3, but then it fails. When I use a high number of vCPUs with ray, similar to experiment 3, for example with 20 cores, it crashed right from the start with the following error.

Starting time: 22:29:04
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
2021-05-20 22:29:06,082 INFO resource_spec.py:231 -- Starting Ray with 157.76 GiB memory available for workers and up to 71.62 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=).
2021-05-20 22:29:06,562 INFO services.py:1193 -- View the Ray dashboard at localhost:8265
2021-05-20 22:31:47,417 WARNING worker.py:1134 -- The actor or task with ID ffffffffffffffff7dec85640100 is pending and cannot currently be scheduled. It requires {CPU: 2.000000} for execution and {CPU: 2.000000} for placement, but this node only has remaining {node:172.16.41.157: 1.000000}, {GPUType:V100: 1.000000}, {memory: 157.763672 GiB}, {GPU: 4.000000}, {object_store_memory: 49.414062 GiB}. In total there are 0 pending tasks and 4 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
2021-05-20 22:31:47,592 INFO (unknown file):0 -- gc.collect() freed 9 refs in 0.07938626100076362 seconds
(pid=94602) 2021-05-20 22:31:47,545 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03231386100014788 seconds
(pid=94591) 2021-05-20 22:31:47,545 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.033138599999801954 seconds
(pid=94596) 2021-05-20 22:31:47,545 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.033120588999736356 seconds
(pid=94592) 2021-05-20 22:31:47,545 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03255322800032445 seconds
(pid=94606) 2021-05-20 22:31:47,544 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.0321920959995623 seconds
(pid=94595) 2021-05-20 22:31:47,545 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03324755700123205 seconds
(pid=94600) 2021-05-20 22:31:47,549 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.0321982380010013 seconds
(pid=94605) 2021-05-20 22:31:47,545 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.032477471000674996 seconds
(pid=94593) 2021-05-20 22:31:47,545 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03354283599946939 seconds
(pid=94616) 2021-05-20 22:31:47,544 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.0324636309997004 seconds
(pid=94610) 2021-05-20 22:31:47,545 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03271439499985718 seconds
(pid=94608) 2021-05-20 22:31:47,546 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.033416103999115876 seconds
(pid=94617) 2021-05-20 22:31:47,545 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.0325973620001605 seconds
(pid=94601) 2021-05-20 22:31:47,545 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03246596499957377 seconds
(pid=94598) 2021-05-20 22:31:47,545 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.03236289599954034 seconds
(pid=94615) 2021-05-20 22:31:47,544 INFO (unknown file):0 -- gc.collect() freed 220 refs in 0.032645387000229675 seconds
(pid=94612) 2021-05-20 22:31:47,583 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.05626858300092863 seconds
(pid=94604) 2021-05-20 22:31:47,583 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.05624505599917029 seconds
(pid=94609) 2021-05-20 22:31:47,583 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.07093844699920737 seconds
(pid=94607) 2021-05-20 22:31:47,579 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.06330851600068854 seconds
(pid=94652) 2021-05-20 22:31:47,584 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.06107115099985094 seconds
(pid=94648) 2021-05-20 22:31:47,576 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.064253087999532 seconds
(pid=94619) 2021-05-20 22:31:47,583 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.0717628220008919 seconds
(pid=94594) 2021-05-20 22:31:47,583 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.07135501999982807 seconds
(pid=94597) 2021-05-20 22:31:47,585 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.06944330199985416 seconds
(pid=94614) 2021-05-20 22:31:47,585 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.06994644700171193 seconds
(pid=94620) 2021-05-20 22:31:47,583 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.07105557100112492 seconds
(pid=94599) 2021-05-20 22:31:47,576 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.06384818100013945 seconds
(pid=94613) 2021-05-20 22:31:47,583 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.07101286999932199 seconds
(pid=94621) 2021-05-20 22:31:47,583 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.07165839899971616 seconds
(pid=94618) 2021-05-20 22:31:47,575 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.06377418000010948 seconds
(pid=94603) 2021-05-20 22:31:47,585 INFO (unknown file):0 -- gc.collect() freed 469 refs in 0.0729347680007777 seconds

So, given the change of output with change of cpu cores used, what I did was repeating experiment 3 with only 4 cores/workers and this time it worked fine with no errors! Below is the results. Is there a limit on the number of vCPUs?

Output of re-execution of experiment 3 (ray without GPU) with only 4 cores:

Starting time: 21:01:57
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
2021-05-20 21:01:59,277 INFO resource_spec.py:231 -- Starting Ray with 157.86 GiB memory available for workers and up to 71.65 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=).
2021-05-20 21:01:59,715 INFO services.py:1193 -- View the Ray dashboard at localhost:8265
(pid=37307) E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
(pid=37307) --- ------ ------------ -------- ------ ------ ------ ------
(pid=37307) 0 0 0.00 897.92 0.00 0.00 0.00 0.00
(pid=37307) 0 200 32856.77 4700.50 32.12 55.82 22.55 0.32
(pid=37307) 0 400 17082.62 1448.50 38.11 59.42 28.05 0.38
(pid=37307) 0 600 11633.67 1212.16 26.75 24.43 29.55 0.27
(pid=37307) 0 800 7331.37 697.37 34.39 50.26 26.14 0.34
(pid=37307) 0 1000 7129.24 828.39 43.66 43.85 43.47 0.44
(pid=37307) 0 1200 7831.97 607.17 47.32 56.78 40.56 0.47
(pid=37307) 0 1400 5814.65 650.83 49.21 60.28 41.57 0.49
(pid=37307) 0 1600 5162.50 449.55 49.63 60.98 41.84 0.50
(pid=37307) 0 1800 7094.22 489.45 50.20 59.55 43.39 0.50
(pid=37307) 0 2000 3087.17 487.18 50.78 59.44 44.32 0.51
(pid=37307) 0 2200 2951.61 399.42 51.45 66.15 42.10 0.51
(pid=37307) 0 2400 13687.99 829.05 51.54 66.72 41.99 0.52
(pid=37307) 0 2600 7874.76 526.23 50.36 54.80 46.58 0.50
(pid=37307) 0 2800 4799.43 454.21 52.74 67.11 43.44 0.53
(pid=37307) 0 3000 6958.26 546.46 52.50 60.12 46.60 0.53
(pid=37307) 0 3200 4742.86 496.26 53.97 64.81 46.24 0.54
(pid=37307) 0 3400 6886.49 515.26 51.65 60.83 44.88 0.52
(pid=37307) 1 3600 5239.00 512.20 52.56 64.64 44.29 0.53
(pid=37307) 1 3800 4681.29 392.98 50.73 63.73 42.13 0.51
(pid=37307) 1 4000 10974.83 430.59 50.81 53.32 48.53 0.51
(pid=37307) 1 4200 4485.91 405.25 52.86 66.55 43.84 0.53
(pid=37307) 1 4400 7079.71 311.49 52.62 57.45 48.54 0.53
(pid=37307) 1 4600 4363.86 347.24 53.81 63.08 46.91 0.54
(pid=37307) 1 4800 4414.88 343.98 54.54 65.41 46.76 0.55
(pid=37307) 1 5000 3692.08 264.70 51.70 61.36 44.67 0.52
(pid=37307) 1 5200 6353.79 468.12 52.79 57.21 49.00 0.53
(pid=37307) 1 5400 5957.36 485.37 55.74 69.81 46.39 0.56
(pid=37307) 1 5600 5961.01 439.30 52.51 67.50 42.96 0.53
(pid=37307) 1 5800 7334.70 436.68 54.48 63.03 47.97 0.54
(pid=37307) 1 6000 4904.55 415.79 52.00 52.53 51.48 0.52
(pid=37307) 1 6200 12293.48 404.71 54.30 67.49 45.43 0.54
(pid=37307) 1 6400 2937.12 312.57 53.04 55.78 50.55 0.53
(pid=37307) 1 6600 7039.74 385.74 53.42 70.02 43.18 0.53
(pid=37307) 1 6800 6382.27 554.14 53.86 64.27 46.35 0.54
(pid=37307) 1 7000 5487.13 424.08 52.91 58.62 48.21 0.53
Ending time: 00:14:20
Total elapsed time: 3.21 hours

Below is a snapshot of CPU and GPU usage - not sure why only 2 CPUs are heavily utilized as opposed to 4.

CPU:

top - 22:30:43 up 2:19, 0 users, load average: 2.62, 2.58, 2.74
Tasks: 404 total, 1 running, 262 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.6%us, 1.0%sy, 0.0%ni, 90.7%id, 0.1%wa, 0.0%hi, 0.1%si, 0.4%st
Mem: 251745828k total, 68348844k used, 183396984k free, 1134280k buffers
Swap: 0k total, 0k used, 0k free, 4776260k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
37307 ec2-user 20 0 99.5g 24g 1.3g S 167.2 10.4 149:22.53 ray::Worker
37306 ec2-user 20 0 81.0g 8.0g 1.3g S 133.8 3.3 123:08.39 ray::Worker
37298 ec2-user 20 0 85.7g 12g 1.3g S 17.7 5.4 51:21.52 ray::Worker
37328 ec2-user 20 0 86.2g 13g 1.3g S 13.8 5.5 51:03.83 ray::Worker
37217 ec2-user 20 0 838m 40m 9728 S 9.8 0.0 5:21.60 gcs_server
37233 ec2-user 20 0 145g 222m 117m S 3.9 0.1 5:00.49 raylet
37235 ec2-user 20 0 224m 54m 24m S 3.9 0.0 2:25.46 /home/ec2-user/
9229 ec2-user 20 0 616m 50m 13m S 2.0 0.0 0:00.76 python
37053 ec2-user 20 0 5891m 172m 86m S 2.0 0.1 1:56.94 python
37148 ec2-user 20 0 94.6g 183m 101m S 2.0 0.1 2:19.44 python -m spacy
37205 ec2-user 20 0 177m 11m 7836 S 2.0 0.0 0:47.88 redis-server
37276 ec2-user 20 0 72.9g 61m 28m S 2.0 0.0 0:45.10 ray::IDLE
37277 ec2-user 20 0 72.9g 62m 28m S 2.0 0.0 0:45.12 ray::IDLE
37278 ec2-user 20 0 72.9g 62m 28m S 2.0 0.0 0:45.32 ray::IDLE
37281 ec2-user 20 0 72.9g 62m 28m S 2.0 0.0 0:45.32 ray::IDLE
37285 ec2-user 20 0 72.9g 62m 28m S 2.0 0.0 0:45.29 ray::IDLE
37287 ec2-user 20 0 72.9g 61m 28m S 2.0 0.0 0:45.49 ray::IDLE top - 22:30:46 up 2:19, 0 users, load average: 2.62, 2.58, 2.74
Tasks: 404 total, 1 running, 262 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.9%us, 1.6%sy, 0.0%ni, 90.3%id, 0.0%wa, 0.0%hi, 0.1%si, 0.1%st
Mem: 251745828k total, 69199252k used, 182546576k free, 1134280k buffers

GPU:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 30C P0 35W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 31C P0 35W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 31C P0 38W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 31C P0 39W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

2- It looks like there are issues in spaCy v3 regarding using gpu whether or not ray is used. I got 2 different kinds of error, one with ray, and one without ray, as described in experiments #3 and #4. I understand that the issue may be on my end, not spaCy. But I successfully ran the training on gpu using spaCy v2 as described at the beginning of this thread. Any ideas? At this point, I cannot even get one GPU working :)

3- I re-executed experiment 1 to see if the results are replicate-able. I am happy to say that I got the exact same result with a slight different execution time which is understandable.

4- When I use ray, it "appears" that the word2vec initiation is ignored. The reason I think that is by looking at the loss tok2vec at experiment 1 and experiment 3. It seems that when I use ray in experiment 3, the loss starts at a significantly higher value. Does ray have a different initiation mechanism?

5- When I used 4 cores in re-execusion of experiment 3, the model took alot longer to train compared to experiment one using one core - both experiments did not use GPU as I have not succeeded in using GPU in spaCy C3 yet. As shown in the results of experiment 1, with one core, the training took 1.84 hours. However, with 4 cores, it took 3.21 hours. This is a counter intuitive outcome, even if we only consider the same number of iterations as single core execution in experiment 1 - I checked on my end and it took twice as long for the same number of iterations! This gave me the idea that maybe the issue is with ray. So I re-executed experiment 3 (ray without GPU) with one core only, and it took similar time (under 2 hours) compared to experiment one which does not use ray - the accuracies were also different but I get that part.

In conclusion, something is fishy when using ray as the more cores I use the slower the execution gets!! I checked ray using 2 cores and it took just over 2 hours! :) Are there any known issues with ray that is being worked on for v3.1 release?

6- If you teach me how to use -g 0, I am happy to re-run the 2 experiments with GPU and see if multiple cores will be used. Though at this point, even one GPU destroys the memory!

Please let know if there are any experiments that you're interested in. I am happy to assist you.

Thanks!
-Jules

@delucca
Copy link

delucca commented Apr 19, 2023

Any updates on this?

@rmitsch
Copy link
Contributor

rmitsch commented Apr 20, 2023

Hi @delucca, not yet, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / ner Feature: Named Entity Recognizer scaling Scaling, serving and parallelizing spaCy training Training and updating models
Projects
None yet
Development

No branches or pull requests

6 participants