Getting Started
With supported tasks, you can perform evalation easily by calling code-eval:
$ code-eval --model_name microsoft/phi-1 --task humaneval
Note
This will result in using a single GPU for generation. If you have more than one, check Multi-GPU Evaluation.
Backend will default select transformers (tf), switch to vllm if you want
to use vLLM engine:
$ code-eval --model_name microsoft/phi-1 --task humaneval --backend vllm
Multi-GPU Evalation
vLLM native support running on multi-GPU, specify your desired GPU with CUDA_DEVICES_VISIBLE:
$ export CUDA_VISIBLE_DEVICES=0,1
$ code-eval --model_name microsoft/phi-1 --task humaneval --backend vllm
This will block vllm to only see device:0 and device:1.
Multi-GPU transformers launch distributed with accelerate:
$ export CUDA_VISIBLE_DEVICES=0,1
$ accelerate launch --num_processes=2 --no-python code-eval \
--model_name microsoft/phi-1 --task humaneval
Customize Tasks
We provide easy overwrited interface for customize task.
You can inherit TaskBase and only need to overwrite prepare_dataset function.
The follow example create a new custom task with its own preprocessing function.
from code_eval.tasks import TaskBase
class CustomTask(TaskBase):
TASK_NAME = "custom_task" # Name for display only.
DATASET_NAME_OR_PATH='./test.jsonl' # huggingface path or local file.
def prepare_dataset(self, *args: Any, **kwargs: Any):
def _preprocess(example):
example['new_inputs'] = f"Question: {example['input']}\nAnswer:"
return example
updated_dataset = self.dataset['test'].map(_preprocess)
return updated_dataset
Note
prepare_dataset must return a Dataset or Dictionary types.
Create Evaluator and run parallel generation:
task = CustomTask()
evaluator = Evaluator(task=task, model_name="microsoft/phi-1", batch_size=16)
print(evaluator.dataset['new_inputs'][1]) # Visualize prompt
evaluator.generate(max_tokens=128, temperature=0.0)
Tip
Name our file as eval.py, run distributed with accelerate:
$ export CUDA_VISIBLE_DEVICES=0,1
$ accelerate launch --num_process=2 eval.py
Supported task
Taskname |
Evaluate dataset |
Metrics |
|---|---|---|
|
Pass@k (k=[1, 10, 100]) |
|
|
Pass@k (k=[1, 10, 100]) |