CodeLLM Evaluator

Easy to evaluate with fast inference settings CodeLLMs

Overview

CodeLLM Evaluator provide the ability for fast and efficiently evaluation on code generation task. Inspired by lm-evaluation-harness and bigcode-eval-harness, we designed our framework for multiple use-case, easy to add new metrics and customized task.

Features:

Implemented HumanEval, MBPP benchmarks for Coding LLMs.
Support for models loaded via transformers, DeepSpeed.
Support for evaluation on adapters (e.g. LoRA) supported in HuggingFace’s PEFT library.
Support for inference with distributed native transformers or fast inference with VLLMs backend.
Easy support for custom prompts, task and metrics.

Setup

Install code-eval package from the github repository via pip:

$ git clone https://github.com/FSoft-AI4Code/code-llm-evaluator.git
$ cd code-llm-evaluator
$ pip install -e .

Quick-start

To evaluate a supported task in python, you can load our code_eval.Evaluator() to generate and compute evaluate metrics on the run.

from code_eval import Evaluator
from code_eval.task import HumanEval

task = HumanEval()
evaluator = Evaluator(task=task)

output = evaluator.generate(num_return_sequences=3,
                            batch_size=16,
                            temperature=0.9)
result = evaluator.evaluate(output)

CLI Usage

Inference with Transformers

Load model and generate answer using native transformers (tf), pass model local path or HuggingFace Hub name. We select transformers as default backend, but you can pass backend="tf" to specify it:

$ code-eval --model_name microsoft/phi-1 \
    --task humaneval \
    --batch_size 8 \
    --backend hf \

Tip

Load LoRA adapters by add --peft_model argument. The --model_name must point to full model architecture.

$ code-eval --model_name microsoft/phi-1 \
    --peft_model <adapters-name> \
    --task humaneval \
    --batch_size 8 \
    --backend hf \

Inference with vLLM engine

We recommend using vLLM engine for fast inference. vLLM supported tensor parallel, data parallel or combination of both. Reference to vLLM documenation for more detail.

To use code-eval with vLLM engine, please refer to vLLM engine documents to instal it.

Note

You can install vLLM using pip:

$ pip install vllm

With model supported by vLLM (See more: vLLM supported model) run:

$ code-eval --model_name microsoft/phi-1 \
    --task humaneval \
    --batch_size 8 \
    --backend vllm

Tip

You can use LoRA with similar syntax.

$ code-eval --model_name microsoft/phi-1 \
    --peft_model <adapters-name> \
    --task humaneval \
    --batch_size 8 \
    --backend vllm \

Cite as

@misc{code-eval,
    author       = {Dung Nguyen Manh},
    title        = {A framework for easily evaluation code generation model},
    month        = 3,
    year         = 2024,
    publisher    = {github},
    version      = {v0.0.1},
    url          = {https://github.com/FSoft-AI4Code/code-llm-evaluator}
}

CodeLLM Evaluator

Overview

Setup

Quick-start

CLI Usage

Inference with Transformers

Inference with vLLM engine

Cite as

Welcome to Code Evaluator’s documentation

Contents