Evaluating Llama 3 on Code Tasks
To test Meta Llama 3’s performance against existing models, we used the coding benchmarks: HumanEval. HumanEval tests the model’s ability to complete code based on docstrings.
The benchmark tests 137 publicly available large language models (LLMs) on code tasks.
| Model | Accuracy1 |
|---|---|
| openai/gpt-4-vision-preview | 60 |
| openai/gpt-4-turbo | 60 |
| openai/gpt-4-0314 | 60 |
| openai/gpt-4-32k-0314 | 59 |
| google/gemini-pro-1.5 | 59 |
| openai/gpt-4-turbo-preview | 58 |
| openai/gpt-4 | 58 |
| openai/gpt-3.5-turbo-0125 | 58 |
| openai/gpt-3.5-turbo | 58 |
| meta-llama/llama-3-70b-instruct | 56 |
| anthropic/claude-3-sonnet:beta | 56 |
| openai/gpt-3.5-turbo-0301 | 54 |
| anthropic/claude-3-sonnet | 54 |
| meta-llama/llama-3-70b-instruct:nitro | 53 |
| openai/gpt-4-32k | 52 |
| anthropic/claude-3-opus:beta | 52 |
| anthropic/claude-3-opus | 52 |
| phind/phind-codellama-34b | 51 |
| nousresearch/nous-capybara-34b | 51 |
| openai/gpt-4-1106-preview | 50 |
| openai/gpt-3.5-turbo-1106 | 50 |
| mistralai/mistral-medium | 50 |
| microsoft/wizardlm-2-8x22b:nitro | 50 |
| microsoftwizardlm-2-8x22b | 50 |
| meta-llama/llama-2-70b-chat:nitro | 50 |
| google/palm-2-chat-bison | 50 |
| cohere/command-r-plus | 50 |
| anthropic/claude-3-haiku | 50 |
| anthropic/claude-2.1:beta | 50 |
| anthropic/claude-2.1 | 50 |
| meta-llama/llama-2-7b-chat | 50 |
| Ol-ai/yi-34b-chat | 49 |
| perplexity/sonar-medium-chat | 49 |
| perplexity/pplx-70b-chat | 48 |
| mistralai/mistral-7b-instruct:nitro | 48 |
| google/gemma-7b-it:free | 48 |
| anthropic/claude-2.0 | 48 |
| mistralai/mixtral-8x7b-instruct | 47 |
| anthropic/claude-instant-1.1 | 47 |
| saolOk/fimbulvetr-llb-v2 | 46 |
| openchat/openchat-7b | 46 |
| openai/gpt-3.5-turbo-16k | 46 |
| nousresearch/nous-hermes-mistral | 46 |
| mistralai/mistral-7b-instruct:free | 46 |
| mistralai/mistral-7b-instruct | 46 |
| cohere/command-r | 46 |
| teknium/openhermes-2.5-mistral-7b | 45 |
| teknium/openhermes-2-mistral-7b | 45 |
| perplexity/pplx-7b-chat | 45 |
| mistralai/mixtral-8x7b-instruct:nitro | 45 |
| google/gemma-7b-it | 45 |
| meta-llama/codellama-34b-instruct | 44 |
| google/palm-2-codechat-bison-32k | 44 |
| google/palm-2-codechat-bison | 44 |
| google/gemini-pro-vision | 44 |
| cognitivecomputations/dolphin-mixtral | 44 |
| perplexity/sonar-small-chat | 43 |
| nousresearch/nous-hermes-yi-34b | 43 |
| nousresearch/nous-hermes-2-mixtral | 43 |
| lizpreciatior/lzlv-70b-fpl6-hf | 43 |
| jondurbin/airoboros-l2-7b | 43 |
| google/gemini-pro | 43 |
| anthropic/claude-3-haiku:beta | 43 |
| anthropic/claude-2.0:beta | 43 |
| sophosympatheia/midnight-rose-70b | 42 |
| openai/gpt-3.5-turbo-0613 | 42 |
| mistralai/mixtral-8x22b-instruct | 42 |
| mistralai/mistral-small | 42 |
| meta-llama/llama-2-13b-chat | 42 |
| google/gemma-7b-it:nitro | 42 |
| anthropic/claude-instant-1.2 | 42 |
| anthropic/claude-1.2 | 42 |
| togethercomputer/stripedhyena-no | 41 |
| databricks/dbrx-instruct | 41 |
| rwkv/rwkv-5-world-3b | 40 |
| openrouter/cinematika-7b | 40 |
| nousresearch/nous-hermes-2-mixtral | 40 |
| nousresearch/nous-capybara-7b | 40 |
| mistralai/mistral-large | 40 |
| huggingfaceh4/zephyr-orpo-14lb-a | 40 |
| google/palm-2-chat-bison-32k | 40 |
| meta-llama/llama-3-8b-instruct | 39 |
| codellama/codellama-70b-instruct | 39 |
| xwinlm/xwin-lm-70b | 38 |
| perplexity/sonar-medium-online | 38 |
| meta-llama/llama-8b-instruct:nitro | 37 |
| anthropic/claude-1 | 36 |
| perplexity/pplx-7b-online | 35 |
| openrouter/cinematika-7b:free | 35 |
| gryphe/mythomax-l2-13b:nitro | 35 |
| gryphe/mythomax-l2-13b | 35 |
| recursal/eagle-7b | 34 |
| perplexity/sonar-small-online | 34 |
| huggingfaceh4/zephyr-7b-beta | 34 |
| 01-ai/yi-6b | 34 |
| perplexity/pplx-70b-online | 33 |
| open-orca/mistral-7b-openorca | 33 |
| nousresearch/nous-hermes-llama2 | 33 |
| mistralai/mixtral-8x22b | 33 |
| gryphe/mythomax-l2-13b:extended | 33 |
| alpindale/goliath-120b | 33 |
| mistralai/mistral-tiny | 32 |
| microsoft/wizardlm-2-7b | 32 |
| cohere/command | 32 |
| austism/chronos-hermes-13b | 32 |
| undi95/toppy-m-7b:free | 31 |
| undi95/toppy-m-7b | 31 |
| openchat/openchat-7b:free | 31 |
| pygmalionai/mythalion-13b | 30 |
| nousresearch/nous-capybara-7b:free | 30 |
| huggingfaceh4/zephyr-7b-beta:free | 30 |
| undi95/toppy-m-7b:nitro | 29 |
| undi95/remm-slerp-l2-13b | 29 |
| mistralai/mixtral-8x7b | 29 |
| anthropic/claude-instant-1.0 | 28 |
| recursal/rwkv-5-3b-ai-town | 27 |
| undi95/remm-slerp-l2-13b:extended | 26 |
| koboldai/psyfighter-13b-2 | 26 |
| 01-ai/yi-34b | 26 |
| neversleep/noromaid-mixtral-8x7b-instruct | 25 |
| togethercomputer/stripedhyena-hessian-7b | 24 |
| openai/gpt-3.5-turbo-instruct | 24 |
| neversleep/noromaid-20b | 24 |
| gryphe/mythomist-7b | 22 |
| meta-llama/llama-8b-instructextension | 20 |
| mancer/weaver | 20 |
| intel/neural-chat-7b | 20 |
| gryphe/mythomist-7b:free | 20 |
| fireworks/firellava-13b | 16 |
| lynn/soliloquy-13 | 14 |
| nousresearch/nous-hermes-2-vision | 0 |
| jondurbin/bagel-34b | 0 |
| jebcarter/psyfighter-13b | 0 |
| haotian-liu/llava-13b | 0 |
| anthropic/claude-instant-l:beta | 0 |
| anthropic/claude-instant-1 | 0 |
| anthropic/claude-2:beta | 0 |
| anthropic/claude-2 | 0 |
The benchmark showed that Llama-3-70b-Instruct performed better than open-source, code-specific LLMs (Phind-CodeLlama-34b, CodeLlama-70b-Instruct) and outperformed Claude-3-Opus.
Reference#
1
HumanEval (pass@1) accuracy, higher is better↩