To test Meta Llama 3’s performance against existing models, we used the coding benchmarks: HumanEval. HumanEval tests the model’s ability to complete code based on docstrings.

The benchmark tests 137 publicly available large language models (LLMs) on code tasks.

ModelAccuracy1
openai/gpt-4-vision-preview60
openai/gpt-4-turbo60
openai/gpt-4-031460
openai/gpt-4-32k-031459
google/gemini-pro-1.559
openai/gpt-4-turbo-preview58
openai/gpt-458
openai/gpt-3.5-turbo-012558
openai/gpt-3.5-turbo58
meta-llama/llama-3-70b-instruct56
anthropic/claude-3-sonnet:beta56
openai/gpt-3.5-turbo-030154
anthropic/claude-3-sonnet54
meta-llama/llama-3-70b-instruct:nitro53
openai/gpt-4-32k52
anthropic/claude-3-opus:beta52
anthropic/claude-3-opus52
phind/phind-codellama-34b51
nousresearch/nous-capybara-34b51
openai/gpt-4-1106-preview50
openai/gpt-3.5-turbo-110650
mistralai/mistral-medium50
microsoft/wizardlm-2-8x22b:nitro50
microsoftwizardlm-2-8x22b50
meta-llama/llama-2-70b-chat:nitro50
google/palm-2-chat-bison50
cohere/command-r-plus50
anthropic/claude-3-haiku50
anthropic/claude-2.1:beta50
anthropic/claude-2.150
meta-llama/llama-2-7b-chat50
Ol-ai/yi-34b-chat49
perplexity/sonar-medium-chat49
perplexity/pplx-70b-chat48
mistralai/mistral-7b-instruct:nitro48
google/gemma-7b-it:free48
anthropic/claude-2.048
mistralai/mixtral-8x7b-instruct47
anthropic/claude-instant-1.147
saolOk/fimbulvetr-llb-v246
openchat/openchat-7b46
openai/gpt-3.5-turbo-16k46
nousresearch/nous-hermes-mistral46
mistralai/mistral-7b-instruct:free46
mistralai/mistral-7b-instruct46
cohere/command-r46
teknium/openhermes-2.5-mistral-7b45
teknium/openhermes-2-mistral-7b45
perplexity/pplx-7b-chat45
mistralai/mixtral-8x7b-instruct:nitro45
google/gemma-7b-it45
meta-llama/codellama-34b-instruct44
google/palm-2-codechat-bison-32k44
google/palm-2-codechat-bison44
google/gemini-pro-vision44
cognitivecomputations/dolphin-mixtral44
perplexity/sonar-small-chat43
nousresearch/nous-hermes-yi-34b43
nousresearch/nous-hermes-2-mixtral43
lizpreciatior/lzlv-70b-fpl6-hf43
jondurbin/airoboros-l2-7b43
google/gemini-pro43
anthropic/claude-3-haiku:beta43
anthropic/claude-2.0:beta43
sophosympatheia/midnight-rose-70b42
openai/gpt-3.5-turbo-061342
mistralai/mixtral-8x22b-instruct42
mistralai/mistral-small42
meta-llama/llama-2-13b-chat42
google/gemma-7b-it:nitro42
anthropic/claude-instant-1.242
anthropic/claude-1.242
togethercomputer/stripedhyena-no41
databricks/dbrx-instruct41
rwkv/rwkv-5-world-3b40
openrouter/cinematika-7b40
nousresearch/nous-hermes-2-mixtral40
nousresearch/nous-capybara-7b40
mistralai/mistral-large40
huggingfaceh4/zephyr-orpo-14lb-a40
google/palm-2-chat-bison-32k40
meta-llama/llama-3-8b-instruct39
codellama/codellama-70b-instruct39
xwinlm/xwin-lm-70b38
perplexity/sonar-medium-online38
meta-llama/llama-8b-instruct:nitro37
anthropic/claude-136
perplexity/pplx-7b-online35
openrouter/cinematika-7b:free35
gryphe/mythomax-l2-13b:nitro35
gryphe/mythomax-l2-13b35
recursal/eagle-7b34
perplexity/sonar-small-online34
huggingfaceh4/zephyr-7b-beta34
01-ai/yi-6b34
perplexity/pplx-70b-online33
open-orca/mistral-7b-openorca33
nousresearch/nous-hermes-llama233
mistralai/mixtral-8x22b33
gryphe/mythomax-l2-13b:extended33
alpindale/goliath-120b33
mistralai/mistral-tiny32
microsoft/wizardlm-2-7b32
cohere/command32
austism/chronos-hermes-13b32
undi95/toppy-m-7b:free31
undi95/toppy-m-7b31
openchat/openchat-7b:free31
pygmalionai/mythalion-13b30
nousresearch/nous-capybara-7b:free30
huggingfaceh4/zephyr-7b-beta:free30
undi95/toppy-m-7b:nitro29
undi95/remm-slerp-l2-13b29
mistralai/mixtral-8x7b29
anthropic/claude-instant-1.028
recursal/rwkv-5-3b-ai-town27
undi95/remm-slerp-l2-13b:extended26
koboldai/psyfighter-13b-226
01-ai/yi-34b26
neversleep/noromaid-mixtral-8x7b-instruct25
togethercomputer/stripedhyena-hessian-7b24
openai/gpt-3.5-turbo-instruct24
neversleep/noromaid-20b24
gryphe/mythomist-7b22
meta-llama/llama-8b-instructextension20
mancer/weaver20
intel/neural-chat-7b20
gryphe/mythomist-7b:free20
fireworks/firellava-13b16
lynn/soliloquy-1314
nousresearch/nous-hermes-2-vision0
jondurbin/bagel-34b0
jebcarter/psyfighter-13b0
haotian-liu/llava-13b0
anthropic/claude-instant-l:beta0
anthropic/claude-instant-10
anthropic/claude-2:beta0
anthropic/claude-20

The benchmark showed that Llama-3-70b-Instruct performed better than open-source, code-specific LLMs (Phind-CodeLlama-34b, CodeLlama-70b-Instruct) and outperformed Claude-3-Opus.

Reference#

1

HumanEval (pass@1) accuracy, higher is better