Tutorial for LLaMA inferencing using llama.cpp. LLaMA hackers going to love this.
Backend Software Engineer#
👋 I’m Cedric Chee. I’ve been a software engineer, writer, and entrepreneur.
I code and write about it sometimes. I create system softwares and apps in Go/JS.
I do product engineering and web development at startups/consulting. I enjoy backend development.
I’m currenly focusing on Large Language Models (LLMs). I tinker with LLMs and AI systems at night.
Recent Posts
HuggingFace Transformers inference for Alpaca
Transformers inference for Stanford Alpaca, one of the first instruction-following language model, fine-tuned from LLaMA model.
Hacking LLaMA Models
Having some fun hacking LLaMA models to run locally on my own hardware while witnessing the first LLaMA community coming alive.
Node.js Good Practices and Mastering Node.js
Going beyond Node.js basic.
Learning GitHub Copilot
Useful resources for learning GitHub Copilot and AI coding tools.
Turbopack Experiment
I just kicked the tires on Turbopack and it looks pretty good so far.
Creating a new monorepo with Turborepo
Learn all about your new monorepo, and how Turborepo makes handling your tasks easier.
How to Design and Fix Tech Hiring Processes
Reboot hiring, alternatives to coding interviews; take-home projects, candidate read code and talk about how it works.
Latency Numbers Every Programmer Should Know
Latency, humanized and visual representation of latencies.
LiveView and React in Contrast
Hey what’s up? I played with Remix shortly after they open-sourced it. I have some fun with webdev after been a while working on backend stuffs.
Use the Right Tool for the Job
You Are Not Google.
Microservice Chassis
Looking for an easier way to deploy my apps (containerized services) to Kubernetes cluster and learning Dapr.
SICP Reading Project
My first encounter with the SICP book (Structure and Interpretation of Computer Programs) was in 2018 through the “Teach Yourself Computer Science” website.
Getting Better at Elasticsearch and Backend Development in General
Two short book reviews; Elasticsearch in Action and Effective Kafka.
Apache Kafka: A Primer
A summary of architecture and core concepts.
React 18 Alpha is Out
React 18 is coming and I want to take this chance and compile some stuffs for exploring new feature like Suspense and many more.
Onsi Haiku Test
The Turing Test for Cloud computing?
Rethinking "Clean Code"
I’ve written about this topic before.
Today, I’m seeing anti-“clean code” stuff topping social media again. This time, it’s about Robert C. Martin’s book “Clean Code”. I’m talking about this blog post, “It’s probably time to stop recommending Clean Code”.
I have actually read Clean Code. It’s not a perfect book. It’s not going to make anyone into a great programmer.
What I Discovered#
I’m going to quote some good points from an old (2020) /r/programming thread.
I’ve more or less given up on lists of rules for “clean code”. Every time I’ve proposed a list, someone creates some working code that assiduously follows every rule, and yet is a complete pile of crap. And yes, the someone is doing this in good faith.
Probably the only rule that really matters is: “use good judgement”.
Personally, I think the principles in Clean Code are very important. However, the book itself isn’t the best thing I’ve ever read, and attaching Uncle Bob’s name to it isn’t necessarily doing the subject matter a service
In my opinion, Sandi Metz’ blog and books (i.e. POODR) present the same principles as Clean Code but in a much more concise, clear fashion. If I had to pick two “required reading” books for every software developer, I absolutely think POODR and Code Complete (by Steve McConnel) would be on the top of the list.
I’ll be honest, reading POODR a few years ago felt like a wake-up call for me in terms of realizing just how much of a junior developer I am. There really is an art to designing abstractions, and if I ever end up doing imperative programming again, I’m going to try to do OO “the right way” this time.
I would personally recommend another Sandi Metz’ book, 99 Bottles of OOP - 2nd Edition. I have read and completed the exercises in this book. I liked the Flock principles being taught throughout this book to uncover abstractions (Not pre-mature/forced abstraction, not abusing OOP. Instead, practicing continuous refactoring with test to improve code. Test in this context is not necessary following strictly TDD style, which is good).
The author of that blog post suggested “A Philosophy of Software Design” (2018) by John Ousterhout. If you’re interested, I found these two blog posts and they have good reviews of that book.
- Book Review by Johnz - Johnz explained as to why the he recommended it to other software engineers and developers. What caught my attention is his point on “Teaching Principles Over Rules”.
- My Take (and a Book Review) by Gergely Orosz
Aside:
- I’ve also seen the Semantic Compression idea from Casey Muratori, mainly this part:
Like a good compressor, I don’t reuse anything until I have at least two instances of it occurring. Many programmers don’t understand how important this is, and try to write “reusable” code right off the bat, but that is probably one of the biggest mistakes you can make. My mantra is, “make your code usable before you try to make it reusable”.’
- Goodbye, Clean Code post by Dan.
I sure didn’t think deeply about any of those things. I thought a lot about how the code looked — but not about how it evolved with a team of squishy humans. … Don’t be a clean code zealot. Clean code is not a goal. It’s an attempt to make some sense out of the immense complexity of systems we’re dealing with.
That’s it for now. Till next time.
System Design Cheatsheet
Picking the right architecture = Picking the right battles + Managing trade-offs
Basic Steps#
- Clarify and agree on the scope of the system
- User cases (description of sequences of events that, taken together, lead to a system doing something useful)
- Who is going to use it?
- How are they going to use it?
- Constraints
- Mainly identify traffic and data handling constraints at scale.
- Scale of the system such as requests per second, requests types, data written per second, data read per second)
- Special system requirements such as multi-threading, read or write oriented.
- High level architecture design (Abstract design)
- Sketch the important components and connections between them, but don’t go into some details.
- Application service layer (serves the requests)
- List different services required.
- Data Storage layer
- eg. Usually a scalable system includes webserver (load balancer), service (service partition), database (master/slave database cluster) and caching systems.
- Component Design
- Component + specific APIs required for each of them.
- Object oriented design for functionalities.
- Map features to modules: One scenario for one module.
- Consider the relationships among modules:
- Certain functions must have unique instance (Singletons)
- Core object can be made up of many other objects (composition).
- One object is another object (inheritance)
- Database schema design.
- Understanding Bottlenecks
- Perhaps your system needs a load balancer and many machines behind it to handle the user requests. * Or maybe the data is so huge that you need to distribute your database on multiple machines. What are some of the downsides that occur from doing that?
- Is the database too slow and does it need some in-memory caching?
- Scaling your abstract design
- Vertical scaling
- You scale by adding more power (CPU, RAM) to your existing machine.
- Horizontal scaling
- You scale by adding more machines into your pool of resources.
- Caching
- Load balancing helps you scale horizontally across an ever-increasing number of servers, but caching will enable you to make vastly better use of the resources you already have, as well as making otherwise unattainable product requirements feasible.
- Application caching requires explicit integration in the application code itself. Usually it will check if a value is in the cache; if not, retrieve the value from the database.
- Database caching tends to be “free”. When you flip your database on, you’re going to get some level of default configuration which will provide some degree of caching and performance. Those initial settings will be optimized for a generic usecase, and by tweaking them to your system’s access patterns you can generally squeeze a great deal of performance improvement.
- In-memory caches are most potent in terms of raw performance. This is because they store their entire set of data in memory and accesses to RAM are orders of magnitude faster than those to disk. eg. Memcached or Redis.
- eg. Precalculating results (e.g. the number of visits from each referring domain for the previous day),
- eg. Pre-generating expensive indexes (e.g. suggested stories based on a user’s click history)
- eg. Storing copies of frequently accessed data in a faster backend (e.g. Memcache instead of PostgreSQL.
- Load balancing
- Public servers of a scalable web service are hidden behind a load balancer. This load balancer evenly distributes load (requests from your users) onto your group/cluster of application servers.
- Types: Smart client (hard to get it perfect), Hardware load balancers ($$$ but reliable), Software load balancers (hybrid - works for most systems)

- Database replication
- Database replication is the frequent electronic copying data from a database in one computer or server to a database in another so that all users share the same level of information. The result is a distributed database in which users can access data relevant to their tasks without interfering with the work of others. The implementation of database replication for the purpose of eliminating data ambiguity or inconsistency among users is known as normalization.
- Database partitioning
- Partitioning of relational data usually refers to decomposing your tables either row-wise (horizontally) or column-wise (vertically).
- Map-Reduce
- For sufficiently small systems you can often get away with adhoc queries on a SQL database, but that approach may not scale up trivially once the quantity of data stored or write-load requires sharding your database, and will usually require dedicated slaves for the purpose of performing these queries (at which point, maybe you’d rather use a system designed for analyzing large quantities of data, rather than fighting your database).
- Adding a map-reduce layer makes it possible to perform data and/or processing intensive operations in a reasonable amount of time. You might use it for calculating suggested users in a social graph, or for generating analytics reports. eg. Hadoop, and maybe Hive or HBase.
- Platform Layer (Services)
- Separating the platform and web application allow you to scale the pieces independently. If you add a new API, you can add platform servers without adding unnecessary capacity for your web application tier.
- Adding a platform layer can be a way to reuse your infrastructure for multiple products or interfaces (a web application, an API, an iPhone app, etc) without writing too much redundant boilerplate code for dealing with caches, databases, etc.

Key topics for designing a system#
- Concurrency
- Do you understand threads, deadlock, and starvation? Do you know how to parallelize algorithms? Do you understand consistency and coherence?
- Networking
- Do you roughly understand IPC and TCP/IP? Do you know the difference between throughput and latency, and when each is the relevant factor?
- Abstraction
- You should understand the systems you’re building upon. Do you know roughly how an OS, file system, and database work? Do you know about the various levels of caching in a modern OS?
- Real-World Performance
- You should be familiar with the speed of everything your computer can do, including the relative performance of RAM, disk, SSD and your network.
- Estimation
- Estimation, especially in the form of a back-of-the-envelope calculation, is important because it helps you narrow down the list of possible solutions to only the ones that are feasible. Then you have only a few prototypes or micro-benchmarks to write.
- Availability & Reliability
- Are you thinking about how things can fail, especially in a distributed environment? Do know how to design a system to cope with network failures? Do you understand durability?
Web App System design considerations:#
- Security (CORS)
- Using CDN
- A content delivery network (CDN) is a system of distributed servers (network) that deliver webpages and other Web content to a user based on the geographic locations of the user, the origin of the webpage and a content delivery server.
- This service is effective in speeding the delivery of content of websites with high traffic and websites that have global reach. The closer the CDN server is to the user geographically, the faster the content will be delivered to the user.
- CDNs also provide protection from large surges in traffic.
- Full Text Search
- Using Sphinx/Lucene/Solr - which achieve fast search responses because, instead of searching the text directly, it searches an index instead.
- Offline support/Progressive enhancement
- Service Workers
- Web Workers
- Server Side rendering
- Asynchronous loading of assets (Lazy load items)
- Minimizing network requests (Http2 + bundling/sprites etc)
- Developer productivity/Tooling
- Accessibility
- Internationalization
- Responsive design
- Browser compatibility
Working Components of Front-end Architecture#
- Code
- HTML5/WAI-ARIA
- CSS/Sass Code standards and organization
- Object-Oriented approach (how do objects break down and get put together)
- JS frameworks/organization/performance optimization techniques
- Asset Delivery - Front-end Ops
- Documentation
- Onboarding Docs
- Styleguide/Pattern Library
- Architecture Diagrams (code flow, tool chain)
- Testing
- Performance Testing
- Visual Regression
- Unit Testing
- End-to-End Testing
- Process
- Git Workflow
- Dependency Management (npm, Bundler, Bower)
- Build Systems (Grunt/Gulp)
- Deploy Process
- Continuous Integration (Travis CI, Jenkins)
Links
How to rock a systems design interview
Introduction to Architecting Systems for Scale
Scalable System Design Patterns
Scalable Web Architecture and Distributed Systems
What is the best way to design a web site to be highly scalable?
Adapted from vasanthk/System Design.md.
All credit goes to the rightful owner.
Hot Topics in Operating Systems
HotOS XVIII program will be great! We will get to see and hear new ideas in Operating Systems research on June 1 2021. It’s been a while for me. I think it will be good time to pause and take the chance to catch up and learn about how tech advances and new applications in OS research are shaping our computational infra. I don’t know where I hear this quip, “Always bet on Linux”. lol.
I think this would get me enjoying reading papers again (PDF published by SIGOps):
- From Warm to Hot Starts: Leveraging Runtimes for the Serverless Era
- Cores That Don’t Count
- From Cloud Computing to Sky Computing ((Sky Computing, what!? Is this yet another buzzword?))
- FlexOS: Making OS Isolation Flexible
- Don’t Be a Blockhead: Zoned Namespaces Make Work on Conventional SSDs Obsolete
- Contextual Concurrency Control
- Metastable Failures in Distributed Systems
- In Reference to RPC: It’s Time to Add an Immutable Shared Address Space
- Zerializer: Towards Zero-Copy Serialization