CS336: Language Modeling from Scratch

(cs336.stanford.edu)

269 points | by kristianpaul 7 hours ago ago

35 comments

fg137 3 hours ago
I recently completed the 2025 version of this course (video + most assignments, skipping some of the most costly part of the tasks). That's quite something. There is a lot going on in the first two assignments which required a ton of thinking and debugging. Despite having a decent foundation in deep learning, it took me several months to finish it using bits of my after-work hours and weekends. (I am not a model part-time student by any means, and sometimes I didn't get to work on this for days, but it could have been much worse.) Hard to imagine how enrolled Stanford students manage to submit assignments in two week cadence.
Coming back to the course, kudos to the course staff, including professors and TAs. The obviously put a ton of thought in designing the course, putting together those slides that contain the latest updates of the field, and preparing the wonderful assignments. You get to create a real LM and explore other important parts of LLM pipeline from small building blocks and validate them, validate each step, and see for yourself how everything comes together. You can really feel a sense of achievement after completing the assignments.
That said, while the staff obviously put a lot of effort into making this accessible to everyone, I wish they made a bit more effort in clarifying the environment requirement. Their harness works best on a Linux environment with NVIDIA GPU, which may be taken for granted for researchers but rare for home computer setup. Their setup also expects specific CUDA versions and/or architectecture. For following at home, the next best setup is Windows with WSL2 + NVIDIA GPU, plus leased GPUs on various platforms, none of which is exactly trivial (or cheap, for that matter). It would be nice if the staff could put together a bit more guidance in that area, especially how someone without any compatible GPU can make the most out of the course. (One thing I learned is that if you use Mac OS and are not careful about memory analysis, your python code could freeze and force reboot your machine).
[-]
- marcelroed 2 hours ago
  TA here. Noted! I now have more resources to test more environments, and will do so whenever possible. I think freezing due to memory overuse is going to be a problem with anything you code yourself, but I do think we could be more rigorous with guiding people to achieve limited memory use for the tokenizer task.
  IMO the cost of renting GPUs is a bit overstated in these comments. Generally almost all of the development can be done locally, and then ran for a short period of time using on-demand GPUs. For assignment 1, you can run everything on your local machine, even if you don't have a GPU. For A1 and A2, you can do (most of) the tasks with only a few hours of renting. Without being too careful using rental GPUs throughout will net you around $200 of a compute budget, but you can easily get this under $50 if you're willing to scale down many of the problems. I think we could work on making this clear and charting what these changes are.
  If you have further feedback or encounter problems, feel free to open issues in the repos so we can resolve them! It's hard for us to fix issues we're not aware of.
  [-]
  - fg137 2 hours ago
    Memory overuse: for context, it's about parallelism on gloo backend with CPU. My observation is that on Linux, the same (bad) python code will result in the process getting killed quickly, saving user the trouble of rebooting. Not sure if MacOS behavior is expected in the first place.
    GPU cost: most of us will spend at least a few hours of troubleshooting to get started on a leased GPU, including but not limited to figuring out how much storage is needed, if CUDA version works well etc. No GPU is definitely possible but difficult. Plus, one issue might be that most of us just don't have enough experience working with them, resulting in more time figuring things out.
    Github issues -- noted, will create any issue that I can think of.
meken 6 hours ago
I have fond memories of cs224d [1] taught by richardsocher. It’s a bit dated at this point as it was created in the pre-transformer era, but it was a very cool introduction to applying deep learning to nlp at the time.
[1] https://cs224d.stanford.edu
[-]
- egl2020 5 hours ago
  Similar thoughts here. That was when I realized the potential of the Internet: I didn't have to be a grad student at a tier 1 research university to learn about the frontier.
skerit 5 hours ago
> GPU compute for self-study
Those suggestions they make for a B200 start at $4.99 an hour.
Is that really required, for starting out? I've been tinkering with my own from-scratch LLM, but in the early phases I don't need anything more than a 4090 on Vast.ai
[-]
- marcelroed 4 hours ago
  TA here. Definitely not! In fact we explicitly added sections in the first assignment to allow for scaling down to even local compute (M-series GPUs). For assignment 2 there are a few regions that require Triton support for your GPU, but everything can be adapted for much cheaper GPUs.
  We were lucky enough to get Blackwell GPUs for Stanford students this year, which is why the writeups are written mostly around them.
- derefr 3 hours ago
  I imagine it's a lot like FPGAs:
  - the hardware you need for a production use-case is relatively small, because production {models, bitstreams} have been heavily size-optimized, stripping out everything not needed to get a good result for the target use-cases
  - but the hardware you need when tinkering/learning how to design {compute kernels, IP blocks} in the first place, must be quite a bit more powerful / higher-capacity, because your experiments will intentionally be the opposite of optimized: they'll be built for legibility / introspectability / debuggability at every level, which massively inflates and de-optimizes the resulting {model, bitstream}.
  (And, to be clear here, "running someone else's finished model, which was designed and optimized to be used on something like a 4090, against your own prompt" is a kind of experimenting, which is cheap, in the same way that "deploying someone else's pre-baked FPGA bitstream, that was designed and synthesized for a $20 target FPGA, onto your own instance of that $20 FPGA, and then feeding your own input signals to it" is cheap. But that's not the kind of experimenting you'd be doing in this course while learning to design your own models!)
- grahameb 4 hours ago
  It seems strange that the required resources aren't provided by the educational institution?
  [-]
  - marcelroed 4 hours ago
    We do provide resources for enrolled students. The online suggestions are for external students or Stanford students who we weren't able to admit.
  - Maxatar 2 hours ago
    It says it's for self-study, ie. those who are not enrolled in the course.
  - ReptileMan 4 hours ago
    Two schools of thought - people are paying 100K per year, we should provide everything. Second is - they are paying 100K per year, do you think they will care for couple of hundred more.
- _0ffh 4 hours ago
  You're right to be sceptical. I have trained reasonably good SLMs for the TinyStories dataset on my 4060Ti (16GB) with no problems. You'll only encounter problems if you want to try if your ideas scale up to models any bigger than "arguably tiny".
- root-parent 5 hours ago
  You dont even need a GPU to train your own LLM.
- flakiness 4 hours ago
  I beliee these are affordable enough for the intended audience (which is Stanford undergrad/master)
  [-]
  - mrcrm9494 4 hours ago
    for them Modal is sponsoring the compute, as stated on the website, the prices are for remote followers
chainsaw10 3 hours ago
I’m intrigued by this course. However I’m also curious about its prerequisite:
> Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N) You should be comfortable with the basics of machine learning and deep learning.
Anyone have a good implementation-heavy self-study resource for those topics, or experience with the recorded lectures for those Stanford courses?
[-]
- alec_heif 3 hours ago
  I found the 2024 Spring CS224N course sufficient for this pre-requisite, coupled with the textbook (chapters 1-13). Like CS336, this one also has videos and assignments available, and it being from 2024 is not a problem since the basics are mostly the same as recent years. Notably this is not true for 336, which spends much more time discussing cutting edge techniques, so the 2026 version there is essential.
  Course: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246...
  Lecture videos: https://www.youtube.com/playlist?list=PLoROMvodv4rOaMFbaqxPD...
  Textbook: https://web.stanford.edu/~jurafsky/slp3/
Oarch 3 hours ago
Oh this is brilliant, I've spent the last month doing something just like this. As a challenge, no libraries allowed besides Python standard libs (so no numpy).
Started with Word2Vec, built an RNN, then LSTM and am halfway through building transformer architecture.
sonabinu 4 hours ago
I brought a group together to do this class using the YouTube videos and course materials available online. It is challenging but rewarding. We tackled it one lecture video per week. Started with over 30 learners and by last session we were down to 8.
AJRF 2 hours ago
Can anyone answer question - whats the minimum viable GPU to follow along with this course at home?
I have a 5080 16GB, are they really needing more than that in this course?
[-]
- pell 2 hours ago
  The first section can be done on a M1 chip, I think the second one needs Triton support, so your 5080 should be fine.
  [-]
  - AJRF 13 minutes ago
    Thank you
armas 3 hours ago
I independently worked on the first two assignments over the course of a year. I learned so much! I was wondering what other courses people took on afterwards :)
airstrike 5 hours ago
I wonder if people prefer to learn this on their own or if building a community around open learning is something that others are interested in
[-]
- danbrooks 4 hours ago
  I'd be interested in joining a discord server.
  Would be great to have a community to discuss the material - even if folks can't commit to the full course.
storus 6 hours ago
Thanks for releasing this again! What are this year's changes to prior offerings?
[-]
- marcelroed 4 hours ago
  TA here. Biggest changes are in the second assignment (distributed) where we added a bunch of memory, profiling and distributed tasks, as well as in the fifth assignment (alignment), where most of the RL tasks are fresh this year. Assignment 3 (scaling laws) was also completely updated, but in a way that might be difficult to run without substantial resources. I'm working on a way for external students to be able to run simulated experiments for free!
  Assignment 1 (basics) has the most hours of preparation invested in it, and only minor modernization/bug fixes were necessary this year.
tmule 6 hours ago
Are video lectures available online?
[-]
- Bilal_io 6 hours ago
  Youtube playlist link from the page https://www.youtube.com/watch?v=JuoVZkPBiKk&list=PLoROMvodv4...
- aerohit 6 hours ago
  https://www.youtube.com/watch?v=JuoVZkPBiKk&list=PLoROMvodv4...
- mindcrime 6 hours ago
  https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaT...
ChrisArchitect 4 hours ago
Related:
AI Agent Guidelines for CS336 at Stanford https://github.com/stanford-cs336/assignment1-basics/blob/ma... (https://news.ycombinator.com/item?id=48359232)
dominotw 4 hours ago
i recently started reading "build reasoning model from scratch" then i realized that i am not really interested in building part and just want to understand theory and practice behind it.
A want like a casual lesswrong style from ground up explanation.
[-]
- ianand 3 hours ago
  In that case I humbly suggest my talk from AI Engineer World's Fair https://www.youtube.com/watch?v=ZuiJjkbX0Og
  Gives you the basics on LLM internals in about 90 minutes and includes an already built model in JavaScript that you can step through in browser devtools to get as detailed as you want.