I've seen Claude ignore important parts of skills/agent files multiple times. I was running a clean up SKILL.md on a hundred markdown files, manually in small groups of 5, and about half the time it listened and ran the skill as written. The other half it would start trying to understand the codebase looking for markdown stuff for 2min, for no good reason, before reverting back to what the skill said.
Try this: Keep your CLAUDE.md as simple as possible, disable skills, and request Opus to start a subagent for each of the files and process at most 10 at a time (so you don't get rate limited) and give it the instructions in the skill for whatever processing you're doing to the markdowns as a prompt, see if that helps.
yes we had to tune the claude.md and the skill trigger quite a bit, to get it much better. But to be honest also 4.6 did improve it quite a bit. Did you run into your issues under 4.5 or 4.6?
> 26% of sessions are abandoned, most within the first 60 seconds
Starting new sessions frequently and using separate new sessions for small tasks is a good practice.
Keeping context clean and focused is a highly effective way to keep the agent on task. Having an up to date AGENTS.md should allow for new sessions to get into simple tasks quickly so you can use single-purpose sessions for small tasks without carrying the baggage of a long past context into them.
this jumped out at me too. What counts as "abandoned"? How do you know the goal was not simply met?
I have longer threads that I don't want to pollute with side quests. I will pull up multiple other chats and ask one or two questions about completely tangential or unrelated things.
insights is straight ego fluffing - it just tells you how brilliant you are and the only actionable insights are the ones hardcoded into the skill that appear for everyone. things like be very specific with the success criteria ahead of time (more than any human could ever possibly be), tell the llm exactly what steps to follow to the letter (instead of doing those steps yourself), use more skills (here's an example you can copy paste that has 2 lines and just tells it to be careful), and a couple of actually neat ideas (like having it use playwright to test changes visually after a UI change)
Ohh this is exciting, I kinda overlooked it. I assume there are still a lot of differences, especially for accross teams. But I immediately ran it, when I saw your comment. Actually still running.
I have seen numbers claiming tools are only called 59% of the time.
Saw another comment on a different platform where someone floated the idea of dynamically injecting context with hooks in the workflow to make things more deterministic.
From session analysis, it would be interesting to understand how crucial the documentation, the level of detail in CLAUDE.md, is.
It seems to me that sometimes documentation (that's too long and often out of date) contributes to greater entropy rather than greater efficiency of the model and agent.
It seems to me that sometimes it's better and more effective to remove, clean up, and simplify (both from CLAUDE.md and the code) rather than having everything documented in detail.
Therefore, from session analysis, it would be interesting to identify the relationship between documentation in CLAUDE.md and model efficiency. How often does the developer reject the LLM output in relation to the level of detail in CLAUDE.md?
Its our own sessions, from our team, over the last 3 months. We used them to develop the product and learn about our usage. You are right, they will remain closed. But I am happy to share aggregated information, if you have specific questions about the dataset.
> A local-first desktop and web app for browsing, searching, and analyzing your past AI coding sessions. See what your agents actually did across every project.
Our focus is a little bit more cross team, and in our internal version, we have also some continuous improvement monitoring, which we will probably release as well.
Hey, here is Rafa, another Rudel AI developer. The ultimate goal is to make developers more productive. Suddenly, we had everyone having dozens of sessions per day, producing 10X more code, we were having 10X more activity but not necessarily 10X productivity.
With this data, you can measure if you are spending too many tokens on sessions, how successful sessions are, and what makes them successful. Developers can also share individual sessions where they struggle with their peers and share learnings and avoid errors that others have had.
the skill usage was one of these "I am wondering about...." things, and we just prompted it into the dashboard to undertand it. We have some of these "hunches" where its easier to analyze having sessions from everyone together to understand similarities as well as differences.
And we answered a few of those kinda one off questions this way. Ongoing, we are also using a lot our "learning" tracking, which is not really usable right now, because it integrates with a few of our other things, but we are planning to release it also soon.
Also the single session view sometimes helps to debug a sessions, and then better guide a "learning".
So its a mix of different things, since we have multiple projects, we can even derive how much we are working on each project, and it kinda maps better than our Linear points :)
I was about to say they have a self-hosting guide, but I see they use third party services that seem absolutely pointless for such a tiny dataset. For comparison, I have a project that happily analyzes 150 million tokens worth of Claude session data w/some basic caching in plain text files on a $300 mini pc in seconds... If/when I reach billions, I might throw Sqlite into the stack. Maybe once I reach tens of billions, something bigger will be worthwhile.
One potential reason for sessions being abandoned within 60 seconds in my experience is realizing you forgot to set something in the environment: github token missing, tool set for the language not on the path, etc. Claude doesn't provide elegant ways to fix those things in-session so I'll just exit, fix up and start Claude again. It does have the option to continue a previous session but there's typically no point in these "oops I forgot that" cases.
The tool does have a quite detailed view for individual sessions. Which allows you to understand input and output much better, but obviously its still mysterious how the output is generated from that input.
I've seen Claude ignore important parts of skills/agent files multiple times. I was running a clean up SKILL.md on a hundred markdown files, manually in small groups of 5, and about half the time it listened and ran the skill as written. The other half it would start trying to understand the codebase looking for markdown stuff for 2min, for no good reason, before reverting back to what the skill said.
LLMs are far from consistent.
Try this: Keep your CLAUDE.md as simple as possible, disable skills, and request Opus to start a subagent for each of the files and process at most 10 at a time (so you don't get rate limited) and give it the instructions in the skill for whatever processing you're doing to the markdowns as a prompt, see if that helps.
https://scottspence.com/posts/measuring-claude-code-skill-ac...
This works in my experience
Try the latest skill-creator, has a/b testing
yes we had to tune the claude.md and the skill trigger quite a bit, to get it much better. But to be honest also 4.6 did improve it quite a bit. Did you run into your issues under 4.5 or 4.6?
> 26% of sessions are abandoned, most within the first 60 seconds
Starting new sessions frequently and using separate new sessions for small tasks is a good practice.
Keeping context clean and focused is a highly effective way to keep the agent on task. Having an up to date AGENTS.md should allow for new sessions to get into simple tasks quickly so you can use single-purpose sessions for small tasks without carrying the baggage of a long past context into them.
this jumped out at me too. What counts as "abandoned"? How do you know the goal was not simply met?
I have longer threads that I don't want to pollute with side quests. I will pull up multiple other chats and ask one or two questions about completely tangential or unrelated things.
I agree. In my experience: "single-purpose sessions for small tasks" is the key
For those unaware, Claude Code comes with a built in /insights command...
insights is straight ego fluffing - it just tells you how brilliant you are and the only actionable insights are the ones hardcoded into the skill that appear for everyone. things like be very specific with the success criteria ahead of time (more than any human could ever possibly be), tell the llm exactly what steps to follow to the letter (instead of doing those steps yourself), use more skills (here's an example you can copy paste that has 2 lines and just tells it to be careful), and a couple of actually neat ideas (like having it use playwright to test changes visually after a UI change)
Ohh this is exciting, I kinda overlooked it. I assume there are still a lot of differences, especially for accross teams. But I immediately ran it, when I saw your comment. Actually still running.
true, the best comes out of it when one uses claude code and codex as a tag team
I have seen numbers claiming tools are only called 59% of the time.
Saw another comment on a different platform where someone floated the idea of dynamically injecting context with hooks in the workflow to make things more deterministic.
interesting, where did you see that?
From session analysis, it would be interesting to understand how crucial the documentation, the level of detail in CLAUDE.md, is. It seems to me that sometimes documentation (that's too long and often out of date) contributes to greater entropy rather than greater efficiency of the model and agent.
It seems to me that sometimes it's better and more effective to remove, clean up, and simplify (both from CLAUDE.md and the code) rather than having everything documented in detail.
Therefore, from session analysis, it would be interesting to identify the relationship between documentation in CLAUDE.md and model efficiency. How often does the developer reject the LLM output in relation to the level of detail in CLAUDE.md?
This is a great idea, documented and added to our roadmap.
> content, the content or transcript of the agent session
Does this include the files being worked on by the agent in the session, or just the chat transcript?
file content is also be uploaded as well https://github.com/obsessiondb/rudel?tab=readme-ov-file#secu...
if you dont trust us with that data though (which i can understand) you can host that thing locally on your machine
is there a reason, other than general faith in humanity, to assume those '1573 sessions' are real?
I do not see any link or source for the data. I assume it is to remain closed, if it exists.
Its our own sessions, from our team, over the last 3 months. We used them to develop the product and learn about our usage. You are right, they will remain closed. But I am happy to share aggregated information, if you have specific questions about the dataset.
it's reasonable to note that w/o sharing the data these findings can't be audited or built upon
but i think the prior on 'this team fabricated these findings' is v low
It might be worthwhile to include some of an example run in your readme.
I scrolled through and didn’t see enough to justify installing and running a thing
Ah sorry, the readme is more about how to run the repo. The "product" information is rather on the website: https://rudel.ai
Reminds me https://www.agentsview.io/.
> A local-first desktop and web app for browsing, searching, and analyzing your past AI coding sessions. See what your agents actually did across every project.
Thx for the link - sounds great !
Our focus is a little bit more cross team, and in our internal version, we have also some continuous improvement monitoring, which we will probably release as well.
This is awesome! I’m working on the Open Prompt Initiative as a way for open source to share prompting knowledge.
Cool, whats the link? We have some learnings, especially in the "Skill guiding" part of our example.
So what conclusions have you drawn or could a person reasonably draw with this data?
Hey, here is Rafa, another Rudel AI developer. The ultimate goal is to make developers more productive. Suddenly, we had everyone having dozens of sessions per day, producing 10X more code, we were having 10X more activity but not necessarily 10X productivity.
With this data, you can measure if you are spending too many tokens on sessions, how successful sessions are, and what makes them successful. Developers can also share individual sessions where they struggle with their peers and share learnings and avoid errors that others have had.
yes what rafa said... aaand we see who wastes the 200 bucks claude subscription by not using it
I 100% agree that we need tools to understand and audit these workflows for opportunities. Nice work.
TBH, I am very hesitant to upload my CC logs to a third-party service.
you can host the whole thing locally :)
Why does it need login and cloud upload? A local cli tool analyzing logs should be sufficient.
We used it across the team, and when you want to bring metrics together across multiple people, its easier on a server, than local.
> That's it. Your Claude Code sessions will now be uploaded automatically.
No, thanks
It will be only enabled for the repo where you called the `enable` command. Or use the cli `upload` command for specific sessions.
Or you can run your own instance, but we will need to add docs, on how to control the endpoint properly in the CLI.
Why is the comment calling out the biggest issue with this so heavily downvoted? Privacy is a massive concern with this.
is this observability for your claude code calls or specifically for high level insights like skill usage?
would love to know your actual day to day use case for what you built
the skill usage was one of these "I am wondering about...." things, and we just prompted it into the dashboard to undertand it. We have some of these "hunches" where its easier to analyze having sessions from everyone together to understand similarities as well as differences. And we answered a few of those kinda one off questions this way. Ongoing, we are also using a lot our "learning" tracking, which is not really usable right now, because it integrates with a few of our other things, but we are planning to release it also soon. Also the single session view sometimes helps to debug a sessions, and then better guide a "learning". So its a mix of different things, since we have multiple projects, we can even derive how much we are working on each project, and it kinda maps better than our Linear points :)
How diverse is your dataset?
Team of 4 engineers, 1 data & business person, 1 design engineer.
I would say roughly equal amount of sessions between them (very roughly)
Also maybe 40% of coding sessions in large brownfield project. 50% greenfield, and remaining 10% non coding tasks.
Nice. Now, to vibe myself a locally hosted alternative.
I was about to say they have a self-hosting guide, but I see they use third party services that seem absolutely pointless for such a tiny dataset. For comparison, I have a project that happily analyzes 150 million tokens worth of Claude session data w/some basic caching in plain text files on a $300 mini pc in seconds... If/when I reach billions, I might throw Sqlite into the stack. Maybe once I reach tens of billions, something bigger will be worthwhile.
The docker-compose contain everything you should need: https://github.com/obsessiondb/rudel/blob/main/docker-compos...
Does it work for Codex?
Yes we added codex support, but its not yet extensively tested. Session upload works, but we kinda have to still QA all the analytics extraction.
One potential reason for sessions being abandoned within 60 seconds in my experience is realizing you forgot to set something in the environment: github token missing, tool set for the language not on the path, etc. Claude doesn't provide elegant ways to fix those things in-session so I'll just exit, fix up and start Claude again. It does have the option to continue a previous session but there's typically no point in these "oops I forgot that" cases.
This is so sad that on top of black box LLMs we also build all these tools that are pretty much black box as well.
It became very hard to understand what exactly is sent to LLM as input/context and how exactly is the output processed.
The tool does have a quite detailed view for individual sessions. Which allows you to understand input and output much better, but obviously its still mysterious how the output is generated from that input.