Rendered at 21:13:01 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
denysvitali 1 days ago [-]
At KubeCon Europe a very good chunk of booths were observability stacks. Everyone was claiming they're better than the competitors (with some of the just justifying themselves by saying "it's written in Rust).
Having dealt with Prometheus (+Thanos) / Grafana / OTEL and other stacks (e.g: custom solution on ClickHouse, Victoria{Metrics,Logs}, Jaeger/Tempo, Loki, ...) and even cloud ones (Google's Monarch rebranded as Prometheus)... what's your selling point? This to me seems like yet another way to re-invent the wheel.
If it's just for running locally, okay, fine, but when it comes to production (where the stack really matters) at scale, you end up with lots of tradeoffs and approaches.
Why is this one a winning one compared to the overwhelming "competition"? Seems like we're re-inventing the wheel for the 100th time instead of focusing on unifying the efforts in making the existing solutions better. Thankfully we now have OTEL, so at least the interoperability part is somewhat solved (or mitigated)
yuppiepuppie 1 days ago [-]
I was thinking this might be a result of the Cheap-money (post covid) era ending and everyone scrambling to reduce their Datadog/Cloud costs. Thinking back on 2023/2024, lots of companies were leaking large amounts of capital to those vendors and I imagine lots of people saw an opportunity for creating leaner and cheaper stacks.
dusanstanojevic 18 hours ago [-]
No need to guess, I'll tell you the exact story of why I made Traceway!
Last Dec I had a customer complaint, took me 2 days to find the issue. I had to pay $800 for Sentry and a bit more for New Relic. The issue was a locking problem that happened only in very very specific cases, erroring in diff places and timing out in others, unfortunately power users were running into it often. I had two systems, no SLO to catch this and they were completely disconnected. Super annoying.
Anyhow, I spent a day looking at those and eventually went, screw this, I'm gonna just make this actually work. So I spent a few hours, hooked it up, no auth or anything nice, pulled the traces and found the issue. Turns out it was locking due to a long transaction existing in a scheduled task, it existed for years.
The big things for me is it automatically flagging issues, prioritizing them and taking into account: errors, response codes, timing. That's why I'm making it, no venture capital, funded by actual revenue from the start (not paying for Sentry or New Relic anymore). It's really a dev focused tool to help smallish teams find and fix issues before customers even have time to complain.
Anyhow, hope that explains it, kinda related to cloud costs, mostly just my personal frustration with existing tools. Also I did NOT want to host a 5 service stack (grafana, otel collector, prometheus, mimir, loki, k8s) for something that can be done in a 60mb go binary that runs on a 3$ server...
robertlagrant 1 days ago [-]
This is my instinct too. I've had the pleasure of using DataDog and the pain of negotiating with their salespeople!
Ocha 1 days ago [-]
Yes. Their sales people don’t even negotiate - they just tell you this is price and done. Dunno why they need sales person if prices are non-negotiable
SOLAR_FIELDS 23 hours ago [-]
It’s because you’re in a leveraged position. Why would they negotiate when they don’t need to. Tell them thanks and that you’re churning tomorrow and watch the “OH WAIT”’s come flying through the door. The insidious thing about datadog is that it snakes its way into your entire business line and so it’s really hard to extricate yourself from it down the road.
gnyadav 1 days ago [-]
[dead]
NortySpock 1 days ago [-]
If I can ask a separate question: what scalability problems did you run into with Victoria{Metrics|Logs|Traces}, and at what scale did you hit them?
VictoriaMetrics and Logs have worked fine in my quiet homelab, and VictoriaMetrics appeared to work great for the infrastructure team of an open source online video game I contribute to (say about 10 physical nodes and 20 applications/services ) ... I was going to suggest VictoriaLogs to them next but wanted to ask what roadblocks could come up.
dusanstanojevic 18 hours ago [-]
I honestly think you are a bot. When ever I see Victoria mentioned it is always the same, always asking about hitting a scaling problem + promoting it, never responding to any comments. Hope I'm wrong, but it's been one too many. I refuse to use a product that is this dishonest.
dusanstanojevic 18 hours ago [-]
Hi, creator of Traceway here. Sorry for the late response, I didn't know this got posted and then my account was rate limiting on comments.
A lot of tools in this space, most pretty good. The goals when I started Traceway were:
- simple to host and reason about
- cheap to host
- comes pre configured for sub 15 dev teams
- completely open source, no paid ad-ons
It's not aimed at teams that can afford SREs (yet), the idea was to provide a good tool for smaller teams and startups in the sub 15 dev range.
The base of Traceway is Clickhouse, nothing special there, if you want you can run it with sqlite for self hosting. Sessions are also stored in S3 so the costs are minimal.
It is opinionated, it comes with preconfigured SLOs for flagging issues with endpoints and it will never try to sell you an AI SRE, you can file your exceptions/slo issues with the git integration and run what ever AI you want on it (I was sick of observability tools trying to sell me an AI). The goal is to have a one line setup, for OpenTelemetry, that gets you everything you need in Traceway without anything needing to be additionally configured. It's Datadog/Sentry but combined and fully open sourced.
I'm a huge fan of open source, here is what we've done so far for making existing solutions better:
1 - Session Replays/RUM
Session replays are usually a premium/expensive feature. With Traceway you can self host them and add them to your app in minutes. I am working on making this a standalone feature that ties into the otel sdks for mobile/js so that you can get your spans/logs/metrics/exceptions from any platform connected to your session replays in Traceway. At one point I got nerd snipped into making it work with Flutter, so we are the only solution I know of that has affordable usable session replays for Flutter.
2 - Symfony Otel
Symfony, the php framework, had no library that offered a few line setup and worked out of the box with open telemetry. We wrote one, you can use it with any tool out there.
3 - Symbolicator
We're working on a symbolicator that will be Open Telemetry Collector compatible, so that you can get your stack traces for Js/Flutter/Android/iOS resolved back. From what I can tell no good solution exists for this currently.
I will make a proper HN post at some point with more info on the project, right now I am focusing on building. If you have any ideas or things you'd like to see feel free to comment, join our discord community or open the issue in our git, we're always happy to accept PRs.
yard2010 1 days ago [-]
I have tried to self host grafana (loki prom and alloy) as o11y stack for prepbook.app. This is hard. I have a bsc in cs not that it says something. I managed to do it eventually, after some research. It was not plug and play in any way. The docs kept saying this solution is not production ready even. I couldn't find the production guide, only the "forget about self hosting and simply pay for us hosting this". After I deployed it the UX was so abrasive my partner won't even try to go into it to figure out a problem. It was a few months ago. Since then new solutions have arrived and I'm waiting to have the time to migrate. I saw PostHog have a solution but I prefer something I could self host and completely own.
I thought how come no one is trying to solve this problem. It looks like it's just a matter of time.
With that being said, my experience can be very skewed since prepbook is a passion project running on a VPS with essentially 0 scale. All I care about is the UX of the stack, not scale. Just for context.
embedding-shape 1 days ago [-]
FWIW, I have no CS degree and barely attended school at all, and found Grafana + Prometheus + Loki fairly easy to setup, at least compared to what we used to use before those tools were available. Maybe it's because I used NixOS for the setup, but besides learning some new domain-specific things I didn't know since before, I don't recall hitting any particular bumps or roadblocks, I also went the 100% self-hosted route (spread across two hosts at home).
What exactly were you struggling with when it came to the setup? Just a ton of new concepts to learn which took time, or something specific to Grafana/Prometheus/Loki?
dijit 1 days ago [-]
"Getting it running" is the easy part.
"Getting it ready for production" is a different game.
I've fallen on my sword many times by trying to explain that prometheus fails every metric of production ready; in fact Google themselves replaced borgmon (prometheus) for Monarch because the "tiny unreliable time series databases everywhere" was in fact, not the successful and reliable deployment strategy that they had claimed.
But, it is very easy to set up. Just don't go looking for failure modes, because they're everywhere and every single one of them is catastrophic.
denysvitali 1 days ago [-]
There are ways to scale Prometheus (look at Thanos), but none of the solutions is really bug free.
See this PR for example (https://github.com/prometheus/prometheus/pull/18364) - this used to impact a production deployment I worked on. Prometheus, Thanos and even OpenTelemetry are full of those kind of problems - but at the same time it's the best we have and we should be grateful they're free and open source.
I'd still choose an open source stack (and contribute to it) rather than go for a proprietary solution - we've all seen what happens with DataDog & co.
Please don't take my words lightly, I worked with the rest of my team in a large scale observability platform and scalability should not be underestimated - at the same time DataDog / Splunk prices are simply unjustified. It's ironically cheaper to build a team of engineers that will maintain a sane observability stack instead of feeding the monster(s).
otterley 1 days ago [-]
> It's ironically cheaper to build a team of engineers that will maintain a sane observability stack instead of feeding the monster(s).
Can you show the math here? This is a very bold claim, and I’m super curious. A shared Google Sheet would work well.
embedding-shape 1 days ago [-]
Well, I am running the stack in production right now, but everyone has a different understanding of what that actually means...
Do you have concrete examples of these catastrophic failures? I've personally havent experienced any myself during these years, but I'm doing very boring and typical stuff, so wouldn't surprise me there was hard edges still.
dijit 1 days ago [-]
There's a difficult distinction here, you're right.
Technically even a single server running LAMP as root but taking frontend traffic meets the definition of in production but I think we all recognise that it's not the right idea.
What I'm referring to is: should the disk start to have issues: what does prometheus do? If the scrapers start to stall due to connection timeouts: what does prometheus do? If you are doing linear interpolation of data and you have massive gaps because you're polling opportunisitically: what does prometheus do.
I'm all about boring technology, but prometheus assumes too much happy path. It assumes that a single node is enough for time series data that is used for alerting.
Which, it is: at very small scale and with best effort reliability.
It's not acceptable as soon as lost data could be critically important in diagnosing major issues in billing systems, or actually billing users, or to infer issues that need to be correlated across multiple systems.
jmalicki 1 days ago [-]
What is the disk? You've already failed by not running distributed. The problem isn't Prometheus, it's "the cloud is too expensive I'll just run on a single VPS"
dijit 1 days ago [-]
Prometheus does not run a distributed tsdb.
jmalicki 1 days ago [-]
Oh right, I forgot that even existed, I am used to seeing it as a layer that gets your data to a distributed tsdb
embedding-shape 1 days ago [-]
> should the disk start to have issues
If that happens, is prometheus really the biggest of your worries here? Software breaks left and right when disks disappear from under them, I'm not sure this is neither unexpected or unique to prometheus.
> If the scrapers start to stall due to connection timeouts: what does prometheus do?
I'm having this "issue" all the time, as some of my WiFi connected (less important) cameras are just within the WiFi range, and I'm using prometheus to scrape metrics from them. It seems like the requests times out, then the next time it doesn't, and everything just works? What's the issue you're experiencing with this exactly?
> It's not acceptable as soon as lost data could be critically important in diagnosing major issues in billing systems, or actually billing users, or
Wait what? Billing systems? That stuff would go into your proper database, wouldn't it? Sure, if prometheus/node_exporter fails or whatever, you won't get metrics out of the host, but again, if those things start failing on that host, the host is having bigger issues than "prometheus suck at scale".
I was eagerly awaiting to be educated about potential gaps in my understanding of prometheus, instead it seems like you simply don't happen to like they way they do things? I was under the impression they did something wrong or something was broken, but these things just seems like the typical stuff you have to think about for any service you deploy.
dijit 1 days ago [-]
Yes, my monitoring system not alerting me when the systems it runs on are failing is the entire problem.
That's not a general "software breaks when disks fail" situation: that's a monitoring system failing at its one job.
Your monitoring system failing silently when your infrastructure is under stress is precisely the failure mode that monitoring exists to prevent.
Zabbix solves this with native HA and self-checks. Prometheus makes it your problem to solve with external tooling, and most people don't, until they need it.
embedding-shape 1 days ago [-]
Why wouldn't your monitoring system alert you when metrics suddenly disappear? Sounds like you need a better monitoring system, prometheus is not gonna magically solve that problem for you. No wonder you were having issues with prometheus...
dijit 1 days ago [-]
I'm not sure what you mean.
Of course the systems that have to alert me to failure have to be designed with mechanisms to alert me to the fact that they themselves are failing.
Prometheus doesn't because it optimised intentionally for being easy to deploy and for there being a hierarchy of prometheus's in a tree-like formation. Which makes sense, but forces a much more distributed and difficult to reason model.
Monitoring systems can't be designed for the happy path. By definition, they only matter when things are going wrong- which is precisely when the happy path isn't available. Prometheus is excellent when everything is fine (scaling aside). That's not when you need your monitoring system to be excellent.
embedding-shape 1 days ago [-]
I think we're running really different monitoring setups, I'd never expect my alerting solution to still be able to alert to me if it's down or degraded, nor would I expect my metrics gathering software to alert me if it's down, that's why I have monitoring setup for those things in the first place.
But, I'm sure your setup makes as much sense in your context as mine makes in my context. As long as it works for you, we're all happy :)
dijit 1 days ago [-]
"I have monitoring set up for those things" - but that doesn't solve the ambiguity. When Prometheus misses a scrape, nothing fires. Silence looks identical whether your service is down, the network blipped, or Prometheus itself is struggling. A defensive monitoring system has to treat absence of data as a signal, not just absence of a problem.
dusanstanojevic 18 hours ago [-]
Hi, I'm the creator of Traceway.
I have created Traceway because I looked at that stack and decided that I'm not going to add 7 more services to my stack that could all fail that I now have to maintain as well. Here is the list: Grafana, Otel Collector (to forward metrics), Prometheus, Loki, Tempo, Mimir, K8s.
This is not maintainable in production, unless you have a person to manage it. My app had about 500-1000 req/sec, this sounds like a lot but it's extremely light from the observability perspective. Why would I add 7 more points of failure and services to monitor for proper resource allocation for something like this? To add insult to injury I would have to keep building my SLOs, they wouldn't be tracked automatically by default, I would have to keep paying for Sentry because the issue tracking is quite lacking on Grafana. Oh almost forgot, I would also have to get an alerting provider or pay for that (maybe I'm wrong, it was 6 mo ago).
Anyhow, Traceway is a 60mb binary in Go, it works with Clickhouse or Sqlite and the data is stored on S3 when not used. That means you can host it with sqlite on the 2$ server or even free tier and have it working for your side projects, you can host it with managed clickhouse and get auto scalability on the db level.
The goal is to provide full observability and tools to fix issues directly for developers. What we have so far: alerts, notifications, SSO (google & github), integrations, metrics, preconfigured SLOs, distributed tracing, RUM/session recordings (js & flutter).
Almost forgot, you'd need a symbolicator as well, or your fe/mobile exception stack traces will be messed up in Grafana, I don't even know which tool they have for that, but it's always a new service to host and maintain...
SOLAR_FIELDS 23 hours ago [-]
FWIW, if you come flying in saying you used NixOS to set something up you’re not what we would call a “casual user”
embedding-shape 22 hours ago [-]
Why not? Hardly unheard of for managing infrastructure. If we were talking about desktop environments, then maybe, and to be fair, I never said I was a casual user, just that I didn't find prometheus particularly difficult to manage in a production environment.
SOLAR_FIELDS 6 hours ago [-]
The implication is that by virtue of using NixOS, you're already a self selected power user. The people that would find setting this thing up in production difficult and the people who would use NixOS are a very small overlap, if any, on that venn diagram.
embedding-shape 39 minutes ago [-]
NixOS is an additional thing on top of prometheus, not a replacement. Not sure why it'd dictate how easy/hard it is to run prometheus & co, you still have to know the same stuff as without it.
parliament32 1 days ago [-]
FWIW we've also tried all sorts of different things, and honestly the very vanilla (prometheus -> central thanos, fluentbit -> central loki, grafana) ends up on top. The resource consumption is surprisingly minimal (for a sense of scale, we run about 200k eps for metrics and 1k eps for logs). For all these solutions, I find myself asking the same question as you.. what problem are you trying to solve? Is there anything actually different about your product other than less stability than the battle-tested stack?
ting0 1 days ago [-]
Do you think Prometheus + Grafana is the way to go?
denysvitali 1 days ago [-]
Really depends on the use case. Home lab? Probably.
Production? As soon as you scale you need a proper solution. Prometheus (by itself) doesn't scale - you need Mimir or Thanos (or similar).
Clickhouse (the "clickstack") seems to be the new kid on the block. Looks very promising.
parliament32 1 days ago [-]
Note Clickhouse is quite old (2010ish?) but they've always been a "web server access log analytics" solution. The pivot to "we do observability too" is new, we'll see how that plays out. Not terribly optimistic given how badly a similar pivot went for Elastic, but who knows.
dusanstanojevic 19 hours ago [-]
Clickhouse is just a database, it has a really neat feature that infrequently accessed data is pushed back to S3 minimizing the costs. It also heavily compresses the data when storing it.
I am the creator of Traceway and it's my all time fav database. Having said that the repositories in Traceway are completely modular, I've implemented the sqlite version so that I can skip docker containers locally and to simplify self hosting for side projects (it runs on like 2$ servers without issues). This is why it's uniquely suitable for telemetry data and why I've used it as the base of Traceway.
They've acquired HyperDX because it was a major Clickhouse user because their whole platform was telemetry on top of Clickhouse. I hope they don't fully pivot into the space as it would be quite awkward, but there are alternatives and I can always redo repositories with a diff storage engine/db.
NortySpock 1 days ago [-]
I thought observability was shoved on Clickhouse by other stacks deciding to use Clickhouse as their recommended database for observability (SigNoz springs to mind but they were not the only one)
valyala 10 hours ago [-]
VictoriaMetrics CTO here.
We at VictoriaMetrics took another approach. We tried using ClickHouse as a database for metrics in 2017, but then decided implementing a specialized database for metrics. This database uses ClickHouse architecture ideas for achieving the best performance and the lowest resource usage. The main difference between ClickHouse and VictoriaMetrics is that VictoriaMetrics is optimized solely for typical observability tasks. It supports all the popular data ingestion protocols, it provides promql-compatible querying API, it provides Graphite-compatible querying API, it provides Prometheus-compatible service discovery and relabeling, it provides Prometheus-compatible alerting and recording rules. It provides built-in web UI for quick exploration and analysis of the ingested metrics, with the ability to investigate the source of high cardinality. It consists of a single small executable (~20MB) without external dependencies with minimum configs and minimum maintenance. See https://altinity.com/wp-content/uploads/2021/11/How-ClickHou... for more details.
We used the same approach for building VictoriaLogs - a specialized database for logs. It uses the most appropriate architecture ideas from ClickHouse for achieving high performance and low resource usage. It is schemaless and zero-config. It contains of a single small executable without external dependencies. It accepts logs via popular data ingestion protocols. It provides a specialized query language for typical queries over production logs - LogsQL. This language is much simpler to use than SQL for querying typical logs. It provides a built-in web UI for quick exploration of the ingested logs. It provides a Grafana plugin for building arbitrary complex dashboards from the stored logs. It provides the ability to build alerts and metrics from the stored logs. See https://docs.victoriametrics.com/victorialogs/faq/#what-is-t...
denysvitali 22 hours ago [-]
I mean, the idea of using OTEL with ClickHouse is rather new, and solves the most painful part of metrics: high cardinality. Has its use-cases, but for sure comes with its own problems
drzaiusx11 19 hours ago [-]
We're on AWS Managed Prometheus + Grafana in production and it certainly scales just fine, although I'm sure under the hood it's an entirely different beast than FOSS Prometheus, likely only AWS engineers truly know..
CyberDildonics 1 days ago [-]
Is "observability stack" the new term for logs and stats?
denysvitali 22 hours ago [-]
You have more than that nowadays. Tracing and profiling are part of O11y too
dusanstanojevic 1 days ago [-]
Hi, I am the creator of Traceway. I've just realized that someone posted about it.
Unfortunately my account is being rate limited and I can't response to each comment.
Thank you for your support the attention project has received has been unreal.
I was looking into this just yesterday. So the Loki + … comparison is a bit off in the Open Source space. The main ones are Signoz and ClickStack in this space. Both using ClickHouse as the database. Heavy compared to something like Loki, but they are OTEL native and not log monitoring. So not in the same category.
jillesvangurp 2 days ago [-]
I used Signoz + Clickstack on a vibe coded Go server project a few weeks ago. I just made codex figure out how to setup signoz + dependencies via docker compose. I even got it to pre-populate signoz with dashboards. It wasn't too bad. The whole thing runs with a few GB. I tried to cover metrics, tracing, and logging at the same time. This is not a production ready setup but you need to trade off cost vs. utility here. If it's useful enough, that could justify extra cost.
I have a background in having done a lot of stuff on the Elastic stack related to this; including setting up a big Elastic Fleet based stack for one client at some point. It might not be the cheapest, but it does provide awesome filtering and querying capabilities. However, a lot of teams that use it don't really know how to tap into that capability so it tends to be overengineered for what it does in the end. And the extra, underutilized complexity is why a lot of teams are wary of dealing with that stack.
Storing the data is the easy part but what's the point if you can't run queries against it and produce dashboards and diagnostic tools that actually help you? Prometheus/grafana or older graphite type setups tend to be compromises where you get lots of data but are then limited on the querying front or the number of metrics. The tradeoff is always between scale and querying flexibility. If you store tens/hundreds of GB of telemetry per day, you need a way to make sense of it. Clickhouse seems to be quite good at scaling and querying. It's basically a column database. I don't have direct experience with Loki.
But in the end, all that power only matters if people actually use it. And, again, in my experience teams tend not to. They tend to have a lot of unrealized aspirations around their tools and infrastructure. If it's just a dumping ground for data + a few simplistic dashboards, optimize for that. A lot of that data is actually only kept for compliance/auditing reasons. For that, querying is usually a secondary concern and it's OK if queries take a bit longer and are less powerful.
tecoholic 1 days ago [-]
I agree. The sentiment applies to most analytics. People who setup analytics are not the same as end users.
dusanstanojevic 23 hours ago [-]
You're absolutely on point with this, I've made the perf tracking opinionated, so it comes preconfigured with SLOs that are good for most of the projects where nobody would bother to set them up.
Traceway has custom dashboards, supports otel logs/traces/metrics/exceptions fully, has session replays for web and flutter (working on ios/android now), has alerting integrations with slack/email/github, oauth login w google/github, and a bunch of other features... All MIT. None behind a paywall.
It has a specific set of trade offs, those are by design, but I am also always open to changing them and improving it. If you try it and have any thoughts the git issues are constantly monitored.
dusanstanojevic 23 hours ago [-]
Agreed, it's a trade-off I am ok with for now.
In reality it's a very modular system, the telemetry repositories can be swapped out easily, I have implemented a clickhouse and a sqlite version (to simplify self hosting) so adding a loki like repository would be a breeze. It's not on the roadmap currently as I am putting a lot of effort into 3 diff parts rn.
The truth is that Clickhouse is an incredible DB that scales really well for observability data.
adenta 2 days ago [-]
I'm partial to open observe, especially because in Ruby the OTEL stuff isn't great for metrics and logs yet.
dusanstanojevic 23 hours ago [-]
When I was starting Traceway I was heavily inspired by skylightio from the Ruby ecosystem. I loved their SLOs/ranking perf issues, but I also wanted the features that Sentry offered in one place.
lytedev 2 days ago [-]
I also run open observe at home, but I can't help but feel that the interface could use some... sparkle, and the mobile experience kinda sucks.
But you can't beat the excellent price and performance. Does what I need and much more
blazarquasar 16 hours ago [-]
Given the heavy LLM usage, i’d probably be a little concerned about the project’s longevity. I personally also can’t stand seeing that typeface on websites anymore…
ddux1389 1 days ago [-]
Hey everyone, I'm the original creator of this project. Just saw this thread, I'll do my best to respond to everyone.
amne 1 days ago [-]
how can you claim in the readme "no per-language vendor SDK" and then link to a list of per-language client SDKs?
dusanstanojevic 23 hours ago [-]
Hi, sorry for not responding sooner, didn't realize this post existed.
Traceway is fully OTel compliant.
Go: The original version started with Go SDKs. I've since moved to using Go OTel. I haven't updated those docs yet because the Go SDKs still work and are used in the wild, but they're on the deprecation track. Thanks for pointing it out.
Symfony: There were no good one-line OTel integrations out there for Symfony, so we wrote one. It is not a custom SDK, it's an OTel configurator. You can use it with any backend, not just Traceway. We're firm believers in contributing back to the OpenTelemetry community.
Frontend / mobile: This is more complicated. The current frontend and mobile OTel spec does not allow session replays to be sent, so for those platforms we still keep SDKs with a custom protocol alongside OTel. As soon as the spec matures I'm hoping to move it fully to OTel.
danparsonson 1 days ago [-]
Aren't they two different things? Vendor SDKs to get the data in, client SDKs as an option to get the data out?
oulipo2 2 days ago [-]
There's a few contenders in self-hostable otel:
- ClickStack (ex HyperDX)
- SigNoz
- Traceway
- a few more
does someone has enough feedback on those to be able to tell which one works best?
Creator of Traceway here. Sorry for not responding sooner, didn't realize this HN post existed.
I saw it recently, I think it looks amazing, I haven't looked into it enough to know of any downsides. I am currently heads down in building as I have the roadmap cut out for the next few months, I will circle back to them as soon as I have a bit more time.
If you're familiar with their platform feel free to checkout Traceway and let me know if there are any incredible features you'd like to see in Traceway or anything they're missing. I am always looking for feedback!
sgt 2 days ago [-]
Funny, the first thing I look for for infra projects like these is to find out if it's written in Go. At that point, my confidence level is increased.
I'm the main contributor to Traceway, I LOVE Elixir! Traceway is strictly for monitoring your app, not the actual usage/product analytics. It's for making sure you know how well your backend is performing and to be able to quickly fix issues that show up.
neya 13 hours ago [-]
Hey, thanks for responding! I was just pointing out to the parent comment that going by a programming language to judge app quality is a very relative metric, so I just used Elixir to help them understand. Thanks again for your hard work on the project!
sexylinux 1 days ago [-]
Why is it better? On the internet it is not enough to just say something. You need to deliver some facts and / or a comparison. Please try it.
etiam 1 days ago [-]
Do you have any proof of that?
neya 13 hours ago [-]
Nope. On the internet, I don't owe anyone anything, especially to someone who created a new account just to argue. Do your own research. And using your real account will elicit better quality responses. Please try it.
ddux1389 1 days ago [-]
Go has been incredible for building Traceway, glad you like it too
ting0 1 days ago [-]
This looks cool
ddux1389 1 days ago [-]
Thank you
ArslanS1997 1 days ago [-]
This is awesome bro
ddux1389 1 days ago [-]
Not the OP, but I am the one making Traceway, thank you
2 days ago [-]
1 days ago [-]
smkumari 1 days ago [-]
[dead]
RGJorge 1 days ago [-]
The "easy to set up" framing usually skips the hardest part: whether the metric you're alerting on is meaningful. Most stacks pull container memory from cAdvisor's `container_memory_usage_bytes`, which is the
same broken `memory_stats.usage` that `docker stats` reports — includes the kernel's reclaimable page cache. For DB containers with hot working sets, that metric stays at 95%+ constantly. Beautiful Grafana
dashboards alerting on a structurally wrong number. The fix is computing real anonymous memory (subtract active_file + inactive_file) — most stacks leave that as a custom exporter exercise. Curious how Traceway handles this out of the box.
Having dealt with Prometheus (+Thanos) / Grafana / OTEL and other stacks (e.g: custom solution on ClickHouse, Victoria{Metrics,Logs}, Jaeger/Tempo, Loki, ...) and even cloud ones (Google's Monarch rebranded as Prometheus)... what's your selling point? This to me seems like yet another way to re-invent the wheel.
If it's just for running locally, okay, fine, but when it comes to production (where the stack really matters) at scale, you end up with lots of tradeoffs and approaches.
Why is this one a winning one compared to the overwhelming "competition"? Seems like we're re-inventing the wheel for the 100th time instead of focusing on unifying the efforts in making the existing solutions better. Thankfully we now have OTEL, so at least the interoperability part is somewhat solved (or mitigated)
Last Dec I had a customer complaint, took me 2 days to find the issue. I had to pay $800 for Sentry and a bit more for New Relic. The issue was a locking problem that happened only in very very specific cases, erroring in diff places and timing out in others, unfortunately power users were running into it often. I had two systems, no SLO to catch this and they were completely disconnected. Super annoying.
Anyhow, I spent a day looking at those and eventually went, screw this, I'm gonna just make this actually work. So I spent a few hours, hooked it up, no auth or anything nice, pulled the traces and found the issue. Turns out it was locking due to a long transaction existing in a scheduled task, it existed for years.
The big things for me is it automatically flagging issues, prioritizing them and taking into account: errors, response codes, timing. That's why I'm making it, no venture capital, funded by actual revenue from the start (not paying for Sentry or New Relic anymore). It's really a dev focused tool to help smallish teams find and fix issues before customers even have time to complain.
Anyhow, hope that explains it, kinda related to cloud costs, mostly just my personal frustration with existing tools. Also I did NOT want to host a 5 service stack (grafana, otel collector, prometheus, mimir, loki, k8s) for something that can be done in a 60mb go binary that runs on a 3$ server...
VictoriaMetrics and Logs have worked fine in my quiet homelab, and VictoriaMetrics appeared to work great for the infrastructure team of an open source online video game I contribute to (say about 10 physical nodes and 20 applications/services ) ... I was going to suggest VictoriaLogs to them next but wanted to ask what roadblocks could come up.
A lot of tools in this space, most pretty good. The goals when I started Traceway were: - simple to host and reason about - cheap to host - comes pre configured for sub 15 dev teams - completely open source, no paid ad-ons
It's not aimed at teams that can afford SREs (yet), the idea was to provide a good tool for smaller teams and startups in the sub 15 dev range.
The base of Traceway is Clickhouse, nothing special there, if you want you can run it with sqlite for self hosting. Sessions are also stored in S3 so the costs are minimal.
It is opinionated, it comes with preconfigured SLOs for flagging issues with endpoints and it will never try to sell you an AI SRE, you can file your exceptions/slo issues with the git integration and run what ever AI you want on it (I was sick of observability tools trying to sell me an AI). The goal is to have a one line setup, for OpenTelemetry, that gets you everything you need in Traceway without anything needing to be additionally configured. It's Datadog/Sentry but combined and fully open sourced.
I'm a huge fan of open source, here is what we've done so far for making existing solutions better:
1 - Session Replays/RUM
Session replays are usually a premium/expensive feature. With Traceway you can self host them and add them to your app in minutes. I am working on making this a standalone feature that ties into the otel sdks for mobile/js so that you can get your spans/logs/metrics/exceptions from any platform connected to your session replays in Traceway. At one point I got nerd snipped into making it work with Flutter, so we are the only solution I know of that has affordable usable session replays for Flutter.
2 - Symfony Otel
Symfony, the php framework, had no library that offered a few line setup and worked out of the box with open telemetry. We wrote one, you can use it with any tool out there.
3 - Symbolicator
We're working on a symbolicator that will be Open Telemetry Collector compatible, so that you can get your stack traces for Js/Flutter/Android/iOS resolved back. From what I can tell no good solution exists for this currently.
I will make a proper HN post at some point with more info on the project, right now I am focusing on building. If you have any ideas or things you'd like to see feel free to comment, join our discord community or open the issue in our git, we're always happy to accept PRs.
I thought how come no one is trying to solve this problem. It looks like it's just a matter of time.
With that being said, my experience can be very skewed since prepbook is a passion project running on a VPS with essentially 0 scale. All I care about is the UX of the stack, not scale. Just for context.
What exactly were you struggling with when it came to the setup? Just a ton of new concepts to learn which took time, or something specific to Grafana/Prometheus/Loki?
"Getting it ready for production" is a different game.
I've fallen on my sword many times by trying to explain that prometheus fails every metric of production ready; in fact Google themselves replaced borgmon (prometheus) for Monarch because the "tiny unreliable time series databases everywhere" was in fact, not the successful and reliable deployment strategy that they had claimed.
But, it is very easy to set up. Just don't go looking for failure modes, because they're everywhere and every single one of them is catastrophic.
See this PR for example (https://github.com/prometheus/prometheus/pull/18364) - this used to impact a production deployment I worked on. Prometheus, Thanos and even OpenTelemetry are full of those kind of problems - but at the same time it's the best we have and we should be grateful they're free and open source.
I'd still choose an open source stack (and contribute to it) rather than go for a proprietary solution - we've all seen what happens with DataDog & co.
Please don't take my words lightly, I worked with the rest of my team in a large scale observability platform and scalability should not be underestimated - at the same time DataDog / Splunk prices are simply unjustified. It's ironically cheaper to build a team of engineers that will maintain a sane observability stack instead of feeding the monster(s).
Can you show the math here? This is a very bold claim, and I’m super curious. A shared Google Sheet would work well.
Do you have concrete examples of these catastrophic failures? I've personally havent experienced any myself during these years, but I'm doing very boring and typical stuff, so wouldn't surprise me there was hard edges still.
Technically even a single server running LAMP as root but taking frontend traffic meets the definition of in production but I think we all recognise that it's not the right idea.
What I'm referring to is: should the disk start to have issues: what does prometheus do? If the scrapers start to stall due to connection timeouts: what does prometheus do? If you are doing linear interpolation of data and you have massive gaps because you're polling opportunisitically: what does prometheus do.
I'm all about boring technology, but prometheus assumes too much happy path. It assumes that a single node is enough for time series data that is used for alerting.
Which, it is: at very small scale and with best effort reliability.
It's not acceptable as soon as lost data could be critically important in diagnosing major issues in billing systems, or actually billing users, or to infer issues that need to be correlated across multiple systems.
If that happens, is prometheus really the biggest of your worries here? Software breaks left and right when disks disappear from under them, I'm not sure this is neither unexpected or unique to prometheus.
> If the scrapers start to stall due to connection timeouts: what does prometheus do?
I'm having this "issue" all the time, as some of my WiFi connected (less important) cameras are just within the WiFi range, and I'm using prometheus to scrape metrics from them. It seems like the requests times out, then the next time it doesn't, and everything just works? What's the issue you're experiencing with this exactly?
> It's not acceptable as soon as lost data could be critically important in diagnosing major issues in billing systems, or actually billing users, or
Wait what? Billing systems? That stuff would go into your proper database, wouldn't it? Sure, if prometheus/node_exporter fails or whatever, you won't get metrics out of the host, but again, if those things start failing on that host, the host is having bigger issues than "prometheus suck at scale".
I was eagerly awaiting to be educated about potential gaps in my understanding of prometheus, instead it seems like you simply don't happen to like they way they do things? I was under the impression they did something wrong or something was broken, but these things just seems like the typical stuff you have to think about for any service you deploy.
That's not a general "software breaks when disks fail" situation: that's a monitoring system failing at its one job.
Your monitoring system failing silently when your infrastructure is under stress is precisely the failure mode that monitoring exists to prevent.
Zabbix solves this with native HA and self-checks. Prometheus makes it your problem to solve with external tooling, and most people don't, until they need it.
Of course the systems that have to alert me to failure have to be designed with mechanisms to alert me to the fact that they themselves are failing.
Zabbix, Nagios, Munin -- practically everything that existed before: understood this.
Prometheus doesn't because it optimised intentionally for being easy to deploy and for there being a hierarchy of prometheus's in a tree-like formation. Which makes sense, but forces a much more distributed and difficult to reason model.
Monitoring systems can't be designed for the happy path. By definition, they only matter when things are going wrong- which is precisely when the happy path isn't available. Prometheus is excellent when everything is fine (scaling aside). That's not when you need your monitoring system to be excellent.
But, I'm sure your setup makes as much sense in your context as mine makes in my context. As long as it works for you, we're all happy :)
I have created Traceway because I looked at that stack and decided that I'm not going to add 7 more services to my stack that could all fail that I now have to maintain as well. Here is the list: Grafana, Otel Collector (to forward metrics), Prometheus, Loki, Tempo, Mimir, K8s.
This is not maintainable in production, unless you have a person to manage it. My app had about 500-1000 req/sec, this sounds like a lot but it's extremely light from the observability perspective. Why would I add 7 more points of failure and services to monitor for proper resource allocation for something like this? To add insult to injury I would have to keep building my SLOs, they wouldn't be tracked automatically by default, I would have to keep paying for Sentry because the issue tracking is quite lacking on Grafana. Oh almost forgot, I would also have to get an alerting provider or pay for that (maybe I'm wrong, it was 6 mo ago).
Anyhow, Traceway is a 60mb binary in Go, it works with Clickhouse or Sqlite and the data is stored on S3 when not used. That means you can host it with sqlite on the 2$ server or even free tier and have it working for your side projects, you can host it with managed clickhouse and get auto scalability on the db level.
The goal is to provide full observability and tools to fix issues directly for developers. What we have so far: alerts, notifications, SSO (google & github), integrations, metrics, preconfigured SLOs, distributed tracing, RUM/session recordings (js & flutter).
Almost forgot, you'd need a symbolicator as well, or your fe/mobile exception stack traces will be messed up in Grafana, I don't even know which tool they have for that, but it's always a new service to host and maintain...
Production? As soon as you scale you need a proper solution. Prometheus (by itself) doesn't scale - you need Mimir or Thanos (or similar).
Clickhouse (the "clickstack") seems to be the new kid on the block. Looks very promising.
I am the creator of Traceway and it's my all time fav database. Having said that the repositories in Traceway are completely modular, I've implemented the sqlite version so that I can skip docker containers locally and to simplify self hosting for side projects (it runs on like 2$ servers without issues). This is why it's uniquely suitable for telemetry data and why I've used it as the base of Traceway.
They've acquired HyperDX because it was a major Clickhouse user because their whole platform was telemetry on top of Clickhouse. I hope they don't fully pivot into the space as it would be quite awkward, but there are alternatives and I can always redo repositories with a diff storage engine/db.
We at VictoriaMetrics took another approach. We tried using ClickHouse as a database for metrics in 2017, but then decided implementing a specialized database for metrics. This database uses ClickHouse architecture ideas for achieving the best performance and the lowest resource usage. The main difference between ClickHouse and VictoriaMetrics is that VictoriaMetrics is optimized solely for typical observability tasks. It supports all the popular data ingestion protocols, it provides promql-compatible querying API, it provides Graphite-compatible querying API, it provides Prometheus-compatible service discovery and relabeling, it provides Prometheus-compatible alerting and recording rules. It provides built-in web UI for quick exploration and analysis of the ingested metrics, with the ability to investigate the source of high cardinality. It consists of a single small executable (~20MB) without external dependencies with minimum configs and minimum maintenance. See https://altinity.com/wp-content/uploads/2021/11/How-ClickHou... for more details.
We used the same approach for building VictoriaLogs - a specialized database for logs. It uses the most appropriate architecture ideas from ClickHouse for achieving high performance and low resource usage. It is schemaless and zero-config. It contains of a single small executable without external dependencies. It accepts logs via popular data ingestion protocols. It provides a specialized query language for typical queries over production logs - LogsQL. This language is much simpler to use than SQL for querying typical logs. It provides a built-in web UI for quick exploration of the ingested logs. It provides a Grafana plugin for building arbitrary complex dashboards from the stored logs. It provides the ability to build alerts and metrics from the stored logs. See https://docs.victoriametrics.com/victorialogs/faq/#what-is-t...
Unfortunately my account is being rate limited and I can't response to each comment.
Thank you for your support the attention project has received has been unreal.
I'll be responding to everyone as the rate limit subsides but I've made this in the meantime: https://github.com/tracewayapp/traceway/blob/main/HN.md
Again, thank you for your support!
I have a background in having done a lot of stuff on the Elastic stack related to this; including setting up a big Elastic Fleet based stack for one client at some point. It might not be the cheapest, but it does provide awesome filtering and querying capabilities. However, a lot of teams that use it don't really know how to tap into that capability so it tends to be overengineered for what it does in the end. And the extra, underutilized complexity is why a lot of teams are wary of dealing with that stack.
Storing the data is the easy part but what's the point if you can't run queries against it and produce dashboards and diagnostic tools that actually help you? Prometheus/grafana or older graphite type setups tend to be compromises where you get lots of data but are then limited on the querying front or the number of metrics. The tradeoff is always between scale and querying flexibility. If you store tens/hundreds of GB of telemetry per day, you need a way to make sense of it. Clickhouse seems to be quite good at scaling and querying. It's basically a column database. I don't have direct experience with Loki.
But in the end, all that power only matters if people actually use it. And, again, in my experience teams tend not to. They tend to have a lot of unrealized aspirations around their tools and infrastructure. If it's just a dumping ground for data + a few simplistic dashboards, optimize for that. A lot of that data is actually only kept for compliance/auditing reasons. For that, querying is usually a secondary concern and it's OK if queries take a bit longer and are less powerful.
Traceway has custom dashboards, supports otel logs/traces/metrics/exceptions fully, has session replays for web and flutter (working on ios/android now), has alerting integrations with slack/email/github, oauth login w google/github, and a bunch of other features... All MIT. None behind a paywall.
It has a specific set of trade offs, those are by design, but I am also always open to changing them and improving it. If you try it and have any thoughts the git issues are constantly monitored.
In reality it's a very modular system, the telemetry repositories can be swapped out easily, I have implemented a clickhouse and a sqlite version (to simplify self hosting) so adding a loki like repository would be a breeze. It's not on the roadmap currently as I am putting a lot of effort into 3 diff parts rn.
The truth is that Clickhouse is an incredible DB that scales really well for observability data.
But you can't beat the excellent price and performance. Does what I need and much more
Traceway is fully OTel compliant.
Go: The original version started with Go SDKs. I've since moved to using Go OTel. I haven't updated those docs yet because the Go SDKs still work and are used in the wild, but they're on the deprecation track. Thanks for pointing it out.
Symfony: There were no good one-line OTel integrations out there for Symfony, so we wrote one. It is not a custom SDK, it's an OTel configurator. You can use it with any backend, not just Traceway. We're firm believers in contributing back to the OpenTelemetry community.
Frontend / mobile: This is more complicated. The current frontend and mobile OTel spec does not allow session replays to be sent, so for those platforms we still keep SDKs with a custom protocol alongside OTel. As soon as the spec matures I'm hoping to move it fully to OTel.
- ClickStack (ex HyperDX) - SigNoz - Traceway - a few more
does someone has enough feedback on those to be able to tell which one works best?
I saw it recently, I think it looks amazing, I haven't looked into it enough to know of any downsides. I am currently heads down in building as I have the roadmap cut out for the next few months, I will circle back to them as soon as I have a bit more time.
If you're familiar with their platform feel free to checkout Traceway and let me know if there are any incredible features you'd like to see in Traceway or anything they're missing. I am always looking for feedback!
https://github.com/plausible/analytics
Elixir.