You Wouldn't Download a Hacker News

(jasonthorsness.com)

166 points | by jasonthorsness 9 hours ago ago

65 comments

  • mattkevan 40 minutes ago

    I did something similar a while back to the @fesshole Twitter/Bluesky account. Downloaded the entire archive and fine-tuned a model on it to create more unhinged confessions.

    Was feeling pretty pleased with myself until I realised that all I’d done was teach an innocent machine about wanking and divorce. Felt like that bit in a sci-fi movie where the alien/super-intelligent AI speed-watches humanity’s history and decides we’re not worth saving after all.

  • bambax an hour ago

    > Now that I have a local download of all Hacker News content, I can train hundreds of LLM-based bots on it and run them as contributors, slowly and inevitably replacing all human text with the output of a chinese room oscillator perpetually echoing and recycling the past.

    The author said this in jest, but I fear someone, someday, will try this; I hope it never happens but if it does, could we stop it?

    • icoder 36 minutes ago

      I'm more and more convinced of an old idea that seems to become more relevant over time: to somehow form a network of trust between humans so that I know that your account is trusted by a person (you) that is trusted by a person (I don't know) [...] that is trusted by a person (that I do know) that is trusted by me.

      Lots of issues there to solve, privacy being one (the links don't have to be known to the users, but in a naive approach they are there on the server).

      Paths of distrust could be added as negative weight, so I can distrust people directly or indirectly (based on the accounts that they trust) and that lowers the trust value of the chain(s) that link me to them.

      Because it's a network, it can adjust itself to people trying to game the system, but it remains a question to how robust it will be.

      • littlestymaar 28 minutes ago

        Ultimately, guaranteeing common trust between citizens is a fundamental role of the State.

        For a mix of ideological reasons and lack of genuine interest for the internet from the legislators, mainly due to the generational factor, it hasn't happened yet, but I expect government issued equivalent of IDs and passports for the internet to become mainstream sooner than later.

      • drcongo 14 minutes ago

        I actually built this once, a long time ago for a very bizarre social network project. I visualised it as a mesh where individuals were the points where the threads met, and as someone's trust level rose, it would pull up the trust levels of those directly connected, and to a lesser degree those connected to them - picture a trawler fishing net and lifting one of the points where the threads meet. Similarly, a user whose trust lowered over time would pull their connections down with them. Sadly I never got to see it at the scale it needed to become useful as the project's funding went sideways.

      • XorNot 28 minutes ago

        I think technically this is the idea that GPG's web of trust was circling without quite staring at, which is the oddest thing about the protocol: it's used mostly today for machine authentication, which it's quite good at (i.e. deb repos)...but the tooling actually generally is oriented around verifying and trusting people.

    • holuponemoment 42 minutes ago

      Does it even matter?

      Perhaps I am jaded but most if not all people regurgitate about topics without thought or reason along very predictable paths, myself very much included. You can mention a single word covered with a muleta (Spanish bullfighting flag) and the average person will happily run at it and give you a predictable response.

      • bob1029 36 minutes ago

        It's like a Pavlovian response in me to respond to anything SQL or C# adjacent.

        I see the exact same in others. There are some HN usernames that I have memorized because they show up deterministically in these threads. Some are so determined it seems like a dedicated PR team, but I know better...

    • nashashmi an hour ago

      We LLMs only output the average response of humanity because we can only give results that are confirmed by multiple sources. On the contrary, many of HN’s comments are quite unique insights that run contrary to the average popular thought. If this is ever to be emulated by an LLM, we would give only gibberish answers. If we had a filter to that gibberish to only permit answers that are reasonable and sensible, our answers would be boring and still be gibberish. In order for our answers to be precise, accurate and unique, we must use something other than LLMs.

    • miki123211 an hour ago

      How do you know it isn't already happening?

      With long and substantive comments, sure, you can usually tell, though much less so now than a year or two ago. With short, 1 to 2 sentence comments though? I think LLMs are good enough to pass as humans by now.

    • no_time an hour ago

      I can’t think of an solution that preserves the open and anonymous nature that we enjoy now. I think most open internet forums will go one of the following routes:

      - ID/proof of human verification. Scan your ID, give me your phone number, rotate your head around while holding up a piece of paper etc. note that some sites already do this by proxy when they whitelist like 5 big email providers they accept for a new account.

      - Going invite only. Self explanatory and works quite well to prevent spam, but limits growth. lobste.rs and private trackers come to mind as an example.

      - Playing a whack-a-mole with spammers (and losing eventually). 4chan does this by requiring you to solve a captcha and requires you to pass the cloudflare turnstile that may or may not do some browser fingerprinting/bot detection. CF is probably pretty good at deanonimizing you through this process too.

      All options sound pretty grim to me. Im not looking forward to the AI spam era of the internet.

      • icoder 26 minutes ago

        I'm sometimes thinking about account verification that requires work/effort over time, could be something fun even, so that it becomes a lot harder to verify a whole army of them. We don't need identification per se, just being human and (somewhat) unique.

        See also my other comment on the same parent wrt network of trust. That could perhaps vet out spammers and trolls. On one and it seems far fetched and a quite underdeveloped idea, on the other hand, social interaction (including discussions like these) as we know it is in serious danger.

      • dns_snek 32 minutes ago

        There must be a technical solution to this based on some cryptographic black magic that both verifies you to be a unique person to a given website without divulging your identity, and without creating a globally unique identifier that would make it easy to track us across the web.

        Of course this goes against the interests of tracking/spying industry and increasingly authoritarian governments, so it's unlikely to ever happen.

    • ahoka an hour ago

      Probably already happening.

    • drcongo 10 minutes ago

      The internet is going to become like William Basinski's Disintegration Loops, regurgitating itself with worse fidelity until it's all just unintelligible noise.

    • _Algernon_ 44 minutes ago

      This is probably already happening to some extent. I think the best we can hope for is xkcd 810: https://xkcd.com/810/

  • montebicyclelo 2 hours ago

    There's also two DBs I know of that have an updated Hacker News table for running analytics on without needing to download it first.

    - BigQuery, (requires Google Cloud account, querying will be free tier I'd guess) — `bigquery-public-data.hacker_news.full`

    - ClickHouse, no signup needed, can run queries in browser directly, [1]

    [1] https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

  • SilverBirch an hour ago

    What is the netiquette of downloading HN? Do you ping Dang and ask him before you blow up his servers? Or do you just assume at this point that every billion dollar tech company is doing this many times over so you probably won't even be noticed?

    • euroderf an hour ago

      Not to mention three-letter agencies, incidentally attaching real names to HN monikers ?

    • krapp 16 minutes ago

      HN has an API, as mentioned in the article, which isn't even rate limited. And all of the data is hosted on Firebase, which is a YC company. It's fine.

  • userbinator 2 hours ago

    I had a 20 GiB JSON file of everything that has ever happened on Hacker News

    I'm actually surprised at that volume, given this is a text-only site. Humans have managed to post over 20 billion bytes of text to it over the 18 years that HN existed? That averages to over 2MB per day, or around 7.5KB/s.

    • sph 2 hours ago

      2 MB per day doesn't sound like a lot. The amount of posts probably has increased exponentially over the years, especially after the Reddit fiasco when we had our latest, and biggest neverending September.

      Also, I bet a decent amount of that is not from humans. /newest is full of bot spam.

      • samplatt 2 hours ago

        Plus the JSON structure metadata, which for the average comment is going to add, what, 10%?

      • FabHK an hour ago

        Around one book every 12 hours.

  • jakegmaths 5 hours ago

    Your query for Java will include all instances of JavaScript as well, so you're over representing Java.

    • smarnach 4 hours ago

      Similarly, the Rust query will include "trust", "antitrust", "frustration" and a bunch of other words

      • sph 2 hours ago

        A guerilla marketing plan for a new language is to call it a common one word syllable, so that it appears much more prominent than it really is on badly-done popularity contests.

        Call it "Go", for example.

        (Necessary disclaimer for the irony-impaired: this is a joke and an attempt at being witty.)

        • setopt an hour ago

          Let’s make a language called ā€œAā€ in that case. (I mean C was fine, so why not one letter?)

        • InDubioProRubio 2 hours ago

          You also wouldn't acronym hijack overload to boost mental presence in gamers LOL

      • matsemann 3 hours ago

        Reminded me about Scunthorpe problem https://en.wikipedia.org/wiki/Scunthorpe_problem

    • jasonthorsness 5 hours ago

      Ah right… maybe even more unexpected then to see a decline

      • cs02rm0 2 hours ago

        I'm not so sure, while Java's never looked better to me, it does "feel" to me to be in significant decline in terms of what people are asking for on LinkedIn.

        I'd imagine these days typescript or node might be taking over some of what would have hit on javascript.

  • ashish01 6 hours ago

    I wrote one a while back https://github.com/ashish01/hn-data-dumps and it was a lot of fun. One thing which will be cool to implement is that more recent items will update more over time making any recent downloaded items more stale than older ones.

    • jasonthorsness 5 hours ago

      Yeah I’m really happy HN offers an API like this instead of locking things down like a bunch of other sites…

      I used a function based on the age for staleness, it considers things stale after a minute or two initially and immutable after about two weeks old.

          // DefaultStaleIf marks stale at 60 seconds after creation, then frequently for the first few days after an item is
          // created, then quickly tapers after the first week to never again mark stale items more than a few weeks old.
      
          const DefaultStaleIf = "(:now-refreshed)>" +
       "(60.0*(log2(max(0.0,((:now-Time)/60.0))+1.0)+pow(((:now-Time)/(24.0*60.0*60.0)),3)))"
      
      https://github.com/jasonthorsness/unlurker/blob/main/hn/core...
  • flakiness 5 hours ago

    I have done something similar. I cheated to use BigQuery dataset (which somehow keeps getting updated) and export the data to parquet, download it and query it using duckdb.

    • minimaxir 4 hours ago

      That's not cheating, that's just pragmatic.

  • 9rx 3 hours ago

    > The Rise Of Rust

    Shouldn't that be The Fall Of Rust? According to this, it saw the most attention during the years before it was created!

    • emilbratt 3 hours ago

      The chart is a stacked one, so we are looking at the height each category takes up and not the height each category reach.

  • stefs 2 hours ago

    please do not use stacked charts! i think it's close to impossible to not to distort the readers impression because a) it's very hard to gauge the height of a certain data point in the noise and b) they're implying a dependency where there _probably_ is none.

    • seabass an hour ago

      My first thought as well! The author of uPlot has a good demo illustrating their pitfalls https://leeoniya.github.io/uPlot/demos/stacked-series.html

    • dguest an hour ago

      How do you feel about stacked plots on a logarithmic y axis? Some physics experiments do this all the time [1] but I find them pretty uninitiative.

      [1]: https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PUBNOTES/ATL-...

      • lblume an hour ago

        What is this even supposed to represent? The entire justification I could give for stacked bars is that you could permute the sub-bars and obtain comparable results. Do the bars still represent additive terms? Multiplicative constants? As a non-physicist I would have no idea on how to interpret this.

        • dguest 14 minutes ago

          It's a histogram. Each color is a different simulated physical process: they can all happen in particle collisions, so the sum of all of them should add up to the data the experiment takes. The data isn't shown here because it hasn't been taken yet: this is an extrapolation to a future dataset. And the dotted lines are some hypothetical signal.

          The area occupied by each color is basically meaningless, though, because of the logarithmic y-scale. It always looks like there's way more of whatever you put on the bottom. And obviously you can grow it without bound: if you move the lower y-limit to 1e-20 you'll have the whole plot dominated by whatever is on the bottom.

          For the record I think it's a terrible convention, it just somehow became standard in some fields.

  • tacker2000 2 hours ago

    Yea, i also get the feeling that these rust evangelists get more annoying every day ;p

  • matsemann 4 hours ago

    One thing I'm curious about, but I guess not visible in any way, is random stats about my own user/usage of the site. What's my upvote/downvote ratio? Are there users I constantly upvote/downvote? Who is liking/hating my comments the most? And some I guessed could be scrapable: Which days/times are I the most active (like the github green grid thingy)? How's my activity changed over the years?

    • pjc50 35 minutes ago

      I don't think you can get the individual vote interactions, and that's probably a good thing. It is irritating that the "API" won't let me get vote counts; I should go back to my Python scraper of the comments page, since that's the only way to get data on post scores.

      I've probably written over 50k words on here and was wondering if I could restructure my best comments into a long meta-commentary on what does well here and what I've learned about what the audience likes and dislikes.

      (HN does not like jokes, but you can get away with it if you also include an explanation)

    • minimaxir 4 hours ago

      The only vote data that is visible via any HN API is the scores on submissions.

      Day/Hour activity maps for a given user are relatively trivial to do in a single query, but only public submission/comment data could be used to infer it.

      • ryandrake 4 hours ago

        Too bad! I’ve always sort of wanted to be able to query things like what were my most upvoted and downvoted comments, how often are my comments flagged, and so on.

        • saagarjha 3 hours ago

          I did this once by scraping the site (very slowly, to be nice). It’s not that hard since the HTML is pretty consistent.

    • nottorp 3 hours ago

      > Are there users I constantly upvote/downvote?

      Hmm. Personally I never look at user names when I comment on something. It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...

      • pjc50 27 minutes ago

        I recognize twenty or so of the most frequent and/or annoying posters.

        The leaderboard https://news.ycombinator.com/leaders absolutely doesn't correlate with posting frequency. Which is probably a good thing. You can't bang out good posts non-stop on every subject.

      • vidarh 2 hours ago

        The exception, to me, is if I'm questioning whether the comment was in good faith or not, where the trackrecord of the user on a given topic could go some way to untangle that. It happens rarely here, compared to e.g. Reddit, but sometimes it's mildly useful.

      • matsemann 3 hours ago

        Same, which is why it would be cool to see. Perhaps there are people I both upvote and downvote?

      • thaumasiotes 2 hours ago

        > It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...

        ...is that supposed to pose some kind of problem? The problem would be in the other direction, surely?

    • 9rx 3 hours ago

      > What's my upvote/downvote ratio?

      Undefined, presumably. For what reason would there be to take time out of your day to press a pointless button?

      It doesn't communicate anything other than that you pressed a button. For someone participating in good faith, that doesn't add any value. But those not participating in good faith, i.e. trolls, it adds incredible value knowing that their trolling is being seen. So it is actually a net negative to the community if you did somehow accidentally press one of those buttons.

      For those who seek fidget toys, there are better devices for that.

      • immibis 3 hours ago

        Actually, its most useful purpose is to hide opinions you disagree with - if 3 other people agree with you.

        Like when someone says GUIs are better than CLIs, or C++ is better than Rust, or you don't need microservices, you can just hide that inconvenient truth from the masses.

        • 9rx 3 hours ago

          So, what you are saying is that if the masses agree that some opinion is disagreeable, they will hide it from themselves? But they already read it to know it was disagreeable, so... What are they hiding it for, exactly? So that they don't have to read it again when they revisit the same comments 10 years later? Does anyone actually go back and reread the comments from 10 years ago?

          • jpc0 31 minutes ago

            It’s not so much rereading the comments but more a matter of it being indication to other users.

            The C++ example for instance above, you are likely to be downvoted for supporting C++ over rust and therefore most people reading through the comments (and LLMs correlating comment ā€œkarmaā€ to how liked a comment is) will generally associate Rust > C++, which isn’t a nuanced opinion at all and IMHO is just plain wrong a decent amount if times. They are tools and have their uses.

            So generally it shows the sentiment of the group and humans and conditioned to follow the group.

        • matsemann 3 hours ago

          Since there are no rules on down voting, people probably use it for different things. Some to show dissent, some to down vote things they think don't belong only, etc. Which is why it would be interesting to see. Am I overusing it compared to the community? Underusing it?

      • saagarjha 3 hours ago

        If Hacker News had reactions I’d put an eye roll here.

        • 9rx 3 hours ago

          You could have assigned 'eye roll' to one of the arrow buttons! Nobody else would have been able to infer your intent, but if you are pressing the arrow buttons it is not like you want anyone else to understand your intent anyway.

  • hsbauauvhabzb 3 hours ago

    Is the raw dataset available anywhere? I really don’t like the HN search function, and grepping through the data would be handy.

    • Havoc 2 hours ago

      It’s on firebase/bigquery to avoid people doing what OP did

      If you click the api link bottom of page it’ll explain.

  • pier25 4 hours ago

    would love to see the graph of React, Vue, Angular, and Svelte

  • andrewshadura 3 hours ago

    Funny nobody's mentioned "correct horse battery staple" in the comments yet…