The surprisingly complex journey to text-selectable client-side generated PDFs

(sdocs.dev)

31 points | by FailMore a day ago ago

22 comments

cbolton 19 minutes ago
I wonder if using Typst would be a viable solution: the compiler can be built into a wasm component that runs locally in the browser (that's what the Typst webapp does) and it generates good PDFs with working selection/copy/paste.
There's even a package (cmarker) than can translate Markdown to Typst which could be enough for a MVP.
Worf an hour ago
PDFs should be only for printing or maybe for keeping scanned versions of things. For anything else they're just not the right tool for the job. Not for things meant to be accessed on a computer like books, scientific papers or, for some weird reason, catalogs and price lists from websites.
We have responsive and open standards like HTML and EPUB (zipped XTML) and they work great. arXiv has HTML papers, and libgen and anna's archive often have EPUB versions of books. The issue for me with EPUB is the lack of good readers now.
[-]
- jkscm 15 minutes ago
  slighlty disagree with this. A fixed page layout has it's own advantages. The reason we have more high quality pdf readers than epub readers is probrably connected to the format itself. PDF readers usally are more more feature complete when it comes to stuff like annotations too.
- gf000 36 minutes ago
  I don't know, I really love a well-typeset books/papers. Especially when they feature figures that are deliberately placed close to the relevant section in the text, it's just not something we can replicate with HTML, that can barely do proper justified text.
  Sure, I would like that beautifully designed page to magically become a single column beautiful document on my phone, but I will take the former over a badly designed text extract where the relevant figure is 10 pages away.
  Epub (=html) is good for novels, but there is nothing replacing PDF for science papers. If anything, the latex (or ideally typst) source would come the closest, if properly written (not absolute offsets). That could be used to produce different page sized versions.
  [-]
  - Worf 7 minutes ago
    The "figures that are deliberately placed close to the relevant section in the text" is something I've heard often, and I'd agree to an extent. But the figure is never 10 pages away (unless you have a tiny screen or something). It's easy to put an image inbetween 2 paragraphs. With PDF papers 1 figure is often referenced in several places throughout the paper so I just open 2 windows with the paper anyway.
    For justified text - what's the point of stretching each line artificially just so they align at the end? It looks awful to me even when done "correctly". Having uneven spaces makes it harder to read. Having every line align on the right also makes it harder to read. When you have uneven lines, I subconsciously use the different at the end as an anchor for where I am in the text or where a certain phrase was. Hyphenating words is another thing that doesn't make a lot of sense nowadays - we have enough words with a hyphen naturally in them, so reading a broken up word is mentally taxing as I have to figure out if it's a normal word with a hyphen or a broken up one.
    All the arXiv HTML papers are much better to read in the browser, IMO. And they'll only get better. PDF will likely stay the same.
    For small screens like phones or tablets, having to constantly scroll up and down and left and right for a 2-column paper is just painful. PDF is much better on a big screen.
- mcdonje 26 minutes ago
  EPUB is the ebook standard, outside of Amazon-land, so it has staying power in its space. I think it would be good for the ecosystem if it broke containment and got tooling in enough places to challenge PDF.
- FailMore an hour ago
  Interesting point. What do you feel about the "business world"'s heavy use of PDFs? There is something to be said about the file format being trusted/so dominant now... probably some random sequence of events led to this happening... but perhaps hard to shift
  [-]
  - Sharlin 28 minutes ago
    Because the business world used to run on paper, and pdf became the de facto standard desktop publishing file format because Adobe became the de facto king of desktop publishing. Storing, transferring, and reading documents on paper has given way for doing all of that digitally, but path dependency guarantees that there’s no way of getting rid of PDF now.
    Purely psychologically, I think there’s something that feels more "secure" or long-lasting about PDF’s perceived quasi-immutability compared to formats designed to be edited.
    [-]
    - FailMore 20 minutes ago
      Yeah, I think the point about editing is a very good one. There is something comforting about them and perhaps that's it (+ maybe we are used to them being A4 pages, so you know what to expect). I think also the lack of flexibility with rendering is good, if you see it on one device you know exactly how it will look on another device.
- danhor 30 minutes ago
  A PDF of a long document such as a standard or reference manual is almost always preferable to an HTML version. HTML versions have issues with formatting, searching (as browsers struggle with multi-thousand page documents and non-native search document search implementations almost always suck), indexing, correct behavior on windows size change (especially a side-by-side pdf view is almost unheard of for webpages), ... . Some vendors have switched to online-only for some documents and it always annoys me.
  [-]
  - Worf 4 minutes ago
    > correct behavior on windows size change
    Except the PDF is not responsive at all and you can't increase or decrease the font size without increasing the whole width of page.
    > Some vendors have switched to online-only for some documents and it always annoys me.
    HTML shouldn't mean online-only. If the vendor isn't trying to make it hard to download, you should always be able to convert to PDF. But PDF to HTML is very hard or impossible.
ashishb an hour ago
Software engineers drastically underestimates GUI - Web layouts, mobile app layouts, and even PDF layouts are non-trivial pieces of work to get right in all circumstances.
[-]
- freedomben an hour ago
  Nobody who has actually worked on those things think that. You might want to qualify if you're only talking about people who have never worked in this area.
  In my experience it's the NON software engineers who tend to underestimate the complexity
- FailMore an hour ago
  Yep, they (can) rarely enter your domain... so it's easy to assume its going to be trivial (maybe because things like .md or .txt files are trivial, so it's easy to think there's not much of a delta)
josefrichter 2 hours ago
It’s not that surprising. It’s one of those well known pandora boxes of web development: email templates, PDFs, printing,…
[-]
- FailMore 2 hours ago
  Ah, I didn't know that. It's not something I had worked on before, and the file format is highly prevalent (so I assumed things would be easy), so it was surprising to me
  [-]
  - caspper69 7 minutes ago
    You would think that, but PDF is not really a format for text. It's a format that describes typography and graphics layout & formatting. It's not uncommon for a text pdf to not contain all of the text it renders (due to ligatures).
  - SirHumphrey 2 hours ago
    Nothing about PDF is easy. Similarly to what once Tom Scott said about time zones, every time I must deal with PDFs I pray that PDF.js can be hacked in to doing it instead, otherwise I just don’t bother.
    It’s on of the few examples when converting it in to picture and chucking it in a multimodal llm is a more sensible solution than trying to parse it.
gobdovan an hour ago
Thanks, this puts into perspective why copy-paste from PDFs is so bad.
I months into building a pasteboard transform library that normalises VS Code, Google Docs, PDFs and a bunch of Chromium apps provider-specific data so I can start pasting everything everywhere exactly how I want it. It's much, much messier than I expected.
Apps put different UTTypes on the pasteboard that are not really compatible with each other. Usually there's a plain text fallback, then rich text/HTML, then provider-specific data. You show how much insane work is needed just to make text selectable with glyph mappings, layout, links, code blocks, rendered styles, etc. But once you copy from that PDF, most viewers still only expose raw text, and often broken raw text at that...
[-]
- FailMore an hour ago
  Yep, it is a very interesting space for improvements imo. Kind of broadly speaking copy and paste is so central to working with a computer in a smooth way it should probably have more power / quality built into it (e.g. not having to install some random plug in to get clipboard history, etc.)
  [-]
  - gobdovan 28 minutes ago
    Part of it is security/privacy and providers avoiding liability. People constantly copy passwords, tokens, personal data, etc., so clipboard history is risky by default. Apple probably does not want to expose a rich API here and then be responsible for securing that surface forever.
    So macOS does not really give you a clean "this app copied this semantic object" API. Clipboard-history apps generally poll NSPasteboard.changeCount, which already makes provenance fuzzy, since you can observe that the pasteboard changed, but not reliably know the source app.
    Pasting is fuzzy too. You know what representations were available, but not what the destination app actually accepted, because that decision happens inside the app and is generally opaque for the OS. So what even is history? Is it the raw object, the fallback text, the richest representation, the thing you intended to paste, or the thing the target app consumed? And even if you define history as "the observed events", polling can also miss states. And once you add transforms (like I want to), you basically have to define your own history model. A coherent OS clipboard-history API probably will never happen without big effort and liability policy changes from providers.
alansaber 33 minutes ago
You don't know the hell of trawling through PDF XML and HTML construction until you've done it