this post was submitted on 17 Nov 2025

16 points (94.4% liked)

No Stupid Questions

46268 readers

894 users here now

No such thing. Ask away!

!nostupidquestions is a community dedicated to being helpful and answering each others' questions on various topics.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules (interactive)

Rule 1- All posts must be legitimate questions. All post titles must include a question.

All posts must be legitimate questions, and all post titles must include a question. Questions that are joke or trolling questions, memes, song lyrics as title, etc. are not allowed here. See Rule 6 for all exceptions.

Rule 2- Your question subject cannot be illegal or NSFW material.

Your question subject cannot be illegal or NSFW material. You will be warned first, banned second.

Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.

Rule 4- No self promotion or upvote-farming of any kind.

That's it.

Rule 5- No baiting or sealioning or promoting an agenda.

Questions which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.

Rule 6- Regarding META posts and joke questions.

Provided it is about the community itself, you may post non-question posts using the [META] tag on your post title.

On fridays, you are allowed to post meme and troll questions, on the condition that it's in text format only, and conforms with our other rules. These posts MUST include the [NSQ Friday] tag in their title.

If you post a serious question on friday and are looking only for legitimate answers, then please include the [Serious] tag on your post. Irrelevant replies will then be removed by moderators.

Rule 7- You can't intentionally annoy, mock, or harass other members.

If you intentionally annoy, mock, harass, or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.

Rule 8- All comments should try to stay relevant to their parent content.

Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.

Rule 10- Majority of bots aren't allowed to participate here. This includes using AI responses and summaries.

Credits

Our breathtaking icon was bestowed upon us by @Cevilia!

The greatest banner of all time: by @TheOneWithTheHair!

founded 2 years ago

MODERATORS

technopagan@lemmy.world

What is an efficient workflow to separate and organize bulk scanned PDF documents? (At work; software is limited.) (lemmy.ml)

submitted 2 months ago by endless@lemmy.ml to c/nostupidquestions@lemmy.world

14 comments fedilink hide all child comments

At work I have been given a task to organize some documents. Please help me, I'm overwhelmed.

TLDR: Manually separate PDFs into individual documents (variable page length), assign each document to a category, and identify the date of the document. Need a fast way to do it. On Windows.

Here is a longer description of my task.

Goals:

Chop up the bulk PDFs to create 1 PDF per document
Sort each of them into one of 10 categories
Identify the date in each document. Include it in the filename. Optimally, insert it into the document itself (e.g. top right hand corner of first page, in the margin so it doesn't cover anything)
Would be nice:
- Rerun the OCR; I think it could be better. At home I would use ocrmypdf
- Clean up the scans: fix alignment, remove artifacts etc (only if effort is trivial and 0% risk of data loss)

Inputs:

Printed out, it's a stack of papers about 6-8cm tall
Has been provided in bulk PDFs about 40-60 pages each
Most individual documents are 2-5 pages in length, with some being 10-20

Document characteristics:

Scanned on an actual scanner; some carefully, others not
- Have been optimized for small smile size
- Mostly black & white, some grey scale
Business documents, records, official correspondence etc. Typed, not handwriting. English.
They are all in some sort of standardized format, but from many different sources, each with their own format
Have had some sort of OCR applied to them; it isn't very good especially when the scans aren't perfect.

Work environment and constraints

Because I am at work, I am using a standard Windows workstation that is set up for office (not developer) use.
I have asked for the full version of Adobe Acrobat to be installed, because they have a license for that. I've never used it. Maybe it will do all of this but based on how profoundly annoying Reader is, I am skeptical.
I can install things that don't require administrator privileges. I can ask for administrator to install something if I am reasonably confident it'll be useful and safe but I can't be annoying asking for things all the time. I strongly prefer open source tools.
Cannot under any circumstances use anything online, cloud, external AI. All data must stay local.
I might be able to justify using linux in the future if this is an ongoing task (it might be) so linux-only suggestions are welcome but won't be implemented first-line. Desktop applications > self hosted servers or command line.

top 14 comments

sorted by: hot top controversial new old

[–] foggy@lemmy.world 8 points 2 months ago* (last edited 2 months ago)

Am I understanding this that you have some number of .pdf files that are 40-60 pages each and within them, there exist documents varying from 2 to 10 pages, and your task is to parse them?

If so, how many .PDFs are we talking here? You said on paper it's like 8cm high? So like 1000 pages ish?

So like 20 .PDFs?

Just... Do it, dawg. The amount of time you spent on this post you coulda finished 1 of those PDFs. That's 5% of the task. Do that 19 more times.

If you're asking "can Adobe Acrobat break a 60 page .pdf into some number of .PDFs based on page numbers I tell it to?" the answer is yes.

Breathe a bit friend. No need to get overwhelmed.

[–] ArseAssassin@sopuli.xyz 4 points 2 months ago (1 children)

Paperless-ngx?

[–] endless@lemmy.ml 2 points 2 months ago

no way I'll be able to install docker and whatever else if needed, run a server etc.

[–] Sunsofold@lemmings.world 3 points 2 months ago (1 children)

I know there are scripting ways to work with PDFs. I was listening to someone talking just earlier about using a script and a localhosted LLM to organise and rename PDFs with author and title. If you can identify some kind of patterns (such as a heading that starts each document of a type) that you can detect, a script could find those pages and then feed that into something that will segment page ranges for each doc. It's definitely possible but the patterns to look for will be determined by the docs you are looking at.

[–] endless@lemmy.ml 1 points 2 months ago (1 children)

I don't have the volume where learning a completely new technology would be worthwhile. I would have to manually verify each one anyways because it has to be perfect. The documents do not have any format as nice as a heading at the top. I'm willing to put in the time to go through each page, I just need a fast way to tag them, then automate separation and renaming.

[–] Sunsofold@lemmings.world 3 points 2 months ago (1 children)

Hmm. Well, first off, if you mean you don't know how to write a script and don't view it as worth learning for this task, that limits the task a fair amount. If you mean you don't want to learn about the particulars of script based PDF editing or OCR, that's understandable.

If you don't want to script at all, you should be able to segment the PDFs via acrobat, or even just 'print to PDF' with page ranges on most viewers. There are ways of bulk renaming files once you have segmented them, even without scripting, though it'd be use case dependent as to whether/how that'd be useful to you.

If you want to script just a little, I made a script ages ago where I used the documents' name to hold the metadata of what needed to be modified. You could certainly do that. (e.g. open the doc in one window, select the file for renaming in your file explorer, scroll through and input the sequence of pages in the rename field, [documentName3,7,15,22,29.PDF] run a script to segment the PDF at those page numbers so you end up with 'documentName-1.PDF' containing pages 1 to 2, another with 3 to 6, etc.)

A bit more effort could maybe be used to do some level of renaming, though how much use that would be would depend on the particulars of your case. I could see extending the previous script a little and making the page annotations include a doc type. (e.g. 13cn meaning segment at page 13 and label it as 'originalDocumentName-clientNotification', or even 13'arbitraryText' and use the arbitrary text as the new file name)

The particularity of your case may be precisely why it hasn't been automated yet.

[–] endless@lemmy.ml 1 points 2 months ago (1 children)

I am not going to learn how to train an AI for this task. It is non trivial to install anything and I cannot use any remote/online tools. I would need to find an appropriate local AI (deepseek?) and learn how to use it from scratch.

I could write a bash script to modify filenames at home on my linux machine. But at work I just have windows. It has... powershell? I guess. I've never used that and to be honest I have no desire to. I would have to install something to cut up the PDFs. ocrmypdf that could do everything. And there are various other cli PDF manipulation tools in the repos. I would have to ask to have it installed. And any other dependencies required. Not gonna happen.

I want a way to easily go through hundreds of pages, look at them and quickly tag them. That is a perfect task for a GUI. To use a script I would have to scroll through the PDF in one application then switch back and forth into a text editor, to manually create a text document specifying which pages are in what document, and what category etc. I'd sooner do it on paper. But I'm sure there is a solution for this, I just don't know what it is.

[–] Sunsofold@lemmings.world 2 points 2 months ago

At the purely GUI level, if you're being granted acrobat, it turns out you can extract arbitrary subsets of pages manually, very quickly. You can then rename them. I haven't learned powershell personally but it absolutely could be used to batch rename files, even if it's a somewhat silly looking language compared to bash. Again, though, how much work that involves depends on your desired naming conventions.

[–] MisterNeon@lemmy.world 2 points 2 months ago (1 children)

Ask the person who asked you to do it how they would do it?

[–] endless@lemmy.ml 3 points 2 months ago (1 children)

They print it out on paper, organize it with sticky notes and paper clips, then have someone re-scan it and name/organize files digitally according to the sticky notes. I don't want to do it that way.

[–] MisterNeon@lemmy.world 3 points 2 months ago (1 children)

Oh that sucks! Good luck, I'm out of ideas.

[–] endless@lemmy.ml 1 points 2 months ago

that's how I know how tall it is printed out! lol

[–] Zwuzelmaus@feddit.org 2 points 2 months ago (1 children)

My opinion: Scan it again. It will be faster than fiddling with the bad PDFs.

Look at the dates while scanning.

Later sort them into folders first by year, then by the name of the author/adressee/business partner.

[–] endless@lemmy.ml 1 points 2 months ago

Originals are unavailable, I only the scans. Which have been printed out of tradition. I could scan them again but it would take a very long time and further decrease the quality. And I don't have the ability to sit by the scanner to catch the files as the come in. Scanner and workstation are in different locations. Manually separating the existing PDFs by using "print to PDF" would be faster.