Automating transcoding and OCRing media

Hey,
I happen take tuition apart from my school to aid in my studies. They provide me with videos, notes, problems and solutions after each session, but viewing them on my tuition’s website is a hassle. They have terrible UI/UX design, making it infuriating to navigate around their website, and I can’t blame them, their job isn’t webdev, but teaching, which they’re good at.
So I decided to do something about it.

They don’t have download buttons for their videos on their websites, because they don’t want me to download their videos, which is understandable.
But I easily circumvented this by using XDMan, paired with its Firefox browser extension.
Their videos occupy a significant amount of storage despite them being of terrible quality.
I noticed the videos were using the H.264 codec, and thought that maybe transcoding them to H.265 would take up less space. Since they’re mostly slides and a 3-hour video, I didn’t expect much of a quality drop. Transcoding is a cpu heavy process, but luckily, I happen to have a decent computer I already use as my server to store my media.

I came up with the idea of writing a systemd service to transcode the videos as soon as I upload them to my server.
This is the script I wrote for the process

# /etc/systemd/system/transcoding.service
[Unit]
Description=Transcode tuiton videos
[Service]
ExecStart=/home/agastya/Videos/Tuition/transcode.sh
WorkingDirectory=/home/agastya/Videos/Tuition/
Restart=always
User=user
[Install]
WantedBy=multi-user.target
#!/bin/bash
find . -name '*.temptrans*' -delete
folder_path="."
IFS=$'\n'
while true; do
files=$(find "$folder_path" -type f -name "*.mp4" -o -name "*.mkv" -o -name "*.ts")
for file in $files; do
if lsof -w "$file" > /dev/null; then
echo "$file busy"
sleep 2
else
if [[ $(ffprobe -v error -select_streams v:0 -show_entries stream=codec_name -of default=noprint_wrappers=1:nokey=1 ""$file"") != "hevc" ]]; then
ffmpeg -hide_banner -i "$file" -c:v libx265 -crf 28 -c:a aac -b:a 128k "${file%.*}.h265.temptrans.mp4"
if [ $? -eq 0 ]; then
mv "${file%.*}.h265.temptrans.mp4" "${file%.*}.h265.mp4"
rm "$file"
fi
fi
fi
done
sleep 5
done

All it does is loop over the root directory of the videos, checking each video file for its codec with ffprobe, and if it happens to not be H.265, it starts transcoding it. I’ve also made it so that if a file is currently being written to, maybe while it’s still being uploaded from my laptop to the server, the script skips it. The script transcodes to a temporary file, and after finishing, overwrites the original file with the transcoded video file.

This, paired with my jellyfin server, gave me a beautiful and unparalleled studying experience, I might do a post on setting up a media server with jellyfin.

The gratifying sense of accomplishment derived from automating this process sparked within me a desire to pursue further automation, even if it seemed like a pursuit of something insignificant.

Then I remembered my tuition supplies us with PDF files of problems, which have text embedded as images. That’s frustrating, because it forces me to type out the entire question if I need to google it.

So I created another script to automatically OCR every document in my media folder.

#!/bin/bash
folder_path="."
ocr_script="./ocrpdf.sh"
while true; do
files=$(find "$folder_path" -type f -name "*Module.pdf")
IFS=$'\n'
for file in $files; do
t_file="${file%Module.pdf}Module_t.pdf"
if [ ! -f "$t_file" ]; then
if lsof -w "$file" > /dev/null; then
echo "File '$file' is being written to. Skipping OCR."
else
echo "Running OCR for file: '$file'"
"$ocr_script" "$file"
fi
fi
done
sleep 5
done
#!/bin/bash
if ! command -v ocrmypdf &>/dev/null; then
echo "Can't find ocrmypdf"
exit 1
fi
if [[ $# -ne 1 ]]; then
echo "Usage: $0 <filename.pdf>"
exit 1
fi
filename="$1"
if [[ ! -f "$filename" ]]; then
echo "File not found."
exit 1
fi
new_filename="${filename%.*}_t.pdf"
ocrmypdf "$filename" "$new_filename"
echo "OCR completed"
view raw ocrpdf.sh hosted with ❤ by GitHub

Well, 2 scripts technically.


The second one is a helper script whose job is to actually scan the pdf and OCR it. I use the ocrmypdf utility to do this.
The first script is actually the service, it loops over all PDF files ending in “Module.pdf”, since those are the only ones that need OCRing, and invokes the second script. And like last time, the script ignores the files that are currently being written to by another program.

So there you have it, the script that took me an hour to write which saves me from a couple dozen minutes of manual work. At least it gives me satisfaction.


Leave a comment

Design a site like this with WordPress.com
Get started