Tags


I had to change the configuration used by my youtube_dl.py script because all newly downloaded videos of a channel or playlist went in the same directory. The consequence of this is that some directories had tens of thousands of files (including subtitles, thumbnails, json output for each video) so it sometimes took several minutes just to do the directory listing.

The solution is to change the output path from

--output "%(uploader)s/%(upload_date)s_%(playlist_index)s_%(title)s_%(id)s"

to

--output "%(uploader)s/%(upload_date>%Y)s/%(upload_date>%m)s/%(upload_date)s_%(playlist_index)s_%(title)s_%(id)s"

so that all files are sorted by yyyy/mm (year/month) in sub-directories.

Now the problem was to fix the exising files: the script presented below does just that.

The %(upload_date)s variable encodes the date in the format yyyymmdd. What the script does it to filter all files starting with yyyymmdd_ and not already archived by yyyy/mm. The yyyymmdd format is then unpacked into the three components:

  • year = yyyy
  • month = mm
  • day = dd

The datetime.date function call checks if those three values correspond to a valid date. This is an extra filter just to make sure we are selecting the right files. Once that is done we create two new directories under the parent of the considered file. For example if we have a file called /home/my-user/movies/news/chan_a/20220420_abcd.mkv then we will have /home/my-user/movies/news/chan_a/2022/04/ as new sub-directories.

Finally Rsync copies the existing file into the subdirectory, so for example the file will be copied into /home/my-user/movies/news/chan_a/2022/04/20220420_abcd.mkv and then deleted from its origin.

Before running the script you just need to install:

#!/usr/bin/env python3
#
# Copyright (C) 2022 Franco Masotti (franco \D\o\T masotti {-A-T-} tutanota \D\o\T com)
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <https://www.gnu.org/licenses/>.

import datetime
import pathlib
import re

import fpyutils

if __name__ == '__main__':
    def main():
        # Edit these variables.
        SRC: str = '/home/my-user/movies'
        RSYNC: str = '/usr/bin/rsync'
        GLOB: str = '*'

        ff: list = list()
        for f in pathlib.Path(SRC).rglob(GLOB):
            ff.append(f)

        for f in ff:
            parents: str = f.parents
            if (f.is_file()
               and re.match('\d{8}_', f.name)):
                p1: str = parents[0].name
                p2: str = parents[1].name
                if not (re.match('\d{4}', p2) and re.match('\d{2}', p1)):
                    date: str = f.name[:8]
                    year: str = date[:4]
                    month: str = date[4:6]
                    day: str = date[6:]

                    try:
                        # Check if it's a valid date.
                        dd = datetime.date(int(year), int(month), int(day))

                        parent: str = f.parent
                        # Parent with date component.
                        dst_dir = pathlib.Path(parent, year, month)
                        dst_dir.mkdir(mode=0o777, parents=True, exist_ok=True)

                        # Copy and then REMOVE.
                        cmd: str = (
                            RSYNC
                            + ' -avAX --progress --remove-source-files '
                            + str(f)
                            + ' '
                            + str(dst_dir)
                        )

                        fpyutils.shell.execute_command_live_output(cmd)
                    except ValueError:
                        print('error: skipping')

    main()

As you can see, this script is not too dissimilar to the one in a previous post.