Skip to contents

This function scans a directory tree for duplicated files (by content hash) and optionally duplicated directories (by identical file sets). It supports filtering by file extension, minimum file size, and parallel processing.

Usage

find_duplicates(
  path = ".",
  size_threshold = 0L,
  extensions = NULL,
  n_cores = 1L,
  print_results = TRUE
)

Arguments

path

Character. Root directory to scan for duplicates.

size_threshold

Numeric. Minimum file size (in MB) to report as duplicate.

extensions

Character vector. Optional file extensions (without the leading dot) to filter (case-insensitive); e.g., c("csv", "txt")). If provided, only files with these extensions are considered. . Directories are excluded if they don't match. Defaults to NULL (all files are considered).

n_cores

Integer. Number of parallel workers to use (default: 1).

print_results

Logical. Whether to print results to the console (default: TRUE).

Value

A list of tibbles with duplicated files and directories, if found. The duplicated_files tibble (if any) contains the following columns:

  • path: root path scanned;

  • dup_group: duplicate group ID;

  • files: list-column of tibbles with absolute/relative paths and modification times;

  • file_ext: file extension(s) of the group;

  • n_files: number of duplicated files in the group;

  • file_size_mb: size (MB) of the first file in the group; and

  • content_hash: MD5 hash of the file content. The duplicated_dirs tibble (if any) contains the following columns:

  • dir: directory path at the duplicated level;

  • dir_abs: absolute path to the directory;

  • n_files: number of files in the directory;

  • n_dup_dirs: number of duplicated directories in the group;

  • dup_group: unique identifier for each group of duplicated directories.

Author

Ahmed El-Gabbas