It takes a long time when hasha from a large file, provider a bloom filter to speed up the hasha? #41

mankeheaven posted onGitHub

when I hasha a 400mb file, it takes 900ms, if I hasha a mount of 1Gb files，maybe it will take me several seconds. so I have to speed up. And bloom filter resolves it. Is there any good ideas to hasha a mount of big files?

Which method are you using?

posted by sindresorhus over 3 years ago

I write this in my project, maybe you can provide a more options with hasha.fromFile? If it provides, I am glad to use your hasha. Here is my function

import { close, open, read, readFile, stat } from 'fs-extra';

import { createHash } from 'crypto';

const defaultHashOptions = {
  fileSize: 10 * 1024,
  bufSize: 2 * 1024,
  sampleSize: 1 * 1024, 
  sampleCount: 5, 
};

interface HashOptions {
  fileSize: number;
  bufSize: number;
  sampleSize: number;
  sampleCount: number;
}

const getHashFromFilePath = async (
  filePath: string,
  options: HashOptions = defaultHashOptions,
) => {
  const hash = createHash('sha256');
  let stats = null;
  try {
    stats = await stat(filePath);
  } catch (e) {
    // logError(`[md5-util] error when stating ${filePath}`, e);
  }
  const size = stats?.size || 0;

  if (size < options.fileSize) {
    const buf = await readFile(filePath);
    hash.update(buf);
    return hash.digest('hex');
  }

  const fd = await open(filePath, 'r');
  const buf = Buffer.alloc(options.bufSize);
  const offset = 0;

  for (let i = 0; i < options.sampleCount; i++) {
    const pos = parseInt(((size * i) / options.sampleCount).toString());
    let length = options.sampleSize;
    if (pos + length > size) {
      length = parseInt((size - pos).toString());
    }
    const bytes = await read(fd, buf, offset, length, pos);
    if (bytes.bytesRead > 0) {
      hash.update(bytes.buffer);
    }
  }
  await close(fd);
  return hash.digest('hex');
};

export { getHashFromFilePath };

posted by mankeheaven over 3 years ago

I have to hash a mount of large files in a few seconds, do you have any good ideas?

posted by mankeheaven over 3 years ago

And bloom filter resolves it.

Nice, how did you do it?

posted by papb over 3 years ago

@papb This is the code, but it has false positive rate. Do you have any idea to improve it?

for (let i = 0; i < options.sampleCount; i++) {
    const pos = parseInt(((size * i) / options.sampleCount).toString());
    let length = options.sampleSize;
    if (pos + length > size) {
      length = parseInt((size - pos).toString());
    }
    const bytes = await read(fd, buf, offset, length, pos);
    if (bytes.bytesRead > 0) {
      hash.update(bytes.buffer);
    }
  }

posted by mankeheaven over 3 years ago

You're using .fromFile which comes with some initial overhead as it has to spawn a new worker_thread. This is only done once though, so multiple calls should be cheaper. Depending on your use-case, .fromFileSync is probably a lot faster (but it's blocking, so not good for servers).

posted by sindresorhus over 3 years ago

Fund this Issue

$0.00

Funded

Only logged in users can fund an issue

Pull requests

Do you want to work on this issue?

It takes a long time when hasha from a large file, provider a bloom filter to speed up the hasha? #41

Company

Community

Support

Connect