Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not reproducible #562

Open
jnmaloof opened this issue May 13, 2024 · 7 comments
Open

Not reproducible #562

jnmaloof opened this issue May 13, 2024 · 7 comments

Comments

@jnmaloof
Copy link

Version 0.23.4

I get a different number of reads and bases after filtering every time I run fastp

(base) exouser@julin-2:$ for i in $(seq 1 10); do echo "run $i";  fastp --in1 GH.lane67.fastq.gz --out1 GH.lane67.trimmed.fastq.gz --length_required 50 -q 20 -u 70 --phred64 --cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 2>&1 >/dev/null | grep -E "(too short)|(total bases: 9)" ; done
run 1
total bases: 99875000
reads failed due to too short: 593
run 2
total bases: 99844193
reads failed due to too short: 705
run 3
total bases: 99868071
reads failed due to too short: 595
run 4
total bases: 99864238
reads failed due to too short: 608
run 5
total bases: 99862997
reads failed due to too short: 625
run 6
total bases: 99900970
reads failed due to too short: 451
run 7
total bases: 99894738
reads failed due to too short: 454
run 8
total bases: 99879032
reads failed due to too short: 538
run 9
total bases: 99844285
reads failed due to too short: 711
run 10
total bases: 99875189
reads failed due to too short: 575
@jnmaloof
Copy link
Author

And even worse, if I change the number of threads to 1 I get very different results:

~/Assignments/assignment-09-jnmaloof/input/Brapa_fastq$ for i in $(seq 1 10); do echo "run $i";  fastp --in1 GH.lane67.fastq.gz --out1 GH.lane67.trimmed.fastq.gz --length_required 50 -q 20 -u 70 --phred64 --thread 1 --cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 2>&1 >/dev/null | grep -E "(too short)|(total bases: 9)" ; done
run 1
total bases: 95838782
reads failed due to too short: 18579
run 2
total bases: 95252057
reads failed due to too short: 21209
run 3
total bases: 95333601
reads failed due to too short: 20843
run 4
total bases: 95233893
reads failed due to too short: 21348
run 5
total bases: 95322517
reads failed due to too short: 20845
run 6
total bases: 95282512
reads failed due to too short: 21062
run 7
total bases: 95392059
reads failed due to too short: 20542
run 8
total bases: 95425813
reads failed due to too short: 20426
run 9
total bases: 95422163
reads failed due to too short: 20373
run 10
total bases: 95362468
reads failed due to too short: 20601
@jnmaloof
Copy link
Author

Examining thread dependency a bit more:

for t in {1..8}
    do
        for i in {1..2}
            do
                echo "threads $t rep $i"
                fastp --in1 GH.lane67.fastq.gz --out1 GH.lane67.trimmed.fastq.gz --length_required 50 -q 20 -u 70 --phred64 --thread $t  --cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 2>&1 >/dev/null | grep -E "(too short)|(total bases: 9)"
            done
    done
threads 1 rep 1
total bases: 95469777
reads failed due to too short: 20233
threads 1 rep 2
total bases: 95242300
reads failed due to too short: 21336
threads 2 rep 1
total bases: 97828605
reads failed due to too short: 9742
threads 2 rep 2
total bases: 97857064
reads failed due to too short: 9565
threads 3 rep 1
total bases: 99872436
reads failed due to too short: 581
threads 3 rep 2
total bases: 99899052
reads failed due to too short: 451
threads 4 rep 1
total bases: 99939418
reads failed due to too short: 284
threads 4 rep 2
total bases: 99930726
reads failed due to too short: 318
threads 5 rep 1
total bases: 99926647
reads failed due to too short: 338
threads 5 rep 2
total bases: 99926602
reads failed due to too short: 345
threads 6 rep 1
total bases: 99923806
reads failed due to too short: 346
threads 6 rep 2
total bases: 99917299
reads failed due to too short: 384
threads 7 rep 1
total bases: 99894695
reads failed due to too short: 490
threads 7 rep 2
total bases: 99904110
reads failed due to too short: 441
threads 8 rep 1
total bases: 99888522
reads failed due to too short: 510
threads 8 rep 2
total bases: 99892127
reads failed due to too short: 514
@jnmaloof
Copy link
Author

jnmaloof commented May 15, 2024

If I revert to version 0.20.1 then things are repreoducible (and give me a different result than any of those above: 42341 reads are removed for being too short). That is on par with what trimmomatic returns and is probably the correct result.

@peter-kanvas
Copy link

I stumbled upon the same problem with version 0.23.4, but found that the results are reproducible when I run using 1 thread. It's interesting that version 0.23.0 claims to have fixed the reproducibility problem...

@sfchen
Copy link
Member

sfchen commented Jul 3, 2024

could you please give me a piece of sample data, along with the command ?

@peter-kanvas
Copy link

peter-kanvas commented Jul 3, 2024

My test dataset is too large to share, but here is the exact fastp call I used
fastp -i SRR13921546_sub_1.fastq.gz -I SRR13921546_sub_2.fastq.gz -o SRR13921546_filter_1.fastq.gz -O SRR13921546_filter_2.fastq.gz -j SRR13921546_filter.json -w 1 --dedup

I'm running in a modified version of this docker container which is based on ubuntu "mambaorg/micromamba:1.5.8-jammy"

fastp installed with micromamba
RUN micromamba create -q -y -c conda-forge -c bioconda -n fastp fastp=0.23.4 && micromamba clean --all -y

I used diff to compare the .json file from multiple runs. Much of it is identical, but not entirely. I think I tried removing the dedup and that did not solve it. Only setting to 1 thread fixed it.

Honestly, fastp is so fast that 1 thread is still usable. Love the program and thanks for following up!

EDIT:
If you really want to recreate my test set, you can download the sequencing from SRR13921546 and then take the first million reads

@jnmaloof
Copy link
Author

You can download my sample dataset here:

https://bis180ldata.s3.amazonaws.com/downloads/Illumina_Assignment/GH.lane67.fastq.gz

My commands are in my earlier posts in this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants