Introduction

Splitting file data in small chucks is a common trick to scale data processing when you launch an analysis on a cluster. When you use Eoulsan in distributed mode, Eoulsan automatically split and merge common biological data (FASTQ files and SAM files) using the Hadoop framework with a low overhead. You can also use this strategy to achieve computation parallelization with non-hadoop cluster providing you manually declare in the workflow file when data must be split and merged.

Spliter step

This module allow to split data in small chucks.

  • Internal name: splitter
  • Available: Both local and distributed mode

  • Input port:
    • input: data to split (format defined in the parameters)

  • Output port:
    • output: split data (format defined in the parameters)


  • Mandatory parameter:
  • Parameter Type Description
    format format Name of the format of the data to split. See below to get the list of the format that can be split

  • Optional parameters: Splitters can have optional arguments to set the splitting method according to the data format
  • Configuration example:
  • <!-- Split reads step (100,000,000 max entries by file) -->
    <step id="mysplitterstep" skip="false" discardoutput="false">
    	<module>splitter</module>
    	<parameters>
    		<parameter>
    			<name>format</name>
    			<value>fastq</value>
    		</parameter>
    		<parameter>
    			<name>max.entries</name>
    			<value>1000000</value>
    		</parameter>
    	</parameters>
    </step>
    

Merger module

This module allow to merge small chucks of data in a large file.

  • Internal name: merger
  • Available: Both local and distributed mode
  • Multithreaded in local mode: N/A
  • Input: Data in the format defined in the parameters
  • Output: Data merged in the same format as the input

  • Mandatory parameter:
  • Parameter Type Description
    format format Name of the format of the data to merge. See below to get the list of the format that can be merged

  • Optional parameters: Mergers can have optional arguments to set the merge method according to the data format
  • Configuration example:
  • <!-- Merge Sam files step -->
    <step id="mymergerstep" skip="false" discardoutput="false">
    	<module>merger</module>
    	<parameters>
    		<parameter>
    			<name>format</name>
    			<value>sam</value>
    		</parameter>
    	</parameters>
    </step>
    

Technical merger module

This module allow to merge all the data related to the same technical replicates. This module use the RepTechGroup column of the design to define the data to merge.

  • Internal name: technicalreplicatemerger
  • Available: Both local and distributed mode
  • Multithreaded in local mode: N/A
  • Input: Data in the format defined in the parameters
  • Output: Data merged in the same format as the input

  • Mandatory parameter:
  • Parameter Type Description
    format format Name of the format of the data to merge. See below to get the list of the format that can be merged

  • Optional parameters: Mergers can have optional arguments to set the merge method according to the data format
  • Configuration example:
  • <!-- Merge technical replicates step -->
    <step id="mytechrepmerger" skip="false" discardoutput="false">
    	<module>technicalreplicatemerger</module>
    	<parameters>
    		<parameter>
    			<name>format</name>
    			<value>fastq</value>
    		</parameter>
    	</parameters>
    </step>
    

Supported formats

fastq

  • Format name: reads_fastq or fastq
  • Description: FASTQ format

  • Splitter optional parameters:
  • Parameter Type Default value Description
    max.entries integer 1000000 The maximal number of entries in splitter output files

  • Merger optional parameters: None

sam

  • Format name: mapper_results_sam or sam
  • Description: SAM format

  • Splitter optional parameters:
  • Parameter Type Default value Description
    max.entries integer 1000000 The maximal number of entries in splitter output files
    chromosomes boolean false Split the origin SAM file in files that only contains entries that map on the same chromosome. This option cannot be used with the max.line option

  • Merger optional parameters: None

bam

  • Format name: mapper_results_bam or bam
  • Description: BAM format

  • Splitter optional parameters:
  • Parameter Type Default value Description
    max.entries integer 1000000 The maximal number of entries in splitter output files
    chromosomes boolean false Split the origin BAM file in files that only contains entries that map on the same chromosome. This option cannot be used with the max.line option

  • Merger optional parameters: None

expression

  • Format name: expression_results_tsv or expression
  • Description: Expression format

  • Splitter optional parameters:
  • Parameter Type Default value Description
    max.entries integer 10000 The maximal number of entries in splitter output files

  • Merger optional parameters: None