In the last post, we have seen how to reject rows using tFileInputDelimited component. The rows got rejected because the records were not the valid records. Talend provides some additional component for the stricter validation. In this post, we will see how to reject rows if it doesn't fulfill the schema criteria. You can use this method to add one more layer of validation.

Requirement 

Check for the input records and reject the rows if it doesn't satisfy the schema criteria.

Steps For Job Processing :

First, let's have a look at the input file. I am highlighting some of the incorrect records and these records should get rejected.


Before strictly checking for the schema compliance, let's use our old method of rejecting records using the input (tFileInputDelimited) component only. I have defined the schema as per the below screenshot to read the input file. The process is same as we did in the last post.


Now, let's just run the Job and check for the output. The tFileInputDelimited component rejected 2 rows. The records are correctly rejected as these are not fulfilling the schema criteria. But, I still haven't got the record which I have highlighted in the above mentioned input file screenshot. As per the schema restriction, the record length is of 50 characters but the tFileInputDelimited reject row is not able to capture it.


To get those records are out, we need to apply an extra layer of validation. And Talend provides a component for this: tSchemaComplianceCheck. It's a very useful component for ensuring that the data passing downstream is correct with respect to the defined schema. The Job design would be as per the below screenshot after adding tSchemaComplianceCheck component :


The next step is to configure this component. We have 3 modes for the schema checking. We can check schema for all columns or pick specific columns using Custom Defined options. To get the log of this content, I have just connected tSchemaComplianceCheck with the tLogRow. The log will come with the errCode. The description of the error code is shown as per the below screenshot :



Let's check the output now..


We get the error code as 8 with the message Exceed max length  in the tSchemaComplianceCheck log.







Generally whenever the developer has a requirement to reject rows from input files, the first component that comes to the mind is tMap. This component is obviously the powerful component because of numerous options it provides. It's not recommended to use tMap for smaller tasks as it expects some input and output. So, for normal operation also, a developer need to use at least three Talend components. Also, it is a bit heavy on memory. The same can be accomplished using single component if the developer is aware of the components reject row option availability.

Let's check how to do this..

Requirement :

The requirement is to reject rows without using tMap component. While doing so, identify how to check the reject rows option availability in Talend components.

Steps For Processing:

For this demo, I am going to use a very simple Talend Job. This Talend Job just reads the data from the input file and display it on the console. The input is as shown below:



If I run the Job now, it will get fail with the below mentioned error. It's a very simple Job with tFileInputDelimited component to read the input file and tLogRow to display input records on the console. The Job is getting failed because of the incorrect format of input records under timestamp column.



Go to tFileInputDelimited component and check for the configurations. Just notice the checked on Die On Error box. This checkbox is forcing the Job to abort. Now, if you click the tFileInputDelimited component, and go to the Rows. You will find only one option: Iterate. No Reject Rows !!



Now if I have to capture the reject rows, I have to include some other components, preferably, tMap. But before going there, we have an option to check whether the particular Talend component provides the reject option or not. To validate the availability, just uncheck the Die on Error box. And again go to the tFileInputDelimited component, right click and go to Rows. Now, you will be able to see two options: Reject and Iterate.



If we use this option, we can directly capture the rejected rows without using any other Talend component. The Final Job would be like this with the Reject row. I am using another tLogRow component to capture the output.



If I execute this Job, the output would be like this:

The rejected rows are getting captured in the tLogRow with the error message.












In this tutorial, we will check how to run your Talend Jobs without using Talend Open Studio. Once you are done with the development in one environment, you have to move your code in the other environments too. That's required exporting your jobs and importing it in other environments. And, in the real environment, the Talend Jobs are going to be run on some kind of scheduler. Let's learn how to make our Talend Jobs scheduler ready..

Requirement :

The requirement is to export the Talend Job and run it without Talend Open Studio. In other words, the execution of jobs would be outside the Talend Designer window.

Steps For Processing:

First, look for the jobs that you want to export and run externally. You can use any of the existing jobs or you can create a new one just for the testing purpose. I am going to use the existing job for this demo.

You can get the detail of the Job here: How To Read Multiple Files Using Talend Open Studio



Right click on the Job and go to Build Job option. Once you click on the Build Job option, you will get a window as mentioned below. Select the Zip file location where you want to keep your zip file for the Job.



You will see plenty of options here. Just keep the Select build type as Standalone Job. Also, check Extract the zip file if you don't want to extract it manually. Also, based on your OS, you can select the launcher type. Once you are done with the selections, just click Finish. Go to the directory and you will see the extract files as shown below:



We are done with the exporting and extracting the files. Now, the execution of the job. As I am working on windows, I will try to execute the Windows Batch File (.bat). To run this file, I will open the Command Prompt and navigate to this directory.

Once I am in the directory, will just call the bat file using HowToReadMultipleFiles_run.bat command.



Once the command gets executed, I can see the same output on the command prompt as I was getting on the tLogRow console.




Requirement :

The requirement is to run the jobs in the sequential and parallel mode in Talend. All child jobs will be called under the Parent Job. The scenario is similar to calling the jobs under the Sequencer stage in Datastage. The requirement is to achieve similar kind of functionality here also.

Steps For Processing:

First, I am going to create two child jobs. These child jobs will be called under the Parent Job.

The First Child Job

It's a very simple job. I am linking tJava components with the tSleep component on a trigger condition. Under tJava_1 component, I have put the below statement: 

System.out.println( " Start : First Child Job ");

Same way, I have put the below statement in tJava_2 component:

System.out.println( " End : First Child Job ");

To halt the process for 5 seconds, tSleep component has been configured with the 5 seconds value.



The Second Child Job

It's a very simple job. I am linking tJava components with the tSleep component on a trigger condition. Under tJava_1 component, I have put the below statement: 

System.out.println( " Start : Second Child Job ");

Same way, I have put the below statement in tJava_2 component:

System.out.println( " End : Second Child Job ");

To halt the process for 5 seconds, tSleep component has been configured with the 5 seconds value.


The Parent Job

I have created one Parent Job to call these child jobs. The execution mode is different. First, I will call these jobs in the Sequential mode. You can guess the mode just by seeing the output. In the Sequential mode, the jobs will execute in an order. The first job will execute first and later the second one whereas, Parallel mode, the jobs will get triggered simultaneously.

Sequential Execution 




Parallel Execution




To enable the Parallel Execution, just check the Multi thread execution box.









Yesterday, I was going through the unanswered posts of Talendforge forum and one question caught my eye. It was titled as Problem with Joins. First, I thought it would be related to some settings problem but it was not. It was related to one scenario that required the implementation of Right Outer Join. In Talend, we can join using two components: tJoin and tMap. The problem is: tJoin only supports Inner Join whereas tMap supports only Inner Join and Left Outer Join. So, we have no way of direct usage of Right Outer Join. You have to find another way of implementing it. This is my workaround for the mentioned problem, you can think of another approach also.

Talendforgre forum question: Problem with the joins

Requirement 

Click To Enlarge


Final Job Design:


Steps For Processing:

First, I have created 3 delimited files with a comma (,) as a delimiter and placed all these records as per the requirement in the files. Now, to read all these input files, I am using tFileInputDelimited component.My thought is to create two data sets and then merge them to Right Outer Join way. To achieve this, I am joining first two input files Literacy_rate.csv and Graduation_rate.csv. As I don't have to skip any records from here, I am using tJoin component for INNER JOIN.


We want all columns from both input files. To get this, go to Edit Schema and add columns as per your need.You can just simply use -> (arrow symbol) to move columns from one panel to another.We will get Year, Grade_Rate and Literacy_Rate columns from the input files but we also have to add one extra column specifically so that it can work with the tUnite component. tUnite expects same schema structure from its sources. So, this will give me the first dataset. And, it will have all records from both input files.


In the second dataset, we will try to capture the records from Grade_count.csv. In the output, we should have matched records for the year 2011 and 2012 as these years are present in the above-mentioned input files. In the output, we should have the values for the respective columns.Also, we should have an entry for Year 2013. Grade_count.csv data will be our Main records and rest two files records will be lookup and the join mode will be Left Outer Join.



The next step is to merge this dataset with the other dataset. To merge it, I am going to use tUnite component. Also, in this case, and as per our requirement, set the order as mentioned in the below screenshot.



The next step is to remove the duplicate records. We would be getting two sets of records for the year 2011 and 2012 form the first and second dataset and we just have to keep the values from the second dataset as it will have correct values for all the columns. I have defined the key as Year to filter out records. tUnique will take first records and drop the second occurrence. To make sure the tUnique picks the dataset coming from tMap. We have defined the merge order as mentioned in the above screenshot.



I am using tSort row next to just align the data in the ascending order of the years.



Once you execute this job, you will get an output like this:






In the last post, we covered how to write records into multiple sheets of an excel from a single delimited input file. This post is about reading the records from multiple sheets present in an excel file. You can't directly read and write records from different schemas excel sheets so we need to involve few more Talend components while doing the transformation. We have already seen how to use tBufferInput and tBufferOutput component for the purpose of temporary storing records while writing into Excel. For the reading purpose, we will use tMap component to define different schemas as per our output requirement.

Requirement

The requirement is to read multiple sheets present in an excel file. Also, the schema of each sheet is different. Extract all records available in the sheets and place it into two different CSV files based on the sheet name.

Input 



Output


Final Job Diagram 


Steps For Processing 

The first step is to read the excel file. The first step is to create a Metadata for this excel file. We will define the schema as the Repository level. Also, our excel has sheets with different schemas so we can't read both sheets with the predefined schemas specific to one sheet. The trick is to first, select All Sheets while defining the schema and second, take the schema of the sheet which has the most number of columns. Please check below screenshots:



In next screen, you will get the preview of the records. You may not get the desired sheet record as per your requirement. You need not bother about the records. Just make sure you get the number the exact number of columns as per your need. In my input, the max number of columns(4) are on the Details sheet. Even though I am getting the Name sheet records are here but I am ok with as the number of columns is 4 here.


In the next screen, you will get the preview of the Metadata. You can rename the column's name based on your whim. But in this scenario, just keep the column names as general as possible. As these 4 columns will only store the records of both sheets. Later, we will define the exact schema as per our need.



The next step is to split records as per our need. In our requirement, we have to split records based on our sheets. To split the records, we are going to use tMap component. To read the CURRENT SHEET , we are going to use the global variable : ((String)globalMap.get("tFileInputExcel_1_CURRENT_SHEET")) 
I am just storing this Expression under Variable var1. If you want, you can use it directly in the ou1 and out2 expression editor. Using a variable here makes code clean and concise.


So. once done with the mapping, you can rename the columns based on your output requirement. As you can see it here, the input columns will store both sheets records. I am detecting the current sheet to be read and then mapping the input columns to output based on the current sheet's schema. This is the only trick that you have to apply while using multiple sheets of an excel file.

The next step is to just map this output to the two different CSV file. The configuration would be like this:









In my previous posts, I have covered how to read an excel and put the data into it. But, it was only related to one sheet of the excel file. What if we have to generate the multiple sheets for the records coming from an input file? It's possible and there is a way to do it.You can create N number of sheets based on your requirement. In this post, I am going to create two sheets in an excel file. Let's check how to do it.

Requirement 

The requirement is to process the fixed length file and generates output in the Excel sheet. The output will be displayed in two different sheets of the same Excel file.

Input 



Output 


The Final Job Diagram 


Steps For The Processing 

The first step is to read our input file.The input file is the fixed length file.To read this file, define the metadata using File Positional option.To define the Metadata, go to the File Positional option and define the positions as per your input file. I have defined the Metadata as per my input file:



Once you define the Metadata, you can read the input file using tFilePositional input file. The configuration would be as mentioned in the below screenshot:


As we have to split the single input file into multiple sheets, we are going to use tMap components to split the contents. The configuration of tMap component would be as mentioned in the below screenshot:



Now, next thing is to put these records into the Excel file. For this, we are going to use tFileOutputExcel. The configuration would be like this:



We can't define more than one sheet name here. But, there is a workaround for this. We are going to use tBufferOutput to store the records for the second sheet in memory and later will use the tBufferInput component to retrieve those values. tBuffer* components provide you the temporary location to store and retrieve records.


You can automatically populate the schema for tBufferOutput by pressing Sync columns. But, this can't be done for tBufferInput, you have to manually define the schema for this component.As we are going to retrieve what we store in memory using tBufferOutput, we have to use the same schema as in tBufferOutput. The schema would be same as mentioned in the below screenshot:




The last step is to link the tBufferOutput to the Excel using tFileOutputExcel component. We have to link to the same Excel as we want to write into the same Excel but in a different sheet. Give the new name for the sheet and don't forget to tick the Append Existing File and Sheet.






© 2015 Techie's House. Designed by Bloggertheme9