In one of my previous posts, I have covered how to iterate through the files present in a directory or subdirectory and display the data present in it. But, what if the requirement is to iterate through the data present in an input file. This requires few more Talend components to be included in the Job design.

Requirement 

The requirement is to read the input file and create a separate file based on the CITY column present in the file.This requirement is definitely required to read and iterate data from the input file.



Steps To Iterate Data Present In Input File 

  • The first step is to read the input file. As used in the previous posts also, I am going to use tFileInputDelimited component to read the file. The configuration of this components is as below:
Talend_tFileInputDelimited Component
Talend_tFileInputDelimited Component

  • The next step is to make this input data iterable. Talend has a component for this also: tFlowToIterate. This component converts Main flow to iterate flow. It iterates input and converts into global variables. These global variables can be used later to access the data individually from the input file. The surprising thing is that you don't even have to do any kind of configuration. Everything will be handled by this component only.



  • Once you use the tFlowToIterate, it will convert your data into an iterable form. And, you can't put this data directly into a file. Even if you use tFileOutputDelimited component and try to link tFlowToIterat to tFileOutpuDelimited, the Talend won't accept it. You need to convert your data from iterable form to fixed form so that it can be put into a file. And, no surprise, Talend has a component for this also: tFixedFlowInput. 


  • tFixedFlowInput component requires config settings. First, define the schema, it will be same as defined in the input file. As soon as you define the columns the same will be populated in the COLUMN section. To map the values, go the value column and hit CTRL + SPACE. This will give you the list of all global variables. Select your global variables from the list.

  • Once everything has been set, your component settings will look like this. Notice, these values are the global values. Using these global values, we can access the data available in the specific column.And, we haven't done anything extra to set these values, just picked the right values from the drop down list. Talend makes it very easy.

  • So, once you get all values, you just have to set these values in the output file. To put the values in the output file, we are going to use tFileOutputDelimited component. You could have guessed it by now how the output will come out with the CITY name. We already have the global variable for the CITY column. We will use this global variable to generate the output file.
  • File Name : "C:/TalendTesting/TestingFiles/"+((String)globalMap.get("row1.CITY"))+".txt"


  • There could be multiple values linked to the same CITY. And, we want all CITY data to be clubbed in one file. To enforce this, please check the APPEND checkbox. Otherwise, each time when the Job encounters a CITY, it will create a new file. To check this scenario, I will add one more entry for PUNE city.



Component Details:

tFileInputDelimited :

tFileInputDelimited reads a given file row by row with simple separated fields. Opens a file and reads it row by row to split them up into fields then sends fields as defined in the Schema to the next Job component, via a Row link.

Get more details from here: tFileInputDelimited Component Detail

tFlowToIterate :

tFlowToIterate iterates on the input data and generates global variables.This component is used to read data line by line from the input flow and store the data entries in iterative global variables.

Get more details from here: tFlowToIterate Component Detail

tFixedFlowInput:

tFixedFlowInput generates as many lines and columns as you want using the context variables. It allows you to generate fixed flow from internal variables.

Get more details from here: tFixedFlowInput Component Detail

tFileOutputDelimited:

tFileOutputDelimited outputs data to a delimited file. This component writes a delimited file that holds data organized according to the defined schema.

Get more details from here: tFileOutputDelimited Component Detail

Final Job Design:


Talend_FinalJobDesign

Final Job Execution:

So, once I execute this Job, I should be getting 3 files for each CITY. And, for the PUNE city, my output file should have two records.

Talend_FinalJobExecution

The Output:

Talend_FinalJobOutput





Image Source : freepik

Ever wondered how could you find the specific component used in the heap of Jobs? This is required for a lot of reasons. You may want to trace the filenames that you have used in some specific components (eg: tFileInputDelimited ) but there are already 100 Jobs in the environment and you are not sure how many Jobs are actually using this component. So, you don't have any filter criteria. The last option would be to check all the Jobs one by one. But, what if you can apply some filter to get at least those Jobs which are actually using this specific component?

Steps To Check For Specific Component In A Job:

You may have a lot of Jobs under Talend Repository. The count could be anything. In my Talend Job Designs, I have these many Jobs as of now. And, I am not sure what components are being used in all these Jobs.

Talend_RepositoryJobDesign

Now, Go to Palette's Seach bar. And, search for your component. Or, if you know the exact folder location of the Talend component, you can directly go to the component. In this demo, I am searching for the tFileInputDelimited component. As you can see in the below screenshot, I have searched with tFileInput and got the list of all components where this string (tFileInput) has been used.


Talend_PaletteComponentSearch

So, the tFileInputDelimited component is under File -> Input directory. Right click on the component, and go to Find Component In Jobs option.


Talend_FindComponentInJobs

As soon as you click on it, Talend will scan all the Jobs available in Job designs and give you the list of the Jobs where this specific component is being used. In this case, Talend has given me the list of all Jobs where tFileInputDelimited component has been used.

Talend_SearchedJobDesigns
This may seem unimportant at the first sight but you will only get to know the importance of this feature when working in a real environment. I have worked on Datastage before there you may have options to find the Job but not any specific component. There were numerous occasions when I was craving for the feature like this. Fortunately, Talend provides this option and this feature is very helpful and useful. 





Variables are the soul of any program. As a programmer, we never code with the static values. Also, the main aim of any program to minimize the redundancy and repetition of the same thing. The Variables make it possible by storing the values dynamically. In Talend world also, we have variables concept and like in any other programming language, it also has the concept of Local and Global variables. The Local variables are known as Context Variables in Talend. A context variable is a variable which is defined by the user for a particular context.

Let's Explore Contexts..

Depending on the circumstances the Job is being used in, you might want to manage it differently for various execution types, known as Contexts (Ex: Prod and Test environments). The simple example is, testing environments where we have to run the jobs for different instances. You can consider all these testing environments as Contexts. You can set your Job to run in all these Contexts once you set the variables for it.

How To Define Context Variables For A Job :

There are mainly TWO ways to define variables for the job...

Define Variables Through Context View:

You can find the Context View among the configuration tabs and below the design workspace.

Talend_ContextView

In simple words, It's more of  Job Properties as used in Datastage. In this window, you can define all the parameters value that are going to be used in your Job. Just define it once, so that you don't have to define it again and again.

Talend_SetContextVariables
 To define, new environments or contexts, you can just click on the PLUS button as shown in below screenshot.


Click on New button, and give a name to the new contexts. You can give different contexts names for different environments. You can define the default contexts here, also set the position using UP /DOWN button.




Now, the new context would be as shown in the below screenshot. Same way, you can set contexts for different environments.

Talend_NewContextGotAdded

The next step is to set and define variables for the Context view, Just press the Plus button at the bottom of the Contexts tab. Set the Name, Type and Values for the variables. These values can be different or same based on your requirement for different environments.

Talend_SetVariablesInContextView

If you want to display prompt for variables before running the Job, Just check the box and mention the prompt name in the next column.


  
Define Variables Through Component View:

This is the quickest way to define a single context variable. You can parameterize any field using this.

Go to the component settings which you want to parameterize. In this example, I am going to use tFileTouch component. This component requires File Name value to execute. So, Instead of providing the File Name directly, I will parameterize this field using a variable.

Go to the field which you want to parameterize and press F5



Hit finish button and you will be back to component config window. Now, you will be able to see new value in File Name text field.


The same will get added to the Contexts tab of the configuration tab automatically. You don't need to come in the Contexts view tab and define the parameters individually. That's why this method is the fastest way to create a single contexts variable.


 If you want to change the file name or the directory, you can do it while running the Job. As we have set the prompt for the file name, Talend will ask you for the filename before running the Job. The benefit of using variables..





<a href="http://www.freepik.com/free-vector/laptop-with-rocket_763378.htm">Designed by Freepik</a>

In most of the ETL tools, we have the option to create routines. The language used to write these routine depends on the ETL platform.Like on the Datastage, it's the bash script. In Talend language, the routines are well written Java code which is required to do some specific task. The routines are small Java code snippets that we can directly use in our Job design. In Talend, You can use the System defined routines or if you are not able to find the routines based on your requirement, you can always define a new one.

Requirement 

The requirement is to add the timestamp as a suffix while generating the output file using Talend System routine.

Steps To Use System Routine In Talend

Talend offers two kinds of routines: System and User. The System routines are already existing routines in the Talend. these are classed according to the type of data which they process: numerical, string, date. The User routines are defined by the user either by updating current System routines or by creating new from scratch using Java code.

Where Are The System Routines?

To access the System routines, Go to Repository -> Code -> Routines-> System. In the below screenshot, I am using TalendDate routine. Once you click on the TalendRoutine, you will get the complete Java code for this routine in the code window. Also, if you check on the left side, you will find all the methods used in this routine at the bottom of the window.



Now, the requirement is to append timestamp as a suffix in the generated file name. For this demo, I will create a simple Job. Talend has a component to generate a blank file: tFileTouch component. This component will help you to generate a new file. You can mention the directory and file name and based on your input, this component will create the blank file under specified folder as per the requirement. You can find this component under File -> Management -> tFileTouch.




tFileTouch Component Details:


tFileTouch either creates an empty file or, if the specified file already exists, updates its date of modification and of last access while keeping the contents unchanged. This component creates an empty file or updates the details of an existing file for further operations and creates the destination directory if it does not exist.

Get the more details here: tFileTouch Component


How To Use System Routine ?

Now, let's configure this component. I will browse for any existing file and will add the timestamp using TalendDate routine. You can also create a new file and directory. Just mention the path and file name and check Create directory if it doesn't exist box.

Once you browse and select your file, it will come like this.



Now, as per the requirement, I will remove the file extension and will add an underscore ( _ ). Next, will add plus (+) sign. Notice the quotes. Hit CTRL + SPACE and you will get a lot of options to choose from.



If you scroll to the bottom, you will find a lot of methods, tagged as TalenDate. Based on your requirement, you can pick a value from the list. You can also play with the different methods and check what's the outcome. As I have to attach timestamp, I will go with the GetDate() option.

But, on windows platform, you have to remove colons(:). So, your GetDate() value would be TalendDate.getDate("CCYY-MM-DD hhmmss") in place of TalendDate.getDate("CCYY-MM-DD hh:mm:ss").



Final Job Execution :

It's time to run the Job now. Let's see the Job execution in action:







Sometimes, we don't get the proper records in the input file. Some columns come as blank. Based on the requirement, we may have to fill those blank values with some values. As we can't manipulate the records in the input file, we have to handle it in the Talend while processing it. One scenario could be to replace blank values as NULL.

Requirement 

The requirement is to replace the blank values coming in the input file with default values. The input file layout is as shown below :


Steps To Fill Up Blank Values With Default Values 

  • First of all, we are going to read this file using tFileInputDelimited. As I already discussed, we need to define the schema before reading any input file. 
  • To Define the schema , Go to MetaData -> FileDelimited -> Create File Delimited. Give any name as per your requirement and click on Next.
  • Browse for the file and select the Format as per your OS. Click Next.
  • As I am getting the Header columns in the input file only, I have checked the Set heading row as column name. As soon as you check this box, the Header in the Rows To Skip section will automatically be get selected.




  • You can see in the above screenshot, the input file records got properly structured in the table format and it's easy to identify Blank column here. It's Company and City column. You may have the requirement to fill these columns with some default values instead of leaving it as blank. I am going to take NULL string as filler. Don't be confused with NULL string. It's not a NULL value. The NULL string, in this case, is similar to any other string value like XXX or YYY.


  • You can put any value as per your requirement. Also, as the column type is String, I am putting these values in double quotes otherwise Talend will throw an error.

The Final Job :

As expected, the blank values got replaced with NULL string. 






Each input file follows a standard layout. Normally, each input files comes with Header and Trailer records. The Header and Trailer records give some additional information about the file Ex : When the file gets delivered and what is the record count in the file. But, while processing the file, the Header and Trailer records are not required. And, we only have to deal with the Detailed records present in the file.

Requirement 

The requirement is to skip Header and Trailer Records from an input file.


Steps To Exclude Header And Trailer From File 

  • First of all, we are going to read this file using tFileInputDelimited. As I already discussed, we need to define the schema before reading any input file. And in this case, we will use specifically Repository schema to read the file.
  • To Define the schema , Go to MetaData -> FileDelimited -> Create File Delimited. Give any name as per your requirement and click on Next.
  • Browse for the file and select the Format as per your OS.


  • Now, in the next screen, we have to do some changes based on our requirement. As you can see my input file is the delimited one and the delimiter is SPACE. I will let the Talend know to split the columns based on the SPACE. Also, to exclude the Header and Trailer , just check the Header and Footer checkbox and put the count as 1. Click on Refresh Preview.




  • The Column name and Column type will be automatically selected. If you are not satisfied with it, you can modify it based on your requirement in the next screen. Once done, click Finish and that's it !!




  • Once you are done with the Metadata, the next step is to select it while reading the records. As we are using tFileInputDelimited component to read the file, we have to select the Repository schema and then the metadata.



The Final Job Design And Execution 

I have used tLogRow component to display the output. If everything works well, we shouldn't be getting Header and Trailer records in the output.









In one of my previous posts. I have covered how to read and Excel File. As you already know, the Excel file comes with a structure. We call this structure the Schema of Excel file. When we read Excel file through Talend, First, we define its schema using Built In or Repository option. Now, what if we want to read multiple Excel files or the multiple worksheets present in Excel and all have different schemas. We can't define the fixed schema for these. All we can do is to define separate schemas for separate layout and then find some links to club these records.

Requirement

The requirement is to read different schemas worksheets present in Excel file and club it into a new Excel File. Instead of different worksheets, there could be different Excel sheets also.


Steps To Read Excels With Different Schemas

  • Let's first explore the Input source. For this demo, I have created an excel file InputExcel. It has two worksheets with different schemas. The first sheet has columns as Name, ID and Score whereas in the Second sheet, we have columns as ID, Department
















  • In both sheet, we have a common column ID. We are going to use this column to match records from both worksheets and use this to club data and output the records in a new excel file.
  • First thing first, we have to define schemas for our Input source. As we are dealing with two schemas, we have to define two different schemas. I have defined Repository schemas for this.



  • The process of defining the schema is same. For the TalendTestingInputExcel metadata, I have selected the FirstSheet as the worksheet and for TalendTestingInputExcel_Sheet2 metadata, I have used SecondSheet as the worksheet. Now, I don't have to define these schemas later while configuring the component. I can just use these schemas and everything will be populated automatically in Talend.


  • Now, the next question, how to link these two worksheets ? For this, we are going to use a new Talend component tMap. It's more like a Transformer of Datastage ETL tool. You can transform and manipulate data using tMap component.
  • Using tMap component, we can join the record based on ID column and extract the required fields.



  • In the above screenshot, I am linking both worksheets based on ID column. You just have drag and drop the row1 ID column in front of row2 ID column. By default, the join is Left outer join. But if you want, you can change it to Inner Join based on your requirement.



  • The final piece is to configure the tFileOutputExcel component.


  • The Final output would be like this :




The Final Job Execution 



© 2015 Techie's House. Designed by Bloggertheme9