Searching for substrings in string

Hi All,

I'm given a school assignment and we have only covered on C language in school.
For this assignment, we are supposed to use C++ and hence I'm pretty confused.

The parameters are as follows:
Given an input file with multiple lines of DNA sets, we are supposed to search for Motifs within them and return the some information in the output file

DNA will be read line by line and known as "sequence" (i.e Sequence 1,2,...)

Information to be shown in output file
Number of occurrence of each motifs, location of motifs, time taken to process the sequence.

At the same time in the program itself we are supposed to show the progression of the searching in each sequence.

Example of text from input file is as follows:
ACTGTCCCTGGGAGAACATATCGTCGTCAACACGTAAACAA
AGCCATCTTGTATTGCGACCTTAATGTAAGCGGCGTAATGATCATTTGGCAGGAAACATTCCTTCGGCGGTCCTCAC

Example of the Motifs to be searched is as follows:
TA
GA
TT

Example of display in program to user:
>> Welcome to Motif Finder program!
>> Make sure your input file is named as Data.txt and has
all the DNA sequences!
>> Motif searching started… Be patient and you will be prompted
to see the output.
>> Seq-1 completed…
>> Seq-2 completed…

>> Processing all the sequences in Data.txt is now completed. See
the statistics reported in Output.txt file.
>> Thank you!

Example of output:
STATISTICS (Generated on: DATE and TIME HERE)
GENERATED BY: <YOUR NAME HERE>
============
Motifs to be searched: TTC, AGGA
For each sequence the format followed is as below:
Motif: Number of Occurrences (locations)
===========

Seq-1: GTTCAGTCAAAGTAATTCTTC
Length of the sequence: 21

Motif (Locations):

TTC: 3 (2,16,19)
AGGA: 0
.
.
.
Time Taken to Process Seq-1: 1.025 secs
==================

.
.
.
==================

Seq-125: GCAAGGTCTAAAGGCTATATCTAAGATTTGAGAGTAGAAAAAAAAAT
Length of the sequence: 47

Motif (Locations):

TCT: 1 (7)
AGGA: 0
.
.
.
Time Taken to Process Seq-125: 1.025 secs
==========

Number of Sequences Processed: 125
Total time taken for processing: (sum all the above times) secs
==============================================

CONSOLIDATED STATISTICS FOR ALL MOTIFS
Format used is as follows:
Motif: Total number of Occurrences (Sequence indices)
===========
AT: 0
TCT: 70 (1, 11, 22, 56, 102)
.
.
.
======

Other useful statistics:
Most Frequently Occurring motif(s): TCT, CCC
Least frequently Occurring motif(s): AT
Most Common motif(s): CGC, GTG
===========================

This is my code so far and I'm pretty much stuck:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
#define _CRT_SECURE_NO_WARNINGS
#include <stdio.h> /* Input & Output Functions Header File */
#include <time.h> /* Date & Time Functions Header File */
#include <string.h> /* String Functions Header File */
#include <stdlib.h> /* General Utility Functions Header File */
#include <iostream> /* Header file to use echo */

FILE *input_seq; /* Declare a file pointer for the input file */
FILE *output; /* Declare a file pointer for the output file */

/* Declare Functions that will be used */
void SearchMotifs(); /* Function structure to search for Motifs in the Input File */
void FindStatistics(); /* Function to tabulate the statistics*/

/* Declare variables used */
char buffer[256]; /* Buffer to store the sequences from input file*/
int i; /* Row number */
int j; /* Column number */
char *result[40][100];

/* Numbering of sequence */
int sequence;

/* Timing of searching for Motfis in sequence */
clock_t begin, end;

/* Declare Motifs (substrings) to search*/
char *TA;

/* Declare the count of occurence */
int CountTA = 0;

/* Declare location of Motifs */
int L_TA=0;

int main(void)
{
	printf("Welcome to Motif Finder Program \n");
	printf("Please ensure that your input file is saved as InputData.txt. and has all the DNA sequences in it. \n");
	system("pause");

	input_seq = fopen("InputData.txt", "r");
	if (input_seq == 0)
	{
		printf("Error in opening file. \nPlease check naming of file.\n");
		system("pauses");
		exit(0);
	}
	else
	{
		printf("File successfully opened.\nPlease hold while the data is being tabulated.\n");
		system("pauses");

		SearchMotifs();
		FindStatistics();

		printf("Searching of Motifs in all sequences has been completed.\n");
		printf("See the statistics report in the Output File.\n");

		system("pauses");
		exit(0);
	}
}
void SearchMotifs()/* Process to search for relevant details required */
{
	for (i = 0; i < 40; i++); {
		for (j = 0; j < 100; j++); {
			result[i][j] = NULL;
		}
	}	
	while (!feof(input_seq)) /* Read till end of file */
	{
		i = 0;
		sequence = i + 1;
		begin = clock();
		strlen(buffer);
		/* Search details for Motif TA */
		while (TA = strstr(buffer, "TA"))
		{
			CountTA++;
			TA++;
			strpbrk(buffer, TA);
		}
		
	}
}
void FindStatistics()
{
	output = fopen("Output.txt", "w");
	time_t current_time;
	
	printf("Statistics /n(Date & time: %f,current_time)/nGenerated by: Motifs Finder/n/n");

}




Topic archived. No new replies allowed.